blogpost
James Wing By James Wing on 2017-02-20

Everybody loves S3 data storage. While there are as many different ways to deliver content to S3 as there are users, there are some common patterns for solving S3 content delivery with Apache NiFi. This article describes, compares, and contrasts three patterns for S3 data delivery in NiFi with respect to typical design concerns:

  • Security and exposing an API
  • Tracking the latest changes
  • Reading from multiple buckets
  • Working across AWS accounts
  • Availability of the solution

The three patterns we will consider are:

  1. NiFi as a Gateway to S3
  2. NiFi processing S3 Event Notifications
  3. NiFi Listing S3 to process objects

Let's examine each in detail and consider their appropriate scenarios, pros, and cons.

NiFi as a Gateway to S3

NiFi as a Gateway to S3

In the Gateway to S3 pattern, external data traffic comes directly to NiFi first. NiFi processes the data as little or as much as you choose, before writing the data out to an S3 storage location. Typical processing would include validating, reformating, and merging data into optimal shape for downstream processing. For an example, see S3 Ingest with NiFi.

Positioning NiFi as the API gives you flexibility and control over the communication protocol and security scheme. It allows the full range of NiFi capabilities, including HTTP(S), NiFi site-to-site, TCP, and many more. There are also many more options for NiFi to pull data in, rather than serving an API endpoint itself.

The existence of the S3 bucket is hidden behind NiFi, so there is no need to share any AWS credentials.

A single NiFi is capable of acting as a gateway for many S3 buckets, and the need to route content to an appropriate bucket is a typical driver for this pattern.

However, acting as the gateway requires NiFi to handle 100% of data traffic. Availability of the solution also depends on NiFi availability.

NiFi processing S3 event notifications

NiFi processing S3 event notifications

The S3 Event Notifications pattern involves configuring NiFi to process S3 event notifications after files are written or deleted from the collection bucket. You configure an SQS queue to receive the event notifications, and NiFi reads notifications from the queue. Each S3 event notification contains metadata about the file's bucket, key, size, etc., which NiFi can use to selectively process them. This method is best for reacting to the most recent S3 activity, and it effectively deals with S3's eventual consistency. For an example, see Process S3 Event Notifications with NiFi.

Because S3 is the exposed API, you would need to grant S3 access to remote writers, and they will certainly know they are writing to your S3 bucket. This may be just fine for internal services, but it can sometimes be a limitation for working with partners and customers.

S3 security is strong, and it can still be a good choice for accepting partner input. In fact, the security, reliability, and throughput performance of S3 are strong arguments for taking this approach. S3 enjoys very wide tool support, and writing to S3 may be easier for remote clients than adapting to a custom API.

The existence of your NiFi instance is hidden from writers. NiFi is decoupled from the direct client to storage data flow, so NiFi availability is not a limitation. Reading S3 event notifications from a stable SQS queue makes it unnecessary for NiFi to process them 24x7. SQS can be configured to hold messages up to 14 days.

A single NiFi can process events for many S3 buckets, even buckets in other AWS accounts. There are no limitations on processing vs. other patterns, you can read from many buckets and write to many buckets as necessary.

On the down side, you surrender some control over S3 writes to remote partners. They may write the wrong data, at the wrong time, or in the wrong structure. You have to pay for the storage, and while S3 rates are reasonable, it might still add up. Consider a lifecycle policy. Again, these concerns may not apply to clients within your organization, but are real concerns with remote partners or customers.

NiFi Listing S3 to Process Objects

NiFi Listing S3 to Process Objects

Listing S3 involves enumerating the objects in an S3 bucket using NiFi's ListS3 processor. ListS3 returns metadata about each object, and it can be used to selectively process files like the event notifications. ListS3 tracks the key and timestamp of the last object read to allow it to incrementally read from a location. This pattern works best for re-processing objects already at rest in S3.

Listing can be performed from NiFi alone, no SQS queues or public APIs required. Relative to the other patterns, it is quick and easy to employ, great for ad-hoc or troubleshooting use.

NiFi's ListS3 processor only works from a single source bucket, but you can have as many processors as you want. It is easy to configure ListS3 to focus on a particular key prefix to narrow the scope of work.

Availability and throughput is entirely within NiFi, although it can safely stop and continue as NiFi restarts.

The Listing S3 pattern may not be the best choice for processing the most recent changes. S3 has an eventual consistency model which makes listing operations inconsistent with respect to just-written content. I recommend thinking of this less as a "processing" pattern and more as a "re-processing" pattern. Use this after you solve a bug in your flow and need to re-process yesterday's data (I'm sure your flows are bug-free, just like mine).

S3 Pattern Comparison

None of these three is "best", of course, it all depends on the particular design concerns and available freedom in your data flow.

Concern Gateway to S3 S3 Event Notifications Listing S3
Endpoint exposed by NiFi push/pull S3 -
Endpoint security defined by NiFi S3 -
Tracking updates Always first to know Low-latency notifications Always after the fact
Availability limited by* NiFi S3 NiFi
Throughput limited by** NiFi S3 NiFi
Complexity Medium High Low
Use case fit Think "Inline" Think "Event-Driven" Think "Reprocess"

* This assumes that S3 availability is always better than NiFi availability. I do not wish to insult any NiFi admins, but I hope you will at least accept that high availability S3 is more affordable than high availability NiFi.

** S3 throughput is not unlimited, and is easily exceeded when writing large numbers of small objects. Using NiFi to capture and merge small objects is a great case for the Gateway to S3 pattern.

For Further Reference