Amazon Elastic MapReduce (EMR) is a great managed Hadoop offering that allows clusters to be both easily deployed and easily disolved. EMR can be used to set up long-lived clusters or run scripted jobs priced by the hour. But getting your data into an EMR cluster can be challenging, for a number of reasons:
- Network access to EMR clusters is tightly restricted by default
- Data formats and patterns in Hadoop are specialized
- Hadoop clusters host a number of components -- HDFS, HBase, Hive, Spark, etc. -- each with their own configuration
In this post, we'll go over some of the basics for connecting Apache NiFi to an EMR cluster.
S3 - The Easy Way
The simplest and easiest way to get data from NiFi to EMR is to write the data to S3. Here we're referring to S3 as an integration data store between NiFi and EMR, and it's worth expanding on when this works well vs. when it falls short. S3 is well supported by both systems, has a good permissions model, supports large data volumes and durable storage. It is easy to configure network access between S3 and NiFi or EMR clusters, and S3 gateways can connect within your VPC. And S3 is pretty cheap. If you are running scripted batch jobs on EMR, this is the most likely scenario for you.
But if you are running a long-lived cluster, S3 may only be halfway there, because you really needed the data in HDFS files, or stored in the context of a Hadoop patform component like HBase. If you are not ingesting the data directly, you still need a solution to get the data the rest of the way from S3 to HBase.
NiFi as Data Gateway
When you need to get data directly into EMR, but you want to keep EMR secure from public access, NiFi does a great job acting as a data gateway. In this design, NiFi is deployed in an EC2 Security Group that permits public access. The EMR cluster grants access only to NiFi. This pattern works doubly well if you launched your EMR cluster in a private subnet, and network traffic is tightly restricted. NiFi makes an elegant gateway you can trust to bring in data under secure conditions.
Granting NiFi Access to EMR
Concepts are nice, but configuring the nuts and bolts of access are another matter. It's easy in EC2 -- but only after you've mastered EC2 network configuration, of course. Even a simple configuration includes three or more Security Groups and a number of network routes from NiFi to EMR. Some things to watch out for include:
- Make sure NiFi is in a security group that allows public access, but EMR is not
- Configure NiFi access to EMR Security Groups using the NiFi Security Group as the traffic source
- EC2 instances can be assigned more than one Security Group, so you can define public access and EMR ingest access separately
- NiFi may need access to both the EMR Master Security Group and the EMR Worker ("Slave") Security Groups, depending on the application you are connecting to.
Decifering EMR Configuration
One of the biggest challenges of getting data into EMR is decifering the configuration of individual components, both network requirements and application configuration.
- Configuration files like core-site.xml, hdfs-site.xml, and component-specific files like hbase-site.xml need to be copied from EMR to your NiFi instance.
- Many uses require access to ZooKeeper on the EMR Master node, typically port 2181.
- In many cases, you will need to refer to the documentation of components like Apache HBase to understand the default port configuration
- Hadoop application ports vary by EMR and application version, so double-check the Release Guide for your EMR version before you go insane.
Packaging Data for Hadoop
There is more to getting your data into EMR than just connecting to the cluster nodes. Fortunately, NiFi is a stellar tool for repackaging your data into a format appropriate for Hadoop. Optimal data packaging depends on the Hadoop application or component, but consider typical HDFS best practices for example.
- Format - Many formats are used in Hadoop, including text, JSON, Avro, Orc, Parquet, and more. NiFi has processors to convert these formats, and can be easily modified to convert differently when the format changes.
- Size - You should not create too many small files, you want to aggregate small records up to the block size of HDFS (128 MB by default). NiFi can help you do this with the MergeContent processor.
- Timeliness - While you want nicely sized files, you may also have timeliness requirements such that you put data to EMR before it gets old. MergeContent can handle that, too, by configuring the Max Bin Age property.
- Compression - Many text-based file formats benefit tremendously from compression, which NiFi can do explicitly with the CompressContent processor, or implicitly as part of the PutHDFS processor.
- File Path - File paths need to be standardized to your HDFS organization pattern. NiFi supports this through attributes and expressions.
- Kite Datasets - NiFi also supports Apache Kite datasets through the StoreInKiteDataset processor. Kite can be configured to do much of the above in a consistent way.