James Wing By James Wing on 2016-04-19

Apache NiFi is a great tool for building flexible and performant data ingestion pipelines. Since relational databases are a staple for many data cleaning, storage, and reporting applications, it makes sense to use NiFi as an ingestion tool for MySQL, SQL Server, Postgres, Oracle, etc. In this article we will look at Apache NiFi's built-in features for getting FlowFile data into your database, and some strategies you should consider in implementing a database flow.

Concept

The concept is dead simple - we take incoming data records, do some processing on them, then insert them into our database. Exactly the sort of thing you expect to do with NiFi.

NiFi Database Ingest Concept

Things get a bit more complicated when you consider the stampede of data going through NiFi, and how that will translate to your RDBMS. Thankfully, NiFi has some good solutions.

PutSQL

The PutSQL processor is the basic NiFi utility for inserting records into a relational database. PutSQL not only handles the basic mechanics of connecting to the database, it will also batch requests together for improved performance. By default, up to 100 FlowFiles may be processed at a time. PutSQL supports two patterns for formatting database statements from FlowFiles:

  1. Literal statement in FlowFile content
  2. Template statement in FLowFile content with parameters in specially structured attributes

In practice, I find these to be similar in that both require one or more processors upstream of PutSQL to marshal the FlowFile content and attributes into shape.

But I prefer method #1 above, to format the FlowFile content as the SQL statement prior to the PutSQL processor. Partly, I find it easier to format the FlowFile content rather than formatting several parameter attributes. But I really like this method for making it easier to troubleshoot. Having the SQL statement in plain view makes it much easier to either eyeball any errors, or to copy/paste the text into a database utility that helps you troubleshoot further.

Typical PutSQL Flow

My typical PutSQL flows look something like this (gist):

PutSQL Flow

Where the database insert is a three-step sequence of processors:

  1. UpdateAttributes - to format attributes that will be used in the database statement, both extracting from the current FlowFile content and calculated using NiFi's expression language. In this pattern, the FlowFile content is about to be replaced, so this may be the last chance to work with it.
  2. ReplaceText - to format the new FlowFile content as a SQL INSERT statement, using the attributes collected above to format the values in the statement using NiFi's expression language. In my simple sample flow, I use "Always Replace" as the Replacement Strategy, with text like this:
    INSERT INTO FlowFiles (uuid, filename, datetime, message)
    VALUES ('${uuid}', '${filename}', '${datetime}', '${message}')
  3. PutSQL - to finally execute the statement against the database and route based on the response. Routing the rety relationship back to PutSQL is standard, as these FlowFiles are assumed to be untried and good. However, routing the failed FlowFiles may be very dependent on your data and the types of errors you see. My template shows them routing back to the processor, which is probably not what you should do, we'll talk more about this below.
UpdateAttributes processor Extract Attributes

UpdateAttributes processor Extract Attributes

ReplaceText processor Format INSERT Statement

ReplaceText processor Format INSERT Statement

ReplaceText processor Format INSERT Statement (detail)

ReplaceText processor Format INSERT Statement (detail)

PutSQL processor Insert SQL

PutSQL processor Insert SQL

DBCPConnectionPool service MySqlConnectionPool

DBCPConnectionPool service MySqlConnectionPool

Troubleshooting

Because the PutSQL processor attempts to batch statements together for optimal performance, the error messages returned from failed statements are not obvious or individualized to the failing FlowFile and its statement. Batch error messages look like this:

Failed to update database due to a failed batch update. There were a total of 1 FlowFiles that failed, 2 that succeeded, and 5 that were not execute and will be routed to retry

In the case above, 8 FlowFiles went in to the processor, only 1 failed, but only 2 succeeded. The 5 FlowFiles that were not executed may be retried as-is, as long as you make sure to route the retry relationship back to PutSQL.

But what should you do with these failures? If you are in a development environment, or troubleshooting connectivity with test data, routing FlowFiles back to PutSQL can make sense. But in production, routing back to the processor won't help if the FlowFiles are certain to fail again. At the very least, you should have somewhere for them to go that allows you to spot the failures, understand the causes, and debug your data and flow.

One possible solution would be to route the FlowFiles out to disk files to be edited and reloaded. For flow application errors, a more NiFi solution is to fix the upstream cause and use the provenance data to replay the FlowFile through the fixed flow. This works great, but can be difficult to do in bulk.

Other Strategies

I like the strategy above because it is fairly general-purpose. But there is always another way with NiFi, depending on what your data looks like. Some other good options include:

  • ConvertJSONToSQL - as the name suggests, formats SQL from JSON data. Has some good options specific to this use case, like generating new FlowFiles for the SQL statement and passing on the original.
  • TransformXml - this might be a great option if you have XML data
  • ExecuteScript - if all else fails, you can use custom scripting

Bulk Load Methods

Many databases support bulk load capabilities outside the SQL standard, typically a command to load records from a text or XML file. The flexibility of NiFi shines here, as it has many features for batching incoming data, form the necessary files, then triggering the load operation. I'll try this out in a future post.