from Hacker News

Apache NiFi

by boredgamer2 on 5/11/20, 5:13 PM with 135 comments

by cs02rm0 on 5/11/20, 6:52 PM
I've used it a fair bit, though not for a couple of years. Few points, some of which may be out of date:
* I've seen customers fall into the trap of thinking they don't need expensive developers because you can drag and drop, just people who can use a mouse can crack on with NiFi.
* It persisted its config to an XML file, including the positions of boxes on the UI. Trying to keep this config in source control with multiple devs working on it was impossible.
* Some people take the view that you should use 'native' NiFi processors and not custom code. This results in huge graphs of processors with 1000s of boxes with lines between you have to follow. Made both better and worse by being able to descend and ascend levels in the graph. The complexity that way quickly becomes insane.
* You're essentially programming with it. I've no doubt you could use it to write, say, an XMPP server if so inclined. Which means you can do a great many things of huge complexity. Programming tools have developed models for inheritance and composition, abstraction, static analysis, etc. which NiFi just didn't have. The amount of repeated logic I've seen it's configuration accumulate is beyond anything I've seen from any novice programmer.
I ended up feeling like it could be an OK choice in a very small number of places, but I never got to work on one of those. The NSA linking together multiple systems with a light touch is possibly one such use case. For most everyone else, I couldn't recommend it.
by _57jb on 5/11/20, 6:13 PM
We used NiFi...one of the worst experiences.
It installs like an appliance and feels like you are grappling with a legacy tool weighed down by a classic view on architecture and maintenance.
We had built a data pipeline and it was for very high-scale data. The theory of it was very much like a TIBCO type approach around data-pipelines.
Sadly the reality was also like a TIBCO type approach around data-pipelines.
One persons experience and opinion and I am super jaded by it due to some vendor cramming it down one of our directors throats who subsequently crammed it down ours when we warned how it would turn out. It ended up being a very leaky and obtuse abstraction that didn't belong in our data-pipeline when you planned how it was maintained longer-term.
I ultimately left that company. It had to do with as much of their leadership and tooling dictation as anything else, NiFi was one of many pains. I am sure there are places that are using NiFi who will never outgrow the tool so take it with a grain of salt.
Said company ultimately struggled for the very reasons those of us who left were predicting (the tooling pipeline was a mess and was thrashing on trying to get it right, constantly breaking by forcing this solution, along with others, into the flow. Lots of finger-pointing).
Sucks to have that: "I told you so..." moment when you never wanted that outcome for them....I just couldn't be a part of their spiral anymore.
by gopalv on 5/11/20, 6:43 PM
NiFi's biggest strength is that it is a 2-way system - it is not Storm, it is not Flink, it is not Kafka, it is not SQS+Lambda.
I like to think of it like Scribe from FB, but with an extremely dynamic configuration protocol.
The places where it really shines is where you can't get away with those 3 and the problem is actually something that needs a system which can back-pressure and modify flows all the way to the source - it is a spiderweb data collection tool.
So someone trying to Complex Event Processing workflows or time-range join operations with it, will probably succeed at the small scale, but start pulling their hair out at the 5-10GB/s rate.
So its real utility is in that it deploys outside your DC, not inside it.
This is the Site-to-Site functionality and MiniFI is the smallest chunk of it, which can be shrunk into a simple C++ something you can deploy it in every physical location (say warehouse or grocery store).
The actually useful part of that is the SDLC cycle for NiFi, which lets you push updates to a flow. So you might want to start with a low granularity parsing of your payment logs on the remote side as you start, but you can switch your attention over it to & remove sampling on the fly if you want.
If you're an airline flying over the arctic, you might have an airline rated MiniFI box on board which is sending low traffic until a central controller pushes a "give me more info on fuel rates".
Or a cold chain warehouse which is monitoring temperature on average, until you notice spikes and ask for granular data to compare to power fluctuations.
It is a data extraction & collection tool, not a processing and reporting tool (though it can do that, it is still a tool for bringing data after extraction/sampling, not enrichment).
by monstrado on 5/11/20, 6:21 PM
Incredible piece of software. I've used it in production at my last two jobs. You can build almost anything in NiFi once you get into the mindset of how it works.
A good way to get started with NiFi is to use it as a highly available quartz-cron scheduler. For example, running "some process" every 5 seconds.
Disclaimer: I'm an Apache NiFi committer.
An article you might find interesting about it's ability to scale.
https://blog.cloudera.com/benchmarking-nifi-performance-and-...
Disclaimer v2: I used to work at Cloudera
by taftster on 5/11/20, 9:38 PM
NiFi at first glance sometimes just looks like a glorified GUI for building out a data-delivery application. But NiFi doesn't just compile an application to be deployed on your network. Instead, the "power" of NiFi is that it allows an operations staff to perform the regular day-in-day-out task of monitoring, regulating and if needed modifying the delivery of data to an enterprise.
NiFi gives insight to your enterprise data streams in a way that allows "active" dataflow management. If a system is down, NiFi allows dataflow operations to make changes and deal with problems directly, right at tier 1 support.
It's often the case that an enterprise software developer has an ongoing role of ensuring the healthy state of the applications from their team. They don't just develop, they are frequently on call and must ensure that data is flowing properly. NiFi helps decouple those roles, so that the operations of dataflow can be actively managed by a dedicated support team that is more tightly integrated with the "mission" of their dataflow.
NiFi additionally offers some features that most programmers skip to help with the resiliency of the application. For example:
- the concept of "back pressure" is baked into NiFi. This helps ensure that downstreams systems don't get overrun by data, allowing NiFi to send upstream signals to slow or buffer the stream.
- data provenance, the ability to see where every piece of data in the system originated and was delivered (the pedigree of the data). Includes the ability to "replay" data as needed.
- dynamic routing, allowing a dataflow operator to actively manage a stream, splicing it, or stopping delivery to one source and delivering to another. Sources and Sinks can be temporarily stopped and queued data placed into another route. Representational forms can be changed (csv -> xml -> json, avro), and even schemas can be changed based on stream.
Anyone can write a shell script that uses curl to connect with a data source, piping to grep/sed/awk and sending to a database. NiFi is more about visualizing that dataflow, seeing it in real-time, and making adjustments to it as needed. It also helps answer the "what happens when things go wrong" question, the ability to back-off if under contention, or replay in case of failure.
(disclaimer: affiliated with NiFi)
by banjoriver on 5/11/20, 7:52 PM
NiFi is vey good at reliably moving data at very high volumes, low latency, with a large number of mature integrations, in a way that allows for fine grained tuning, and i've seen first hand that it is very scalable. It's internal architecture is very principled: https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.ht...
Out of the box it is incredibly powerful and easy to use; in particular it's data provenance, monitoring, queueing, and back pressure capabilities are hard to match; custom solution would take extensive dev to even come close to the features.
It is not code, and that means it is resistant to code based tooling. For years it's critical weakness was related to migrating flows between environments, but this has been mostly resolved. If you are in a place with dev teams and separate ops teams, and lots of process required to make prod changes, then this was problematic.
However, the GUI flow programming is insanely powerful and is ideal when you need to do rapid prototyping, or quickly adapt existing pipelines; this same power and flexibility means that you can shoot yourself in the foot. As others have said, this is not a tool for non technical people; you need to understand systems, resource management, and the principles of scaling high volume distributed workloads.
This flow based visual approach makes understanding what is happening easier for someone coming later. I've seen a solution that required a dozen containers of redis, two multiple programming languages, zookeeper, a custom gui, and and mediocre operational visibility, be migrated to a simple nifi flow that was 10 connected squares in a row. The complexity of the custom solution, even though it was very stable and had nice code quality, meant that that solution became a legacy debt quickly after it was deployed. Now that same data flow is much easier to understand, and has great operational monitoring.
Some suggestions: - limit NiFi's scope to data routing and movement, and avoid data transformations or ETL in the flow. This ensures you can scale to your network limits, and aren't cpu/memory bound by the transformation of content. - constrain the scope of each instance of nifi, and not deploy 100s of flows onto a single cluster. - you can do alot with a single node, only go to a cluster for HA and when you know you need the scale.
by unixhero on 5/12/20, 7:53 AM
Phew! Happy to have read the comments here. They say a lot. I will go with Apache Airflow for all my workflow needs from now on. I wasn't entirely sure if this was the best bet, but after seeing all of this I am now.
I know a massive installation [0] which is about to be open sourced, where Apache NIFI is used in the middle of the stack as a key component. No dismissal of the capabilities this package offers intended.
[0] https://sikkerhetsfestivalen.no/bidrag2019/138
slides [slide #32]: https://static1.squarespace.com/static/5c2f61585b409bfa28a47...
by pacofvf on 5/11/20, 6:14 PM
For the love of god, don't use NiFi to trigger an Airflow DAG.
by yawz on 5/11/20, 6:34 PM
If you're considering Apache NiFi, you should also look at Apache Airflow and Uber Cadence to decide what model would work best for you.
by corndoge on 5/11/20, 6:03 PM
Can someone explain what this is? I can't find anything on the website that explains it
by rfsliva on 5/11/20, 7:55 PM
We are using NiFi as our dataflow engine for real time data ingest. We are using a current version, 1.11.4, and have several instances running including a development instance. The interface provides our team the ability to do quick iterative development and testing. An example of one of our use cases is we have 2 dataflows that ingest data from 2 different vehicle location/status systems and pump them into SQL Server. At the same time another dataflow merges the data from SQL Server and sends the data to Azure Event Hub. These dataflows were easy to setup, test and extend. This replaced a process that was written in Go.
by endlessmike89 on 5/11/20, 6:39 PM
Nifi is a good (not great) tool, mostly because of all of the functionality you get out of the box. It comes with almost any kind of connector you would need for moving data. There's a pretty steep learning curve, but once you push through that, creating a new data flow from scratch is quick and easy. It sucks that other people in this thread have had bad experiences with Nifi, and I can't say that I haven't. However, it has generally been a positive addition to my team's stack.
by haddr on 5/11/20, 6:44 PM
Had some second hand opinions on running NiFi in prod and all of them were rather negative, some saying it was a mistake. That was around one year ago. I wonder if things have changed since then.
by sixhobbits on 5/11/20, 10:10 PM
I have never heard of this before, and I'm sad that profit-driven, marketing speak has taken over even non-profit product pages.
> An easy to use, powerful, and reliable system.
This is the title. That's the most important sentence, and it's absolutely meaningless.
It's bad enough that everything has to "sell" - just describe to me what your product does and I'll decide if I need or not. Don't try to convince me.
If you have to sell, do it by differentiating yourself from your competitors. No one is calling themselves "Difficult to use, weak, and unreliable", so saying the opposite is not differentiation.
When did we accept that marketing-speak was default communication. Can't we have some landing pages that are essays? Or even a few paragraphs instead of trying-to-be-catchy bullet point phrases in large font?
by pazo on 5/11/20, 8:14 PM
I have experience from multiple projects with NiFi and it was the main reason for me and others quitting the company. Somehow management were convinced by some salesmen that this would be the golden bullet, however, all of their deliveries were delayed. We experienced issues debugging flows with performance problems, and even basic version control was problematic due to ids being replaced every time.
by josephmosby on 5/11/20, 10:16 PM
NiFi is a fantastic tool for a certain set of organizational constraints.
* It doesn't need much in the way of dependencies to run. If you can get Java onto a machine, you can probably get NiFi to run on that machine. - That is HUGE if you are operating in an environment where getting any new dependencies installed on a machine is an operational nightmare.
* It doesn't require a lot of overhead. Specifically, no database.
* You can write components for it that don't require a whole lot of tweaking for small changes to the incoming data. So, if I have a machine processing a JSON file that looks like XXYX and another machine processing a nearly identical JSON file that looks like XYXX, the tweaks can be made pretty easily.
So, if you're looking for a lightweight, low overhead, easily configurable tool that may be running in an environment where you've got to run lots of little instances that are mostly similar but not quite, NiFi is great.
If you are running a centralized data pipeline where you have a dedicated team of data engineers to keep the data flowing, there are better options out there.
by tspann on 5/11/20, 7:27 PM
No more XML. Check out NiFi 1.11.4, it does everything you need for easy ingest. If you are reading some files putting them into Kafka or S3 or a database or MongoDB or Hbase or Hive or Impala or Oracle or Kudu or ..., it's genius.
https://www.datainmotion.dev/
by Sodman on 5/12/20, 1:15 AM
Having used NiFi in production, my biggest issue with it is handling source control and multiple environments. As the "IDE" is effectively also the runtime, the lines between "local", "stage", and "prod" are easy to blur.
They have a built-in source control product called "NiFi Registry", which can even be backed by git. The workflow for promoting flows between environments feels clunky though, especially as so much environment-specific configuration is required once your number of components gets high enough.
Moving our Java, Ruby or Go code between environments or handling versioning and releases was a piece of cake, in comparison.
by tomrod on 5/11/20, 6:32 PM
Do I understand what this is: general purposes SSIS-type data integration, pipeline, and workflow tool?
If so, how does it compare to SSIS, dbt, and other projects (please name!)?
Otherwise, what is an analogous toolset?
by benjaminwootton on 5/11/20, 6:23 PM
I have been working on a new product which competes with NiFi, providing streaming data transformations.
Think, if order value > 100 and the customer has ordered 3 times in the last hour and the product will be in stock tomorrow.
Kafka streams, Flink and Dataflow are super powerful and I think there is room for a GUI tool.
Would be great to hear experiences of NiFi in this domain or discuss the space with any experienced users. Will add contact details in my profile.
by kentosi on 5/12/20, 2:06 AM
For those who were around during the mid-2000s, is this basically another revival of SOA (Service Oriented Architecture)?
I watched one of the explanation videos and it brought back memories.
My dislike of the phase back then, which I hope they've addressed now, is that while everything looked find and dandy while designing things on a UI, when something broke it was a whole heaps of generated XML no one could read.
by jszymborski on 5/11/20, 6:24 PM
So, the comments here have mostly ranged between neutral to negative regarding their experience.
I have a problem where I want to stream data to an ML layer and then stream that to a web app (e.g. Laravel or Django).
Reading the docs here, this seems like this would solve this problem, but was wondering if people had alternatives given that people seem to think poorly of this application.
by aasasd on 5/11/20, 8:52 PM
So Apache has at least a handful of software packages that do about the same thing, but with different interfaces and connectivity?
by ibishvintilli on 5/12/20, 11:04 AM
I like the way how it buffers messages. You basically stop a process and it will continue when it left off. It is easy to create a distributed cluster. It has hundreds of different connectors for external sources. On the other hand is very bulky. Making it to work with https was horrible. You cannot just put it behind a reverse proxy.
by dikei on 5/12/20, 1:22 AM
We only use NiFi on the edge of our data lake. It's very good at bulk loading, pulling log files/sensor data from hundreds of systems into our systems.
However, it does not handle small records well, and deploying custom processors is a pain, so don't use it to replace your stream processing framework.
by gatorbait83 on 5/11/20, 6:13 PM
Our team found this adapter to integrate ML with NiFI pretty handy: https://dev.to/tspannhw/easy-deep-learning-in-apache-nifi-wi...
by takeda on 5/11/20, 8:42 PM
It reminds me of LONI Pipeline[1], which was created for the need of a neuroscience lab to process images of brain scans.
[1] http://pipeline.loni.usc.edu/
by onetrickwolf on 5/11/20, 7:57 PM
I am new to this site, why is there just a link to Apache NiFI on the front page? Is this somehow news? Sorry not trying to be rude just confuses me a bit since NiFi has been around for some time.
by throwawaysea on 5/11/20, 6:58 PM
Is this an open source self hosted equivalent to https://aws.amazon.com/datapipeline/?
by fmakunbound on 5/12/20, 4:40 AM
Reminds me of the early 2000s when we were all into BPM, graphical or otherwise. The drawbacks are pretty obvious. I bet the engineers who built it had fun, tho.
by yalogin on 5/11/20, 11:51 PM
For someone not involved in the web stack, reading through that page tells me nothing. Can someone tell me what use cases this is meant to solve?
by dmtroyer on 5/11/20, 6:45 PM
Has anyone replaced Mirth with NiFi for HL7 slinging?
by hestefisk on 5/11/20, 10:33 PM
Does NiFi still use XML output for its source code? It makes it very hard to put under source control. Overall it’s a nice tool and very fast.
by meh206 on 5/12/20, 7:52 AM
Our devs hated it due to ease of accidental flow fsck ups. And good luck scaling it or trying to put it behind an LB!
by iofiiiiiiiii on 5/12/20, 7:45 PM
> An easy to use, powerful, and reliable system to process and distribute data.
But what is it?
by century19 on 5/11/20, 6:02 PM
Brought to you by the NSA ;-)
by J0_k3r on 5/12/20, 3:04 AM
inb4 this shit gets bloated to hell like apache httpd
by mberning on 5/11/20, 6:08 PM
We have a team using this at work. They had built a process and needed it to be put on a VM and run periodically. They said the requirements were a dual core machine and 8gb of ram. The “binary” was like 1.8gb. I’m sure it included a jre and a full nifi runtime, but god damn that is ridiculous. Had this process been built using go or crystal or something like that it probably would have been less than a megabyte and able to run with 512mb of ram.