The document discusses how Apache NiFi can empower self-organizing teams by allowing them to control their own data movement pipelines. It describes NiFi as a data-movement production line that enables teams to build and change pipelines quickly. This helps reduce time-to-production and feedback loops by moving responsibility for data integration to individual teams rather than having changes controlled in a centralized manner.
[ 1 min ] [ 1.12 min ]
Hi There! I’m Sebastian
I’m a senior consultant here in EMEA and I focus on HDF technologies and Apache NiFi
Started using NiFi about three years ago in Australia.
Pre open-source days -- very different to now but that early learning has definitely helped
Using NiFi ever since those early days and really enjoyed the NiFi journey.
Fantastic product that really solves a lot of problems unlike anything else out there
One of the most transformational NiFi applications I have seen
[ 1 min ] [ 1 min ]
So first a quick poll:
Who has heard of NiFi?
Keep them up if you know what NiFi does?
Keep them up if you have ever used NiFi?
I summarise NiFi as the Data-Movement Production-line
[ 1 min ]
Data movement
Moving Data around
See on image
Business units
Machines
Bandwidth links
Availability times
Sounds easy but it isn’t
There are some tools that specialise in some areas but *in general* If youre moving data, NiFI is probably what you want
[ 1:50 min ]
Production line
Flow based programming model
FlowFiles - the actual data - product
Processors - the work units
Queues - connectors
[ 30 sec ]
Why would I use NiFi? Moving Data
What does it resemble? The production line
Don’t worry youll see more later
[ 1 min ] [ 2:40 ]
So armed with that ridiculously brief overview, I want to talk about how NiFi can be used to facilitate the agile philosphy of the self organising team
What is a self organising team: one where they define the how, the management defines the what.
Why is this good? Well lets take an example:
Management says - at 3pm, take the red truck out the back, load all the apples and deliver across the bridge to the supermarket.
For this to go well, everything has to work, all the steps have to be known in advance.
What happens if … truck is blue? Take it? No fuel? Break down?
Putting the decisions into the hands of the people who are most qualified to make them - speeds up delivery and increases quality (often solving the problems in novel ways to boot)
Generally it increases quality and decreases time to completion
How does this work with Data Ingestion Pipelines?
[ 1 min ]
Using NiFi can help to speed up the process:
Less risk of losing perishable insights
Reduce costs
[ 2 min ]
You might have a very simplistic representation of the organisation like so
We’re agile, so we have cross-functional teams (maybe consisting of developers, sysadmins, data professionals and subject matter experts)
This depicts the information flow through the teams
This isn’t that special at the moment, looks like a normal ‘change managed, data backbone’ -everything goes through core
And anytime one of the teams needs a change, it goes through the core team.
For example, if in store team wants information from the supply chain team they have to coordinate with core, who talks to supply chain and all the complexities there.
It’s difficult to get core to prioritise your work due to their requests coming in thick and fast
Often leads to prioritisation by volume - or those who yell the loudest win
Often have to do testing in test infrastructure to verify that changes work
Then can submit change request
Only to find out that test isn’t actually like production, you exceed the change window and have to revert
Starting the change management cycle over again.
Even companies that don’t have these specific procedures are often bogged down in slow change cycles
BUT .. let’s assume these are all NiFi instances.
Website, core, in store etc all have their own NiFi instance - we haven’t changed the organisational structure at all
Just changed the tool that passes data around - so right now it looks the same. But lets replay the scenario from above - In Store wants supply chain data
[ 1 min ]
Direct Connect
S2S
Easy to use once set up
Intuitive UI
Flow based
For simple data movement, anyone can use
[ 1 min ]
Team to Team - No Core
Direct connect - straight to the team in question
No core
No change requests minimal process
Team affected by change are the ones who implement it
Decisions made by people most well positioned to do so
[ 1 min ]
Individual to Individual
Just 2 people!
[ 1 min ]
Not just techies
Could be anyone theoretically on the team - not just NiFi Guru
Whoever is using the data, can get it
[ 1 min ]
Productionable
NiFi itself and these features are not gimmicks
They are the same, robust and secure features using in all our deployments
[ 1 min ]
Immediate
Changes take effect immediately
Can quickly see and debug issues
[ 1 min ]
Not Just S2S!
[ 4 min ]
Flexible - almost any endpoint
Integrators dream!
Options include: files, to Hadoop, to Kafka, to plain TCP, HTTP, JMS, CDC, WebSockets, Emails, Hbase, Mongo, SNMP, Solr, splunk and even twitter!
[ 1 min ]
Now I want to look at improvement over the oganisation as a whole
You’ve seen the improvements that NiFi can bring to a team if all teams have a their own NiFi instance that is under their control
But this requires NiFI everywhere, which is the case for only a handful of organisations
So how do we get there? Well we could just change over night right …. ? Mandate that All teams use NiFi? Pump huge amounts of money to looking at the potential risks and mitigating those and rolling out changes and all the tradition stuff? But ...
[ 2 min ]
[ .5 min ]
Here we have a traditional data movement pipeline
The Buyers are looking at historical sales trends and trying to see if their predictions were correct. For this to happen we need to go all the way back to the warehouse database:
Start at the database, the warehouse team want to get a report on all the items in the database
Ops probably does some sort of manual process - logging in to a firewalled machine, bringing up a shell and executing the required report
Here we have a traditional data movement pipeline
The Buyers are looking at historical sales trends and trying to see if their predictions were correct. For this to happen we need to go all the way back to the warehouse database:
Start at the database, the warehouse team want to get a report on all the items in the database
Ops probably does some sort of manual process - logging in to a firewalled machine, bringing up a shell and executing the required report
Then placed on the Shared Drive
The warehouse team pick this up and load into excel
Check that things look OK and pass to the Supply Chain team
However supply chain don’t sit in the warehouse and the Shared Drive is different. So it’s placed on an SFTP server
Its joined with other reports from other warehouses, reconciled and place on the second HQ internal SAN
Picked up by the buyers
They don’t need to modify the report, simply ingest the data for analysis
Do so and pass the results to the business by email.
All the markings in
Here we have a traditional data movement pipeline
The Buyers are looking at historical sales trends and trying to see if their predictions were correct. For this to happen we need to go all the way back to the warehouse database:
Start at the database, the warehouse team want to get a report on all the items in the database
Ops probably does some sort of manual process - logging in to a firewalled machine, bringing up a shell and executing the required report
Then placed on the Shared Drive
The warehouse team pick this up and load into excel
Check that things look OK and pass to the Supply Chain team
However supply chain don’t sit in the warehouse and the Shared Drive is different. So it’s placed on an SFTP server
Its joined with other reports from other warehouses, reconciled and place on the second HQ internal SAN
Picked up by the buyers
They don’t need to modify the report, simply ingest the data for analysis
Do so and pass the results to the business by email.
All the markings in
Here we have a traditional data movement pipeline
The Buyers are looking at historical sales trends and trying to see if their predictions were correct. For this to happen we need to go all the way back to the warehouse database:
Start at the database, the warehouse team want to get a report on all the items in the database
Ops probably does some sort of manual process - logging in to a firewalled machine, bringing up a shell and executing the required report
Then placed on the Shared Drive
The warehouse team pick this up and load into excel
Check that things look OK and pass to the Supply Chain team
However supply chain don’t sit in the warehouse and the Shared Drive is different. So it’s placed on an SFTP server
Its joined with other reports from other warehouses, reconciled and place on the second HQ internal SAN
Picked up by the buyers
They don’t need to modify the report, simply ingest the data for analysis
Do so and pass the results to the business by email.
Here we have a traditional data movement pipeline
The Buyers are looking at historical sales trends and trying to see if their predictions were correct. For this to happen we need to go all the way back to the warehouse database:
Start at the database, the warehouse team want to get a report on all the items in the database
Ops probably does some sort of manual process - logging in to a firewalled machine, bringing up a shell and executing the required report
Then placed on the Shared Drive
The warehouse team pick this up and load into excel
Check that things look OK and pass to the Supply Chain team
However supply chain don’t sit in the warehouse and the Shared Drive is different. So it’s placed on an SFTP server
Its joined with other reports from other warehouses, reconciled and place on the second HQ internal SAN
Picked up by the buyers
They don’t need to modify the report, simply ingest the data for analysis
Do so and pass the results to the business by email.
While obviously fictional, these sorts of ingestion pipelines are the norm not the exception, they are especially hard to root out as each team is generally siloed and don’t have visibility of the process as a whole
Lets pick a team that has decided to try out NiFi for automating some of this movement - the supply chain team.
We go ahead and install NiFi inside the Supply chain team
We change just one task at first: Using the FetchSFTP processor to watch the server and
pick up the warehouse report
And place it in a location where it can be worked on
So we have just automated one step
The employee looking after this step now knows that the file will appear, ready to be worked on
To simplify another simple step, once recieving this file we can send an email to the SC team notifying them of its arrival.
Now the warehouse employee is freed from doing that step.
Great! That’s two down. Theres still the manual analysis, but thats what humans are good at so let them do that.
Now the Warehouse guy makes a comment that that is a cool thing
You say, well If you have your own nifi, you could do the same thing!
So you convince them to try setting up thier own NiFi
At first, it just automates one step moving the file to the ftp server
That great for the warehouse guys, one less thing
But now we can eliminate the clunky file -ftp -pickup - putdown chain with NiFi S2S
At first, it just automates one step moving the file to the ftp server
That great for the warehouse guys, one less thing
But now we can eliminate the clunky file -ftp -pickup - putdown chain with NiFi S2S
[ 10 sec ]
We’ve seen what benefits can come from the NiFI web
But why NiFI?
What features make it a good fit?
[ 2 min ]
Cross funtional teams - wide skill set
Want anyone to come in with minimal training and be in control of thier own data
Most people are aware of flow charts and Graphical web interfaces
Don’t need special software - web browser
Changes take effect immediately:
Allows faster feed-back cycles
Avoids long or specialised deployment cycles (e.g. must know how to use git, jenkins or pull requests)
Lots of good visual queues to help show where issues are and how to resolve them
[ 1:30 min ]
Many systems its difficult to get a snapshot of the stages
Would have to get a sample, transport to somewhere you could access and then probably download
This is often just more work for the developer
Fast feedback
Again fast feedback!
Can see exactly what has changed
(12:20 to here from slide 41)
So as I’ve mentioned previously, S2S is a very nice way to communicate with NiFi clusters
Only have to open one port
Easy to configure
Ties in nicely with the UI
Maintains Attributes!
Balances accross clusters
But as previously mentioned
Integrates with bloody everything!
So as I’ve mentioned previously, S2S is a very nice way to communicate with NiFi clusters
Only have to open one port
Easy to configure
Ties in nicely with the UI
Maintains Attributes!
Balances accross clusters
But as previously mentioned
Integrates with bloody everything!
Can see anyone who has made changes
Strong Authorisation: Are sure that the people have authorised are the ones making the change
Can then reverse changes if required (no undo, but can just re-apply to old setting)
Can very tightly control who can do what on the system
E.g. have someone who can see the data but can’t move it, someone who can move but not see
Can now use processor groups to group flows and secure those
Could have one group with access to one and one group with access to the other but no cross talk
Similar model to the NiFi Web but on one cluster
Has pros/cons but is possible
Make managing these things easier
Can get started in 10 minutes but will also do GB and thousands of events/second
Can be secured