Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
2 years with NiFi: in a manufacturing company
フローベースドプログラミング勉強会
Damien Contreras
ダミアン コントレラ
Mobile & Platform Development ...
Quick	Background	
SiloedData
x5	companies &	Information	Systems
Point2Point	connections
Night	Batch	processing
&
File	driv...
Our	standard	dataflow	pattern
Extract	data	from	
datasources
Verify	data
kick	start	
aggregation
Notify	consumer	
applicat...
NiFi Environment
Hadoop Production environment
…NiFi Node	0
Hadoop Dev environment
…
Node	0
RDBMS
Legacy	Systems
…
Azure
N...
Site-to-Site	connection	simplifies
Local	Datasource(s) Hadoop	
on	the	cloud
Production	Env QA	/	Dev	Env
Old	production New...
Using	processing	groups	enables	to	structure	
flows
Processing
Group
Extraction
Ingestion
Make	it	easy	to	maintain
Keep	th...
Mutualize	with	common	processors	&		
attributes
Make	it	easy	to	maintain
Focus	on	reusability
Ensure	that	some	attributes	...
Slow	data	sources
Extract	1st	table Extract	2nd		table
Save	to	HDFS
Extract	3rd		table
Success
SuccessSuccess
Success
Succ...
Retry	&	Error	handling
Processor
Write	to	error	log
Success
failure
Read	from	Error	log
Re-Process
Update	Error	log
Send	D...
MDM	(MySQL)	synching
MySQL
MDM	Repository
binlogs
Maxwell
Generate	
Row	files
{"database":"INTEGRATION","table":"mbew","ty...
Upcoming SlideShare
Loading in …5
×

Damien contreras futureofdata-20170428

545 views

Published on

Some good practices learned from 2 years using Apache NiFi & Hadoop.

https://www.youtube.com/watch?v=HApHmuQdEZY

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Damien contreras futureofdata-20170428

  1. 1. 2 years with NiFi: in a manufacturing company フローベースドプログラミング勉強会 Damien Contreras ダミアン コントレラ Mobile & Platform Development Manager Coca-Cola East Japan 2017-04-28 damien.contreras@ccej.co.jp Linkedin: damiencontreras Twitter: @dvolute Github: damienContreras
  2. 2. Quick Background SiloedData x5 companies & Information Systems Point2Point connections Night Batch processing & File driven
  3. 3. Our standard dataflow pattern Extract data from datasources Verify data kick start aggregation Notify consumer application NIFI ROLE Save to HDFS & Hive NIFI PROCESSORS FLOW Set filename Tablename Partition stored location Verify count Verify if all data are Available Launch ingestion into ORC / program Launch aggregation Notify through webServices & a JSON Payload Save to the right Directory based on Dataflow attributes DataSources Tables Validation Criteria Processing Release Trigger
  4. 4. NiFi Environment Hadoop Production environment …NiFi Node 0 Hadoop Dev environment … Node 0 RDBMS Legacy Systems … Azure NiFi NiFi NiFi …Node 0NiFiSAP ECC Site-to-Site Site-to-Site Vending Machine data NEW ENVIRONMENT CURRENT ENVIRONMENT Hadoop Production environment x2
  5. 5. Site-to-Site connection simplifies Local Datasource(s) Hadoop on the cloud Production Env QA / Dev Env Old production New production MIGRATE DATA PREPARING QA ENVIRONMENT CONNECTING LOCAL SOURCES WITH CLOUD RESOURCES Sharing the same certificate through the different NiFi Instances & sharing the 2 necessary ports EASY TO IMPLEMENT
  6. 6. Using processing groups enables to structure flows Processing Group Extraction Ingestion Make it easy to maintain Keep the interface uncluttered (less than 15 processors in width) 1st breakdown is between Extraction & ingestion / Processing 2nd breakdown is per datasource 3nd create logical groups Use input / output ports to connect all groups SAP Oracle MDM … RATIONALES HOW
  7. 7. Mutualize with common processors & attributes Make it easy to maintain Focus on reusability Ensure that some attributes are common for all flows Use attributes to set : tablename, HDFS folders, timestamp, … Use single processor taking as an input some attributes Extraction Save to HDFS Extraction Extraction RATIONALES HOW Update hive metastore
  8. 8. Slow data sources Extract 1st table Extract 2nd table Save to HDFS Extract 3rd table Success SuccessSuccess Success Success Some databases don t handle properly multiple connections we daisy chain on success the processors Update partition
  9. 9. Retry & Error handling Processor Write to error log Success failure Read from Error log Re-Process Update Error log Send Data Every 5 mins Master Data/ Ingestion / BUBBLING EVENTS Error Notification OR DELAYED RETRY INSTANT RETRY
  10. 10. MDM (MySQL) synching MySQL MDM Repository binlogs Maxwell Generate Row files {"database":"INTEGRATION","table":"mbew","type":"update"," ts":1491046864,"xid":918998302,"data":{”field1":”value”,…}} JSON Files Get file Parse JSON Insert / delete / update ORC mbew_History_orc mbew_snapshot_orc NiFi Hive Determine the type of operation Identify the table & search for its primary keys type: insert History_orc Snapshot_orc insert insert type: update insert update type: delete insert delete JSON File PARSING JSON OPERATION ON ORC TABLES

×