Integrating Apache Spark and NiFi for Data Lakes

MAKING BIG DATA COME ALIVE
Integrating Apache Spark And NiFi
For Data Lakes
Ron Bodkin Founder & President
Scott Reisdorf R&D Architect

2
Agenda
• Requirements
• Design
• Demo

3
• A central repository
with trusted,
consistent data
• Reduce costs by
offloading analytical
systems and archiving cold
data
• Derive value quickly
with easier discovery
and prototyping
• A laboratory for
experimenting with
new technologies
and data
Goals for a Data Lake

4
• Automation of pipelines
with metadata and
performance tracking
• Governance with
clear distinction of
roles and responsibilities
• SLA tracking with
alerts on failures or
violations
• Interactive data discovery
and experimentation
What’s Needed For A Hadoop Data Lake?

5
Example Ingestion Project
• 4000+ unique flat files and RDMS tables, plus a few streaming
data feeds
• Mix of incremental and snapshot data
• Ingest into Hadoop (minimally HDFS and Hive tables)
• Cleansing/encryption and data validation
• Metadata capture
Focus shifts over time from data ingestion to
transformation then to analytics

7
Apache Spark Functions
• Cleanse
• Validate
• Profile
• Wrangle

8
Pipeline design with Apache
• Visual drag-and-drop
• Dozens of data connectors
• 150+ pre-built transforms
• Data lineage
• Batch and Streaming
• Extensible
© 2016 Think Big, a Teradata Company 7/10/2016

9
Role separation
• IT Designers design models in NiFi
• Register with framework
• Integrated development process
© 2016 Think Big, a Teradata Company 7/10/2016
Apache NiFi Think Big framework
• Users configure new feeds
• Based on common model
• Generated and executed in NiFi
register
deploy

1010
7/10/2016
© 2015 Think Big, a Teradata Company
User features
around
org. roles
Visual design
Streaming
and Batch
Fully
governed
Integrated
Best
Practices
Secure, modern
architecture
Design Approach
Will be open
source (Apache
license)

1111
Ingest and Prepare
• UI-guided feed creation
• Data protection
• Data cleanse
• Data validation
• Data profiling
• Powered by Apache Spark

Unpack and/or
merge small files
Put file
HDFS
Cleanse/Stand
ardize
Spark
Data Profile
Spark
Metadata
Validate
Spark
Data Ingest Model
Metadata determines
behavior of individual
components
Adds many Hadoop-
specific higher-level NiFi
processors
Index Text
Elasticsearch
Merge / Dedupe
Hive
Compress &
Archive Originals
HDFS,S3
Extract Table
JDBC
Get File(s)
Filesystem
Message
JMS/Kafka
Other
HTTP/REST, etc.
Data policies
12

1313
Data self-service and “wrangle”
• Graphical SQL builder
• 100+ transform functions
• Machine learning
• Publish and schedule
• Powered by Apache Spark

1414
Data Discovery
• Google-like searching
• Extensible metadata
• Data profile
• Data sampling

1515
Operations
• Dashboard
• Health Monitoring
• Data Confidence
• SLA enforcement
• Alerts
• Performance reports

16
• Powerful search capabilities for users against data
(think Google-like searching)
• NiFi processor extracts source data from Hadoop table
for indexing in ElasticSearch
• Incremental updates during ingest
ElasticSearch – Full Text Indexing
Data Lake
select id,user,tweet
from twitter_feed
extract JSON

Integrating Apache Spark and NiFi for Data Lakes

More Related Content

What's hot

Viewers also liked

Similar to Integrating Apache Spark and NiFi for Data Lakes

More from DataWorks Summit/Hadoop Summit

Recently uploaded

Integrating Apache Spark and NiFi for Data Lakes

Editor's Notes