MAKING BIG DATA COME ALIVE
Integrating Apache Spark And NiFi
For Data Lakes
Ron Bodkin Founder & President
Scott Reisdorf R&D Architect
2
Agenda
• Requirements
• Design
• Demo
3
• A central repository
with trusted,
consistent data
• Reduce costs by
offloading analytical
systems and archiving cold
data
• Derive value quickly
with easier discovery
and prototyping
• A laboratory for
experimenting with
new technologies
and data
Goals for a Data Lake
4
• Automation of pipelines
with metadata and
performance tracking
• Governance with
clear distinction of
roles and responsibilities
• SLA tracking with
alerts on failures or
violations
• Interactive data discovery
and experimentation
What’s Needed For A Hadoop Data Lake?
5
Example Ingestion Project
• 4000+ unique flat files and RDMS tables, plus a few streaming
data feeds
• Mix of incremental and snapshot data
• Ingest into Hadoop (minimally HDFS and Hive tables)
• Cleansing/encryption and data validation
• Metadata capture
Focus shifts over time from data ingestion to
transformation then to analytics
6
Design
7
Apache Spark Functions
• Cleanse
• Validate
• Profile
• Wrangle
8
Pipeline design with Apache
• Visual drag-and-drop
• Dozens of data connectors
• 150+ pre-built transforms
• Data lineage
• Batch and Streaming
• Extensible
© 2016 Think Big, a Teradata Company 7/10/2016
9
Role separation
• IT Designers design models in NiFi
• Register with framework
• Integrated development process
© 2016 Think Big, a Teradata Company 7/10/2016
Apache NiFi Think Big framework
• Users configure new feeds
• Based on common model
• Generated and executed in NiFi
register
deploy
1010
7/10/2016
© 2015 Think Big, a Teradata Company
User features
around
org. roles
Visual design
Streaming
and Batch
Fully
governed
Integrated
Best
Practices
Secure, modern
architecture
Design Approach
Will be open
source (Apache
license)
1111
Ingest and Prepare
• UI-guided feed creation
• Data protection
• Data cleanse
• Data validation
• Data profiling
• Powered by Apache Spark
Unpack and/or
merge small files
Put file
HDFS
Cleanse/Stand
ardize
Spark
Data Profile
Spark
Metadata
Validate
Spark
Data Ingest Model
Metadata determines
behavior of individual
components
Adds many Hadoop-
specific higher-level NiFi
processors
Index Text
Elasticsearch
Merge / Dedupe
Hive
Compress &
Archive Originals
HDFS,S3
Extract Table
JDBC
Get File(s)
Filesystem
Message
JMS/Kafka
Other
HTTP/REST, etc.
Data policies
12
1313
Data self-service and “wrangle”
• Graphical SQL builder
• 100+ transform functions
• Machine learning
• Publish and schedule
• Powered by Apache Spark
1414
Data Discovery
• Google-like searching
• Extensible metadata
• Data profile
• Data sampling
1515
Operations
• Dashboard
• Health Monitoring
• Data Confidence
• SLA enforcement
• Alerts
• Performance reports
16
• Powerful search capabilities for users against data
(think Google-like searching)
• NiFi processor extracts source data from Hadoop table
for indexing in ElasticSearch
• Incremental updates during ingest
ElasticSearch – Full Text Indexing
Data Lake
select id,user,tweet
from twitter_feed
extract JSON
17
Demo
1818

Integrating Apache Spark and NiFi for Data Lakes

  • 1.
    MAKING BIG DATACOME ALIVE Integrating Apache Spark And NiFi For Data Lakes Ron Bodkin Founder & President Scott Reisdorf R&D Architect
  • 2.
  • 3.
    3 • A centralrepository with trusted, consistent data • Reduce costs by offloading analytical systems and archiving cold data • Derive value quickly with easier discovery and prototyping • A laboratory for experimenting with new technologies and data Goals for a Data Lake
  • 4.
    4 • Automation ofpipelines with metadata and performance tracking • Governance with clear distinction of roles and responsibilities • SLA tracking with alerts on failures or violations • Interactive data discovery and experimentation What’s Needed For A Hadoop Data Lake?
  • 5.
    5 Example Ingestion Project •4000+ unique flat files and RDMS tables, plus a few streaming data feeds • Mix of incremental and snapshot data • Ingest into Hadoop (minimally HDFS and Hive tables) • Cleansing/encryption and data validation • Metadata capture Focus shifts over time from data ingestion to transformation then to analytics
  • 6.
  • 7.
    7 Apache Spark Functions •Cleanse • Validate • Profile • Wrangle
  • 8.
    8 Pipeline design withApache • Visual drag-and-drop • Dozens of data connectors • 150+ pre-built transforms • Data lineage • Batch and Streaming • Extensible © 2016 Think Big, a Teradata Company 7/10/2016
  • 9.
    9 Role separation • ITDesigners design models in NiFi • Register with framework • Integrated development process © 2016 Think Big, a Teradata Company 7/10/2016 Apache NiFi Think Big framework • Users configure new feeds • Based on common model • Generated and executed in NiFi register deploy
  • 10.
    1010 7/10/2016 © 2015 ThinkBig, a Teradata Company User features around org. roles Visual design Streaming and Batch Fully governed Integrated Best Practices Secure, modern architecture Design Approach Will be open source (Apache license)
  • 11.
    1111 Ingest and Prepare •UI-guided feed creation • Data protection • Data cleanse • Data validation • Data profiling • Powered by Apache Spark
  • 12.
    Unpack and/or merge smallfiles Put file HDFS Cleanse/Stand ardize Spark Data Profile Spark Metadata Validate Spark Data Ingest Model Metadata determines behavior of individual components Adds many Hadoop- specific higher-level NiFi processors Index Text Elasticsearch Merge / Dedupe Hive Compress & Archive Originals HDFS,S3 Extract Table JDBC Get File(s) Filesystem Message JMS/Kafka Other HTTP/REST, etc. Data policies 12
  • 13.
    1313 Data self-service and“wrangle” • Graphical SQL builder • 100+ transform functions • Machine learning • Publish and schedule • Powered by Apache Spark
  • 14.
    1414 Data Discovery • Google-likesearching • Extensible metadata • Data profile • Data sampling
  • 15.
    1515 Operations • Dashboard • HealthMonitoring • Data Confidence • SLA enforcement • Alerts • Performance reports
  • 16.
    16 • Powerful searchcapabilities for users against data (think Google-like searching) • NiFi processor extracts source data from Hadoop table for indexing in ElasticSearch • Incremental updates during ingest ElasticSearch – Full Text Indexing Data Lake select id,user,tweet from twitter_feed extract JSON
  • 17.
  • 18.

Editor's Notes

  • #13 Notice that we delegate processing to the Spark and Hadoop cluster for much of our work