Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Integrating Apache Spark and NiFi for Data Lakes

8,215 views

Published on

Integrating Apache Spark and NiFi for Data Lakes

Published in: Technology

Integrating Apache Spark and NiFi for Data Lakes

  1. 1. MAKING BIG DATA COME ALIVE Integrating Apache Spark And NiFi For Data Lakes Ron Bodkin Founder & President Scott Reisdorf R&D Architect
  2. 2. 2 Agenda • Requirements • Design • Demo
  3. 3. 3 • A central repository with trusted, consistent data • Reduce costs by offloading analytical systems and archiving cold data • Derive value quickly with easier discovery and prototyping • A laboratory for experimenting with new technologies and data Goals for a Data Lake
  4. 4. 4 • Automation of pipelines with metadata and performance tracking • Governance with clear distinction of roles and responsibilities • SLA tracking with alerts on failures or violations • Interactive data discovery and experimentation What’s Needed For A Hadoop Data Lake?
  5. 5. 5 Example Ingestion Project • 4000+ unique flat files and RDMS tables, plus a few streaming data feeds • Mix of incremental and snapshot data • Ingest into Hadoop (minimally HDFS and Hive tables) • Cleansing/encryption and data validation • Metadata capture Focus shifts over time from data ingestion to transformation then to analytics
  6. 6. 6 Design
  7. 7. 7 Apache Spark Functions • Cleanse • Validate • Profile • Wrangle
  8. 8. 8 Pipeline design with Apache • Visual drag-and-drop • Dozens of data connectors • 150+ pre-built transforms • Data lineage • Batch and Streaming • Extensible © 2016 Think Big, a Teradata Company 7/10/2016
  9. 9. 9 Role separation • IT Designers design models in NiFi • Register with framework • Integrated development process © 2016 Think Big, a Teradata Company 7/10/2016 Apache NiFi Think Big framework • Users configure new feeds • Based on common model • Generated and executed in NiFi register deploy
  10. 10. 1010 7/10/2016 © 2015 Think Big, a Teradata Company User features around org. roles Visual design Streaming and Batch Fully governed Integrated Best Practices Secure, modern architecture Design Approach Will be open source (Apache license)
  11. 11. 1111 Ingest and Prepare • UI-guided feed creation • Data protection • Data cleanse • Data validation • Data profiling • Powered by Apache Spark
  12. 12. Unpack and/or merge small files Put file HDFS Cleanse/Stand ardize Spark Data Profile Spark Metadata Validate Spark Data Ingest Model Metadata determines behavior of individual components Adds many Hadoop- specific higher-level NiFi processors Index Text Elasticsearch Merge / Dedupe Hive Compress & Archive Originals HDFS,S3 Extract Table JDBC Get File(s) Filesystem Message JMS/Kafka Other HTTP/REST, etc. Data policies 12
  13. 13. 1313 Data self-service and “wrangle” • Graphical SQL builder • 100+ transform functions • Machine learning • Publish and schedule • Powered by Apache Spark
  14. 14. 1414 Data Discovery • Google-like searching • Extensible metadata • Data profile • Data sampling
  15. 15. 1515 Operations • Dashboard • Health Monitoring • Data Confidence • SLA enforcement • Alerts • Performance reports
  16. 16. 16 • Powerful search capabilities for users against data (think Google-like searching) • NiFi processor extracts source data from Hadoop table for indexing in ElasticSearch • Incremental updates during ingest ElasticSearch – Full Text Indexing Data Lake select id,user,tweet from twitter_feed extract JSON
  17. 17. 17 Demo
  18. 18. 1818

×