Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

4,934 views

Published on

Beginning with HDP 2.1, Hortonworks Data Platform ships with Apache Falcon for Hadoop data governance. Himanshu Bari, Hortonworks senior product manager, and Venkatesh Seetharam, Hortonworks co-founder and committer to Apache Falcon, lead this 30-minute webinar, including:
+ Why you need Apache Falcon
+ Key new Falcon features
+ Demo: Defining data pipelines with replication; policies for retention and late data arrival; managing Falcon server with Ambari

Published in: Software, Technology

Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

  1. 1. Page 1 © Hortonworks Inc. 2014 Discover HDP 2.1 Apache Falcon for Data Governance in Hadoop Hortonworks. We do Hadoop.
  2. 2. Page 2 © Hortonworks Inc. 2014 Speakers Justin Sears Hortonworks Product Marketing Manager Himanshu Bari Hortonworks Senior Product Manager & PM for Apache Falcon & Apache Storm in Hortonworks Data Platform Venkatesh Seetharam Foundational Hadoop Architect, Engineer & Committer for Apache Falcon and Apache Knox Gateway projects
  3. 3. Page 3 © Hortonworks Inc. 2014 Agenda •  Why You Need Apache Falcon •  Key New Falcon Features •  Demo –  Defining data pipelines –  Policies for retention –  Managing Falcon server with Apache Ambari
  4. 4. Page 4 © Hortonworks Inc. 2014 OPERATIONS  TOOLS   Provision, Manage & Monitor DEV  &  DATA  TOOLS   Build & Test A Modern Data Architecture APPLICATIONS  DATA    SYSTEM   REPOSITORIES   RDBMS   EDW   MPP   Business     Analy<cs   Custom  Applica<ons   Packaged   Applica<ons   Governance &Integration ENTERPRISE HADOOP Security Operations Data Access Data Management SOURCES   OLTP,  ERP,   CRM  Systems   Documents,     Emails   Web  Logs,   Click  Streams   Social  Networks   Machine   Generated   Sensor   Data   GeolocaCon  Data  
  5. 5. Page 5 © Hortonworks Inc. 2014 HDP 2.1: Enterprise Hadoop HDP 2.1 Hortonworks Data Platform     Provision,   Manage  &   Monitor     Ambari   Zookeeper   Scheduling     Oozie   Data  Workflow,   Lifecycle  &   Governance     Falcon   Sqoop   Flume   NFS   WebHDFS   YARN  :  Data  Opera<ng  System   DATA    MANAGEMENT   DATA    ACCESS   GOVERNANCE  &   INTEGRATION   OPERATIONS   Script     Pig       Search     Solr       SQL     Hive/Tez,   HCatalog       NoSQL     HBase   Accumulo       Stream       Storm         Others     In-­‐Memory   AnalyCcs,     ISV  engines   1   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   N   HDFS     (Hadoop  Distributed  File  System)   Batch     Map   Reduce       SECURITY   Authen<ca<on   Authoriza<on   Accoun<ng   Data  Protec<on     Storage:  HDFS   Resources:  YARN   Access:  Hive,  …     Pipeline:  Falcon   Cluster:  Knox  
  6. 6. Page 6 © Hortonworks Inc. 2014 NoSQL     HBase   Accumulo       Stream       Storm         Others     In-­‐Memory   AnalyCcs,     ISV  engines   Script     Pig       Search     Solr       HDP 2.1: Enterprise Hadoop HDP 2.1 Hortonworks Data Platform     Provision,   Manage  &   Monitor     Ambari   Zookeeper   Scheduling     Oozie   DATA    MANAGEMENT   OPERATIONS   1   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   N   HDFS     (Hadoop  Distributed  File  System)   SECURITY   Authen<ca<on   Authoriza<on   Accoun<ng   Data  Protec<on     Storage:  HDFS   Resources:  YARN   Access:  Hive,  …     Pipeline:  Falcon   Cluster:  Knox   YARN  :  Data  Opera<ng  System   DATA    ACCESS   SQL     Hive/Tez,   HCatalog       Batch     Map   Reduce       Data  Workflow,   Lifecycle  &   Governance     Falcon   Sqoop   Flume   NFS   WebHDFS   GOVERNANCE  &   INTEGRATION  
  7. 7. Page 7 © Hortonworks Inc. 2014 Outline Falcon Overview Features Architecture & Demo
  8. 8. Page 8 © Hortonworks Inc. 2014 Simple Data Pipeline in Hadoop Relatively simple Oozie workflow Job1 Job2 JobN Job3 Has a Simple data pipeline Raw Data Clean Data Prepped Data HDFS data lake MR/Pig/Hive BI TOOLS Data Sources MR/Pig/Hive
  9. 9. Page 9 © Hortonworks Inc. 2014 Quickly Gets Complicated…. Data stewards •  Impact analysis •  Monitor pipeline •  Track ownership •  Late data & failure handling Compliance teams •  Audit •  Retention •  Eviction IT admins •  Monitor infra •  Replication •  Archival Business & data analysts •  Verify data quality Manually write & wire Multiple complex Oozie workflows Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN Other Hadoop tools Eg. DistCp Typical data governance requirements Raw Clean Prep
  10. 10. Page 10 © Hortonworks Inc. 2014 Apache Falcon to the Rescue Data pipeline Raw Clean Prep Defined in Auto generate & orchestrate Adds the required data governance features Falcon adds the required data governance features DEFINITION Replication | Retention Eviction | Late data MONITORING TRACING Audit | Lineage Tagging Multiple complex Oozie workflows Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN Other Hadoop ecosystem tools Eg. DistCp
  11. 11. Page 11 © Hortonworks Inc. 2014 Outline Falcon Overview Features Architecture & Demo
  12. 12. Page 12 © Hortonworks Inc. 2014 Falcon Basic Concepts • Feed: Defines a “dataset” so a.k.a ‘datasets’ • Process: Consumes feeds, invokes processing logic & produces feeds All these put together represent ‘Data Pipelines’ in Hadoop CLUSTER FEED aka DATASET PROCESS INPUT TO CREATES • Cluster: : Represents the “interfaces” to a Hadoop cluster
  13. 13. Page 13 © Hortonworks Inc. 2014 Data Pipeline Definition XML based pipeline specification Modular - Clusters, feeds & processes defined separately and then linked together Easy to re-use across multiple pipelines Out of the box policies Predefined policies for replication, retention & late data handling Easily customization of policies Extensible Plug in external solutions at any step of the pipeline Eg. Invoke third party data obfuscation components
  14. 14. Page 14 © Hortonworks Inc. 2014 Replication & Retention Staged Data Retain 5 Years Cleansed Data Retain 3 Years Conformed Data Retain 3 Years Presented Data Retain Last Copy Only •  Sophisticated retention policies expressed in one place •  Simplify data retention for audit, compliance, or for data re-processing
  15. 15. Page 15 © Hortonworks Inc. 2014 Data Pipeline Monitoring DATA Primary site DR site Centralized monitoring of data pipeline with Falcon + Ambari Pipeline run alerts Hadoop Cluster-1 Hadoop Cluster-2 Pipeline run history Pipeline scheduling raw clean prep raw clean prep
  16. 16. Page 16 © Hortonworks Inc. 2014 Data Pipeline Tracing . Purchase feed Customer feed Product feedStore feed View dependencies between clusters, datasets and processes Data pipeline dependencies Add arbitrary tags to feeds & processes Credit feed Sensitive encrypted Data pipeline tagging Know who modified a dataset when and into what Data pipeline audits File-1 File-2 File-3 Analyze how a dataset reached a particular state Data pipeline lineage
  17. 17. Page 17 © Hortonworks Inc. 2014 Falcon User Flow Create cluster entity & process XML specifications Validate and save specifications to HDFS Kick off Feeds & processes Schedule “Instances” of feeds & process to run Ensure feeds & processes run as expected Update feeds & processes as needed User Falcon Server Falcon CLI or API Define pipeline Deploy pipeline Manage pipeline ‘instance’ suspend, resume, kill SCHEDULESUBMIT
  18. 18. Page 18 © Hortonworks Inc. 2014 Outline Falcon Overview Features Architecture & Demo
  19. 19. Page 19 © Hortonworks Inc. 2014 Falcon Architecture Centralized Falcon Orchestration Framework Hadoop ecosystem tools Falcon  Server   JMS   API   &   UI   AMBARI   HDFS / Hive Oozie Entity Specs Scheduled Jobs Process Status MapRed / Pig / Hive / Sqoop / Flume / DistCP Data stewards + Hadoop admins
  20. 20. Page 20 © Hortonworks Inc. 2014 Clickstream enrichment data pipeline Use case description •  Clicks & impressions data lands hourly in my primary cluster (under HDFS location /../ {date}). •  Cluster is located in the Oregon data center. •  Data arrives from all NA-west-coast production servers. •  The input data feeds are often late for up to 4 hrs. •  We need to enrich the clickstream data with Ad impression metadata and make it available to our marketing data science team for customer segmentation analysis. •  Primary Hadoop cluster does not need the raw and enriched click data after 3 months. •  Our IT policy requires us to backup all enriched click data and store it for 3 years in our secondary Hadoop cluster in the Virginia data center.
  21. 21. Page 21 © Hortonworks Inc. 2014 Falcon Entity Relationships CLICKSTREAM ENRICHMENT PIPELINE Clicks DATASET Enriched clicks DATASET Click enrichment PROCESSClicks ingest PROCESS Oregon Hadoop cluster PRIMARY CLUSTER Virginia Hadoop cluster BACKUP CLUSTER Creates Runson Storedon Backup to Create Impressions ingest PROCESS Creates Impressions DATASET Runson
  22. 22. Page 22 © Hortonworks Inc. 2014 Learn More About Data Governance in Hadoop Hortonworks.com/labs/data-management/ Register for the remaining 4 Discover HDP 2.1 Webinars Hortonworks.com/webinars Next Webinar: Apache Hadoop 2.4.0, YARN and HDFS Wednesday, May 28, 9am Pacific
  23. 23. Page 23 © Hortonworks Inc. 2014 Thank you!

×