Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Discover HDP 2.2: Apache Falcon for Hadoop Data Governance

7,380 views

Published on

Hortonworks Data Platform 2.2 includes Apache Falcon for Hadoop data governance. In this 30-minute webinar, we discussed why the enterprise needs Falcon for governance, and demonstrated data pipeline construction, policies for data retention and management with Ambari. We also discussed new innovations including: integration of user authentication, data lineage, an improved interface for pipeline management, and the new Falcon capability to establish an automated policy for cloud backup to Microsoft Azure or Amazon S3.

Published in: Software
  • Be the first to comment

Discover HDP 2.2: Apache Falcon for Hadoop Data Governance

  1. 1. Discover HDP 2.2: Apache Falcon for Hadoop Data Governance Page 1 © Hortonworks Inc. 2014 Hortonworks. We do Hadoop.
  2. 2. Speakers Page 2 © Hortonworks Inc. 2014 Justin Sears Hortonworks Product Marketing Manager Andrew Ahn Hortonworks Director of Product Management for Data Governance in Hortonworks Data Platform Venkatesh Seetharam Foundational Hadoop Architect, Committer and PMC Member for Apache Falcon
  3. 3. Agenda • Introduction to Apache Falcon • New Innovation in Apache Falcon 0.6.0 § HDFS Mirroring § Cloud Replication • A Look Ahead • Q & A We’ll move quickly: • Attendee phone lines are muted • Text any questions to Andrew Ahn using Webex chat • Questions answered at the end • Unanswered questions and answers in upcoming blog post Page 3 © Hortonworks Inc. 2014
  4. 4. Big Data, Hadoop & Data Center Re-platforming Business Drivers • From reactive analytics to proactive interactions • Insights that drive competitive advantage & optimal returns Page 4 © Hortonworks Inc. 2014 $ Financial Drivers • Cost of data systems, as % of IT spend, continues to grow • Cost advantages of commodity hardware & open source software Technical Drivers • Data is growing exponentially & existing systems overwhelmed • Predominantly driven by NEW types of data that can inform analytics There is an inequitable balance between vendor and customer in the market
  5. 5. Clickstream Capture and analyze website visitors’ data trails and optimize your website Page 5 © Hortonworks Inc. 2014 Sensors Discover patterns in data streaming automatically from remote sensors and machines Server Logs Research logs to diagnose process failures and prevent security breaches Hadoop Value: New Types of Data Sentiment Understand how your customers feel about your brand and products – right now Geographic Analyze location-based data to manage operations where they occur Unstructured Understand patterns in files across millions of web pages, emails, and documents
  6. 6. A Shift from Reactive to Proactive Interactions A shift in Advertising From mass branding …to 1x1 Targeting A shift in Financial Services From Educated Investing …to Automated Algorithms A shift in Healthcare From mass treatment …to Designer Medicine A shift in Retail A shift in Telco Page 6 © Hortonworks Inc. 2014 HDP and Hadoop allow organizations to use data to shift interactions from… Reactive Post Transaction Proactive Pre Decision …to Real-t From static branding ime Personalization From break then fix …to repair before break
  7. 7. Enterprise Goals for the Modern Data Architecture Batch Interactive Real-Time Page 7 © Hortonworks Inc. 2014 • Consolidate siloed data sets structured and unstructured • Central data set on a single cluster • Multiple workloads across batch interactive and real time • Central services for security, governance and operation • Preserve existing investment in current tools and platforms • Single view of the customer, product, supply chain DATA SYSTEM APPLICATIONS Business Analytics Custom Applications Packaged Applications RDBMS EDW MPP YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N CRM ERP Other 1 ° ° ° ° ° ° HDFS (Hadoop Distributed File System) SOURCES EXISTING Systems Clickstream Web &Social Geoloca9on Sensor & Machine Server Logs Unstructured
  8. 8. YARN Transformed Hadoop & Opened a New Era Script Pig BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SQL Hive TezTez Page 8 © Hortonworks Inc. 2014 YARN The Architectural Center of Hadoop • Common data platform, many applications • Support multi-tenant access & processing • Batch, interactive & real-time use cases Java Scala Cascading Tez Stream Storm YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) Search Solr NoSQL HBase Accumulo Sli der Slider In-Memory Spark
  9. 9. YARN Extends Hadoop to Other Data Center Leaders Script Pig BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SQL Hive TezTez Java Scala Cascading Tez NoSQL HBase Accumulo Sli der 1 ° ° ° ° ° ° ° Stream Storm Slider HDFS In-Memory Spark (Hadoop Distributed File System) ° ° ° ° ° ° ° ° Page 9 © Hortonworks Inc. 2014 YARN The Architectural Center of Hadoop • Common data platform, many applications • Support multi-tenant access & processing • Batch, interactive & real-time use cases • Supports 3rd-party ISV tools (ex. SAS, Syncsort, Actian, etc.) YARN: Data Operating System (Cluster Resource Management) ° ° ° ° Others ISV Engines Search Solr ° ° ° ° ° ° ° ° ° ° YARN Ready Applications Facilitates ongoing innovation and enterprise adoption via ecosystem of new and existing “YARN Ready” solutions
  10. 10. Enterprise Hadoop: Central Set of Services BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE SECURITY OPERATIONS Tez TezTez Page 10 © Hortonworks Inc. 2014 Slider Slider YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Enables Apache Hadoop to be an Enterprise Data Platform with centralized services for: • Governance • Operations • Security Everything that plugs into Hadoop inherits these services Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Load data and manage according to policy Deploy and effectively manage the platform Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection Script Pig SQL Hive Java Scala Cascading Stream Storm Search Solr NoSQL HBase Accumulo In-Memory Spark Others ISV Engines HDFS (Hadoop Distributed File System)
  11. 11. Hortonworks Development Investment for the Enterprise Vertical Integration with YARN and HDFS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE SECURITY OPERATIONS Tez TezTez Slider 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Page 11 © Hortonworks Inc. 2014 Slider ° ° ° ° ° ° ° ° ° ° ° ° ° ° Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Load data and manage according to policy Deploy and effectively manage the platform Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection Script Pig SQL Hive Java Scala Cascading Stream Storm Search Solr NoSQL HBase Accumulo In-Memory Spark Others ISV Engines YARN: Data Operating System (Cluster Resource Management) HDFS (Hadoop Distributed File System) • Ensure engines can run reliably and respectfully in a YARN based cluster • Implement features throughout the stack to accommodate
  12. 12. Hortonworks Development Investment for the Enterprise Horizontal Integration for Enterprise Services BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE SECURITY OPERATIONS Tez TezTez Slider 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Page 12 © Hortonworks Inc. 2014 Slider ° ° ° ° ° ° ° ° ° ° ° ° ° ° Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Load data and manage according to policy Deploy and effectively manage the platform Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection Script Pig SQL Hive Java Scala Cascading Stream Storm Search Solr NoSQL HBase Accumulo In-Memory Spark Others ISV Engines YARN: Data Operating System (Cluster Resource Management) HDFS (Hadoop Distributed File System) • Ensure consistent enterprise services are applied across the entire Hadoop stack • Integrate with and extend existing data center solutions for these key requirements
  13. 13. HDP Delivers Enterprise Hadoop Hortonworks Data Platform 2.2 GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SECURITY OPERATIONS Script Pig SQL Hive TezTez Page 13 © Hortonworks Inc. 2014 Java Scala Cascading Tez Stream Storm YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) Search Solr NoSQL HBase Accumulo Sli der Slider In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS Authentication Authorization Audit Data Protection Storage: HDFS Resources: YARN Access: Hive Pipeline: Falcon Cluster: Ranger Cluster: Knox Linux Windows Deployment Choice Cloud YARN is the architectural center of HDP • Common data set across all applications • Batch, interactive & real-time workloads • Multi-tenant access & processing Provides comprehensive enterprise capabilities • Governance • Security • Operations Enables broad ecosystem adoption • ISVs can plug directly into Hadoop The widest range of deployment options • Linux & Windows • On premises & cloud Others ISV Engines On-Premises
  14. 14. HDP Delivers Enterprise Hadoop Hortonworks Data Platform 2.2 Script Pig BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SECURITY OPERATIONS SQL Hive TezTez Page 14 © Hortonworks Inc. 2014 Java Scala Cascading Tez Stream Storm YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) Search Solr NoSQL HBase Accumulo Sli der Slider In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Authentication Authorization Audit Data Protection Storage: HDFS Resources: YARN Access: Hive Pipeline: Falcon Cluster: Ranger Cluster: Knox YARN is the architectural center of HDP • Common data set across all applications • Batch, interactive & real-time workloads • Multi-tenant access & processing Provides comprehensive enterprise capabilities • Governance • Security • Operations Enables broad ecosystem adoption • ISVs can plug directly into Hadoop The widest range of deployment options • Linux & Windows • On premises & cloud Others ISV Engines Linux Windows Deployment Choice On-Premises Cloud GOVERNANCE Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS
  15. 15. Introduction to Apache Falcon Page 15 © Hortonworks Inc. 2014
  16. 16. Falcon Overview Centrally Manage Data Lifecycle – Centralized definition & management of pipelines for data ingest, process & export Business Continuity & Disaster Recovery – Out of the box policies for data replication & retention – End to end monitoring of data pipelines Address audit & compliance requirements – Visualize data pipeline lineage – Track data pipeline audit logs – Tag data with business metadata Page 16 © Hortonworks Inc. 2014 The data traffic cop
  17. 17. Falcon Architecture Page 17 © Hortonworks Inc. 2014 Centralized Falcon Orchestration Framework Falcon Server Entity Specs Scheduled Jobs Process Status Hadoop ecosystem tools JMS API & UI AMBARI HDFS / Hive Oozie MapRed / Pig / Hive / Sqoop / Flume / DistCP Data stewards + Hadoop admins
  18. 18. Data Pipeline: Definition • XML based pipeline specification – Modular - Clusters, feeds & processes defined separately and then linked together – Easy to re-use across multiple pipelines • Out of the box policies – Predefined policies for replication, late data handling & eviction – Easily customization of policies • Extensible – Plug in external solutions at any step of the pipeline – Eg. Invoke third party data obfuscation components Page 18 © Hortonworks Inc. 2014
  19. 19. Data Pipeline: Monitoring Hadoop Cluster-1 Hadoop Cluster-2 Page 19 © Hortonworks Inc. 2014 DATA raw clean prep raw clean prep Primary site DR site Centralized monitoring of data pipeline with Falcon + Ambari Pipeline run alerts Pipeline run history Pipeline Scheduling
  20. 20. Data Pipeline: Tracing Data pipeline dependencies Store feed feed . Customer feed Purchase feed Product View dependencies between clusters, datasets and processes Page 20 © Hortonworks Inc. 2014 Data pipeline tagging Sensitive Encrypted Credit feed Add arbitrary tags to feeds & processes Data pipeline audits Know who modified a dataset when and into what Coming Soon Data pipeline File-1 File-2 lineage File-3 Analyze how a dataset reached a particular state
  21. 21. Replication with Falcon Primary Hadoop Cluster Staged Data Presented Page 21 © Hortonworks Inc. 2014 Data Cleansed Data Conformed Data Staged Data Presented Data Replication Failover Hadoop Cluster Replication BI / Analy9cs BusinessObjects BI • Falcon manages workflow and replication • Enables business continuity without requiring full data reprocessing • Failover clusters can be smaller than primary clusters
  22. 22. Data Retention with Falcon Staged Data Presented Retention Policy Page 22 © Hortonworks Inc. 2014 Data Cleansed Data Conformed Data Retain 5 Years Retain Last Copy Only Retain 3 Years Retain 3 Years • Sophisticated retention policies expressed in one place • Simplify data retention for audit, compliance, or for data re-processing
  23. 23. Late Data Handling with Falcon Wait up to 4 hours for FTP data to arrive Page 23 © Hortonworks Inc. 2014 Staged Data Combined Data Online Transaction Data (via Sqoop) Web Log Data (via FTP) • Processing waits until all required input data is available • Checks for late data arrivals, issues retrigger processing as necessary • Eliminates writing complex data handling rules within applications
  24. 24. Falcon Investment Plans Page 24 © Hortonworks Inc. 2014 DATES AND FEATURES SUBJECT TO CHANGE November 2014 Future Release • Authentication & Authorization Integration • Pipeline, (HDFS file & Hive) table Lineage GA • HDFS DR Replication with Recipes • UI for Lineage management • Replicate to Cloud - Azure & S3 Post-HDP 2.2 Tech Preview • Hive/HCat metastore Replication • Expanded UI Entity creation and management. • Hive/HCat metastore Replication GA • Pipeline Run Notification via SNMP, e-mail, etc. • Hive ACID support • HDFS Snapshot Integration • File import SSH & SCP • Visual Pipeline Designer • Resource Metrics • Automated migration of data through HDFS storage tiers
  25. 25. New in Apache Falcon 0.6.0: HDFS Mirroring Page 25 © Hortonworks Inc. 2014
  26. 26. DR Mirroring of HDFS with Recipes Properties Properties Page 26 © Hortonworks Inc. 2014 • Mirroring for Disaster Recovery and Business continuity use cases. • Customizable for mulitple targets and frequency of synchronization • Recipes: Template model re-use of complex workflows Recipe Reduce Cleanse Replicate Properties Workflow Template Recipe Reduce Cleanse Replicate Workflow Template Recipe Reduce Cleanse Replicate Workflow Template
  27. 27. New in Apache Falcon 0.6.0: Cloud Replication Page 27 © Hortonworks Inc. 2014
  28. 28. Replication to Cloud Page 28 © Hortonworks Inc. 2014 • Seemlessly replicate to Cloud targets • Replicate from Cloud as a source. • Support for Amazon S3 and Microsoft Azure Azure Amazon S3 On Prem Cluster
  29. 29. A Look Ahead Page 29 © Hortonworks Inc. 2014
  30. 30. Page 30 © Hortonworks Inc. 2014
  31. 31. Page 31 © Hortonworks Inc. 2014
  32. 32. Page 32 © Hortonworks Inc. 2014
  33. 33. Page 33 © Hortonworks Inc. 2014
  34. 34. Page 34 © Hortonworks Inc. 2014
  35. 35. Page 35 © Hortonworks Inc. 2014
  36. 36. Q & A Page 36 © Hortonworks Inc. 2014
  37. 37. Thank you! Learn more at: hortonworks.com/hadoop/falcon/ Page 37 © Hortonworks Inc. 2014 Register for the remaining 5 Discover HDP 2.2 Webinars Hortonworks.com/webinars

×