Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Implementing a Data Lake with Enterprise Grade Data Governance

12,960 views

Published on

Hadoop provides a powerful platform for data science and analytics, where data engineers and data scientists can leverage myriad data from external and internal data sources to uncover new insight. Such power is also presenting a few new challenges. On the one hand, the business wants more and more self-service, and on the other hand IT is trying to keep up with the demand for data, while maintaining architecture and data governance standards.

In this webinar, Andrew Ahn, Data Governance Initiative Product Manager at Hortonworks, will address the gaps and offer best practices in providing end-to-end data governance in HDP. Andrew Ahn will be followed by Oliver Claude of Waterline Data, who will share a case study of how Waterline Data Inventory works with HDP in the Modern Data Architecture to automate the discovery of business and compliance metadata, data lineage, as well as data quality metrics.

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Implementing a Data Lake with Enterprise Grade Data Governance

  1. 1. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Implementing a Data Lake with Enterprise Grade Data Governance We do Hadoop.
  2. 2. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Your speakers Andrew Ahn Governance Product Manager, Hortonworks Oliver Claude CMO at Waterline
  3. 3. © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP: Data Governance We Do Hadoop
  4. 4. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Enterprise Data Governance Goals GOAL: Provide a common approach to data governance across all systems and data within the organization •  Transparent Governance standards & protocols must be clearly defined and available to all •  Reproducible Recreate the relevant data landscape at a point in time •  Auditable All relevant events and assets but be traceable with appropriate historical lineage •  Consistent Compliance practices must be consistent ETL/DQ BPM Business Analytics Visualization & Dashboards ERP CRM SCM MDM ARCHIVE Governance Framework
  5. 5. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Governance Challenges WITHIN Hadoop •  No comprehensive governance within the Hadoop stack •  Mostly disjoint as each project defines its own future and there is no common framework •  Disparate tools, such as HCatalog, Ranger and Falcon provide pieces of the overall solution •  No integration with external governance frameworks •  Difficult to get right because each project is autonomous and you need insight into traditional IT ApachePig ApacheHive ApacheHBase ApacheAccumulo ApacheSolr ApacheSpark ApacheStorm
  6. 6. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Governance Initiative for Hadoop ETL/DQ BPM Business Analytics Visualization & Dashboards ERP CRM SCM MDM ARCHIVE Data Governance Initiative Common Governance Framework 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ApachePig ApacheHive ApacheHBase ApacheAccumulo ApacheSolr ApacheSpark ApacheStorm TWO Requirements 1.  Hadoop must snap in to the existing frameworks and be a good citizen 2.  Hadoop must also provide governance within its own stack of technologies A group of companies dedicated to meeting these requirements in the open
  7. 7. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Common Data Governance Use Cases Financial Reporting Chain of custody, Lineage Narratives Telco Device log management, Correlation, Analysis, and Mitigation Retail Point of sale analysis, Price optimization Healthcare 30 day measures reporting
  8. 8. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Overview We Do Hadoop
  9. 9. © Hortonworks Inc. 2011 – 2014. All Rights Reserved New Project Proposal: Apache Atlas Apache Atlas Proposed open source project aimed at solving the Hadoop data governance challenge in the open. Key Capabilities •  Data Classification •  Metadata Exchange •  Centralized Auditing •  Search & Lineage (Browse) •  Security & Policy Engine Apache Atlas Knowledge Store Audit Store ModelsType-System Policy RulesTaxonomies Tag Based Policies Data Lifecycle Management Real Time Tag Based Access Control REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Energy PPDM Retail PCI PII Other CWM Essen%al  Timeline     Phase-­‐3   •  Collaboration Features •  Self Service •  Steward Delegation •  Profiling & Pattern Analysis •  Visualization   Phase-­‐2 •  Advance audit reporting •  Advanced Policy Engine •  Row / Column Masking •  3rd party Metadata exchange   1H  2015  GA   •  Rest API •  Centralized Taxonomy •  Import / export metadata •  Basic Policy Rules Engine •  Real-time access control •  Column Level Tagging
  10. 10. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Capabilities: Overview Data Classification •  Import or define taxonomy business-oriented annotations for data •  Define, annotate, and automate capture of relationships between data sets and underlying elements including source, target, and derivation processes •  Export metadata to third-party systems Centralized Auditing •  Capture security access information for every application, process, and interaction with data •  Capture the operational information for execution, steps, and activities Search & Lineage (Browse) •  Pre-defined navigation paths to explore the data classification and audit information •  Text-based search features locates relevant data and audit event across Data Lake quickly and accurately •  Browse visualization of data set lineage allowing users to drill-down into operational, security, and provenance related information Security & Policy Engine •  Rationalize compliance policy at runtime based on data classification schemes •  Advanced definition of policies for preventing data derivation based on classification (i.e. re- identification) Apache Atlas Knowledge Store Audit Store ModelsType-System Policy RulesTaxonomies Tag Based Policies Data Lifecycle Management Real Time Tag Based Access Control REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Energy PPDM Retail PCI PII Other CWM
  11. 11. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Apache Atlas Overview Knowledge Store Knowledge store categorized with appropriate business- oriented taxonomy •  Data sets & objects •  Tables / Columns •  Logical context •  Source, destination Support exchange of metadata between foundation components and third-party applications/governance tools Leverages existing Hadoop metastores Audit Store Policy Engine Data Lifecycle Management Security REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Custom CWM Retail PCI PII Other Knowledge Store ModelsType-System Policy RulesTaxonomies
  12. 12. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Knowledge Store Apache Atlas Overview Data Lifecycle Management Leverage existing investment in Apache Falcon with a focus on: •  Provenance •  Multi-cluster replication •  Data set retention/eviction •  Late data handling •  Automation Audit Store ModelsType-System Policy RulesTaxonomies Policy Engine Security REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Custom CWM Retail PCI PII Other Data Lifecycle Management
  13. 13. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Knowledge Store Apache Atlas Overview Audit Store Historical repository for all governance events •  Security: Access Grant & Deny •  Operational: Data Provenance & Metrics •  Indexed and Searchable ModelsType-System Policy RulesTaxonomies Policy Engine Data Lifecycle Management Security REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Custom CWM Retail PCI PII Other Audit Store
  14. 14. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Knowledge Store Apache Atlas Overview Security Integration with HDP Advanced Security investments to ensure compliance. Establish global security policies based on data classification. Leverages Ranger plug-in architecture for policy enforcement Audit Store ModelsType-System Policy RulesTaxonomies Policy Engine Data Lifecycle Management REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Custom CWM Retail PCI PII Other Security
  15. 15. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Knowledge Store Apache Atlas Overview Policy Engine Runtime rationalization of policies rules with respect to data asset combinations and time. Fully extensible. •  Metadata based •  Geo based rules •  Time-based rules •  Hive Column Prohibitions •  Preview: Hive Row and Column Masking Audit Store ModelsType-System Taxonomies Data Lifecycle Management Security REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Custom CWM Retail PCI PII Other Policy Rules Policy Engine
  16. 16. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Knowledge Store Apache Atlas Overview RESTful interface •  Extensible enterprise classification of data assets, relationships and policies organized in a meaningful way -- aligned to business organization. •  Supports exploration via user interface •  Supports extensibility via API and CLI exposure Audit Store ModelsType-System Policy RulesTaxonomies Policy Engine Data Lifecycle Management Security REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Custom CWM Retail PCI PII Other
  17. 17. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Coming 2h 2015
  18. 18. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Knowledge Store Apache Atlas Overview Enhanced Audit Store Historical repository for all governance events •  Immutable file format •  Events Metadata Taggable •  Advanced Reporting •  Security: Access Grant & Deny •  Operational: Data Provenance & Metrics •  Indexed and SearchableModelsType-System Policy RulesTaxonomies Policy Engine Data Lifecycle Management Security REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Custom CWM Retail PCI PII Other Audit Store
  19. 19. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Summary
  20. 20. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Capabilities: Overview Data Classification •  Import or define taxonomy business-oriented annotations for data •  Define, annotate, and automate capture of relationships between data sets and underlying elements including source, target, and derivation processes •  Export metadata to third-party systems Centralized Auditing •  Capture security access information for every application, process, and interaction with data •  Capture the operational information for execution, steps, and activities Search & Lineage (Browse) •  Pre-defined navigation paths to explore the data classification and audit information •  Text-based search features locates relevant data and audit event across Data Lake quickly and accurately •  Browse visualization of data set lineage allowing users to drill-down into operational, security, and provenance related information Security & Policy Engine •  Rationalize compliance policy at runtime based on data classification schemes •  Advanced definition of policies for preventing data derivation based on classification (i.e. re- identification) Apache Atlas Knowledge Store Audit Store ModelsType-System Policy RulesTaxonomies Tag Based Policies Data Lifecycle Management Real Time Tag Based Access Control REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Energy PPDM Retail PCI PII Other CWM
  21. 21. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Governance Ready Certification Program Curated group of vendor partners to provide rich & complete features Customers choose features that they want to deploy – a la carte. Low switching costs ! HDP at core to provide stability and interoperability Discovery Tagging Prep / Cleanse ETL Governance BPM Self Service Visual- ization
  22. 22. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Waterline Data improves speed to value and compliance Data Warehouse Offload Data Science/ Analytics Sandbox Data Lake VALUE CREATION COST SAVINGS Deliver a Business-Ready Data Lake Accelerate Data Prep Process Govern Data in Hadoop
  23. 23. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Find, understand and govern data in Hadoop
  24. 24. © Hortonworks Inc. 2011 – 2014. All Rights Reserved The Modern Data Architecture
  25. 25. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Capabilities: Overview Apache Atlas Knowledge Store Audit Store ModelsType-System Policy RulesTaxonomies Tag Based Policies Data Lifecycle Management Real Time Tag Based Access Control REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Energy PPDM Retail PCI PII Other CWM Rest API Business Glossary Automated Classification (Tagging) Automated Lineage Discovery Profiling and Data Quality Schema Discovery Change Detection and Audit •  Glossary •  Tags •  Lineage •  Models
  26. 26. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Visual-ization Governance Ready Certification Program Discovery Tagging Prep / Cleanse ETL Governance BPM Self Service Visual- ization
  27. 27. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Imagine shopping on Amazon.com GOVERNANCE Inventory Find and Understand Provision
  28. 28. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Waterline Data is like Amazon.com for data in Hadoop GOVERNANCE Inventory Find and Understand Provision
  29. 29. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Inventory
  30. 30. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Find and Understand
  31. 31. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Provision
  32. 32. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Governance
  33. 33. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Find, understand and govern data in Hadoop Big Data IT Architect Deliver a Business- Ready Data Lake Data Engineer/Data Scientist Accelerate Data Prep Process CDO/Data Steward Govern Data in Hadoop
  34. 34. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Deliver a business-ready data lake “It’s easy to get data into Hadoop, but it’s not necessarily easy to get data out of Hadoop. There is a need for data as a service to help the business find, understand, and govern data in Hadoop.” Joe DosSantos, EMC Big Data Practice Leader
  35. 35. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Deliver a business-ready data lake “It’s easy to get data into Hadoop, but it’s not necessarily easy to get data out of Hadoop. There is a need for data as a service to help the business find, understand, and govern data in Hadoop.” Joe DosSantos, EMC Big Data Practice Leader
  36. 36. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Accelerate data prep process “80% of Big Data analytics is data prep, and 80% of data prep is inventorying data.” Data Engineering Director, Financial Services
  37. 37. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Accelerate data prep process "Waterline Data fills a critical gap in big data exploratory analytics by automating the tagging and cataloging of data, which in turn can help analytic teams provision the right data for their analyses.” Tony Baer, Principal Analyst, Ovum
  38. 38. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Govern data in Hadoop “Data lakes therefore carry substantial risks. The most important is the inability to determine data quality or the lineage of findings by other analysts or users that have found value, previously, in using the same data in the lake. By its definition, a data lake accepts any data, without oversight or governance. Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp. And without metadata, every subsequent use of data means analysts start from scratch.” “Gartner Says Beware of the Data Lake Fallacy” post on the Gartner website
  39. 39. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Govern data in Hadoop “The first step to governing Big Data is to build an inventory.” Sunil Soares, Managing Partner, Information Asset
  40. 40. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Best practice approach to implement an enterprise grade data lake 6. Monitor and maintain 5. Open up to users 4. Protect sensitive data 3. Integrate with enterprise metadata repository 2. Build inventory of data 1. Create and populate landing area
  41. 41. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Best practices in deployment landscape 1. Create and populate landing area 1 1 •  Create Landing directory structure •  Set up ETL processes using Falcon to orchestrate •  Implement ETL jobs using ETL tools (Syncsort, Talend, Informatica, etc), Hadoop tools (Sqoop, Flume, etc) or FTP Falcon
  42. 42. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Best practices in deployment landscape 2. Build inventory of data 1. Create and populate landing area 2 •  Crawl the cluster •  Profile files •  Automatically discover technical, business, and compliance metadata at a field level •  Create Hive tables as needed •  Import lineage •  Export to Atlas 2 2 Falcon HCatalog Atlas
  43. 43. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Best practices in deployment landscape 3. Integrate with enterprise metadata repository 2. Build inventory of data 1. Create and populate landing area 3 3 •  Import business glossary terms and export new tags and updated definitions •  Synchronize Atlas and Waterline Data Inventory •  Export metadata and lineage from Hadoop to Enterprise repository Falcon HCatalog Atlas
  44. 44. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Best practices in deployment landscape 4. Protect sensitive data 3. Integrate with enterprise metadata repository 2. Build inventory of data 1. Create and populate landing area 4 •  Use Waterline Data Inventory to find sensitive data •  Create access privileges in Ranger •  Encrypt or de-identify HCatalog Ranger Falcon Atlas
  45. 45. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Best practices in deployment landscape 5. Open up to users 4. Protect sensitive data 3. Integrate with enterprise metadata repository 2. Build inventory of data 1. Create and populate landing area 5 5 5 •  Create account with Kerberos, LDAP, etc. •  Set up ACLs (leverage Ranger) •  Users can browse securely through Waterline Data Inventory 5 HCatalog Ranger Falcon Atlas
  46. 46. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Best practices in deployment landscape 6. Monitor and maintain 5. Open up to users 4. Protect sensitive data 3. Integrate with enterprise metadata repository 2. Build inventory of data 1. Create and populate landing area •  Continue profiling new or changed files and sync with Atlas •  Continue monitoring for sensitive data, use Ranger to protect •  Build a folksonomy and synchronize with business glossary in Atlas and Enterprise Business Glossary HCatalog Ranger Falcon Atlas
  47. 47. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Find, understand and govern data in Hadoop Discover lineage and business metadata automatically, and manage metadata CDO/Data Steward Automate cataloging of data assets at scale, with secure provisioning to business users Big Data Architect Find and understand best-suited and most trusted data without having to explore every file manually Data Engineer/Data Scientist/Business Analyst
  48. 48. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Questions and Answers
  49. 49. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Next Steps… Download the Hortonworks Sandbox Learn Hadoop Build Your Analytic App Try Hadoop 2 More about Waterline Data & Hortonworks http://hortonworks.com/partner/waterline-data Joint tutorial: bit.ly/DataLakeTutorial Modern Data Architecture Paper: go.waterlinedata.com/hw-mda
  50. 50. © Hortonworks Inc. 2011 – 2014. All Rights Reserved SAN JOSE June 9-11 BRUSSELS April 15-16 •  Deep-dive technical content •  65+ sessions and 5 tracks •  1,000 attendees •  Sponsorships Available •  Including Pre and Post event community meetups and BOFs •  Hadoop training available •  100+ sessions and 7 tracks •  Deep-dive technical content •  5,000 attendees •  Sponsorships Available •  Including Pre and Post event community meetups and BOFs •  Hadoop training available www.hadoopsummit.org The Largest Hadoop Community Events in 
 Europe and North America

×