Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to build a successful Data Lake

9,151 views

Published on

How to build a successful Data Lake

Published in: Technology
  • Be the first to comment

How to build a successful Data Lake

  1. 1. How to Build a Successful Data Lake Alex Gorelik Waterline Data Founder and CEO
  2. 2. Data Lakes Power Data-Driven Decision Making
  3. 3. Maximize Business Value With a Data Lake How Do You Democratize the Data Lake to Maximize Business Value? Data Lake Data Puddle Data Swamp No Value Enterprise Impact Tight Control “Governed” Self-Service Business Value Data Democratization DW Off- loading
  4. 4. Data Swamps Raw data Can’t find or use data Can’t allow access without protecting sensitive data
  5. 5. Data Warehouse Offloading: Cost Savings I prefer a data warehouse--it’s more predictable It takes IT 3 months of data architecture and ETL work to add new data to the data lake I can’t get the original data 
  6. 6. Low variety of data and low adoption • Focused use case (e.g., fraud detection) • Fully automated programs (e.g., ETL off-loading) • Small user community (e.g., data science sand box) Strong technical skill set requirement Data Puddles: Limited Scope and Value
  7. 7. What Makes a Successful Data Lake? Right Data Right InterfaceRight Platform + +
  8. 8. Right Platform: • Volume—Massively scalable • Variety—Schema on read • Future proof—modular—same data can be used by many different projects and technologies • Platform cost – extremely attractive cost structure
  9. 9. Right Data Challenges Most Data is Lost, So it Can’t Be Analyzed Later Only a small portion of data in enterprises today is saved in data warehouses Data Exhaust
  10. 10. Right Data: Save Raw Data Now to Analyze Later • Don’t know now what data will be needed later • Save as much data as possible now to analyze later
  11. 11. • Don’t know now what data will be needed later • Save as much data as possible now to analyze later • Save raw data, so it can be treated correctly for each use case Right Data: Save Raw Data Now to Analyze Later
  12. 12. • Departments hoard and protect their data and do not share it with the rest of the enterprise • Frictionless ingestion does not depend on data owners Right Data Challenges: Data Silos and Data Hoarding
  13. 13. Right Interface: Key to Broad Adoption • Data marketplace for data self-service • Providing data at the right level of expertise
  14. 14. Providing Data at the Right Level of Expertise Data scientists Business analysts Raw data Clean, trusted, prepared data
  15. 15. Roadmap to Data Lake Success Organize the lake Set up for self-service Open the lake to the users
  16. 16. Organize the Data Lake into Zones Organize the lake
  17. 17. Multi-modal IT – Different Governance Levels for Different Zones Raw or Landing Sensitive Gold or Curated Work Data Stewards Data Scientists Data Engineers Data Scientists, Business Analysts  Minimal governance  Make sure there is no sensitive data  Minimal governance  Make sure there is no sensitive data  Heavy governance  Trusted, curated data  Lineage, data quality  Heavy governance  Restricted access
  18. 18. Business Analyst Self-Service Workflow Find and Understand Provision Prep Analyze Set up for self-service
  19. 19. Finding, understanding and governing data in a data lake is like shopping at a flea market “We have 100 million fields of data – how can anyone find or trust anything?” – Telco Executive
  20. 20. Botond Horvath / Shutterstock.com DATA SCIENTIST / BUSINESS ANALYST DATA STEWARD BIG DATA ARCHITECT Can’t govern and trust data (unknown metadata, data quality, PII, data lineage) Need data to use with self- service tools but can’t explore everything manually to find and understand data Can’t catalog all the data manually and keep up with data provisioning
  21. 21. Instead Imaging Shopping On Amazon.com Catalog Find, Understand And Collaborate Provision
  22. 22. Catalog Find, Understand And Collaborate Provision Waterline Data is like Amazon for Data in Hadoop
  23. 23. Finding and Understanding Data • Crowdsource metadata and automate creation of a catalog • Institutionalize tribal data knowledge • Automate discovery to cover all data sets • Establish trust • Curated annotated data sets • Lineage • Data quality • Governance Find and Understand
  24. 24. Accessing and Provisioning Data You cannot give all access to all users You must protect PII data and sensitive business information Provision Agile/Self-service approach Create a metadata-only catalog When users request access, data is de-identified and provisioned Top down approach Find and de-identify all sensitive data Provide access to every user for every dataset as needed
  25. 25. Provide a Self-Service Interface to Find, Understand, and Provision Data
  26. 26. Prepare data for analytics Prep Clean data Remove or fix bad data, fill in missing values, convert to common units of measure Shape data Combine (join, concatenate) Resolve entities (create a single customer record from multiple records or sources) Transform (aggregate, bucketize, filter, convert codes to names, etc.) Blend data Harmonize data from multiple sources to a common schema or model Tooling Many great dedicated data wrangling tools on the horizon Some capabilities in BI and data visualization tools SQL and scripting languages for the more technical analysts
  27. 27. Data Analysis • Many wonderful self- service BI and data visualization tools • Mature space with many established and innovative vendors Magic Quadrant for Business Intelligence and Analytics Platforms 04 February 2016 | ID:G00275847 Analyst(s): Josh Parenteau, Rita L. Sallam, Cindi Howson, Joao Tapadinhas, Kurt Schlegel, Thomas W. Oestreich Analyze
  28. 28. Unlock the Value of the Data Lake with the Waterline Data Smart Data Catalog Time To Value Tribal Knowledge Sharing Trust
  29. 29. Waterline Data Is The Only Smart Data Catalog For The Data Lake “Use an INFORMATION CATALOG TO MAXIMIZE BUSINESS VALUE From Information Assets” “automatically identify, profile, and metatag files in HDFS and make them available for analysis and exploration” “tapped into an important and underserved opportunity” “comprehensive big data governance and discovery platform” “opens the data to a wider variety of people” “fills a critical gap in big data exploratory analytics by automating the tagging and cataloging of data”
  30. 30. Current Customers Healthcare Insurance Life Sciences Aerospace Automotive Banking Government Marketing "Opening up a data lake for self-service analytics requires a data catalog that's smart enough to automatically catalog every field of data so business analysts can maximize time to value” -- Jerry Megaro, Global Head Of Data Analytics, Merck KGaA “Understanding where your data came from and what it means in context is vital to making a data lake initiative successful and not just another data quagmire – the catalog plays a critical component in this” -- Global Head of Data Governance, Risk, and Standard, International Multi-Line Insurer “A governed yet agile data catalog is key to open up the data lake to business people” -- Paolo Arvati, Big Data, CSI- Piemonte
  31. 31. We Run Natively On Hadoop And Integrate With Existing Tools
  32. 32. Workflow of Enabling Self-Service Analytics With Hortonworks Hortonworks Atlas And Ranger Data Prep Analytics & Visualization Smart Data DiscoveryProfiling, Sensitive Data & Data Lineage Discovery, Automated Tagging Data Stewardship Curate Tags Self-Service Data Catalog Find, Collaborate And Take Action Metadata, Tags, Data Lineage Metadata, Tags, Roles & Access Control Roles & Access Control
  33. 33. A Successful Data Lake Right Data Right InterfaceRight Platform + +
  34. 34. Come to Booth 303 to see a demo and talk to us about your data lake Come to the Atlas session at 4:00 PM on Thursday in room 210C
  35. 35. Waterline Data The Smart Data Catalog Company

×