Successfully reported this slideshow.
Your SlideShare is downloading. ×

Deploying a Governed Data Lake

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 34 Ad
Advertisement

More Related Content

Slideshows for you (20)

Viewers also liked (20)

Advertisement

Similar to Deploying a Governed Data Lake (20)

Recently uploaded (20)

Advertisement

Deploying a Governed Data Lake

  1. 1. Deploying a Governed Data Lake
  2. 2. 2 Everyone needs data to make better decisions
  3. 3. 3 A data lake http://www.pwc.com/us/en/technology-forecast/2014/issue1/features/data-lakes.jhtml “Size and low cost” “Fidelity: Hadoop data lakes preserve data in its original form” “Ease of accessibility: Accessibility is easy in the data lake” “Late binding: Hadoop lends itself to flexible, task-oriented structuring and does not require up- front data models” “Nearly unlimited potential for operational insight and data discovery. As data volumes, data variety, and metadata richness grow, so does the benefit.”
  4. 4. 4 Data warehouse vs. data lake Data Warehouse • Production system • Well-defined usage • Well-defined schema • Clean, trusted data • Heavy IT reliance – Less technical analysts – Large IT teams: DBAs, Data Architects, ETL Developers, BI Developers, DQ Developers, Data Modelers, Data Stewards Data Lake • Non-production system • Future, experimental usage • No schema (schema on read) • Raw data, frictionless ingestion • Self-service – More technical analysts – IT manages the cluster and ingestion, but no IT involvement when working with data
  5. 5. 5 as the platform for a scalable data lake infrastructure ✔ Hadoop ✔ Hadoop ✔ Hadoop • Lots of data (Volume): cost-effective storage and scalable processing • Flexibility to handle all kinds of data (Variety) • Will be around for a long time: modularity to insure future-proofing
  6. 6. 6 Is Hadoop enough? Big Data Architect Hadoop We have Hadoop, now what? 10-20 nodes
  7. 7. 7 Big Data Architect Hadoop How do I get the business to start using it? Data Scientist/Business Analyst 10-20 nodes
  8. 8. 8 Big Data Architect Hadoop How do I get the business to start using it? Data Scientist/Business Analyst How do I find and understand data easily to do big data analytics? Self-service 10-20 nodes
  9. 9. 9 Big Data Architect Hadoop Data Scientist/Business Analyst No security and governance 10-20 nodes Risk/Data Governance Executive How do I ensure compliance with regulations and data policies ? Sensitive data?
  10. 10. 10 Big Data Architect Hadoop How do I scale? Data Scientist/Business Analysts 100s/1000s of nodes Manual process to catalog the lake can’t scale
  11. 11. 11 • Lots of data (Volume): cost-effective storage and scalable processing • Flexibility to handle all kinds of data (Variety) • Will be around for a long time: modularity to insure future-proofing • Self-service to help users find, understand and use the data • Governance to protect sensitive data, document lineage and asses quality The platform for a scalable data lake infrastructure ✔ Hadoop ✔ Hadoop ✔ Hadoop X Hadoop X Hadoop
  12. 12. 12 Waterline Data Inventory broadens Hadoop adoption through governed self-service Big Data Architect Hadoop Data Scientist/Business Analyst 100s/1000s of nodes Risk/Data Governance Executive Self-service Security and governance Massive scale
  13. 13. 13 3-phase approach to a governed data lake Organize the lake Inventory the lake Open up the lake
  14. 14. 14 Organize the lake into zones Organize the lake
  15. 15. 15 Establish access control per zone • Business Analysts • Data Scientists • Data Scientists • Data Engineers • Data Scientists • Data Engineers • Data Stewards Sensitive Landing GoldWork Organize the lake
  16. 16. 16 The governed data lake Data Scientist/Business Analyst Data Steward Big Data Architect HDFS Hive Waterline Data Inventory Find/understand Govern Governed data layer Governance Inventory Self-Service
  17. 17. 17 Metadata Curation Self-Service Catalog/Provisioning Big Data Architect Find/understand Governed data layer Data Scientist/Business Analyst The governed data lake Data Steward HDFS Hive Waterline Data Inventory Govern Inventory Inventory the lake Profile and discover the content of files and Hive tables
  18. 18. 18 Inventory Parse multiple content types Create catalog automatically Discover lineage automatically
  19. 19. 19 Self-Service Catalog/Provisioning Big Data Architect Find/understand Governed data layer Data Scientist/Business Analyst The governed data lake Data Steward HDFS Hive Waterline Data Inventory Govern Inventory Govern the lake Governance • Inspect files and perform tag curation • Identify sensitive data • Assess data quality • Discover data lineage • Manage glossary
  20. 20. 20 Navigate Lineage of Files in Hadoop Clickable, navigable lineage discovered using file content or imported from other tools through REST APIs
  21. 21. 21 Automated Data Profiling Helps with Quality Assessment Infographic shows contents at a glance: • Different types of data in the same field • Number of missing values Separate profiles for each data type including number of unique values (cardinality), uniqueness (selectivity) and type-specific measures like mean and standard deviation for numbers
  22. 22. 22 Data Preview and Visualization Helps Understand the Data Visualization helps understand the shape and distribution of data Most frequent values for each field
  23. 23. 23 Discover Sensitive Data Screen shot Find all fields that may have SSN
  24. 24. 24 Curate Discovered Sensitive Data Fields Curate the field and accept or reject the tag
  25. 25. 25 Manage Glossary Import or create a business glossary Manage tags
  26. 26. 26 View and search history Screenshot of history tab Another screenshot of searching history (made up) Data Inventory keeps track of all user tagging, schema changes, lineage changes in Audit History
  27. 27. 27 Data Steward Govern Big Data Architect Governed data layer Open up the data lake HDFS Hive Waterline Data Inventory Inventory Governance Self-Service Find/understand Data Scientist/Business Analyst Explore catalog and provision data securely Open up the lake
  28. 28. 28 Find and Understand Automatically propagate user- defined tags (crowdsource ontology) Discover meaning of fields and tag automatically Multi-faceted drill down Automated facet creation based on metadata Business metadata-based search
  29. 29. 29 Annotate fields, files and folders with tags • Analysts can tag fields and files with meaningful business tags • Type-ahead shows existing available tags that match the typed string • Users can choose one or create a new tag • Period in tag name automatically creates tag hierarchy (e.g., Restaurant.Name creates category “Restaurant” and tag “Name”
  30. 30. 30 Based on a single field in one file tagged as Restaurant.Name, Waterline Data Inventory discovery engine found 25 additional instances of Restaurant Name automatically. User assigned tags are solid blue Automatically suggested tags are faded blue with confidence level Delimited files don’t have field names Waterline Data Inventory learns from analysts who manually tag fields and automatically finds and tags similar fields
  31. 31. 31 Create Hive tables Screen shot of file with “Generate Hive Table” option selected - Replace Hive with Drill Generate Hive Tables
  32. 32. 32
  33. 33. 33 Company overview • Headquartered in Mountain View, CA • Funded in 2013 by Menlo Ventures and Sigma West • Management Team: Alex Gorelik, Founder, CEO Founded Exeros (IBM) and Acta (SAP), IBM DE, Informatica GM. Columbia BSCS, Stanford MSCS. Oliver Claude, Marketing VP SAP, VP Informatica, IBM, Siebel. Nova Southeastern MS MIS. Jason Chen, Engineering VP Teradata, Acta, Sybase. USC PhD CS. Ravi Ramachandran, Sales CSC-Infochimps Big Data, AppLabs, Xchanging, Pegasystems. Scient (Razorfish) WATERLINE DATA NAMED COOL VENDOR Gartner, Cool Vendors in Information Governance and MDM, 2015 Guido DeSimoni, Roxane Edjlali, Saul Judah, Bill O'Kane, Andrew White
  34. 34. Visit our exhibit in the ballroom to get more information

×