Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data: Setting Up the Data Lake

15,812 views

Published on

The 19th annual Enterprise Data World (EDW) Conference is recognized as the most comprehensive educational conference on data management in the world. Joe Caserta, President of Caserta Concepts, presented "Big Data: Setting Up the Data Lake".

For more information on the services offered, please visit our website at http://casertaconcepts.com/.

Published in: Technology
  • Be the first to comment

Big Data: Setting Up the Data Lake

  1. 1. @joe_Caserta#edwdc15 Big Data: Setting Up the Data Lake Joe Caserta President Caserta Concepts March 31, 2015 Grand Hyatt Washington, DC
  2. 2. @joe_Caserta#edwdc15 Top 20 Big Data Consulting - CIO Review Joe Caserta Timeline Launched Big Data practice Co-author, with Ralph Kimball, The Data Warehouse ETL Toolkit (Wiley) Dedicated to Data Warehousing, Business Intelligence since 1996 Began consulting database programing and data modeling 25+ years hands-on experience building database solutions Founded Caserta Concepts in NYC Web log analytics solution published in Intelligent Enterprise Formalized Alliances / Partnerships – System Integrators Partnered with Big Data vendors Cloudera, Hortonworks, IBM, Cisco, Datameer, Basho more… Launched Training practice, teaching data concepts world-wide Laser focus on extending Data Warehouses with Big Data solutions 1986 2004 1996 2009 2001 2010 2013 Launched Big Data Warehousing (BDW) Meetup-NYC 2,000 Members 2012 2014 Established best practices for big data ecosystem implementation – Healthcare, Finance, Insurance Top 20 Most Powerful Big Data consulting firms Dedicated to Data Governance Techniques on Big Data (Innovation)
  3. 3. @joe_Caserta#edwdc15 Enrollments Claims Finance ETL Ad-Hoc Query Horizontally Scalable Environment - Optimized for Analytics Big Data Lake Canned Reporting Big Data Analytics NoSQL Databases ETL Ad-Hoc/Canned Reporting Traditional BI Mahout MapReduce Pig/Hive N1 N2 N4N3 N5 Hadoop Distributed File System (HDFS) Traditional EDW Others… Today’s business environment requires Big Data Data Science
  4. 4. @joe_Caserta#edwdc15 Innovation is the only sustainable competitive advantage a company can have.
  5. 5. @joe_Caserta#edwdc15 Components of the Data Lake Hadoop Distribution: Cloudera, Hortonworks, MapR, Pivotal-HD, IBM  Tools:  Hive: Map data to structures and use SQL-like queries  Pig: Data transformation language for big data  Sqoop: Extracts external sources and loads Hadoop  Spark: General-purpose cluster computing framework  Storm: Real-time ETL  NoSQL:  Document: MongoDB, CouchDB  Graph: Neo4j, Titan  Key Value: Riak, Redis  Columnar: Cassandra, Hbase  Search: Lucene, Solr, ElasticSearch  Languages: Python, SciPy, Java, R, Scala
  6. 6. @joe_Caserta#edwdc15 Why is Big Data Governance Important?  Convergence of  Data quality  Management and policies  All data in an organization  Set of processes  Ensures important data assets are formally managed throughout the enterprise.  Ensures data can be trusted  People made accountable for low data quality It is about putting people and technology in place to fix and prevent issues with data so that the enterprise can become more efficient.
  7. 7. @joe_Caserta#edwdc15 •Data is coming in so fast, how do we monitor it? •Real real-time analytics •What does “complete” mean •Dealing with sparse, incomplete, volatile, and highly manufactured data. How do you certify sentiment analysis? •Wider breadth of datasets and sources in scope requires larger data governance support •Data governance cannot start at the data warehouse •Data volume is higher so the process must be more reliant on programmatic administration •Less people/process dependence Volume Variety VelocityVeracity The Challenges Building a Data Lake
  8. 8. @joe_Caserta#edwdc15 What’s Old is New Again  Before Data Warehousing Governance  Users trying to produce reports from raw source data  No Data Conformance  No Master Data Management  No Data Quality processes  No Trust: Two analysts were almost guaranteed to come up with two different sets of numbers!  Before Data Lake Governance  We can put “anything” in Hadoop  We can analyze anything  We’re scientists, we don’t need IT, we make the rules  Rule #1: Dumping data into Hadoop with no repeatable process, procedure, or data governance will create a mess  Rule #2: Information harvested from an ungoverned systems will take us back to the old days: No Trust = Not Actionable
  9. 9. @joe_Caserta#edwdc15 Making it Right  The promise is an “agile” data culture where communities of users are encouraged to explore new datasets in new ways  New tools  External data  Data blending  Decentralization  With all the V’s, data scientists, new tools, new data we must rely LESS on HUMANS  We need more systemic administration  We need systems, tools to help with big data governance  This space is EXTREMELY immature!  Steps towards Data Governance for the Data Lake 1. Establish difference between traditional data and big data governance 2. Establish basic rules for where new data governance can be applied 3. Establish processes for graduating the products of data science to governance 4. Establish a set of tools to make governing Big Data feasible
  10. 10. @joe_Caserta#edwdc15 Process Architecture Communication Organization IFP Governance Administration Compliance Reporting Standards Value Proposition Risk/Reward Information Accountabilities Stewardship Architecture Enterprise Data Council Data Integrity Metrics Control Mechanisms Principles and Standards Information Usability Communication BDG provides vision, oversight and accountability for leveraging corporate information assets to create competitive advantage, and accelerate the vision of integrated delivery. Value Creation • Acts on Requirements Build Capabilities • Does the Work • Responsible for adherence Governance Committees Data Stewards Project Teams Enterprise Data Council • Executive Oversight • Prioritizes work Drives change Accountable for results Definitions Data Governance for the Data Lake
  11. 11. @joe_Caserta#edwdc15 •This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization •Definitions, lineage (where does this data come from), business definitions, technical metadataMetadata •Identify and control sensitive data, regulatory compliancePrivacy/Security •Data must be complete and correct. Measure, improve, certify Data Quality and Monitoring •Policies around data frequency, source availability, etc.Business Process Integration •Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management •Data retention, purge schedule, storage/archiving Information Lifecycle Management (ILM) Components of Data Governance • Add Big Data to overall framework and assign responsibility • Add data scientists to the Stewardship program • Assign stewards to new data sets (twitter, call center logs, etc.) • Graph databases are more flexible than relational • Lower latency service required • Distributed data quality and matching algorithms • Data Quality and Monitoring (probably home grown, drools?) • Quality checks not only SQL: machine learning, Pig and Map Reduce • Acting on large dataset quality checks may require distribution • Larger scale • New datatypes • Integrate with Hive Metastore, HCatalog, home grown tables • Secure and mask multiple data types (not just tabular) • Deletes are more uncommon (unless there is regulatory requirement) • Take advantage of compression and archiving (like AWS Glacier) • Data detection and masking on unstructured data upon ingest • Near-zero latency, DevOps, Core component of business operations For Big Data
  12. 12. @joe_Caserta#edwdc15 Data Lake Governance Realities  Full data governance can only be applied to “Structured” data  The data must have a known and well documented schema  This can include materialized endpoints such as files or tables OR projections such as a Hive table  Governed structured data must have:  A known schema with Metadata  A known and certified lineage  A monitored, quality test, managed process for ingestion and transformation  A governed usage  Data isn’t just for enterprise BI tools anymore  We talk about unstructured data in Hadoop but more-so it’s semi- structured/structured with a definable schema.  Even in the case of unstructured data, structure must be extracted/applied in just about every case imaginable before analysis can be performed.
  13. 13. @joe_Caserta#edwdc15 The Data Scientists Can Help!  Data Science to Big Data Warehouse mapping  Full Data Governance Requirements  Provide full process lineage  Data certification process by data stewards and business owners  Ongoing Data Quality monitoring that includes Quality Checks  Provide requirements for Data Lake  Proper metadata established:  Catalog  Data Definitions  Lineage  Quality monitoring  Know and validate data completeness
  14. 14. @joe_Caserta#edwdc15 Big Data Warehouse Data Science Workspace Data Lake – Integrated Sandbox Landing Area – Source Data in “Full Fidelity” The Big Data Governance Pyramid Metadata  Catalog ILM  who has access, how long do we “manage it” Raw machine data collection, collect everything Data is ready to be turned into information: organized, well defined, complete. Agile business insight through data- munging, machine learning, blending with external data, development of to-be BDW facts Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitoring of completeness of data Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitoring of completeness of data  Hadoop has different governance demands at each tier.  Only top tier of the pyramid is fully governed.  We refer to this as the Trusted tier of the Big Data Warehouse. Fully Data Governed ( trusted) User community arbitrary queries and reporting
  15. 15. @joe_Caserta#edwdc15 Securing Big Data  Determining Who Sees What:  Need to be able to secure as many data types as possible  Auto-discovery important!  Current products:  Sentry – SQL security semantics to Hive  Knox – Central authentication mechanism to Hadoop  Cloudera Navigator – Central security auditing  Hadoop - Good old *NIX permission with LDAP  Dataguise – Auto-discovery, masking, encryption  Datameer – The BI Tool for Hadoop Recommendation: Assemble based on existing tools
  16. 16. @joe_Caserta#edwdc15 • For now Hive Metastore, HCatalog + Custom might be best • HCatalog gives great “abstraction” services • Maps to a relational schema • Developers don’t need to worry about data formats and storage • Can use SuperLuminate to get started Recommendation: Leverage HCatalog + Custom metadata tables Metadata
  17. 17. @joe_Caserta#edwdc15 Closing Thoughts – Enable the Future  Today’s business environment requires the convergence of data quality, data management, data engineering and business policies.  Make sure your data can be trusted and people can be held accountable for impact caused by low data quality.  Get experts to help calm the turbulence… it can be exhausting!  Blaze new trails! Polyglot Persistence – “where any decent sized enterprise will have a variety of different data storage technologies for different kinds of data. There will still be large amounts of it managed in relational stores, but increasingly we'll be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it.” -- Martin Fowler
  18. 18. @joe_Caserta#edwdc15 Thank You Joe Caserta President, Caserta Concepts joe@casertaconcepts.com (914) 261-3648 @joe_Caserta

×