Introducing
Data Lakes
Pravin Singh
Why?
• Once upon a time, there was a Data Warehouse
– Data pre-categorized at the point of entry
– Data well organized, but in silos
– Common, predetermined data model for “optimal” analysis
– Upfront DB modelling and ETL effort
– A single-source-of-truth, but at the cost of flexibility
– Complex system with low tolerance for human error, IT help required for even
the smallest enhancements
– Not to forget, the high costs
• Then came the Big Bang, of Information!
• Data Lake to the Rescue
What?
Source: PwC
Benefits
• Breaks the silos
• Flexible Data Model (Schema on Read)
• Data Provenance
• No upfront modeling and data cleansing
• Low cost of ownership
• Focused on exploration, not on operations
• Can work as staging area for ETL
Pitfalls and Challenges
• Data Lake as Data Graveyard
• Metadata
• Governance
• Information Lifecycle Management (ILM)
• Security and Privacy
• Training
Lake Maturity
Source: PwC
Four Stages of Data Lake Adoption
1: Life Before Hadoop
– Applications stand alone with their databases
– Some applications contribute data to a data warehouse
– Analysts run reporting and analytics in data warehouse
Four Stages of Data Lake Adoption
2: Hadoop is Introduced
– Applications contribute data to Hadoop
– Hadoop runs batch MapReduce jobs
– Hadoop used for ETL into warehouse or analytic databases
– Hadoop data reintroduced into applications
Four Stages of Data Lake Adoption
3: Growing the Data Lake
– Newly built systems center around Hadoop by default
– Applications use each other’s data via Hadoop
– Hadoop becomes a default data destination; governance and metadata become
important
– Data warehouse use becomes the exception, where legacy or special
requirements dictate
Four Stages of Data Lake Adoption
4: Data Lake and Application Cloud
– New applications are built on a Hadoop application platform around the data lake
– Hadoop matures as an elastic distributed data computing platform
– Data lake adds security and governance layers
– Data availability increases, application deployment time decreases
– Some apps still have special or legacy needs and execute independently
Questions?

Introducing Data Lakes

  • 1.
  • 2.
    Why? • Once upona time, there was a Data Warehouse – Data pre-categorized at the point of entry – Data well organized, but in silos – Common, predetermined data model for “optimal” analysis – Upfront DB modelling and ETL effort – A single-source-of-truth, but at the cost of flexibility – Complex system with low tolerance for human error, IT help required for even the smallest enhancements – Not to forget, the high costs • Then came the Big Bang, of Information! • Data Lake to the Rescue
  • 3.
  • 4.
    Benefits • Breaks thesilos • Flexible Data Model (Schema on Read) • Data Provenance • No upfront modeling and data cleansing • Low cost of ownership • Focused on exploration, not on operations • Can work as staging area for ETL
  • 5.
    Pitfalls and Challenges •Data Lake as Data Graveyard • Metadata • Governance • Information Lifecycle Management (ILM) • Security and Privacy • Training
  • 6.
  • 7.
    Four Stages ofData Lake Adoption 1: Life Before Hadoop – Applications stand alone with their databases – Some applications contribute data to a data warehouse – Analysts run reporting and analytics in data warehouse
  • 8.
    Four Stages ofData Lake Adoption 2: Hadoop is Introduced – Applications contribute data to Hadoop – Hadoop runs batch MapReduce jobs – Hadoop used for ETL into warehouse or analytic databases – Hadoop data reintroduced into applications
  • 9.
    Four Stages ofData Lake Adoption 3: Growing the Data Lake – Newly built systems center around Hadoop by default – Applications use each other’s data via Hadoop – Hadoop becomes a default data destination; governance and metadata become important – Data warehouse use becomes the exception, where legacy or special requirements dictate
  • 10.
    Four Stages ofData Lake Adoption 4: Data Lake and Application Cloud – New applications are built on a Hadoop application platform around the data lake – Hadoop matures as an elastic distributed data computing platform – Data lake adds security and governance layers – Data availability increases, application deployment time decreases – Some apps still have special or legacy needs and execute independently
  • 11.