Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Top Trends in Building Data Lakes for Machine Learning and AI


Published on

Presentation by Ashish Thusoo, Co-Founder & CEO at Qubole, on exploring the big data industry trends in moving from data warehouses to cloud-based data lakes.This presentation will cover how companies today are seeing a significant rise in the success of their big data projects by moving to the cloud to iteratively build more cost-effective data pipelines and new products with ML and AI.

Uncovering how services like AWS, Google, Oracle, and Microsoft Azure provide the storage and compute infrastructure to build self-service data platforms that can enable all teams and new products to scale iteratively.

Published in: Data & Analytics
  • Be the first to comment

Top Trends in Building Data Lakes for Machine Learning and AI

  1. 1. Big Data in the Cloud State of the Union and Future Trends Ashish Thusoo Qubole CEO & Co-Founder
  2. 2. 00Copyright 2017 © Qubole About Me Today - CEO & Co-founder of Qubole ➢ Big Data Platform built for the Cloud • Scaling to process 700+ PB of data a month in popular Clouds (AWS, Microsoft Azure, Oracle) • Provides Auto-Orchestration and Self-Service for big data jobs (e.g. ETL, Machine Learning, Adhoc) • Cloud and workload optimized Spark, Hive, Hadoop and Presto Engines Background ➢ Started career at Oracle as a Principal Architect ➢ Ran Data Infrastructure team at Facebook from 2007-2011: • Built out the Self-service Big Data Platform at Facebook for internal operations • Saw huge growth, while chartered to provide all teams unified analytics (Marketing, Analytics, Engineering, Sales Finance, etc.) • Spawned developments of big data engines such as Apache Hive and Apache Presto ➢ Co-created and led Apache Hive project
  3. 3. From Data Warehouse to Data Lake Evolution of Big Data
  4. 4. 00Copyright 2017 © Qubole Changing Nature of Data
  5. 5. 00Copyright 2017 © Qubole Changing Nature of Analytics Descriptive Analytics Diagnostic Analytic Predictive Analytics Prescriptive Analytics What happened? Why did it happen? What will happen? How can we make it happen? DIFFICULTY VALUE Analytics Value Escalator
  6. 6. 00Copyright 2017 © Qubole Breakdown of Data Warehouse – Emergence of Data Lake ENTERPRISE DATA WAREHOUSE ETL BI Data Mining DATA LAKE INGEST ELT ML Data Discovery Data Warehouse LEGACY DATA PLATFORM MODEREN DATA PLATFORM
  7. 7. 00Copyright 2017 © Qubole Differences Between Data Lakes and Data Warehouses DATA LAKE vs DATA WAREHOUSE Semi-structured / unstructured / structured / raw DATA Structured data SQL / Machine Learning / ETL / Graph Analytics etc. ANALYTICS FLEXIBILITY SQL Cheap storage for large volumes of data VOLUME Expensive at large volumes of data High agility with ability to quickly reconfigure for new workloads AGILITY Fixed configuration and limited agility Data Engineers / Data Scientists / Analysts USERS Analysts / Business Users
  8. 8. 00Copyright 2017 © Qubole Data Back Office with a Data Warehouse Centric Approach Marketing Ops DeveloperSales Ops Product Managers Data Infrastructure Data Team Operations Analyst AnalystData Architect Business Users Product Support Customer Support
  9. 9. 00Copyright 2017 © Qubole Data Back Office Transformation with a Data Lake Operations Analyst Marketing Ops Analyst Data Architect Business Users Product Support Customer Support Developer Sales Ops Product Managers Data Infrastructure
  10. 10. State of Data Lake Adoption in Enterprises
  11. 11. 00Copyright 2017 © Qubole The Five Stages of Data Lake Maturity 01 Stage Aspiration - Production Reporting/DW - Researching 02 Stage Experimentation - Initial Big Data Deployment - Targeted Use Case 03 Stage Expansion - Multiple Departments - Multiple Engines - Top Down Use Cases 04 Stage Inversion - Enterprise Transformation - Bottoms up use cases - 05 Stage Nirvana - Digital Enterprise - Ubiquitous Insights - True Business Transformation
  12. 12. 00Copyright 2017 © Qubole Is there a plan in place to move to a big data self-service analytics model? Yes 65% No 35% Data Lake’s Reality Gap: Everyone wants self-service nirvana 65% Moving to Self-Service Model to Enable Data Professionals
  13. 13. 00Copyright 2017 © Qubole How confident are you that the data team can achieve self-service analytics? 16% 39% 32% 11% 2% 0% 20% 40% 60% 80% 100% Extremely confident Very confident Confident Somewhat doubtful Extremely doubtful Data Lake’s Reality Gap: IT is confident they can get there 87% Confident They Can Provide Self-Service Analytics
  14. 14. 00Copyright 2017 © Qubole Data Lake’s Reality Gap: Only 8% are there today How do you assess your big data maturity? In process but still have a lot to learn 35% Need to scale and optimize our processes 28% Still figuring it out 22% Fully mature process providing tools and access to insights for all stakeholders 8% We have proven processes but still can’t meet business demand 7% Only 8% Have Mature Big Data Processes
  15. 15. 00Copyright 2017 © Qubole A Prescription to Success - Move to the Cloud Adaptability • Best machine configuration for the workload • Best Engine for the workload • On demand and elastic; automatically scale up or down Agility • Initial provisioning in min/hours, not months • Change configurations dynamically • Compute and Storage scale independently Cost • Pay only for what you actually use • Use spot instances to reduce cost by up to 80%
  16. 16. Cloud Computing and Data Lakes Infrastructure as a Service
  17. 17. 00Copyright 2017 © Qubole Changing Nature of the IT Infrastructure Mainframe Computing Client-Server Computing Cloud Computing 1960 1980 2000 2020
  18. 18. 00Copyright 2017 © Qubole Cloud vs Data Centers • Infrastructure is an API – Therefore Infrastructure can with the Application App STATIC & RIGID FLEXIBLE & ELASTIC App
  19. 19. 00Copyright 2017 © Qubole Properties of Data Lakes Big Data Analysis Is Bursty e.g. at Qubole we see on an average the minimum to maximum size of infrastructure to vary by 3400% Ever Expanding e.g. data processed on Qubole as grown 2.5x in a year Rapidly Evolving e.g. Spark, Presto and others addressing gaps in technology
  20. 20. 00Copyright 2017 © Qubole Creating Data Lakes in the Cloud + 1. Faster Time to Production (Minutes and Hours instead of 6-18 months) 2. Increasing success in Big Data and Data Science projects 3. Agility and Flexibility of Infrastructure to change compute for workloads and technology 4. Significant Cost Reduction (Infrastructure fits avg utilization as opposed to peak)
  22. 22. 00Copyright 2017 © Qubole Architecture – Putting Together Cloud Data Lakes
  23. 23. 00Copyright 2017 © Qubole Cloud Data Lakes vs Data Center Data Lakes - Automation Cluster Lifecycle Management Auto start/terminate Auto-scaling up/down Performance Optimization Cluster re-balancing Performance/Caching Cost Optimization Spot node usage Resource substitution
  24. 24. 00Copyright 2017 © Qubole Cloud Data Lakes vs Data Center Data Lakes - TCO 25 5 10 15 20 Clustersize(Nodes) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Time Minimum nodes Maximum nodes On Prem Cluster (fixed) Actual Compute Usage Cluster Lifecycle Savings (19% / 90%) Workload Aware Auto-Scaling Savings (48% / 90%) Spot Shopper Savings (15% / 45%)
  25. 25. 00Copyright 2017 © Qubole Cloud Data Lakes vs Data Center Data Lakes – Concurrency and Elasticity select t.county, count(1) from (select transform( using ‘’ as a.county from SMALL_TABLE a) tgroup by t.county; insert overwrite table dest select,, count(distinct b.uid) from ads a join LARGE_TABLE b on ( group by,; OBJECT STORAGE ELASTIC COMPUTE
  26. 26. 00Copyright 2017 © Qubole Bringing it Together Using the Cloud for Big Data Platforms and Data Science Operations is Fundamentally Different from Operating Big Data Platforms On-Premise. Done Properly This Leads to 1. Faster Time to Value 2. Increased Flexibility and Scale 3. Better Adoption of Analytics 4. Ability to Spend Smarter, and Not Less
  27. 27. Example Cases of Success
  28. 28. 00Copyright 2017 © Qubole Case Studies Reduced Time to Market for New Products - Ibotta, Inc. ➔ Ibotta was faced with the challenges and limitations of a cloud Data Warehouse, reaching constraints of being able to scale and leaving new data on the floor, which was prevent new features such as personalized app experiences and campaign optimizations. This drove building a datalake, being the source of ETL pipelines from ingestion, along with enabling Data Science and Product teams for feature engineering. Bursty ETL on Telemetry & User Data - ➔ Lyft crunches through all the data that an application that their drivers and riders create. Lyft generates 3-5x more data on weekends vs. weekdays. Their data platform has to be flexible enough to process a significant amount more on weekends in order to deliver the right insights back to their users.
  29. 29. 00Copyright 2017 © Qubole Case Studies Data Governance for Customer 360 - ➔ Turner had a initiative for giving Engineering, Data Science and Analyst breaking down their data silos in order to provide a 360 view of their engagement from campaigns through viewership. Using QDS, they could easily set up governance and permissions in S3, eliminating the risk of exposing sensitive data to the wrong teams, while enabling them the right access for their tools. Data Science Development Workbench - ➔ Acxiom had a need to share data across different data science teams with the control to lock down certain users. With Qubole, they have built a “Data Science Playground” that allows various DS teams have a single environment for developing and testing models and algorithms before rolling them into production.
  30. 30. 00Copyright 2017 © Qubole Right Tool for the Right Team Built for Anyone Who Uses Data Analysts Data Scientists Data Engineers Data Admins BI teams Executive Dashboards Products A Single Platform for Any Use Case Reporting Ad Hoc Queries Machine Learning Stream Processing Vertical Apps ETL Workflows Open Source Engines, Optimized for the Cloud Cloud-Native, Optimized for Agnostics Web App User Interface, SDK (Python, Java, R, Ruby APIs), ODBC/JDBC connectors (visualization software and databases) Workflows orchestration (Airflow, Oozie, Azkaban)
  31. 31. Future Directions
  32. 32. 00Copyright 2017 © Qubole (Re)Emergence of Deep Learning
  33. 33. 00Copyright 2017 © Qubole Deep Learning Applications Applications today focused in areas around • Image Recognition and Processing • Speech Recognition and Processing • NLP and Text Analysis
  34. 34. 00Copyright 2017 © Qubole Emergence of New Use Cases and Technology Deep Learning Platforms are Emerging
  35. 35. 00Copyright 2017 © Qubole Serverless Computing
  36. 36. 00Copyright 2017 © Qubole Advantages of Serverless • Zero Administration • Fast Bursting • Cost Advantages
  37. 37. 00Copyright 2017 © Qubole Emergence of Serverless Computing
  38. 38. Let’s Chat… Learn More at @qubole