Top Trends in Building Data Lakes for Machine Learning and AI

Big Data in the Cloud
State of the Union and Future Trends
Ashish Thusoo
Qubole CEO & Co-Founder

00Copyright 2017 © Qubole
About Me
Today - CEO & Co-founder of Qubole
➢ Big Data Platform built for the Cloud
• Scaling to process 700+ PB of data a month in popular Clouds (AWS, Microsoft Azure, Oracle)
• Provides Auto-Orchestration and Self-Service for big data jobs (e.g. ETL, Machine Learning, Adhoc)
• Cloud and workload optimized Spark, Hive, Hadoop and Presto Engines
Background
➢ Started career at Oracle as a Principal Architect
➢ Ran Data Infrastructure team at Facebook from 2007-2011:
• Built out the Self-service Big Data Platform at Facebook for internal operations
• Saw huge growth, while chartered to provide all teams unified analytics (Marketing, Analytics,
Engineering, Sales Finance, etc.)
• Spawned developments of big data engines such as Apache Hive and Apache Presto
➢ Co-created and led Apache Hive project

From Data Warehouse to Data Lake
Evolution of Big Data

Changing Nature of Data

Changing Nature of Analytics
Descriptive
Analytics
Diagnostic
Analytic
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did it
happen?
What will
happen?
How can we
make it happen?
DIFFICULTY
VALUE
Analytics Value Escalator

Breakdown of Data Warehouse – Emergence of Data Lake
ENTERPRISE DATA
WAREHOUSE
ETL
BI
Data
Mining
DATA LAKE
INGEST
ELT ML
Data
Discovery
Data
Warehouse
LEGACY DATA PLATFORM MODEREN DATA PLATFORM

Differences Between Data Lakes and Data Warehouses
DATA LAKE vs DATA WAREHOUSE
Semi-structured / unstructured /
structured / raw DATA
Structured data
SQL / Machine Learning / ETL /
Graph Analytics etc. ANALYTICS FLEXIBILITY
SQL
Cheap storage for large volumes
of data VOLUME
Expensive at large volumes of
data
High agility with ability to quickly
reconfigure for new workloads AGILITY
Fixed configuration and limited
agility
Data Engineers / Data Scientists /
Analysts USERS
Analysts / Business Users

Data Back Office with a Data Warehouse Centric Approach
Marketing
Ops
DeveloperSales
Ops
Product
Managers
Data Infrastructure
Data Team
Operations
Analyst
AnalystData
Architect
Business
Users
Product
Support
Customer
Support

Data Back Office Transformation with a Data Lake
Operations
Analyst
Marketing Ops
Analyst
Data
Architect
Business
Users
Product
Support
Customer
Support
Developer
Sales Ops
Product
Managers
Data
Infrastructure

State of Data Lake Adoption in Enterprises

The Five Stages of Data Lake Maturity
01
Stage
Aspiration
- Production
Reporting/DW
- Researching
02
Stage
Experimentation
- Initial Big Data
Deployment
- Targeted Use
Case
03
Stage
Expansion
- Multiple
Departments
- Multiple Engines
- Top Down Use
Cases
04
Stage
Inversion
- Enterprise
Transformation
- Bottoms up
use cases
-
05
Stage
Nirvana
- Digital
Enterprise
- Ubiquitous
Insights
- True Business
Transformation

Is there a plan in place
to move to a big data
self-service analytics
model?
Yes
65%
No
35%
Data Lake’s Reality Gap: Everyone wants self-service nirvana
65% Moving to Self-Service Model to Enable Data Professionals

How confident are you
that the data team can
achieve self-service
analytics?
16% 39% 32% 11% 2%
0% 20% 40% 60% 80% 100%
Extremely confident
Very confident
Confident
Somewhat doubtful
Extremely doubtful
Data Lake’s Reality Gap: IT is confident they can get there
87% Confident They Can Provide Self-Service Analytics

Data Lake’s Reality Gap: Only 8% are there today
How do you assess
your big data
maturity?
In process but still
have a lot to learn
35%
Need to scale and
optimize our
processes
28%
Still figuring it out
22%
Fully mature
process providing
tools and access
to insights for all
stakeholders
8%
We have proven
processes but still
can’t meet
business demand
7%
Only 8% Have Mature Big Data Processes

A Prescription to Success - Move to the Cloud
Adaptability
• Best machine configuration for the workload
• Best Engine for the workload
• On demand and elastic; automatically scale up
or down
Agility
• Initial provisioning in min/hours, not months
• Change configurations dynamically
• Compute and Storage scale independently
Cost
• Pay only for what you actually use
• Use spot instances to reduce cost by up to 80%

Cloud Computing and Data Lakes
Infrastructure as a Service

Changing Nature of the IT Infrastructure
Mainframe
Computing
Client-Server
Computing
Cloud
Computing
1960 1980 2000
2020

Cloud vs Data Centers
• Infrastructure is an API – Therefore Infrastructure can with the Application
App
STATIC & RIGID
FLEXIBLE & ELASTIC
App

Properties of Data Lakes
Big Data Analysis Is
Bursty
e.g. at Qubole we see
on an average the
minimum to maximum
size of infrastructure
to vary by 3400%
Ever
Expanding
e.g. data
processed on
Qubole as grown
2.5x in a year
Rapidly
Evolving
e.g. Spark,
Presto and
others
addressing gaps
in technology

Creating Data Lakes in the Cloud
+
1. Faster Time to Production (Minutes and Hours instead of 6-18 months)
2. Increasing success in Big Data and Data Science projects
3. Agility and Flexibility of Infrastructure to change compute for workloads and
technology
4. Significant Cost Reduction (Infrastructure fits avg utilization as opposed to peak)

Key Architecture – Separation of Compute and Storage
COMPUTE AND STORAGE COMPUTE AND STORAGE COMPUTE AND STORAGE
COMPUTE AND STORAGE COMPUTE AND STORAGE COMPUTE AND STORAGE
OBJECT STORAGE
ELASTIC COMPUTE
LEGACY DATA CENTER ARCHITECTURE CLOUD INFRASTRUCTURE ARCHITECTURE

Architecture – Putting Together Cloud Data Lakes

Cloud Data Lakes vs Data Center Data Lakes - Automation
Cluster Lifecycle Management
Auto start/terminate
Auto-scaling up/down
Performance Optimization
Cluster re-balancing
Performance/Caching
Cost Optimization
Spot node usage
Resource substitution

Cloud Data Lakes vs Data Center Data Lakes - TCO
25
5
10
15
20
Clustersize(Nodes)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Time
Minimum nodes
Maximum nodes
On Prem Cluster (fixed)
Actual Compute Usage
Cluster Lifecycle Savings (19% / 90%)
Workload Aware Auto-Scaling Savings (48% / 90%)
Spot Shopper Savings (15% / 45%)

Cloud Data Lakes vs Data Center Data Lakes – Concurrency and
Elasticity
select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as a.county
from SMALL_TABLE a) tgroup by t.county;
insert overwrite table dest
select a.id, a.zip, count(distinct b.uid)
from ads a join LARGE_TABLE b on
(a.id=b.ad_id) group by a.id, a.zip;
OBJECT STORAGE
ELASTIC COMPUTE

Bringing it Together
Using the Cloud for Big Data Platforms and Data Science Operations is
Fundamentally Different from Operating Big Data Platforms On-Premise.
Done Properly This Leads to
1. Faster Time to Value
2. Increased Flexibility and Scale
3. Better Adoption of Analytics
4. Ability to Spend Smarter, and Not Less

Case Studies
Reduced Time to Market for New Products - Ibotta, Inc.
➔ Ibotta was faced with the challenges and limitations of a cloud Data Warehouse, reaching constraints
of being able to scale and leaving new data on the floor, which was prevent new features such as
personalized app experiences and campaign optimizations. This drove building a datalake, being the
source of ETL pipelines from ingestion, along with enabling Data Science and Product teams for
feature engineering.
Bursty ETL on Telemetry & User Data -
➔ Lyft crunches through all the data that an application that their drivers and riders create. Lyft
generates 3-5x more data on weekends vs. weekdays. Their data platform has to be flexible enough
to process a significant amount more on weekends in order to deliver the right insights back to their
users.

Case Studies
Data Governance for Customer 360 -
➔ Turner had a initiative for giving Engineering, Data Science and Analyst breaking down their data
silos in order to provide a 360 view of their engagement from campaigns through viewership. Using
QDS, they could easily set up governance and permissions in S3, eliminating the risk of exposing
sensitive data to the wrong teams, while enabling them the right access for their tools.
Data Science Development Workbench -
➔ Acxiom had a need to share data across different data science teams with the control to lock down
certain users. With Qubole, they have built a “Data Science Playground” that allows various DS
teams have a single environment for developing and testing models and algorithms before rolling
them into production.

Right Tool for the Right Team
Built for Anyone Who Uses
Data
Analysts
Data Scientists
Data Engineers
Data Admins
BI teams
Executive Dashboards
Products
A Single Platform for Any
Use Case
Reporting
Ad Hoc Queries
Machine Learning
Stream Processing
Vertical Apps
ETL Workflows
Open Source Engines,
Optimized for the Cloud
Cloud-Native, Optimized
for Agnostics
Web App User Interface, SDK (Python, Java, R, Ruby APIs), ODBC/JDBC connectors (visualization software and databases)
Workflows orchestration (Airflow, Oozie, Azkaban)

(Re)Emergence of Deep Learning

Deep Learning Applications
Applications today focused in areas around
• Image Recognition and Processing
• Speech Recognition and Processing
• NLP and Text Analysis

Emergence of New Use Cases and Technology
Deep Learning Platforms are Emerging

Serverless Computing

Advantages of Serverless
• Zero Administration
• Fast Bursting
• Cost Advantages

Emergence of Serverless Computing

Let’s Chat…
athusoo@qubole.com
Learn More at www.qubole.com
@qubole

Top Trends in Building Data Lakes for Machine Learning and AI

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Top Trends in Building Data Lakes for Machine Learning and AI

Similar to Top Trends in Building Data Lakes for Machine Learning and AI (20)

Recently uploaded

Recently uploaded (20)

Top Trends in Building Data Lakes for Machine Learning and AI