HIVE AND PRESTO
FOR BIG DATA
ANALYTICS IN THE
CLOUD
WHAT’S NEW ABOUT BIG DATA YOU
SAY…
 Traditionally, analytics on data internal to an organization
 Customer data
 ERP data
 Some pre-digested external data like market research
 Sophisticated analytics using new data sources
 Social data
 Website data
Low density, fine grained and massive
“Most EDWs are < 2TB”
LOW DENSITY, HIGH VOLUME
DATA
Amul comment data: 18000 * 140 * 60 * 24 * 30 = 100 GB per month
Category Unique visitors
Retail –
Luxury goods
20 million
Retail –
consumer
goods
30 million
Retail – tickets 26 million
Social Media Website data
Traditional technologies cannot handle this low-density, high volume data
SKELETON OF A BIG DATA
PROJECT
Internal
Data
External
Data
TB - PBs
Actionable report
Analytics Workflow
HOW DO THE BIG GUYS DO IT?
 Build data centers
 Buy or build custom big-data software
 Hire ETL engineers who manage bringing data into the system
 Hire admins to keep it all running
 Hire data scientists to come up with interesting questions
 Hire developers who can translate questions into programs
Lots of upfront
investment
Long time to get started
Lots of risks
BIG DATA PROJECT ENTAILS
LANDSCAPE IS CHANGING
 Advent of public clouds
 Cheap, reliable storage
 Provision 10-1000s of machines in a couple of minutes
 Pay as you go, grow as you please
 Free / inexpensive big-data software
 Hadoop, Hive, Presto
CLOUD PRIMITIVES
 Persistent object store e.g. AWS S3
 Reliability is basically solved for you (*)
 Ability to provision clusters with pre-built images in a couple of
minutes
 Pay by the hour (or by the minute)
 Spot instances (AWS)
 Relational DB as a Service
 MySQL, PostgreSQL etc
THE CLOUD CAN HANDLE YOUR DATA
CLOUD’S COMPUTE FLEXIBILITY
 Analytics workloads tend to be bursty
 Most orgs struggle to predict usage 2-3 months down the line
 Tend to overprovision compute
 Result: < 30% utilization of their hardware
 Cloud allows you to scale up and down
 Trickier for a big data system, but possible
Chen et al,
VLDB 2012
Provision for peak workload
BIG DATA SOFTWARE
 Many open source projects
 Hadoop based on Google’s MR paper (Yahoo)
 Hive (SQL-on-Hadoop)
 Presto (Fast SQL)
 Production ready, running at scale at Yahoo, FB and many other
environments
ENTER HADOOP
 Open-source implementation of Map-reduce used by Google to
index trillions of web pages
 Allows programmers to write distributed programs using map and
reduce abstractions
 Ability to run these programs on large amounts of data
 Uses bunch of cheap hardware, can tolerate failures
HADOOP SCALES!
HIVE: SQL ON HADOOP
 Facebook had a Multi Petabyte Warehouse
 Had 80+ engineers writing Hadoop jobs
 Files are insufficient data abstractions
 Need tables, schemas, partitions, indices
 SQL is highly popular
 So, implement SQL on top of Hadoop
 Allowed non-programmers to process all the data
 FB open-sourced it
 Production ready
 Processes 25PB of data in FB
 Processes 20PB of data at Qubole
HIVE ALLOWS YOU TO DESCRIBE DATA
 Example
 My data lives in Amazon S3 in a specific location
 It is in delimited text format
 Please create a virtual table for me
 Number of data formats: JSON, Text, Binary, Avro, ProtoBuf, Thrift
 Analytics is often a downstream process
 Conversion of data is time consuming and not productive
create external table nation (N_NATIONKEY INT, N_NAME STRING,
N_REGIONKEY INT, N_COMMENT STRING)
ROW FORMAT DELIMITED
STORED AS TEXTFILE
LOCATION 's3n://public-qubole/datasets/tpch5G/nation';
HIVE EXTENSIBILITY
 Connect to external data sources like MongoDB
 Write code to understand new data formats - serdes
 Custom UDFs in Java
 Plug in custom code in python or any other language
SELECT
TRANSFORM (hosting_ids, user_id, d)
USING 'python combine_arrays.py' AS (hosting_ranks_array, user_id, d)
FROM s_table;
HIVE ALLOWS YOU TO QUERY THE
DATA
 SQL-Like
 Query is parallelized using Hadoop as execution engine
Select count(*) from nation;
Count(*)
Count(*)
Sum()
HIVE EXECUTION
 Split Hive query into multiple Hadoop/MR jobs
 Run Job 1, save intermediate output to HDFS
 Run Job 2..
 Return results
 Data parallel because every hadoop job runs on number of
machines
T11
100MB
T12
100MB
10 files
5 files
5 files
TASK PARALLELISM
T1 T2 T3
100MB 100MB 100MB10 files
EXECUTION MODEL 1
T1
100MB
T2
100MB
T3
100MB10 files
• Only 100MB of memory required
• Can stop and resume
• Allows for multiplexing multiple pipelines
• Can tolerate failures
• Spilling can be expensive
• Time to first result is high
EXECUTION MODEL 2
T1
100MB
T2
100MB
T3
100MB10 files
• Task parallelism
• Needs 3X memory
• No spilling, hence much faster
• Early first results
• Stop and resume is trickier
• Multiplexing is more difficult
• Cannot tolerate failures
ENTER PRESTO
 Hive was EM1 and had associated disadvantages
 Internal project at Facebook to implement EM2 (Presto)
 Use case was interactive queries over the same data
 Open sourced late 2013
 Promised much faster query performance
 In-memory processing, aggressive pipelining
 Supports all the data formats that Hive does
 Can’t plug in user code at this point, vanilla SQL
CONTRASTING HIVE AND PRESTO
Hive Presto
Uses Hadoop MR for
execution (EM1)
Pipelined execution model
(EM2)
Spills intermediate data to
FS
Intermediate data in
memory
Can tolerate failures Does not tolerate failures
Automatic join ordering User-specified join ordering
Can handle joins of two
large tables
One table needs to fit in
memory
Supports grouping sets Does not support GS
Plug in custom code Cannot plug in custom code
More data types Limited data types
Hive 0.11 vs Presto 0.60
PERFORMANCE COMPARISON
• Presto is 2.5-7x faster
• But, some queries just run out of
memory
• Contrasts the execution models
IN A NUTSHELL
SAMPLE SETUP
Cloud Storage
Sqoop
Application
Sync
Heavy duty queries Interactive queries
CRYSTAL BALL
 Hive is actively working on task parallelism as part of the Stinger
Initiative
 Presto is also making rapid progress in bridging some of its gaps
 There are other open source projects:
 Impala, Shark, Drill, Tajo
 Lots of goodies for users
CONCLUSION
 Big Data Analytics is becoming accessible and affordable
 Public clouds give flexibility and change economics
 Hive and Presto provide intuitive and powerful ways to interact
with your data

Big dataproposal

  • 1.
    HIVE AND PRESTO FORBIG DATA ANALYTICS IN THE CLOUD
  • 2.
    WHAT’S NEW ABOUTBIG DATA YOU SAY…  Traditionally, analytics on data internal to an organization  Customer data  ERP data  Some pre-digested external data like market research  Sophisticated analytics using new data sources  Social data  Website data Low density, fine grained and massive “Most EDWs are < 2TB”
  • 3.
    LOW DENSITY, HIGHVOLUME DATA Amul comment data: 18000 * 140 * 60 * 24 * 30 = 100 GB per month Category Unique visitors Retail – Luxury goods 20 million Retail – consumer goods 30 million Retail – tickets 26 million Social Media Website data Traditional technologies cannot handle this low-density, high volume data
  • 4.
    SKELETON OF ABIG DATA PROJECT Internal Data External Data TB - PBs Actionable report Analytics Workflow
  • 5.
    HOW DO THEBIG GUYS DO IT?  Build data centers  Buy or build custom big-data software  Hire ETL engineers who manage bringing data into the system  Hire admins to keep it all running  Hire data scientists to come up with interesting questions  Hire developers who can translate questions into programs
  • 6.
    Lots of upfront investment Longtime to get started Lots of risks BIG DATA PROJECT ENTAILS
  • 7.
    LANDSCAPE IS CHANGING Advent of public clouds  Cheap, reliable storage  Provision 10-1000s of machines in a couple of minutes  Pay as you go, grow as you please  Free / inexpensive big-data software  Hadoop, Hive, Presto
  • 8.
    CLOUD PRIMITIVES  Persistentobject store e.g. AWS S3  Reliability is basically solved for you (*)  Ability to provision clusters with pre-built images in a couple of minutes  Pay by the hour (or by the minute)  Spot instances (AWS)  Relational DB as a Service  MySQL, PostgreSQL etc
  • 9.
    THE CLOUD CANHANDLE YOUR DATA
  • 10.
    CLOUD’S COMPUTE FLEXIBILITY Analytics workloads tend to be bursty  Most orgs struggle to predict usage 2-3 months down the line  Tend to overprovision compute  Result: < 30% utilization of their hardware  Cloud allows you to scale up and down  Trickier for a big data system, but possible Chen et al, VLDB 2012 Provision for peak workload
  • 11.
    BIG DATA SOFTWARE Many open source projects  Hadoop based on Google’s MR paper (Yahoo)  Hive (SQL-on-Hadoop)  Presto (Fast SQL)  Production ready, running at scale at Yahoo, FB and many other environments
  • 12.
    ENTER HADOOP  Open-sourceimplementation of Map-reduce used by Google to index trillions of web pages  Allows programmers to write distributed programs using map and reduce abstractions  Ability to run these programs on large amounts of data  Uses bunch of cheap hardware, can tolerate failures
  • 13.
  • 14.
    HIVE: SQL ONHADOOP  Facebook had a Multi Petabyte Warehouse  Had 80+ engineers writing Hadoop jobs  Files are insufficient data abstractions  Need tables, schemas, partitions, indices  SQL is highly popular  So, implement SQL on top of Hadoop  Allowed non-programmers to process all the data  FB open-sourced it  Production ready  Processes 25PB of data in FB  Processes 20PB of data at Qubole
  • 15.
    HIVE ALLOWS YOUTO DESCRIBE DATA  Example  My data lives in Amazon S3 in a specific location  It is in delimited text format  Please create a virtual table for me  Number of data formats: JSON, Text, Binary, Avro, ProtoBuf, Thrift  Analytics is often a downstream process  Conversion of data is time consuming and not productive create external table nation (N_NATIONKEY INT, N_NAME STRING, N_REGIONKEY INT, N_COMMENT STRING) ROW FORMAT DELIMITED STORED AS TEXTFILE LOCATION 's3n://public-qubole/datasets/tpch5G/nation';
  • 16.
    HIVE EXTENSIBILITY  Connectto external data sources like MongoDB  Write code to understand new data formats - serdes  Custom UDFs in Java  Plug in custom code in python or any other language SELECT TRANSFORM (hosting_ids, user_id, d) USING 'python combine_arrays.py' AS (hosting_ranks_array, user_id, d) FROM s_table;
  • 17.
    HIVE ALLOWS YOUTO QUERY THE DATA  SQL-Like  Query is parallelized using Hadoop as execution engine Select count(*) from nation; Count(*) Count(*) Sum()
  • 18.
    HIVE EXECUTION  SplitHive query into multiple Hadoop/MR jobs  Run Job 1, save intermediate output to HDFS  Run Job 2..  Return results  Data parallel because every hadoop job runs on number of machines T11 100MB T12 100MB 10 files 5 files 5 files
  • 19.
    TASK PARALLELISM T1 T2T3 100MB 100MB 100MB10 files
  • 20.
    EXECUTION MODEL 1 T1 100MB T2 100MB T3 100MB10files • Only 100MB of memory required • Can stop and resume • Allows for multiplexing multiple pipelines • Can tolerate failures • Spilling can be expensive • Time to first result is high
  • 21.
    EXECUTION MODEL 2 T1 100MB T2 100MB T3 100MB10files • Task parallelism • Needs 3X memory • No spilling, hence much faster • Early first results • Stop and resume is trickier • Multiplexing is more difficult • Cannot tolerate failures
  • 22.
    ENTER PRESTO  Hivewas EM1 and had associated disadvantages  Internal project at Facebook to implement EM2 (Presto)  Use case was interactive queries over the same data  Open sourced late 2013  Promised much faster query performance  In-memory processing, aggressive pipelining  Supports all the data formats that Hive does  Can’t plug in user code at this point, vanilla SQL
  • 23.
    CONTRASTING HIVE ANDPRESTO Hive Presto Uses Hadoop MR for execution (EM1) Pipelined execution model (EM2) Spills intermediate data to FS Intermediate data in memory Can tolerate failures Does not tolerate failures Automatic join ordering User-specified join ordering Can handle joins of two large tables One table needs to fit in memory Supports grouping sets Does not support GS Plug in custom code Cannot plug in custom code More data types Limited data types Hive 0.11 vs Presto 0.60
  • 24.
    PERFORMANCE COMPARISON • Prestois 2.5-7x faster • But, some queries just run out of memory • Contrasts the execution models
  • 25.
  • 26.
  • 27.
    CRYSTAL BALL  Hiveis actively working on task parallelism as part of the Stinger Initiative  Presto is also making rapid progress in bridging some of its gaps  There are other open source projects:  Impala, Shark, Drill, Tajo  Lots of goodies for users
  • 28.
    CONCLUSION  Big DataAnalytics is becoming accessible and affordable  Public clouds give flexibility and change economics  Hive and Presto provide intuitive and powerful ways to interact with your data