Data Science on Hadoop:
How Cloudera Impala Unlocks New
Productivity and Insights
Justin Erickson | Product Manager
Marcel Kornacker | Software Engineer
Ravikumar Visweswara | Software Engineer
October 2012
Why Data Scientists Love Hadoop

  •   Massive volumes of data




  •   Data preparation & analytics in 1 environment
  •   Highly flexible environment for creating & testing machine learning models




  •   10% the cost/TB under management
Hadoop Use Cases Moving to Real-Time




      Already query      Already load data into      Already use HBase for
    Hadoop using Hive   CDH every 90 mins or less    real-time data access




                                      Source: Cloudera customer survey August 2012
But Hadoop Isn’t Fast Enough




      Need faster     Move data from            See value today in
       queries on   Hadoop to RDBMS for         consolidating to a
      Hadoop data      interactive SQL           single platform




                               Source: Cloudera customer survey August 2012
Beyond Batch – The Next Stage for Hadoop
             HADOOP TODAY IS TOO SLOW
                     MapReduce is batch
       Simple queries can take minutes / tens of minutes


    CURRENT DATA MANAGEMENT IS TOO COMPLEX
                Optimized for rigid schemas &
                 special purpose applications
            Redundant data storage & processes
           Very expensive systems: $20K-150K / TB
Cloudera Enterprise RTQ
Real-Time Query for Data Stored in Hadoop
Powered by Cloudera Impala.
                           Supports Hive SQL
                           4-30X faster than Hive over MapReduce
                           Supports multiple storage engines &
                           file formats
                           Uses existing drivers, integrates with existing
                           metastore, works with leading BI tools
                           Flexible, cost-effective, no lock-in

                           Deploy & operate with Cloudera Manager
Cloudera Now Powered by Impala
          BEFORE IMPALA                                  WITH IMPALA
                                      USER INTERFACE



                                      BATCH PROCESSING       REAL-TIME ACCESS




  • Unified Storage:                 • With Impala:
     Supports HDFS and HBase              Real-time SQL queries
     Flexible file formats                Native distributed query engine
  • Unified Metastore                     Optimized for low-latency
  • Unified Security                 • Provides:
  • Unified Client Interfaces:            Answers as fast as you can ask
     ODBC, SQL syntax, Hue Beeswax        Everyone to ask questions for all data
                                          Big data storage and analytics together
Cloudera Impala Details
Common Hive SQL and interface                      Unified metadata and scheduler
           SQL App                          Hive                                    State
                                          Metastore      YARN       HDFS NN         Store
            ODBC




    Query Planner                 Query Planner       Fully MPP        Query Planner
 Query Coordinator              Query Coordinator     Distributed    Query Coordinator
 Query Exec Engine              Query Exec Engine                    Query Exec Engine
 HDFS DN     HBase              HDFS DN    HBase                    HDFS DN         HBase
                                                             Local Direct Reads
Cloudera Impala Details
Common Hive SQL and interface
           SQL App                             Hive                        State
                                             Metastore   YARN   HDFS NN    Store
            ODBC

                     SQL Request

    Query Planner                    Query Planner                Query Planner
 Query Coordinator                 Query Coordinator            Query Coordinator
 Query Exec Engine                 Query Exec Engine            Query Exec Engine
 HDFS DN     HBase                 HDFS DN    HBase             HDFS DN   HBase
Cloudera Impala Details
                                       Unified metadata and scheduler
          SQL App               Hive                                    State
                              Metastore      YARN       HDFS NN         Store
           ODBC




  Query Planner       Query Planner                        Query Planner
Query Coordinator   Query Coordinator                   Query Coordinator
Query Exec Engine   Query Exec Engine                    Query Exec Engine
HDFS DN     HBase   HDFS DN    HBase                    HDFS DN         HBase
Cloudera Impala Details
          SQL App               Hive                               State
                              Metastore     YARN        HDFS NN    Store
           ODBC




  Query Planner       Query Planner       Fully MPP       Query Planner
Query Coordinator   Query Coordinator     Distributed   Query Coordinator
Query Exec Engine   Query Exec Engine                   Query Exec Engine
HDFS DN     HBase   HDFS DN    HBase                    HDFS DN   HBase
Cloudera Impala Details
          SQL App               Hive                              State
                              Metastore   YARN      HDFS NN       Store
           ODBC




  Query Planner       Query Planner                    Query Planner
Query Coordinator   Query Coordinator                Query Coordinator
Query Exec Engine   Query Exec Engine                Query Exec Engine
HDFS DN     HBase   HDFS DN    HBase                HDFS DN       HBase
                                             Local Direct Reads
Cloudera Impala Details
          SQL App                             Hive                              State
                                            Metastore     YARN       HDFS NN    Store
           ODBC

                    SQL Results

  Query Planner                     Query Planner       In Memory      Query Planner
Query Coordinator                 Query Coordinator      Transfers   Query Coordinator
Query Exec Engine                 Query Exec Engine                  Query Exec Engine
HDFS DN     HBase                 HDFS DN    HBase                   HDFS DN   HBase
Advantages of Our Approach
•   No high-latency MapReduce batch processing
•   Local processing avoids network bottlenecks
•   No costly data format conversion overhead
•   All data immediately query-able
•   Single machine pool to scale
•   All machines available to both Impala and MapReduce
•   Single, open, and unified metadata and scheduler

       MapReduce                      Remote Query               Side Storage
    Query                        Query        Query    Query
    Node                         Node         Node     Node     Query     MR
                 Hive                                           Engine
     MR     OR    MR                                                       DN
                                 NN
     DN          HDFS
                                         DN       DN       DN
Cloudera Impala Demo
Benefits of Cloudera Impala
Real-Time Query for Data Stored in Hadoop
                       • Get answers as fast as you can ask questions
                       • Interactive analytics directly on source data
                       • No jumping between data silos
                       • Reduce duplicate storage with EDW
                       • Reduce data movement for interactive analysis
                       • Leverage existing tools and employee skills
                       • Ask questions of all your data
                       • No information loss from aggregation or
                         conforming to relational schemas for analysis

                       • Single metadata store from origination through analysis
                       • No need to hunt through multiple data silos
Cloudera powers real-time data hub
     The Challenge:
     • Needs to understand 2 years clickstream data for greater insight
     • Legacy system cannot scale for data processing and analytics
                                                      So Expedia can optimize end user
                                                      data-driven search results and
                                                      maximize Google AdWord spend.

                                                   The Solution:
                                                   • Cloudera Enterprise – 4 Petabyes
                                                   • One single scalable platform for Big data for
                                                     archive, ETL & analytics with real-time BI
                                                   • Running Impala

18                                  CONFIDENTIAL - RESTRICTED
Validated Beta Partners
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

  • 2.
    Data Science onHadoop: How Cloudera Impala Unlocks New Productivity and Insights Justin Erickson | Product Manager Marcel Kornacker | Software Engineer Ravikumar Visweswara | Software Engineer October 2012
  • 3.
    Why Data ScientistsLove Hadoop • Massive volumes of data • Data preparation & analytics in 1 environment • Highly flexible environment for creating & testing machine learning models • 10% the cost/TB under management
  • 4.
    Hadoop Use CasesMoving to Real-Time Already query Already load data into Already use HBase for Hadoop using Hive CDH every 90 mins or less real-time data access Source: Cloudera customer survey August 2012
  • 5.
    But Hadoop Isn’tFast Enough Need faster Move data from See value today in queries on Hadoop to RDBMS for consolidating to a Hadoop data interactive SQL single platform Source: Cloudera customer survey August 2012
  • 6.
    Beyond Batch –The Next Stage for Hadoop HADOOP TODAY IS TOO SLOW MapReduce is batch Simple queries can take minutes / tens of minutes CURRENT DATA MANAGEMENT IS TOO COMPLEX Optimized for rigid schemas & special purpose applications Redundant data storage & processes Very expensive systems: $20K-150K / TB
  • 7.
    Cloudera Enterprise RTQ Real-TimeQuery for Data Stored in Hadoop Powered by Cloudera Impala. Supports Hive SQL 4-30X faster than Hive over MapReduce Supports multiple storage engines & file formats Uses existing drivers, integrates with existing metastore, works with leading BI tools Flexible, cost-effective, no lock-in Deploy & operate with Cloudera Manager
  • 8.
    Cloudera Now Poweredby Impala BEFORE IMPALA WITH IMPALA USER INTERFACE BATCH PROCESSING REAL-TIME ACCESS • Unified Storage: • With Impala: Supports HDFS and HBase Real-time SQL queries Flexible file formats Native distributed query engine • Unified Metastore Optimized for low-latency • Unified Security • Provides: • Unified Client Interfaces: Answers as fast as you can ask ODBC, SQL syntax, Hue Beeswax Everyone to ask questions for all data Big data storage and analytics together
  • 9.
    Cloudera Impala Details CommonHive SQL and interface Unified metadata and scheduler SQL App Hive State Metastore YARN HDFS NN Store ODBC Query Planner Query Planner Fully MPP Query Planner Query Coordinator Query Coordinator Distributed Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase Local Direct Reads
  • 10.
    Cloudera Impala Details CommonHive SQL and interface SQL App Hive State Metastore YARN HDFS NN Store ODBC SQL Request Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 11.
    Cloudera Impala Details Unified metadata and scheduler SQL App Hive State Metastore YARN HDFS NN Store ODBC Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 12.
    Cloudera Impala Details SQL App Hive State Metastore YARN HDFS NN Store ODBC Query Planner Query Planner Fully MPP Query Planner Query Coordinator Query Coordinator Distributed Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 13.
    Cloudera Impala Details SQL App Hive State Metastore YARN HDFS NN Store ODBC Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase Local Direct Reads
  • 14.
    Cloudera Impala Details SQL App Hive State Metastore YARN HDFS NN Store ODBC SQL Results Query Planner Query Planner In Memory Query Planner Query Coordinator Query Coordinator Transfers Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 15.
    Advantages of OurApproach • No high-latency MapReduce batch processing • Local processing avoids network bottlenecks • No costly data format conversion overhead • All data immediately query-able • Single machine pool to scale • All machines available to both Impala and MapReduce • Single, open, and unified metadata and scheduler MapReduce Remote Query Side Storage Query Query Query Query Node Node Node Node Query MR Hive Engine MR OR MR DN NN DN HDFS DN DN DN
  • 16.
  • 17.
    Benefits of ClouderaImpala Real-Time Query for Data Stored in Hadoop • Get answers as fast as you can ask questions • Interactive analytics directly on source data • No jumping between data silos • Reduce duplicate storage with EDW • Reduce data movement for interactive analysis • Leverage existing tools and employee skills • Ask questions of all your data • No information loss from aggregation or conforming to relational schemas for analysis • Single metadata store from origination through analysis • No need to hunt through multiple data silos
  • 18.
    Cloudera powers real-timedata hub The Challenge: • Needs to understand 2 years clickstream data for greater insight • Legacy system cannot scale for data processing and analytics So Expedia can optimize end user data-driven search results and maximize Google AdWord spend. The Solution: • Cloudera Enterprise – 4 Petabyes • One single scalable platform for Big data for archive, ETL & analytics with real-time BI • Running Impala 18 CONFIDENTIAL - RESTRICTED
  • 19.

Editor's Notes

  • #19 Expedia’s use case for Impala:As theworld’s leading online travel provider, Expedia’s business requires a fine-tuned website that understands what its visitors want and can deliver results to partner hotels, airlines and other travel vendors. Expedia has historically used traditional relational data warehouses to capture and analyze the clickstream data generated to, from and within its website, but saw the value in being able to capture greater volumes of historical, detailed data leveraging Hadoop. The goal: to better understand keyword conversions driving traffic to the site in order to optimize Google AdWord spend. Today, Expedia uses Hadoop to empower its full data lifecycle – data is collected from online activity, loaded into Hadoop, scored and analyzed, and that data generates scoring engines which impact the recommendations, search results and sort orders on Expedia.com. Most recently, Expedia has kicked off a project using HBase and Impala for real-time BI that will power their Market Manager, an interactive application used by merchants such as hotels so they can see how Expedia is performing vs. competitors. For example, if one hotel notices they aren’t getting many bookings through Expedia around Christmastime, they can drill into the application to find out why: is it because their prices are too high? Or are they running low on inventory for certain dates? With this solution, Expedia can glean these insights and proactively reach out to merchants with recommendations on how they might drive greater bookings. Impala will allow Expedia’s business users to access Hadoop in a more interactive, ad hoc, speed-of-thought manner. Latency will be cut in half, and Impala provides an extensible solution that will scale with the growth of the business.