Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights
 

Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

on

  • 2,521 views

This talk will cover what tools and techniques work and don’t work well for data scientists working on Hadoop today and how to leverage the lessons learned by the experts to increase your ...

This talk will cover what tools and techniques work and don’t work well for data scientists working on Hadoop today and how to leverage the lessons learned by the experts to increase your productivity as well as what to expect for the future of data science on Hadoop. We will leverage insights derived from the top data scientists working on big data systems at Cloudera as well as experiences from running big data systems at Facebook, Google, and Yahoo.

Statistics

Views

Total Views
2,521
Views on SlideShare
2,093
Embed Views
428

Actions

Likes
14
Downloads
156
Comments
0

5 Embeds 428

http://www.cloudera.com 422
http://author01.mtv.cloudera.com 2
http://author01.core.cloudera.com 2
http://www.bigdatacloud.com 1
http://cloudera.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Expedia’s use case for Impala:As theworld’s leading online travel provider, Expedia’s business requires a fine-tuned website that understands what its visitors want and can deliver results to partner hotels, airlines and other travel vendors. Expedia has historically used traditional relational data warehouses to capture and analyze the clickstream data generated to, from and within its website, but saw the value in being able to capture greater volumes of historical, detailed data leveraging Hadoop. The goal: to better understand keyword conversions driving traffic to the site in order to optimize Google AdWord spend. Today, Expedia uses Hadoop to empower its full data lifecycle – data is collected from online activity, loaded into Hadoop, scored and analyzed, and that data generates scoring engines which impact the recommendations, search results and sort orders on Expedia.com. Most recently, Expedia has kicked off a project using HBase and Impala for real-time BI that will power their Market Manager, an interactive application used by merchants such as hotels so they can see how Expedia is performing vs. competitors. For example, if one hotel notices they aren’t getting many bookings through Expedia around Christmastime, they can drill into the application to find out why: is it because their prices are too high? Or are they running low on inventory for certain dates? With this solution, Expedia can glean these insights and proactively reach out to merchants with recommendations on how they might drive greater bookings. Impala will allow Expedia’s business users to access Hadoop in a more interactive, ad hoc, speed-of-thought manner. Latency will be cut in half, and Impala provides an extensible solution that will scale with the growth of the business.

Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights Presentation Transcript

  • Data Science on Hadoop:How Cloudera Impala Unlocks NewProductivity and InsightsJustin Erickson | Product ManagerMarcel Kornacker | Software EngineerRavikumar Visweswara | Software EngineerOctober 2012
  • Why Data Scientists Love Hadoop • Massive volumes of data • Data preparation & analytics in 1 environment • Highly flexible environment for creating & testing machine learning models • 10% the cost/TB under management
  • Hadoop Use Cases Moving to Real-Time Already query Already load data into Already use HBase for Hadoop using Hive CDH every 90 mins or less real-time data access Source: Cloudera customer survey August 2012
  • But Hadoop Isn’t Fast Enough Need faster Move data from See value today in queries on Hadoop to RDBMS for consolidating to a Hadoop data interactive SQL single platform Source: Cloudera customer survey August 2012
  • Beyond Batch – The Next Stage for Hadoop HADOOP TODAY IS TOO SLOW MapReduce is batch Simple queries can take minutes / tens of minutes CURRENT DATA MANAGEMENT IS TOO COMPLEX Optimized for rigid schemas & special purpose applications Redundant data storage & processes Very expensive systems: $20K-150K / TB
  • Cloudera Enterprise RTQReal-Time Query for Data Stored in HadoopPowered by Cloudera Impala. Supports Hive SQL 4-30X faster than Hive over MapReduce Supports multiple storage engines & file formats Uses existing drivers, integrates with existing metastore, works with leading BI tools Flexible, cost-effective, no lock-in Deploy & operate with Cloudera Manager
  • Cloudera Now Powered by Impala BEFORE IMPALA WITH IMPALA USER INTERFACE BATCH PROCESSING REAL-TIME ACCESS • Unified Storage: • With Impala: Supports HDFS and HBase Real-time SQL queries Flexible file formats Native distributed query engine • Unified Metastore Optimized for low-latency • Unified Security • Provides: • Unified Client Interfaces: Answers as fast as you can ask ODBC, SQL syntax, Hue Beeswax Everyone to ask questions for all data Big data storage and analytics together
  • Cloudera Impala DetailsCommon Hive SQL and interface Unified metadata and scheduler SQL App Hive State Metastore YARN HDFS NN Store ODBC Query Planner Query Planner Fully MPP Query Planner Query Coordinator Query Coordinator Distributed Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase Local Direct Reads
  • Cloudera Impala DetailsCommon Hive SQL and interface SQL App Hive State Metastore YARN HDFS NN Store ODBC SQL Request Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • Cloudera Impala Details Unified metadata and scheduler SQL App Hive State Metastore YARN HDFS NN Store ODBC Query Planner Query Planner Query PlannerQuery Coordinator Query Coordinator Query CoordinatorQuery Exec Engine Query Exec Engine Query Exec EngineHDFS DN HBase HDFS DN HBase HDFS DN HBase
  • Cloudera Impala Details SQL App Hive State Metastore YARN HDFS NN Store ODBC Query Planner Query Planner Fully MPP Query PlannerQuery Coordinator Query Coordinator Distributed Query CoordinatorQuery Exec Engine Query Exec Engine Query Exec EngineHDFS DN HBase HDFS DN HBase HDFS DN HBase
  • Cloudera Impala Details SQL App Hive State Metastore YARN HDFS NN Store ODBC Query Planner Query Planner Query PlannerQuery Coordinator Query Coordinator Query CoordinatorQuery Exec Engine Query Exec Engine Query Exec EngineHDFS DN HBase HDFS DN HBase HDFS DN HBase Local Direct Reads
  • Cloudera Impala Details SQL App Hive State Metastore YARN HDFS NN Store ODBC SQL Results Query Planner Query Planner In Memory Query PlannerQuery Coordinator Query Coordinator Transfers Query CoordinatorQuery Exec Engine Query Exec Engine Query Exec EngineHDFS DN HBase HDFS DN HBase HDFS DN HBase
  • Advantages of Our Approach• No high-latency MapReduce batch processing• Local processing avoids network bottlenecks• No costly data format conversion overhead• All data immediately query-able• Single machine pool to scale• All machines available to both Impala and MapReduce• Single, open, and unified metadata and scheduler MapReduce Remote Query Side Storage Query Query Query Query Node Node Node Node Query MR Hive Engine MR OR MR DN NN DN HDFS DN DN DN
  • Cloudera Impala Demo
  • Benefits of Cloudera ImpalaReal-Time Query for Data Stored in Hadoop • Get answers as fast as you can ask questions • Interactive analytics directly on source data • No jumping between data silos • Reduce duplicate storage with EDW • Reduce data movement for interactive analysis • Leverage existing tools and employee skills • Ask questions of all your data • No information loss from aggregation or conforming to relational schemas for analysis • Single metadata store from origination through analysis • No need to hunt through multiple data silos
  • Cloudera powers real-time data hub The Challenge: • Needs to understand 2 years clickstream data for greater insight • Legacy system cannot scale for data processing and analytics So Expedia can optimize end user data-driven search results and maximize Google AdWord spend. The Solution: • Cloudera Enterprise – 4 Petabyes • One single scalable platform for Big data for archive, ETL & analytics with real-time BI • Running Impala18 CONFIDENTIAL - RESTRICTED
  • Validated Beta Partners