The Platform for Big Data    Amr Awadallah | CTO, Founder, Cloudera, Inc.    aaa@cloudera.com, twitter: @awadallah1
The Problems with Current Data Systems    BI Reports + Interactive Apps             3. Can’t Explore Original High        ...
The Solution: A Combined Storage/Compute Layer                                                     3. Data Exploration &  ...
So What is Apache                                            Hadoop ?•   A scalable fault-tolerant distributed system for ...
The Hadoop Big Bang                                                   • Fastest sort of a TB, 62secs                      ...
The Key Benefit: Agility/FlexibilitySchema-on-Write (RDBMS):                                    Schema-on-Read (Hadoop):  ...
Scalability: Scalable Software Development      Grows without requiring developers to      re-architect their algorithms/a...
Economics: Return on Byte    •   Return on Byte (ROB) = value to be extracted from that        byte divided by the cost of...
The Big Data Platform: CDH4 – June 2012                                   Job Workflow             Data Processing Lib    ...
CDH in the Enterprise Data Stack                                       ENGINEERS      DATA SCIENTISTS      ANALYSTS       ...
HBase versus HDFSHDFS:                                                    HBase:Optimized For:                            ...
Use Case Examples •   Retail: Price Optimization •   Media: Content Targeting •   Finance: Fraud Detection •   Manufacturi...
Core Benefits of the Platform for Big Data        1. FLEXIBILITY        STORE ANY DATA        RUN ANY ANALYSIS        KEEP...
Thank you!Amr Awadallah, CTO, Founder, Cloudera, Inc. <aaa@cloudera.com>   @awadallah
Upcoming SlideShare
Loading in...5
×

Data Science Day New York: The Platform for Big Data

1,251

Published on

Understand the problem current data systems are facing with the emergence of Big Data and the solution of a combined storage/compute layer.

0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,251
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
88
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide
  • Open Source – 100% Open Source, 100% Apache licensed, 100% Free. Enterprise Ready – Predictable releases, Documentation, Hotfix Patches, Intensive QA. Proven at Scale – Deployed at hundreds of enterprises across many industries. Integrated – All required component versions &amp; dependencies are properly managed. Industry Standard – Existing RDBMS, ETL and BI systems work with it. Many Form Factors – Public Cloud, Private Cloud, RHEL, Ubuntu, 32/64bit, etc.
  • Data Science Day New York: The Platform for Big Data

    1. 1. The Platform for Big Data Amr Awadallah | CTO, Founder, Cloudera, Inc. aaa@cloudera.com, twitter: @awadallah1
    2. 2. The Problems with Current Data Systems BI Reports + Interactive Apps 3. Can’t Explore Original High Fidelity Raw Data RDBMS (aggregated data) ETL Compute Grid 1. Moving Data To Compute Doesn’t Scale Storage Only Grid (original raw data) 2. Archiving Mostly Append = Premature Collection Data Death Instrumentation2 ©2012 Cloudera, Inc. All Rights Reserved.
    3. 3. The Solution: A Combined Storage/Compute Layer 3. Data Exploration & BI Reports + Interactive Apps Advanced Analytics RDBMS (aggregated data) 1. Scalable Throughput For ETL & Aggregation (ETL Acceleration) 2. Keep Data Hadoop: Storage + Compute Grid Alive For Ever Mostly Append (Active Archive) Collection Instrumentation3 ©2012 Cloudera, Inc. All Rights Reserved.
    4. 4. So What is Apache Hadoop ?• A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license).• Core Hadoop has two main systems: • Hadoop Distributed File System: self-healing high-bandwidth clustered storage. • MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction.• Key business values: • Flexibility – Store any data, Run any analysis. • Scalability – Start at 1TB/3-nodes grow to petabytes/1000s of nodes. • Economics – Cost per TB at a fraction of traditional options.4 ©2012 Cloudera, Inc. All Rights Reserved.
    5. 5. The Hadoop Big Bang • Fastest sort of a TB, 62secs over 1,460 nodes • Sorted a PB in 16.25hours over 3,658 nodes Hadoop World 2009, 500 attendees5 ©2012 Cloudera, Inc. All Rights Reserved.
    6. 6. The Key Benefit: Agility/FlexibilitySchema-on-Write (RDBMS): Schema-on-Read (Hadoop): • Schema must be created before • Data is simply copied to the file store, any data can be loaded. no transformation is needed. • An explicit load operation has to • A SerDe (Serializer/Deserlizer) is take place which transforms data applied during read time to extract to DB internal serialization format. the required columns (late binding) • New columns must be added • New data can start flowing anytime explicitly before new data for such and will appear retroactively once the columns can be loaded into the SerDe is updated to parse it. database. • OLAP is Fast • Load is Fast Pros Pros • Standards/Governance • Flexibility/Agility6 ©2012 Cloudera, Inc. All Rights Reserved.
    7. 7. Scalability: Scalable Software Development Grows without requiring developers to re-architect their algorithms/application. AUTO SCALE AUTO SCALE7 ©2012 Cloudera, Inc. All Rights Reserved.
    8. 8. Economics: Return on Byte • Return on Byte (ROB) = value to be extracted from that byte divided by the cost of storing that byte • If ROB is < 1 then it will be buried into tape wasteland, thus we need more economical active storage. High ROB Low ROB8 ©2012 Cloudera, Inc. All Rights Reserved.
    9. 9. The Big Data Platform: CDH4 – June 2012 Job Workflow Data Processing Lib Data Mining Lib APACHE OOZIE DataFu for Pig APACHE MAHOUT Build/Test: APACHE BIGTOP Web Console Interactive SQL Metadata HUE Impala APACHE HIVE MetaStore Batch Processing Languages Fast Data APACHE PIG, APACHE HIVE Read/Write Integration Access APACHE FLUME, Hadoop Core Kernel APACHE SQOOP MapReduce, HDFS APACHE HBASE Cloud Deployment Connectivity Coordination APACHE WHIRR ODBC/JDBC/FUSE/HTTPS APACHE ZOOKEEPER Cloudera Manager Free Edition (Installation Wizard)9 ©2012 Cloudera, Inc. All Rights Reserved.
    10. 10. CDH in the Enterprise Data Stack ENGINEERS DATA SCIENTISTS ANALYSTS BUSINESS USERS DATA SYSTEM ARCHITECTS OPERATORS Modeling Modeling BI / / BI Enterprise Enterprise IDEs IDEs Tools Tools Analytics Analytics Reporting Reporting Meta Data/ Meta Data/ Cloudera Cloudera ETL Tools ETL Tools Manager Manager ODBC, JDBC, NFS, HTTP Enterprise Data Sqoop Warehouse Online Serving Sqoop Systems Flume Flume Flume Sqoop CUSTOMERS Relational Relational Web/Mobile Web/Mobile Logs Logs Files Files Web Data Web Data Databases Applications Databases Applications10 ©2012 Cloudera, Inc. All Rights Reserved.
    11. 11. HBase versus HDFSHDFS: HBase:Optimized For: Optimized For:• Large Files • Small Records• Sequential Access (Hi Throughput) • Random Access (Lo Latency)• Append Only • Atomic Record UpdatesUse For: Use For:• Fact tables that are mostly append only • Dimension tables which are updated and require sequential full table scans. frequently and require random low- latency lookups. Not Suitable For: • Low Latency Interactive OLAP.11 ©2012 Cloudera, Inc. All Rights Reserved.
    12. 12. Use Case Examples • Retail: Price Optimization • Media: Content Targeting • Finance: Fraud Detection • Manufacturing: Diagnostics • Info Services: Satellite Imagery • Agriculture: Seed Optimization • Power: Smart Consumption12 ©2012 Cloudera, Inc. All Rights Reserved.
    13. 13. Core Benefits of the Platform for Big Data 1. FLEXIBILITY STORE ANY DATA RUN ANY ANALYSIS KEEP’S PACE WITH THE RATE OF CHANGE OF INCOMING DATA 2. SCALABILITY PROVEN GROWTH TO PBS/1,000s OF NODES NO NEED TO REWRITE QUERIES, AUTOMATICALLY SCALES KEEP’S PACE WITH THE RATE OF GROWTH OF INCOMING DATA 3. ECONOMICS COST PER TB AT A FRACTION OF OTHER OPTIONS KEEP ALL OF YOUR DATA ALIVE IN AN ACTIVE ARCHIVE POWERING THE DATA BEATS ALGORITHM MOVEMENT13 ©2012 Cloudera, Inc. All Rights Reserved.
    14. 14. Thank you!Amr Awadallah, CTO, Founder, Cloudera, Inc. <aaa@cloudera.com> @awadallah
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×