• Save

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Cloudera - Amr Awadallah - Hadoop World 2010

on

  • 2,697 views

Business Analyst Tools & Applications for Hadoop

Business Analyst Tools & Applications for Hadoop

Amr Awadallah
Cloudera

Statistics

Views

Total Views
2,697
Views on SlideShare
2,307
Embed Views
390

Actions

Likes
0
Downloads
0
Comments
0

4 Embeds 390

http://www.cloudera.com 386
unmht:// 2
http://test.cloudera.com 1
http://blog.cloudera.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Cloudera - Amr Awadallah - Hadoop World 2010 Cloudera - Amr Awadallah - Hadoop World 2010 Presentation Transcript

    • Business Analyst Tools for Hadoop
      Amr Awadallah
      CTO, Cloudera, Inc.
      Hadoop World
      October 12th, 2010
      Copyright 2010 Couldera Inc. All Rights Reserved.
      1
    • The Spectrum of Hadoop Users
      Copyright 2010 Cloudera Inc. All rights reserved
      2
      BI, Analytics
      IDEs
      Enterprise Reporting
      Analysts
      Operators
      Business Users
      Customers
      Engineers
      Cloudera Enterprise
      Enterprise
      Data
      Warehouse
      Low-Latency Serving Systems
      Web Application
      Logs
      Relational Databases
      Web Data
      Files
    • Evolution of Hadoop Query/Programming Languages
      Java MapReduce: Gives the most flexibility and performance, but potentially long development cycle (the “assembly language” of Hadoop).
      Streaming MapReduce (also Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility.
      Cascading: Cascading is a thin Java library that sits on top of MapReduce, it lets developers assemble complex processes.
      Pig: A high-level language out of Yahoo, suitable for batch data flow workloads.
      Hive: A SQL interpreter out of Facebook, also includes a meta-store mapping files to their schemas and associated SerDe.
      Oozie: A PDL XML workflow server engine that enables creating a workflow of jobs composed of any of the above.
      3
      Copyright 2010 Couldera Inc. All Rights Reserved.
    • Hive vs Pig Example (count distinct values > 0)
      Hive syntax:
      SELECT COUNT(DISTINCT col1)
      FROM mytable
      WHERE col1 > 0;
      Pig syntax:
      mytable = LOAD ‘myfile’ AS (col1, col2, col3);
      mytable = FOREACH mytable GENERATE col1;
      mytable = FILTER mytable BY col1 > 0;
      mytable = DISTINCT col1;
      mytable = GROUP mytable BY col1;
      mytable = FOREACH mytable GENERATE COUNT(mytable);
      DUMP mytable;
      4
      Copyright 2010 Couldera Inc. All Rights Reserved.
    • Hive Features
      A subset of SQL covering the most common statements
      Agile data types: Array, Map, Struct, and JSON objects
      User Defined Functions and Aggregates
      Regular Expression support
      MapReduce support
      JDBC/ODBC support
      Partitions and Buckets (for performance optimization)
      In The Works: Indices, Columnar Storage, Views, Microstrategy compatibility, Explode/Collect
      More details: http://wiki.apache.org/hadoop/Hive
      5
      Copyright 2010 Couldera Inc. All Rights Reserved.
    • The Hadoop Query Tool Ecosystem
      6
      Copyright 2010 Couldera Inc. All Rights Reserved.
      In Memory
      ETL
      Query Authoring
      Spreadsheet
      BI/OLAP
      Developer
      Reporting
      Stats/Math
      MicroStrategy
      IBM Cognos
      SAP BOBJ
      Microsoft SSRS
      Jaspersoft
      Pentaho
      Informatica
      Pervasive
      IBM DataStage
      Microsoft SSIS
      Talend
      Kettle
      Karmasphere
      Eclipse
      Cascading
      PowerPivot
      QlikTech
      EdgeSpring
      Tableau
      Karmasphere
      Quest (Toad)
      SAS
      IBM SPSS
      Matlab
      R/RHIPE
      Mahoot
      Hama
      SAP Crystal
      Actuate/BIRT
      IBM BigSheets
      Datameer
      Cloudera Enterprise
      Cloudera’s Distribution for Hadoop
      Hadoop is very flexible, use the right tool for the job at hand.
    • Toad for Cloud (for Query Authoring)
      7
      Copyright 2010 Couldera Inc. All Rights Reserved.
      Hadoop
      RDBMS
      Learn more at: http://www.ToadForCloud.com
    • Karmasphere (for Developers and Analysts)
      8
      Copyright 2010 Couldera Inc. All Rights Reserved.
    • Tableau (for Advanced Visualization)
      9
      Copyright 2010 Couldera Inc. All Rights Reserved.
    • Datameer (for Analysts, Spreadsheet UI)
      10
      Copyright 2010 Couldera Inc. All Rights Reserved.
    • MicroStrategy (for interactive Dashboards)
      11
      Copyright 2010 Couldera Inc. All Rights Reserved.
    • Talend (for Extract-Tranform-Load, aka ETL)
      12
      Copyright 2010 Couldera Inc. All Rights Reserved.
    • General Advice for Choosing the Right Tool.
      First and foremost, what problem are you trying to solve? And what is your skill set? Use the tool that gets you there fastest.
      What is the learning curve involved with this new tool?
      Does the tool interoperate with other systems?
      Is the tool leveraging the investment in Pig/Hive?
      Does the tool lock you in to a proprietary file format?
      Is the tool certified for Cloudera’s Distribution of Hadoop?
      13
      Copyright 2010 Couldera Inc. All Rights Reserved.
    • Appendix
      Copyright 2010 Couldera Inc. All Rights Reserved.
      14
    • Hive Agile Data Types
      STRUCTS:
      SELECT mytable.mycolumn.myfield FROM …
      MAPS (Hashes):
      SELECT mytable.mycolumn[mykey] FROM …
      ARRAYS:
      SELECT mytable.mycolumn[5] FROM …
      JSON:
      SELECT get_json_object(mycolumn, objpath
      15
      Copyright 2010 Couldera Inc. All Rights Reserved.