Cloudera - Amr Awadallah - Hadoop World 2010

•

1 like•1,407 views

Cloudera, Inc.

Business Analyst Tools & Applications for Hadoop Amr Awadallah Cloudera

Technology

Business Analyst Tools for Hadoop
Amr Awadallah
CTO, Cloudera, Inc.
Hadoop World
October 12th, 2010
Copyright 2010 Couldera Inc. All Rights Reserved. 1

The Spectrum of Hadoop Users
Copyright 2010 Cloudera Inc. All rights reserved 2
Logs Files Web Data
Enterprise
Data
Warehouse
Web
Application
Enterprise
Reporting
BI, Analytics
Analysts Business Users
Customers
IDEs
Engineers
Relational
Databases
Low-Latency
Serving
Systems
Cloudera
Enterprise
Operators

Evolution of Hadoop Query/Programming Languages
1. Java MapReduce: Gives the most flexibility and performance,
but potentially long development cycle (the “assembly
language” of Hadoop).
2. Streaming MapReduce (also Pipes): Allows you to develop in
any programming language of your choice, but slightly lower
performance and less flexibility.
3. Cascading: Cascading is a thin Java library that sits on top of
MapReduce, it lets developers assemble complex processes.
4. Pig: A high-level language out of Yahoo, suitable for batch data
flow workloads.
5. Hive: A SQL interpreter out of Facebook, also includes a meta-
store mapping files to their schemas and associated SerDe.
6. Oozie: A PDL XML workflow server engine that enables creating
a workflow of jobs composed of any of the above.
3Copyright 2010 Couldera Inc. All Rights Reserved.

Hive vs Pig Example (count distinct values > 0)
• Hive syntax:
SELECT COUNT(DISTINCT col1)
FROM mytable
WHERE col1 > 0;
• Pig syntax:
mytable = LOAD ‘myfile’ AS (col1, col2, col3);
mytable = FOREACH mytable GENERATE col1;
mytable = FILTER mytable BY col1 > 0;
mytable = DISTINCT col1;
mytable = GROUP mytable BY col1;
mytable = FOREACH mytable GENERATE COUNT(mytable);
DUMP mytable;
4Copyright 2010 Couldera Inc. All Rights Reserved.

Hive Features
• A subset of SQL covering the most common statements
• Agile data types: Array, Map, Struct, and JSON objects
• User Defined Functions and Aggregates
• Regular Expression support
• MapReduce support
• JDBC/ODBC support
• Partitions and Buckets (for performance optimization)
• In The Works: Indices, Columnar Storage, Views, Microstrategy
compatibility, Explode/Collect
• More details: http://wiki.apache.org/hadoop/Hive
5Copyright 2010 Couldera Inc. All Rights Reserved.

The Hadoop Query Tool Ecosystem
6Copyright 2010 Couldera Inc. All Rights Reserved.
Cloudera Enterprise
Cloudera’s Distribution for Hadoop
In Memory
PowerPivot
QlikTech
EdgeSpring
Tableau
ETL
Informatica
Pervasive
IBM DataStage
Microsoft SSIS
Talend
Kettle
Query Authoring
Karmasphere
Quest (Toad)
Spreadsheet
IBM BigSheets
Datameer
BI/OLAP
MicroStrategy
IBM Cognos
SAP BOBJ
Microsoft SSRS
Jaspersoft
Pentaho
Developer
Karmasphere
Eclipse
Cascading
Stats/Math
SAS
IBM SPSS
Matlab
R/RHIPE
Mahoot
Hama
Reporting
SAP Crystal
Actuate/BIRT
Hadoop is very flexible, use the right tool for the job at hand.

Toad for Cloud (for Query Authoring)
7Copyright 2010 Couldera Inc. All Rights Reserved.
RDBMSHadoop
Learn more at: http://www.ToadForCloud.com

Karmasphere (for Developers and Analysts)
8Copyright 2010 Couldera Inc. All Rights Reserved.

Tableau (for Advanced Visualization)
9Copyright 2010 Couldera Inc. All Rights Reserved.

Datameer (for Analysts, Spreadsheet UI)
10Copyright 2010 Couldera Inc. All Rights Reserved.

MicroStrategy (for interactive Dashboards)
11Copyright 2010 Couldera Inc. All Rights Reserved.

Talend (for Extract-Tranform-Load, aka ETL)
12Copyright 2010 Couldera Inc. All Rights Reserved.

General Advice for Choosing the Right Tool.
• First and foremost, what problem are you trying to solve? And
what is your skill set? Use the tool that gets you there fastest.
• What is the learning curve involved with this new tool?
• Does the tool interoperate with other systems?
• Is the tool leveraging the investment in Pig/Hive?
• Does the tool lock you in to a proprietary file format?
• Is the tool certified for Cloudera’s Distribution of Hadoop?
13Copyright 2010 Couldera Inc. All Rights Reserved.

Appendix
Copyright 2010 Couldera Inc. All Rights Reserved. 14

Hive Agile Data Types
• STRUCTS:
• SELECT mytable.mycolumn.myfield FROM …
• MAPS (Hashes):
• SELECT mytable.mycolumn[mykey] FROM …
• ARRAYS:
• SELECT mytable.mycolumn[5] FROM …
• JSON:
• SELECT get_json_object(mycolumn, objpath
15Copyright 2010 Couldera Inc. All Rights Reserved.

What's hot

Hadoop on Cloud: Why and How?Cloudera, Inc.

Impala use case @ ZooskCloudera, Inc.

Hadoop Ecosystem at a GlanceNeev Technologies

Building Big data solutions in AzureMostafa

Big data solutions in azureMostafa

JethroData technical white paperJethroData

Hadoop distributions - ecosystemJakub Stransky

Big data solutions in AzureMostafa

Hadoop vs. RDBMS for Advanced Analyticsjoshwills

Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Cloudera, Inc.

Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyHakka Labs

High concurrency, Low latency analytics using Spark/KuduChris George

Comparison among rdbms, hadoop and sparkAgnihotriGhosh2

A Closer Look at Apache KuduAndriy Zabavskyy

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionCloudera, Inc.

Introduction to HadoopDr. C.V. Suresh Babu

Enabling the Active Data Warehouse with Apache KuduGrant Henke

What is Apache sparkmanisha1110

SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UKhuguk

Sparkflows.iosparkflows

What's hot (20)

Hadoop on Cloud: Why and How?

Impala use case @ Zoosk

Hadoop Ecosystem at a Glance

Building Big data solutions in Azure

Big data solutions in azure

JethroData technical white paper

Hadoop distributions - ecosystem

Big data solutions in Azure

Hadoop vs. RDBMS for Advanced Analytics

Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...

Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney

High concurrency, Low latency analytics using Spark/Kudu

Comparison among rdbms, hadoop and spark

A Closer Look at Apache Kudu

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production

Introduction to Hadoop

Enabling the Active Data Warehouse with Apache Kudu

What is Apache spark

SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK

Sparkflows.io

Viewers also liked

Test3Venkat Ramaswamy

Photo Sharing Services Smart Card 060513McAlester Army Ammunition Plant

Introductionshema12345

Manifeste des tiers lieuxGarlann Nizon

Final Presentation-ARCSheher Bano

Scope of cost accountingRADHIKA GUPTA

HBaseCon 2013: ETL for Apache HBaseCloudera, Inc.

Acc0101. Meaning and Scope of AccountingCPT Success

Vahva henkilöbrändi DIKO FutureMarja

Viewers also liked (9)

Test3

Photo Sharing Services Smart Card 060513

Introduction

Manifeste des tiers lieux

Final Presentation-ARC

Scope of cost accounting

HBaseCon 2013: ETL for Apache HBase

Acc0101. Meaning and Scope of Accounting

Vahva henkilöbrändi DIKO

Similar to Cloudera - Amr Awadallah - Hadoop World 2010

Overview of big data & hadoop v1Thanh Nguyen

Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen

Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen

Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks

Big data or big dealeduarderwee

spark_v1_2Frank Schroeter

Hadoop and Big Data: RevealedSachin Holla

Architecting the Future of Big Data and SearchHortonworks

Big SQL Competitive Summary - Vendor LandscapeNicolas Morales

Agile data lake? An oxymoron?samthemonad

The other Apache Technologies your Big Data solution needsgagravarr

What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!

Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.

Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Frank Munz

Oracle Unified Information Architeture + Analytics by ExampleHarald Erb

Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.

Hadoop in a NutshellAnthony Thomas

10 big data analytics tools to watch out for in 2019JanBask Training

Data Science Languages and Industry AnalyticsWes McKinney

Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousingc-bslim

Similar to Cloudera - Amr Awadallah - Hadoop World 2010 (20)

Overview of big data & hadoop v1

Overview of big data & hadoop version 1 - Tony Nguyen

Overview of Big data, Hadoop and Microsoft BI - version1

Eric Baldeschwieler Keynote from Storage Developers Conference

Big data or big deal

spark_v1_2

Hadoop and Big Data: Revealed

Architecting the Future of Big Data and Search

Big SQL Competitive Summary - Vendor Landscape

Agile data lake? An oxymoron?

The other Apache Technologies your Big Data solution needs

What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka

Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...

Oracle Unified Information Architeture + Analytics by Example

Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...

Hadoop in a Nutshell

10 big data analytics tools to watch out for in 2019

Data Science Languages and Industry Analytics

Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

Recently uploaded

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Real Time Object Detection Using Open CVKhem

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Histor y of HAM Radio presentation slidevu2urc

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

Slack Application Development 101 Slidespraypatel2

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

A Year of the Servo Reboot: Where Are We Now?Igalia

How to convert PDF to text with Nanonetsnaman860154

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Real Time Object Detection Using Open CV

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Finology Group – Insurtech Innovation Award 2024

Histor y of HAM Radio presentation slide

Scaling API-first – The story of a global engineering organization

Presentation on how to chat with PDF using ChatGPT code interpreter

Breaking the Kubernetes Kill Chain: Host Path Mount

Advantages of Hiring UIUX Design Service Providers for Your Business

Slack Application Development 101 Slides

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Boost Fertility New Invention Ups Success Rates.pdf

A Year of the Servo Reboot: Where Are We Now?

How to convert PDF to text with Nanonets

Exploring the Future Potential of AI-Enabled Smartphone Processors

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Cloudera - Amr Awadallah - Hadoop World 2010

2. The Spectrum of Hadoop Users Copyright 2010 Cloudera Inc. All rights reserved 2 Logs Files Web Data Enterprise Data Warehouse Web Application Enterprise Reporting BI, Analytics Analysts Business Users Customers IDEs Engineers Relational Databases Low-Latency Serving Systems Cloudera Enterprise Operators

3. Evolution of Hadoop Query/Programming Languages 1. Java MapReduce: Gives the most flexibility and performance, but potentially long development cycle (the “assembly language” of Hadoop). 2. Streaming MapReduce (also Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility. 3. Cascading: Cascading is a thin Java library that sits on top of MapReduce, it lets developers assemble complex processes. 4. Pig: A high-level language out of Yahoo, suitable for batch data flow workloads. 5. Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDe. 6. Oozie: A PDL XML workflow server engine that enables creating a workflow of jobs composed of any of the above. 3Copyright 2010 Couldera Inc. All Rights Reserved.

4. Hive vs Pig Example (count distinct values > 0) • Hive syntax: SELECT COUNT(DISTINCT col1) FROM mytable WHERE col1 > 0; • Pig syntax: mytable = LOAD ‘myfile’ AS (col1, col2, col3); mytable = FOREACH mytable GENERATE col1; mytable = FILTER mytable BY col1 > 0; mytable = DISTINCT col1; mytable = GROUP mytable BY col1; mytable = FOREACH mytable GENERATE COUNT(mytable); DUMP mytable; 4Copyright 2010 Couldera Inc. All Rights Reserved.

5. Hive Features • A subset of SQL covering the most common statements • Agile data types: Array, Map, Struct, and JSON objects • User Defined Functions and Aggregates • Regular Expression support • MapReduce support • JDBC/ODBC support • Partitions and Buckets (for performance optimization) • In The Works: Indices, Columnar Storage, Views, Microstrategy compatibility, Explode/Collect • More details: http://wiki.apache.org/hadoop/Hive 5Copyright 2010 Couldera Inc. All Rights Reserved.

6. The Hadoop Query Tool Ecosystem 6Copyright 2010 Couldera Inc. All Rights Reserved. Cloudera Enterprise Cloudera’s Distribution for Hadoop In Memory PowerPivot QlikTech EdgeSpring Tableau ETL Informatica Pervasive IBM DataStage Microsoft SSIS Talend Kettle Query Authoring Karmasphere Quest (Toad) Spreadsheet IBM BigSheets Datameer BI/OLAP MicroStrategy IBM Cognos SAP BOBJ Microsoft SSRS Jaspersoft Pentaho Developer Karmasphere Eclipse Cascading Stats/Math SAS IBM SPSS Matlab R/RHIPE Mahoot Hama Reporting SAP Crystal Actuate/BIRT Hadoop is very flexible, use the right tool for the job at hand.

13. General Advice for Choosing the Right Tool. • First and foremost, what problem are you trying to solve? And what is your skill set? Use the tool that gets you there fastest. • What is the learning curve involved with this new tool? • Does the tool interoperate with other systems? • Is the tool leveraging the investment in Pig/Hive? • Does the tool lock you in to a proprietary file format? • Is the tool certified for Cloudera’s Distribution of Hadoop? 13Copyright 2010 Couldera Inc. All Rights Reserved.

15. Hive Agile Data Types • STRUCTS: • SELECT mytable.mycolumn.myfield FROM … • MAPS (Hashes): • SELECT mytable.mycolumn[mykey] FROM … • ARRAYS: • SELECT mytable.mycolumn[5] FROM … • JSON: • SELECT get_json_object(mycolumn, objpath 15Copyright 2010 Couldera Inc. All Rights Reserved.

Cloudera - Amr Awadallah - Hadoop World 2010

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Cloudera - Amr Awadallah - Hadoop World 2010

Similar to Cloudera - Amr Awadallah - Hadoop World 2010 (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Cloudera - Amr Awadallah - Hadoop World 2010