The Hadoop Ecosystem

The Hadoop Ecosystem

J Singh, DataThinks.org

March 12, 2012

• Introduction
– What Hadoop is, and what it’s not
– Origins and History
– Hello Hadoop
• The Hadoop Bestiary
• The Hadoop Providers
• Hosted Hadoop Frameworks

© J Singh, 2011 2
2

What Hadoop is, and what it’s not
• A Framework for Map Reduce

• A Top-level Apache Project

• Hadoop is • Hadoop is not
 A Framework, not a “solution” A painless replacement for SQL
• Think Linux or J2EE

 Scalable Uniformly fast or efficient

 Great for pipelining massive Great for ad hoc Analysis
amounts of data to achieve the
end result

 Sometimes the only option

© J Singh, 2011 3
3

You are ready for Hadoop when…
• You no longer get enthused by the prospect of more data
– Rate of data accumulation is increasing
– The idea of moving data from hither to yon is positively scary
– A hit man threatens to delete your data in the middle of the night
• And you want to pay him to do it

• Seriously, you are ready for Hadoop when analysis is the bottleneck
– Could be because of data size
– Could be because of the complexity of the data
– Could be because of the level of analysis required
– Could be because the analysis requirements are fluid

© J Singh, 2011 4
4

MapReduce Conceptual Underpinnings
• Based on Functional Programming model
– From Lisp
• (map square '(1 2 3 4)) (1 4 9 16)
• (reduce plus '(1 4 9 16)) 30
– From APL
• +/ N N  1 2 3 4

• Easy to distribute (based on each element of the vector)

• New for Map/Reduce: Nice failure/retry semantics
– Hundreds and thousands of low-end servers are running at the
same time

© J Singh, 2011 5
5

MapReduce Flow

Word Count Example

MapOut
foo 1
Lines Result
bar 1
foo bar foo 3
quux 1
quux foo labs 1
foo 1
foo labs quux 2
foo 1
quux bar 1
labs 1
quux 1

© J Singh, 2011 6
6

Hello Hadoop
• Word Count
– Example with Unstructured Data
– Load 5 books from Gutenberg.org
into /tmp/gutenberg
– Load them into HDFS
– Run Hadoop
• Results are put into HDFS
– Copy results into file system

– What could be simpler?

– DIY instructions for Amazon EC2
available on DataThinks.org blog

© J Singh, 2011 7
7

• Introduction
– Core: Hadoop Map Reduce and Hadoop Distributed File System
– Data Access: HBase, Pig, Hive
– Algorithms: Mahout
– Data Import: Flume, Sqoop and Nutch

© J Singh, 2011 8
8

The Core: Hadoop and HDFS
• Hadoop • Hadoop Distributed File System
– One master, n slaves – Robust Data Storage across
– Master machines, insulating against
• Schedules mappers & reducers failure
• Connects pipeline stages – Keeps n copies of each file
• Handles failure semantics • Configurable number of copies
• Distributes copies across racks
and locations

© J Singh, 2011 9
9

Hadoop Bestiary (p1a): Hbase, Pig
• Database Primitives • Processing
– Hbase – Pig
• Wide column data structure • A high(-ish) level data-flow
built on HDFS language and execution
framework for parallel
computation
• Accesses HDFS and Hbase
• Batch as well as Interactive
• Integrates UDFs written in
Java, Python, JavaScript
• Compiles to map & reduce
functions – not 100% efficiently

© J Singh, 2011 10
10

In Pig (Latin)

Users = load ‘users’ as (name, age);
Filtered = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
count(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;

store Top5 into ‘top5sites’;

© J Singh, 2011 11
11
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt

Pig Translation into Map Reduce

Load Users Load Pages
Users = load …
Filter by age
Fltrd = filter …
Pages = load …
Job 1 Join on name Joined = join …
Group on url
Grouped = group …
Summed = … count()…
Job 2 Count clicks Sorted = order …
Top5 = limit …
Order by clicks

Job 3 Take top 5

© J Singh, 2011 Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt 12
12

Hadoop Bestiary (p1b): Hbase, Hive
• Database Primitives • Processing
– Hbase – Hive
• Wide column data structure • Data Warehouse Infrastructure
built on HDFS • QL, a subset of SQL that
supports primitives supportable
by Map Reduce
• Support for custom mappers
and reducers for more
sophisticated analysis
• Compiles to map & reduce
functions – not 100% efficiently

Hive Example
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
:: ::
STORED AS SEQUENCEFILE;

© J Singh, 2011 13
13

Hadoop Bestiary (p2): Mahout
• Algorithms • Examples
– Mahout – Clustering Algorithms
• Scalable machine learning and • Canopy Clustering
data mining • K-Means Clustering
• Runs on top of Hadoop • …
• Written in Java
• In active development – Recommenders / Collaborative
– Algorithms being added
Filtering Algorithms

– Other
• Regression Algorithms
• Neural Networks
• Hidden Markov Models

© J Singh, 2011 14
14

Hadoop Bestiary (p3): Data Import
• Data Import Mechanisms • Data Import
– Sqoop: Structured Data – Sqoop
– Flume: Streams • Import from RDBMS to HDFS
• Export too
– Flume
• Import streams
– Text Files
– System Logs
– Nutch
• Import from Web
• Note: Nutch + Hadoop = Lucene

© J Singh, 2011 15
15

Hadoop Bestiary (p4): Complete Picture

© J Singh, 2011 16
16

• Introduction
– Apache
– Cloudera
– Options when your data lives in a Database

© J Singh, 2011 17
17

Apache Distribution
• The Definitive Repository
– The hub for Code, Documentation, Tutorials

– Many contributors, for example
• Pig was a Yahoo! Contribution
• Hive came from Facebook
• Sqoop came from Cloudera

• Bare metal install option:
– Download to your machine(s) from Apache
– Install and Operate
• Modify to fit your business better

© J Singh, 2011 18
18

Cloudera
• Cloudera : Hadoop :: Red Hat : Linux

• Cloudera’s Distribution Including Apache Hadoop (CDH)
– A packaged set of Hadoop modules that work together
– Now at CDH3
– Largest contributor of code to Apache Hadoop

• $76M in Venture funding so far

© J Singh, 2011 19
19

When the data lives in a Database…

• Objective: keeping Analytics and Data as close as possible

• Options for RDBMS : • Options for NoSQL Databases
– Sqoop data to/from HDFS – Sqoop-like connectors
• Need to move the data • Need to move the data
• Can utilize all parts of Hadoop
– In-database analytics
• Available for TeraData, – Built-in Map Reduce available
Greenplum, etc. for most NoSQL databases
• If you have the need • Knows about and tuned to the
– And the $$$ storage mechanism
• But typically only offers map
and reduce
– No Pig, Hive, …

© J Singh, 2011 20
20

• Introduction
• Hadoop Platforms as a Service
– Amazon Elastic MapReduce
– Hadoop in Windows Azure
– Google App Engine
– Other
• Infochimps
• IBM SmartCloud

© J Singh, 2011 21
21

Amazon Elastic Map Reduce (EMR)
• Hosted Map Reduce
– CLI on your laptop
• Control over size of cluster
• Automatic spin-up/down instances

– Map & Reduce programs on S3
• Pig, Hive or
• Custom in Java, Ruby, Python,
Perl, PHP, R, C++, Cascading

– Data In/Out on S3 or
– Data In/Out on DynamoDB

• Keep in mind:
– Hadoop on EC2 is also an option

© J Singh, 2011 22
22

Hadoop in Windows Azure
• Basic Level
– Hive Add-in for Excel
– Hive ODBC Driver

• Hadoop-based Distribution for Windows Server and Azure
– Strategic Partnership with HortonWorks
– Windows-based CLI on your laptop

• Broadest Level
– JavaScript framework for Hadoop
– Hadoop connectors for SQL Server and Parallel Data Warehouse

© J Singh, 2011 23
23

Google App Engine MapReduce
• Map Reduce as a Service
– Distinct from Google’s internal Map Reduce
– Part of Google App Engine

• Works with Google Datastore
– A Wide Column Store

• A “purely programmatic” environment
– Write Map and Reduce functions in Python / Java

© J Singh, 2011 24
24

Take Aways
• There are many flavors of
Hadoop.
– The important part is
Functional Programming and
Map Reduce

– Don’t let the proliferation of
choices stump you.

– Experiment with it!

© J Singh, 2011 26
26

Thank you
• J Singh
– President, Early Stage IT
• Technology Services and Strategy for Startups

• DataThinks.org is a new service of Early Stage IT
– “Big Data” analytics solutions

© J Singh, 2011 27
27

The Hadoop Ecosystem

More Related Content

What's hot

Viewers also liked

Similar to The Hadoop Ecosystem

More from J Singh

Recently uploaded

The Hadoop Ecosystem

Editor's Notes