Your SlideShare is downloading. ×
0
07.11.13

uweseiler

Introduction to the
Hadoop Ecosystem
07.11.13

About me

Big Data Nerd

Hadoop Trainer MongoDB Author

Photography Enthusiast

Travelpirate
07.11.13

About us

is a bunch of…

Big Data Nerds

Agile Ninjas

Continuous Delivery Gurus

Join us!
Enterprise Java Spec...
07.11.13

Agenda

• What is Big Data & Hadoop?
• Core Hadoop
• The Hadoop Ecosystem
• Use Cases
• What‘s next? Hadoop 2.0!
07.11.13

Agenda

• What is Big Data & Hadoop?
• Core Hadoop
• The Hadoop Ecosystem
• Use Cases
• What‘s next? Hadoop 2.0!
Big Data
Big Data is like teenage sex:
everybody talks about it,
nobody really knows how to
do it, everyone thinks
everyone else is...
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
07.11.13

My favorite definition
07.11.13

The classic definition

Volume

The 3 V’s of Big Data

Velocity
Variety
07.11.13

«Big Data» != Hadoop
g

NoSQL
Classification of NoSQL

07.11.13

Key-Value Stores
K

V

K

V

K

V

K

1

V

K

Column Stores

V

Graph Databases

1

1
...
Horizontal
Scaling
07.11.13

Vertical Scaling

RAM
CPU
Storage
07.11.13

Vertical Scaling

RAM
CPU
Storage
07.11.13

Vertical Scaling

RAM
CPU
Storage
07.11.13

Horizontal Scaling

RAM
CPU
Storage
07.11.13

RAM
CPU
Storage

Horizontal Scaling

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage
07.11.13

Horizontal Scaling

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU...
07.11.13

Why Hadoop?
Traditional dataStores are expensive to scale
and by Design difficult to Distribute

Scale out is th...
How to scale data?

07.11.13

“Data“
w
worker

r

w
worker

r
“Result“

w
worker

r
07.11.13

But…

Parallel processing is
complicated!
07.11.13

But…

Data storage is not
trivial!
07.11.13

What is Hadoop?

Distributed Storage and
Computation Framework
07.11.13

What is Hadoop?

Hadoop != Database
07.11.13

What is Hadoop?

“Swiss army knife
of the 21st century”

http://www.guardian.co.uk/technology/2011/mar/25/media-...
The Hadoop App Store

07.11.13

HDFS

MapRed

HCat

Pig

Hive

HBase

Ambari

Avro

Cassandra

Chukwa

Flume

Hana

HyperT...
07.11.13

The Hadoop App Store
Hadoop
Distributions

Apache
Hadoop

+
+

•
•
•
•
less

HDFS
MapReduce
Hadoop Ecosystem
Had...
07.11.13

Agenda

• What is Big Data & Hadoop?
• Core Hadoop
• The Hadoop Ecosystem
• Use Cases
• What‘s next? Hadoop 2.0!
07.11.13

Data Storage

OK, first things
first!
I want to store all of
my <<Big Data>>
07.11.13

Data Storage
07.11.13

Hadoop Distributed File System

• Distributed file system for
redundant storage
• Designed to reliably store dat...
07.11.13

Hadoop Distributed File System

Intended for
• large files
• batch inserts
HDFS Architecture

07.11.13

Client

Master

Helper

File

NameNode

Secondary
NameNode

#1

#2

Rack 1
Slave

DataNode
#1...
07.11.13

HDFS

Let’s have a look…
07.11.13

Data Processing

Data stored, check!
Now I want to
create insights
from my data!
07.11.13

Data Processing
07.11.13

MapReduce

• Programming model for
distributed computations at a
massive scale
• Execution framework for
organiz...
07.11.13

Typical large-data problem

• Extract something of interest from each

Map

• Iterate over a large number of rec...
MapReduce Flow

07.11.13

Map
a

Map

b 2
Combine

a

c

3

Map

c

a

6

Combine

b 2
Partition

c

3

c

Map
2

b

Combi...
Combined Hadoop Architecture

07.11.13

Client

Master

Job

JobTracker

File

NameNode

Secondary
NameNode

Slave

Slave
...
07.11.13

Word Count Mapper in Java

public class WordCountMapper extends MapReduceBase implements
Mapper<LongWritable, Te...
07.11.13

Word Count Reducer in Java

public class WordCountReducer extends MapReduceBase
implements Reducer<Text, IntWrit...
07.11.13

Map/Reduce

Let’s have a look…
07.11.13

Agenda

• What is Big Data & Hadoop?
• Core Hadoop
• The Hadoop Ecosystem
• Use Cases
• What‘s next? Hadoop 2.0!
07.11.13

Scripting for Hadoop

Java for MapReduce?
I dunno, dude…
I’m more of a
scripting guy…
07.11.13

Scripting for Hadoop
07.11.13

Apache Pig

• High-level data flow language
• Made of two components:
• Data processing language Pig Latin
• Com...
07.11.13

Pig in the Hadoop ecosystem
Pig
Scripting

HCatalog
Metadata Management

MapReduce
Distributed Programming Frame...
07.11.13

Pig Latin

users = LOAD 'users.txt' USING PigStorage(',') AS (name,
age);
pages = LOAD 'pages.txt' USING PigStor...
07.11.13

Pig Execution Plan
07.11.13

Try that with Java…
07.11.13

Pig

Let’s have a look…
07.11.13

SQL for Hadoop

OK, Pig seems quite
useful…
But I’m more of a
SQL person…
07.11.13

SQL for Hadoop
07.11.13

Apache Hive

• Data Warehousing Layer on top of
Hadoop
• Allows analysis and queries
using a SQL-like language
07.11.13

Hive in the Hadoop ecosystem
Pig

Hive

Scripting

Query

HCatalog
Metadata Management

MapReduce
Distributed Pr...
07.11.13

Hive Architecture

Hive
Shell

Hive

Metastore

Hive
Server
Hive
Engine

Hive Thrift
Driver

Thrift
Applications...
07.11.13

Hive Example

CREATE TABLE users(name STRING, age INT);
CREATE TABLE pages(user STRING, url STRING);
LOAD DATA I...
07.11.13

Hive

Let’s have a look…
07.11.13

But wait, there’s still more!

More components of the
Hadoop Ecosystem
Mahout
07.11.13

Machine Learning

Hive

Scripting

SQL-like queries

Data storage

Scoop

Flume

Import & Export of
relat...
07.11.13

Agenda

• What is Big Data & Hadoop?
• Core Hadoop
• The Hadoop Ecosystem
• Use Cases
• What‘s next? Hadoop 2.0!
Classical enterprise platform

Applications

07.11.13
Business
Intelligence

Business
Applications

Custom
Applications

D...
Big Data Platform

Applications

07.11.13
Business
Intelligence

Business
Applications

Custom
Applications

Dev Tools

Da...
Pattern #1: Refine data

Applications

07.11.13
Business
Intelligence

Business
Applications

Custom
Applications

Data Sy...
Pattern #2: Explore data

Applications

07.11.13
Business
Intelligence

Business
Applications

Custom
Applications
1

Data...
Pattern #3: Enrich data

Applications

07.11.13

Business
Applications

Custom
Applications
1

Data Systems

3
Traditional...
07.11.13

Bringing it all together…

One example…
07.11.13

Digital Advertising

• 6 billion ad deliveries per day
• Reports (and bills) for the
advertising companies neede...
AdServing Architecture

FFM

AdServer

AdServer

07.11.13

Hadoop Cluster

Synchronisation

Campaign
Database

Campaign
Da...
07.11.13

What’s next?

Hadoop 2.0
aka YARN
Hadoop 1.0

07.11.13

Built for web-scale batch apps
Single App

Single App

Batch

Batch

Single App

Single App

Single ...
07.11.13

MapReduce is good for…

• Embarrassingly parallel algorithms
• Summing, grouping, filtering, joining
• Off-line ...
07.11.13

MapReduce is OK for…

• Iterative jobs (i.e., graph algorithms)
– Each iteration must read/write data to
disk
– ...
07.11.13

MapReduce is not good for…

• Jobs that need shared state/coordination
– Tasks are shared-nothing
– Shared-state...
07.11.13

•

MapReduce limitations

Scalability
– Maximum cluster size ~ 4,500 nodes
– Maximum concurrent tasks – 40,000
–...
07.11.13

Hadoop 2.0: Next-gen platform

Single use system
Batch Apps

Hadoop 1.0

MapReduce
Cluster resource mgmt.
+ data...
Taking Hadoop beyond batch

07.11.13

Store all data in one place

Interact with data in multiple ways
Applications run na...
07.11.13

A brief history of Hadoop 2.0

• Originally conceived & architected by the
team at Yahoo!
–

The team at Hortonw...
07.11.13

Hadoop 2.0 Projects

• YARN
• HDFS Federation aka HDFS 2.0
• Stinger & Tez aka Hive 2.0
07.11.13

Hadoop 2.0 Projects

• YARN
• HDFS Federation aka HDFS 2.0
• Stinger & Tez aka Hive 2.0
07.11.13

YARN: Architecture

Split up the two major functions of the JobTracker

Cluster resource management & Applicatio...
07.11.13

YARN: Architecture

• Resource Manager
– Global resource scheduler
– Hierarchical queues

•

Node Manager
– Per-...
07.11.13

YARN: Architecture
ResourceManager

Scheduler
NodeManager

NodeManager

NodeManager

NodeManager

MapReduce 1

m...
07.11.13

Hadoop 2.0 Projects

• YARN
• HDFS Federation aka HDFS 2.0
• Stinger & Tez aka Hive 2.0
07.11.13

HDFS Federation

• Removes tight coupling of Block
Storage and Namespace
• Scalability & Isolation
• High Availa...
HDFS Federation: Architecture

07.11.13

NameNodes do not talk to each other

NameNode 1

NameNode 2

Namespace 1
logs

fi...
07.11.13

Only the active
writes edits

HDFS: Quorum based storage
Journal
Node

Journal
Node

Active NameNode
Block
Map

...
07.11.13

Hadoop 2.0 Projects

• YARN
• HDFS Federation aka HDFS 2.0
• Stinger & Tez aka Hive 2.0
07.11.13

Real-Time
• Online systems
• R-T analytics
• CEP

0-5s

Hive: Current Focus Area

Interactive
• Parameterized
Re...
07.11.13

Real-Time
• Online systems
• R-T analytics
• CEP

Stinger: Extending the sweet spot
NonInteractive

Interactive
...
07.11.13

Stinger Initiative at a glance
07.11.13

Tez: The Execution Engine

•

Low level data-processing execution engine

•

Use it for the base of MapReduce, H...
Pig/Hive MR vs. Pig/Hive Tez

07.11.13

SELECT a.state, COUNT(*),
AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c O...
07.11.13

Tez Service

• MapReduce Query Startup is expensive:
– Job launch & task-launch latencies are fatal for
short qu...
07.11.13

Tez: Low latency
SELECT a.state, COUNT(*),
AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId =...
07.11.13

Stinger: Summary

* Real numbers, but handle with care!
07.11.13

•
•
•
•
•
•
•
•

Hadoop 2.0 Applications

MapReduce 2.0
HOYA - HBase on YARN
Storm, Spark, Apache S4
Hamster (MP...
07.11.13

•
•
•
•
•
•
•
•

Hadoop 2.0 Applications

MapReduce 2.0
HOYA - HBase on YARN
Storm, Spark, Apache S4
Hamster (MP...
07.11.13

MapReduce 2.0

• Basically a porting to the YARN
architecture
• MapReduce becomes a user-land
library
• No need ...
07.11.13

•
•
•
•
•
•
•
•

Hadoop 2.0 Applications

MapReduce 2.0
HOYA - HBase on YARN
Storm, Spark, Apache S4
Hamster (MP...
07.11.13

HOYA: HBase on YARN

• Create on-demand HBase clusters
• Configure different HBase instances
differently
• Bette...
07.11.13

•
•
•
•
•
•
•
•

Hadoop 2.0 Applications

MapReduce 2.0
HOYA - HBase on YARN
Storm, Spark, Apache S4
Hamster (MP...
07.11.13

Twitter Storm

• Stream-processing
• Real-time processing
• Developed as standalone application
• https://github...
07.11.13

Storm: Conceptual view
Bolt:

Spout:
Source of streams

Spout

Bolt

Consumer of streams,
Processing of tuples,
...
07.11.13

•
•
•
•
•
•
•
•

Hadoop 2.0 Applications

MapReduce 2.0
HOYA - HBase on YARN
Storm, Spark, Apache S4
Hamster (MP...
07.11.13

Spark

• High-speed in-memory analytics over
Hadoop and Hive
• Separate MapReduce-like engine
–
–

Speedup of up...
07.11.13

Data Sharing in Spark
07.11.13

•
•
•
•
•
•
•
•

Hadoop 2.0 Applications

MapReduce 2.0
HOYA - HBase on YARN
Storm, Spark, Apache S4
Hamster (MP...
07.11.13

Apache Giraph

• Giraph is a framework for processing semistructured graph data on a massive scale.
• Giraph is ...
07.11.13

Hadoop 2.0 Summary

1. Scale
2. New programming models &
Services
3. Improved cluster utilization
4. Agility
5. ...
07.11.13

Getting started…

One more thing…
07.11.13

Hortonworks Sandbox

http://hortonworks.com/products/hortonworsk-sandbox
07.11.13

1.

Books about Hadoop
Hadoop - The Definite Guide, Tom White,
3rd ed., O’Reilly, 2012.

2.

Hadoop in Action, C...
07.11.13

The end…or the beginning?
Upcoming SlideShare
Loading in...5
×

Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

7,898

Published on

Talk held at the IT-Stammtisch Darmstadt on 08.11.2013

Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!

Published in: Technology, Business
2 Comments
21 Likes
Statistics
Notes
No Downloads
Views
Total Views
7,898
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
885
Comments
2
Likes
21
Embeds 0
No embeds

No notes for slide

Transcript of "Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)"

  1. 1. 07.11.13 uweseiler Introduction to the Hadoop Ecosystem
  2. 2. 07.11.13 About me Big Data Nerd Hadoop Trainer MongoDB Author Photography Enthusiast Travelpirate
  3. 3. 07.11.13 About us is a bunch of… Big Data Nerds Agile Ninjas Continuous Delivery Gurus Join us! Enterprise Java Specialists Performance Geeks
  4. 4. 07.11.13 Agenda • What is Big Data & Hadoop? • Core Hadoop • The Hadoop Ecosystem • Use Cases • What‘s next? Hadoop 2.0!
  5. 5. 07.11.13 Agenda • What is Big Data & Hadoop? • Core Hadoop • The Hadoop Ecosystem • Use Cases • What‘s next? Hadoop 2.0!
  6. 6. Big Data
  7. 7. Big Data is like teenage sex: everybody talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it…
  8. 8. Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
  9. 9. Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
  10. 10. Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
  11. 11. Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
  12. 12. Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
  13. 13. Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
  14. 14. Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
  15. 15. Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
  16. 16. Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
  17. 17. Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
  18. 18. Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
  19. 19. 07.11.13 My favorite definition
  20. 20. 07.11.13 The classic definition Volume The 3 V’s of Big Data Velocity Variety
  21. 21. 07.11.13 «Big Data» != Hadoop
  22. 22. g NoSQL
  23. 23. Classification of NoSQL 07.11.13 Key-Value Stores K V K V K V K 1 V K Column Stores V Graph Databases 1 1 1 1 1 1 1 1 1 1 Document Stores _id _id _id
  24. 24. Horizontal Scaling
  25. 25. 07.11.13 Vertical Scaling RAM CPU Storage
  26. 26. 07.11.13 Vertical Scaling RAM CPU Storage
  27. 27. 07.11.13 Vertical Scaling RAM CPU Storage
  28. 28. 07.11.13 Horizontal Scaling RAM CPU Storage
  29. 29. 07.11.13 RAM CPU Storage Horizontal Scaling RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage
  30. 30. 07.11.13 Horizontal Scaling RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage
  31. 31. 07.11.13 Why Hadoop? Traditional dataStores are expensive to scale and by Design difficult to Distribute Scale out is the way to go!
  32. 32. How to scale data? 07.11.13 “Data“ w worker r w worker r “Result“ w worker r
  33. 33. 07.11.13 But… Parallel processing is complicated!
  34. 34. 07.11.13 But… Data storage is not trivial!
  35. 35. 07.11.13 What is Hadoop? Distributed Storage and Computation Framework
  36. 36. 07.11.13 What is Hadoop? Hadoop != Database
  37. 37. 07.11.13 What is Hadoop? “Swiss army knife of the 21st century” http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop
  38. 38. The Hadoop App Store 07.11.13 HDFS MapRed HCat Pig Hive HBase Ambari Avro Cassandra Chukwa Flume Hana HyperT Impala Mahout Nutch Oozie Scoop Scribe Tez Vertica Whirr ZooKee Horton Cloudera MapR EMC Intel IBM Talend TeraData Pivotal Informat Microsoft. Pentaho Jasper Sync Kognitio Tableau Splunk Platfora Rack Karma Actuate MicStrat
  39. 39. 07.11.13 The Hadoop App Store Hadoop Distributions Apache Hadoop + + • • • • less HDFS MapReduce Hadoop Ecosystem Hadoop YARN • • • • Test & Packaging Installation Monitoring Business Support Functionality • • • • • Integrated Environment Visualization (Near-)Realtime analysis Modeling ETL & Connectors Big Data Suites more
  40. 40. 07.11.13 Agenda • What is Big Data & Hadoop? • Core Hadoop • The Hadoop Ecosystem • Use Cases • What‘s next? Hadoop 2.0!
  41. 41. 07.11.13 Data Storage OK, first things first! I want to store all of my <<Big Data>>
  42. 42. 07.11.13 Data Storage
  43. 43. 07.11.13 Hadoop Distributed File System • Distributed file system for redundant storage • Designed to reliably store data on commodity hardware • Built to expect hardware failures
  44. 44. 07.11.13 Hadoop Distributed File System Intended for • large files • batch inserts
  45. 45. HDFS Architecture 07.11.13 Client Master Helper File NameNode Secondary NameNode #1 #2 Rack 1 Slave DataNode #1 Block Map Journal Log periodical merges Rack 2 Slave DataNode #1 Slave DataNode #1
  46. 46. 07.11.13 HDFS Let’s have a look…
  47. 47. 07.11.13 Data Processing Data stored, check! Now I want to create insights from my data!
  48. 48. 07.11.13 Data Processing
  49. 49. 07.11.13 MapReduce • Programming model for distributed computations at a massive scale • Execution framework for organizing and performing such computations • Data locality is king
  50. 50. 07.11.13 Typical large-data problem • Extract something of interest from each Map • Iterate over a large number of records • Shuffle and sort intermediate results • Generate final output Reduce • Aggregate intermediate results
  51. 51. MapReduce Flow 07.11.13 Map a Map b 2 Combine a c 3 Map c a 6 Combine b 2 Partition c 3 c Map 2 b Combine 9 a Partition 3 c 7 c Combine b 2 Partition 7 c Partition Shuffle and Sort a 1 3 Reduce a 4 b 7 Reduce b 9 8 c 2 8 Reduce c 19 9 8
  52. 52. Combined Hadoop Architecture 07.11.13 Client Master Job JobTracker File NameNode Secondary NameNode Slave Slave Slave TaskTracker TaskTracker TaskTracker Task Task Task DataNode Block DataNode Block Helper DataNode Block
  53. 53. 07.11.13 Word Count Mapper in Java public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
  54. 54. 07.11.13 Word Count Reducer in Java public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { IntWritable value = (IntWritable) values.next(); sum += value.get(); } output.collect(key, new IntWritable(sum)); } }
  55. 55. 07.11.13 Map/Reduce Let’s have a look…
  56. 56. 07.11.13 Agenda • What is Big Data & Hadoop? • Core Hadoop • The Hadoop Ecosystem • Use Cases • What‘s next? Hadoop 2.0!
  57. 57. 07.11.13 Scripting for Hadoop Java for MapReduce? I dunno, dude… I’m more of a scripting guy…
  58. 58. 07.11.13 Scripting for Hadoop
  59. 59. 07.11.13 Apache Pig • High-level data flow language • Made of two components: • Data processing language Pig Latin • Compiler to translate Pig Latin to MapReduce
  60. 60. 07.11.13 Pig in the Hadoop ecosystem Pig Scripting HCatalog Metadata Management MapReduce Distributed Programming Framework HDFS Hadoop Distributed File System
  61. 61. 07.11.13 Pig Latin users = LOAD 'users.txt' USING PigStorage(',') AS (name, age); pages = LOAD 'pages.txt' USING PigStorage(',') AS (user, url); filteredUsers = FILTER users BY age >= 18 and age <=50; joinResult = JOIN filteredUsers BY name, pages by user; grouped = GROUP joinResult BY url; summed = FOREACH grouped GENERATE group, COUNT(joinResult) as clicks; sorted = ORDER summed BY clicks desc; top10 = LIMIT sorted 10; STORE top10 INTO 'top10sites';
  62. 62. 07.11.13 Pig Execution Plan
  63. 63. 07.11.13 Try that with Java…
  64. 64. 07.11.13 Pig Let’s have a look…
  65. 65. 07.11.13 SQL for Hadoop OK, Pig seems quite useful… But I’m more of a SQL person…
  66. 66. 07.11.13 SQL for Hadoop
  67. 67. 07.11.13 Apache Hive • Data Warehousing Layer on top of Hadoop • Allows analysis and queries using a SQL-like language
  68. 68. 07.11.13 Hive in the Hadoop ecosystem Pig Hive Scripting Query HCatalog Metadata Management MapReduce Distributed Programming Framework HDFS Hadoop Distributed File System
  69. 69. 07.11.13 Hive Architecture Hive Shell Hive Metastore Hive Server Hive Engine Hive Thrift Driver Thrift Applications Hive JDBC Driver JDBC Applications Hive ODBC Driver ODBC Applications MapReduce HDFS
  70. 70. 07.11.13 Hive Example CREATE TABLE users(name STRING, age INT); CREATE TABLE pages(user STRING, url STRING); LOAD DATA INPATH '/user/sandbox/users.txt' INTO TABLE 'users'; LOAD DATA INPATH '/user/sandbox/pages.txt' INTO TABLE 'pages'; SELECT pages.url, count(*) AS clicks FROM users JOIN pages ON (users.name = pages.user) WHERE users.age >= 18 AND users.age <= 50 GROUP BY pages.url SORT BY clicks DESC LIMIT 10;
  71. 71. 07.11.13 Hive Let’s have a look…
  72. 72. 07.11.13 But wait, there’s still more! More components of the Hadoop Ecosystem
  73. 73. Mahout 07.11.13 Machine Learning Hive Scripting SQL-like queries Data storage Scoop Flume Import & Export of relational data Import & Export of data flows Oozie HDFS Workflow automatization Data processing Ambari MapReduce ZooKeeper Metadata Management Cluster Coordination HBase NoSQL Database HCatalog Cluster installation & management Pig
  74. 74. 07.11.13 Agenda • What is Big Data & Hadoop? • Core Hadoop • The Hadoop Ecosystem • Use Cases • What‘s next? Hadoop 2.0!
  75. 75. Classical enterprise platform Applications 07.11.13 Business Intelligence Business Applications Custom Applications Dev Tools Data Sources Data Systems Build & Test Traditional Systems RDBMS EDW MPP Operation … Traditional Sources RDBMS OLTP OLAP … Manage & Monitor
  76. 76. Big Data Platform Applications 07.11.13 Business Intelligence Business Applications Custom Applications Dev Tools Data Sources Data Systems Build & Test Traditional Systems RDBMS EDW MPP Enterprise Hadoop Plattform … Traditional Sources RDBMS OLTP OLAP New Sources … Logs Mails Social Sensor … Media Operation Manage & Monitor
  77. 77. Pattern #1: Refine data Applications 07.11.13 Business Intelligence Business Applications Custom Applications Data Systems Traditional Systems Enterprise Hadoop 2 Plattform 3 EDW MPP … Data Sources 1 Traditional Sources RDBMS OLTP Capture all data Process 2 the data 4 RDBMS 1 OLAP New Sources … Logs Mails Social Sensor … Media Exchange using 3 traditional systems Process & Visualize 4 with traditional applications
  78. 78. Pattern #2: Explore data Applications 07.11.13 Business Intelligence Business Applications Custom Applications 1 Data Systems 3 Traditional Systems Enterprise Hadoop Plattform 2 RDBMS EDW MPP … Data Sources 1 Traditional Sources RDBMS OLTP OLAP New Sources … Logs Mails Social Sensor … Media Capture all data Process 2 the data Explore the data using 3 applications with support for Hadoop
  79. 79. Pattern #3: Enrich data Applications 07.11.13 Business Applications Custom Applications 1 Data Systems 3 Traditional Systems Enterprise Hadoop Plattform 2 RDBMS EDW MPP … Data Sources 1 Traditional Sources RDBMS OLTP OLAP New Sources … Logs Mails Social Sensor … Media Capture all data 2 Process the data Directly 3 ingest the data
  80. 80. 07.11.13 Bringing it all together… One example…
  81. 81. 07.11.13 Digital Advertising • 6 billion ad deliveries per day • Reports (and bills) for the advertising companies needed • Own C++ solution did not scale • Adding functions was a nightmare
  82. 82. AdServing Architecture FFM AdServer AdServer 07.11.13 Hadoop Cluster Synchronisation Campaign Database Campaign Data AMS Binary Log Format TCP Interface TCP Interface Custom Flume Source Custom Flume Source Pig Report Engine Hive Temporary data Aggregated data NAS Local files Start Job Scheduler Flume HDFS Sink Config UI Job Config XML Direct Download
  83. 83. 07.11.13 What’s next? Hadoop 2.0 aka YARN
  84. 84. Hadoop 1.0 07.11.13 Built for web-scale batch apps Single App Single App Batch Batch Single App Single App Single App Batch Batch Batch HDFS HDFS HDFS
  85. 85. 07.11.13 MapReduce is good for… • Embarrassingly parallel algorithms • Summing, grouping, filtering, joining • Off-line batch jobs on massive data sets • Analyzing an entire large dataset
  86. 86. 07.11.13 MapReduce is OK for… • Iterative jobs (i.e., graph algorithms) – Each iteration must read/write data to disk – I/O and latency cost of an iteration is high
  87. 87. 07.11.13 MapReduce is not good for… • Jobs that need shared state/coordination – Tasks are shared-nothing – Shared-state requires scalable state store • Low-latency jobs • Jobs on small datasets • Finding individual records
  88. 88. 07.11.13 • MapReduce limitations Scalability – Maximum cluster size ~ 4,500 nodes – Maximum concurrent tasks – 40,000 – Coarse synchronization in JobTracker • Availability – Failure kills all queued and running jobs • Hard partition of resources into map & reduce slots – Low resource utilization • Lacks support for alternate paradigms and services – Iterative applications implemented using MapReduce are 10x slower
  89. 89. 07.11.13 Hadoop 2.0: Next-gen platform Single use system Batch Apps Hadoop 1.0 MapReduce Cluster resource mgmt. + data processing HDFS Redundant, reliable storage Multi-purpose platform Batch, Interactive, Streaming, … Hadoop 2.0 MapReduce Others Data processing Data processing YARN Cluster resource management HDFS 2.0 Redundant, reliable storage
  90. 90. Taking Hadoop beyond batch 07.11.13 Store all data in one place Interact with data in multiple ways Applications run natively in Hadoop Batch Interactive Online MapReduce Tez HOYA Streaming Graph In-Memory Storm, … Giraph YARN Cluster resource management HDFS 2.0 Redundant, reliable storage Spark Other Search, …
  91. 91. 07.11.13 A brief history of Hadoop 2.0 • Originally conceived & architected by the team at Yahoo! – The team at Hortonworks has been working on YARN for 4 years: • – • Arun Murthy created the original JIRA in 2008 and now is the YARN release manager 90% of code from Hortonworks & Yahoo! Hadoop 2.0 based architecture running at scale at Yahoo! – Deployed on 35,000 nodes for 6+ months
  92. 92. 07.11.13 Hadoop 2.0 Projects • YARN • HDFS Federation aka HDFS 2.0 • Stinger & Tez aka Hive 2.0
  93. 93. 07.11.13 Hadoop 2.0 Projects • YARN • HDFS Federation aka HDFS 2.0 • Stinger & Tez aka Hive 2.0
  94. 94. 07.11.13 YARN: Architecture Split up the two major functions of the JobTracker Cluster resource management & Application life-cycle management ResourceManager Scheduler NodeManager NodeManager AM 1 NodeManager Container 1.1 NodeManager Container 2.1 Container 2.3 NodeManager NodeManager NodeManager NodeManager Container 1.2 AM 2 Container 2.2
  95. 95. 07.11.13 YARN: Architecture • Resource Manager – Global resource scheduler – Hierarchical queues • Node Manager – Per-machine agent – Manages the life-cycle of container – Container resource monitoring • Application Master – Per-application – Manages application scheduling and task execution – e.g. MapReduce Application Master
  96. 96. 07.11.13 YARN: Architecture ResourceManager Scheduler NodeManager NodeManager NodeManager NodeManager MapReduce 1 map 1.1 reduce 2.2 map 2.1 Region server 2 reduce 2.1 nimbus 1 vertex 3 NodeManager NodeManager NodeManager NodeManager HBase Master map 1.2 MapReduce 2 map 2.2 nimbus 2 Region server 1 vertex 4 vertex 2 NodeManager NodeManager NodeManager NodeManager HOYA reduce 1.1 Tez map 2.3 vertex 1 Region server 3 Storm
  97. 97. 07.11.13 Hadoop 2.0 Projects • YARN • HDFS Federation aka HDFS 2.0 • Stinger & Tez aka Hive 2.0
  98. 98. 07.11.13 HDFS Federation • Removes tight coupling of Block Storage and Namespace • Scalability & Isolation • High Availability • Increased performance Details: https://issues.apache.org/jira/browse/HDFS-1052
  99. 99. HDFS Federation: Architecture 07.11.13 NameNodes do not talk to each other NameNode 1 NameNode 2 Namespace 1 logs finance Block Management 1 1 2 DataNode 1 3 4 DataNode 2 Namespace 2 insights reports Block Management 2 5 6 DataNode 3 NameNodes manages only slice of namespace 7 8 DataNode 4 DataNodes can store blocks managed by any NameNode
  100. 100. 07.11.13 Only the active writes edits HDFS: Quorum based storage Journal Node Journal Node Active NameNode Block Map DataNode Edits File DataNode Journal Node Standby NameNode Block Map DataNode Edits File DataNode The state is shared on a quorum of journal nodes The Standby simultaneously reads and applies the edits DataNode DataNodes report to both NameNodes but listen only to the orders from the active one
  101. 101. 07.11.13 Hadoop 2.0 Projects • YARN • HDFS Federation aka HDFS 2.0 • Stinger & Tez aka Hive 2.0
  102. 102. 07.11.13 Real-Time • Online systems • R-T analytics • CEP 0-5s Hive: Current Focus Area Interactive • Parameterized Reports • Drilldown • Visualization • Exploration NonInteractive Batch • Data preparation • Operational • Incremental batch batch processing processing • Enterprise • Dashboards / Reports Scorecards • Data Mining Current Hive Sweet Spot 1m – 1h 5s – 1m Data Size 1h+
  103. 103. 07.11.13 Real-Time • Online systems • R-T analytics • CEP Stinger: Extending the sweet spot NonInteractive Interactive • Parameterized Reports • Drilldown • Visualization • Exploration • Data preparation • Incremental batch processing • Dashboards / Scorecards Batch • Operational batch processing • Enterprise Reports • Data Mining Future Hive Expansion 0-5s 1m – 1h 5s – 1m 1h+ Data Size Improve Latency & Throughput • Query engine improvements • New “Optimized RCFile” column store • Next-gen runtime (elim’s M/R latency) Extend Deep Analytical Ability • Analytics functions • Improved SQL coverage • Continued focus on core Hive use cases
  104. 104. 07.11.13 Stinger Initiative at a glance
  105. 105. 07.11.13 Tez: The Execution Engine • Low level data-processing execution engine • Use it for the base of MapReduce, Hive, Pig, etc. • Enables pipelining of jobs • Removes task and job launch times • Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline • Does not write intermediate output to HDFS – Much lighter disk and network usage • Built on YARN
  106. 106. Pig/Hive MR vs. Pig/Hive Tez 07.11.13 SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Job 1 Job 2 I/O Synchronization Barrier I/O Synchronization Barrier Single Job Job 3 Pig/Hive - MR Pig/Hive - Tez
  107. 107. 07.11.13 Tez Service • MapReduce Query Startup is expensive: – Job launch & task-launch latencies are fatal for short queries (in order of 5s to 30s) • Solution: – Tez Service (= Preallocated Application Master) • Removes job-launch overhead (Application Master) • Removes task-launch overhead (Pre-warmed Containers) – Hive/Pig • Submit query-plan to Tez Service – Native Hadoop service, not ad-hoc
  108. 108. 07.11.13 Tez: Low latency SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Existing Hive Tez & Tez Service Hive/Tez Parse Query 0.5s Parse Query 0.5s Parse Query 0.5s Create Plan 0.5s Create Plan 0.5s Create Plan 0.5s Launch MapReduce 20s Launch MapReduce 20s Submit to Tez Service 0.5s Process MapReduce 10s Process MapReduce 2s Total 31s Total 23s Process Map-Reduce Total 2s 3.5s * No exact numbers, for illustration only
  109. 109. 07.11.13 Stinger: Summary * Real numbers, but handle with care!
  110. 110. 07.11.13 • • • • • • • • Hadoop 2.0 Applications MapReduce 2.0 HOYA - HBase on YARN Storm, Spark, Apache S4 Hamster (MPI on Hadoop) Apache Giraph Apache Hama Distributed Shell Tez
  111. 111. 07.11.13 • • • • • • • • Hadoop 2.0 Applications MapReduce 2.0 HOYA - HBase on YARN Storm, Spark, Apache S4 Hamster (MPI on Hadoop) Apache Giraph Apache Hama Distributed Shell Tez
  112. 112. 07.11.13 MapReduce 2.0 • Basically a porting to the YARN architecture • MapReduce becomes a user-land library • No need to rewrite MapReduce jobs • Increased scalability & availability • Better cluster utilization
  113. 113. 07.11.13 • • • • • • • • Hadoop 2.0 Applications MapReduce 2.0 HOYA - HBase on YARN Storm, Spark, Apache S4 Hamster (MPI on Hadoop) Apache Giraph Apache Hama Distributed Shell Tez
  114. 114. 07.11.13 HOYA: HBase on YARN • Create on-demand HBase clusters • Configure different HBase instances differently • Better isolation • Create (transient) HBase clusters from MapReduce jobs • Elasticity of clusters for analytic / batch workload processing • Better cluster resources utilization
  115. 115. 07.11.13 • • • • • • • • Hadoop 2.0 Applications MapReduce 2.0 HOYA - HBase on YARN Storm, Spark, Apache S4 Hamster (MPI on Hadoop) Apache Giraph Apache Hama Distributed Shell Tez
  116. 116. 07.11.13 Twitter Storm • Stream-processing • Real-time processing • Developed as standalone application • https://github.com/nathanmarz/storm • Ported on YARN • https://github.com/yahoo/storm-yarn
  117. 117. 07.11.13 Storm: Conceptual view Bolt: Spout: Source of streams Spout Bolt Consumer of streams, Processing of tuples, Possibly emits new tuples Stream: Bolt Unbound sequence of tuples Tuple Tuple: List of name-value pairs Bolt Tuple Spout Bolt Tuple Bolt Topology: Network of Spouts & Bolts as the nodes and stream as the edge
  118. 118. 07.11.13 • • • • • • • • Hadoop 2.0 Applications MapReduce 2.0 HOYA - HBase on YARN Storm, Spark, Apache S4 Hamster (MPI on Hadoop) Apache Giraph Apache Hama Distributed Shell Tez
  119. 119. 07.11.13 Spark • High-speed in-memory analytics over Hadoop and Hive • Separate MapReduce-like engine – – Speedup of up to 100x On-disk queries 5-10x faster • Compatible with Hadoop‘s Storage API • Available as standalone application – https://github.com/mesos/spark • Experimental support for YARN since 0.6 – http://spark.incubator.apache.org/docs/0.6.0/running-on-yarn.html
  120. 120. 07.11.13 Data Sharing in Spark
  121. 121. 07.11.13 • • • • • • • • Hadoop 2.0 Applications MapReduce 2.0 HOYA - HBase on YARN Storm, Spark, Apache S4 Hamster (MPI on Hadoop) Apache Giraph Apache Hama Distributed Shell Tez
  122. 122. 07.11.13 Apache Giraph • Giraph is a framework for processing semistructured graph data on a massive scale. • Giraph is loosely based upon Google's Pregel • Giraph performs iterative calculations on top of an existing Hadoop cluster. • Available on GitHub – https://github.com/apache/giraph
  123. 123. 07.11.13 Hadoop 2.0 Summary 1. Scale 2. New programming models & Services 3. Improved cluster utilization 4. Agility 5. Beyond Java
  124. 124. 07.11.13 Getting started… One more thing…
  125. 125. 07.11.13 Hortonworks Sandbox http://hortonworks.com/products/hortonworsk-sandbox
  126. 126. 07.11.13 1. Books about Hadoop Hadoop - The Definite Guide, Tom White, 3rd ed., O’Reilly, 2012. 2. Hadoop in Action, Chuck Lam, Manning, 2011 Programming Pig, Alan Gates O’Reilly, 2011 1. Hadoop Operations, Eric Sammer, O’Reilly, 2012
  127. 127. 07.11.13 The end…or the beginning?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×