SlideShare a Scribd company logo
1 of 106
Download to read offline
Novi Sad
05.09.2013
Introduction to the
Hadoop Ecosystem
uweseiler
Novi Sad
05.09.2013 About me
Big Data Nerd
TravelpiratePhotography Enthusiast
Hadoop Trainer MongoDB Author
Novi Sad
05.09.2013 About us
is a bunch of…
Big Data Nerds Agile Ninjas Continuous Delivery Gurus
Enterprise Java Specialists Performance Geeks
Join us!
Novi Sad
05.09.2013 Agenda
• What is Big Data & Hadoop?
• Core Hadoop
• The Hadoop Ecosystem
• Use Cases
• What‘s next? Hadoop 2.0!
Novi Sad
05.09.2013 Agenda
• What is Big Data & Hadoop?
• Core Hadoop
• The Hadoop Ecosystem
• Use Cases
• What‘s next? Hadoop 2.0!
Novi Sad
05.09.2013 Let’s face it…
Big Data is the latest …
Novi Sad
05.09.2013 But on the other hand…
Data is an…
Novi Sad
05.09.2013 Think about it…
Advertisements
Novi Sad
05.09.2013 Think about it…
Recommendations
Novi Sad
05.09.2013 Think about it…
Anything that sells
Novi Sad
05.09.2013 The 3 V’s of Big Data
VarietyVolume Velocity
Novi Sad
05.09.2013 My favorite definition
Novi Sad
05.09.2013 Why Hadoop?
Traditional dataStoresare expensive to scale
and by Design difficultto Distribute
Scale out is the way to go!
Novi Sad
05.09.2013 How to scale data?
“Data“
r r
“Result“
w w
worker workerworker
w
r
Novi Sad
05.09.2013 But…
Parallel processing is
complicated!
Novi Sad
05.09.2013 But…
Data storage is not
trivial!
Novi Sad
05.09.2013 What is Hadoop?
Distributed Storage and
Computation Framework
Novi Sad
05.09.2013 What is Hadoop?
«Big Data» != Hadoop
Novi Sad
05.09.2013 What is Hadoop?
Hadoop != Database
Novi Sad
05.09.2013 What is Hadoop?
“Swiss army knife
of the 21st century”
http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop
Novi Sad
05.09.2013 The Hadoop App Store
HDFS MapRed HCat Pig Hive HBase Ambari Avro Cassandra
Chukwa
Intel
Sync
Flume Hana HyperT Impala Mahout Nutch Oozie Scoop
Scribe Tez Vertica Whirr ZooKee Horton Cloudera MapR EMC
IBM Talend TeraData Pivotal Informat Microsoft. Pentaho Jasper
Kognitio Tableau Splunk Platfora Rack Karma Actuate MicStrat
Novi Sad
05.09.2013
Functionalityless more
Apache
Hadoop
Hadoop
Distributions
Big Data
Suites
• HDFS
• MapReduce
• Hadoop Ecosystem
• Hadoop YARN
• Test & Packaging
• Installation
• Monitoring
• Business Support
+
• Integrated Environment
• Visualization
• (Near-)Realtime analysis
• Modeling
• ETL & Connectors
+
The Hadoop App Store
Novi Sad
05.09.2013 Agenda
• What is Big Data & Hadoop?
• Core Hadoop
• The Hadoop Ecosystem
• Use Cases
• What‘s next? Hadoop 2.0!
Novi Sad
05.09.2013 Data Storage
OK, first things
first!
I want to store all of
my <<Big Data>>
Novi Sad
05.09.2013 Data Storage
Novi Sad
05.09.2013 Hadoop Distributed File System
• Distributed file system for
redundant storage
• Designed to reliably store data
on commodity hardware
• Built to expect hardware
failures
Novi Sad
05.09.2013 Hadoop Distributed File System
Intended for
• large files
• batch inserts
Novi Sad
05.09.2013 HDFS Architecture
NameNode
Master
Block Map
Slave Slave Slave
Rack 1 Rack 2
Journal Log
DataNode DataNode DataNode
File
Client
Secondary
NameNode
Helper
periodical merges
#1 #2
#1 #1 #1
Novi Sad
05.09.2013 Data Processing
Data stored, check!
Now I want to
create insights
from my data!
Novi Sad
05.09.2013 Data Processing
Novi Sad
05.09.2013 MapReduce
• Programming model for
distributed computations at a
massive scale
• Execution framework for
organizing and performing such
computations
• Data locality is king
Novi Sad
05.09.2013 Typical large-data problem
• Iterate over a large number of records
• Extract something of interest from each
• Shuffle and sort intermediate results
• Aggregate intermediate results
• Generate final output
MapReduce
Novi Sad
05.09.2013 MapReduce Flow
Combine Combine Combine Combine
a b 2 c 9 a 3 c 2 b 7 c 8
Partition Partition Partition Partition
Shuffle and Sort
Map Map Map Map
a b 2 c 3 c 6 a 3 c 2 b 7 c 8
a 1 3 b 7 c 2 8 9
Reduce Reduce Reduce
a 4 b 9 c 19
Novi Sad
05.09.2013 Combined Hadoop Architecture
Client
NameNode
Master
Slave
TaskTracker
Secondary
NameNode
Helper
JobTracker
DataNode
File
Job
Block
Task
Slave
TaskTracker
DataNode
Block
Task
Slave
TaskTracker
DataNode
Block
Task
Novi Sad
05.09.2013 Word Count Mapper in Java
public class WordCountMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
Novi Sad
05.09.2013 Word Count Reducer in Java
public class WordCountReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator values, OutputCollector
output, Reporter reporter) throws IOException
{
int sum = 0;
while (values.hasNext())
{
IntWritable value = (IntWritable) values.next();
sum += value.get();
}
output.collect(key, new IntWritable(sum));
}
}
Novi Sad
05.09.2013 Agenda
• What is Big Data & Hadoop?
• Core Hadoop
• The Hadoop Ecosystem
• Use Cases
• What‘s next? Hadoop 2.0!
Novi Sad
05.09.2013 Scripting for Hadoop
Java for MapReduce?
I dunno, dude…
I’m more of a
scripting guy…
Novi Sad
05.09.2013 Scripting for Hadoop
Novi Sad
05.09.2013 Apache Pig
• High-level data flow language
• Made of two components:
• Data processing language Pig Latin
• Compiler to translate Pig Latin to
MapReduce
Novi Sad
05.09.2013 Pig in the Hadoop ecosystem
HDFS
Hadoop Distributed File System
MapReduce
Distributed Programming Framework
HCatalog
Metadata Management
Pig
Scripting
Novi Sad
05.09.2013 Pig Latin
users = LOAD 'users.txt' USING PigStorage(',') AS (name,
age);
pages = LOAD 'pages.txt' USING PigStorage(',') AS (user,
url);
filteredUsers = FILTER users BY age >= 18 and age <=50;
joinResult = JOIN filteredUsers BY name, pages by user;
grouped = GROUP joinResult BY url;
summed = FOREACH grouped GENERATE group,
COUNT(joinResult) as clicks;
sorted = ORDER summed BY clicks desc;
top10 = LIMIT sorted 10;
STORE top10 INTO 'top10sites';
Novi Sad
05.09.2013 Pig Execution Plan
Novi Sad
05.09.2013 Try that with Java…
Novi Sad
05.09.2013 SQL for Hadoop
OK, Pig seems quite
useful…
But I’m more of a
SQL person…
Novi Sad
05.09.2013 SQL for Hadoop
Novi Sad
05.09.2013 Apache Hive
• Data Warehousing Layer on top of
Hadoop
• Allows analysis and queries
using a SQL-like language
Novi Sad
05.09.2013 Hive in the Hadoop ecosystem
HDFS
Hadoop Distributed File System
MapReduce
Distributed Programming Framework
HCatalog
Metadata Management
Pig
Scripting
Hive
Query
Novi Sad
05.09.2013 Hive Architecture
Hive
Hive
Engine
HDFS
MapReduce
Meta-
store
Thrift
Applications
JDBC
Applications
ODBC
Applications
Hive Thrift
Driver
Hive JDBC
Driver
Hive ODBC
Driver
Hive
Server
Hive
Shell
Novi Sad
05.09.2013 Hive Example
CREATE TABLE users(name STRING, age INT);
CREATE TABLE pages(user STRING, url STRING);
LOAD DATA INPATH '/user/sandbox/users.txt' INTO
TABLE 'users';
LOAD DATA INPATH '/user/sandbox/pages.txt' INTO
TABLE 'pages';
SELECT pages.url, count(*) AS clicks FROM users JOIN
pages ON (users.name = pages.user)
WHERE users.age >= 18 AND users.age <= 50
GROUP BY pages.url
SORT BY clicks DESC
LIMIT 10;
Novi Sad
05.09.2013 But wait, there’s still more!
More components of the
Hadoop Ecosystem
Novi Sad
05.09.2013
HDFS
Data storage
MapReduce
Data processing
HCatalog
Metadata Management
Pig
Scripting
Hive
SQL-like queries
HBase
NoSQLDatabase
Mahout
Machine Learning
ZooKeeper
ClusterCoordination
Scoop
Import & Export of
relational data
Ambari
Clusterinstallation&management
Oozie
Workflowautomatization
Flume
Import & Export of
data flows
Novi Sad
05.09.2013 Agenda
• What is Big Data & Hadoop?
• Core Hadoop
• The Hadoop Ecosystem
• Use Cases
• What‘s next? Hadoop 2.0!
Novi Sad
05.09.2013
DataSourcesDataSystemsApplications
Traditional Sources
RDBMS OLTP OLAP …
Traditional Systems
RDBMS EDW MPP …
Business
Intelligence
Business
Applications
Custom
Applications
Operation
Manage
&
Monitor
Dev Tools
Build
&
Test
Classical enterprise platform
Novi Sad
05.09.2013
DataSourcesDataSystemsApplications
Traditional Sources
RDBMS OLTP OLAP …
Traditional Systems
RDBMS EDW MPP …
Business
Intelligence
Business
Applications
Custom
Applications
Operation
Manage
&
Monitor
Dev Tools
Build
&
Test
New Sources
Logs Mails Sensor …Social
Media
Enterprise
Hadoop
Plattform
Big Data Platform
Novi Sad
05.09.2013DataSourcesDataSystemsApplications
Traditional Sources
RDBMS OLTP OLAP …
Traditional Systems
RDBMS EDW MPP …
Business
Intelligence
Business
Applications
Custom
Applications
New Sources
Logs Mails Sensor …Social
Media
Enterprise
Hadoop
Plattform
1
2
3
4
1
2
3
4
Capture
all data
Process
the data
Exchange
using
traditional
systems
Process &
Visualize
with
traditional
applications
Pattern #1: Refine data
Novi Sad
05.09.2013DataSourcesDataSystemsApplications
Traditional Sources
RDBMS OLTP OLAP …
Traditional Systems
RDBMS EDW MPP …
Business
Intelligence
Business
Applications
Custom
Applications
New Sources
Logs Mails Sensor …Social
Media
Enterprise
Hadoop
Plattform
1
2
3
1
2
3
Capture
all data
Process
the data
Explore the
data using
applications
with support
for Hadoop
Pattern #2: Explore data
Novi Sad
05.09.2013DataSourcesDataSystemsApplications
Traditional Sources
RDBMS OLTP OLAP …
Traditional Systems
RDBMS EDW MPP …
Business
Applications
Custom
Applications
New Sources
Logs Mails Sensor …Social
Media
Enterprise
Hadoop
Plattform
1
3 1
2
3
Capture
all data
Process
the data
Directly
ingest the
data
Pattern #3: Enrich data
2
Novi Sad
05.09.2013 Bringing it all together…
One example…
Novi Sad
05.09.2013 Digital Advertising
• 6 billion ad deliveries per day
• Reports (and bills) for the
advertising companies needed
• Own C++ solution did not scale
• Adding functions was a nightmare
Novi Sad
05.09.2013
Campaign
Database
FFM AMS
TCP
Interface
TCP
Interface
Custom
Flume
Source
Custom
Flume
Source
Flume HDFS Sink
Local files
Campaign
Data
Hadoop Cluster
Binary
Log Format
Synchronisation
Pig Hive
Temporäre
Daten
NAS
Aggregated
data
Report
Engine
Direct
Download
Job
Scheduler
Config UI Job Config
XML
Start
AdServer
AdServer
AdServing Architecture
Novi Sad
05.09.2013 What’s next?
Hadoop 2.0
aka YARN
Novi Sad
05.09.2013
HDFS
Built for web-scale batch apps
HDFS HDFS
Single App
Batch
Single App
Batch
Single App
Batch
Single App
Batch
Single App
Batch
Hadoop 1.0
Novi Sad
05.09.2013 MapReduce is good for…
• Embarrassingly parallel algorithms
• Summing, grouping, filtering, joining
• Off-line batch jobs on massive data
sets
• Analyzing an entire large dataset
Novi Sad
05.09.2013 MapReduce is OK for…
• Iterative jobs (i.e., graph algorithms)
– Each iteration must read/write data to
disk
– I/O and latency cost of an iteration is
high
Novi Sad
05.09.2013 MapReduce is not good for…
• Jobs that need shared state/coordination
– Tasks are shared-nothing
– Shared-state requires scalable state store
• Low-latency jobs
• Jobs on small datasets
• Finding individual records
Novi Sad
05.09.2013 MapReduce limitations
• Scalability
– Maximum cluster size ~ 4,500 nodes
– Maximum concurrent tasks – 40,000
– Coarse synchronization in JobTracker
• Availability
– Failure kills all queued and running jobs
• Hard partition of resources into map & reduce slots
– Low resource utilization
• Lacks support for alternate paradigms and services
– Iterative applications implemented using MapReduce are 10x
slower
Novi Sad
05.09.2013
Hadoop 1.0
HDFS
Redundant, reliable
storage
Hadoop 2.0: Next-gen platform
MapReduce
Cluster resource mgmt.
+ data processing
Hadoop 2.0
HDFS 2.0
Redundant, reliable storage
MapReduce
Data processing
Single use system
Batch Apps
Multi-purpose platform
Batch, Interactive, Streaming, …
YARN
Cluster resource management
Others
Data processing
Novi Sad
05.09.2013 Taking Hadoop beyond batch
Applications run natively in Hadoop
HDFS 2.0
Redundant, reliable storage
Batch
MapReduce
Store all data in one place
Interact with data in multiple ways
YARN
Cluster resource management
Interactive
Tez
Online
HOYA
Streaming
Storm, …
Graph
Giraph
In-Memory
Spark
Other
Search, …
Novi Sad
05.09.2013 A brief history of Hadoop 2.0
• Originally conceived & architected by the
team at Yahoo!
– Arun Murthy created the original JIRA in 2008 and now is
the YARN release manager
• The team at Hortonworks has been working
on YARN for 4 years:
– 90% of code from Hortonworks & Yahoo!
• Hadoop 2.0 based architecture running at scale at
Yahoo!
– Deployed on 35,000 nodes for 6+ months
Novi Sad
05.09.2013 Hadoop 2.0 Projects
• YARN
• HDFS Federation aka HDFS 2.0
• Stinger & Tez aka Hive 2.0
Novi Sad
05.09.2013 Hadoop 2.0 Projects
• YARN
• HDFS Federation aka HDFS 2.0
• Stinger & Tez aka Hive 2.0
Novi Sad
05.09.2013 YARN: Architecture
Split up the two major functions of the JobTracker
Cluster resource management & Application life-cycle management
ResourceManager
NodeManager NodeManager NodeManager NodeManager
NodeManager NodeManager NodeManager NodeManager
Scheduler
AM 1
Container 1.2
Container 1.1
AM 2
Container 2.1
Container 2.2
Container 2.3
Novi Sad
05.09.2013 YARN: Architecture
• Resource Manager
– Global resource scheduler
– Hierarchical queues
• Node Manager
– Per-machine agent
– Manages the life-cycle of container
– Container resource monitoring
• Application Master
– Per-application
– Manages application scheduling and task execution
– e.g. MapReduce Application Master
Novi Sad
05.09.2013 YARN: Architecture
ResourceManager
NodeManager NodeManager NodeManager NodeManager
NodeManager NodeManager NodeManager NodeManager
Scheduler
MapReduce 1
map 1.2
map 1.1
MapReduce 2
map 2.1
map 2.2
reduce 2.1
NodeManager NodeManager NodeManager NodeManager
reduce 1.1 Tez map 2.3
reduce 2.2
vertex 1
vertex 2
vertex 3
vertex 4
HOYA
HBase Master
Region server 1
Region server 2
Region server 3 Storm
nimbus 1
nimbus 2
Novi Sad
05.09.2013 Hadoop 2.0 Projects
• YARN
• HDFS Federation aka HDFS 2.0
• Stinger & Tez aka Hive 2.0
Novi Sad
05.09.2013 HDFS Federation
• Removes tight coupling of Block
Storage and Namespace
• Scalability & Isolation
• High Availability
• Increased performance
Details: https://issues.apache.org/jira/browse/HDFS-1052
Novi Sad
05.09.2013 HDFS Federation: Architecture
NameNodes do not talk to each other
NameNodes manages
only slice of namespace
DataNodes can store
blocks managed by
any NameNode
NameNode 1
Namespace 1
logs finance
Block Management 1
1 2 43
NameNode 2
Namespace 2
insights reports
Block Management 2
5 6 87
DataNode
1
DataNode
2
DataNode
3
DataNode
4
Novi Sad
05.09.2013 HDFS: Quorum based storage
Active NameNode Standby NameNode
DataNodeDataNodeDataNode DataNode DataNode
Journal
Node
Journal
Node
Journal
NodeOnly the active
writes edits
The state is shared
on a quorum of
journal nodes
The Standby
simultaneously
reads and applies
the edits
DataNodes report to both NameNodes but listen
only to the orders from the active one
Block
Map
Edits
File
Block
Map
Edits
File
Novi Sad
05.09.2013 Hadoop 2.0 Projects
• YARN
• HDFS Federation aka HDFS 2.0
• Stinger & Tez aka Hive 2.0
Novi Sad
05.09.2013 Hive: Current Focus Area
• Online systems
• R-T analytics
• CEP
Real-Time Interactive Batch
• Parameterized
Reports
• Drilldown
• Visualization
• Exploration
• Operational
batch
processing
• Enterprise
Reports
• Data Mining
Data SizeData Size
0-5s 5s – 1m 1m – 1h 1h+
Non-
Interactive
• Data preparation
• Incremental
batch
processing
• Dashboards /
Scorecards
Current Hive Sweet Spot
Novi Sad
05.09.2013 Stinger: Extending the sweet spot
• Online systems
• R-T analytics
• CEP
Real-Time Interactive Batch
• Parameterized
Reports
• Drilldown
• Visualization
• Exploration
• Operational
batch
processing
• Enterprise
Reports
• Data Mining
Data SizeData Size
0-5s 5s – 1m 1m – 1h 1h+
Non-
Interactive
• Data preparation
• Incremental
batch
processing
• Dashboards /
Scorecards
Future Hive Expansion
Improve Latency & Throughput
• Query engine improvements
• New “Optimized RCFile” column store
• Next-gen runtime (elim’s M/R latency)
Extend Deep Analytical Ability
• Analytics functions
• Improved SQL coverage
• Continued focus on core Hive use cases
Novi Sad
05.09.2013 Stinger Initiative at a glance
Novi Sad
05.09.2013 Tez: The Execution Engine
• Low level data-processing execution engine
• Use it for the base of MapReduce, Hive, Pig, etc.
• Enables pipelining of jobs
• Removes task and job launch times
• Hive and Pig jobs no longer need to move to the
end of the queue between steps in the pipeline
• Does not write intermediate output to HDFS
– Much lighter disk and network usage
• Built on YARN
Novi Sad
05.09.2013 Pig/Hive MR vs. Pig/Hive Tez
SELECT a.state, COUNT(*),
AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Pig/Hive - MR Pig/Hive - Tez
I/O Synchronization
Barrier
I/O Synchronization
Barrier
Job 1
Job 2
Job 3
Single Job
Novi Sad
05.09.2013 Tez Service
• MapReduce Query Startup is expensive:
– Job launch & task-launch latencies are fatal for
short queries (in order of 5s to 30s)
• Solution:
– Tez Service (= Preallocated Application Master)
• Removes job-launch overhead (Application Master)
• Removes task-launch overhead (Pre-warmed Containers)
– Hive/Pig
• Submit query-plan to Tez Service
– Native Hadoop service, not ad-hoc
Novi Sad
05.09.2013 Tez: Low latency
SELECT a.state, COUNT(*),
AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Existing Hive
Parse Query 0.5s
Create Plan 0.5s
Launch Map-
Reduce
20s
Process Map-
Reduce
10s
Total 31s
Hive/Tez
Parse Query 0.5s
Create Plan 0.5s
Launch Map-
Reduce
20s
Process Map-
Reduce
2s
Total 23s
Tez & Tez Service
Parse Query 0.5s
Create Plan 0.5s
Submit to Tez
Service
0.5s
Process Map-Reduce 2s
Total 3.5s
* No exact numbers, for illustration only
Novi Sad
05.09.2013 Stinger: Summary
* Real numbers, but handle with care!
Novi Sad
05.09.2013 Hadoop 2.0 Applications
• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
Novi Sad
05.09.2013 Hadoop 2.0 Applications
• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
Novi Sad
05.09.2013 MapReduce 2.0
• Basically a porting to the YARN
architecture
• MapReduce becomes a user-land
library
• No need to rewrite MapReduce jobs
• Increased scalability & availability
• Better cluster utilization
Novi Sad
05.09.2013 Hadoop 2.0 Applications
• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
Novi Sad
05.09.2013 HOYA: HBase on YARN
• Create on-demand HBase clusters
• Configure different HBase instances
differently
• Better isolation
• Create (transient) HBase clusters from
MapReduce jobs
• Elasticity of clusters for analytic / batch
workload processing
• Better cluster resources utilization
Novi Sad
05.09.2013 Hadoop 2.0 Applications
• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
Novi Sad
05.09.2013 Twitter Storm
• Stream-processing
• Real-time processing
• Developed as standalone application
• https://github.com/nathanmarz/storm
• Ported on YARN
• https://github.com/yahoo/storm-yarn
Novi Sad
05.09.2013 Storm: Conceptual view
Spout
Spout
Spout:
Source of streams
Bolt
Bolt
Bolt
Bolt
Bolt
Bolt:
Consumer of streams,
Processing of tuples,
Possibly emits new tuples
Tuple
Tuple
Tuple
Tuple:
List of name-value pairs
Stream:
Unbound sequence of tuples
Topology: Network of Spouts & Bolts as the nodes and stream as the edge
Novi Sad
05.09.2013 Hadoop 2.0 Applications
• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
Novi Sad
05.09.2013 Spark
• High-speed in-memory analytics over
Hadoop and Hive
• Separate MapReduce-like engine
– Speedup of up to 100x
– On-disk queries 5-10x faster
• Compatible with Hadoop‘s Storage API
• Available as standalone application
– https://github.com/mesos/spark
• Experimental support for YARN since 0.6
– http://spark.incubator.apache.org/docs/0.6.0/running-on-yarn.html
Novi Sad
05.09.2013 Data Sharing in Spark
Novi Sad
05.09.2013 Hadoop 2.0 Applications
• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
Novi Sad
05.09.2013 Apache Giraph
• Giraph is a framework for processing semi-
structured graph data on a massive scale.
• Giraph is loosely based upon Google's
Pregel
• Giraph performs iterative calculations on top
of an existing Hadoop cluster.
• Available on GitHub
– https://github.com/apache/giraph
Novi Sad
05.09.2013 Hadoop 2.0 Summary
1. Scale
2. New programming models &
Services
3. Improved cluster utilization
4. Agility
5. Beyond Java
Novi Sad
05.09.2013 Getting started…
One more thing…
Novi Sad
05.09.2013 Hortonworks Sandbox
http://hortonworks.com/products/hortonworsk-sandbox
Novi Sad
05.09.2013 Books about Hadoop
1. Hadoop - The Definite Guide, Tom White,
3rd ed., O’Reilly, 2012.
2. Hadoop in Action, Chuck Lam,
Manning, 2011
Programming Pig, Alan Gates
O’Reilly, 2011
1. Hadoop Operations, Eric Sammer,
O’Reilly, 2012
Novi Sad
05.09.2013 The end…or the beginning?

More Related Content

What's hot

Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentationpuneet yadav
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...Simplilearn
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache RangerDataWorks Summit
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleDataWorks Summit
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMongoDB
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkDatabricks
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 

What's hot (20)

Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
Mongo db intro.pptx
Mongo db intro.pptxMongo db intro.pptx
Mongo db intro.pptx
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 

Viewers also liked

Druid at SF Big Analytics 2015-12-01
Druid at SF Big Analytics 2015-12-01Druid at SF Big Analytics 2015-12-01
Druid at SF Big Analytics 2015-12-01gianmerlino
 
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...Sudhir Tonse
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDataWorks Summit
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with HadoopOReillyStrata
 
OLAP options on Hadoop
OLAP options on HadoopOLAP options on Hadoop
OLAP options on HadoopYuta Imai
 

Viewers also liked (7)

Druid at SF Big Analytics 2015-12-01
Druid at SF Big Analytics 2015-12-01Druid at SF Big Analytics 2015-12-01
Druid at SF Big Analytics 2015-12-01
 
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
OLAP options on Hadoop
OLAP options on HadoopOLAP options on Hadoop
OLAP options on Hadoop
 
Scalable Real-time analytics using Druid
Scalable Real-time analytics using DruidScalable Real-time analytics using Druid
Scalable Real-time analytics using Druid
 

Similar to Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Edition)

Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopAvkash Chauhan
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Ashok Royal
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopStefano Paluello
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
 
Big data, Cloud Computing and No SQL
Big data, Cloud Computing and No SQLBig data, Cloud Computing and No SQL
Big data, Cloud Computing and No SQLManu Cohen-Yashar
 
SQL vs NoSQL, an experiment with MongoDB
SQL vs NoSQL, an experiment with MongoDBSQL vs NoSQL, an experiment with MongoDB
SQL vs NoSQL, an experiment with MongoDBMarco Segato
 
Getting started with node JS
Getting started with node JSGetting started with node JS
Getting started with node JSHamdi Hmidi
 
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...Gianfranco Palumbo
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseHenk van der Valk
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
JS App Architecture
JS App ArchitectureJS App Architecture
JS App ArchitectureCorey Butler
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 

Similar to Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Edition) (20)

Midao JDBC presentation
Midao JDBC presentationMidao JDBC presentation
Midao JDBC presentation
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Exploring sql server 2016 bi
Exploring sql server 2016 biExploring sql server 2016 bi
Exploring sql server 2016 bi
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache Hadoop
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Big data, Cloud Computing and No SQL
Big data, Cloud Computing and No SQLBig data, Cloud Computing and No SQL
Big data, Cloud Computing and No SQL
 
SQL vs NoSQL, an experiment with MongoDB
SQL vs NoSQL, an experiment with MongoDBSQL vs NoSQL, an experiment with MongoDB
SQL vs NoSQL, an experiment with MongoDB
 
Getting started with node JS
Getting started with node JSGetting started with node JS
Getting started with node JS
 
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL Polybase
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
JS App Architecture
JS App ArchitectureJS App Architecture
JS App Architecture
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 

More from Uwe Printz

Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureUwe Printz
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldUwe Printz
 
Lightning Talk: Agility & Databases
Lightning Talk: Agility & DatabasesLightning Talk: Agility & Databases
Lightning Talk: Agility & DatabasesUwe Printz
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceUwe Printz
 
Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Uwe Printz
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceUwe Printz
 
MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)Uwe Printz
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceUwe Printz
 
MongoDB für Java-Programmierer
MongoDB für Java-ProgrammiererMongoDB für Java-Programmierer
MongoDB für Java-ProgrammiererUwe Printz
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter StormUwe Printz
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Map/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBMap/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBUwe Printz
 
First meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group FrankfurtFirst meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group FrankfurtUwe Printz
 

More from Uwe Printz (18)

Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, Future
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Lightning Talk: Agility & Databases
Lightning Talk: Agility & DatabasesLightning Talk: Agility & Databases
Lightning Talk: Agility & Databases
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduce
 
MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
MongoDB für Java-Programmierer
MongoDB für Java-ProgrammiererMongoDB für Java-Programmierer
MongoDB für Java-Programmierer
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Map/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBMap/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDB
 
First meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group FrankfurtFirst meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group Frankfurt
 

Recently uploaded

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Edition)

  • 1. Novi Sad 05.09.2013 Introduction to the Hadoop Ecosystem uweseiler
  • 2. Novi Sad 05.09.2013 About me Big Data Nerd TravelpiratePhotography Enthusiast Hadoop Trainer MongoDB Author
  • 3. Novi Sad 05.09.2013 About us is a bunch of… Big Data Nerds Agile Ninjas Continuous Delivery Gurus Enterprise Java Specialists Performance Geeks Join us!
  • 4. Novi Sad 05.09.2013 Agenda • What is Big Data & Hadoop? • Core Hadoop • The Hadoop Ecosystem • Use Cases • What‘s next? Hadoop 2.0!
  • 5. Novi Sad 05.09.2013 Agenda • What is Big Data & Hadoop? • Core Hadoop • The Hadoop Ecosystem • Use Cases • What‘s next? Hadoop 2.0!
  • 6. Novi Sad 05.09.2013 Let’s face it… Big Data is the latest …
  • 7. Novi Sad 05.09.2013 But on the other hand… Data is an…
  • 8. Novi Sad 05.09.2013 Think about it… Advertisements
  • 9. Novi Sad 05.09.2013 Think about it… Recommendations
  • 10. Novi Sad 05.09.2013 Think about it… Anything that sells
  • 11. Novi Sad 05.09.2013 The 3 V’s of Big Data VarietyVolume Velocity
  • 12. Novi Sad 05.09.2013 My favorite definition
  • 13. Novi Sad 05.09.2013 Why Hadoop? Traditional dataStoresare expensive to scale and by Design difficultto Distribute Scale out is the way to go!
  • 14. Novi Sad 05.09.2013 How to scale data? “Data“ r r “Result“ w w worker workerworker w r
  • 15. Novi Sad 05.09.2013 But… Parallel processing is complicated!
  • 16. Novi Sad 05.09.2013 But… Data storage is not trivial!
  • 17. Novi Sad 05.09.2013 What is Hadoop? Distributed Storage and Computation Framework
  • 18. Novi Sad 05.09.2013 What is Hadoop? «Big Data» != Hadoop
  • 19. Novi Sad 05.09.2013 What is Hadoop? Hadoop != Database
  • 20. Novi Sad 05.09.2013 What is Hadoop? “Swiss army knife of the 21st century” http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop
  • 21. Novi Sad 05.09.2013 The Hadoop App Store HDFS MapRed HCat Pig Hive HBase Ambari Avro Cassandra Chukwa Intel Sync Flume Hana HyperT Impala Mahout Nutch Oozie Scoop Scribe Tez Vertica Whirr ZooKee Horton Cloudera MapR EMC IBM Talend TeraData Pivotal Informat Microsoft. Pentaho Jasper Kognitio Tableau Splunk Platfora Rack Karma Actuate MicStrat
  • 22. Novi Sad 05.09.2013 Functionalityless more Apache Hadoop Hadoop Distributions Big Data Suites • HDFS • MapReduce • Hadoop Ecosystem • Hadoop YARN • Test & Packaging • Installation • Monitoring • Business Support + • Integrated Environment • Visualization • (Near-)Realtime analysis • Modeling • ETL & Connectors + The Hadoop App Store
  • 23. Novi Sad 05.09.2013 Agenda • What is Big Data & Hadoop? • Core Hadoop • The Hadoop Ecosystem • Use Cases • What‘s next? Hadoop 2.0!
  • 24. Novi Sad 05.09.2013 Data Storage OK, first things first! I want to store all of my <<Big Data>>
  • 26. Novi Sad 05.09.2013 Hadoop Distributed File System • Distributed file system for redundant storage • Designed to reliably store data on commodity hardware • Built to expect hardware failures
  • 27. Novi Sad 05.09.2013 Hadoop Distributed File System Intended for • large files • batch inserts
  • 28. Novi Sad 05.09.2013 HDFS Architecture NameNode Master Block Map Slave Slave Slave Rack 1 Rack 2 Journal Log DataNode DataNode DataNode File Client Secondary NameNode Helper periodical merges #1 #2 #1 #1 #1
  • 29. Novi Sad 05.09.2013 Data Processing Data stored, check! Now I want to create insights from my data!
  • 31. Novi Sad 05.09.2013 MapReduce • Programming model for distributed computations at a massive scale • Execution framework for organizing and performing such computations • Data locality is king
  • 32. Novi Sad 05.09.2013 Typical large-data problem • Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output MapReduce
  • 33. Novi Sad 05.09.2013 MapReduce Flow Combine Combine Combine Combine a b 2 c 9 a 3 c 2 b 7 c 8 Partition Partition Partition Partition Shuffle and Sort Map Map Map Map a b 2 c 3 c 6 a 3 c 2 b 7 c 8 a 1 3 b 7 c 2 8 9 Reduce Reduce Reduce a 4 b 9 c 19
  • 34. Novi Sad 05.09.2013 Combined Hadoop Architecture Client NameNode Master Slave TaskTracker Secondary NameNode Helper JobTracker DataNode File Job Block Task Slave TaskTracker DataNode Block Task Slave TaskTracker DataNode Block Task
  • 35. Novi Sad 05.09.2013 Word Count Mapper in Java public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
  • 36. Novi Sad 05.09.2013 Word Count Reducer in Java public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { IntWritable value = (IntWritable) values.next(); sum += value.get(); } output.collect(key, new IntWritable(sum)); } }
  • 37. Novi Sad 05.09.2013 Agenda • What is Big Data & Hadoop? • Core Hadoop • The Hadoop Ecosystem • Use Cases • What‘s next? Hadoop 2.0!
  • 38. Novi Sad 05.09.2013 Scripting for Hadoop Java for MapReduce? I dunno, dude… I’m more of a scripting guy…
  • 40. Novi Sad 05.09.2013 Apache Pig • High-level data flow language • Made of two components: • Data processing language Pig Latin • Compiler to translate Pig Latin to MapReduce
  • 41. Novi Sad 05.09.2013 Pig in the Hadoop ecosystem HDFS Hadoop Distributed File System MapReduce Distributed Programming Framework HCatalog Metadata Management Pig Scripting
  • 42. Novi Sad 05.09.2013 Pig Latin users = LOAD 'users.txt' USING PigStorage(',') AS (name, age); pages = LOAD 'pages.txt' USING PigStorage(',') AS (user, url); filteredUsers = FILTER users BY age >= 18 and age <=50; joinResult = JOIN filteredUsers BY name, pages by user; grouped = GROUP joinResult BY url; summed = FOREACH grouped GENERATE group, COUNT(joinResult) as clicks; sorted = ORDER summed BY clicks desc; top10 = LIMIT sorted 10; STORE top10 INTO 'top10sites';
  • 43. Novi Sad 05.09.2013 Pig Execution Plan
  • 44. Novi Sad 05.09.2013 Try that with Java…
  • 45. Novi Sad 05.09.2013 SQL for Hadoop OK, Pig seems quite useful… But I’m more of a SQL person…
  • 47. Novi Sad 05.09.2013 Apache Hive • Data Warehousing Layer on top of Hadoop • Allows analysis and queries using a SQL-like language
  • 48. Novi Sad 05.09.2013 Hive in the Hadoop ecosystem HDFS Hadoop Distributed File System MapReduce Distributed Programming Framework HCatalog Metadata Management Pig Scripting Hive Query
  • 49. Novi Sad 05.09.2013 Hive Architecture Hive Hive Engine HDFS MapReduce Meta- store Thrift Applications JDBC Applications ODBC Applications Hive Thrift Driver Hive JDBC Driver Hive ODBC Driver Hive Server Hive Shell
  • 50. Novi Sad 05.09.2013 Hive Example CREATE TABLE users(name STRING, age INT); CREATE TABLE pages(user STRING, url STRING); LOAD DATA INPATH '/user/sandbox/users.txt' INTO TABLE 'users'; LOAD DATA INPATH '/user/sandbox/pages.txt' INTO TABLE 'pages'; SELECT pages.url, count(*) AS clicks FROM users JOIN pages ON (users.name = pages.user) WHERE users.age >= 18 AND users.age <= 50 GROUP BY pages.url SORT BY clicks DESC LIMIT 10;
  • 51. Novi Sad 05.09.2013 But wait, there’s still more! More components of the Hadoop Ecosystem
  • 52. Novi Sad 05.09.2013 HDFS Data storage MapReduce Data processing HCatalog Metadata Management Pig Scripting Hive SQL-like queries HBase NoSQLDatabase Mahout Machine Learning ZooKeeper ClusterCoordination Scoop Import & Export of relational data Ambari Clusterinstallation&management Oozie Workflowautomatization Flume Import & Export of data flows
  • 53. Novi Sad 05.09.2013 Agenda • What is Big Data & Hadoop? • Core Hadoop • The Hadoop Ecosystem • Use Cases • What‘s next? Hadoop 2.0!
  • 54. Novi Sad 05.09.2013 DataSourcesDataSystemsApplications Traditional Sources RDBMS OLTP OLAP … Traditional Systems RDBMS EDW MPP … Business Intelligence Business Applications Custom Applications Operation Manage & Monitor Dev Tools Build & Test Classical enterprise platform
  • 55. Novi Sad 05.09.2013 DataSourcesDataSystemsApplications Traditional Sources RDBMS OLTP OLAP … Traditional Systems RDBMS EDW MPP … Business Intelligence Business Applications Custom Applications Operation Manage & Monitor Dev Tools Build & Test New Sources Logs Mails Sensor …Social Media Enterprise Hadoop Plattform Big Data Platform
  • 56. Novi Sad 05.09.2013DataSourcesDataSystemsApplications Traditional Sources RDBMS OLTP OLAP … Traditional Systems RDBMS EDW MPP … Business Intelligence Business Applications Custom Applications New Sources Logs Mails Sensor …Social Media Enterprise Hadoop Plattform 1 2 3 4 1 2 3 4 Capture all data Process the data Exchange using traditional systems Process & Visualize with traditional applications Pattern #1: Refine data
  • 57. Novi Sad 05.09.2013DataSourcesDataSystemsApplications Traditional Sources RDBMS OLTP OLAP … Traditional Systems RDBMS EDW MPP … Business Intelligence Business Applications Custom Applications New Sources Logs Mails Sensor …Social Media Enterprise Hadoop Plattform 1 2 3 1 2 3 Capture all data Process the data Explore the data using applications with support for Hadoop Pattern #2: Explore data
  • 58. Novi Sad 05.09.2013DataSourcesDataSystemsApplications Traditional Sources RDBMS OLTP OLAP … Traditional Systems RDBMS EDW MPP … Business Applications Custom Applications New Sources Logs Mails Sensor …Social Media Enterprise Hadoop Plattform 1 3 1 2 3 Capture all data Process the data Directly ingest the data Pattern #3: Enrich data 2
  • 59. Novi Sad 05.09.2013 Bringing it all together… One example…
  • 60. Novi Sad 05.09.2013 Digital Advertising • 6 billion ad deliveries per day • Reports (and bills) for the advertising companies needed • Own C++ solution did not scale • Adding functions was a nightmare
  • 61. Novi Sad 05.09.2013 Campaign Database FFM AMS TCP Interface TCP Interface Custom Flume Source Custom Flume Source Flume HDFS Sink Local files Campaign Data Hadoop Cluster Binary Log Format Synchronisation Pig Hive Temporäre Daten NAS Aggregated data Report Engine Direct Download Job Scheduler Config UI Job Config XML Start AdServer AdServer AdServing Architecture
  • 62. Novi Sad 05.09.2013 What’s next? Hadoop 2.0 aka YARN
  • 63. Novi Sad 05.09.2013 HDFS Built for web-scale batch apps HDFS HDFS Single App Batch Single App Batch Single App Batch Single App Batch Single App Batch Hadoop 1.0
  • 64. Novi Sad 05.09.2013 MapReduce is good for… • Embarrassingly parallel algorithms • Summing, grouping, filtering, joining • Off-line batch jobs on massive data sets • Analyzing an entire large dataset
  • 65. Novi Sad 05.09.2013 MapReduce is OK for… • Iterative jobs (i.e., graph algorithms) – Each iteration must read/write data to disk – I/O and latency cost of an iteration is high
  • 66. Novi Sad 05.09.2013 MapReduce is not good for… • Jobs that need shared state/coordination – Tasks are shared-nothing – Shared-state requires scalable state store • Low-latency jobs • Jobs on small datasets • Finding individual records
  • 67. Novi Sad 05.09.2013 MapReduce limitations • Scalability – Maximum cluster size ~ 4,500 nodes – Maximum concurrent tasks – 40,000 – Coarse synchronization in JobTracker • Availability – Failure kills all queued and running jobs • Hard partition of resources into map & reduce slots – Low resource utilization • Lacks support for alternate paradigms and services – Iterative applications implemented using MapReduce are 10x slower
  • 68. Novi Sad 05.09.2013 Hadoop 1.0 HDFS Redundant, reliable storage Hadoop 2.0: Next-gen platform MapReduce Cluster resource mgmt. + data processing Hadoop 2.0 HDFS 2.0 Redundant, reliable storage MapReduce Data processing Single use system Batch Apps Multi-purpose platform Batch, Interactive, Streaming, … YARN Cluster resource management Others Data processing
  • 69. Novi Sad 05.09.2013 Taking Hadoop beyond batch Applications run natively in Hadoop HDFS 2.0 Redundant, reliable storage Batch MapReduce Store all data in one place Interact with data in multiple ways YARN Cluster resource management Interactive Tez Online HOYA Streaming Storm, … Graph Giraph In-Memory Spark Other Search, …
  • 70. Novi Sad 05.09.2013 A brief history of Hadoop 2.0 • Originally conceived & architected by the team at Yahoo! – Arun Murthy created the original JIRA in 2008 and now is the YARN release manager • The team at Hortonworks has been working on YARN for 4 years: – 90% of code from Hortonworks & Yahoo! • Hadoop 2.0 based architecture running at scale at Yahoo! – Deployed on 35,000 nodes for 6+ months
  • 71. Novi Sad 05.09.2013 Hadoop 2.0 Projects • YARN • HDFS Federation aka HDFS 2.0 • Stinger & Tez aka Hive 2.0
  • 72. Novi Sad 05.09.2013 Hadoop 2.0 Projects • YARN • HDFS Federation aka HDFS 2.0 • Stinger & Tez aka Hive 2.0
  • 73. Novi Sad 05.09.2013 YARN: Architecture Split up the two major functions of the JobTracker Cluster resource management & Application life-cycle management ResourceManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager Scheduler AM 1 Container 1.2 Container 1.1 AM 2 Container 2.1 Container 2.2 Container 2.3
  • 74. Novi Sad 05.09.2013 YARN: Architecture • Resource Manager – Global resource scheduler – Hierarchical queues • Node Manager – Per-machine agent – Manages the life-cycle of container – Container resource monitoring • Application Master – Per-application – Manages application scheduling and task execution – e.g. MapReduce Application Master
  • 75. Novi Sad 05.09.2013 YARN: Architecture ResourceManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager Scheduler MapReduce 1 map 1.2 map 1.1 MapReduce 2 map 2.1 map 2.2 reduce 2.1 NodeManager NodeManager NodeManager NodeManager reduce 1.1 Tez map 2.3 reduce 2.2 vertex 1 vertex 2 vertex 3 vertex 4 HOYA HBase Master Region server 1 Region server 2 Region server 3 Storm nimbus 1 nimbus 2
  • 76. Novi Sad 05.09.2013 Hadoop 2.0 Projects • YARN • HDFS Federation aka HDFS 2.0 • Stinger & Tez aka Hive 2.0
  • 77. Novi Sad 05.09.2013 HDFS Federation • Removes tight coupling of Block Storage and Namespace • Scalability & Isolation • High Availability • Increased performance Details: https://issues.apache.org/jira/browse/HDFS-1052
  • 78. Novi Sad 05.09.2013 HDFS Federation: Architecture NameNodes do not talk to each other NameNodes manages only slice of namespace DataNodes can store blocks managed by any NameNode NameNode 1 Namespace 1 logs finance Block Management 1 1 2 43 NameNode 2 Namespace 2 insights reports Block Management 2 5 6 87 DataNode 1 DataNode 2 DataNode 3 DataNode 4
  • 79. Novi Sad 05.09.2013 HDFS: Quorum based storage Active NameNode Standby NameNode DataNodeDataNodeDataNode DataNode DataNode Journal Node Journal Node Journal NodeOnly the active writes edits The state is shared on a quorum of journal nodes The Standby simultaneously reads and applies the edits DataNodes report to both NameNodes but listen only to the orders from the active one Block Map Edits File Block Map Edits File
  • 80. Novi Sad 05.09.2013 Hadoop 2.0 Projects • YARN • HDFS Federation aka HDFS 2.0 • Stinger & Tez aka Hive 2.0
  • 81. Novi Sad 05.09.2013 Hive: Current Focus Area • Online systems • R-T analytics • CEP Real-Time Interactive Batch • Parameterized Reports • Drilldown • Visualization • Exploration • Operational batch processing • Enterprise Reports • Data Mining Data SizeData Size 0-5s 5s – 1m 1m – 1h 1h+ Non- Interactive • Data preparation • Incremental batch processing • Dashboards / Scorecards Current Hive Sweet Spot
  • 82. Novi Sad 05.09.2013 Stinger: Extending the sweet spot • Online systems • R-T analytics • CEP Real-Time Interactive Batch • Parameterized Reports • Drilldown • Visualization • Exploration • Operational batch processing • Enterprise Reports • Data Mining Data SizeData Size 0-5s 5s – 1m 1m – 1h 1h+ Non- Interactive • Data preparation • Incremental batch processing • Dashboards / Scorecards Future Hive Expansion Improve Latency & Throughput • Query engine improvements • New “Optimized RCFile” column store • Next-gen runtime (elim’s M/R latency) Extend Deep Analytical Ability • Analytics functions • Improved SQL coverage • Continued focus on core Hive use cases
  • 83. Novi Sad 05.09.2013 Stinger Initiative at a glance
  • 84. Novi Sad 05.09.2013 Tez: The Execution Engine • Low level data-processing execution engine • Use it for the base of MapReduce, Hive, Pig, etc. • Enables pipelining of jobs • Removes task and job launch times • Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline • Does not write intermediate output to HDFS – Much lighter disk and network usage • Built on YARN
  • 85. Novi Sad 05.09.2013 Pig/Hive MR vs. Pig/Hive Tez SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Pig/Hive - MR Pig/Hive - Tez I/O Synchronization Barrier I/O Synchronization Barrier Job 1 Job 2 Job 3 Single Job
  • 86. Novi Sad 05.09.2013 Tez Service • MapReduce Query Startup is expensive: – Job launch & task-launch latencies are fatal for short queries (in order of 5s to 30s) • Solution: – Tez Service (= Preallocated Application Master) • Removes job-launch overhead (Application Master) • Removes task-launch overhead (Pre-warmed Containers) – Hive/Pig • Submit query-plan to Tez Service – Native Hadoop service, not ad-hoc
  • 87. Novi Sad 05.09.2013 Tez: Low latency SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Existing Hive Parse Query 0.5s Create Plan 0.5s Launch Map- Reduce 20s Process Map- Reduce 10s Total 31s Hive/Tez Parse Query 0.5s Create Plan 0.5s Launch Map- Reduce 20s Process Map- Reduce 2s Total 23s Tez & Tez Service Parse Query 0.5s Create Plan 0.5s Submit to Tez Service 0.5s Process Map-Reduce 2s Total 3.5s * No exact numbers, for illustration only
  • 88. Novi Sad 05.09.2013 Stinger: Summary * Real numbers, but handle with care!
  • 89. Novi Sad 05.09.2013 Hadoop 2.0 Applications • MapReduce 2.0 • HOYA - HBase on YARN • Storm, Spark, Apache S4 • Hamster (MPI on Hadoop) • Apache Giraph • Apache Hama • Distributed Shell • Tez
  • 90. Novi Sad 05.09.2013 Hadoop 2.0 Applications • MapReduce 2.0 • HOYA - HBase on YARN • Storm, Spark, Apache S4 • Hamster (MPI on Hadoop) • Apache Giraph • Apache Hama • Distributed Shell • Tez
  • 91. Novi Sad 05.09.2013 MapReduce 2.0 • Basically a porting to the YARN architecture • MapReduce becomes a user-land library • No need to rewrite MapReduce jobs • Increased scalability & availability • Better cluster utilization
  • 92. Novi Sad 05.09.2013 Hadoop 2.0 Applications • MapReduce 2.0 • HOYA - HBase on YARN • Storm, Spark, Apache S4 • Hamster (MPI on Hadoop) • Apache Giraph • Apache Hama • Distributed Shell • Tez
  • 93. Novi Sad 05.09.2013 HOYA: HBase on YARN • Create on-demand HBase clusters • Configure different HBase instances differently • Better isolation • Create (transient) HBase clusters from MapReduce jobs • Elasticity of clusters for analytic / batch workload processing • Better cluster resources utilization
  • 94. Novi Sad 05.09.2013 Hadoop 2.0 Applications • MapReduce 2.0 • HOYA - HBase on YARN • Storm, Spark, Apache S4 • Hamster (MPI on Hadoop) • Apache Giraph • Apache Hama • Distributed Shell • Tez
  • 95. Novi Sad 05.09.2013 Twitter Storm • Stream-processing • Real-time processing • Developed as standalone application • https://github.com/nathanmarz/storm • Ported on YARN • https://github.com/yahoo/storm-yarn
  • 96. Novi Sad 05.09.2013 Storm: Conceptual view Spout Spout Spout: Source of streams Bolt Bolt Bolt Bolt Bolt Bolt: Consumer of streams, Processing of tuples, Possibly emits new tuples Tuple Tuple Tuple Tuple: List of name-value pairs Stream: Unbound sequence of tuples Topology: Network of Spouts & Bolts as the nodes and stream as the edge
  • 97. Novi Sad 05.09.2013 Hadoop 2.0 Applications • MapReduce 2.0 • HOYA - HBase on YARN • Storm, Spark, Apache S4 • Hamster (MPI on Hadoop) • Apache Giraph • Apache Hama • Distributed Shell • Tez
  • 98. Novi Sad 05.09.2013 Spark • High-speed in-memory analytics over Hadoop and Hive • Separate MapReduce-like engine – Speedup of up to 100x – On-disk queries 5-10x faster • Compatible with Hadoop‘s Storage API • Available as standalone application – https://github.com/mesos/spark • Experimental support for YARN since 0.6 – http://spark.incubator.apache.org/docs/0.6.0/running-on-yarn.html
  • 99. Novi Sad 05.09.2013 Data Sharing in Spark
  • 100. Novi Sad 05.09.2013 Hadoop 2.0 Applications • MapReduce 2.0 • HOYA - HBase on YARN • Storm, Spark, Apache S4 • Hamster (MPI on Hadoop) • Apache Giraph • Apache Hama • Distributed Shell • Tez
  • 101. Novi Sad 05.09.2013 Apache Giraph • Giraph is a framework for processing semi- structured graph data on a massive scale. • Giraph is loosely based upon Google's Pregel • Giraph performs iterative calculations on top of an existing Hadoop cluster. • Available on GitHub – https://github.com/apache/giraph
  • 102. Novi Sad 05.09.2013 Hadoop 2.0 Summary 1. Scale 2. New programming models & Services 3. Improved cluster utilization 4. Agility 5. Beyond Java
  • 103. Novi Sad 05.09.2013 Getting started… One more thing…
  • 104. Novi Sad 05.09.2013 Hortonworks Sandbox http://hortonworks.com/products/hortonworsk-sandbox
  • 105. Novi Sad 05.09.2013 Books about Hadoop 1. Hadoop - The Definite Guide, Tom White, 3rd ed., O’Reilly, 2012. 2. Hadoop in Action, Chuck Lam, Manning, 2011 Programming Pig, Alan Gates O’Reilly, 2011 1. Hadoop Operations, Eric Sammer, O’Reilly, 2012
  • 106. Novi Sad 05.09.2013 The end…or the beginning?