SlideShare a Scribd company logo
Hadoop
Ecosystem
Ran
Silberman
Dec. 2014
What types of ecosystems exist?
● Systems that are based on MapReduce
● Systems that replace MapReduce
● Complementary databases
● Utilities
● See complete list here
Systems based
on MapReduce
Hive
● Part of the Apache project
● General SQL-like syntax for querying HDFS or other
large databases
● Each SQL statement is translated to one or more
MapReduce jobs (in some cases none)
● Supports pluggable Mappers, Reducers and SerDe’s
(Serializer/Deserializer)
● Pro: Convenient for analytics people that use SQL
Hive Architecture
Hive Usage
Start a hive shell:
$hive
create hive table:
hive> CREATE TABLE tikal (id BIGINT, name STRING, startdate TIMESTAMP, email
STRING)
Show all tables:
hive> SHOW TABLES;
Add a new column to the table:
hive> ALTER TABLE tikal ADD COLUMNS (description STRING);
Load HDFS data file into the dable:
hive> LOAD DATA INPATH '/home/hduser/tikal_users' OVERWRITE INTO TABLE tikal;
query employees that work more than a year:
hive> SELECT name FROM tikal WHERE (unix_timestamp() - startdate > 365 * 24 *
60 * 60);
Pig
● Part of the Apache project
● A programing language that is compiled into one or
more MaprRecuce jobs.
● Supports User Defined functions
● Pro: More Convenient to write than pure MapReduce.
Pig Usage
Start a pig Shell. (grunt is the PigLatin shell prompt)
$ pig
grunt>
Load a HDFS data file:
grunt> employees = LOAD 'hdfs://hostname:54310/home/hduser/tikal_users'
as (id,name,startdate,email,description);
Dump the data to console:
grunt> DUMP employees;
Query the data:
grunt> employees_more_than_1_year = FILTER employees BY (float)rating>1.
0;
grunt> DUMP employees_more_than_1_year;
Store query result to new file:
grunt> store employees_more_than_1_year into
'/home/hduser/employees_more_than_1_year';
Cascading
● An infrastructure with API that is compiled to one or
more MapReduce jobs
● Provide graphical view of the MapReduce jobs workflow
● Ways to tweak setting and improve performance of
workflow.
● Pros:
○ Hides MapReduce API and joins jobs
○ Graphical view and performance tuning
MapReduce workflow
● MapReduce framework operates exclusively on
Key/Value pairs
● There are three phases in the workflow:
○ map
○ combine
○ reduce
(input) <k1, v1> =>
map => <k2, v2> =>
combine => <k2, v2> =>
reduce => <k3, v3> (output)
WordCount in MapRecuce Java API
private class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
WordCount in MapRecuce Java Cont.
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
WordCount in MapRecuce Java Cont.
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
MapReduce workflow example.
Let’s consider two text files:
$ bin/hdfs dfs -cat /user/joe/wordcount/input/file01
Hello World Bye World
$ bin/hdfs dfs -cat /user/joe/wordcount/input/file02
Hello Hadoop Goodbye Hadoop
Mapper code
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
Mapper output
For two files there will be two mappers.
For the given sample input the first map emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
The second map emits:
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
Set Combiner
We defined a combiner in the code:
job.setCombinerClass(IntSumReducer.class);
Combiner output
Output of each map is passed through the local combiner
for local aggregation, after being sorted on the keys.
The output of the first map:
< Bye, 1>
< Hello, 1>
< World, 2>
The output of the second map:
< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>
Reducer code
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Reducer output
The reducer sums up the values
The output of the job is:
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
The Cascading core components
● Tap (Data resource)
○ Source (Data input)
○ Sink (Data output)
● Pipe (data stream)
● Filter (Data operation)
● Flow (assembly of Taps and Pipes)
WordCount in Cascading
Visualization
source (Document Collection)
sink (Word Count)
pipes (Tokenize, Count)
WodCount in Cascading Cont.
// define source and sink Taps.
Scheme sourceScheme = new TextLine( new Fields( "line" ) );
Tap source = new Hfs( sourceScheme, inputPath );
Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );
Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );
// the 'head' of the pipe assembly
Pipe assembly = new Pipe( "wordcount" );
// For each input Tuple
// parse out each word into a new Tuple with the field name "word"
// regular expressions are optional in Cascading
String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?!pL)";
Function function = new RegexGenerator( new Fields( "word" ), regex );
assembly = new Each( assembly, new Fields( "line" ), function );
// group the Tuple stream by the "word" value
assembly = new GroupBy( assembly, new Fields( "word" ) );
WodCount in Cascading
// For every Tuple group
// count the number of occurrences of "word" and store result in
// a field named "count"
Aggregator count = new Count( new Fields( "count" ) );
assembly = new Every( assembly, count );
// initialize app properties, tell Hadoop which jar file to use
Properties properties = new Properties();
FlowConnector.setApplicationJarClass( properties, Main.class );
// plan a new Flow from the assembly using the source and sink Taps
// with the above properties
FlowConnector flowConnector = new FlowConnector( properties );
Flow flow = flowConnector.connect( "word-count", source, sink, assembly );
// execute the flow, block until complete
flow.complete();
Diagram of Cascading Flow
Scalding
● Extension to Cascading
● Programing language is Scala instead of Java
● Good for functional programing paradigms in Data
Applications
● Pro: code can be very compact!
WordCount in Scalding
import com.twitter.scalding._
class WordCountJob(args : Args) extends Job(args) {
TypedPipe.from(TextLine(args("input")))
.flatMap { line => line.split("""s+""") }
.groupBy { word => word }
.size
.write(TypedTsv(args("output")))
}
Summingbird
● An open source from Twitter.
● An API that is compiled to Scalding and to Storm
topologies.
● Can be written in Java or Scala
● Pro: When you want to use Lambda Architecture and
you want to write one code that will run on both Hadoop
and Storm.
WordCount in Summingbird
def wordCount[P <: Platform[P]]
(source: Producer[P, String], store: P#Store[String, Long]) =
source.flatMap { sentence =>
toWords(sentence).map(_ -> 1L)
}.sumByKey(store)
Systems that
replace MapReduce
Spark
● Part of the Apache project
● Replaces MapReduce with it own engine that works
much faster without compromising consistency
● Architecture not based on Map-reduce but rather on two
concepts: RDD (Resilient Distributed Dataset) and DAG
(Directed Acyclic Graph)
● Pro’s:
○ Works much faster than MapReduce;
○ fast growing community.
Impala
● Open Source from Cloudera
● Used for Interactive queries with SQL syntax
● Replaces MapReduce with its own Impala Server
● Pro: Can get much faster response time for SQL over
HDFS than Hive or Pig.
Impala benchmark
Note: Impala is over Parquet!
Impala replaces MapReduce
Impala architecture
● Impala architecture was inspired by Google Dremel
● MapReduce is great for functional programming, but not
efficient for SQL.
● Impala replaced the MapReduce with Distributed Query
Engine that is optimized for fast queries.
Dermal architecture
Dremel: Interactive Analysis of Web-Scale Datasets
Impala architecture
Presto, Drill, Tez
● Several more alternatives:
○ Presto by Facebook
○ Apache Drill pushed by MapR
○ Apache Tez pushed by Hortonworks
● all are alternatives to Impala and do more or less the
same: provide faster response time for queries over
HDFS.
● Each of the above claim to have very fast results.
● Be careful of benchmarks they publish: to get better
results they use indexed data rather than sequential
files in HDFS (i.e., ORC file, Parquet, HBase)
Complementary
Databases
HBase
● Apache project
● NoSQL cluster database that can grow linearly
● Can store billions of rows X millions of columns
● Storage is based on HDFS
● API based on MapReduce
● Pros:
○ Strongly consistent read/writes
○ Good for high-speed counter aggregations
Parquet
● Apache (incubator) project. Initiated by Twitter &
Cloudera
● Columnar File Format - write one column at a time
● Integrated with Hadoop ecosystem (MapReduce, Hive)
● Supports Avro, Thrift and ProtBuf
● Pro: keep I/O to a minimum by reading from a disk only
the data required for the query
Columnar format (Parquet)
Advantages of Columnar formats
● Better compression as data is more homogenous.
● I/O will be reduced as we can efficiently scan only a
subset of the columns while reading the data.
● When storing data of the same type in each column,
we can use encodings better suited to the modern
processors’ pipeline by making instruction branching
more predictable.
Utilities
Flume
● Cloudera product
● Used to collect files from distributed systems and send
them to central repository
● Designed for integration with HDFS but can write to
other FS
● Supports listening to TCP and UDP sockets
● Main Use Case: collect distributed logs to HDFS
Avro
● An Apache project
● Data Serialization by Schema
● Support rich data structures. Defined in Json-like syntax
● Support Schema evolution
● Integrated with Hadoop I/O API
● Similar to Thrift and ProtocolBuffers
Oozie
● An Apache project
● Workflow Scheduler for Hadoop jobs
● Very close integration with the Hadoop API
Mesos
● Apache project
● Cluster manager that abstracts resources
● Integrated with Hadoop to allocate resources
● Scalable to 10,000 nodes
● Supports physical machines, VM’s, Docker
● Multi resource scheduler (memory, CPU, disk, ports)
● Web UI for viewing cluster status

More Related Content

What's hot

Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
MapR Technologies
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreModern Data Stack France
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Uwe Printz
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
DataWorks Summit
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
Yahoo Developer Network
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
Kelly Technologies
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explainedDmytro Sandu
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
Emilio Coppa
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
Yahoo Developer Network
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
Rajesh Nadipalli
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
DataWorks Summit
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
DataWorks Summit/Hadoop Summit
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringGeorge Ang
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
Yahoo Developer Network
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
Gwen (Chen) Shapira
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
Carol McDonald
 

What's hot (20)

Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means Clustering
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 

Viewers also liked

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Sharding with spider solutions 20160721
Sharding with spider solutions 20160721Sharding with spider solutions 20160721
Sharding with spider solutions 20160721
Kentoku
 
Clash of clans data structures
Clash of clans   data structuresClash of clans   data structures
Clash of clans data structures
Ran Silberman
 
From a kafkaesque story to The Promised Land
From a kafkaesque story to The Promised LandFrom a kafkaesque story to The Promised Land
From a kafkaesque story to The Promised Land
Ran Silberman
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Stanley Wang
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
Sandip Darwade
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
Zohar Elkayam
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
Rajkumar Singh
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
sunera pathan
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
sunera pathan
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
Neev Technologies
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
Lior Sidi
 
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Cloudera, Inc.
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystemtfmailru
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
Cleverence Kombe
 
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Alexey Kharlamov
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
Delhi/NCR HUG
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
Patrick Nicolas
 
Dev ops for big data cluster management tools
Dev ops for big data  cluster management toolsDev ops for big data  cluster management tools
Dev ops for big data cluster management tools
Ran Silberman
 

Viewers also liked (20)

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Sharding with spider solutions 20160721
Sharding with spider solutions 20160721Sharding with spider solutions 20160721
Sharding with spider solutions 20160721
 
Clash of clans data structures
Clash of clans   data structuresClash of clans   data structures
Clash of clans data structures
 
From a kafkaesque story to The Promised Land
From a kafkaesque story to The Promised LandFrom a kafkaesque story to The Promised Land
From a kafkaesque story to The Promised Land
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Dev ops for big data cluster management tools
Dev ops for big data  cluster management toolsDev ops for big data  cluster management tools
Dev ops for big data cluster management tools
 

Similar to Hadoop ecosystem

MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Robert Metzger
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
Andrea Iacono
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
Hadoop
HadoopHadoop
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
Mao Geng
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
Jazan University
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
softwarequery
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
Databricks
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programming
Kuldeep Dhole
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
Habiba Abderrahim
 

Similar to Hadoop ecosystem (20)

MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Hadoop
HadoopHadoop
Hadoop
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Data Science
Data ScienceData Science
Data Science
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programming
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
 

Recently uploaded

1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 

Recently uploaded (20)

1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 

Hadoop ecosystem

  • 2. What types of ecosystems exist? ● Systems that are based on MapReduce ● Systems that replace MapReduce ● Complementary databases ● Utilities ● See complete list here
  • 4. Hive ● Part of the Apache project ● General SQL-like syntax for querying HDFS or other large databases ● Each SQL statement is translated to one or more MapReduce jobs (in some cases none) ● Supports pluggable Mappers, Reducers and SerDe’s (Serializer/Deserializer) ● Pro: Convenient for analytics people that use SQL
  • 6. Hive Usage Start a hive shell: $hive create hive table: hive> CREATE TABLE tikal (id BIGINT, name STRING, startdate TIMESTAMP, email STRING) Show all tables: hive> SHOW TABLES; Add a new column to the table: hive> ALTER TABLE tikal ADD COLUMNS (description STRING); Load HDFS data file into the dable: hive> LOAD DATA INPATH '/home/hduser/tikal_users' OVERWRITE INTO TABLE tikal; query employees that work more than a year: hive> SELECT name FROM tikal WHERE (unix_timestamp() - startdate > 365 * 24 * 60 * 60);
  • 7. Pig ● Part of the Apache project ● A programing language that is compiled into one or more MaprRecuce jobs. ● Supports User Defined functions ● Pro: More Convenient to write than pure MapReduce.
  • 8. Pig Usage Start a pig Shell. (grunt is the PigLatin shell prompt) $ pig grunt> Load a HDFS data file: grunt> employees = LOAD 'hdfs://hostname:54310/home/hduser/tikal_users' as (id,name,startdate,email,description); Dump the data to console: grunt> DUMP employees; Query the data: grunt> employees_more_than_1_year = FILTER employees BY (float)rating>1. 0; grunt> DUMP employees_more_than_1_year; Store query result to new file: grunt> store employees_more_than_1_year into '/home/hduser/employees_more_than_1_year';
  • 9. Cascading ● An infrastructure with API that is compiled to one or more MapReduce jobs ● Provide graphical view of the MapReduce jobs workflow ● Ways to tweak setting and improve performance of workflow. ● Pros: ○ Hides MapReduce API and joins jobs ○ Graphical view and performance tuning
  • 10. MapReduce workflow ● MapReduce framework operates exclusively on Key/Value pairs ● There are three phases in the workflow: ○ map ○ combine ○ reduce (input) <k1, v1> => map => <k2, v2> => combine => <k2, v2> => reduce => <k3, v3> (output)
  • 11. WordCount in MapRecuce Java API private class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
  • 12. WordCount in MapRecuce Java Cont. public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
  • 13. WordCount in MapRecuce Java Cont. public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
  • 14. MapReduce workflow example. Let’s consider two text files: $ bin/hdfs dfs -cat /user/joe/wordcount/input/file01 Hello World Bye World $ bin/hdfs dfs -cat /user/joe/wordcount/input/file02 Hello Hadoop Goodbye Hadoop
  • 15. Mapper code public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } }
  • 16. Mapper output For two files there will be two mappers. For the given sample input the first map emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> The second map emits: < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>
  • 17. Set Combiner We defined a combiner in the code: job.setCombinerClass(IntSumReducer.class);
  • 18. Combiner output Output of each map is passed through the local combiner for local aggregation, after being sorted on the keys. The output of the first map: < Bye, 1> < Hello, 1> < World, 2> The output of the second map: < Goodbye, 1> < Hadoop, 2> < Hello, 1>
  • 19. Reducer code public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
  • 20. Reducer output The reducer sums up the values The output of the job is: < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>
  • 21. The Cascading core components ● Tap (Data resource) ○ Source (Data input) ○ Sink (Data output) ● Pipe (data stream) ● Filter (Data operation) ● Flow (assembly of Taps and Pipes)
  • 22. WordCount in Cascading Visualization source (Document Collection) sink (Word Count) pipes (Tokenize, Count)
  • 23. WodCount in Cascading Cont. // define source and sink Taps. Scheme sourceScheme = new TextLine( new Fields( "line" ) ); Tap source = new Hfs( sourceScheme, inputPath ); Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) ); Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE ); // the 'head' of the pipe assembly Pipe assembly = new Pipe( "wordcount" ); // For each input Tuple // parse out each word into a new Tuple with the field name "word" // regular expressions are optional in Cascading String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?!pL)"; Function function = new RegexGenerator( new Fields( "word" ), regex ); assembly = new Each( assembly, new Fields( "line" ), function ); // group the Tuple stream by the "word" value assembly = new GroupBy( assembly, new Fields( "word" ) );
  • 24. WodCount in Cascading // For every Tuple group // count the number of occurrences of "word" and store result in // a field named "count" Aggregator count = new Count( new Fields( "count" ) ); assembly = new Every( assembly, count ); // initialize app properties, tell Hadoop which jar file to use Properties properties = new Properties(); FlowConnector.setApplicationJarClass( properties, Main.class ); // plan a new Flow from the assembly using the source and sink Taps // with the above properties FlowConnector flowConnector = new FlowConnector( properties ); Flow flow = flowConnector.connect( "word-count", source, sink, assembly ); // execute the flow, block until complete flow.complete();
  • 26. Scalding ● Extension to Cascading ● Programing language is Scala instead of Java ● Good for functional programing paradigms in Data Applications ● Pro: code can be very compact!
  • 27. WordCount in Scalding import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TypedPipe.from(TextLine(args("input"))) .flatMap { line => line.split("""s+""") } .groupBy { word => word } .size .write(TypedTsv(args("output"))) }
  • 28. Summingbird ● An open source from Twitter. ● An API that is compiled to Scalding and to Storm topologies. ● Can be written in Java or Scala ● Pro: When you want to use Lambda Architecture and you want to write one code that will run on both Hadoop and Storm.
  • 29. WordCount in Summingbird def wordCount[P <: Platform[P]] (source: Producer[P, String], store: P#Store[String, Long]) = source.flatMap { sentence => toWords(sentence).map(_ -> 1L) }.sumByKey(store)
  • 31. Spark ● Part of the Apache project ● Replaces MapReduce with it own engine that works much faster without compromising consistency ● Architecture not based on Map-reduce but rather on two concepts: RDD (Resilient Distributed Dataset) and DAG (Directed Acyclic Graph) ● Pro’s: ○ Works much faster than MapReduce; ○ fast growing community.
  • 32. Impala ● Open Source from Cloudera ● Used for Interactive queries with SQL syntax ● Replaces MapReduce with its own Impala Server ● Pro: Can get much faster response time for SQL over HDFS than Hive or Pig.
  • 33. Impala benchmark Note: Impala is over Parquet!
  • 35. Impala architecture ● Impala architecture was inspired by Google Dremel ● MapReduce is great for functional programming, but not efficient for SQL. ● Impala replaced the MapReduce with Distributed Query Engine that is optimized for fast queries.
  • 36. Dermal architecture Dremel: Interactive Analysis of Web-Scale Datasets
  • 38. Presto, Drill, Tez ● Several more alternatives: ○ Presto by Facebook ○ Apache Drill pushed by MapR ○ Apache Tez pushed by Hortonworks ● all are alternatives to Impala and do more or less the same: provide faster response time for queries over HDFS. ● Each of the above claim to have very fast results. ● Be careful of benchmarks they publish: to get better results they use indexed data rather than sequential files in HDFS (i.e., ORC file, Parquet, HBase)
  • 40. HBase ● Apache project ● NoSQL cluster database that can grow linearly ● Can store billions of rows X millions of columns ● Storage is based on HDFS ● API based on MapReduce ● Pros: ○ Strongly consistent read/writes ○ Good for high-speed counter aggregations
  • 41. Parquet ● Apache (incubator) project. Initiated by Twitter & Cloudera ● Columnar File Format - write one column at a time ● Integrated with Hadoop ecosystem (MapReduce, Hive) ● Supports Avro, Thrift and ProtBuf ● Pro: keep I/O to a minimum by reading from a disk only the data required for the query
  • 43. Advantages of Columnar formats ● Better compression as data is more homogenous. ● I/O will be reduced as we can efficiently scan only a subset of the columns while reading the data. ● When storing data of the same type in each column, we can use encodings better suited to the modern processors’ pipeline by making instruction branching more predictable.
  • 45. Flume ● Cloudera product ● Used to collect files from distributed systems and send them to central repository ● Designed for integration with HDFS but can write to other FS ● Supports listening to TCP and UDP sockets ● Main Use Case: collect distributed logs to HDFS
  • 46. Avro ● An Apache project ● Data Serialization by Schema ● Support rich data structures. Defined in Json-like syntax ● Support Schema evolution ● Integrated with Hadoop I/O API ● Similar to Thrift and ProtocolBuffers
  • 47. Oozie ● An Apache project ● Workflow Scheduler for Hadoop jobs ● Very close integration with the Hadoop API
  • 48. Mesos ● Apache project ● Cluster manager that abstracts resources ● Integrated with Hadoop to allocate resources ● Scalable to 10,000 nodes ● Supports physical machines, VM’s, Docker ● Multi resource scheduler (memory, CPU, disk, ports) ● Web UI for viewing cluster status