Full stack analytics with Hadoop 2

Gabriele Modena
Gabriele ModenaResearch Engineer at Improve Digital
Full stack analytics 
with Hadoop 2 
Trento, 2014-09-11 
GABRIELE MODENA LEARNING HADOOP 2
CS.ML! 
Data Scientist 
ML & Data Mining 
Academia & Industry 
! 
Learning Hadoop 2 for 
Packt_Publishing (together 
with Garry Turkington). TBD.
This talk is about tools
Your mileage may vary
I will avoid benchmarks
Back in 2012 
GABRIELE MODENA LEARNING HADOOP 2
HDFS 
Name Node 
Data Node 
! 
! 
Google paper (2003)! 
Distributed storage! 
Block ops 
Name Node 
Data Node Data Node 
GABRIELE MODENA LEARNING HADOOP 2
MapReduce 
Google paper (2006)! 
Divide and conquer functional model! 
Concepts from database research! 
Batch worloads! 
Aggregation operations (eg. GROUP BY) 
GABRIELE MODENA LEARNING HADOOP 2
Two phases 
Map 
Reduce 
GABRIELE MODENA LEARNING HADOOP 2
Programs are chains 
of jobs
GABRIELE MODENA LEARNING HADOOP 2
All in all 
Great when records (jobs) are independent! 
Composability monsters! 
Computation vs. Communication tradeoff! 
Low level API! 
Tuning required 
GABRIELE MODENA LEARNING HADOOP 2
Computation with 
MapReduce 
CRUNCH 
GABRIELE MODENA LEARNING HADOOP 2
Higher level abstractions, 
still geared towards batch 
loads
Dremel (Impala, Drill) 
Google paper (2010) ! 
Access blocks directly from data nodes (partition 
the fs namespace)! 
Columnar store (optimize for OLAP)! 
Appeals to database / BI crowds! 
Ridiculously fast (as long as you have memory) 
GABRIELE MODENA LEARNING HADOOP 2
Computation beyond 
MapReduce 
Iterative workloads! 
Low latency queries! 
Real-time computation! 
High level abstractions 
GABRIELE MODENA LEARNING HADOOP 2
Hadoop 2 
Applications (Hive, Pig, Crunch, Cascading, etc…) 
Streaming 
(storm, spark, 
samza) 
In memory 
(spark) 
Interactive 
(Tez) 
HPC 
(MPI) 
Resource Management (YARN) 
HDFS 
Batch 
(MapReduce) 
Graph 
(giraph) 
GABRIELE MODENA LEARNING HADOOP 2
Full stack analytics with Hadoop 2
Tez (Dryad) 
Microsoft paper (2007)! 
Generalization of MapReduce as dataflow! 
Express dependencies, I/O pipelining! 
Low level API for building DAGs! 
Mainly an execution engine (Hive-on-Tez, Pig-on-Tez) 
GABRIELE MODENA LEARNING HADOOP 2
GABRIELE MODENA LEARNING HADOOP 2
DAG dag = new DAG("WordCount"); 
dag.addVertex(tokenizerVertex) 
.addVertex(summerVertex) 
.addEdge( 
new Edge(tokenizerVertex, summerVertex, 
edgeConf.createDefaultEdgeProperty())); 
GABRIELE MODENA LEARNING HADOOP 2
p!!ackage org.apache.tez.mapreduce.examples; import java.io.IOException; 
import java.util.Map; 
import java.util.StringTokenizer; 
i!mport java.util.TreeMap; import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.conf.Configured; 
import org.apache.hadoop.fs.FileSystem; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.io.IntWritable; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.mapred.FileAlreadyExistsException; 
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 
import org.apache.hadoop.security.UserGroupInformation; 
import org.apache.hadoop.util.GenericOptionsParser; 
import org.apache.hadoop.util.Tool; 
import org.apache.hadoop.util.ToolRunner; 
import org.apache.hadoop.yarn.api.records.LocalResource; 
import org.apache.tez.client.TezClient; 
import org.apache.tez.dag.api.DAG; 
import org.apache.tez.dag.api.Edge; 
import org.apache.tez.dag.api.InputDescriptor; 
import org.apache.tez.dag.api.OutputDescriptor; 
import org.apache.tez.dag.api.ProcessorDescriptor; 
import org.apache.tez.dag.api.TezConfiguration; 
import org.apache.tez.dag.api.Vertex; 
import org.apache.tez.dag.api.client.DAGClient; 
import org.apache.tez.dag.api.client.DAGStatus; 
import org.apache.tez.mapreduce.committer.MROutputCommitter; 
import org.apache.tez.mapreduce.common.MRInputAMSplitGenerator; 
import org.apache.tez.mapreduce.hadoop.MRHelpers; 
import org.apache.tez.mapreduce.input.MRInput; 
import org.apache.tez.mapreduce.output.MROutput; 
import org.apache.tez.mapreduce.processor.SimpleMRProcessor; 
import org.apache.tez.runtime.api.Output; 
import org.apache.tez.runtime.library.api.KeyValueReader; 
import org.apache.tez.runtime.library.api.KeyValueWriter; 
import org.apache.tez.runtime.library.api.KeyValuesReader; 
i!mport org.apache.tez.runtime.library.conf.OrderedPartitionedKVEdgeConfigurer; import com.google.common.base.Preconditions; 
import org.apache.tez.runtime.library.partitioner.HashPartitioner; !! 
public class WordCount extends Configured implements Tool { 
public static class TokenProcessor extends SimpleMRProcessor { 
IntWritable one = new IntWritable(1); 
! Text word = new Text(); @Override 
public void run() throws Exception { 
Preconditions.checkArgument(getInputs().size() == 1); 
Preconditions.checkArgument(getOutputs().size() == 1); 
MRInput input = (MRInput) getInputs().values().iterator().next(); 
KeyValueReader kvReader = input.getReader(); 
Output output = getOutputs().values().iterator().next(); 
KeyValueWriter kvWriter = (KeyValueWriter) output.getWriter(); 
while (kvReader.next()) { 
StringTokenizer itr = new StringTokenizer(kvReader.getCurrentValue().toString()); 
while (itr.hasMoreTokens()) { 
word.set(itr.nextToken()); 
kvWriter.write(word, one); 
} 
} 
! } ! } public static class SumProcessor extends SimpleMRProcessor { 
@Override 
public void run() throws Exception { 
Preconditions.checkArgument(getInputs().size() == 1); 
MROutput out = (MROutput) getOutputs().values().iterator().next(); 
KeyValueWriter kvWriter = out.getWriter(); 
KeyValuesReader kvReader = (KeyValuesReader) getInputs().values().iterator().next() 
.getReader(); 
while (kvReader.next()) { 
Text word = (Text) kvReader.getCurrentKey(); 
int sum = 0; 
for (Object value : kvReader.getCurrentValues()) { 
sum += ((IntWritable) value).get(); 
} 
kvWriter.write(word, new IntWritable(sum)); 
} 
} 
} 
! private DAG createDAG(FileSystem fs, TezConfiguration tezConf, 
Map<String, LocalResource> localResources, Path stagingDir, 
! String inputPath, String outputPath) throws IOException { Configuration inputConf = new Configuration(tezConf); 
inputConf.set(FileInputFormat.INPUT_DIR, inputPath); 
InputDescriptor id = new InputDescriptor(MRInput.class.getName()) 
.setUserPayload(MRInput.createUserPayload(inputConf, 
! TextInputFormat.class.getName(), true, true)); Configuration outputConf = new Configuration(tezConf); 
outputConf.set(FileOutputFormat.OUTDIR, outputPath); 
OutputDescriptor od = new OutputDescriptor(MROutput.class.getName()) 
.setUserPayload(MROutput.createUserPayload( 
! outputConf, TextOutputFormat.class.getName(), true)); Vertex tokenizerVertex = new Vertex("tokenizer", new ProcessorDescriptor( 
TokenProcessor.class.getName()), -1, MRHelpers.getMapResource(tezConf)); 
! tokenizerVertex.addInput("MRInput", id, MRInputAMSplitGenerator.class); Vertex summerVertex = new Vertex("summer", 
! new ProcessorDescriptor( 
SumProcessor.class.getName()), 1, MRHelpers.getReduceResource(tezConf)); 
summerVertex.addOutput("MROutput", od, MROutputCommitter.class); OrderedPartitionedKVEdgeConfigurer edgeConf = OrderedPartitionedKVEdgeConfigurer 
.newBuilder(Text.class.getName(), IntWritable.class.getName(), 
! HashPartitioner.class.getName(), null).build(); DAG dag = new DAG("WordCount"); 
dag.addVertex(tokenizerVertex) 
.addVertex(summerVertex) 
.addEdge( 
return dag; 
! } private static void printUsage() { 
new Edge(tokenizerVertex, summerVertex, edgeConf.createDefaultEdgeProperty())); 
System.err.println("Usage: " + " wordcount <in1> <out1>"); 
ToolRunner.printGenericCommandUsage(System.err); 
! } public boolean run(String inputPath, String outputPath, Configuration conf) throws Exception { 
System.out.println("Running WordCount"); 
// conf and UGI 
TezConfiguration tezConf; 
if (conf != null) { 
tezConf = new TezConfiguration(conf); 
} else { 
tezConf = new TezConfiguration(); 
} 
UserGroupInformation.setConfiguration(tezConf); 
! String user = UserGroupInformation.getCurrentUser().getShortUserName(); // staging dir 
FileSystem fs = FileSystem.get(tezConf); 
String stagingDirStr = Path.SEPARATOR + "user" + Path.SEPARATOR 
+ user + Path.SEPARATOR+ ".staging" + Path.SEPARATOR 
+ Path.SEPARATOR + Long.toString(System.currentTimeMillis()); 
Path stagingDir = new Path(stagingDirStr); 
tezConf.set(TezConfiguration.TEZ_AM_STAGING_DIR, stagingDirStr); 
stagingDir = fs.makeQualified(stagingDir); 
// No need to add jar containing this class as assumed to be part of 
! // the tez jars. // TEZ-674 Obtain tokens based on the Input / Output paths. For now assuming staging dir 
// is the same filesystem as the one used for Input/Output. 
TezClient tezSession = new TezClient("WordCountSession", tezConf); 
! tezSession.start(); ! DAGClient dagClient = null; try { 
if (fs.exists(new Path(outputPath))) { 
throw new FileAlreadyExistsException("Output directory " 
+ outputPath + " already exists"); 
} 
Map<String, LocalResource> localResources = 
new TreeMap<String, LocalResource>(); 
DAG dag = createDAG(fs, tezConf, localResources, 
! stagingDir, inputPath, outputPath); tezSession.waitTillReady(); 
! dagClient = tezSession.submitDAG(dag); // monitoring 
DAGStatus dagStatus = dagClient.waitForCompletionWithAllStatusUpdates(null); 
if (dagStatus.getState() != DAGStatus.State.SUCCEEDED) { 
System.out.println("DAG diagnostics: " + dagStatus.getDiagnostics()); 
return false; 
} 
return true; 
} finally { 
fs.delete(stagingDir, true); 
tezSession.stop(); 
} 
! } @Override 
public int run(String[] args) throws Exception { 
Configuration conf = getConf(); 
! String [] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { 
printUsage(); 
return 2; 
} 
WordCount job = new WordCount(); 
job.run(otherArgs[0], otherArgs[1], conf); 
return 0; 
! } public static void main(String[] args) throws Exception { 
int res = ToolRunner.run(new Configuration(), new WordCount(), args); 
System.exit(res); 
} 
} 
GABRIELE MODENA LEARNING HADOOP 2
Spark 
AMPLab paper (2010), builds on Dryad! 
Resilient Distributed Datasets (RDDs)! 
High level API (and a repl)! 
Also an execution engine (Hive-on-Spark, Pig-on- 
Spark) 
GABRIELE MODENA LEARNING HADOOP 2
JavaRDD<String> file = spark.textFile(“hdfs://infile.txt"); 
! 
JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() { 
public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } 
}); 
! 
JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>() { 
public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } 
}); 
! 
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() { 
public Integer call(Integer a, Integer b) { return a + b; } 
}); 
! 
counts.saveAsTextFile(“hdfs://outfile.txt"); 
GABRIELE MODENA LEARNING HADOOP 2
Rule of thumb 
Avoid spill-to-disk! 
Spark and Tez don’t mix well! 
Join on 50+ TB = Hive+Tez, MapReduce! 
Direct access to API (in memory) = Spark! 
OLAP = Hive+Tez, Cloudera Impala! 
GABRIELE MODENA LEARNING HADOOP 2
Good stuff. So what?
The data <adjective> 
S3, mysql, nfs, … 
HDFS 
Workflow coordination 
Ingestion Metadata 
Processing 
GABRIELE MODENA LEARNING HADOOP 2
Analytics on Hadoop 2 
Batch & interactive! 
Datawarehousing & computing! 
Dataset size and velocity! 
Integrations with existing tools! 
Distributions will constrain your stack 
GABRIELE MODENA LEARNING HADOOP 2
Use cases 
Datawarehousing! 
Explorative Data Analysis! 
Stream processing! 
Predictive Analytics 
GABRIELE MODENA LEARNING HADOOP 2
Datawarehousing 
Data ingestion! 
Pipelines! 
Transform and enrich (ETL) queries - batch! 
Low latency (presentation) queries - interactive! 
Interoperable data formats and metadata! 
Workflow Orchestration 
GABRIELE MODENA LEARNING HADOOP 2
Collection and ingestion 
$ hadoop distcp 
GABRIELE MODENA LEARNING HADOOP 2
Once data is in HDFS
Apache Hive 
HiveQL ! 
Data stored on HDFS! 
Metadata kept in mysql (metastore)! 
Metadata exposed to third parties (HCatalog)! 
Suitable both for interactive and batch queries 
GABRIELE MODENA LEARNING HADOOP 2
set hive.execution.engine=tez
set hive.execution.engine=mr
The nature of Hive tables 
CREATE TABLE and (LOAD DATA) produce metadata! 
! 
Schema based on the data “as it has already arrived”! 
! 
Data files underlying a Hive table are no different from any 
other file on HDFS! 
! 
Primitive types behave as in Java 
GABRIELE MODENA LEARNING HADOOP 2
Data formats 
Record oriented (avro, text)! 
Column oriented (Parquet, Orc) 
GABRIELE MODENA LEARNING HADOOP 2
Text (tab separated) 
create external table tweets 
( 
created_at string, 
tweet_id string, 
text string, 
in_reply_to string, 
retweeted boolean, 
user_id string, 
place_id string 
) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY 't' 
STORED AS TEXTFILE 
LOCATION ‘$input’ 
$ hadoop fs -cat /data/tweets.tsv 
2014-03-12T17:34:26.000Z!443802208698908672! Oh &amp; I'm chuffed for 
@GeraintThomas86, doing Wales proud in yellow!! #ParisNice #Cymru! NULL! 
223224878! NULL 
2014-03-12T17:34:26.000Z!443802208706908160! Stalker48 Kembali Lagi Cek 
Disini http://t.co/4BMTFByFH5 236! NULL! 629845435! NULL 
2014-03-12T17:34:26.000Z!443802208728268800! @Piconn ou melhor, eu era :c 
mudei! NULL! 255768055! NULL 
2014-03-12T17:34:26.000Z!443802208698912768! I swear Ryan's always in his 
own world. He's always like 4 hours behind everyone else.! NULL! 
2379282889! NULL 
2014-03-12T17:34:26.000Z!443802208702713856! @maggersforever0 lmfao you 
gotta see this, its awesome http://t.co/1PvXEELlqi! NULL! 355858832! 
NULL 
2014-03-12T17:34:26.000Z!443802208698896384! Crazy... http://t.co/ 
G4QRMSKGkh! NULL! 106439395! NULL! 
GABRIELE MODENA LEARNING HADOOP 2 
•
SELECT COUNT(*) 
FROM tweets
Apache Avro 
Record oriented! 
Migrations (forward, backward)! 
Schema on write! 
Interoperability 
{ 
“namespace”: “com.mycompany.avrotables”, 
"name": "tweets", 
"type": "record", 
"fields": [ 
{"name": "created_at", "type": "string", “doc”: “date_time of tweet”}, 
{"name": "tweet_id_str", "type": "string"}, 
{"name": "text", "type": "string"}, 
{"name": "in_reply_to", "type": ["string", "null"]}, 
{"name": "is_retweeted", "type": ["string", "null"]}, 
{"name": "user_id", "type": "string"}, 
{"name": "place_id", "type": ["string", "null"]} 
] 
} 
CREATE TABLE tweets 
ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 
STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' 
SERDEPROPERTIES ( 
'avro.schema.url'='hdfs:///schema/avro/tweets_avro.avsc' 
) ; 
insert 
into 
table 
tweets 
select 
* 
from 
tweets_ext; 
GABRIELE MODENA LEARNING HADOOP 2
Some thoughts on schemas 
Only make additive changes! 
Think about schema distribution! 
Manage schema versions explicitly 
GABRIELE MODENA LEARNING HADOOP 2
Parquet 
! 
Ad hoc use case! 
Cloudera Impala’s default file format! 
Execution engine agnostic! 
HIVE-5783! 
Let it handle block size! 
! 
create table tweets ( 
created_at string, 
tweet_id string, 
text string, 
in_reply_to string, 
retweeted boolean, 
user_id string, 
place_id string 
) STORED AS PARQUET; 
! 
insert into table tweets 
select * from tweets_ext; 
GABRIELE MODENA LEARNING HADOOP 2
If possible, use both
Table Optimization 
Create tables with workloads in mind! 
Partitions! 
Bucketing! 
Join strategies 
GABRIELE MODENA LEARNING HADOOP 2
Plenty of tunables !! 
# partitions 
SET hive.exec.dynamic.partition=true; 
SET hive.exec.dynamic.partition.mode=nonstrict; 
SET hive.exec.max.dynamic.partitions.pernode=10000; 
SET hive.exec.max.dynamic.partitions=100000; 
SET hive.exec.max.created.files=1000000; 
! 
# merge small files 
SET hive.merge.size.per.task=256000000; 
SET hive.merge.mapfiles=true; 
SET hive.merge.mapredfiles=true; 
SET hive.merge.smallfiles.avgsize=16000000; 
# Compression 
SET mapred.output.compress=true; 
SET mapred.output.compression.type=BLOCK; 
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec; 
SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec; 
GABRIELE MODENA LEARNING HADOOP 2
Apache Oozie 
Data pipelines! 
Workflow execution and 
coordination! 
Time and availability based 
execution! 
Configuration over code! 
MapReduce centric! 
Actions Hive, Pig, fs, shell, 
sqoop 
<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">! 
...! 
<action name="[NODE-NAME]">! 
<hive xmlns="uri:oozie:hive-action:0.2">! 
<job-tracker>[JOB-TRACKER]</job-tracker>! 
<name-node>[NAME-NODE]</name-node>! 
<prepare>! 
<delete path="[PATH]"/>! 
...! 
<mkdir path="[PATH]"/>! 
...! 
</prepare>! 
<job-xml>[HIVE SETTINGS FILE]</job-xml>! 
<configuration>! 
<property>! 
<name>[PROPERTY-NAME]</name>! 
<value>[PROPERTY-VALUE]</value>! 
</property>! 
...! 
</configuration>! 
<script>[HIVE-SCRIPT]</script>! 
<param>[PARAM-VALUE]</param>! 
...! 
<param>[PARAM-VALUE]</param>! 
<file>[FILE-PATH]</file>! 
...! 
<archive>[FILE-PATH]</archive>! 
...! 
</hive>! 
<ok to="[NODE-NAME]"/>! 
<error to="[NODE-NAME]"/>! 
</action>! 
...! 
</workflow-app> 
GABRIELE MODENA LEARNING HADOOP 2
EDA 
Luminosity in xkcd comics (courtesy of rbloggers) 
GABRIELE MODENA LEARNING HADOOP 2
Sample the dataset
Use hive-on-tez, impala
Spark & Ipython Notebook 
! 
! from pyspark import SparkContext! 
! sc = SparkContext(CLUSTER_URL, 
‘ipython-notebook') ! 
Works with Avro, Parqeut etc! 
Move computation close to 
data! 
Numpy, scikit-learn, matplotlib! 
Setup can be tedious 
GABRIELE MODENA LEARNING HADOOP 2
Stream processing 
Statistics in real time! 
Data feeds! 
Machine generated (sensor data, logs)! 
Predictive analytics 
GABRIELE MODENA LEARNING HADOOP 2
Several niches 
Low latency (storm, s4)! 
Persistency and resiliency (samza)! 
Apply complex logic (spark-streaming)! 
Type of message stream (kafka) 
GABRIELE MODENA LEARNING HADOOP 2
Apache Samza 
Kafka for streaming ! 
Yarn for resource 
management and exec! 
Samza API for 
processing! 
Sweet spot: second, 
minutes 
Samza API 
Yarn 
Kafka 
GABRIELE MODENA LEARNING HADOOP 2
public void process( 
IncomingMessageEnvelope envelope, 
MessageCollector collector, 
TaskCoordinator coordinator)
public void window( 
MessageCollector collector, 
TaskCoordinator coordinator)
Bootstrap streams 
Samza can consume messages from multiple 
streams! 
Rewind on historical data does not preserve 
ordering! 
If a task has any bootstrap streams defined then 
it will read these streams until they are fully 
processed 
GABRIELE MODENA LEARNING HADOOP 2
Predictive modelling 
GABRIELE MODENA LEARNING HADOOP 2
Learning from data 
Predictive model = statistical learning! 
Simple = parallelizable! 
Garbage in = garbage out 
GABRIELE MODENA LEARNING HADOOP 2
Couple of things we can do 
1. Parameter tuning 
2. Feature engineering 
3. Learn on all data 
GABRIELE MODENA LEARNING HADOOP 2
Train against all data 
Ensamble methods (cooperative and competitive)! 
Avoid multi pass / iterations! 
Apply models to live data! 
Keep models up to date 
GABRIELE MODENA LEARNING HADOOP 2
Off the shelf 
Apache Mahout (MapReduce, Spark) ! 
MLlib (Spark)! 
Cascading-pattern (MapReduce, Tez, Spark) 
GABRIELE MODENA LEARNING HADOOP 2
Apache Mahout 0.9 
Once the default solution for ML with MapReduce! 
Quality may vary! 
Good components are really good! 
Is it a library? A framework? A recommendation 
system? 
GABRIELE MODENA LEARNING HADOOP 2
The good 
The go-to if you need a Recommendation System! 
SGD (optimization)! 
Random Forest (classification/regression)! 
SVD (feature engineering)! 
ALS (collaborative filtering) 
GABRIELE MODENA LEARNING HADOOP 2
The puzzling 
SVM? ! 
Model updates are implementation specific!! 
Feature encoding and input format are often 
model specific 
GABRIELE MODENA LEARNING HADOOP 2
Apache Mahout trunk 
Moving away from MapReduce! 
Spark + Scala DSL = new classes of algorithms! 
Major code cleanup 
GABRIELE MODENA LEARNING HADOOP 2
It needs major 
infrastructure work 
around it
batch + streaming
There’s a buzzword for that 
http://lambda-architecture.net/ 
GABRIELE MODENA LEARNING HADOOP 2
Wrap up
With hadoop 2 
Cluster as an Operating System! 
YARN, mostly! 
Multiparadigm, better interop! 
Same system, different tools, multiple use cases! 
Batch + interactive 
GABRIELE MODENA LEARNING HADOOP 2
This said 
Ops is where a lot of time goes! 
Building clusters is hard! 
Distro fragmentation! 
Bleeding edge rush! 
Heavy lifting needed 
GABRIELE MODENA LEARNING HADOOP 2
That’s all, folks
Thanks for having me
Let’s discuss
1 of 74

Recommended

Resilient Distributed Datasets by
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed DatasetsGabriele Modena
2.1K views63 slides
Approximation algorithms for stream and batch processing by
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingGabriele Modena
1.2K views26 slides
Introduction to Spark by
Introduction to SparkIntroduction to Spark
Introduction to SparkCarol McDonald
2.7K views108 slides
Titan and Cassandra at WellAware by
Titan and Cassandra at WellAwareTitan and Cassandra at WellAware
Titan and Cassandra at WellAwaretwilmes
1.6K views38 slides
Apache Spark Overview by
Apache Spark OverviewApache Spark Overview
Apache Spark OverviewCarol McDonald
3.4K views136 slides
Hadoop, MapReduce and R = RHadoop by
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
9.4K views28 slides

More Related Content

What's hot

Big data distributed processing: Spark introduction by
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
222 views64 slides
Introduction to MapReduce and Hadoop by
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
1.6K views47 slides
Build a Time Series Application with Apache Spark and Apache HBase by
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
2.8K views51 slides
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab by
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabVijay Srinivas Agneeswaran, Ph.D
3K views29 slides
Indexed Hive by
Indexed HiveIndexed Hive
Indexed HiveNikhilDeshpande
21.5K views27 slides
Building Machine Learning Applications with Sparkling Water by
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterSri Ambati
2.1K views36 slides

What's hot(20)

Introduction to MapReduce and Hadoop by Mohamed Elsaka
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
Mohamed Elsaka1.6K views
Build a Time Series Application with Apache Spark and Apache HBase by Carol McDonald
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
Carol McDonald2.8K views
Building Machine Learning Applications with Sparkling Water by Sri Ambati
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling Water
Sri Ambati2.1K views
Large Scale Math with Hadoop MapReduce by Hortonworks
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
Hortonworks19.4K views
Introduction to Hadoop and MapReduce by Dr Ganesh Iyer
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
Dr Ganesh Iyer3.9K views
Big data processing using - Hadoop Technology by Shital Kat
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop Technology
Shital Kat1.3K views
Mapreduce Algorithms by Amund Tveit
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
Amund Tveit28K views
LocationTech Projects by Jody Garnett
LocationTech ProjectsLocationTech Projects
LocationTech Projects
Jody Garnett2.6K views
Fast Cars, Big Data How Streaming can help Formula 1 by Carol McDonald
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
Carol McDonald908 views
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t... by Brian O'Neill
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Brian O'Neill4.5K views
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod... by Xiao Qin
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
Xiao Qin1.1K views
Python in an Evolving Enterprise System (PyData SV 2013) by PyData
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
PyData982 views
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015 by Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju1.5K views
Cassandra advanced data modeling by Romain Hardouin
Cassandra advanced data modelingCassandra advanced data modeling
Cassandra advanced data modeling
Romain Hardouin5.8K views

Viewers also liked

Unikernels: in search of a killer app and a killer ecosystem by
Unikernels: in search of a killer app and a killer ecosystemUnikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystemrhatr
562 views50 slides
Spark fundamentals i (bd095 en) version #1: updated: april 2015 by
Spark fundamentals i (bd095 en) version #1: updated: april 2015Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015Ashutosh Sonaliya
173 views1 slide
Type Checking Scala Spark Datasets: Dataset Transforms by
Type Checking Scala Spark Datasets: Dataset TransformsType Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset TransformsJohn Nestor
2.5K views20 slides
臺灣高中數學講義 - 第一冊 - 數與式 by
臺灣高中數學講義 - 第一冊 - 數與式臺灣高中數學講義 - 第一冊 - 數與式
臺灣高中數學講義 - 第一冊 - 數與式Xuan-Chao Huang
8.3K views42 slides
New Analytics Toolbox DevNexus 2015 by
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
1.4K views131 slides
Think Like Spark: Some Spark Concepts and a Use Case by
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseRachel Warren
600 views46 slides

Viewers also liked(20)

Unikernels: in search of a killer app and a killer ecosystem by rhatr
Unikernels: in search of a killer app and a killer ecosystemUnikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystem
rhatr562 views
Spark fundamentals i (bd095 en) version #1: updated: april 2015 by Ashutosh Sonaliya
Spark fundamentals i (bd095 en) version #1: updated: april 2015Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Ashutosh Sonaliya173 views
Type Checking Scala Spark Datasets: Dataset Transforms by John Nestor
Type Checking Scala Spark Datasets: Dataset TransformsType Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset Transforms
John Nestor2.5K views
臺灣高中數學講義 - 第一冊 - 數與式 by Xuan-Chao Huang
臺灣高中數學講義 - 第一冊 - 數與式臺灣高中數學講義 - 第一冊 - 數與式
臺灣高中數學講義 - 第一冊 - 數與式
Xuan-Chao Huang8.3K views
Think Like Spark: Some Spark Concepts and a Use Case by Rachel Warren
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use Case
Rachel Warren600 views
Apache Spark: killer or savior of Apache Hadoop? by rhatr
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?
rhatr4.8K views
IBM Spark Meetup - RDD & Spark Basics by Satya Narayan
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark Basics
Satya Narayan1.1K views
Apache Spark Introduction @ University College London by Vitthal Gogate
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
Vitthal Gogate1.7K views
Think Like Spark by Alpine Data
Think Like SparkThink Like Spark
Think Like Spark
Alpine Data469 views
Hadoop Spark Introduction-20150130 by Xuan-Chao Huang
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
Xuan-Chao Huang2.1K views
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed... by Uwe Printz
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Uwe Printz19.3K views
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra by Piotr Kolaczkowski
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Piotr Kolaczkowski6.4K views
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016 by StampedeCon
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
StampedeCon903 views
Intro to Spark development by Spark Summit
 Intro to Spark development  Intro to Spark development
Intro to Spark development
Spark Summit10K views
Beneath RDD in Apache Spark by Jacek Laskowski by Spark Summit
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit2.1K views
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive by Sachin Aggarwal
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal3.3K views
Spark SQL Deep Dive @ Melbourne Spark Meetup by Databricks
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks9K views

Similar to Full stack analytics with Hadoop 2

Hadoop trainingin bangalore by
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangaloreappaji intelhunt
719 views38 slides
Introduction to Scalding and Monoids by
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
7.6K views23 slides
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013 by
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Robert Metzger
5K views30 slides
Hadoop by
HadoopHadoop
HadoopScott Leberknight
30.3K views85 slides
Mapreduce by examples by
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
3.5K views54 slides
Hadoop Integration in Cassandra by
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in CassandraJairam Chandar
4.4K views15 slides

Similar to Full stack analytics with Hadoop 2(20)

Introduction to Scalding and Monoids by Hugo Gävert
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert7.6K views
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013 by Robert Metzger
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Robert Metzger5K views
Mapreduce by examples by Andrea Iacono
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
Andrea Iacono3.5K views
Hadoop Integration in Cassandra by Jairam Chandar
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
Jairam Chandar4.4K views
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou... by CloudxLab
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab564 views
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX by rhatr
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
rhatr3.3K views
Big-data-analysis-training-in-mumbai by Unmesh Baile
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
Unmesh Baile402 views
Spark overview by Lisa Hua
Spark overviewSpark overview
Spark overview
Lisa Hua7.3K views
Cascading Through Hadoop for the Boulder JUG by Matthew McCullough
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
Matthew McCullough536 views
The Fundamentals Guide to HDP and HDInsight by Gert Drapers
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
Gert Drapers1.3K views
EuroPython 2015 - Big Data with Python and Hadoop by Max Tepkeev
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and Hadoop
Max Tepkeev1.7K views
JRubyKaigi2010 Hadoop Papyrus by Koichi Fujikawa
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
Koichi Fujikawa1.5K views

Recently uploaded

Pydata Global 2023 - How can a learnt model unlearn something by
Pydata Global 2023 - How can a learnt model unlearn somethingPydata Global 2023 - How can a learnt model unlearn something
Pydata Global 2023 - How can a learnt model unlearn somethingSARADINDU SENGUPTA
8 views13 slides
Penetration testing by Burpsuite by
Penetration testing by  BurpsuitePenetration testing by  Burpsuite
Penetration testing by BurpsuiteAyonDebnathCertified
5 views19 slides
Oral presentation.pdf by
Oral presentation.pdfOral presentation.pdf
Oral presentation.pdfreemalmazroui8
5 views10 slides
4_4_WP_4_06_ND_Model.pptx by
4_4_WP_4_06_ND_Model.pptx4_4_WP_4_06_ND_Model.pptx
4_4_WP_4_06_ND_Model.pptxd6fmc6kwd4
7 views13 slides
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf by
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOppotus
31 views19 slides
AvizoImageSegmentation.pptx by
AvizoImageSegmentation.pptxAvizoImageSegmentation.pptx
AvizoImageSegmentation.pptxnathanielbutterworth1
7 views14 slides

Recently uploaded(20)

Pydata Global 2023 - How can a learnt model unlearn something by SARADINDU SENGUPTA
Pydata Global 2023 - How can a learnt model unlearn somethingPydata Global 2023 - How can a learnt model unlearn something
Pydata Global 2023 - How can a learnt model unlearn something
4_4_WP_4_06_ND_Model.pptx by d6fmc6kwd4
4_4_WP_4_06_ND_Model.pptx4_4_WP_4_06_ND_Model.pptx
4_4_WP_4_06_ND_Model.pptx
d6fmc6kwd47 views
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf by Oppotus
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf
Oppotus31 views
Customer Data Cleansing Project.pptx by Nat O
Customer Data Cleansing Project.pptxCustomer Data Cleansing Project.pptx
Customer Data Cleansing Project.pptx
Nat O6 views
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning by SARADINDU SENGUPTA
GDG Cloud Community Day 2022 -  Managing data quality in Machine LearningGDG Cloud Community Day 2022 -  Managing data quality in Machine Learning
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning
Product Research sample.pdf by AllenSingson
Product Research sample.pdfProduct Research sample.pdf
Product Research sample.pdf
AllenSingson33 views
Running PostgreSQL in a Kubernetes cluster: CloudNativePG by Nick Ivanov
Running PostgreSQL in a Kubernetes cluster: CloudNativePGRunning PostgreSQL in a Kubernetes cluster: CloudNativePG
Running PostgreSQL in a Kubernetes cluster: CloudNativePG
Nick Ivanov7 views
PRIVACY AWRE PERSONAL DATA STORAGE by antony420421
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGE
antony4204217 views
PyData Global 2022 - Things I learned while running neural networks on microc... by SARADINDU SENGUPTA
PyData Global 2022 - Things I learned while running neural networks on microc...PyData Global 2022 - Things I learned while running neural networks on microc...
PyData Global 2022 - Things I learned while running neural networks on microc...
Data about the sector workshop by info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821729 views
Listed Instruments Survey 2022.pptx by secretariat4
Listed Instruments Survey  2022.pptxListed Instruments Survey  2022.pptx
Listed Instruments Survey 2022.pptx
secretariat4121 views
Business administration Project File.pdf by KiranPrajapati91
Business administration Project File.pdfBusiness administration Project File.pdf
Business administration Project File.pdf
KiranPrajapati9110 views

Full stack analytics with Hadoop 2

  • 1. Full stack analytics with Hadoop 2 Trento, 2014-09-11 GABRIELE MODENA LEARNING HADOOP 2
  • 2. CS.ML! Data Scientist ML & Data Mining Academia & Industry ! Learning Hadoop 2 for Packt_Publishing (together with Garry Turkington). TBD.
  • 3. This talk is about tools
  • 5. I will avoid benchmarks
  • 6. Back in 2012 GABRIELE MODENA LEARNING HADOOP 2
  • 7. HDFS Name Node Data Node ! ! Google paper (2003)! Distributed storage! Block ops Name Node Data Node Data Node GABRIELE MODENA LEARNING HADOOP 2
  • 8. MapReduce Google paper (2006)! Divide and conquer functional model! Concepts from database research! Batch worloads! Aggregation operations (eg. GROUP BY) GABRIELE MODENA LEARNING HADOOP 2
  • 9. Two phases Map Reduce GABRIELE MODENA LEARNING HADOOP 2
  • 12. All in all Great when records (jobs) are independent! Composability monsters! Computation vs. Communication tradeoff! Low level API! Tuning required GABRIELE MODENA LEARNING HADOOP 2
  • 13. Computation with MapReduce CRUNCH GABRIELE MODENA LEARNING HADOOP 2
  • 14. Higher level abstractions, still geared towards batch loads
  • 15. Dremel (Impala, Drill) Google paper (2010) ! Access blocks directly from data nodes (partition the fs namespace)! Columnar store (optimize for OLAP)! Appeals to database / BI crowds! Ridiculously fast (as long as you have memory) GABRIELE MODENA LEARNING HADOOP 2
  • 16. Computation beyond MapReduce Iterative workloads! Low latency queries! Real-time computation! High level abstractions GABRIELE MODENA LEARNING HADOOP 2
  • 17. Hadoop 2 Applications (Hive, Pig, Crunch, Cascading, etc…) Streaming (storm, spark, samza) In memory (spark) Interactive (Tez) HPC (MPI) Resource Management (YARN) HDFS Batch (MapReduce) Graph (giraph) GABRIELE MODENA LEARNING HADOOP 2
  • 19. Tez (Dryad) Microsoft paper (2007)! Generalization of MapReduce as dataflow! Express dependencies, I/O pipelining! Low level API for building DAGs! Mainly an execution engine (Hive-on-Tez, Pig-on-Tez) GABRIELE MODENA LEARNING HADOOP 2
  • 21. DAG dag = new DAG("WordCount"); dag.addVertex(tokenizerVertex) .addVertex(summerVertex) .addEdge( new Edge(tokenizerVertex, summerVertex, edgeConf.createDefaultEdgeProperty())); GABRIELE MODENA LEARNING HADOOP 2
  • 22. p!!ackage org.apache.tez.mapreduce.examples; import java.io.IOException; import java.util.Map; import java.util.StringTokenizer; i!mport java.util.TreeMap; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileAlreadyExistsException; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.security.UserGroupInformation; import org.apache.hadoop.util.GenericOptionsParser; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import org.apache.hadoop.yarn.api.records.LocalResource; import org.apache.tez.client.TezClient; import org.apache.tez.dag.api.DAG; import org.apache.tez.dag.api.Edge; import org.apache.tez.dag.api.InputDescriptor; import org.apache.tez.dag.api.OutputDescriptor; import org.apache.tez.dag.api.ProcessorDescriptor; import org.apache.tez.dag.api.TezConfiguration; import org.apache.tez.dag.api.Vertex; import org.apache.tez.dag.api.client.DAGClient; import org.apache.tez.dag.api.client.DAGStatus; import org.apache.tez.mapreduce.committer.MROutputCommitter; import org.apache.tez.mapreduce.common.MRInputAMSplitGenerator; import org.apache.tez.mapreduce.hadoop.MRHelpers; import org.apache.tez.mapreduce.input.MRInput; import org.apache.tez.mapreduce.output.MROutput; import org.apache.tez.mapreduce.processor.SimpleMRProcessor; import org.apache.tez.runtime.api.Output; import org.apache.tez.runtime.library.api.KeyValueReader; import org.apache.tez.runtime.library.api.KeyValueWriter; import org.apache.tez.runtime.library.api.KeyValuesReader; i!mport org.apache.tez.runtime.library.conf.OrderedPartitionedKVEdgeConfigurer; import com.google.common.base.Preconditions; import org.apache.tez.runtime.library.partitioner.HashPartitioner; !! public class WordCount extends Configured implements Tool { public static class TokenProcessor extends SimpleMRProcessor { IntWritable one = new IntWritable(1); ! Text word = new Text(); @Override public void run() throws Exception { Preconditions.checkArgument(getInputs().size() == 1); Preconditions.checkArgument(getOutputs().size() == 1); MRInput input = (MRInput) getInputs().values().iterator().next(); KeyValueReader kvReader = input.getReader(); Output output = getOutputs().values().iterator().next(); KeyValueWriter kvWriter = (KeyValueWriter) output.getWriter(); while (kvReader.next()) { StringTokenizer itr = new StringTokenizer(kvReader.getCurrentValue().toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); kvWriter.write(word, one); } } ! } ! } public static class SumProcessor extends SimpleMRProcessor { @Override public void run() throws Exception { Preconditions.checkArgument(getInputs().size() == 1); MROutput out = (MROutput) getOutputs().values().iterator().next(); KeyValueWriter kvWriter = out.getWriter(); KeyValuesReader kvReader = (KeyValuesReader) getInputs().values().iterator().next() .getReader(); while (kvReader.next()) { Text word = (Text) kvReader.getCurrentKey(); int sum = 0; for (Object value : kvReader.getCurrentValues()) { sum += ((IntWritable) value).get(); } kvWriter.write(word, new IntWritable(sum)); } } } ! private DAG createDAG(FileSystem fs, TezConfiguration tezConf, Map<String, LocalResource> localResources, Path stagingDir, ! String inputPath, String outputPath) throws IOException { Configuration inputConf = new Configuration(tezConf); inputConf.set(FileInputFormat.INPUT_DIR, inputPath); InputDescriptor id = new InputDescriptor(MRInput.class.getName()) .setUserPayload(MRInput.createUserPayload(inputConf, ! TextInputFormat.class.getName(), true, true)); Configuration outputConf = new Configuration(tezConf); outputConf.set(FileOutputFormat.OUTDIR, outputPath); OutputDescriptor od = new OutputDescriptor(MROutput.class.getName()) .setUserPayload(MROutput.createUserPayload( ! outputConf, TextOutputFormat.class.getName(), true)); Vertex tokenizerVertex = new Vertex("tokenizer", new ProcessorDescriptor( TokenProcessor.class.getName()), -1, MRHelpers.getMapResource(tezConf)); ! tokenizerVertex.addInput("MRInput", id, MRInputAMSplitGenerator.class); Vertex summerVertex = new Vertex("summer", ! new ProcessorDescriptor( SumProcessor.class.getName()), 1, MRHelpers.getReduceResource(tezConf)); summerVertex.addOutput("MROutput", od, MROutputCommitter.class); OrderedPartitionedKVEdgeConfigurer edgeConf = OrderedPartitionedKVEdgeConfigurer .newBuilder(Text.class.getName(), IntWritable.class.getName(), ! HashPartitioner.class.getName(), null).build(); DAG dag = new DAG("WordCount"); dag.addVertex(tokenizerVertex) .addVertex(summerVertex) .addEdge( return dag; ! } private static void printUsage() { new Edge(tokenizerVertex, summerVertex, edgeConf.createDefaultEdgeProperty())); System.err.println("Usage: " + " wordcount <in1> <out1>"); ToolRunner.printGenericCommandUsage(System.err); ! } public boolean run(String inputPath, String outputPath, Configuration conf) throws Exception { System.out.println("Running WordCount"); // conf and UGI TezConfiguration tezConf; if (conf != null) { tezConf = new TezConfiguration(conf); } else { tezConf = new TezConfiguration(); } UserGroupInformation.setConfiguration(tezConf); ! String user = UserGroupInformation.getCurrentUser().getShortUserName(); // staging dir FileSystem fs = FileSystem.get(tezConf); String stagingDirStr = Path.SEPARATOR + "user" + Path.SEPARATOR + user + Path.SEPARATOR+ ".staging" + Path.SEPARATOR + Path.SEPARATOR + Long.toString(System.currentTimeMillis()); Path stagingDir = new Path(stagingDirStr); tezConf.set(TezConfiguration.TEZ_AM_STAGING_DIR, stagingDirStr); stagingDir = fs.makeQualified(stagingDir); // No need to add jar containing this class as assumed to be part of ! // the tez jars. // TEZ-674 Obtain tokens based on the Input / Output paths. For now assuming staging dir // is the same filesystem as the one used for Input/Output. TezClient tezSession = new TezClient("WordCountSession", tezConf); ! tezSession.start(); ! DAGClient dagClient = null; try { if (fs.exists(new Path(outputPath))) { throw new FileAlreadyExistsException("Output directory " + outputPath + " already exists"); } Map<String, LocalResource> localResources = new TreeMap<String, LocalResource>(); DAG dag = createDAG(fs, tezConf, localResources, ! stagingDir, inputPath, outputPath); tezSession.waitTillReady(); ! dagClient = tezSession.submitDAG(dag); // monitoring DAGStatus dagStatus = dagClient.waitForCompletionWithAllStatusUpdates(null); if (dagStatus.getState() != DAGStatus.State.SUCCEEDED) { System.out.println("DAG diagnostics: " + dagStatus.getDiagnostics()); return false; } return true; } finally { fs.delete(stagingDir, true); tezSession.stop(); } ! } @Override public int run(String[] args) throws Exception { Configuration conf = getConf(); ! String [] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { printUsage(); return 2; } WordCount job = new WordCount(); job.run(otherArgs[0], otherArgs[1], conf); return 0; ! } public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new WordCount(), args); System.exit(res); } } GABRIELE MODENA LEARNING HADOOP 2
  • 23. Spark AMPLab paper (2010), builds on Dryad! Resilient Distributed Datasets (RDDs)! High level API (and a repl)! Also an execution engine (Hive-on-Spark, Pig-on- Spark) GABRIELE MODENA LEARNING HADOOP 2
  • 24. JavaRDD<String> file = spark.textFile(“hdfs://infile.txt"); ! JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() { public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } }); ! JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); ! JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() { public Integer call(Integer a, Integer b) { return a + b; } }); ! counts.saveAsTextFile(“hdfs://outfile.txt"); GABRIELE MODENA LEARNING HADOOP 2
  • 25. Rule of thumb Avoid spill-to-disk! Spark and Tez don’t mix well! Join on 50+ TB = Hive+Tez, MapReduce! Direct access to API (in memory) = Spark! OLAP = Hive+Tez, Cloudera Impala! GABRIELE MODENA LEARNING HADOOP 2
  • 26. Good stuff. So what?
  • 27. The data <adjective> S3, mysql, nfs, … HDFS Workflow coordination Ingestion Metadata Processing GABRIELE MODENA LEARNING HADOOP 2
  • 28. Analytics on Hadoop 2 Batch & interactive! Datawarehousing & computing! Dataset size and velocity! Integrations with existing tools! Distributions will constrain your stack GABRIELE MODENA LEARNING HADOOP 2
  • 29. Use cases Datawarehousing! Explorative Data Analysis! Stream processing! Predictive Analytics GABRIELE MODENA LEARNING HADOOP 2
  • 30. Datawarehousing Data ingestion! Pipelines! Transform and enrich (ETL) queries - batch! Low latency (presentation) queries - interactive! Interoperable data formats and metadata! Workflow Orchestration GABRIELE MODENA LEARNING HADOOP 2
  • 31. Collection and ingestion $ hadoop distcp GABRIELE MODENA LEARNING HADOOP 2
  • 32. Once data is in HDFS
  • 33. Apache Hive HiveQL ! Data stored on HDFS! Metadata kept in mysql (metastore)! Metadata exposed to third parties (HCatalog)! Suitable both for interactive and batch queries GABRIELE MODENA LEARNING HADOOP 2
  • 36. The nature of Hive tables CREATE TABLE and (LOAD DATA) produce metadata! ! Schema based on the data “as it has already arrived”! ! Data files underlying a Hive table are no different from any other file on HDFS! ! Primitive types behave as in Java GABRIELE MODENA LEARNING HADOOP 2
  • 37. Data formats Record oriented (avro, text)! Column oriented (Parquet, Orc) GABRIELE MODENA LEARNING HADOOP 2
  • 38. Text (tab separated) create external table tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE LOCATION ‘$input’ $ hadoop fs -cat /data/tweets.tsv 2014-03-12T17:34:26.000Z!443802208698908672! Oh &amp; I'm chuffed for @GeraintThomas86, doing Wales proud in yellow!! #ParisNice #Cymru! NULL! 223224878! NULL 2014-03-12T17:34:26.000Z!443802208706908160! Stalker48 Kembali Lagi Cek Disini http://t.co/4BMTFByFH5 236! NULL! 629845435! NULL 2014-03-12T17:34:26.000Z!443802208728268800! @Piconn ou melhor, eu era :c mudei! NULL! 255768055! NULL 2014-03-12T17:34:26.000Z!443802208698912768! I swear Ryan's always in his own world. He's always like 4 hours behind everyone else.! NULL! 2379282889! NULL 2014-03-12T17:34:26.000Z!443802208702713856! @maggersforever0 lmfao you gotta see this, its awesome http://t.co/1PvXEELlqi! NULL! 355858832! NULL 2014-03-12T17:34:26.000Z!443802208698896384! Crazy... http://t.co/ G4QRMSKGkh! NULL! 106439395! NULL! GABRIELE MODENA LEARNING HADOOP 2 •
  • 40. Apache Avro Record oriented! Migrations (forward, backward)! Schema on write! Interoperability { “namespace”: “com.mycompany.avrotables”, "name": "tweets", "type": "record", "fields": [ {"name": "created_at", "type": "string", “doc”: “date_time of tweet”}, {"name": "tweet_id_str", "type": "string"}, {"name": "text", "type": "string"}, {"name": "in_reply_to", "type": ["string", "null"]}, {"name": "is_retweeted", "type": ["string", "null"]}, {"name": "user_id", "type": "string"}, {"name": "place_id", "type": ["string", "null"]} ] } CREATE TABLE tweets ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' SERDEPROPERTIES ( 'avro.schema.url'='hdfs:///schema/avro/tweets_avro.avsc' ) ; insert into table tweets select * from tweets_ext; GABRIELE MODENA LEARNING HADOOP 2
  • 41. Some thoughts on schemas Only make additive changes! Think about schema distribution! Manage schema versions explicitly GABRIELE MODENA LEARNING HADOOP 2
  • 42. Parquet ! Ad hoc use case! Cloudera Impala’s default file format! Execution engine agnostic! HIVE-5783! Let it handle block size! ! create table tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) STORED AS PARQUET; ! insert into table tweets select * from tweets_ext; GABRIELE MODENA LEARNING HADOOP 2
  • 44. Table Optimization Create tables with workloads in mind! Partitions! Bucketing! Join strategies GABRIELE MODENA LEARNING HADOOP 2
  • 45. Plenty of tunables !! # partitions SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.max.dynamic.partitions.pernode=10000; SET hive.exec.max.dynamic.partitions=100000; SET hive.exec.max.created.files=1000000; ! # merge small files SET hive.merge.size.per.task=256000000; SET hive.merge.mapfiles=true; SET hive.merge.mapredfiles=true; SET hive.merge.smallfiles.avgsize=16000000; # Compression SET mapred.output.compress=true; SET mapred.output.compression.type=BLOCK; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec; SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec; GABRIELE MODENA LEARNING HADOOP 2
  • 46. Apache Oozie Data pipelines! Workflow execution and coordination! Time and availability based execution! Configuration over code! MapReduce centric! Actions Hive, Pig, fs, shell, sqoop <workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">! ...! <action name="[NODE-NAME]">! <hive xmlns="uri:oozie:hive-action:0.2">! <job-tracker>[JOB-TRACKER]</job-tracker>! <name-node>[NAME-NODE]</name-node>! <prepare>! <delete path="[PATH]"/>! ...! <mkdir path="[PATH]"/>! ...! </prepare>! <job-xml>[HIVE SETTINGS FILE]</job-xml>! <configuration>! <property>! <name>[PROPERTY-NAME]</name>! <value>[PROPERTY-VALUE]</value>! </property>! ...! </configuration>! <script>[HIVE-SCRIPT]</script>! <param>[PARAM-VALUE]</param>! ...! <param>[PARAM-VALUE]</param>! <file>[FILE-PATH]</file>! ...! <archive>[FILE-PATH]</archive>! ...! </hive>! <ok to="[NODE-NAME]"/>! <error to="[NODE-NAME]"/>! </action>! ...! </workflow-app> GABRIELE MODENA LEARNING HADOOP 2
  • 47. EDA Luminosity in xkcd comics (courtesy of rbloggers) GABRIELE MODENA LEARNING HADOOP 2
  • 50. Spark & Ipython Notebook ! ! from pyspark import SparkContext! ! sc = SparkContext(CLUSTER_URL, ‘ipython-notebook') ! Works with Avro, Parqeut etc! Move computation close to data! Numpy, scikit-learn, matplotlib! Setup can be tedious GABRIELE MODENA LEARNING HADOOP 2
  • 51. Stream processing Statistics in real time! Data feeds! Machine generated (sensor data, logs)! Predictive analytics GABRIELE MODENA LEARNING HADOOP 2
  • 52. Several niches Low latency (storm, s4)! Persistency and resiliency (samza)! Apply complex logic (spark-streaming)! Type of message stream (kafka) GABRIELE MODENA LEARNING HADOOP 2
  • 53. Apache Samza Kafka for streaming ! Yarn for resource management and exec! Samza API for processing! Sweet spot: second, minutes Samza API Yarn Kafka GABRIELE MODENA LEARNING HADOOP 2
  • 54. public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator)
  • 55. public void window( MessageCollector collector, TaskCoordinator coordinator)
  • 56. Bootstrap streams Samza can consume messages from multiple streams! Rewind on historical data does not preserve ordering! If a task has any bootstrap streams defined then it will read these streams until they are fully processed GABRIELE MODENA LEARNING HADOOP 2
  • 57. Predictive modelling GABRIELE MODENA LEARNING HADOOP 2
  • 58. Learning from data Predictive model = statistical learning! Simple = parallelizable! Garbage in = garbage out GABRIELE MODENA LEARNING HADOOP 2
  • 59. Couple of things we can do 1. Parameter tuning 2. Feature engineering 3. Learn on all data GABRIELE MODENA LEARNING HADOOP 2
  • 60. Train against all data Ensamble methods (cooperative and competitive)! Avoid multi pass / iterations! Apply models to live data! Keep models up to date GABRIELE MODENA LEARNING HADOOP 2
  • 61. Off the shelf Apache Mahout (MapReduce, Spark) ! MLlib (Spark)! Cascading-pattern (MapReduce, Tez, Spark) GABRIELE MODENA LEARNING HADOOP 2
  • 62. Apache Mahout 0.9 Once the default solution for ML with MapReduce! Quality may vary! Good components are really good! Is it a library? A framework? A recommendation system? GABRIELE MODENA LEARNING HADOOP 2
  • 63. The good The go-to if you need a Recommendation System! SGD (optimization)! Random Forest (classification/regression)! SVD (feature engineering)! ALS (collaborative filtering) GABRIELE MODENA LEARNING HADOOP 2
  • 64. The puzzling SVM? ! Model updates are implementation specific!! Feature encoding and input format are often model specific GABRIELE MODENA LEARNING HADOOP 2
  • 65. Apache Mahout trunk Moving away from MapReduce! Spark + Scala DSL = new classes of algorithms! Major code cleanup GABRIELE MODENA LEARNING HADOOP 2
  • 66. It needs major infrastructure work around it
  • 68. There’s a buzzword for that http://lambda-architecture.net/ GABRIELE MODENA LEARNING HADOOP 2
  • 70. With hadoop 2 Cluster as an Operating System! YARN, mostly! Multiparadigm, better interop! Same system, different tools, multiple use cases! Batch + interactive GABRIELE MODENA LEARNING HADOOP 2
  • 71. This said Ops is where a lot of time goes! Building clusters is hard! Distro fragmentation! Bleeding edge rush! Heavy lifting needed GABRIELE MODENA LEARNING HADOOP 2