SlideShare a Scribd company logo
1 of 56
Big Data otimizado: Arquiteturas
eficientes para construção de
Pipelines MapReduce
Fabiane Bizinella Nardon
@fabianenardon
Eu e Big Data
GRANDE?
o quão
grande é
COMO SABER SE VOCÊ TEM DADOS
GRANDES MESMO:
Todos os seus dados não cabem
em uma só máquina
byFernandoStankuns
COMO SABER SE VOCÊ TEM DADOS
GRANDES MESMO:
Você está falando mais em Terabytes
do que em Gigabytes
COMO SABER SE VOCÊ TEM DADOS
GRANDES MESMO:
A quantidade de dados que você processa cresce
constantemente. E deve dobrar no ano que vem.
bySauloCruz
PARA TODO O RESTO:
KEEP IT SIMPLE!
Hadoop
HBase
Hive
Crunch
HDFS
Cascading
Pig Mahout
Redis
MongoDB
MySQL
Cassandra
Dados
Map
Reduce
Novos Dados
Dados
Map
Reduce
Pipeline (Exemplo)
http://www.tailtarget.com/home/
http://cnn.com/news
http://www.tailtarget.com/about/
http://www.tailtarget.com/home/ - Tecnologia
http://cnn.com/news - Notícias
http://www.tailtarget.com/about/ - Tecnologia
u=0C010003 - Tecnologia
u=12070002 - Notícias
u=00AD0e12 - Tecnologia
u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194
u=12070002 - http://cnn.com/news - 189.19.123.161
u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127
u=0C010003 - http://www.tailtarget.com/home/
u=12070002 - http://cnn.com/news
u=00AD0e12 - http://www.tailtarget.com/about/
MapReduce Pipelines
- Ferramentas -
Orquestrar
Encadear
Otimizar
Hadoop
HBase
Hive
Crunch
HDFS
Cascading
Pig
Mahout
Redis
MongoDB
MySQL
Cassandra
Apache Crunch
Biblioteca para construção de MapReduce pipelines
sobre Hadoop
Intercala e orquestra diferentes funções de
MapReduce
De quebra, otimiza e facilita a implementação de
MapReduce
FlumeJava: Easy, Efficient Data-Parallel
Pipelines (Google, 2010)
Crunch – Anatomia de um Pipeline
Hadoop Node 1 Hadoop Node 2
DoFN 1 DoFN 2PCollection* Ptable*
• PCollection, PTable ouPGroupedTable
Write
Data
Source
HDFS HDFS
Data
Target
Crunch – Anatomia de um Pipeline
Hadoop Node 1 Hadoop Node 2
DoFN 1 DoFN 2PCollection* Ptable* Write
Data
Source
parallelDo()
HDFS HDFS
• PCollection, PTable ouPGroupedTable
Data
Target
Crunch – Anatomia de um Pipeline
Hadoop Node 1 Hadoop Node 2
DoFN 1 DoFN 1PCollection* Ptable* Write
Data
Source
Data
Target
parallelDo()
HDFS HDFS
• PCollection, PTable ouPGroupedTable
DoFN 1 DoFN 2
Data
Target
Crunch – Anatomia de um Pipeline
Hadoop Node 1 Hadoop Node 2
DoFN 1 DoFN 2PCollection* PTable* Write
Data
Source
Data
Target
parallelDo()
DoFN 1 DoFN 1
HDFS HDFS
• PCollection, PTable ouPGroupedTable
u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194
u=12070002 - http://cnn.com/news - 189.19.123.161
u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127
tailtarget.com – 2
cnn.com - 1
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFile("my/file");
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
pipeline.writeTextFile(counts, "my/output/file");
PipelineResult pipelineResult = pipeline.done();
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFile("my/file");
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
pipeline.writeTextFile(counts, "my/output/file");
PipelineResult pipelineResult = pipeline.done();
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFile("my/file");
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
pipeline.writeTextFile(counts, "my/output/file");
PipelineResult pipelineResult = pipeline.done();
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFile("my/file");
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
pipeline.writeTextFile(counts, "my/output/file");
PipelineResult pipelineResult = pipeline.done();
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFile("my/file");
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
pipeline.writeTextFile(counts, "my/output/file");
PipelineResult pipelineResult = pipeline.done();
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFile("my/file");
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
pipeline.writeTextFile(counts, "my/output/file");
PipelineResult pipelineResult = pipeline.done();
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFile("my/file");
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
pipeline.writeTextFile(counts, "my/output/file");
PipelineResult pipelineResult = pipeline.done();
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFile("my/file");
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
pipeline.writeTextFile(counts, "my/output/file");
PipelineResult pipelineResult = pipeline.done();
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFile("my/file");
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
pipeline.writeTextFile(counts, "my/output/file");
PipelineResult pipelineResult = pipeline.done();
public class NaiveCountVisitors extends DoFn<String, Pair<String, Integer>> {
public NaiveCountVisitors() {
}
public void process(String line, Emitter<Pair<String, Integer>> emitter) {
String[] parts = line.split(" ");
URL url = new URL(parts[2]);
emitter.emit(Pair.of(url.getHost(), 1));
}
}
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text page = new Text();
public void map(LongWritable key, Text value, Context context) {
String line = value.toString();
String[] parts = line.split(" ");
page.set(new URL(parts[2]).getHost());
context.write(page, one);
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable counter = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context){
int count = 0;
for (IntWritable value : values) {
count = count + value.get();
}
counter.set(count);
context.write(key, counter);
}
}
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
MapReduce (Sem Crunch)
MapReduce (Com Crunch)
HDFS
Chunk 1
Chunk 2
Record
Reader
Record
Reader
Map
Map
Combine
Combine Local
Storage
Map
Local
Storage
Copy Sort Reduce
Reduce
Counter Map Reduce Total Map Reduce Total
SLOTS_MILLIS_MAPS 0 0 1,635,906 0 0 1,434,544
SLOTS_MILLIS_REDUCES 0 0 870,082 0 0 384,755
FILE_BYTES_WRITTEN 1,907,284,471 956,106,354 2,863,390,825 3,776,871 681,575 4,458,446
Map input records 33,809,720 0 33,809,720 33,809,720 0 33,809,720
Map output records 33,661,880 0 33,661,880 33,661,880 0 33,661,880
Combine input records 0 0 0 33,714,223 0 33,714,223
Combine output records 0 0 0 74,295 0 74,295
Reduce input records 0 33,661,880 33,661,880 0 21,952 21,952
Reduce output records 0 343 343 0 343 343
Map output bytes 888,738,480 0 888,738,480 888,738,480 0 888,738,480
Map output materialized bytes 956,063,008 0 956,063,008 657,536 0 657,536
Reduce shuffle bytes 0 940,985,238 940,985,238 0 657,536 657,536
Physical memory (bytes) snapshot 11,734,376,448 527,491,072 12,261,867,520 12,008,472,576 86,654,976 12,095,127,552
Spilled Records 67,103,496 33,661,880 100,765,376 74,295 21,952 96,247
Total committed heap usage
(bytes)
10,188,226,560 396,902,400 10,585,128,960 10,188,226,560 59,441,152 10,247,667,712
CPU time spent (ms) 456,03 79,84 535,87 450,26 6,23 456,49
MapReduce Com Crunch
private Map<String, Integer> items = null;
public void initialize() {
items = new HashMap<String, Integer>();
}
public void process(String line, Emitter<Pair<String, Integer>> emitter) {
String[] parts = line.split(" ");
Integer value = items.get(new URL(parts[2]).getHost());
if (value == null) {
items.put(url.getHost(), 1);
} else {
items.put(url.getHost(), value + 1);
}
}
public void cleanup(Emitter<Pair<String, Integer>> emitter) {
for (Entry<String, Integer> item : items.entrySet()) {
emitter.emit(Pair.of(item.getKey(), item.getValue()));
}
}
Crunch com pré-combinação
Counter Map Reduce Total Map Reduce Total
SLOTS_MILLIS_MAPS 0 0 1,434,544 0 0 1,151,037
SLOTS_MILLIS_REDUCES 0 0 384,755 0 0 614,053
FILE_BYTES_WRITTEN 3,776,871 681,575 4,458,446 2,225,432 706,037 2,931,469
Map input records 33,809,720 0 33,809,720 33,809,720 0 33,809,720
Map output records 33,661,880 0 33,661,880 21,952 0 21,952
Combine input records 33,714,223 0 33,714,223 21,952 0 21,952
Combine output records 74,295 0 74,295 21,952 0 21,952
Reduce input records 0 21,952 21,952 0 21,952 21,952
Reduce output records 0 343 343 0 343 343
Map output bytes 888,738,480 0 888,738,480 613,248 0 613,248
Map output materialized bytes 657,536 0 657,536 657,92 0 657,92
Reduce shuffle bytes 0 657,536 657,536 0 657,92 657,92
Physical memory (bytes) snapshot 12,008,472,576 86,654,976 12,095,127,552
12,065,873,9
20
171,278,336
12,237,152,2
56
Spilled Records 74,295 21,952 96,247 21,952 21,952 43,904
Total committed heap usage (bytes) 10,188,226,560 59,441,152 10,247,667,712
10,188,226,5
60
118,882,304
10,307,108,8
64
CPU time spent (ms) 450,26 6,23 456,49 295,79 9,86 305,65
Com Crunch
Com
Pré-Combinação
Os gargalos geralmente são
causados pela quantidade de
dados que é trafegada na rede
Arquitetando o Pipeline
http://www.tailtarget.com/home/
http://cnn.com/news
http://www.tailtarget.com/about/
http://www.tailtarget.com/home/ - Tecnologia
http://cnn.com/news - Notícias
http://www.tailtarget.com/about/ - Tecnologia
u=0C010003 - Tecnologia
u=12070002 - Notícias
u=00AD0e12 - Tecnologia
u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194
u=12070002 - http://cnn.com/news - 189.19.123.161
u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127
u=0C010003 - http://www.tailtarget.com/home/
u=12070002 - http://cnn.com/news
u=00AD0e12 - http://www.tailtarget.com/about/
Arquitetando o Pipeline
http://www.tailtarget.com/home/
http://cnn.com/news
http://www.tailtarget.com/about/
http://www.tailtarget.com/home/ - Tecnologia
http://cnn.com/news - Notícias
http://www.tailtarget.com/about/ - Tecnologia
u=0C010003 - Tecnologia
u=12070002 - Notícias
u=00AD0e12 - Tecnologia
u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194
u=12070002 - http://cnn.com/news - 189.19.123.161
u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127
u=0C010003 - http://www.tailtarget.com/home/
u=12070002 - http://cnn.com/news
u=00AD0e12 - http://www.tailtarget.com/about/
Merge
1
2 3
4 5
6
{http://www.tailtarget.com/home/, [u=0C010003 ], Tecnologia}
{http://cnn.com/news, [u=12070002], Notícias}
{http://www.tailtarget.com/about/, [u=00AD0e12 ], Tecnologia}
Arquitetando o Pipeline
{http://www.tailtarget.com/home/, [u=0C010003 ]}
{http://cnn.com/news, [u=12070002]}
{http://www.tailtarget.com/about/, [u=00AD0e12 ]}
u=0C010003 - Tecnologia
u=12070002 - Notícias
u=00AD0e12 - Tecnologia
u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194
u=12070002 - http://cnn.com/news - 189.19.123.161
u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127
1
2
3
4
Faça o máximo que puder com
os dados que tem na mão
Arquitetando o Pipeline
http://www.tailtarget.com/home/
http://cnn.com/news
http://www.tailtarget.com/about/
http://www.tailtarget.com/home/ - Tecnologia
http://cnn.com/news - Notícias
http://www.tailtarget.com/about/ - Tecnologia
u=0C010003 - Tecnologia
u=12070002 - Notícias
u=00AD0e12 - Tecnologia
u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194
u=12070002 - http://cnn.com/news - 189.19.123.161
u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127
u=0C010003 - http://www.tailtarget.com/home/
u=12070002 - http://cnn.com/news
u=00AD0e12 - http://www.tailtarget.com/about/
Merge
Arquitetando o Pipeline
http://www.tailtarget.com/home/
http://cnn.com/news
http://www.tailtarget.com/home/ - Tecnologia
http://cnn.com/news - Notícias
u=0C010003 - Tecnologia
u=12070002 - Notícias
u=00AD0e12 - Tecnologia
Redis
4
2 4
3
56
u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194
u=12070002 - http://cnn.com/news - 189.19.123.161
u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127
1
u=0C010003 - http://www.tailtarget.com/home/
u=12070002 - http://cnn.com/news
u=00AD0e12 - http://www.tailtarget.com/about/
2
Arquitetando o Pipeline
u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194
u=12070002 - http://cnn.com/news - 189.19.123.161
u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127
u=0C010003 - http://www.tailtarget.com/home/
u=12070002 - http://cnn.com/news
u=00AD0e12 - http://www.tailtarget.com/about/
http://www.tailtarget.com/home/
http://cnn.com/news
http://www.tailtarget.com/home/ - Tecnologia
http://cnn.com/news - Notícias
u=0C010003 - Tecnologia
u=12070002 - Notícias
u=00AD0e12 - Tecnologia
Redis
1
2 3
4 6
Pipeline A:
Input: 1
Output: 2, 4
Pipeline B:
Input: 1
Output: 3, 6
PipelineResult pipelineResult = pipeline.done();
String dotFileContents = pipeline.getConfiguration()
.get(PlanningParameters.PIPELINE_PLAN_DOTFILE);
FileUtils.writeStringToFile(
new File("/tmp/logpipelinegraph.dot"), dotFileContents);
Maximize o paralelismo
Nada tem mais impacto
na performance da sua aplicação
do que a otimização do seu
próprio código
SmallData
*É big data, lembra?
Hadoop/HDFS não funcionam bem
com arquivos pequenos.
E se a fonte não for um arquivo texto?
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
DataBaseSource<Vistors> dbsrc = new DataBaseSource.Builder<Vistors>(Vistors.class)
.setDriverClass(org.h2.Driver.class)
.setUrl(”jdbc://…").setUsername(”root").setPassword("")
.selectSQLQuery("SELECT URL, UID FROM TEST")
.countSQLQuery("select count(*) from Test").build();
PCollection<Vistors> visitors= pipeline.read(dbsrc);
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
…
PipelineResult pipelineResult = pipeline.done();
Ou crie o seu datasource…
public class RedisSource<T extends Writable> implements Source<T> {
@Override
public void configureSource(org.apache.hadoop.mapreduce.Job job, int inputId)
throws IOException {
Configuration configuration = job.getConfiguration();
RedisConfiguration.configureDB(configuration,
redisMasters, dbNumber, dataStructure);
job.setInputFormatClass(RedisInputFormat.class);
RedisInputFormat.setInput(job, inputClass, redisMasters,
dbNumber, sliceExpression, maxRecordsToReturn,
dataStructure, null, null);
}
}
- Read
- getSplits
- Write
Tenha certeza que seu Big Data é
Big mesmo
Otimize o seu código. Ele será
repetido MUITAS vezes
Projete o seu pipeline para
maximizar o paralelismo
Na nuvem você tem recursos
virtualmente ilimitados.
Mas o custo também.
Big Data otimizado: Arquiteturas
eficientes para construção de
Pipelines MapReduce
Fabiane Bizinella Nardon
@fabianenardon

More Related Content

What's hot

Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014
MongoDB
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB
MongoDB
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
MongoDB
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo db
MongoDB
 

What's hot (20)

Introduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopIntroduction to MongoDB and Hadoop
Introduction to MongoDB and Hadoop
 
MongoDB 3.2 - Analytics
MongoDB 3.2  - AnalyticsMongoDB 3.2  - Analytics
MongoDB 3.2 - Analytics
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014
 
Webinar: Exploring the Aggregation Framework
Webinar: Exploring the Aggregation FrameworkWebinar: Exploring the Aggregation Framework
Webinar: Exploring the Aggregation Framework
 
Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduce
 
Norikra in Action (ver. 2014 spring)
Norikra in Action (ver. 2014 spring)Norikra in Action (ver. 2014 spring)
Norikra in Action (ver. 2014 spring)
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB
 
ClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei Milovidov
 
Extending Slate Queries & Reports with JSON & JQUERY
Extending Slate Queries & Reports with JSON & JQUERYExtending Slate Queries & Reports with JSON & JQUERY
Extending Slate Queries & Reports with JSON & JQUERY
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
 
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesWebinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
 
D-Talk: What's awesome about Ruby 2.x and Rails 4
D-Talk: What's awesome about Ruby 2.x and Rails 4D-Talk: What's awesome about Ruby 2.x and Rails 4
D-Talk: What's awesome about Ruby 2.x and Rails 4
 
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
 
[DRAFT] Workshop - Technical Introduction to joola.io
[DRAFT] Workshop - Technical Introduction to joola.io[DRAFT] Workshop - Technical Introduction to joola.io
[DRAFT] Workshop - Technical Introduction to joola.io
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo db
 

Similar to Big Data otimizado: Arquiteturas eficientes para construção de Pipelines MapReduce

Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
Sages
 

Similar to Big Data otimizado: Arquiteturas eficientes para construção de Pipelines MapReduce (20)

Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
 
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemWprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
 
Hadoop
HadoopHadoop
Hadoop
 
Psycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptPsycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python Script
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
DataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and FutureDataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and Future
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Streaming Operational Data with MariaDB MaxScale
Streaming Operational Data with MariaDB MaxScaleStreaming Operational Data with MariaDB MaxScale
Streaming Operational Data with MariaDB MaxScale
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Big Data otimizado: Arquiteturas eficientes para construção de Pipelines MapReduce

  • 1. Big Data otimizado: Arquiteturas eficientes para construção de Pipelines MapReduce Fabiane Bizinella Nardon @fabianenardon
  • 2. Eu e Big Data
  • 4. COMO SABER SE VOCÊ TEM DADOS GRANDES MESMO: Todos os seus dados não cabem em uma só máquina byFernandoStankuns
  • 5. COMO SABER SE VOCÊ TEM DADOS GRANDES MESMO: Você está falando mais em Terabytes do que em Gigabytes
  • 6. COMO SABER SE VOCÊ TEM DADOS GRANDES MESMO: A quantidade de dados que você processa cresce constantemente. E deve dobrar no ano que vem. bySauloCruz
  • 7. PARA TODO O RESTO: KEEP IT SIMPLE!
  • 11. Pipeline (Exemplo) http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias http://www.tailtarget.com/about/ - Tecnologia u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/
  • 12. MapReduce Pipelines - Ferramentas - Orquestrar Encadear Otimizar Hadoop HBase Hive Crunch HDFS Cascading Pig Mahout Redis MongoDB MySQL Cassandra
  • 13. Apache Crunch Biblioteca para construção de MapReduce pipelines sobre Hadoop Intercala e orquestra diferentes funções de MapReduce De quebra, otimiza e facilita a implementação de MapReduce FlumeJava: Easy, Efficient Data-Parallel Pipelines (Google, 2010)
  • 14. Crunch – Anatomia de um Pipeline Hadoop Node 1 Hadoop Node 2 DoFN 1 DoFN 2PCollection* Ptable* • PCollection, PTable ouPGroupedTable Write Data Source HDFS HDFS Data Target
  • 15. Crunch – Anatomia de um Pipeline Hadoop Node 1 Hadoop Node 2 DoFN 1 DoFN 2PCollection* Ptable* Write Data Source parallelDo() HDFS HDFS • PCollection, PTable ouPGroupedTable Data Target
  • 16. Crunch – Anatomia de um Pipeline Hadoop Node 1 Hadoop Node 2 DoFN 1 DoFN 1PCollection* Ptable* Write Data Source Data Target parallelDo() HDFS HDFS • PCollection, PTable ouPGroupedTable DoFN 1 DoFN 2
  • 17. Data Target Crunch – Anatomia de um Pipeline Hadoop Node 1 Hadoop Node 2 DoFN 1 DoFN 2PCollection* PTable* Write Data Source Data Target parallelDo() DoFN 1 DoFN 1 HDFS HDFS • PCollection, PTable ouPGroupedTable
  • 18. u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 tailtarget.com – 2 cnn.com - 1
  • 19. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 20. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 21. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 22. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 23. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 24. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 25. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 26. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 27. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 28. public class NaiveCountVisitors extends DoFn<String, Pair<String, Integer>> { public NaiveCountVisitors() { } public void process(String line, Emitter<Pair<String, Integer>> emitter) { String[] parts = line.split(" "); URL url = new URL(parts[2]); emitter.emit(Pair.of(url.getHost(), 1)); } }
  • 29. public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private static final IntWritable one = new IntWritable(1); private Text page = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); String[] parts = line.split(" "); page.set(new URL(parts[2]).getHost()); context.write(page, one); } }
  • 30. public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable counter = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context){ int count = 0; for (IntWritable value : values) { count = count + value.get(); } counter.set(count); context.write(key, counter); } } PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
  • 33. HDFS Chunk 1 Chunk 2 Record Reader Record Reader Map Map Combine Combine Local Storage Map Local Storage Copy Sort Reduce Reduce
  • 34. Counter Map Reduce Total Map Reduce Total SLOTS_MILLIS_MAPS 0 0 1,635,906 0 0 1,434,544 SLOTS_MILLIS_REDUCES 0 0 870,082 0 0 384,755 FILE_BYTES_WRITTEN 1,907,284,471 956,106,354 2,863,390,825 3,776,871 681,575 4,458,446 Map input records 33,809,720 0 33,809,720 33,809,720 0 33,809,720 Map output records 33,661,880 0 33,661,880 33,661,880 0 33,661,880 Combine input records 0 0 0 33,714,223 0 33,714,223 Combine output records 0 0 0 74,295 0 74,295 Reduce input records 0 33,661,880 33,661,880 0 21,952 21,952 Reduce output records 0 343 343 0 343 343 Map output bytes 888,738,480 0 888,738,480 888,738,480 0 888,738,480 Map output materialized bytes 956,063,008 0 956,063,008 657,536 0 657,536 Reduce shuffle bytes 0 940,985,238 940,985,238 0 657,536 657,536 Physical memory (bytes) snapshot 11,734,376,448 527,491,072 12,261,867,520 12,008,472,576 86,654,976 12,095,127,552 Spilled Records 67,103,496 33,661,880 100,765,376 74,295 21,952 96,247 Total committed heap usage (bytes) 10,188,226,560 396,902,400 10,585,128,960 10,188,226,560 59,441,152 10,247,667,712 CPU time spent (ms) 456,03 79,84 535,87 450,26 6,23 456,49 MapReduce Com Crunch
  • 35. private Map<String, Integer> items = null; public void initialize() { items = new HashMap<String, Integer>(); } public void process(String line, Emitter<Pair<String, Integer>> emitter) { String[] parts = line.split(" "); Integer value = items.get(new URL(parts[2]).getHost()); if (value == null) { items.put(url.getHost(), 1); } else { items.put(url.getHost(), value + 1); } } public void cleanup(Emitter<Pair<String, Integer>> emitter) { for (Entry<String, Integer> item : items.entrySet()) { emitter.emit(Pair.of(item.getKey(), item.getValue())); } }
  • 37. Counter Map Reduce Total Map Reduce Total SLOTS_MILLIS_MAPS 0 0 1,434,544 0 0 1,151,037 SLOTS_MILLIS_REDUCES 0 0 384,755 0 0 614,053 FILE_BYTES_WRITTEN 3,776,871 681,575 4,458,446 2,225,432 706,037 2,931,469 Map input records 33,809,720 0 33,809,720 33,809,720 0 33,809,720 Map output records 33,661,880 0 33,661,880 21,952 0 21,952 Combine input records 33,714,223 0 33,714,223 21,952 0 21,952 Combine output records 74,295 0 74,295 21,952 0 21,952 Reduce input records 0 21,952 21,952 0 21,952 21,952 Reduce output records 0 343 343 0 343 343 Map output bytes 888,738,480 0 888,738,480 613,248 0 613,248 Map output materialized bytes 657,536 0 657,536 657,92 0 657,92 Reduce shuffle bytes 0 657,536 657,536 0 657,92 657,92 Physical memory (bytes) snapshot 12,008,472,576 86,654,976 12,095,127,552 12,065,873,9 20 171,278,336 12,237,152,2 56 Spilled Records 74,295 21,952 96,247 21,952 21,952 43,904 Total committed heap usage (bytes) 10,188,226,560 59,441,152 10,247,667,712 10,188,226,5 60 118,882,304 10,307,108,8 64 CPU time spent (ms) 450,26 6,23 456,49 295,79 9,86 305,65 Com Crunch Com Pré-Combinação
  • 38. Os gargalos geralmente são causados pela quantidade de dados que é trafegada na rede
  • 39. Arquitetando o Pipeline http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias http://www.tailtarget.com/about/ - Tecnologia u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/
  • 40. Arquitetando o Pipeline http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias http://www.tailtarget.com/about/ - Tecnologia u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/ Merge 1 2 3 4 5 6
  • 41. {http://www.tailtarget.com/home/, [u=0C010003 ], Tecnologia} {http://cnn.com/news, [u=12070002], Notícias} {http://www.tailtarget.com/about/, [u=00AD0e12 ], Tecnologia} Arquitetando o Pipeline {http://www.tailtarget.com/home/, [u=0C010003 ]} {http://cnn.com/news, [u=12070002]} {http://www.tailtarget.com/about/, [u=00AD0e12 ]} u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 1 2 3 4
  • 42. Faça o máximo que puder com os dados que tem na mão
  • 43. Arquitetando o Pipeline http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias http://www.tailtarget.com/about/ - Tecnologia u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/
  • 44. Merge Arquitetando o Pipeline http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia Redis 4 2 4 3 56 u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 1 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/ 2
  • 45. Arquitetando o Pipeline u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia Redis 1 2 3 4 6 Pipeline A: Input: 1 Output: 2, 4 Pipeline B: Input: 1 Output: 3, 6
  • 46. PipelineResult pipelineResult = pipeline.done(); String dotFileContents = pipeline.getConfiguration() .get(PlanningParameters.PIPELINE_PLAN_DOTFILE); FileUtils.writeStringToFile( new File("/tmp/logpipelinegraph.dot"), dotFileContents);
  • 47.
  • 49. Nada tem mais impacto na performance da sua aplicação do que a otimização do seu próprio código
  • 51. *É big data, lembra? Hadoop/HDFS não funcionam bem com arquivos pequenos.
  • 52. E se a fonte não for um arquivo texto? Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); DataBaseSource<Vistors> dbsrc = new DataBaseSource.Builder<Vistors>(Vistors.class) .setDriverClass(org.h2.Driver.class) .setUrl(”jdbc://…").setUsername(”root").setPassword("") .selectSQLQuery("SELECT URL, UID FROM TEST") .countSQLQuery("select count(*) from Test").build(); PCollection<Vistors> visitors= pipeline.read(dbsrc); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); … PipelineResult pipelineResult = pipeline.done();
  • 53. Ou crie o seu datasource… public class RedisSource<T extends Writable> implements Source<T> { @Override public void configureSource(org.apache.hadoop.mapreduce.Job job, int inputId) throws IOException { Configuration configuration = job.getConfiguration(); RedisConfiguration.configureDB(configuration, redisMasters, dbNumber, dataStructure); job.setInputFormatClass(RedisInputFormat.class); RedisInputFormat.setInput(job, inputClass, redisMasters, dbNumber, sliceExpression, maxRecordsToReturn, dataStructure, null, null); } } - Read - getSplits - Write
  • 54. Tenha certeza que seu Big Data é Big mesmo Otimize o seu código. Ele será repetido MUITAS vezes Projete o seu pipeline para maximizar o paralelismo
  • 55. Na nuvem você tem recursos virtualmente ilimitados. Mas o custo também.
  • 56. Big Data otimizado: Arquiteturas eficientes para construção de Pipelines MapReduce Fabiane Bizinella Nardon @fabianenardon

Editor's Notes

  1. The bottleneck usually is caused by the amount of data going across the network
  2. The bottleneck usually is caused by the amount of data going across the network