Big Data otimizado: Arquiteturas
eficientes para construção de
Pipelines MapReduce
Fabiane Bizinella Nardon
@fabianenardon
Eu e Big Data
GRANDE?
o quão
grande é
COMO SABER SE VOCÊ TEM DADOS
GRANDES MESMO:
Todos os seus dados não cabem
em uma só máquina
byFernandoStankuns
COMO SABER SE VOCÊ TEM DADOS
GRANDES MESMO:
Você está falando mais em Terabytes
do que em Gigabytes
COMO SABER SE VOCÊ TEM DADOS
GRANDES MESMO:
A quantidade de dados que você processa cresce
constantemente. E deve dobrar n...
PARA TODO O RESTO:
KEEP IT SIMPLE!
Hadoop
HBase
Hive
Crunch
HDFS
Cascading
Pig Mahout
Redis
MongoDB
MySQL
Cassandra
Dados
Map
Reduce
Novos Dados
Dados
Map
Reduce
Pipeline (Exemplo)
http://www.tailtarget.com/home/
http://cnn.com/news
http://www.tailtarget.com/about/
http://www.tailtar...
MapReduce Pipelines
- Ferramentas -
Orquestrar
Encadear
Otimizar
Hadoop
HBase
Hive
Crunch
HDFS
Cascading
Pig
Mahout
Redis
...
Apache Crunch
Biblioteca para construção de MapReduce pipelines
sobre Hadoop
Intercala e orquestra diferentes funções de
M...
Crunch – Anatomia de um Pipeline
Hadoop Node 1 Hadoop Node 2
DoFN 1 DoFN 2PCollection* Ptable*
• PCollection, PTable ouPGr...
Crunch – Anatomia de um Pipeline
Hadoop Node 1 Hadoop Node 2
DoFN 1 DoFN 2PCollection* Ptable* Write
Data
Source
parallelD...
Crunch – Anatomia de um Pipeline
Hadoop Node 1 Hadoop Node 2
DoFN 1 DoFN 1PCollection* Ptable* Write
Data
Source
Data
Targ...
Data
Target
Crunch – Anatomia de um Pipeline
Hadoop Node 1 Hadoop Node 2
DoFN 1 DoFN 2PCollection* PTable* Write
Data
Sour...
u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194
u=12070002 - http://cnn.com/news - 189.19.123.161
u=00AD0e1...
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFi...
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFi...
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFi...
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFi...
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFi...
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFi...
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFi...
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFi...
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFi...
public class NaiveCountVisitors extends DoFn<String, Pair<String, Integer>> {
public NaiveCountVisitors() {
}
public void ...
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final IntWritable one = new...
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable counter = new IntWr...
MapReduce (Sem Crunch)
MapReduce (Com Crunch)
HDFS
Chunk 1
Chunk 2
Record
Reader
Record
Reader
Map
Map
Combine
Combine Local
Storage
Map
Local
Storage
Copy Sort Reduce
...
Counter Map Reduce Total Map Reduce Total
SLOTS_MILLIS_MAPS 0 0 1,635,906 0 0 1,434,544
SLOTS_MILLIS_REDUCES 0 0 870,082 0...
private Map<String, Integer> items = null;
public void initialize() {
items = new HashMap<String, Integer>();
}
public voi...
Crunch com pré-combinação
Counter Map Reduce Total Map Reduce Total
SLOTS_MILLIS_MAPS 0 0 1,434,544 0 0 1,151,037
SLOTS_MILLIS_REDUCES 0 0 384,755 0...
Os gargalos geralmente são
causados pela quantidade de
dados que é trafegada na rede
Arquitetando o Pipeline
http://www.tailtarget.com/home/
http://cnn.com/news
http://www.tailtarget.com/about/
http://www.ta...
Arquitetando o Pipeline
http://www.tailtarget.com/home/
http://cnn.com/news
http://www.tailtarget.com/about/
http://www.ta...
{http://www.tailtarget.com/home/, [u=0C010003 ], Tecnologia}
{http://cnn.com/news, [u=12070002], Notícias}
{http://www.tai...
Faça o máximo que puder com
os dados que tem na mão
Arquitetando o Pipeline
http://www.tailtarget.com/home/
http://cnn.com/news
http://www.tailtarget.com/about/
http://www.ta...
Merge
Arquitetando o Pipeline
http://www.tailtarget.com/home/
http://cnn.com/news
http://www.tailtarget.com/home/ - Tecnol...
Arquitetando o Pipeline
u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194
u=12070002 - http://cnn.com/news - ...
PipelineResult pipelineResult = pipeline.done();
String dotFileContents = pipeline.getConfiguration()
.get(PlanningParamet...
Maximize o paralelismo
Nada tem mais impacto
na performance da sua aplicação
do que a otimização do seu
próprio código
SmallData
*É big data, lembra?
Hadoop/HDFS não funcionam bem
com arquivos pequenos.
E se a fonte não for um arquivo texto?
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
DataBase...
Ou crie o seu datasource…
public class RedisSource<T extends Writable> implements Source<T> {
@Override
public void config...
Tenha certeza que seu Big Data é
Big mesmo
Otimize o seu código. Ele será
repetido MUITAS vezes
Projete o seu pipeline par...
Na nuvem você tem recursos
virtualmente ilimitados.
Mas o custo também.
Big Data otimizado: Arquiteturas
eficientes para construção de
Pipelines MapReduce
Fabiane Bizinella Nardon
@fabianenardon
Big Data otimizado: Arquiteturas eficientes para construção de Pipelines MapReduce
Upcoming SlideShare
Loading in …5
×

Big Data otimizado: Arquiteturas eficientes para construção de Pipelines MapReduce

1,472 views

Published on

Limpar, agregar, analisar, transformar: aplicações de data science no mundo real em geral envolvem a execução de diversas etapas de processamento, cada uma adicionando mais valor aos seus dados. Arquitetar e orquestrar estes pipelines de forma eficiente é uma tarefa que exige uma boa dose de conhecimento sobre o funcionamento interno de algoritmos de MapReduce e alguns truques que você só aprende depois de processar vários terabytes. Esta palestra irá mostrar como arquitetar MapReduce Pipelines eficientes usando o framework Apache Crunch, como integrar seus pipelines com fontes de dados externas como Redis, MongoDB ou mesmo bancos de dados relacionais, qual a melhor granularidade para seus jobs e quando investir em uma arquitetura de MapReduce realmente faz sentido.

Palestra apresentada por Fabiane Bizinella Nardon no QConSP 2013.

Published in: Technology, Business
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,472
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
21
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide
  • The bottleneck usually is caused by the amount of data going across the network
  • The bottleneck usually is caused by the amount of data going across the network
  • Big Data otimizado: Arquiteturas eficientes para construção de Pipelines MapReduce

    1. 1. Big Data otimizado: Arquiteturas eficientes para construção de Pipelines MapReduce Fabiane Bizinella Nardon @fabianenardon
    2. 2. Eu e Big Data
    3. 3. GRANDE? o quão grande é
    4. 4. COMO SABER SE VOCÊ TEM DADOS GRANDES MESMO: Todos os seus dados não cabem em uma só máquina byFernandoStankuns
    5. 5. COMO SABER SE VOCÊ TEM DADOS GRANDES MESMO: Você está falando mais em Terabytes do que em Gigabytes
    6. 6. COMO SABER SE VOCÊ TEM DADOS GRANDES MESMO: A quantidade de dados que você processa cresce constantemente. E deve dobrar no ano que vem. bySauloCruz
    7. 7. PARA TODO O RESTO: KEEP IT SIMPLE!
    8. 8. Hadoop HBase Hive Crunch HDFS Cascading Pig Mahout Redis MongoDB MySQL Cassandra
    9. 9. Dados Map Reduce Novos Dados
    10. 10. Dados Map Reduce
    11. 11. Pipeline (Exemplo) http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias http://www.tailtarget.com/about/ - Tecnologia u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/
    12. 12. MapReduce Pipelines - Ferramentas - Orquestrar Encadear Otimizar Hadoop HBase Hive Crunch HDFS Cascading Pig Mahout Redis MongoDB MySQL Cassandra
    13. 13. Apache Crunch Biblioteca para construção de MapReduce pipelines sobre Hadoop Intercala e orquestra diferentes funções de MapReduce De quebra, otimiza e facilita a implementação de MapReduce FlumeJava: Easy, Efficient Data-Parallel Pipelines (Google, 2010)
    14. 14. Crunch – Anatomia de um Pipeline Hadoop Node 1 Hadoop Node 2 DoFN 1 DoFN 2PCollection* Ptable* • PCollection, PTable ouPGroupedTable Write Data Source HDFS HDFS Data Target
    15. 15. Crunch – Anatomia de um Pipeline Hadoop Node 1 Hadoop Node 2 DoFN 1 DoFN 2PCollection* Ptable* Write Data Source parallelDo() HDFS HDFS • PCollection, PTable ouPGroupedTable Data Target
    16. 16. Crunch – Anatomia de um Pipeline Hadoop Node 1 Hadoop Node 2 DoFN 1 DoFN 1PCollection* Ptable* Write Data Source Data Target parallelDo() HDFS HDFS • PCollection, PTable ouPGroupedTable DoFN 1 DoFN 2
    17. 17. Data Target Crunch – Anatomia de um Pipeline Hadoop Node 1 Hadoop Node 2 DoFN 1 DoFN 2PCollection* PTable* Write Data Source Data Target parallelDo() DoFN 1 DoFN 1 HDFS HDFS • PCollection, PTable ouPGroupedTable
    18. 18. u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 tailtarget.com – 2 cnn.com - 1
    19. 19. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
    20. 20. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
    21. 21. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
    22. 22. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
    23. 23. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
    24. 24. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
    25. 25. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
    26. 26. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
    27. 27. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
    28. 28. public class NaiveCountVisitors extends DoFn<String, Pair<String, Integer>> { public NaiveCountVisitors() { } public void process(String line, Emitter<Pair<String, Integer>> emitter) { String[] parts = line.split(" "); URL url = new URL(parts[2]); emitter.emit(Pair.of(url.getHost(), 1)); } }
    29. 29. public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private static final IntWritable one = new IntWritable(1); private Text page = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); String[] parts = line.split(" "); page.set(new URL(parts[2]).getHost()); context.write(page, one); } }
    30. 30. public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable counter = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context){ int count = 0; for (IntWritable value : values) { count = count + value.get(); } counter.set(count); context.write(key, counter); } } PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
    31. 31. MapReduce (Sem Crunch)
    32. 32. MapReduce (Com Crunch)
    33. 33. HDFS Chunk 1 Chunk 2 Record Reader Record Reader Map Map Combine Combine Local Storage Map Local Storage Copy Sort Reduce Reduce
    34. 34. Counter Map Reduce Total Map Reduce Total SLOTS_MILLIS_MAPS 0 0 1,635,906 0 0 1,434,544 SLOTS_MILLIS_REDUCES 0 0 870,082 0 0 384,755 FILE_BYTES_WRITTEN 1,907,284,471 956,106,354 2,863,390,825 3,776,871 681,575 4,458,446 Map input records 33,809,720 0 33,809,720 33,809,720 0 33,809,720 Map output records 33,661,880 0 33,661,880 33,661,880 0 33,661,880 Combine input records 0 0 0 33,714,223 0 33,714,223 Combine output records 0 0 0 74,295 0 74,295 Reduce input records 0 33,661,880 33,661,880 0 21,952 21,952 Reduce output records 0 343 343 0 343 343 Map output bytes 888,738,480 0 888,738,480 888,738,480 0 888,738,480 Map output materialized bytes 956,063,008 0 956,063,008 657,536 0 657,536 Reduce shuffle bytes 0 940,985,238 940,985,238 0 657,536 657,536 Physical memory (bytes) snapshot 11,734,376,448 527,491,072 12,261,867,520 12,008,472,576 86,654,976 12,095,127,552 Spilled Records 67,103,496 33,661,880 100,765,376 74,295 21,952 96,247 Total committed heap usage (bytes) 10,188,226,560 396,902,400 10,585,128,960 10,188,226,560 59,441,152 10,247,667,712 CPU time spent (ms) 456,03 79,84 535,87 450,26 6,23 456,49 MapReduce Com Crunch
    35. 35. private Map<String, Integer> items = null; public void initialize() { items = new HashMap<String, Integer>(); } public void process(String line, Emitter<Pair<String, Integer>> emitter) { String[] parts = line.split(" "); Integer value = items.get(new URL(parts[2]).getHost()); if (value == null) { items.put(url.getHost(), 1); } else { items.put(url.getHost(), value + 1); } } public void cleanup(Emitter<Pair<String, Integer>> emitter) { for (Entry<String, Integer> item : items.entrySet()) { emitter.emit(Pair.of(item.getKey(), item.getValue())); } }
    36. 36. Crunch com pré-combinação
    37. 37. Counter Map Reduce Total Map Reduce Total SLOTS_MILLIS_MAPS 0 0 1,434,544 0 0 1,151,037 SLOTS_MILLIS_REDUCES 0 0 384,755 0 0 614,053 FILE_BYTES_WRITTEN 3,776,871 681,575 4,458,446 2,225,432 706,037 2,931,469 Map input records 33,809,720 0 33,809,720 33,809,720 0 33,809,720 Map output records 33,661,880 0 33,661,880 21,952 0 21,952 Combine input records 33,714,223 0 33,714,223 21,952 0 21,952 Combine output records 74,295 0 74,295 21,952 0 21,952 Reduce input records 0 21,952 21,952 0 21,952 21,952 Reduce output records 0 343 343 0 343 343 Map output bytes 888,738,480 0 888,738,480 613,248 0 613,248 Map output materialized bytes 657,536 0 657,536 657,92 0 657,92 Reduce shuffle bytes 0 657,536 657,536 0 657,92 657,92 Physical memory (bytes) snapshot 12,008,472,576 86,654,976 12,095,127,552 12,065,873,9 20 171,278,336 12,237,152,2 56 Spilled Records 74,295 21,952 96,247 21,952 21,952 43,904 Total committed heap usage (bytes) 10,188,226,560 59,441,152 10,247,667,712 10,188,226,5 60 118,882,304 10,307,108,8 64 CPU time spent (ms) 450,26 6,23 456,49 295,79 9,86 305,65 Com Crunch Com Pré-Combinação
    38. 38. Os gargalos geralmente são causados pela quantidade de dados que é trafegada na rede
    39. 39. Arquitetando o Pipeline http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias http://www.tailtarget.com/about/ - Tecnologia u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/
    40. 40. Arquitetando o Pipeline http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias http://www.tailtarget.com/about/ - Tecnologia u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/ Merge 1 2 3 4 5 6
    41. 41. {http://www.tailtarget.com/home/, [u=0C010003 ], Tecnologia} {http://cnn.com/news, [u=12070002], Notícias} {http://www.tailtarget.com/about/, [u=00AD0e12 ], Tecnologia} Arquitetando o Pipeline {http://www.tailtarget.com/home/, [u=0C010003 ]} {http://cnn.com/news, [u=12070002]} {http://www.tailtarget.com/about/, [u=00AD0e12 ]} u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 1 2 3 4
    42. 42. Faça o máximo que puder com os dados que tem na mão
    43. 43. Arquitetando o Pipeline http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias http://www.tailtarget.com/about/ - Tecnologia u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/
    44. 44. Merge Arquitetando o Pipeline http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia Redis 4 2 4 3 56 u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 1 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/ 2
    45. 45. Arquitetando o Pipeline u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia Redis 1 2 3 4 6 Pipeline A: Input: 1 Output: 2, 4 Pipeline B: Input: 1 Output: 3, 6
    46. 46. PipelineResult pipelineResult = pipeline.done(); String dotFileContents = pipeline.getConfiguration() .get(PlanningParameters.PIPELINE_PLAN_DOTFILE); FileUtils.writeStringToFile( new File("/tmp/logpipelinegraph.dot"), dotFileContents);
    47. 47. Maximize o paralelismo
    48. 48. Nada tem mais impacto na performance da sua aplicação do que a otimização do seu próprio código
    49. 49. SmallData
    50. 50. *É big data, lembra? Hadoop/HDFS não funcionam bem com arquivos pequenos.
    51. 51. E se a fonte não for um arquivo texto? Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); DataBaseSource<Vistors> dbsrc = new DataBaseSource.Builder<Vistors>(Vistors.class) .setDriverClass(org.h2.Driver.class) .setUrl(”jdbc://…").setUsername(”root").setPassword("") .selectSQLQuery("SELECT URL, UID FROM TEST") .countSQLQuery("select count(*) from Test").build(); PCollection<Vistors> visitors= pipeline.read(dbsrc); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); … PipelineResult pipelineResult = pipeline.done();
    52. 52. Ou crie o seu datasource… public class RedisSource<T extends Writable> implements Source<T> { @Override public void configureSource(org.apache.hadoop.mapreduce.Job job, int inputId) throws IOException { Configuration configuration = job.getConfiguration(); RedisConfiguration.configureDB(configuration, redisMasters, dbNumber, dataStructure); job.setInputFormatClass(RedisInputFormat.class); RedisInputFormat.setInput(job, inputClass, redisMasters, dbNumber, sliceExpression, maxRecordsToReturn, dataStructure, null, null); } } - Read - getSplits - Write
    53. 53. Tenha certeza que seu Big Data é Big mesmo Otimize o seu código. Ele será repetido MUITAS vezes Projete o seu pipeline para maximizar o paralelismo
    54. 54. Na nuvem você tem recursos virtualmente ilimitados. Mas o custo também.
    55. 55. Big Data otimizado: Arquiteturas eficientes para construção de Pipelines MapReduce Fabiane Bizinella Nardon @fabianenardon

    ×