Big Data otimizado: Arquiteturas eficientes para construção de Pipelines MapReduce
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Big Data otimizado: Arquiteturas eficientes para construção de Pipelines MapReduce

on

  • 1,351 views

Limpar, agregar, analisar, transformar: aplicações de data science no mundo real em geral envolvem a execução de diversas etapas de processamento, cada uma adicionando mais valor aos seus dados. ...

Limpar, agregar, analisar, transformar: aplicações de data science no mundo real em geral envolvem a execução de diversas etapas de processamento, cada uma adicionando mais valor aos seus dados. Arquitetar e orquestrar estes pipelines de forma eficiente é uma tarefa que exige uma boa dose de conhecimento sobre o funcionamento interno de algoritmos de MapReduce e alguns truques que você só aprende depois de processar vários terabytes. Esta palestra irá mostrar como arquitetar MapReduce Pipelines eficientes usando o framework Apache Crunch, como integrar seus pipelines com fontes de dados externas como Redis, MongoDB ou mesmo bancos de dados relacionais, qual a melhor granularidade para seus jobs e quando investir em uma arquitetura de MapReduce realmente faz sentido.

Palestra apresentada por Fabiane Bizinella Nardon no QConSP 2013.

Statistics

Views

Total Views
1,351
Views on SlideShare
1,277
Embed Views
74

Actions

Likes
8
Downloads
19
Comments
0

1 Embed 74

https://twitter.com 74

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • The bottleneck usually is caused by the amount of data going across the network
  • The bottleneck usually is caused by the amount of data going across the network

Big Data otimizado: Arquiteturas eficientes para construção de Pipelines MapReduce Presentation Transcript

  • 1. Big Data otimizado: Arquiteturas eficientes para construção de Pipelines MapReduce Fabiane Bizinella Nardon @fabianenardon
  • 2. Eu e Big Data
  • 3. GRANDE? o quão grande é
  • 4. COMO SABER SE VOCÊ TEM DADOS GRANDES MESMO: Todos os seus dados não cabem em uma só máquina byFernandoStankuns
  • 5. COMO SABER SE VOCÊ TEM DADOS GRANDES MESMO: Você está falando mais em Terabytes do que em Gigabytes
  • 6. COMO SABER SE VOCÊ TEM DADOS GRANDES MESMO: A quantidade de dados que você processa cresce constantemente. E deve dobrar no ano que vem. bySauloCruz
  • 7. PARA TODO O RESTO: KEEP IT SIMPLE!
  • 8. Hadoop HBase Hive Crunch HDFS Cascading Pig Mahout Redis MongoDB MySQL Cassandra
  • 9. Dados Map Reduce Novos Dados
  • 10. Dados Map Reduce
  • 11. Pipeline (Exemplo) http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias http://www.tailtarget.com/about/ - Tecnologia u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/
  • 12. MapReduce Pipelines - Ferramentas - Orquestrar Encadear Otimizar Hadoop HBase Hive Crunch HDFS Cascading Pig Mahout Redis MongoDB MySQL Cassandra
  • 13. Apache Crunch Biblioteca para construção de MapReduce pipelines sobre Hadoop Intercala e orquestra diferentes funções de MapReduce De quebra, otimiza e facilita a implementação de MapReduce FlumeJava: Easy, Efficient Data-Parallel Pipelines (Google, 2010)
  • 14. Crunch – Anatomia de um Pipeline Hadoop Node 1 Hadoop Node 2 DoFN 1 DoFN 2PCollection* Ptable* • PCollection, PTable ouPGroupedTable Write Data Source HDFS HDFS Data Target
  • 15. Crunch – Anatomia de um Pipeline Hadoop Node 1 Hadoop Node 2 DoFN 1 DoFN 2PCollection* Ptable* Write Data Source parallelDo() HDFS HDFS • PCollection, PTable ouPGroupedTable Data Target
  • 16. Crunch – Anatomia de um Pipeline Hadoop Node 1 Hadoop Node 2 DoFN 1 DoFN 1PCollection* Ptable* Write Data Source Data Target parallelDo() HDFS HDFS • PCollection, PTable ouPGroupedTable DoFN 1 DoFN 2
  • 17. Data Target Crunch – Anatomia de um Pipeline Hadoop Node 1 Hadoop Node 2 DoFN 1 DoFN 2PCollection* PTable* Write Data Source Data Target parallelDo() DoFN 1 DoFN 1 HDFS HDFS • PCollection, PTable ouPGroupedTable
  • 18. u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 tailtarget.com – 2 cnn.com - 1
  • 19. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 20. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 21. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 22. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 23. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 24. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 25. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 26. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 27. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 28. public class NaiveCountVisitors extends DoFn<String, Pair<String, Integer>> { public NaiveCountVisitors() { } public void process(String line, Emitter<Pair<String, Integer>> emitter) { String[] parts = line.split(" "); URL url = new URL(parts[2]); emitter.emit(Pair.of(url.getHost(), 1)); } }
  • 29. public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private static final IntWritable one = new IntWritable(1); private Text page = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); String[] parts = line.split(" "); page.set(new URL(parts[2]).getHost()); context.write(page, one); } }
  • 30. public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable counter = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context){ int count = 0; for (IntWritable value : values) { count = count + value.get(); } counter.set(count); context.write(key, counter); } } PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
  • 31. MapReduce (Sem Crunch)
  • 32. MapReduce (Com Crunch)
  • 33. HDFS Chunk 1 Chunk 2 Record Reader Record Reader Map Map Combine Combine Local Storage Map Local Storage Copy Sort Reduce Reduce
  • 34. Counter Map Reduce Total Map Reduce Total SLOTS_MILLIS_MAPS 0 0 1,635,906 0 0 1,434,544 SLOTS_MILLIS_REDUCES 0 0 870,082 0 0 384,755 FILE_BYTES_WRITTEN 1,907,284,471 956,106,354 2,863,390,825 3,776,871 681,575 4,458,446 Map input records 33,809,720 0 33,809,720 33,809,720 0 33,809,720 Map output records 33,661,880 0 33,661,880 33,661,880 0 33,661,880 Combine input records 0 0 0 33,714,223 0 33,714,223 Combine output records 0 0 0 74,295 0 74,295 Reduce input records 0 33,661,880 33,661,880 0 21,952 21,952 Reduce output records 0 343 343 0 343 343 Map output bytes 888,738,480 0 888,738,480 888,738,480 0 888,738,480 Map output materialized bytes 956,063,008 0 956,063,008 657,536 0 657,536 Reduce shuffle bytes 0 940,985,238 940,985,238 0 657,536 657,536 Physical memory (bytes) snapshot 11,734,376,448 527,491,072 12,261,867,520 12,008,472,576 86,654,976 12,095,127,552 Spilled Records 67,103,496 33,661,880 100,765,376 74,295 21,952 96,247 Total committed heap usage (bytes) 10,188,226,560 396,902,400 10,585,128,960 10,188,226,560 59,441,152 10,247,667,712 CPU time spent (ms) 456,03 79,84 535,87 450,26 6,23 456,49 MapReduce Com Crunch
  • 35. private Map<String, Integer> items = null; public void initialize() { items = new HashMap<String, Integer>(); } public void process(String line, Emitter<Pair<String, Integer>> emitter) { String[] parts = line.split(" "); Integer value = items.get(new URL(parts[2]).getHost()); if (value == null) { items.put(url.getHost(), 1); } else { items.put(url.getHost(), value + 1); } } public void cleanup(Emitter<Pair<String, Integer>> emitter) { for (Entry<String, Integer> item : items.entrySet()) { emitter.emit(Pair.of(item.getKey(), item.getValue())); } }
  • 36. Crunch com pré-combinação
  • 37. Counter Map Reduce Total Map Reduce Total SLOTS_MILLIS_MAPS 0 0 1,434,544 0 0 1,151,037 SLOTS_MILLIS_REDUCES 0 0 384,755 0 0 614,053 FILE_BYTES_WRITTEN 3,776,871 681,575 4,458,446 2,225,432 706,037 2,931,469 Map input records 33,809,720 0 33,809,720 33,809,720 0 33,809,720 Map output records 33,661,880 0 33,661,880 21,952 0 21,952 Combine input records 33,714,223 0 33,714,223 21,952 0 21,952 Combine output records 74,295 0 74,295 21,952 0 21,952 Reduce input records 0 21,952 21,952 0 21,952 21,952 Reduce output records 0 343 343 0 343 343 Map output bytes 888,738,480 0 888,738,480 613,248 0 613,248 Map output materialized bytes 657,536 0 657,536 657,92 0 657,92 Reduce shuffle bytes 0 657,536 657,536 0 657,92 657,92 Physical memory (bytes) snapshot 12,008,472,576 86,654,976 12,095,127,552 12,065,873,9 20 171,278,336 12,237,152,2 56 Spilled Records 74,295 21,952 96,247 21,952 21,952 43,904 Total committed heap usage (bytes) 10,188,226,560 59,441,152 10,247,667,712 10,188,226,5 60 118,882,304 10,307,108,8 64 CPU time spent (ms) 450,26 6,23 456,49 295,79 9,86 305,65 Com Crunch Com Pré-Combinação
  • 38. Os gargalos geralmente são causados pela quantidade de dados que é trafegada na rede
  • 39. Arquitetando o Pipeline http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias http://www.tailtarget.com/about/ - Tecnologia u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/
  • 40. Arquitetando o Pipeline http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias http://www.tailtarget.com/about/ - Tecnologia u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/ Merge 1 2 3 4 5 6
  • 41. {http://www.tailtarget.com/home/, [u=0C010003 ], Tecnologia} {http://cnn.com/news, [u=12070002], Notícias} {http://www.tailtarget.com/about/, [u=00AD0e12 ], Tecnologia} Arquitetando o Pipeline {http://www.tailtarget.com/home/, [u=0C010003 ]} {http://cnn.com/news, [u=12070002]} {http://www.tailtarget.com/about/, [u=00AD0e12 ]} u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 1 2 3 4
  • 42. Faça o máximo que puder com os dados que tem na mão
  • 43. Arquitetando o Pipeline http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias http://www.tailtarget.com/about/ - Tecnologia u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/
  • 44. Merge Arquitetando o Pipeline http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia Redis 4 2 4 3 56 u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 1 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/ 2
  • 45. Arquitetando o Pipeline u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia Redis 1 2 3 4 6 Pipeline A: Input: 1 Output: 2, 4 Pipeline B: Input: 1 Output: 3, 6
  • 46. PipelineResult pipelineResult = pipeline.done(); String dotFileContents = pipeline.getConfiguration() .get(PlanningParameters.PIPELINE_PLAN_DOTFILE); FileUtils.writeStringToFile( new File("/tmp/logpipelinegraph.dot"), dotFileContents);
  • 47. Maximize o paralelismo
  • 48. Nada tem mais impacto na performance da sua aplicação do que a otimização do seu próprio código
  • 49. SmallData
  • 50. *É big data, lembra? Hadoop/HDFS não funcionam bem com arquivos pequenos.
  • 51. E se a fonte não for um arquivo texto? Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); DataBaseSource<Vistors> dbsrc = new DataBaseSource.Builder<Vistors>(Vistors.class) .setDriverClass(org.h2.Driver.class) .setUrl(”jdbc://…").setUsername(”root").setPassword("") .selectSQLQuery("SELECT URL, UID FROM TEST") .countSQLQuery("select count(*) from Test").build(); PCollection<Vistors> visitors= pipeline.read(dbsrc); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); … PipelineResult pipelineResult = pipeline.done();
  • 52. Ou crie o seu datasource… public class RedisSource<T extends Writable> implements Source<T> { @Override public void configureSource(org.apache.hadoop.mapreduce.Job job, int inputId) throws IOException { Configuration configuration = job.getConfiguration(); RedisConfiguration.configureDB(configuration, redisMasters, dbNumber, dataStructure); job.setInputFormatClass(RedisInputFormat.class); RedisInputFormat.setInput(job, inputClass, redisMasters, dbNumber, sliceExpression, maxRecordsToReturn, dataStructure, null, null); } } - Read - getSplits - Write
  • 53. Tenha certeza que seu Big Data é Big mesmo Otimize o seu código. Ele será repetido MUITAS vezes Projete o seu pipeline para maximizar o paralelismo
  • 54. Na nuvem você tem recursos virtualmente ilimitados. Mas o custo também.
  • 55. Big Data otimizado: Arquiteturas eficientes para construção de Pipelines MapReduce Fabiane Bizinella Nardon @fabianenardon