Open XKE - Big Data, Big Mess par Bertrand Dechoux

525 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
525
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Open XKE - Big Data, Big Mess par Bertrand Dechoux

  1. 1. Big Data, Big Mess ? Par Bertrand Dechoux 1
  2. 2. Experience Hadoop •première contact début 2010 •consultant et trainer Hadoop @ Xebia 2 2
  3. 3. Agenda Hadoop MapReduce 101 Api Java, Hadoop Streaming Hive, Pig et Cascading Et les données ? 3 3
  4. 4. Hadoop MapReduce 101 1 4
  5. 5. un problème, une solution Objectifs : Choix : •calcul distribué •commodity hardware •haute volumétrie •local read 5 5
  6. 6. Map et Reduce DATA DATA DATA DATA map map map map reduce reduce DATA DATA 6 6
  7. 7. Ce qui vous est fourni • des primitives • en Java • fonctionnelles • de batch distribué 7 7
  8. 8. Api Java, Hadoop Streaming 1 8
  9. 9. L’Api java public class WordCount { { public class WordCount public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> private final static IntWritable one = = new IntWritable(1); private final static IntWritable one new IntWritable(1); private Text word = = new Text(); private Text word new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException String line = = value.toString(); String line value.toString(); StringTokenizer tokenizer = = new StringTokenizer(line); StringTokenizer tokenizer new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { { while (tokenizer.hasMoreTokens()) word.set(tokenizer.nextToken()); word.set(tokenizer.nextToken()); context.write(word, one); context.write(word, one); } } } } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { { public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> public void reduce(Text key, Iterable<IntWritable> values, Context context) public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { { throws IOException, InterruptedException int sum = = 0; int sum 0; for (IntWritable val : values) { { for (IntWritable val : values) sum += val.get(); sum += val.get(); } } context.write(key, new IntWritable(sum)); context.write(key, new IntWritable(sum)); } } } } public static void main(String[] args) throws Exception { { public static void main(String[] args) throws Exception Configuration conf = = new Configuration(); Configuration conf new Configuration(); Job job = = new Job(conf, "wordcount"); Job job new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); job.waitForCompletion(true); } } } } 9 9
  10. 10. Industrialisation Simple •dependances -> maven -> MRUnit + JUnit + maven •test -> maven + jenkins + nexus •release 10 10
  11. 11. Cas d’usage classique •centralisation des logs •comment l’exploitant utilise t il les logs? 11 11
  12. 12. Beyond Java : Hadoop Streaming •lecture et écriture sur stdin/stdout •integration du legacy •seulement des jobs simples •industrialisation sans problème 12 12
  13. 13. Hive, Pig et Cascading 1 13
  14. 14. Hive et Pig •HiveQL •structuré •tree •PigLatin •‘bou!e tout’ •DAG 14 14
  15. 15. Industrialisation ? •dependances -> maven -> JUnit + maven •test -> maven + jenkins + nexus •release 15 15
  16. 16. Industrialisation Laborieuse •1 job MapReduce -> minimum 10 secondes -> ??? •1 requete -> trop long •n requetes 16 16
  17. 17. Cascading •principe similaire à Hive et Pig •une surapi en Java •ou scala : scalding •ou clojure : cascalog •Hadoop n’est pas la seule plateforme 17 17
  18. 18. Et les données? 1 18
  19. 19. Les fichiers type text SequenceFile Avro interoperabilité performance 19 19
  20. 20. Le filesystem : HDFS •peu de "chiers •des gros "chiers •optimisés pour la lecture en continu 20 20
  21. 21. La base : HBase •un clone de BigTable •essentiellement une Map avec clefs triées 21 21
  22. 22. Data Management •HCatalog •inspiré de Hive metastore •décrit les jeux de données •Avro •un "chier contenant sa description •perfomant 22 22
  23. 23. Data Management •management = coordination •data steward / data custodian 23 23
  24. 24. Tout cela est il important ? 24 24
  25. 25. Des Questions ? Merci! 25

×