Open XKE - Big Data, Big Mess par Bertrand Dechoux
Upcoming SlideShare
Loading in...5
×
 

Open XKE - Big Data, Big Mess par Bertrand Dechoux

on

  • 370 views

 

Statistics

Views

Total Views
370
Views on SlideShare
370
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Open XKE - Big Data, Big Mess par Bertrand Dechoux Open XKE - Big Data, Big Mess par Bertrand Dechoux Presentation Transcript

    • Big Data, Big Mess ? Par Bertrand Dechoux 1
    • Experience Hadoop •première contact début 2010 •consultant et trainer Hadoop @ Xebia 2 2
    • Agenda Hadoop MapReduce 101 Api Java, Hadoop Streaming Hive, Pig et Cascading Et les données ? 3 3
    • Hadoop MapReduce 101 1 4
    • un problème, une solution Objectifs : Choix : •calcul distribué •commodity hardware •haute volumétrie •local read 5 5
    • Map et Reduce DATA DATA DATA DATA map map map map reduce reduce DATA DATA 6 6
    • Ce qui vous est fourni • des primitives • en Java • fonctionnelles • de batch distribué 7 7
    • Api Java, Hadoop Streaming 1 8
    • L’Api java public class WordCount { { public class WordCount public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> private final static IntWritable one = = new IntWritable(1); private final static IntWritable one new IntWritable(1); private Text word = = new Text(); private Text word new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException String line = = value.toString(); String line value.toString(); StringTokenizer tokenizer = = new StringTokenizer(line); StringTokenizer tokenizer new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { { while (tokenizer.hasMoreTokens()) word.set(tokenizer.nextToken()); word.set(tokenizer.nextToken()); context.write(word, one); context.write(word, one); } } } } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { { public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> public void reduce(Text key, Iterable<IntWritable> values, Context context) public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { { throws IOException, InterruptedException int sum = = 0; int sum 0; for (IntWritable val : values) { { for (IntWritable val : values) sum += val.get(); sum += val.get(); } } context.write(key, new IntWritable(sum)); context.write(key, new IntWritable(sum)); } } } } public static void main(String[] args) throws Exception { { public static void main(String[] args) throws Exception Configuration conf = = new Configuration(); Configuration conf new Configuration(); Job job = = new Job(conf, "wordcount"); Job job new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); job.waitForCompletion(true); } } } } 9 9
    • Industrialisation Simple •dependances -> maven -> MRUnit + JUnit + maven •test -> maven + jenkins + nexus •release 10 10
    • Cas d’usage classique •centralisation des logs •comment l’exploitant utilise t il les logs? 11 11
    • Beyond Java : Hadoop Streaming •lecture et écriture sur stdin/stdout •integration du legacy •seulement des jobs simples •industrialisation sans problème 12 12
    • Hive, Pig et Cascading 1 13
    • Hive et Pig •HiveQL •structuré •tree •PigLatin •‘bou!e tout’ •DAG 14 14
    • Industrialisation ? •dependances -> maven -> JUnit + maven •test -> maven + jenkins + nexus •release 15 15
    • Industrialisation Laborieuse •1 job MapReduce -> minimum 10 secondes -> ??? •1 requete -> trop long •n requetes 16 16
    • Cascading •principe similaire à Hive et Pig •une surapi en Java •ou scala : scalding •ou clojure : cascalog •Hadoop n’est pas la seule plateforme 17 17
    • Et les données? 1 18
    • Les fichiers type text SequenceFile Avro interoperabilité performance 19 19
    • Le filesystem : HDFS •peu de "chiers •des gros "chiers •optimisés pour la lecture en continu 20 20
    • La base : HBase •un clone de BigTable •essentiellement une Map avec clefs triées 21 21
    • Data Management •HCatalog •inspiré de Hive metastore •décrit les jeux de données •Avro •un "chier contenant sa description •perfomant 22 22
    • Data Management •management = coordination •data steward / data custodian 23 23
    • Tout cela est il important ? 24 24
    • Des Questions ? Merci! 25