SlideShare a Scribd company logo
Introducción a Hadoop
   El bazuca de los datos


       Iván de Prado Alonso // @ivanprado // @datasalt
Datasalt




  Foco en el Big Data
  –   Contribución al Open Source
  –   Consultoría & Desarrollo
  –   Formación



                                    2 / 34
BIG
“MAC”
 DATA



        3 / 34
Fisonomía de un proyecto Big Data



             Adquisición


           Procesamiento


               Servicio

                                    4 / 34
Tipos de sistemas Big Data

●   Offline
    –   La latencia no es un problema
●   Online
    –   La inmediatez de los datos es importante
●   Mixto
    –   Lo más común

                Offline                       Online
    MapReduce                     Bases de datos NoSQL
    Hadoop                        Motores de búsqueda
    Distributed RDBMS


                                                         5 / 34
“Swiss army knife of the
                                           21st century”
                                                              Media Guardian Innovation
                                                                                Awards




http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop   6 / 34
Historia

●   2004-2006
    –   Google publica los papers de GFS y MapReduce
    –   Doug Cutting implementa una versión Open Source en
        Nutch
●   2006-2008
    –   Hadoop se separa de Nutch
    –   Se alcanza la escala web en 2008
●   2008-Hasta ahora
    –   Hadoop se populariza y se comienza a explotar
        comercialmente.

                               Fuente: Hadoop: a brief history. Doug Cutting

                                                                        7 / 34
Hadoop

     “The Apache Hadoop
      software library is a
  framework that allows for
        the distributed
  processing of large data
    sets across clusters of
       computers using a
    simple programming
            model”
              De la página de Hadoop



                                       8 / 34
Sistema de Ficheros Distribuido

●   Sistema de ficheros distribuido (HDFS)
    –   Bloques grandes: 64 Mb
         ●   Almacenados en el sistema de ficheros del SO
    –   Tolerante a Fallos (replicación)
    –   Formatos habituales:
         ●   Ficheros en formato texto (CSV)
         ●   SequenceFiles
              –   Ristras de pares [clave, valor]




                                                            9 / 34
MapReduce

●   Dos funciones (Map y Reduce)
    –   Map(k, v) : [z,w]*
    –   Reduce(k, v*) : [z, w]*
●   Ejemplo: contar palabras
    –   Map([documento, null]) -> [palabra, 1]*
    –   Reduce(palabra, 1*) -> [palabra, total]
●   MapReduce y SQL
    –   SELECT palabra, count(*) GROUP BY palabra
●   Ejecución distribuida en un cluster con escalabilidad
    horizontal


                                                        10 / 34
El típico Word Count
  Esto es una linea
  Esto también


 Map                                  Reduce
                                       reduce(es, {1}) =
  map(“Esto es una linea”) =
                                           es, 1
      esto, 1
                                       reduce(esto, {1, 1}) =
      es, 1
                                           esto, 2
      una, 1
                                       reduce(linea, {1}) =
      linea, 1
                                           linea, 1
  map(“Esto también”) =
                                       reduce(también, {1}) =
      esto, 1
                                           también, 1
      también, 1
                                       reduce(una, {1}) =
                                           una, 1


                               es, 1
                               esto, 2
  Resultado:                   linea, 1
                               también, 1
                               una, 1



                                                                11 / 34
Word Count en Hadoop
   public class WordCountHadoop extends Configured implements Tool {

              public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

                        private final static IntWritable one = new IntWritable(1);
                        private Text word = new Text();

                        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
                                  StringTokenizer itr = new StringTokenizer(value.toString());
                                  while(itr.hasMoreTokens()) {
                                            word.set(itr.nextToken());
                                            context.write(word, one);
                                  }
                        }
              }

              public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

                        private IntWritable result = new IntWritable();

                        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
                            InterruptedException {
                                  int sum = 0;
                                  for(IntWritable val : values) {
                                            sum += val.get();
                                  }




 ¡Mejor vamos por partes!
                                  result.set(sum);
                                  context.write(key, result);
                        }
              }

               @Override
       public int run(String[] args) throws Exception {

                        if(args.length != 2) {
                                  System.err.println("Usage: wordcount-hadoop <in> <out>");
                                  System.exit(2);
                        }

                        Path output = new Path(args[1]);
                        HadoopUtils.deleteIfExists(FileSystem.get(output.toUri(), conf), output);

                        Job job = new Job(getConf(), "word count hadoop");
                        job.setJarByClass(WordCountHadoop.class);
                        job.setMapperClass(TokenizerMapper.class);
                        job.setCombinerClass(IntSumReducer.class);
                        job.setReducerClass(IntSumReducer.class);
                        job.setOutputKeyClass(Text.class);
                        job.setOutputValueClass(IntWritable.class);
                        FileInputFormat.addInputPath(job, new Path(args[0]));

                        FileOutputFormat.setOutputPath(job, new Path(args[1]));
                        job.waitForCompletion(true);

                        return 0;
       }

              public static void main(String[] args) throws Exception {
                        ToolRunner.run(new SortJobHadoop(), args);
              }
   }




                                                                                                                              12 / 34
Word Count en Hadoop - Mapper


      public static class TokenizerMapper extends Mapper<Object, Text,
  Text, IntWritable> {

         private final static IntWritable one = new IntWritable(1);
         private Text word = new Text();

          public void map(Object key, Text value, Context context) throws
  IOException, InterruptedException {

             StringTokenizer itr = new StringTokenizer(value.toString());
             while(itr.hasMoreTokens()) {

                 word.set(itr.nextToken());
                 context.write(word, one);
             }
         }
     }




                                                                         13 / 34
Word Count en Hadoop - Reducer


      public static class IntSumReducer extends Reducer<Text, IntWritable,
  Text, IntWritable> {

         private IntWritable result = new IntWritable();

          public void reduce(Text key, Iterable<IntWritable> values,
  Context context) throws IOException,
              InterruptedException {
              int sum = 0;
              for(IntWritable val : values) {
                  sum += val.get();
              }
              result.set(sum);
              context.write(key, result);
          }
      }




                                                                       14 / 34
Word Count en Hadoop – Configuración y
ejecución

         if(args.length != 2) {
             System.err.println("Usage: wordcount-hadoop <in> <out>");
             System.exit(2);
         }

          Path output = new Path(args[1]);
          HadoopUtils.deleteIfExists(FileSystem.get(output.toUri(), conf),
  output);

         Job job = new Job(getConf(), "word count hadoop");
         job.setJarByClass(WordCountHadoop.class);
         job.setMapperClass(TokenizerMapper.class);
         job.setCombinerClass(IntSumReducer.class);
         job.setReducerClass(IntSumReducer.class);
         job.setOutputKeyClass(Text.class);
         job.setOutputValueClass(IntWritable.class);
         FileInputFormat.addInputPath(job, new Path(args[0]));

         FileOutputFormat.setOutputPath(job, new Path(args[1]));
         job.waitForCompletion(true);




                                                                      15 / 34
Ejecución de un Job MapReduce
                  Bloques del fichero de entrada




                           Nodo 1




                                                   Nodo 2
    Mappers




    Datos
    Intermedios
                           Nodo 1




                                                   Nodo 2
    Reducers

    Resultado

                                                            16 / 34
Serialización

 ●   Writables
     • Serialización nativa de Hadoop
     • De muy bajo nivel
     • Tipos básicos: IntWritable, Text, etc.
 ●   Otras
     • Thrift, Avro, Protostuff
     • Compatibilidad hacia atrás.




                                                17 / 34
La curva de
aprendizaje
de Hadoop
  es alta




          18 / 34
Tuple MapReduce

●   Un MapReduce más simple
    –   Tuplas en lugar de key/value




    –   A nivel de job se define
         ●   Los campos por los que agrupar
         ●   Los campos por los que ordenar
    –   Tuple MapReduce-join


                                              19 / 34
Pangool
●   Implementación de
    TupleMap reduce
    –   Desarrollado por Datasalt
    –   OpenSource
    –   Eficiencia equiparable a
        Hadoop
●   Objetivo: reemplazar la API
    de Hadoop
●   Si quieres aprender
    Hadoop, empieza por
    Pangool



                                    20 / 34
Eficiencia de Pangool
●   Equiparable a Hadoop




    Ver http://pangool.net/benchmark.html

                                            21 / 34
Pangool – URL resolution

●   Ejemplo de Join
    –   Muy difícil en Hadoop. Fácil en Pangool.
●   Problema:
    –   Existen muchos acortadores de URLs y redirecciones
    –   Para analizar datos, suele ser útil reemplazar las URLs por su URL
        canónica
    –   Supongamos que tenemos ambos datasets
        ●   Un mapa con entradas URL → URL canónica
        ●   Un dataset con URLs (que queremos resolver) y otros campos.
    –   El siguiente job Pangool soluciona el problema de manera escalable.




                                                                          22 / 34
URL Resolution – Definiendo Schemas


    static Schema getURLRegisterSchema() {
        List<Field> urlRegisterFields = new ArrayList<Field>();
        urlRegisterFields.add(Field.create("url",Type.STRING));
        urlRegisterFields.add(Field.create("timestamp",Type.LONG));
        urlRegisterFields.add(Field.create("ip",Type.STRING));
        return new Schema("urlRegister", urlRegisterFields);
    }

    static Schema getURLMapSchema() {
        List<Field> urlMapFields = new ArrayList<Field>();
        urlMapFields.add(Field.create("url",Type.STRING));
        urlMapFields.add(Field.create("canonicalUrl",Type.STRING));
        return new Schema("urlMap", urlMapFields);
    }




                                                                      23 / 34
URL Resolution – Cargando el fichero a
resolver


     public static class UrlProcessor extends TupleMapper<LongWritable,
 Text> {

        private Tuple tuple = new Tuple(getURLRegisterSchema());

         @Override
         public void map(LongWritable key, Text value, TupleMRContext
 context, Collector collector)
             throws IOException, InterruptedException {

            String[] fields = value.toString().split("t");
            tuple.set("url", fields[0]);
            tuple.set("timestamp", Long.parseLong(fields[1]));
            tuple.set("ip", fields[2]);
            collector.write(tuple);
        }
    }




                                                                        24 / 34
URL Resolution – Cargando el mapa de URLs


     public static class UrlMapProcessor extends TupleMapper<LongWritable,
 Text> {

        private Tuple tuple = new Tuple(getURLMapSchema());

         @Override
         public void map(LongWritable key, Text value, TupleMRContext
 context, Collector collector)
             throws IOException, InterruptedException {

            String[] fields = value.toString().split("t");
            tuple.set("url", fields[0]);
            tuple.set("canonicalUrl", fields[1]);
            collector.write(tuple);
        }
    }




                                                                        25 / 34
URL Resolution – Resolución en el reducer
     public static class Handler extends TupleReducer<Text, NullWritable>
 {

        private Text result;

         @Override
         public void reduce(ITuple group, Iterable<ITuple> tuples,
 TupleMRContext context, Collector collector)
             throws IOException, InterruptedException, TupleMRException {
             if (result == null) {
                 result = new Text();
             }
             String cannonicalUrl = null;
             for(ITuple tuple : tuples) {
                 if("urlMap".equals(tuple.getSchema().getName())) {
                     cannonicalUrl = tuple.get("canonicalUrl").toString();
                 } else {
                     result.set(cannonicalUrl + "t" +
 tuple.get("timestamp") + "t" + tuple.get("ip"));
                     collector.write(result, NullWritable.get());
                 }
             }
         }
     }

                                                                      26 / 34
URL Resolution – Configurando y Lanzando
el job
  String input1 = args[0];
  String input2 = args[1];
  String output = args[2];

  deleteOutput(output);

  TupleMRBuilder mr = new TupleMRBuilder(conf,"Pangool Url Resolution");
  mr.addIntermediateSchema(getURLMapSchema());
  mr.addIntermediateSchema(getURLRegisterSchema());
  mr.setGroupByFields("url");
  mr.setOrderBy(
      new OrderBy().add("url", Order.ASC).addSchemaOrder(Order.ASC));
  mr.setTupleReducer(new Handler());
  mr.setOutput(new Path(output),
      new HadoopOutputFormat(TextOutputFormat.class),
      Text.class,
      NullWritable.class);
  mr.addInput(new Path(input1),
      new HadoopInputFormat(TextInputFormat.class),
      new UrlMapProcessor());
  mr.addInput(new Path(input2),
      new HadoopInputFormat(TextInputFormat.class),
      new UrlProcessor());
  mr.createJob().waitForCompletion(true);



                                                                      27 / 34
Introducción a hadoop
Introducción a hadoop
Introducción a hadoop
Introducción a hadoop
Introducción a hadoop
Introducción a hadoop
Introducción a hadoop

More Related Content

What's hot

Meethadoop
MeethadoopMeethadoop
Meethadoop
IIIT-H
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and Security
MapR Technologies
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
tshiran
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
Hadoop online training
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
mcsrivas
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Ovidiu Dimulescu
 
Pptx present
Pptx presentPptx present
Pptx present
Nitish Bhardwaj
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
Hanborq Inc.
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
ryancox
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
Chicago Hadoop Users Group
 
Hadoop
HadoopHadoop
Introduction to Mongodb
Introduction to MongodbIntroduction to Mongodb
Introduction to Mongodb
Harun Yardımcı
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
Hakan Ilter
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
Sean Murphy
 
Cassandra/Hadoop Integration
Cassandra/Hadoop IntegrationCassandra/Hadoop Integration
Cassandra/Hadoop Integration
Jeremy Hanna
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
MapR Technologies
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
DataWorks Summit
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
OReillyStrata
 
Ecossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine LuizaEcossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine Luiza
Nelson Forte
 
Hadoop
HadoopHadoop
Hadoop
Cassell Hsu
 

What's hot (20)

Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and Security
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Pptx present
Pptx presentPptx present
Pptx present
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Mongodb
Introduction to MongodbIntroduction to Mongodb
Introduction to Mongodb
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Cassandra/Hadoop Integration
Cassandra/Hadoop IntegrationCassandra/Hadoop Integration
Cassandra/Hadoop Integration
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Ecossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine LuizaEcossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine Luiza
 
Hadoop
HadoopHadoop
Hadoop
 

Viewers also liked

Tuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreduceTuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreduce
datasalt
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
datasalt
 
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for HadoopSplout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
datasalt
 
Datasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactionsDatasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactions
datasalt
 
Day snowman
Day snowmanDay snowman
Day snowman
afresh65
 
Final evaluation
Final evaluationFinal evaluation
Final evaluation
Jeremiah Mitchell
 
Internettrendsv1 150526193103-lva1-app6892
Internettrendsv1 150526193103-lva1-app6892Internettrendsv1 150526193103-lva1-app6892
Internettrendsv1 150526193103-lva1-app6892
Benjamin Crucq
 
Switching On the Growth Engine in Your Small Consulting Practice
Switching On the Growth Engine in Your Small Consulting PracticeSwitching On the Growth Engine in Your Small Consulting Practice
Switching On the Growth Engine in Your Small Consulting Practice
GYK Antler
 
1. creative writing workshop april 2015 final
1. creative writing workshop april 2015 final1. creative writing workshop april 2015 final
1. creative writing workshop april 2015 final
amgonzalezpineiro
 
Kopienu rīcībspēja un identitāte
Kopienu rīcībspēja un identitāte Kopienu rīcībspēja un identitāte
Kopienu rīcībspēja un identitāte
nacionalaidentitate
 
Saul Sours
Saul SoursSaul Sours
Saul Sours
Danny Thomas
 
Keyboarding chapter 4 ppt
Keyboarding chapter 4 pptKeyboarding chapter 4 ppt
Keyboarding chapter 4 ppt
ks0385
 
1. interviewing revised feb 3 2015
1. interviewing revised feb 3 20151. interviewing revised feb 3 2015
1. interviewing revised feb 3 2015
amgonzalezpineiro
 
Mobilne usluge - Ljubicasta buducnost
Mobilne usluge - Ljubicasta buducnostMobilne usluge - Ljubicasta buducnost
Mobilne usluge - Ljubicasta buducnost
Kupindo
 
Rihanna[1]
Rihanna[1]Rihanna[1]
Rihanna[1]
ibutt5
 
Gepard gm6 lynx x
Gepard gm6 lynx xGepard gm6 lynx x
Gepard gm6 lynx x
riskis
 
Driving Profits in the Downturn, Using Data to Improve Website Performance an...
Driving Profits in the Downturn, Using Data to Improve Website Performance an...Driving Profits in the Downturn, Using Data to Improve Website Performance an...
Driving Profits in the Downturn, Using Data to Improve Website Performance an...
Marisa Gallagher
 
Qualcomm Life Connect 2013 - Michael Z. Jones
Qualcomm Life Connect 2013 - Michael Z. JonesQualcomm Life Connect 2013 - Michael Z. Jones
Qualcomm Life Connect 2013 - Michael Z. Jones
Qualcomm Life
 
Basisregistratie Ondergronds - StadSPOORT
Basisregistratie Ondergronds - StadSPOORTBasisregistratie Ondergronds - StadSPOORT
Basisregistratie Ondergronds - StadSPOORT
StadSPOORT
 
Nācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīze
Nācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīzeNācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīze
Nācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīze
nacionalaidentitate
 

Viewers also liked (20)

Tuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreduceTuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreduce
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for HadoopSplout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
 
Datasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactionsDatasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactions
 
Day snowman
Day snowmanDay snowman
Day snowman
 
Final evaluation
Final evaluationFinal evaluation
Final evaluation
 
Internettrendsv1 150526193103-lva1-app6892
Internettrendsv1 150526193103-lva1-app6892Internettrendsv1 150526193103-lva1-app6892
Internettrendsv1 150526193103-lva1-app6892
 
Switching On the Growth Engine in Your Small Consulting Practice
Switching On the Growth Engine in Your Small Consulting PracticeSwitching On the Growth Engine in Your Small Consulting Practice
Switching On the Growth Engine in Your Small Consulting Practice
 
1. creative writing workshop april 2015 final
1. creative writing workshop april 2015 final1. creative writing workshop april 2015 final
1. creative writing workshop april 2015 final
 
Kopienu rīcībspēja un identitāte
Kopienu rīcībspēja un identitāte Kopienu rīcībspēja un identitāte
Kopienu rīcībspēja un identitāte
 
Saul Sours
Saul SoursSaul Sours
Saul Sours
 
Keyboarding chapter 4 ppt
Keyboarding chapter 4 pptKeyboarding chapter 4 ppt
Keyboarding chapter 4 ppt
 
1. interviewing revised feb 3 2015
1. interviewing revised feb 3 20151. interviewing revised feb 3 2015
1. interviewing revised feb 3 2015
 
Mobilne usluge - Ljubicasta buducnost
Mobilne usluge - Ljubicasta buducnostMobilne usluge - Ljubicasta buducnost
Mobilne usluge - Ljubicasta buducnost
 
Rihanna[1]
Rihanna[1]Rihanna[1]
Rihanna[1]
 
Gepard gm6 lynx x
Gepard gm6 lynx xGepard gm6 lynx x
Gepard gm6 lynx x
 
Driving Profits in the Downturn, Using Data to Improve Website Performance an...
Driving Profits in the Downturn, Using Data to Improve Website Performance an...Driving Profits in the Downturn, Using Data to Improve Website Performance an...
Driving Profits in the Downturn, Using Data to Improve Website Performance an...
 
Qualcomm Life Connect 2013 - Michael Z. Jones
Qualcomm Life Connect 2013 - Michael Z. JonesQualcomm Life Connect 2013 - Michael Z. Jones
Qualcomm Life Connect 2013 - Michael Z. Jones
 
Basisregistratie Ondergronds - StadSPOORT
Basisregistratie Ondergronds - StadSPOORTBasisregistratie Ondergronds - StadSPOORT
Basisregistratie Ondergronds - StadSPOORT
 
Nācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīze
Nācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīzeNācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīze
Nācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīze
 

Similar to Introducción a hadoop

Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
Mohamed Elsaka
 
Cloud jpl
Cloud jplCloud jpl
Cloud jpl
Marc de Palol
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
Koichi Fujikawa
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
bmlever
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
Dilum Bandara
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
Ed Kohlwey
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
Svetlin Nakov
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
elliando dias
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
Andrea Iacono
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Vitaly Gordon
 
Hw09 Hadoop + Clojure
Hw09   Hadoop + ClojureHw09   Hadoop + Clojure
Hw09 Hadoop + Clojure
Cloudera, Inc.
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
Jazan University
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 

Similar to Introducción a hadoop (20)

Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Cloud jpl
Cloud jplCloud jpl
Cloud jpl
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Hw09 Hadoop + Clojure
Hw09   Hadoop + ClojureHw09   Hadoop + Clojure
Hw09 Hadoop + Clojure
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 

Recently uploaded

Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptxDublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Kunal Gupta
 
Applying Retrieval-Augmented Generation (RAG) to Combat Hallucinations in GenAI
Applying Retrieval-Augmented Generation (RAG) to Combat Hallucinations in GenAIApplying Retrieval-Augmented Generation (RAG) to Combat Hallucinations in GenAI
Applying Retrieval-Augmented Generation (RAG) to Combat Hallucinations in GenAI
ssuserd4e0d2
 
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSECHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
kumarjarun2010
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and OllamaTirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Zilliz
 
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Muhammad Ali
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
 
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
Lidia A.
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
 
The Role of Technology in Payroll Statutory Compliance (1).pdf
The Role of Technology in Payroll Statutory Compliance (1).pdfThe Role of Technology in Payroll Statutory Compliance (1).pdf
The Role of Technology in Payroll Statutory Compliance (1).pdf
paysquare consultancy
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
rajancomputerfbd
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024
aakash malhotra
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
SynapseIndia
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
huseindihon
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
maigasapphire
 
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
SynapseIndia
 
Pigging Unit Lubricant Oil Blending Plant
Pigging Unit Lubricant Oil Blending PlantPigging Unit Lubricant Oil Blending Plant
Pigging Unit Lubricant Oil Blending Plant
LINUS PROJECTS (INDIA)
 

Recently uploaded (20)

Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptxDublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
 
Applying Retrieval-Augmented Generation (RAG) to Combat Hallucinations in GenAI
Applying Retrieval-Augmented Generation (RAG) to Combat Hallucinations in GenAIApplying Retrieval-Augmented Generation (RAG) to Combat Hallucinations in GenAI
Applying Retrieval-Augmented Generation (RAG) to Combat Hallucinations in GenAI
 
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSECHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and OllamaTirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
 
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
 
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
 
The Role of Technology in Payroll Statutory Compliance (1).pdf
The Role of Technology in Payroll Statutory Compliance (1).pdfThe Role of Technology in Payroll Statutory Compliance (1).pdf
The Role of Technology in Payroll Statutory Compliance (1).pdf
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
 
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
 
Pigging Unit Lubricant Oil Blending Plant
Pigging Unit Lubricant Oil Blending PlantPigging Unit Lubricant Oil Blending Plant
Pigging Unit Lubricant Oil Blending Plant
 

Introducción a hadoop

  • 1. Introducción a Hadoop El bazuca de los datos Iván de Prado Alonso // @ivanprado // @datasalt
  • 2. Datasalt Foco en el Big Data – Contribución al Open Source – Consultoría & Desarrollo – Formación 2 / 34
  • 4. Fisonomía de un proyecto Big Data Adquisición Procesamiento Servicio 4 / 34
  • 5. Tipos de sistemas Big Data ● Offline – La latencia no es un problema ● Online – La inmediatez de los datos es importante ● Mixto – Lo más común Offline Online MapReduce Bases de datos NoSQL Hadoop Motores de búsqueda Distributed RDBMS 5 / 34
  • 6. “Swiss army knife of the 21st century” Media Guardian Innovation Awards http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop 6 / 34
  • 7. Historia ● 2004-2006 – Google publica los papers de GFS y MapReduce – Doug Cutting implementa una versión Open Source en Nutch ● 2006-2008 – Hadoop se separa de Nutch – Se alcanza la escala web en 2008 ● 2008-Hasta ahora – Hadoop se populariza y se comienza a explotar comercialmente. Fuente: Hadoop: a brief history. Doug Cutting 7 / 34
  • 8. Hadoop “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model” De la página de Hadoop 8 / 34
  • 9. Sistema de Ficheros Distribuido ● Sistema de ficheros distribuido (HDFS) – Bloques grandes: 64 Mb ● Almacenados en el sistema de ficheros del SO – Tolerante a Fallos (replicación) – Formatos habituales: ● Ficheros en formato texto (CSV) ● SequenceFiles – Ristras de pares [clave, valor] 9 / 34
  • 10. MapReduce ● Dos funciones (Map y Reduce) – Map(k, v) : [z,w]* – Reduce(k, v*) : [z, w]* ● Ejemplo: contar palabras – Map([documento, null]) -> [palabra, 1]* – Reduce(palabra, 1*) -> [palabra, total] ● MapReduce y SQL – SELECT palabra, count(*) GROUP BY palabra ● Ejecución distribuida en un cluster con escalabilidad horizontal 10 / 34
  • 11. El típico Word Count Esto es una linea Esto también Map Reduce reduce(es, {1}) = map(“Esto es una linea”) = es, 1 esto, 1 reduce(esto, {1, 1}) = es, 1 esto, 2 una, 1 reduce(linea, {1}) = linea, 1 linea, 1 map(“Esto también”) = reduce(también, {1}) = esto, 1 también, 1 también, 1 reduce(una, {1}) = una, 1 es, 1 esto, 2 Resultado: linea, 1 también, 1 una, 1 11 / 34
  • 12. Word Count en Hadoop public class WordCountHadoop extends Configured implements Tool { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while(itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for(IntWritable val : values) { sum += val.get(); } ¡Mejor vamos por partes! result.set(sum); context.write(key, result); } } @Override public int run(String[] args) throws Exception { if(args.length != 2) { System.err.println("Usage: wordcount-hadoop <in> <out>"); System.exit(2); } Path output = new Path(args[1]); HadoopUtils.deleteIfExists(FileSystem.get(output.toUri(), conf), output); Job job = new Job(getConf(), "word count hadoop"); job.setJarByClass(WordCountHadoop.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); return 0; } public static void main(String[] args) throws Exception { ToolRunner.run(new SortJobHadoop(), args); } } 12 / 34
  • 13. Word Count en Hadoop - Mapper public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while(itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } 13 / 34
  • 14. Word Count en Hadoop - Reducer public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for(IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } 14 / 34
  • 15. Word Count en Hadoop – Configuración y ejecución if(args.length != 2) { System.err.println("Usage: wordcount-hadoop <in> <out>"); System.exit(2); } Path output = new Path(args[1]); HadoopUtils.deleteIfExists(FileSystem.get(output.toUri(), conf), output); Job job = new Job(getConf(), "word count hadoop"); job.setJarByClass(WordCountHadoop.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); 15 / 34
  • 16. Ejecución de un Job MapReduce Bloques del fichero de entrada Nodo 1 Nodo 2 Mappers Datos Intermedios Nodo 1 Nodo 2 Reducers Resultado 16 / 34
  • 17. Serialización ● Writables • Serialización nativa de Hadoop • De muy bajo nivel • Tipos básicos: IntWritable, Text, etc. ● Otras • Thrift, Avro, Protostuff • Compatibilidad hacia atrás. 17 / 34
  • 18. La curva de aprendizaje de Hadoop es alta 18 / 34
  • 19. Tuple MapReduce ● Un MapReduce más simple – Tuplas en lugar de key/value – A nivel de job se define ● Los campos por los que agrupar ● Los campos por los que ordenar – Tuple MapReduce-join 19 / 34
  • 20. Pangool ● Implementación de TupleMap reduce – Desarrollado por Datasalt – OpenSource – Eficiencia equiparable a Hadoop ● Objetivo: reemplazar la API de Hadoop ● Si quieres aprender Hadoop, empieza por Pangool 20 / 34
  • 21. Eficiencia de Pangool ● Equiparable a Hadoop Ver http://pangool.net/benchmark.html 21 / 34
  • 22. Pangool – URL resolution ● Ejemplo de Join – Muy difícil en Hadoop. Fácil en Pangool. ● Problema: – Existen muchos acortadores de URLs y redirecciones – Para analizar datos, suele ser útil reemplazar las URLs por su URL canónica – Supongamos que tenemos ambos datasets ● Un mapa con entradas URL → URL canónica ● Un dataset con URLs (que queremos resolver) y otros campos. – El siguiente job Pangool soluciona el problema de manera escalable. 22 / 34
  • 23. URL Resolution – Definiendo Schemas static Schema getURLRegisterSchema() { List<Field> urlRegisterFields = new ArrayList<Field>(); urlRegisterFields.add(Field.create("url",Type.STRING)); urlRegisterFields.add(Field.create("timestamp",Type.LONG)); urlRegisterFields.add(Field.create("ip",Type.STRING)); return new Schema("urlRegister", urlRegisterFields); } static Schema getURLMapSchema() { List<Field> urlMapFields = new ArrayList<Field>(); urlMapFields.add(Field.create("url",Type.STRING)); urlMapFields.add(Field.create("canonicalUrl",Type.STRING)); return new Schema("urlMap", urlMapFields); } 23 / 34
  • 24. URL Resolution – Cargando el fichero a resolver public static class UrlProcessor extends TupleMapper<LongWritable, Text> { private Tuple tuple = new Tuple(getURLRegisterSchema()); @Override public void map(LongWritable key, Text value, TupleMRContext context, Collector collector) throws IOException, InterruptedException { String[] fields = value.toString().split("t"); tuple.set("url", fields[0]); tuple.set("timestamp", Long.parseLong(fields[1])); tuple.set("ip", fields[2]); collector.write(tuple); } } 24 / 34
  • 25. URL Resolution – Cargando el mapa de URLs public static class UrlMapProcessor extends TupleMapper<LongWritable, Text> { private Tuple tuple = new Tuple(getURLMapSchema()); @Override public void map(LongWritable key, Text value, TupleMRContext context, Collector collector) throws IOException, InterruptedException { String[] fields = value.toString().split("t"); tuple.set("url", fields[0]); tuple.set("canonicalUrl", fields[1]); collector.write(tuple); } } 25 / 34
  • 26. URL Resolution – Resolución en el reducer public static class Handler extends TupleReducer<Text, NullWritable> { private Text result; @Override public void reduce(ITuple group, Iterable<ITuple> tuples, TupleMRContext context, Collector collector) throws IOException, InterruptedException, TupleMRException { if (result == null) { result = new Text(); } String cannonicalUrl = null; for(ITuple tuple : tuples) { if("urlMap".equals(tuple.getSchema().getName())) { cannonicalUrl = tuple.get("canonicalUrl").toString(); } else { result.set(cannonicalUrl + "t" + tuple.get("timestamp") + "t" + tuple.get("ip")); collector.write(result, NullWritable.get()); } } } } 26 / 34
  • 27. URL Resolution – Configurando y Lanzando el job String input1 = args[0]; String input2 = args[1]; String output = args[2]; deleteOutput(output); TupleMRBuilder mr = new TupleMRBuilder(conf,"Pangool Url Resolution"); mr.addIntermediateSchema(getURLMapSchema()); mr.addIntermediateSchema(getURLRegisterSchema()); mr.setGroupByFields("url"); mr.setOrderBy( new OrderBy().add("url", Order.ASC).addSchemaOrder(Order.ASC)); mr.setTupleReducer(new Handler()); mr.setOutput(new Path(output), new HadoopOutputFormat(TextOutputFormat.class), Text.class, NullWritable.class); mr.addInput(new Path(input1), new HadoopInputFormat(TextInputFormat.class), new UrlMapProcessor()); mr.addInput(new Path(input2), new HadoopInputFormat(TextInputFormat.class), new UrlProcessor()); mr.createJob().waitForCompletion(true); 27 / 34