SlideShare a Scribd company logo
1 of 31
Anatomy of
distributed
computing with
Hadoop
What is Hadoop?
   Hadoop was started out as a subproject of Nutch by
    Doug Cutting

   Hadoop boosted Nutch’s scalability

   Enhanced by Yahoo! and became Apache top level
    project

   System for distributed big data processing
       Big data is Terabytes and
                        Petabytes and
                                    more…
       Exabytes, Zettabytes datasets?
Why anyone needs Hadoop?
Hadoop use cases
Hadoop use cases
Hadoop use cases
Hadoop basics
 Implements    Google’s whitepaper:
   http://research.google.com/archive/mapreduce.html



 Hadoop   is a combination of:
         HDFS                      Storage
       MapReduce                 Computation
HDFS
Hadoop Distributed File System
   It’s a file system
    bin/hadoop dfs <command> <options>



                   <command>
cat              expunge         put
chgrp            get             rm
chmod            getmerge        rmr
chown            ls              setrep
copyFromLocal    lsr             stat
copyToLocal      mkdir           tail
cp               moveFromLocal   test
du               moveToLocal     text
dus              mv              touchz
Hadoop Distributed File System
   It’s accessible
Hadoop Distributed File System
   It’s distributed
   It employs masterslave architecture
Hadoop Distributed File System
   Name Node:
    Stores file system metadata

   Secondary Name Node(s):
    Periodically merges file system image

   Data Node(s):
    Stores actual data (blocks)
    Allows data to be replicated
MapReduce
      A programming model for distributed data
       processing

      A data processing primitives are functions:
             Mappers and Reducers
MapReduce

!   To decompose MapReduce think of data in
    terms of keys and values:

<key, value>
<user id, user profile>
<timestamp, apache log entry>
<tag, list of tagged images>
MapReduce
 Mapper
 Function that takes key and value and emits
 zero or more keys and values

 Reducer
 Function that takes key and all “mapped”
 values and emits zero or more new keys and
 value
MapReduce example
 “Hello World” for Hadoop:
       http://wiki.apache.org/hadoop/WordCount


 “Tag   Cloud” example for Hadoop:

 tag1 tag2 tag3
 tag1 tag3        weight(tagi)
 tag3
 tag4 tag5 tag6
Tag Cloud example
   Input is taggable content (images, posts,
    videos) with space separated tags:
    <posti, “tag1 tag2 … tagn”>

   Output is tagi with it’s count and total tags:
    <tagi, tag count>
    <total tags, total tags count>

   Results:
    weight(tagi)=tagi count/total tags
    font(tagi)=fn(weight(tagi))
Tag Cloud Mapper
    Mapper implements interface:
    org.apache.hadoop.mapreduce.Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

   Mapper input:
       <post1, “tag1 tag3”>
       <post2, “tag3”>
       <post3, “tag2 tag3 tag4”>
       <post4, “tag1 tag2 tag3”>

                   simplify model & make line number a key

       <line1, “tag1 tag3”>
       <line2, “tag3”>
       <line3, “tag2 tag3 tag4”>
       <line4, “tag1 tag2 tag3”>

                    write raw tags to input file
Tag Cloud Mapper
      Mapper input:                                                              Mapper output:

   <0, “tag1 tag3”>                                                            <“total tags”, 2>
   <1, “tag3”>                                                                 <“tag1”, 1>
   <2, “tag2 tag3 tag4”>                                                       <“tag3”, 1>
   <3, “tag1 tag2 tag3”>
                                                                               <“total tags”, 1>
             read values - tags from file (line number is a key)               <“tag3”, 1>

                                 “tag1 tag3” // space separated tags           <“total tags”, 3>
                                                                               <“tag2”, 1>
String line = value.toString();                                                <“tag3”, 1>
StringTokenizer tokenizer = new StringTokenizer(line, ” ");                    <“tag4”, 1>
context.write(TOTAL_TAGS_KEY,                                context.write()
                  new IntWritable(tokenizer.countTokens()));                   <“total tags”, 3>
while (tokenizer.hasMoreTokens()) {                                            <“tag1”, 1>
    Text tag = new Text(tokenizer.nextToken());                                <“tag2”, 1>
    context.write(tag, new IntWritable(1)); // write to HDFS                   <“tag3”, 1>
}
Reducer phases
   1. Shuffle or Copy phase:
    Copies output from Mapper to Reducer local file system

   2. Sort phase:
    Sort Mapper output by keys. This becomes Reducer input
           Mapper output:                          Reducer input:
           <“total tags”, 2>                       <“tag1”, 1>
           <“tag1”, 1>                             <“tag1”, 1>
           <“tag3”, 1>
                                                   <“tag2”, 1>
           <“total tags”, 1>                       <“tag2”, 1>
           <“tag3”, 1>
                               shuffle & sort by
           <“total tags”, 3>   key                 <“tag3”, 1>
           <“tag2”, 1>                             <“tag3”, 1>
           <“tag3”, 1>                             <“tag3”, 1>
           <“tag4”, 1>                             <“tag3”, 1>

           <“total tags”, 3>                       <“tag4”, 1>
           <“tag1”, 1>
           <“tag2”, 1>                             <“total tags”, 2>
           <“tag3”, 1>                             <“total tags”, 1>
                                                   <“total tags”, 3>
                                                   <“total tags”, 3>
   3. Reduce or Emit phase:
    Performs reduce() for each sorted <key, value> input groups
Tag Cloud Reduce phase
  Reducer implements interface:
org.apache.hadoop.mapreduce.Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

  Reducer input:                                                [<“tag1”, 1>, <“tag1”, 1>]
<“tag1”, 1>
<“tag1”, 1>                              int tagsCount = 0;
                 pairs grouped by tagi   for (IntWritable value : values) {
<“tag2”, 1>                                 tagsCount += value.get();
<“tag2”, 1>                              }
                                         context.write(key, new IntWritable(tagsCount));
<“tag3”, 1>
<“tag3”, 1>                                                    context.write()
<“tag3”, 1>
<“tag3”, 1>
                                                      Reducer output:
<“tag4”, 1>                                           <tag1, 2>
                                                      <tag2, 2>
<“total tags”, 2>                                     <tag3, 4>
<“total tags”, 1>                                     <tag4, 1>
<“total tags”, 3>                                     <total tags, 9>
<“total tags”, 3>
Tag Cloud Output
    Reducer output is weighted list:
    <tag1, 2>
    <tag2, 2>
    <tag3, 4>
    <tag4, 1>
    <total tags, 9>
                                         output
   Tag’s weight:
    weight(tagi)=tagi count/total tags

    <weight(tag1), 2/9>
    <weight(tag2), 2/9>
    <weight(tag3), 4/9>
    <weight(tag4), 1/9>

   Size of font:
    font(tagi)=fn(weight(tagi))
Between Map and Reduce
                                                  Mapper output:
   Combiner:                                     <“total tags”, 2>
                                                  <“tag1”, 1>
     implements interface                        <“tag1”, 1>
    org.apache.hadoop.mapreduce.Reducer           <“tag3”, 1>

     function works as in-memory Reducer                  in-memory combine
     serves for additional optimization
                                                  Combiner output:
                                                  <“total tags”, 3>
                                                  <“tag1”, 2>
   Partitioner:                                  <“tag3”, 1>
     implements interface
    org.apache.hadoop.mapreduce.Partitioner
     function assigns intermediate <key, value> pair from
    Mapper to designed Reducer partition
Time for a Workshop
                                     Standalone mode
   Build “Tag Cloud” project jar:
cd $TAG_CLOUD_HOME
mvn clean install

  Check input directory:
$HADOOP_HOME/bin/hadoop fs -ls $TAG_CLOUD_HOME/input/

  Check input file:
$HADOOP_HOME/bin/hadoop fs -cat $TAG_CLOUD_HOME/input/tags01

  Submit TagCloudJob to Hadoop:
$HADOOP_HOME/bin/hadoop jar $TAG_CLOUD_HOME/target/tagcloud-1.0.jar
com.altoros.rnd.hadoop.tagcloud.TagCloudJob $TAG_CLOUD_HOME/input
$TAG_CLOUD_HOME/output

  Check output directory:
$HADOOP_HOME/bin/hadoop fs -ls $TAG_CLOUD_HOME/output/

  Check output file:
$HADOOP_HOME/bin/hadoop fs -cat $TAG_CLOUD_HOME/output/part-r-00000
Apache Pig
   Higher-level data processing layer on top
    of Hadoop
   Data-flow oriented language (pig scripts)
   Data types include sets, associative
    arrays, tuples
   Developed at Yahoo!
Apache Hive
   Feature set is similar to Pig
   SQL-like data warehouse infrastructure
   Language is more strictly SQL
   Supports SELECT, JOIN, GROUP BY, etc
   Developed at Facebook
Apache HBase
    Column-store database (after Google
     BigTable model)
    HDFS is an underlying file system
    Holds extremely large datasets (multi Tb)
    Constrained access model
Apache Mahout
     Scalable machine learning algorithms on
      top of Hadoop:
     – filtering,
     – recommendations,
     – classifiers,
     – clustering
Apache ZooKeeper
     Common services for distributed
      applications:
      - group services,
      - configuration management,
      - naming services,
      - synchronization
Oozie
   Workflow engine for Hadoop
   Orchestrates dependencies between
    jobs running on Hadoop (including HDFS,
    Pig and MapReduce)
   Another query processing API
   Developed at Yahoo!
Apache Chukwa
    System for reliable large-scale log
     collection
    Displaying, monitoring and analyzing results
    Built on top of the Hadoop Distributed File
     System (HDFS) and Map/Reduce
    Incubated at apache.org
Questions

             links:
http://www.slideshare.net/tazija/anatomy-of-distributed-computing-with-hadoop
https://github.com/tazija/TagCloud

             skype: siarhei_bushyk
             mailto: tazija@gmail.com
             mailto: sergey.bushik@altoros.com

More Related Content

What's hot

The Testing Games: Mocking, yay!
The Testing Games: Mocking, yay!The Testing Games: Mocking, yay!
The Testing Games: Mocking, yay!Donny Wals
 
Web Integration Patterns in the Era of HTML5
Web Integration Patterns in the Era of HTML5Web Integration Patterns in the Era of HTML5
Web Integration Patterns in the Era of HTML5johnwilander
 
多治見IT勉強会 Groovy Grails
多治見IT勉強会 Groovy Grails多治見IT勉強会 Groovy Grails
多治見IT勉強会 Groovy GrailsTsuyoshi Yamamoto
 
JSON and Swift, Still A Better Love Story Than Twilight
JSON and Swift, Still A Better Love Story Than TwilightJSON and Swift, Still A Better Love Story Than Twilight
JSON and Swift, Still A Better Love Story Than TwilightDonny Wals
 
Sqlxml vs xquery
Sqlxml vs xquerySqlxml vs xquery
Sqlxml vs xqueryAmol Pujari
 
Modern query optimisation features in MySQL 8.
Modern query optimisation features in MySQL 8.Modern query optimisation features in MySQL 8.
Modern query optimisation features in MySQL 8.Mydbops
 
Encryption: It's For More Than Just Passwords
Encryption: It's For More Than Just PasswordsEncryption: It's For More Than Just Passwords
Encryption: It's For More Than Just PasswordsJohn Congdon
 
Apache Airflow
Apache AirflowApache Airflow
Apache AirflowJason Kim
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGMatthew McCullough
 
Embracing the-power-of-refactor
Embracing the-power-of-refactorEmbracing the-power-of-refactor
Embracing the-power-of-refactorXiaojun REN
 
BDD - Behavior Driven Development Webapps mit Groovy Spock und Geb
BDD - Behavior Driven Development Webapps mit Groovy Spock und GebBDD - Behavior Driven Development Webapps mit Groovy Spock und Geb
BDD - Behavior Driven Development Webapps mit Groovy Spock und GebChristian Baranowski
 
Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Markus Klems
 
MySQL's JSON Data Type and Document Store
MySQL's JSON Data Type and Document StoreMySQL's JSON Data Type and Document Store
MySQL's JSON Data Type and Document StoreDave Stokes
 
Store and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraStore and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraDeependra Ariyadewa
 
Webinar: Replication and Replica Sets
Webinar: Replication and Replica SetsWebinar: Replication and Replica Sets
Webinar: Replication and Replica SetsMongoDB
 
SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade DowntimeSCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade DowntimeJeff Frost
 

What's hot (20)

The Testing Games: Mocking, yay!
The Testing Games: Mocking, yay!The Testing Games: Mocking, yay!
The Testing Games: Mocking, yay!
 
Web Integration Patterns in the Era of HTML5
Web Integration Patterns in the Era of HTML5Web Integration Patterns in the Era of HTML5
Web Integration Patterns in the Era of HTML5
 
多治見IT勉強会 Groovy Grails
多治見IT勉強会 Groovy Grails多治見IT勉強会 Groovy Grails
多治見IT勉強会 Groovy Grails
 
DB2 Native XML
DB2 Native XMLDB2 Native XML
DB2 Native XML
 
JSON and Swift, Still A Better Love Story Than Twilight
JSON and Swift, Still A Better Love Story Than TwilightJSON and Swift, Still A Better Love Story Than Twilight
JSON and Swift, Still A Better Love Story Than Twilight
 
Sqlxml vs xquery
Sqlxml vs xquerySqlxml vs xquery
Sqlxml vs xquery
 
Practica n° 7
Practica n° 7Practica n° 7
Practica n° 7
 
Modern query optimisation features in MySQL 8.
Modern query optimisation features in MySQL 8.Modern query optimisation features in MySQL 8.
Modern query optimisation features in MySQL 8.
 
Encryption: It's For More Than Just Passwords
Encryption: It's For More Than Just PasswordsEncryption: It's For More Than Just Passwords
Encryption: It's For More Than Just Passwords
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
Embracing the-power-of-refactor
Embracing the-power-of-refactorEmbracing the-power-of-refactor
Embracing the-power-of-refactor
 
BDD - Behavior Driven Development Webapps mit Groovy Spock und Geb
BDD - Behavior Driven Development Webapps mit Groovy Spock und GebBDD - Behavior Driven Development Webapps mit Groovy Spock und Geb
BDD - Behavior Driven Development Webapps mit Groovy Spock und Geb
 
Spock and Geb in Action
Spock and Geb in ActionSpock and Geb in Action
Spock and Geb in Action
 
Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3
 
MySQL's JSON Data Type and Document Store
MySQL's JSON Data Type and Document StoreMySQL's JSON Data Type and Document Store
MySQL's JSON Data Type and Document Store
 
Store and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraStore and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and Cassandra
 
Webinar: Replication and Replica Sets
Webinar: Replication and Replica SetsWebinar: Replication and Replica Sets
Webinar: Replication and Replica Sets
 
SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade DowntimeSCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
 
Intro to Redis
Intro to RedisIntro to Redis
Intro to Redis
 

Viewers also liked

How not to make a hacker friendly application
How not to make a hacker friendly applicationHow not to make a hacker friendly application
How not to make a hacker friendly applicationAbhinav Mishra
 
Assorted Learnings of Microservices
Assorted Learnings of MicroservicesAssorted Learnings of Microservices
Assorted Learnings of MicroservicesDavid Dawson
 
Intervención huecos
Intervención huecosIntervención huecos
Intervención huecosDavid Acuña
 
Metodologias de Seguridad De Sistemas
Metodologias de Seguridad De SistemasMetodologias de Seguridad De Sistemas
Metodologias de Seguridad De SistemasCarmen Benites
 
Design Pattern MicroServices Architecture in Japanese
Design Pattern MicroServices Architecture in JapaneseDesign Pattern MicroServices Architecture in Japanese
Design Pattern MicroServices Architecture in JapaneseLei Xu
 
Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?Alexandre Vasseur
 
Light microscope vs. Electron microscope
Light microscope vs. Electron microscopeLight microscope vs. Electron microscope
Light microscope vs. Electron microscopeJamica Ambion
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 
O DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAH
O DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAHO DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAH
O DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAHviasdosaber
 

Viewers also liked (12)

How not to make a hacker friendly application
How not to make a hacker friendly applicationHow not to make a hacker friendly application
How not to make a hacker friendly application
 
Synthesis Presentation of Agricultural Communication
Synthesis Presentation of Agricultural Communication Synthesis Presentation of Agricultural Communication
Synthesis Presentation of Agricultural Communication
 
resume 2015
resume 2015resume 2015
resume 2015
 
Assorted Learnings of Microservices
Assorted Learnings of MicroservicesAssorted Learnings of Microservices
Assorted Learnings of Microservices
 
Intervención huecos
Intervención huecosIntervención huecos
Intervención huecos
 
Sheryl Larson 2015
Sheryl Larson 2015Sheryl Larson 2015
Sheryl Larson 2015
 
Metodologias de Seguridad De Sistemas
Metodologias de Seguridad De SistemasMetodologias de Seguridad De Sistemas
Metodologias de Seguridad De Sistemas
 
Design Pattern MicroServices Architecture in Japanese
Design Pattern MicroServices Architecture in JapaneseDesign Pattern MicroServices Architecture in Japanese
Design Pattern MicroServices Architecture in Japanese
 
Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?
 
Light microscope vs. Electron microscope
Light microscope vs. Electron microscopeLight microscope vs. Electron microscope
Light microscope vs. Electron microscope
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
O DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAH
O DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAHO DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAH
O DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAH
 

Similar to Anatomy of distributed computing with Hadoop

2012 mongo db_bangalore_roadmap_new
2012 mongo db_bangalore_roadmap_new2012 mongo db_bangalore_roadmap_new
2012 mongo db_bangalore_roadmap_newMongoDB
 
The Ring programming language version 1.10 book - Part 92 of 212
The Ring programming language version 1.10 book - Part 92 of 212The Ring programming language version 1.10 book - Part 92 of 212
The Ring programming language version 1.10 book - Part 92 of 212Mahmoud Samir Fayed
 
The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181Mahmoud Samir Fayed
 
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkRde:code 2017
 
Reproducible Computational Research in R
Reproducible Computational Research in RReproducible Computational Research in R
Reproducible Computational Research in RSamuel Bosch
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startupsbmlever
 
Stata cheatsheet programming
Stata cheatsheet programmingStata cheatsheet programming
Stata cheatsheet programmingTim Essam
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat SheetLaura Hughes
 
yagdao-0.3.1 JPA guide
yagdao-0.3.1 JPA guideyagdao-0.3.1 JPA guide
yagdao-0.3.1 JPA guideMert Can Akkan
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to SparkLi Ming Tsai
 
NoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsNoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsGleicon Moraes
 
Go 1.10 Release Party - PDX Go
Go 1.10 Release Party - PDX GoGo 1.10 Release Party - PDX Go
Go 1.10 Release Party - PDX GoRodolfo Carvalho
 
OPM Recipe designer notes
OPM Recipe designer notesOPM Recipe designer notes
OPM Recipe designer notested-xu
 
Groovy and Grails talk
Groovy and Grails talkGroovy and Grails talk
Groovy and Grails talkdesistartups
 
Hatohol technical-brief-20130830-hbstudy
Hatohol technical-brief-20130830-hbstudyHatohol technical-brief-20130830-hbstudy
Hatohol technical-brief-20130830-hbstudykoedoyoshida
 

Similar to Anatomy of distributed computing with Hadoop (20)

2012 mongo db_bangalore_roadmap_new
2012 mongo db_bangalore_roadmap_new2012 mongo db_bangalore_roadmap_new
2012 mongo db_bangalore_roadmap_new
 
The Ring programming language version 1.10 book - Part 92 of 212
The Ring programming language version 1.10 book - Part 92 of 212The Ring programming language version 1.10 book - Part 92 of 212
The Ring programming language version 1.10 book - Part 92 of 212
 
The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181
 
Redis
RedisRedis
Redis
 
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
 
Reproducible Computational Research in R
Reproducible Computational Research in RReproducible Computational Research in R
Reproducible Computational Research in R
 
Python course Day 1
Python course Day 1Python course Day 1
Python course Day 1
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
 
Stata cheatsheet programming
Stata cheatsheet programmingStata cheatsheet programming
Stata cheatsheet programming
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Hw09 Hadoop + Clojure
Hw09   Hadoop + ClojureHw09   Hadoop + Clojure
Hw09 Hadoop + Clojure
 
yagdao-0.3.1 JPA guide
yagdao-0.3.1 JPA guideyagdao-0.3.1 JPA guide
yagdao-0.3.1 JPA guide
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
NoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsNoSQL and SQL Anti Patterns
NoSQL and SQL Anti Patterns
 
Hadoop
HadoopHadoop
Hadoop
 
Go 1.10 Release Party - PDX Go
Go 1.10 Release Party - PDX GoGo 1.10 Release Party - PDX Go
Go 1.10 Release Party - PDX Go
 
OPM Recipe designer notes
OPM Recipe designer notesOPM Recipe designer notes
OPM Recipe designer notes
 
Groovy and Grails talk
Groovy and Grails talkGroovy and Grails talk
Groovy and Grails talk
 
Hatohol technical-brief-20130830-hbstudy
Hatohol technical-brief-20130830-hbstudyHatohol technical-brief-20130830-hbstudy
Hatohol technical-brief-20130830-hbstudy
 

Recently uploaded

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Recently uploaded (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Anatomy of distributed computing with Hadoop

  • 2. What is Hadoop?  Hadoop was started out as a subproject of Nutch by Doug Cutting  Hadoop boosted Nutch’s scalability  Enhanced by Yahoo! and became Apache top level project  System for distributed big data processing  Big data is Terabytes and Petabytes and more…  Exabytes, Zettabytes datasets?
  • 7. Hadoop basics  Implements Google’s whitepaper: http://research.google.com/archive/mapreduce.html  Hadoop is a combination of: HDFS Storage MapReduce Computation
  • 8. HDFS Hadoop Distributed File System  It’s a file system bin/hadoop dfs <command> <options> <command> cat expunge put chgrp get rm chmod getmerge rmr chown ls setrep copyFromLocal lsr stat copyToLocal mkdir tail cp moveFromLocal test du moveToLocal text dus mv touchz
  • 9. Hadoop Distributed File System  It’s accessible
  • 10. Hadoop Distributed File System  It’s distributed  It employs masterslave architecture
  • 11. Hadoop Distributed File System  Name Node: Stores file system metadata  Secondary Name Node(s): Periodically merges file system image  Data Node(s): Stores actual data (blocks) Allows data to be replicated
  • 12. MapReduce  A programming model for distributed data processing  A data processing primitives are functions: Mappers and Reducers
  • 13. MapReduce ! To decompose MapReduce think of data in terms of keys and values: <key, value> <user id, user profile> <timestamp, apache log entry> <tag, list of tagged images>
  • 14. MapReduce  Mapper Function that takes key and value and emits zero or more keys and values  Reducer Function that takes key and all “mapped” values and emits zero or more new keys and value
  • 15. MapReduce example  “Hello World” for Hadoop: http://wiki.apache.org/hadoop/WordCount  “Tag Cloud” example for Hadoop: tag1 tag2 tag3 tag1 tag3 weight(tagi) tag3 tag4 tag5 tag6
  • 16. Tag Cloud example  Input is taggable content (images, posts, videos) with space separated tags: <posti, “tag1 tag2 … tagn”>  Output is tagi with it’s count and total tags: <tagi, tag count> <total tags, total tags count>  Results: weight(tagi)=tagi count/total tags font(tagi)=fn(weight(tagi))
  • 17. Tag Cloud Mapper  Mapper implements interface: org.apache.hadoop.mapreduce.Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>  Mapper input: <post1, “tag1 tag3”> <post2, “tag3”> <post3, “tag2 tag3 tag4”> <post4, “tag1 tag2 tag3”> simplify model & make line number a key <line1, “tag1 tag3”> <line2, “tag3”> <line3, “tag2 tag3 tag4”> <line4, “tag1 tag2 tag3”> write raw tags to input file
  • 18. Tag Cloud Mapper  Mapper input:  Mapper output: <0, “tag1 tag3”> <“total tags”, 2> <1, “tag3”> <“tag1”, 1> <2, “tag2 tag3 tag4”> <“tag3”, 1> <3, “tag1 tag2 tag3”> <“total tags”, 1> read values - tags from file (line number is a key) <“tag3”, 1> “tag1 tag3” // space separated tags <“total tags”, 3> <“tag2”, 1> String line = value.toString(); <“tag3”, 1> StringTokenizer tokenizer = new StringTokenizer(line, ” "); <“tag4”, 1> context.write(TOTAL_TAGS_KEY, context.write() new IntWritable(tokenizer.countTokens())); <“total tags”, 3> while (tokenizer.hasMoreTokens()) { <“tag1”, 1> Text tag = new Text(tokenizer.nextToken()); <“tag2”, 1> context.write(tag, new IntWritable(1)); // write to HDFS <“tag3”, 1> }
  • 19. Reducer phases  1. Shuffle or Copy phase: Copies output from Mapper to Reducer local file system  2. Sort phase: Sort Mapper output by keys. This becomes Reducer input Mapper output: Reducer input: <“total tags”, 2> <“tag1”, 1> <“tag1”, 1> <“tag1”, 1> <“tag3”, 1> <“tag2”, 1> <“total tags”, 1> <“tag2”, 1> <“tag3”, 1> shuffle & sort by <“total tags”, 3> key <“tag3”, 1> <“tag2”, 1> <“tag3”, 1> <“tag3”, 1> <“tag3”, 1> <“tag4”, 1> <“tag3”, 1> <“total tags”, 3> <“tag4”, 1> <“tag1”, 1> <“tag2”, 1> <“total tags”, 2> <“tag3”, 1> <“total tags”, 1> <“total tags”, 3> <“total tags”, 3>  3. Reduce or Emit phase: Performs reduce() for each sorted <key, value> input groups
  • 20. Tag Cloud Reduce phase  Reducer implements interface: org.apache.hadoop.mapreduce.Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>  Reducer input: [<“tag1”, 1>, <“tag1”, 1>] <“tag1”, 1> <“tag1”, 1> int tagsCount = 0; pairs grouped by tagi for (IntWritable value : values) { <“tag2”, 1> tagsCount += value.get(); <“tag2”, 1> } context.write(key, new IntWritable(tagsCount)); <“tag3”, 1> <“tag3”, 1> context.write() <“tag3”, 1> <“tag3”, 1>  Reducer output: <“tag4”, 1> <tag1, 2> <tag2, 2> <“total tags”, 2> <tag3, 4> <“total tags”, 1> <tag4, 1> <“total tags”, 3> <total tags, 9> <“total tags”, 3>
  • 21. Tag Cloud Output  Reducer output is weighted list: <tag1, 2> <tag2, 2> <tag3, 4> <tag4, 1> <total tags, 9> output  Tag’s weight: weight(tagi)=tagi count/total tags <weight(tag1), 2/9> <weight(tag2), 2/9> <weight(tag3), 4/9> <weight(tag4), 1/9>  Size of font: font(tagi)=fn(weight(tagi))
  • 22. Between Map and Reduce Mapper output:  Combiner: <“total tags”, 2> <“tag1”, 1>  implements interface <“tag1”, 1> org.apache.hadoop.mapreduce.Reducer <“tag3”, 1>  function works as in-memory Reducer in-memory combine  serves for additional optimization Combiner output: <“total tags”, 3> <“tag1”, 2>  Partitioner: <“tag3”, 1>  implements interface org.apache.hadoop.mapreduce.Partitioner  function assigns intermediate <key, value> pair from Mapper to designed Reducer partition
  • 23. Time for a Workshop Standalone mode  Build “Tag Cloud” project jar: cd $TAG_CLOUD_HOME mvn clean install  Check input directory: $HADOOP_HOME/bin/hadoop fs -ls $TAG_CLOUD_HOME/input/  Check input file: $HADOOP_HOME/bin/hadoop fs -cat $TAG_CLOUD_HOME/input/tags01  Submit TagCloudJob to Hadoop: $HADOOP_HOME/bin/hadoop jar $TAG_CLOUD_HOME/target/tagcloud-1.0.jar com.altoros.rnd.hadoop.tagcloud.TagCloudJob $TAG_CLOUD_HOME/input $TAG_CLOUD_HOME/output  Check output directory: $HADOOP_HOME/bin/hadoop fs -ls $TAG_CLOUD_HOME/output/  Check output file: $HADOOP_HOME/bin/hadoop fs -cat $TAG_CLOUD_HOME/output/part-r-00000
  • 24. Apache Pig  Higher-level data processing layer on top of Hadoop  Data-flow oriented language (pig scripts)  Data types include sets, associative arrays, tuples  Developed at Yahoo!
  • 25. Apache Hive  Feature set is similar to Pig  SQL-like data warehouse infrastructure  Language is more strictly SQL  Supports SELECT, JOIN, GROUP BY, etc  Developed at Facebook
  • 26. Apache HBase  Column-store database (after Google BigTable model)  HDFS is an underlying file system  Holds extremely large datasets (multi Tb)  Constrained access model
  • 27. Apache Mahout  Scalable machine learning algorithms on top of Hadoop: – filtering, – recommendations, – classifiers, – clustering
  • 28. Apache ZooKeeper  Common services for distributed applications: - group services, - configuration management, - naming services, - synchronization
  • 29. Oozie  Workflow engine for Hadoop  Orchestrates dependencies between jobs running on Hadoop (including HDFS, Pig and MapReduce)  Another query processing API  Developed at Yahoo!
  • 30. Apache Chukwa  System for reliable large-scale log collection  Displaying, monitoring and analyzing results  Built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce  Incubated at apache.org
  • 31. Questions links: http://www.slideshare.net/tazija/anatomy-of-distributed-computing-with-hadoop https://github.com/tazija/TagCloud skype: siarhei_bushyk mailto: tazija@gmail.com mailto: sergey.bushik@altoros.com

Editor's Notes

  1. DataNodes are constantly reporting to the NameNode. Blocks are stored on the Data Nodes.
  2. Standalone operation mode:1. export TAG_CLOUD_HOME=/Users/tazija/Projects/hadoop/tagcloud2. export HADOOP_HOME=/Users/tazija/Programs/apache-hadoop-0.23.03.cd $TAG_CLOUD_HOME4.mvn clean install5. $HADOOP_HOME/bin/hadoopfs -ls $TAG_CLOUD_HOME/inputInput directory is $TAG_CLOUD_HOME/input6. $HADOOP_HOME/bin/hadoopfs -cat $TAG_CLOUD_HOME/input/tags01We use InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text.5. $HADOOP_HOME/bin/hadoop jar $TAG_CLOUD_HOME/target/tagcloud-1.0.jar com.altoros.rnd.hadoop.tagcloud.TagCloudJob $TAG_CLOUD_HOME/input $TAG_CLOUD_HOME/outputDistributed mode:1. /etc/hadoop/hdfs-site.xml&lt;configuration&gt; &lt;property&gt; &lt;name&gt;dfs.namenode.name.dir&lt;/name&gt; &lt;value&gt;file:/Users/tazija/Programs/apache-hadoop-0.23.0/data/hdfs/namenode&lt;/value&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;dfs.namenode.name.dir&lt;/name&gt; &lt;value&gt;file:/Users/tazija/Programs/apache-hadoop-0.23.0/data/hdfs/namenode&lt;/value&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;dfs.replication&lt;/name&gt; &lt;value&gt;1&lt;/value&gt; &lt;/property&gt;&lt;/configuration&gt;2. Formatfilesystembin/hadoopnamenode –format3. Start daemons./sbin/hadoop-daemon.sh start namenode./sbin/hadoop-daemon.sh start datanode./sbin/hadoop-daemon.sh start secondarynamenode4. Checkhdfs statushttp://localhost:50070/