SlideShare a Scribd company logo
1 of 31
Anatomy of
distributed
computing with
Hadoop
What is Hadoop?
   Hadoop was started out as a subproject of Nutch by
    Doug Cutting

   Hadoop boosted Nutch’s scalability

   Enhanced by Yahoo! and became Apache top level
    project

   System for distributed big data processing
       Big data is Terabytes and
                        Petabytes and
                                    more…
       Exabytes, Zettabytes datasets?
Why anyone needs Hadoop?
Hadoop use cases
Hadoop use cases
Hadoop use cases
Hadoop basics
 Implements    Google’s whitepaper:
   http://research.google.com/archive/mapreduce.html



 Hadoop   is a combination of:
         HDFS                      Storage
       MapReduce                 Computation
HDFS
Hadoop Distributed File System
   It’s a file system
    bin/hadoop dfs <command> <options>



                   <command>
cat              expunge         put
chgrp            get             rm
chmod            getmerge        rmr
chown            ls              setrep
copyFromLocal    lsr             stat
copyToLocal      mkdir           tail
cp               moveFromLocal   test
du               moveToLocal     text
dus              mv              touchz
Hadoop Distributed File System
   It’s accessible
Hadoop Distributed File System
   It’s distributed
   It employs masterslave architecture
Hadoop Distributed File System
   Name Node:
    Stores file system metadata

   Secondary Name Node(s):
    Periodically merges file system image

   Data Node(s):
    Stores actual data (blocks)
    Allows data to be replicated
MapReduce
      A programming model for distributed data
       processing

      A data processing primitives are functions:
             Mappers and Reducers
MapReduce

!   To decompose MapReduce think of data in
    terms of keys and values:

<key, value>
<user id, user profile>
<timestamp, apache log entry>
<tag, list of tagged images>
MapReduce
 Mapper
 Function that takes key and value and emits
 zero or more keys and values

 Reducer
 Function that takes key and all “mapped”
 values and emits zero or more new keys and
 value
MapReduce example
 “Hello World” for Hadoop:
       http://wiki.apache.org/hadoop/WordCount


 “Tag   Cloud” example for Hadoop:

 tag1 tag2 tag3
 tag1 tag3        weight(tagi)
 tag3
 tag4 tag5 tag6
Tag Cloud example
   Input is taggable content (images, posts,
    videos) with space separated tags:
    <posti, “tag1 tag2 … tagn”>

   Output is tagi with it’s count and total tags:
    <tagi, tag count>
    <total tags, total tags count>

   Results:
    weight(tagi)=tagi count/total tags
    font(tagi)=fn(weight(tagi))
Tag Cloud Mapper
    Mapper implements interface:
    org.apache.hadoop.mapreduce.Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

   Mapper input:
       <post1, “tag1 tag3”>
       <post2, “tag3”>
       <post3, “tag2 tag3 tag4”>
       <post4, “tag1 tag2 tag3”>

                   simplify model & make line number a key

       <line1, “tag1 tag3”>
       <line2, “tag3”>
       <line3, “tag2 tag3 tag4”>
       <line4, “tag1 tag2 tag3”>

                    write raw tags to input file
Tag Cloud Mapper
      Mapper input:                                                              Mapper output:

   <0, “tag1 tag3”>                                                            <“total tags”, 2>
   <1, “tag3”>                                                                 <“tag1”, 1>
   <2, “tag2 tag3 tag4”>                                                       <“tag3”, 1>
   <3, “tag1 tag2 tag3”>
                                                                               <“total tags”, 1>
             read values - tags from file (line number is a key)               <“tag3”, 1>

                                 “tag1 tag3” // space separated tags           <“total tags”, 3>
                                                                               <“tag2”, 1>
String line = value.toString();                                                <“tag3”, 1>
StringTokenizer tokenizer = new StringTokenizer(line, ” ");                    <“tag4”, 1>
context.write(TOTAL_TAGS_KEY,                                context.write()
                  new IntWritable(tokenizer.countTokens()));                   <“total tags”, 3>
while (tokenizer.hasMoreTokens()) {                                            <“tag1”, 1>
    Text tag = new Text(tokenizer.nextToken());                                <“tag2”, 1>
    context.write(tag, new IntWritable(1)); // write to HDFS                   <“tag3”, 1>
}
Reducer phases
   1. Shuffle or Copy phase:
    Copies output from Mapper to Reducer local file system

   2. Sort phase:
    Sort Mapper output by keys. This becomes Reducer input
           Mapper output:                          Reducer input:
           <“total tags”, 2>                       <“tag1”, 1>
           <“tag1”, 1>                             <“tag1”, 1>
           <“tag3”, 1>
                                                   <“tag2”, 1>
           <“total tags”, 1>                       <“tag2”, 1>
           <“tag3”, 1>
                               shuffle & sort by
           <“total tags”, 3>   key                 <“tag3”, 1>
           <“tag2”, 1>                             <“tag3”, 1>
           <“tag3”, 1>                             <“tag3”, 1>
           <“tag4”, 1>                             <“tag3”, 1>

           <“total tags”, 3>                       <“tag4”, 1>
           <“tag1”, 1>
           <“tag2”, 1>                             <“total tags”, 2>
           <“tag3”, 1>                             <“total tags”, 1>
                                                   <“total tags”, 3>
                                                   <“total tags”, 3>
   3. Reduce or Emit phase:
    Performs reduce() for each sorted <key, value> input groups
Tag Cloud Reduce phase
  Reducer implements interface:
org.apache.hadoop.mapreduce.Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

  Reducer input:                                                [<“tag1”, 1>, <“tag1”, 1>]
<“tag1”, 1>
<“tag1”, 1>                              int tagsCount = 0;
                 pairs grouped by tagi   for (IntWritable value : values) {
<“tag2”, 1>                                 tagsCount += value.get();
<“tag2”, 1>                              }
                                         context.write(key, new IntWritable(tagsCount));
<“tag3”, 1>
<“tag3”, 1>                                                    context.write()
<“tag3”, 1>
<“tag3”, 1>
                                                      Reducer output:
<“tag4”, 1>                                           <tag1, 2>
                                                      <tag2, 2>
<“total tags”, 2>                                     <tag3, 4>
<“total tags”, 1>                                     <tag4, 1>
<“total tags”, 3>                                     <total tags, 9>
<“total tags”, 3>
Tag Cloud Output
    Reducer output is weighted list:
    <tag1, 2>
    <tag2, 2>
    <tag3, 4>
    <tag4, 1>
    <total tags, 9>
                                         output
   Tag’s weight:
    weight(tagi)=tagi count/total tags

    <weight(tag1), 2/9>
    <weight(tag2), 2/9>
    <weight(tag3), 4/9>
    <weight(tag4), 1/9>

   Size of font:
    font(tagi)=fn(weight(tagi))
Between Map and Reduce
                                                  Mapper output:
   Combiner:                                     <“total tags”, 2>
                                                  <“tag1”, 1>
     implements interface                        <“tag1”, 1>
    org.apache.hadoop.mapreduce.Reducer           <“tag3”, 1>

     function works as in-memory Reducer                  in-memory combine
     serves for additional optimization
                                                  Combiner output:
                                                  <“total tags”, 3>
                                                  <“tag1”, 2>
   Partitioner:                                  <“tag3”, 1>
     implements interface
    org.apache.hadoop.mapreduce.Partitioner
     function assigns intermediate <key, value> pair from
    Mapper to designed Reducer partition
Time for a Workshop
                                     Standalone mode
   Build “Tag Cloud” project jar:
cd $TAG_CLOUD_HOME
mvn clean install

  Check input directory:
$HADOOP_HOME/bin/hadoop fs -ls $TAG_CLOUD_HOME/input/

  Check input file:
$HADOOP_HOME/bin/hadoop fs -cat $TAG_CLOUD_HOME/input/tags01

  Submit TagCloudJob to Hadoop:
$HADOOP_HOME/bin/hadoop jar $TAG_CLOUD_HOME/target/tagcloud-1.0.jar
com.altoros.rnd.hadoop.tagcloud.TagCloudJob $TAG_CLOUD_HOME/input
$TAG_CLOUD_HOME/output

  Check output directory:
$HADOOP_HOME/bin/hadoop fs -ls $TAG_CLOUD_HOME/output/

  Check output file:
$HADOOP_HOME/bin/hadoop fs -cat $TAG_CLOUD_HOME/output/part-r-00000
Apache Pig
   Higher-level data processing layer on top
    of Hadoop
   Data-flow oriented language (pig scripts)
   Data types include sets, associative
    arrays, tuples
   Developed at Yahoo!
Apache Hive
   Feature set is similar to Pig
   SQL-like data warehouse infrastructure
   Language is more strictly SQL
   Supports SELECT, JOIN, GROUP BY, etc
   Developed at Facebook
Apache HBase
    Column-store database (after Google
     BigTable model)
    HDFS is an underlying file system
    Holds extremely large datasets (multi Tb)
    Constrained access model
Apache Mahout
     Scalable machine learning algorithms on
      top of Hadoop:
     – filtering,
     – recommendations,
     – classifiers,
     – clustering
Apache ZooKeeper
     Common services for distributed
      applications:
      - group services,
      - configuration management,
      - naming services,
      - synchronization
Oozie
   Workflow engine for Hadoop
   Orchestrates dependencies between
    jobs running on Hadoop (including HDFS,
    Pig and MapReduce)
   Another query processing API
   Developed at Yahoo!
Apache Chukwa
    System for reliable large-scale log
     collection
    Displaying, monitoring and analyzing results
    Built on top of the Hadoop Distributed File
     System (HDFS) and Map/Reduce
    Incubated at apache.org
Questions

             links:
http://www.slideshare.net/tazija/anatomy-of-distributed-computing-with-hadoop
https://github.com/tazija/TagCloud

             skype: siarhei_bushyk
             mailto: tazija@gmail.com
             mailto: sergey.bushik@altoros.com

More Related Content

What's hot

The Testing Games: Mocking, yay!
The Testing Games: Mocking, yay!The Testing Games: Mocking, yay!
The Testing Games: Mocking, yay!Donny Wals
 
Web Integration Patterns in the Era of HTML5
Web Integration Patterns in the Era of HTML5Web Integration Patterns in the Era of HTML5
Web Integration Patterns in the Era of HTML5johnwilander
 
多治見IT勉強会 Groovy Grails
多治見IT勉強会 Groovy Grails多治見IT勉強会 Groovy Grails
多治見IT勉強会 Groovy GrailsTsuyoshi Yamamoto
 
JSON and Swift, Still A Better Love Story Than Twilight
JSON and Swift, Still A Better Love Story Than TwilightJSON and Swift, Still A Better Love Story Than Twilight
JSON and Swift, Still A Better Love Story Than TwilightDonny Wals
 
Sqlxml vs xquery
Sqlxml vs xquerySqlxml vs xquery
Sqlxml vs xqueryAmol Pujari
 
Modern query optimisation features in MySQL 8.
Modern query optimisation features in MySQL 8.Modern query optimisation features in MySQL 8.
Modern query optimisation features in MySQL 8.Mydbops
 
Encryption: It's For More Than Just Passwords
Encryption: It's For More Than Just PasswordsEncryption: It's For More Than Just Passwords
Encryption: It's For More Than Just PasswordsJohn Congdon
 
Apache Airflow
Apache AirflowApache Airflow
Apache AirflowJason Kim
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGMatthew McCullough
 
Embracing the-power-of-refactor
Embracing the-power-of-refactorEmbracing the-power-of-refactor
Embracing the-power-of-refactorXiaojun REN
 
BDD - Behavior Driven Development Webapps mit Groovy Spock und Geb
BDD - Behavior Driven Development Webapps mit Groovy Spock und GebBDD - Behavior Driven Development Webapps mit Groovy Spock und Geb
BDD - Behavior Driven Development Webapps mit Groovy Spock und GebChristian Baranowski
 
Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Markus Klems
 
MySQL's JSON Data Type and Document Store
MySQL's JSON Data Type and Document StoreMySQL's JSON Data Type and Document Store
MySQL's JSON Data Type and Document StoreDave Stokes
 
Store and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraStore and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraDeependra Ariyadewa
 
Webinar: Replication and Replica Sets
Webinar: Replication and Replica SetsWebinar: Replication and Replica Sets
Webinar: Replication and Replica SetsMongoDB
 
SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade DowntimeSCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade DowntimeJeff Frost
 

What's hot (20)

The Testing Games: Mocking, yay!
The Testing Games: Mocking, yay!The Testing Games: Mocking, yay!
The Testing Games: Mocking, yay!
 
Web Integration Patterns in the Era of HTML5
Web Integration Patterns in the Era of HTML5Web Integration Patterns in the Era of HTML5
Web Integration Patterns in the Era of HTML5
 
多治見IT勉強会 Groovy Grails
多治見IT勉強会 Groovy Grails多治見IT勉強会 Groovy Grails
多治見IT勉強会 Groovy Grails
 
DB2 Native XML
DB2 Native XMLDB2 Native XML
DB2 Native XML
 
JSON and Swift, Still A Better Love Story Than Twilight
JSON and Swift, Still A Better Love Story Than TwilightJSON and Swift, Still A Better Love Story Than Twilight
JSON and Swift, Still A Better Love Story Than Twilight
 
Sqlxml vs xquery
Sqlxml vs xquerySqlxml vs xquery
Sqlxml vs xquery
 
Practica n° 7
Practica n° 7Practica n° 7
Practica n° 7
 
Modern query optimisation features in MySQL 8.
Modern query optimisation features in MySQL 8.Modern query optimisation features in MySQL 8.
Modern query optimisation features in MySQL 8.
 
Encryption: It's For More Than Just Passwords
Encryption: It's For More Than Just PasswordsEncryption: It's For More Than Just Passwords
Encryption: It's For More Than Just Passwords
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
Embracing the-power-of-refactor
Embracing the-power-of-refactorEmbracing the-power-of-refactor
Embracing the-power-of-refactor
 
BDD - Behavior Driven Development Webapps mit Groovy Spock und Geb
BDD - Behavior Driven Development Webapps mit Groovy Spock und GebBDD - Behavior Driven Development Webapps mit Groovy Spock und Geb
BDD - Behavior Driven Development Webapps mit Groovy Spock und Geb
 
Spock and Geb in Action
Spock and Geb in ActionSpock and Geb in Action
Spock and Geb in Action
 
Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3
 
MySQL's JSON Data Type and Document Store
MySQL's JSON Data Type and Document StoreMySQL's JSON Data Type and Document Store
MySQL's JSON Data Type and Document Store
 
Store and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraStore and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and Cassandra
 
Webinar: Replication and Replica Sets
Webinar: Replication and Replica SetsWebinar: Replication and Replica Sets
Webinar: Replication and Replica Sets
 
SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade DowntimeSCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
 
Intro to Redis
Intro to RedisIntro to Redis
Intro to Redis
 

Viewers also liked

How not to make a hacker friendly application
How not to make a hacker friendly applicationHow not to make a hacker friendly application
How not to make a hacker friendly applicationAbhinav Mishra
 
Assorted Learnings of Microservices
Assorted Learnings of MicroservicesAssorted Learnings of Microservices
Assorted Learnings of MicroservicesDavid Dawson
 
Intervención huecos
Intervención huecosIntervención huecos
Intervención huecosDavid Acuña
 
Metodologias de Seguridad De Sistemas
Metodologias de Seguridad De SistemasMetodologias de Seguridad De Sistemas
Metodologias de Seguridad De SistemasCarmen Benites
 
Design Pattern MicroServices Architecture in Japanese
Design Pattern MicroServices Architecture in JapaneseDesign Pattern MicroServices Architecture in Japanese
Design Pattern MicroServices Architecture in JapaneseLei Xu
 
Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?Alexandre Vasseur
 
Light microscope vs. Electron microscope
Light microscope vs. Electron microscopeLight microscope vs. Electron microscope
Light microscope vs. Electron microscopeJamica Ambion
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 
O DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAH
O DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAHO DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAH
O DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAHviasdosaber
 

Viewers also liked (12)

How not to make a hacker friendly application
How not to make a hacker friendly applicationHow not to make a hacker friendly application
How not to make a hacker friendly application
 
Synthesis Presentation of Agricultural Communication
Synthesis Presentation of Agricultural Communication Synthesis Presentation of Agricultural Communication
Synthesis Presentation of Agricultural Communication
 
resume 2015
resume 2015resume 2015
resume 2015
 
Assorted Learnings of Microservices
Assorted Learnings of MicroservicesAssorted Learnings of Microservices
Assorted Learnings of Microservices
 
Intervención huecos
Intervención huecosIntervención huecos
Intervención huecos
 
Sheryl Larson 2015
Sheryl Larson 2015Sheryl Larson 2015
Sheryl Larson 2015
 
Metodologias de Seguridad De Sistemas
Metodologias de Seguridad De SistemasMetodologias de Seguridad De Sistemas
Metodologias de Seguridad De Sistemas
 
Design Pattern MicroServices Architecture in Japanese
Design Pattern MicroServices Architecture in JapaneseDesign Pattern MicroServices Architecture in Japanese
Design Pattern MicroServices Architecture in Japanese
 
Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?
 
Light microscope vs. Electron microscope
Light microscope vs. Electron microscopeLight microscope vs. Electron microscope
Light microscope vs. Electron microscope
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
O DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAH
O DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAHO DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAH
O DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAH
 

Similar to Anatomy of distributed computing with Hadoop

2012 mongo db_bangalore_roadmap_new
2012 mongo db_bangalore_roadmap_new2012 mongo db_bangalore_roadmap_new
2012 mongo db_bangalore_roadmap_newMongoDB
 
The Ring programming language version 1.10 book - Part 92 of 212
The Ring programming language version 1.10 book - Part 92 of 212The Ring programming language version 1.10 book - Part 92 of 212
The Ring programming language version 1.10 book - Part 92 of 212Mahmoud Samir Fayed
 
The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181Mahmoud Samir Fayed
 
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkRde:code 2017
 
Reproducible Computational Research in R
Reproducible Computational Research in RReproducible Computational Research in R
Reproducible Computational Research in RSamuel Bosch
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startupsbmlever
 
Stata cheatsheet programming
Stata cheatsheet programmingStata cheatsheet programming
Stata cheatsheet programmingTim Essam
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat SheetLaura Hughes
 
yagdao-0.3.1 JPA guide
yagdao-0.3.1 JPA guideyagdao-0.3.1 JPA guide
yagdao-0.3.1 JPA guideMert Can Akkan
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to SparkLi Ming Tsai
 
NoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsNoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsGleicon Moraes
 
Go 1.10 Release Party - PDX Go
Go 1.10 Release Party - PDX GoGo 1.10 Release Party - PDX Go
Go 1.10 Release Party - PDX GoRodolfo Carvalho
 
OPM Recipe designer notes
OPM Recipe designer notesOPM Recipe designer notes
OPM Recipe designer notested-xu
 
Groovy and Grails talk
Groovy and Grails talkGroovy and Grails talk
Groovy and Grails talkdesistartups
 
Hatohol technical-brief-20130830-hbstudy
Hatohol technical-brief-20130830-hbstudyHatohol technical-brief-20130830-hbstudy
Hatohol technical-brief-20130830-hbstudykoedoyoshida
 

Similar to Anatomy of distributed computing with Hadoop (20)

2012 mongo db_bangalore_roadmap_new
2012 mongo db_bangalore_roadmap_new2012 mongo db_bangalore_roadmap_new
2012 mongo db_bangalore_roadmap_new
 
The Ring programming language version 1.10 book - Part 92 of 212
The Ring programming language version 1.10 book - Part 92 of 212The Ring programming language version 1.10 book - Part 92 of 212
The Ring programming language version 1.10 book - Part 92 of 212
 
The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181
 
Redis
RedisRedis
Redis
 
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
 
Reproducible Computational Research in R
Reproducible Computational Research in RReproducible Computational Research in R
Reproducible Computational Research in R
 
Python course Day 1
Python course Day 1Python course Day 1
Python course Day 1
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
 
Stata cheatsheet programming
Stata cheatsheet programmingStata cheatsheet programming
Stata cheatsheet programming
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Hw09 Hadoop + Clojure
Hw09   Hadoop + ClojureHw09   Hadoop + Clojure
Hw09 Hadoop + Clojure
 
yagdao-0.3.1 JPA guide
yagdao-0.3.1 JPA guideyagdao-0.3.1 JPA guide
yagdao-0.3.1 JPA guide
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
NoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsNoSQL and SQL Anti Patterns
NoSQL and SQL Anti Patterns
 
Hadoop
HadoopHadoop
Hadoop
 
Go 1.10 Release Party - PDX Go
Go 1.10 Release Party - PDX GoGo 1.10 Release Party - PDX Go
Go 1.10 Release Party - PDX Go
 
OPM Recipe designer notes
OPM Recipe designer notesOPM Recipe designer notes
OPM Recipe designer notes
 
Groovy and Grails talk
Groovy and Grails talkGroovy and Grails talk
Groovy and Grails talk
 
Hatohol technical-brief-20130830-hbstudy
Hatohol technical-brief-20130830-hbstudyHatohol technical-brief-20130830-hbstudy
Hatohol technical-brief-20130830-hbstudy
 

Recently uploaded

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 

Recently uploaded (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 

Anatomy of distributed computing with Hadoop

  • 2. What is Hadoop?  Hadoop was started out as a subproject of Nutch by Doug Cutting  Hadoop boosted Nutch’s scalability  Enhanced by Yahoo! and became Apache top level project  System for distributed big data processing  Big data is Terabytes and Petabytes and more…  Exabytes, Zettabytes datasets?
  • 7. Hadoop basics  Implements Google’s whitepaper: http://research.google.com/archive/mapreduce.html  Hadoop is a combination of: HDFS Storage MapReduce Computation
  • 8. HDFS Hadoop Distributed File System  It’s a file system bin/hadoop dfs <command> <options> <command> cat expunge put chgrp get rm chmod getmerge rmr chown ls setrep copyFromLocal lsr stat copyToLocal mkdir tail cp moveFromLocal test du moveToLocal text dus mv touchz
  • 9. Hadoop Distributed File System  It’s accessible
  • 10. Hadoop Distributed File System  It’s distributed  It employs masterslave architecture
  • 11. Hadoop Distributed File System  Name Node: Stores file system metadata  Secondary Name Node(s): Periodically merges file system image  Data Node(s): Stores actual data (blocks) Allows data to be replicated
  • 12. MapReduce  A programming model for distributed data processing  A data processing primitives are functions: Mappers and Reducers
  • 13. MapReduce ! To decompose MapReduce think of data in terms of keys and values: <key, value> <user id, user profile> <timestamp, apache log entry> <tag, list of tagged images>
  • 14. MapReduce  Mapper Function that takes key and value and emits zero or more keys and values  Reducer Function that takes key and all “mapped” values and emits zero or more new keys and value
  • 15. MapReduce example  “Hello World” for Hadoop: http://wiki.apache.org/hadoop/WordCount  “Tag Cloud” example for Hadoop: tag1 tag2 tag3 tag1 tag3 weight(tagi) tag3 tag4 tag5 tag6
  • 16. Tag Cloud example  Input is taggable content (images, posts, videos) with space separated tags: <posti, “tag1 tag2 … tagn”>  Output is tagi with it’s count and total tags: <tagi, tag count> <total tags, total tags count>  Results: weight(tagi)=tagi count/total tags font(tagi)=fn(weight(tagi))
  • 17. Tag Cloud Mapper  Mapper implements interface: org.apache.hadoop.mapreduce.Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>  Mapper input: <post1, “tag1 tag3”> <post2, “tag3”> <post3, “tag2 tag3 tag4”> <post4, “tag1 tag2 tag3”> simplify model & make line number a key <line1, “tag1 tag3”> <line2, “tag3”> <line3, “tag2 tag3 tag4”> <line4, “tag1 tag2 tag3”> write raw tags to input file
  • 18. Tag Cloud Mapper  Mapper input:  Mapper output: <0, “tag1 tag3”> <“total tags”, 2> <1, “tag3”> <“tag1”, 1> <2, “tag2 tag3 tag4”> <“tag3”, 1> <3, “tag1 tag2 tag3”> <“total tags”, 1> read values - tags from file (line number is a key) <“tag3”, 1> “tag1 tag3” // space separated tags <“total tags”, 3> <“tag2”, 1> String line = value.toString(); <“tag3”, 1> StringTokenizer tokenizer = new StringTokenizer(line, ” "); <“tag4”, 1> context.write(TOTAL_TAGS_KEY, context.write() new IntWritable(tokenizer.countTokens())); <“total tags”, 3> while (tokenizer.hasMoreTokens()) { <“tag1”, 1> Text tag = new Text(tokenizer.nextToken()); <“tag2”, 1> context.write(tag, new IntWritable(1)); // write to HDFS <“tag3”, 1> }
  • 19. Reducer phases  1. Shuffle or Copy phase: Copies output from Mapper to Reducer local file system  2. Sort phase: Sort Mapper output by keys. This becomes Reducer input Mapper output: Reducer input: <“total tags”, 2> <“tag1”, 1> <“tag1”, 1> <“tag1”, 1> <“tag3”, 1> <“tag2”, 1> <“total tags”, 1> <“tag2”, 1> <“tag3”, 1> shuffle & sort by <“total tags”, 3> key <“tag3”, 1> <“tag2”, 1> <“tag3”, 1> <“tag3”, 1> <“tag3”, 1> <“tag4”, 1> <“tag3”, 1> <“total tags”, 3> <“tag4”, 1> <“tag1”, 1> <“tag2”, 1> <“total tags”, 2> <“tag3”, 1> <“total tags”, 1> <“total tags”, 3> <“total tags”, 3>  3. Reduce or Emit phase: Performs reduce() for each sorted <key, value> input groups
  • 20. Tag Cloud Reduce phase  Reducer implements interface: org.apache.hadoop.mapreduce.Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>  Reducer input: [<“tag1”, 1>, <“tag1”, 1>] <“tag1”, 1> <“tag1”, 1> int tagsCount = 0; pairs grouped by tagi for (IntWritable value : values) { <“tag2”, 1> tagsCount += value.get(); <“tag2”, 1> } context.write(key, new IntWritable(tagsCount)); <“tag3”, 1> <“tag3”, 1> context.write() <“tag3”, 1> <“tag3”, 1>  Reducer output: <“tag4”, 1> <tag1, 2> <tag2, 2> <“total tags”, 2> <tag3, 4> <“total tags”, 1> <tag4, 1> <“total tags”, 3> <total tags, 9> <“total tags”, 3>
  • 21. Tag Cloud Output  Reducer output is weighted list: <tag1, 2> <tag2, 2> <tag3, 4> <tag4, 1> <total tags, 9> output  Tag’s weight: weight(tagi)=tagi count/total tags <weight(tag1), 2/9> <weight(tag2), 2/9> <weight(tag3), 4/9> <weight(tag4), 1/9>  Size of font: font(tagi)=fn(weight(tagi))
  • 22. Between Map and Reduce Mapper output:  Combiner: <“total tags”, 2> <“tag1”, 1>  implements interface <“tag1”, 1> org.apache.hadoop.mapreduce.Reducer <“tag3”, 1>  function works as in-memory Reducer in-memory combine  serves for additional optimization Combiner output: <“total tags”, 3> <“tag1”, 2>  Partitioner: <“tag3”, 1>  implements interface org.apache.hadoop.mapreduce.Partitioner  function assigns intermediate <key, value> pair from Mapper to designed Reducer partition
  • 23. Time for a Workshop Standalone mode  Build “Tag Cloud” project jar: cd $TAG_CLOUD_HOME mvn clean install  Check input directory: $HADOOP_HOME/bin/hadoop fs -ls $TAG_CLOUD_HOME/input/  Check input file: $HADOOP_HOME/bin/hadoop fs -cat $TAG_CLOUD_HOME/input/tags01  Submit TagCloudJob to Hadoop: $HADOOP_HOME/bin/hadoop jar $TAG_CLOUD_HOME/target/tagcloud-1.0.jar com.altoros.rnd.hadoop.tagcloud.TagCloudJob $TAG_CLOUD_HOME/input $TAG_CLOUD_HOME/output  Check output directory: $HADOOP_HOME/bin/hadoop fs -ls $TAG_CLOUD_HOME/output/  Check output file: $HADOOP_HOME/bin/hadoop fs -cat $TAG_CLOUD_HOME/output/part-r-00000
  • 24. Apache Pig  Higher-level data processing layer on top of Hadoop  Data-flow oriented language (pig scripts)  Data types include sets, associative arrays, tuples  Developed at Yahoo!
  • 25. Apache Hive  Feature set is similar to Pig  SQL-like data warehouse infrastructure  Language is more strictly SQL  Supports SELECT, JOIN, GROUP BY, etc  Developed at Facebook
  • 26. Apache HBase  Column-store database (after Google BigTable model)  HDFS is an underlying file system  Holds extremely large datasets (multi Tb)  Constrained access model
  • 27. Apache Mahout  Scalable machine learning algorithms on top of Hadoop: – filtering, – recommendations, – classifiers, – clustering
  • 28. Apache ZooKeeper  Common services for distributed applications: - group services, - configuration management, - naming services, - synchronization
  • 29. Oozie  Workflow engine for Hadoop  Orchestrates dependencies between jobs running on Hadoop (including HDFS, Pig and MapReduce)  Another query processing API  Developed at Yahoo!
  • 30. Apache Chukwa  System for reliable large-scale log collection  Displaying, monitoring and analyzing results  Built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce  Incubated at apache.org
  • 31. Questions links: http://www.slideshare.net/tazija/anatomy-of-distributed-computing-with-hadoop https://github.com/tazija/TagCloud skype: siarhei_bushyk mailto: tazija@gmail.com mailto: sergey.bushik@altoros.com

Editor's Notes

  1. DataNodes are constantly reporting to the NameNode. Blocks are stored on the Data Nodes.
  2. Standalone operation mode:1. export TAG_CLOUD_HOME=/Users/tazija/Projects/hadoop/tagcloud2. export HADOOP_HOME=/Users/tazija/Programs/apache-hadoop-0.23.03.cd $TAG_CLOUD_HOME4.mvn clean install5. $HADOOP_HOME/bin/hadoopfs -ls $TAG_CLOUD_HOME/inputInput directory is $TAG_CLOUD_HOME/input6. $HADOOP_HOME/bin/hadoopfs -cat $TAG_CLOUD_HOME/input/tags01We use InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text.5. $HADOOP_HOME/bin/hadoop jar $TAG_CLOUD_HOME/target/tagcloud-1.0.jar com.altoros.rnd.hadoop.tagcloud.TagCloudJob $TAG_CLOUD_HOME/input $TAG_CLOUD_HOME/outputDistributed mode:1. /etc/hadoop/hdfs-site.xml&lt;configuration&gt; &lt;property&gt; &lt;name&gt;dfs.namenode.name.dir&lt;/name&gt; &lt;value&gt;file:/Users/tazija/Programs/apache-hadoop-0.23.0/data/hdfs/namenode&lt;/value&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;dfs.namenode.name.dir&lt;/name&gt; &lt;value&gt;file:/Users/tazija/Programs/apache-hadoop-0.23.0/data/hdfs/namenode&lt;/value&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;dfs.replication&lt;/name&gt; &lt;value&gt;1&lt;/value&gt; &lt;/property&gt;&lt;/configuration&gt;2. Formatfilesystembin/hadoopnamenode –format3. Start daemons./sbin/hadoop-daemon.sh start namenode./sbin/hadoop-daemon.sh start datanode./sbin/hadoop-daemon.sh start secondarynamenode4. Checkhdfs statushttp://localhost:50070/