Your SlideShare is downloading. ×
Anatomy of distributed computing with Hadoop
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Anatomy of distributed computing with Hadoop


Published on

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • DataNodes are constantly reporting to the NameNode. Blocks are stored on the Data Nodes.
  • Standalone operation mode:1. export TAG_CLOUD_HOME=/Users/tazija/Projects/hadoop/tagcloud2. export HADOOP_HOME=/Users/tazija/Programs/ $TAG_CLOUD_HOME4.mvn clean install5. $HADOOP_HOME/bin/hadoopfs -ls $TAG_CLOUD_HOME/inputInput directory is $TAG_CLOUD_HOME/input6. $HADOOP_HOME/bin/hadoopfs -cat $TAG_CLOUD_HOME/input/tags01We use InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text.5. $HADOOP_HOME/bin/hadoop jar $TAG_CLOUD_HOME/target/tagcloud-1.0.jar com.altoros.rnd.hadoop.tagcloud.TagCloudJob $TAG_CLOUD_HOME/input $TAG_CLOUD_HOME/outputDistributed mode:1. /etc/hadoop/hdfs-site.xml<configuration> <property> <name></name> <value>file:/Users/tazija/Programs/apache-hadoop-0.23.0/data/hdfs/namenode</value> </property> <property> <name></name> <value>file:/Users/tazija/Programs/apache-hadoop-0.23.0/data/hdfs/namenode</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property></configuration>2. Formatfilesystembin/hadoopnamenode –format3. Start daemons./sbin/ start namenode./sbin/ start datanode./sbin/ start secondarynamenode4. Checkhdfs statushttp://localhost:50070/
  • Transcript

    • 1. Anatomy ofdistributedcomputing withHadoop
    • 2. What is Hadoop? Hadoop was started out as a subproject of Nutch by Doug Cutting Hadoop boosted Nutch’s scalability Enhanced by Yahoo! and became Apache top level project System for distributed big data processing  Big data is Terabytes and Petabytes and more…  Exabytes, Zettabytes datasets?
    • 3. Why anyone needs Hadoop?
    • 4. Hadoop use cases
    • 5. Hadoop use cases
    • 6. Hadoop use cases
    • 7. Hadoop basics Implements Google’s whitepaper: Hadoop is a combination of: HDFS Storage MapReduce Computation
    • 8. HDFSHadoop Distributed File System It’s a file system bin/hadoop dfs <command> <options> <command>cat expunge putchgrp get rmchmod getmerge rmrchown ls setrepcopyFromLocal lsr statcopyToLocal mkdir tailcp moveFromLocal testdu moveToLocal textdus mv touchz
    • 9. Hadoop Distributed File System It’s accessible
    • 10. Hadoop Distributed File System It’s distributed It employs masterslave architecture
    • 11. Hadoop Distributed File System Name Node: Stores file system metadata Secondary Name Node(s): Periodically merges file system image Data Node(s): Stores actual data (blocks) Allows data to be replicated
    • 12. MapReduce  A programming model for distributed data processing  A data processing primitives are functions: Mappers and Reducers
    • 13. MapReduce! To decompose MapReduce think of data in terms of keys and values:<key, value><user id, user profile><timestamp, apache log entry><tag, list of tagged images>
    • 14. MapReduce Mapper Function that takes key and value and emits zero or more keys and values Reducer Function that takes key and all “mapped” values and emits zero or more new keys and value
    • 15. MapReduce example “Hello World” for Hadoop: “Tag Cloud” example for Hadoop: tag1 tag2 tag3 tag1 tag3 weight(tagi) tag3 tag4 tag5 tag6
    • 16. Tag Cloud example Input is taggable content (images, posts, videos) with space separated tags: <posti, “tag1 tag2 … tagn”> Output is tagi with it’s count and total tags: <tagi, tag count> <total tags, total tags count> Results: weight(tagi)=tagi count/total tags font(tagi)=fn(weight(tagi))
    • 17. Tag Cloud Mapper Mapper implements interface: org.apache.hadoop.mapreduce.Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> Mapper input: <post1, “tag1 tag3”> <post2, “tag3”> <post3, “tag2 tag3 tag4”> <post4, “tag1 tag2 tag3”> simplify model & make line number a key <line1, “tag1 tag3”> <line2, “tag3”> <line3, “tag2 tag3 tag4”> <line4, “tag1 tag2 tag3”> write raw tags to input file
    • 18. Tag Cloud Mapper  Mapper input:  Mapper output: <0, “tag1 tag3”> <“total tags”, 2> <1, “tag3”> <“tag1”, 1> <2, “tag2 tag3 tag4”> <“tag3”, 1> <3, “tag1 tag2 tag3”> <“total tags”, 1> read values - tags from file (line number is a key) <“tag3”, 1> “tag1 tag3” // space separated tags <“total tags”, 3> <“tag2”, 1>String line = value.toString(); <“tag3”, 1>StringTokenizer tokenizer = new StringTokenizer(line, ” "); <“tag4”, 1>context.write(TOTAL_TAGS_KEY, context.write() new IntWritable(tokenizer.countTokens())); <“total tags”, 3>while (tokenizer.hasMoreTokens()) { <“tag1”, 1> Text tag = new Text(tokenizer.nextToken()); <“tag2”, 1> context.write(tag, new IntWritable(1)); // write to HDFS <“tag3”, 1>}
    • 19. Reducer phases 1. Shuffle or Copy phase: Copies output from Mapper to Reducer local file system 2. Sort phase: Sort Mapper output by keys. This becomes Reducer input Mapper output: Reducer input: <“total tags”, 2> <“tag1”, 1> <“tag1”, 1> <“tag1”, 1> <“tag3”, 1> <“tag2”, 1> <“total tags”, 1> <“tag2”, 1> <“tag3”, 1> shuffle & sort by <“total tags”, 3> key <“tag3”, 1> <“tag2”, 1> <“tag3”, 1> <“tag3”, 1> <“tag3”, 1> <“tag4”, 1> <“tag3”, 1> <“total tags”, 3> <“tag4”, 1> <“tag1”, 1> <“tag2”, 1> <“total tags”, 2> <“tag3”, 1> <“total tags”, 1> <“total tags”, 3> <“total tags”, 3> 3. Reduce or Emit phase: Performs reduce() for each sorted <key, value> input groups
    • 20. Tag Cloud Reduce phase Reducer implements interface:org.apache.hadoop.mapreduce.Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> Reducer input: [<“tag1”, 1>, <“tag1”, 1>]<“tag1”, 1><“tag1”, 1> int tagsCount = 0; pairs grouped by tagi for (IntWritable value : values) {<“tag2”, 1> tagsCount += value.get();<“tag2”, 1> } context.write(key, new IntWritable(tagsCount));<“tag3”, 1><“tag3”, 1> context.write()<“tag3”, 1><“tag3”, 1>  Reducer output:<“tag4”, 1> <tag1, 2> <tag2, 2><“total tags”, 2> <tag3, 4><“total tags”, 1> <tag4, 1><“total tags”, 3> <total tags, 9><“total tags”, 3>
    • 21. Tag Cloud Output Reducer output is weighted list: <tag1, 2> <tag2, 2> <tag3, 4> <tag4, 1> <total tags, 9> output Tag’s weight: weight(tagi)=tagi count/total tags <weight(tag1), 2/9> <weight(tag2), 2/9> <weight(tag3), 4/9> <weight(tag4), 1/9> Size of font: font(tagi)=fn(weight(tagi))
    • 22. Between Map and Reduce Mapper output: Combiner: <“total tags”, 2> <“tag1”, 1>  implements interface <“tag1”, 1> org.apache.hadoop.mapreduce.Reducer <“tag3”, 1>  function works as in-memory Reducer in-memory combine  serves for additional optimization Combiner output: <“total tags”, 3> <“tag1”, 2> Partitioner: <“tag3”, 1>  implements interface org.apache.hadoop.mapreduce.Partitioner  function assigns intermediate <key, value> pair from Mapper to designed Reducer partition
    • 23. Time for a Workshop Standalone mode Build “Tag Cloud” project jar:cd $TAG_CLOUD_HOMEmvn clean install Check input directory:$HADOOP_HOME/bin/hadoop fs -ls $TAG_CLOUD_HOME/input/ Check input file:$HADOOP_HOME/bin/hadoop fs -cat $TAG_CLOUD_HOME/input/tags01 Submit TagCloudJob to Hadoop:$HADOOP_HOME/bin/hadoop jar $TAG_CLOUD_HOME/target/tagcloud-1.0.jarcom.altoros.rnd.hadoop.tagcloud.TagCloudJob $TAG_CLOUD_HOME/input$TAG_CLOUD_HOME/output Check output directory:$HADOOP_HOME/bin/hadoop fs -ls $TAG_CLOUD_HOME/output/ Check output file:$HADOOP_HOME/bin/hadoop fs -cat $TAG_CLOUD_HOME/output/part-r-00000
    • 24. Apache Pig Higher-level data processing layer on top of Hadoop Data-flow oriented language (pig scripts) Data types include sets, associative arrays, tuples Developed at Yahoo!
    • 25. Apache Hive Feature set is similar to Pig SQL-like data warehouse infrastructure Language is more strictly SQL Supports SELECT, JOIN, GROUP BY, etc Developed at Facebook
    • 26. Apache HBase  Column-store database (after Google BigTable model)  HDFS is an underlying file system  Holds extremely large datasets (multi Tb)  Constrained access model
    • 27. Apache Mahout  Scalable machine learning algorithms on top of Hadoop: – filtering, – recommendations, – classifiers, – clustering
    • 28. Apache ZooKeeper  Common services for distributed applications: - group services, - configuration management, - naming services, - synchronization
    • 29. Oozie Workflow engine for Hadoop Orchestrates dependencies between jobs running on Hadoop (including HDFS, Pig and MapReduce) Another query processing API Developed at Yahoo!
    • 30. Apache Chukwa  System for reliable large-scale log collection  Displaying, monitoring and analyzing results  Built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce  Incubated at
    • 31. Questions links: skype: siarhei_bushyk mailto: mailto: