Anatomy of distributed computing with Hadoop


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • DataNodes are constantly reporting to the NameNode. Blocks are stored on the Data Nodes.
  • Standalone operation mode:1. export TAG_CLOUD_HOME=/Users/tazija/Projects/hadoop/tagcloud2. export HADOOP_HOME=/Users/tazija/Programs/ $TAG_CLOUD_HOME4.mvn clean install5. $HADOOP_HOME/bin/hadoopfs -ls $TAG_CLOUD_HOME/inputInput directory is $TAG_CLOUD_HOME/input6. $HADOOP_HOME/bin/hadoopfs -cat $TAG_CLOUD_HOME/input/tags01We use InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text.5. $HADOOP_HOME/bin/hadoop jar $TAG_CLOUD_HOME/target/tagcloud-1.0.jar com.altoros.rnd.hadoop.tagcloud.TagCloudJob $TAG_CLOUD_HOME/input $TAG_CLOUD_HOME/outputDistributed mode:1. /etc/hadoop/hdfs-site.xml<configuration> <property> <name></name> <value>file:/Users/tazija/Programs/apache-hadoop-0.23.0/data/hdfs/namenode</value> </property> <property> <name></name> <value>file:/Users/tazija/Programs/apache-hadoop-0.23.0/data/hdfs/namenode</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property></configuration>2. Formatfilesystembin/hadoopnamenode –format3. Start daemons./sbin/ start namenode./sbin/ start datanode./sbin/ start secondarynamenode4. Checkhdfs statushttp://localhost:50070/
  • Anatomy of distributed computing with Hadoop

    1. 1. Anatomy ofdistributedcomputing withHadoop
    2. 2. What is Hadoop? Hadoop was started out as a subproject of Nutch by Doug Cutting Hadoop boosted Nutch’s scalability Enhanced by Yahoo! and became Apache top level project System for distributed big data processing  Big data is Terabytes and Petabytes and more…  Exabytes, Zettabytes datasets?
    3. 3. Why anyone needs Hadoop?
    4. 4. Hadoop use cases
    5. 5. Hadoop use cases
    6. 6. Hadoop use cases
    7. 7. Hadoop basics Implements Google’s whitepaper: Hadoop is a combination of: HDFS Storage MapReduce Computation
    8. 8. HDFSHadoop Distributed File System It’s a file system bin/hadoop dfs <command> <options> <command>cat expunge putchgrp get rmchmod getmerge rmrchown ls setrepcopyFromLocal lsr statcopyToLocal mkdir tailcp moveFromLocal testdu moveToLocal textdus mv touchz
    9. 9. Hadoop Distributed File System It’s accessible
    10. 10. Hadoop Distributed File System It’s distributed It employs masterslave architecture
    11. 11. Hadoop Distributed File System Name Node: Stores file system metadata Secondary Name Node(s): Periodically merges file system image Data Node(s): Stores actual data (blocks) Allows data to be replicated
    12. 12. MapReduce  A programming model for distributed data processing  A data processing primitives are functions: Mappers and Reducers
    13. 13. MapReduce! To decompose MapReduce think of data in terms of keys and values:<key, value><user id, user profile><timestamp, apache log entry><tag, list of tagged images>
    14. 14. MapReduce Mapper Function that takes key and value and emits zero or more keys and values Reducer Function that takes key and all “mapped” values and emits zero or more new keys and value
    15. 15. MapReduce example “Hello World” for Hadoop: “Tag Cloud” example for Hadoop: tag1 tag2 tag3 tag1 tag3 weight(tagi) tag3 tag4 tag5 tag6
    16. 16. Tag Cloud example Input is taggable content (images, posts, videos) with space separated tags: <posti, “tag1 tag2 … tagn”> Output is tagi with it’s count and total tags: <tagi, tag count> <total tags, total tags count> Results: weight(tagi)=tagi count/total tags font(tagi)=fn(weight(tagi))
    17. 17. Tag Cloud Mapper Mapper implements interface: org.apache.hadoop.mapreduce.Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> Mapper input: <post1, “tag1 tag3”> <post2, “tag3”> <post3, “tag2 tag3 tag4”> <post4, “tag1 tag2 tag3”> simplify model & make line number a key <line1, “tag1 tag3”> <line2, “tag3”> <line3, “tag2 tag3 tag4”> <line4, “tag1 tag2 tag3”> write raw tags to input file
    18. 18. Tag Cloud Mapper  Mapper input:  Mapper output: <0, “tag1 tag3”> <“total tags”, 2> <1, “tag3”> <“tag1”, 1> <2, “tag2 tag3 tag4”> <“tag3”, 1> <3, “tag1 tag2 tag3”> <“total tags”, 1> read values - tags from file (line number is a key) <“tag3”, 1> “tag1 tag3” // space separated tags <“total tags”, 3> <“tag2”, 1>String line = value.toString(); <“tag3”, 1>StringTokenizer tokenizer = new StringTokenizer(line, ” "); <“tag4”, 1>context.write(TOTAL_TAGS_KEY, context.write() new IntWritable(tokenizer.countTokens())); <“total tags”, 3>while (tokenizer.hasMoreTokens()) { <“tag1”, 1> Text tag = new Text(tokenizer.nextToken()); <“tag2”, 1> context.write(tag, new IntWritable(1)); // write to HDFS <“tag3”, 1>}
    19. 19. Reducer phases 1. Shuffle or Copy phase: Copies output from Mapper to Reducer local file system 2. Sort phase: Sort Mapper output by keys. This becomes Reducer input Mapper output: Reducer input: <“total tags”, 2> <“tag1”, 1> <“tag1”, 1> <“tag1”, 1> <“tag3”, 1> <“tag2”, 1> <“total tags”, 1> <“tag2”, 1> <“tag3”, 1> shuffle & sort by <“total tags”, 3> key <“tag3”, 1> <“tag2”, 1> <“tag3”, 1> <“tag3”, 1> <“tag3”, 1> <“tag4”, 1> <“tag3”, 1> <“total tags”, 3> <“tag4”, 1> <“tag1”, 1> <“tag2”, 1> <“total tags”, 2> <“tag3”, 1> <“total tags”, 1> <“total tags”, 3> <“total tags”, 3> 3. Reduce or Emit phase: Performs reduce() for each sorted <key, value> input groups
    20. 20. Tag Cloud Reduce phase Reducer implements interface:org.apache.hadoop.mapreduce.Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> Reducer input: [<“tag1”, 1>, <“tag1”, 1>]<“tag1”, 1><“tag1”, 1> int tagsCount = 0; pairs grouped by tagi for (IntWritable value : values) {<“tag2”, 1> tagsCount += value.get();<“tag2”, 1> } context.write(key, new IntWritable(tagsCount));<“tag3”, 1><“tag3”, 1> context.write()<“tag3”, 1><“tag3”, 1>  Reducer output:<“tag4”, 1> <tag1, 2> <tag2, 2><“total tags”, 2> <tag3, 4><“total tags”, 1> <tag4, 1><“total tags”, 3> <total tags, 9><“total tags”, 3>
    21. 21. Tag Cloud Output Reducer output is weighted list: <tag1, 2> <tag2, 2> <tag3, 4> <tag4, 1> <total tags, 9> output Tag’s weight: weight(tagi)=tagi count/total tags <weight(tag1), 2/9> <weight(tag2), 2/9> <weight(tag3), 4/9> <weight(tag4), 1/9> Size of font: font(tagi)=fn(weight(tagi))
    22. 22. Between Map and Reduce Mapper output: Combiner: <“total tags”, 2> <“tag1”, 1>  implements interface <“tag1”, 1> org.apache.hadoop.mapreduce.Reducer <“tag3”, 1>  function works as in-memory Reducer in-memory combine  serves for additional optimization Combiner output: <“total tags”, 3> <“tag1”, 2> Partitioner: <“tag3”, 1>  implements interface org.apache.hadoop.mapreduce.Partitioner  function assigns intermediate <key, value> pair from Mapper to designed Reducer partition
    23. 23. Time for a Workshop Standalone mode Build “Tag Cloud” project jar:cd $TAG_CLOUD_HOMEmvn clean install Check input directory:$HADOOP_HOME/bin/hadoop fs -ls $TAG_CLOUD_HOME/input/ Check input file:$HADOOP_HOME/bin/hadoop fs -cat $TAG_CLOUD_HOME/input/tags01 Submit TagCloudJob to Hadoop:$HADOOP_HOME/bin/hadoop jar $TAG_CLOUD_HOME/target/tagcloud-1.0.jarcom.altoros.rnd.hadoop.tagcloud.TagCloudJob $TAG_CLOUD_HOME/input$TAG_CLOUD_HOME/output Check output directory:$HADOOP_HOME/bin/hadoop fs -ls $TAG_CLOUD_HOME/output/ Check output file:$HADOOP_HOME/bin/hadoop fs -cat $TAG_CLOUD_HOME/output/part-r-00000
    24. 24. Apache Pig Higher-level data processing layer on top of Hadoop Data-flow oriented language (pig scripts) Data types include sets, associative arrays, tuples Developed at Yahoo!
    25. 25. Apache Hive Feature set is similar to Pig SQL-like data warehouse infrastructure Language is more strictly SQL Supports SELECT, JOIN, GROUP BY, etc Developed at Facebook
    26. 26. Apache HBase  Column-store database (after Google BigTable model)  HDFS is an underlying file system  Holds extremely large datasets (multi Tb)  Constrained access model
    27. 27. Apache Mahout  Scalable machine learning algorithms on top of Hadoop: – filtering, – recommendations, – classifiers, – clustering
    28. 28. Apache ZooKeeper  Common services for distributed applications: - group services, - configuration management, - naming services, - synchronization
    29. 29. Oozie Workflow engine for Hadoop Orchestrates dependencies between jobs running on Hadoop (including HDFS, Pig and MapReduce) Another query processing API Developed at Yahoo!
    30. 30. Apache Chukwa  System for reliable large-scale log collection  Displaying, monitoring and analyzing results  Built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce  Incubated at
    31. 31. Questions links: skype: siarhei_bushyk mailto: mailto: