Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

3,865 views

Published on

IMC Institute Course: "Big Data on Public Cloud" 5 Aug 2013

Published in: Technology

Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

  1. 1. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 1 Big Data using Hadoop On Amazon Elastic MapReduce Hands On Workshop Dr.Thanachart Numnonda thanachart@imcinstitute.com Danairat T. Certified Java Programmer, TOGAF – Silver danairat@gmail.com, +66-81-559-1446
  2. 2. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Lecture: Big Data Development Process
  3. 3. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Big Data Development Process Guideline Architecture Planning • Targeted Users • Target Opportunities • Data Scientist • Data Source/Type • Data Capturing Approach • Data Processing and Visualize Planning • Technology Architecture • Big Data EcoSystem • (Hadoop Ecosystem) • Sizing • Integration • Security • Administration and Operation Planning Big Data Development • Develop Use Cases • Set up Big Data Pseudo-distribution Mode • Set up HDFS • Develop Data Capturing System • Develop Data Analytic • Map Reduce • Hive • R • Etc. • Integrate result to Enterprise Analytic System • Set up Big Data Cluster Mode Operation and Support • Monitor HDFS utilization and capacity planning • Monitor Job Tracker availability • Monitor Data Capturing System • Upgrade or Patch Big Data Hadoop ecosystem • System admin. Training • Helpdesk Training • End-User Training (Analytic Results) System Evaluation • Adoption Rates for each analytics results • No. of Missing Analytic Results • No. of Missing Data • Lost hours per month • Avg. of each Analytic Result Response Time • No. of Technology System Failure per month
  4. 4. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Running Hadoop on Local Mode
  5. 5. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hadoop Installation Hadoop provides three installation choices: ● Local mode: This is an unzip and run mode to get you started right away where allparts of Hadoop run within the same JVM ● Pseudo distributed mode: This mode will be run on different parts of Hadoop as different Java processors, but within a single machine ● Distributed mode: This is the real setup that spans multiple machines
  6. 6. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Installing Hadoop and Ecosystem 1. Installing Virutal Box or VMWare Player 2. Running Image File 3. Start Hadoop 4. Hadoop Web Console 5. Stop Hadoop Notes:- Hadoop and IPv6; Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4 stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster.If your organisation moves to IPv6 only, you will encounter problems. Source: http://wiki.apache.org/hadoop/HadoopIPv6
  7. 7. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop MapReduce (Job Scheduling/Execution System) HDFS (Hadoop Distributed File System) Pig Sqoop HBase Hive Hadoop's Ecosystem in the VM
  8. 8. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Starting Hadoop [hdadmin@localhost hadoop]$ /usr/local/hadoop/bin/start-all.sh Starting up a Namenode, Datanode, Jobtracker and a Tasktracker on your machine. [hdadmin@localhost hadoop]$ /usr/lib/jvm/jdk1.6.0_39/bin/jps 11567 Jps 10766 NameNode 11099 JobTracker 11221 TaskTracker 10899 DataNode 11018 SecondaryNameNode [hdadmin@localhost hadoop]$ Checking Java Process and you are now running Hadoop as pseudo distributed mode
  9. 9. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hadoop is up!
  10. 10. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Stopping Hadoop [hdadmin@localhost hadoop]$ /usr/local/hadoop/bin/stop-all.sh stopping jobtracker localhost: stopping tasktracker stopping namenode localhost: stopping datanode localhost: stopping secondarynamenode
  11. 11. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Importing Data to HDFS using Hadoop Command Line
  12. 12. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Importing Data to Hadoop Creating new file in /tmp $ vi /tmp/input_test.txt GNOME Terminal is a terminal emulation application that you can use to perform the following tasks: Access a UNIX shell in the GNOME environment A shell is a program that interprets and executes the commands that you type at a command line prompt. When you start GNOME Terminal, the application starts the default shell that is specified in your system account. You can switch to a different shell at any time. Typing for the text file, Please type your own data $hadoop dfs -mkdir /input $hadoop dfs -mkdir /output $hadoop dfs -copyFromLocal /tmp/input_test.txt /input
  13. 13. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Reviewing, Retrieving, Deleting Data from HDFS
  14. 14. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Review file in Hadoop HDFS [hdadmin@localhost bin]$ hadoop dfs -ls /input Found 1 items -rw-r--r-- 1 hdadmin supergroup 1016 2013-03-13 20:11 /input/input_test.txt [hdadmin@localhost bin]$ hadoop dfs -cat /input/input_test.txt List HDFS File Read HDFS File Retrieve HDFS File to Local File System Please see also http://hadoop.apache.org/docs/r1.0.4/commands_manual.html [hdadmin@localhost bin]$ hadoop dfs -copyToLocal /input/input_test.txt /tmp/file.txt
  15. 15. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Review file in Hadoop HDFS using WebUIhttp://localhost:50070/
  16. 16. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Review file in Hadoop HDFS using WebUI
  17. 17. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Review file in Hadoop HDFS using WebUI
  18. 18. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Review file in Hadoop HDFS using WebUI Scroll Down
  19. 19. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Review file in Hadoop HDFS using WebUI
  20. 20. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Review file in Hadoop HDFS using WebUI
  21. 21. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Review file in Hadoop HDFS using WebUI
  22. 22. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hadoop Port Numbers Daemon Default Port Configuration Parameter in conf/*-site.xml HDFS Namenode 50070 dfs.http.address Datanodes 50075 dfs.datanode.http.address Secondarynamenode 50090 dfs.secondary.http.address MR JobTracker 50030 mapred.job.tracker.http.addre ss Tasktrackers 50060 mapred.task.tracker.http.addr ess
  23. 23. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Review Content from System shell [hdadmin@localhost current]$ cd /app/hadoop/tmp/dfs/data/current [hdadmin@localhost current]$ ls -l total 24 -rw-r--r--. 1 hdadmin hadoop 1016 Mar 13 20:11 blk_1997667773574667398 -rw-r--r--. 1 hdadmin hadoop 15 Mar 13 20:11 blk_1997667773574667398_1005.meta -rw-r--r--. 1 hdadmin hadoop 4 Mar 13 20:04 blk_-6735227193197163844 -rw-r--r--. 1 hdadmin hadoop 11 Mar 13 20:04 blk_-6735227193197163844_1004.meta -rw-r--r--. 1 hdadmin hadoop 482 Mar 13 20:18 dncp_block_verification.log.curr -rw-r--r--. 1 hdadmin hadoop 154 Mar 13 20:03 VERSION [hdadmin@localhost current]$ more blk_1997667773574667398 GNOME Terminal is a terminal emulation application that you can use to perform the following tasks: Access a UNIX shell in the GNOME environment A shell is a program that interprets and executes the commands that you type at a command lin e prompt. When you start GNOME Terminal, the application starts the default shell that is specified in your system account. You can switch to a different shell at any time. [hdadmin@localhost current]$
  24. 24. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Removing data from HDFS using Shell Command hdadmin@localhost detach]$ hadoop dfs -rm /input/input_test.txt Deleted hdfs://localhost:54310/input/input_test.txt hdadmin@localhost detach]$
  25. 25. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Running Hadoop on Amazon Elastic MapReduce
  26. 26. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Architecture Overview of Amazon EMR
  27. 27. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Creating an AWS account
  28. 28. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Signing up for the necessary services ● Simple Storage Service (S3) ● Elastic Compute Cloud (EC2) ● Elastic MapReduce (EMR) Caution! This costs real money!
  29. 29. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Creating Amazon S3 bucket
  30. 30. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Create access key using Security Credentials in the AWS Management Console
  31. 31. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  32. 32. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Creating a new Job Flow in EMR
  33. 33. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  34. 34. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  35. 35. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  36. 36. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  37. 37. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  38. 38. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  39. 39. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  40. 40. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop View Result from the S3 bucket
  41. 41. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Lecture: Understanding Map Reduce Processing Client Name Node Job Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Map Reduce
  42. 42. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop MapReduce Framework map: (K1, V1) -> list(K2, V2)) reduce: (K2, list(V2)) -> list(K3, V3)
  43. 43. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop MapReduce Processing – The Data flow 1. InputFormat, InputSplits, RecordReader 2. Mapper - your focus is here 3. Partition, Shuffle & Sort 4. Reducer - your focus is here 5. OutputFormat, RecordWriter
  44. 44. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop How does the MapReduce work? Output in a list of (Key, List of Values) in the intermediate file Sorting Partitioning Output in a list of (Key, Value) in the intermediate file InputSplit RecordReader RecordWriter
  45. 45. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop How does the MapReduce work? Sorting Partitioning Combining Car, 2 Car, 2 Bear, {1,1} Car, {2,1} River, {1,1} Deer, {1,1} Output in a list of (Key, List of Values) in the intermediate file Output in a list of (Key, Value) in the intermediate file InputSplit RecordReader RecordWriter
  46. 46. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop InputFormat InputFormat: Description: Key: Value: TextInputFormat Default format; reads lines of text files The byte offset of the line The line contents KeyValueInputFormat Parses lines into key, val pairs Everything up to the first tab character The remainder of the line SequenceFileInputFor mat A Hadoop-specific high-performance binary format user-defined user-defined
  47. 47. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop InputSplit An InputSplit describes a unit of work that comprises a single map task. InputSplit presents a byte-oriented view of the input. You can control this value by setting the mapred.min.split.size parameter in core-site.xml, or by overriding the parameter in the JobConf object used to submit a particular MapReduce job. RecordReader RecordReader reads <key, value> pairs from an InputSplit. Typically the RecordReader converts the byte-oriented view of the input, provided by the InputSplit, and presents a record- oriented to the Mapper
  48. 48. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Mapper Mapper: The Mapper performs the user-defined logic to the input a key, value and emits (key, value) pair(s) which are forwarded to the Reducers. Partition, Shuffle & Sort After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. Partitioner controls the partitioning of map-outputs to assign to reduce task . he total number of partitions is the same as the number of reduce tasks for the job The set of intermediate keys on a single node is automatically sorted by internal Hadoop before they are presented to the Reducer This process of moving map outputs to the reducers is known as shuffling.
  49. 49. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reducer This is an instance of user-provided code that performs read each key, iterator of values in the partition assigned. The OutputCollector object in Reducer phase has a method named collect() which will collect a (key, value) output. OutputFormat, Record Writer OutputFormat governs the writing format in OutputCollector and RecordWriter writes output into HDFS. OutputFormat: Description TextOutputFormat Default; writes lines in "key t value" form SequenceFileOutputFormat Writes binary files suitable for reading into subsequent MapReduce jobs NullOutputFormat generates no output files
  50. 50. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Writing you own Map Reduce Program
  51. 51. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Wordcount (HelloWord in Hadoop) 1. package org.myorg; 2. 3. import java.io.IOException; 4. import java.util.*; 5. 6. import org.apache.hadoop.fs.Path; 7. import org.apache.hadoop.conf.*; 8. import org.apache.hadoop.io.*; 9. import org.apache.hadoop.mapred.*; 10. import org.apache.hadoop.util.*; 11. 12. public class WordCount { 13. 14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { 15. private final static IntWritable one = new IntWritable(1); 16. private Text word = new Text(); 17. 18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 19. String line = value.toString(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasMoreTokens()) { 22. word.set(tokenizer.nextToken()); 23. output.collect(word, one); 24. } 25. } 26. }
  52. 52. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Wordcount (HelloWord in Hadoop) 27. 28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { 29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 30. int sum = 0; 31. while (values.hasNext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. } 37.
  53. 53. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Wordcount (HelloWord in Hadoop) 38. public static void main(String[] args) throws Exception { 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setJobName("wordcount"); 41. 42. conf.setOutputKeyClass(Text.class); 43. conf.setOutputValueClass(IntWritable.class); 44. 45. conf.setMapperClass(Map.class); 46. 47. conf.setReducerClass(Reduce.class); 48. 49. conf.setInputFormat(TextInputFormat.class); 50. conf.setOutputFormat(TextOutputFormat.class); 51. 52. FileInputFormat.setInputPaths(conf, new Path(args[0])); 53. FileOutputFormat.setOutputPath(conf, new Path(args[1])); 54. 55. JobClient.runJob(conf); 57. } 58. } 59.
  54. 54. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Packaging Map Reduce and Deploying to Hadoop Runtime Environment
  55. 55. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Packaging Map Reduce Program Usage Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the Hadoop version installed, compile WordCount.java and create a jar: $ mkdir /home/hduser/wordcount_classes $ cd /home/hduser $ javac -classpath /usr/local/hadoop/hadoop-core-0.20.205.0.jar -d wordcount_classes WordCount.java $ jar -cvf ./wordcount.jar -C wordcount_classes/ . $ hadoop jar ./wordcount.jar org.myorg.WordCount /input/* /output/wordcount_output_dir Output: ……. $ hadoop dfs -cat /output/wordcount_output_dir/part-00000
  56. 56. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing MapReduce Output Result Scroll Down the web page
  57. 57. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing MapReduce Output Result
  58. 58. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing MapReduce Output Result
  59. 59. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing MapReduce Output Result
  60. 60. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing MapReduce Output Result
  61. 61. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing MapReduce Output Result
  62. 62. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing MapReduce Output Result
  63. 63. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing MapReduce Output Result
  64. 64. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Running WordCount.jar on Amazon EMR
  65. 65. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Upload .jar file and input file to Amazon S3 1. Select <yourbucket> in Amazon S3 service 2. Create folder : applications 3. Upload wordcount.jar to the applications folder 4. Create another folder: input 5. Upload input_test.txt to the input folder
  66. 66. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Create a new Job Flow in EMR
  67. 67. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Input JAR Location and Arguments
  68. 68. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  69. 69. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  70. 70. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  71. 71. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  72. 72. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop View the Result
  73. 73. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Lecture Understanding Hive
  74. 74. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Introduction A Petabyte Scale Data Warehouse Using Hadoop Hive is developed by Facebook, designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL
  75. 75. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop What Hive is NOT Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs, etc.).
  76. 76. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop System Architecture and Components • Metastore: To store the meta data. • Query compiler and execution engine: To convert SQL queries to a sequence of map/reduce jobs that are then executed on Hadoop. • SerDe and ObjectInspectors: Programmable interfaces and implementations of common data formats and types. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. • UDF and UDAF: Programmable interfaces and implementations for user defined functions (scalar and aggregate functions). • Clients: Command line client similar to Mysql command line. hive.apache.org
  77. 77. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Architecture Overview HDFS Hive CLI QueriesBrowsing Map Reduce MetaStore Thrift API SerDe Thrift Jute JSON.. Execution Hive QL Parser Planner Mgmt. WebUI HDFS DDL Hive Hive.apache.org
  78. 78. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Sample HiveQL The Query compiler uses the information stored in the metastore to convert SQL queries into a sequence of map/reduce jobs, e.g. the following query SELECT * FROM t where t.c = 'xyz' SELECT t1.c2 FROM t1 JOIN t2 ON (t1.c1 = t2.c1) SELECT t1.c1, count(1) from t1 group by t1.c1 Hive.apache.org
  79. 79. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Running Hive Hive Shell ● Interactive hive ● Script hive -f myscript ● Inline hive -e 'SELECT * FROM mytable' Hive.apache.org
  80. 80. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Creating Table and Retrieving Data using Hive
  81. 81. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hive Hands-On Labs 1. Creating Hive Table 2. Reviewing Hive Table in HDFS 3. Alter and Drop Hive Table 4. Loading Data to Hive Table 5. Querying Data from Hive Table 6. Reviewing Hive Table Content from HDFS Command and WebUI 7. Insert Overwriting the Hive Table
  82. 82. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Starting Hive Re-Start Hive CLI again $ hive Logging initialized using configuration in file:/usr/local/hive- 0.9.0-bin/conf/hive-log4j.properties Hive history file=/tmp/hdadmin/hive_job_log_hdadmin_201303171635_1944738265.txt hive> hive> quit; Quit from Hive
  83. 83. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 1. Creating Hive Table hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; OK Time taken: 4.069 seconds hive (default)> show tables; OK test_tbl Time taken: 0.138 seconds hive (default)> describe test_tbl; OK id int country string Time taken: 0.147 seconds hive (default)> See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html
  84. 84. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 2. Reviewing Hive Table in HDFS [hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse Found 1 items drwxr-xr-x - hdadmin supergroup 0 2013-03-17 17:51 /user/hive/warehouse/test_tbl [hdadmin@localhost hdadmin]$ Review Hive Table from HDFS WebUI
  85. 85. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 3. Alter and Drop Hive Table hive (default)> alter table test_tbl add columns (remarks STRING); hive (default)> describe test_tbl; OK id int country string remarks string Time taken: 0.077 seconds hive (default)> drop table test_tbl; OK Time taken: 0.9 seconds See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html
  86. 86. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 3. Alter and Drop Hive Table CREATE EXTERNAL TABLE weblog_entries ( ip STRING, dash1 STRING, dash2 STRING, date STRING,status1 STRING, getstr STRING, link STRING,http STRING, Status STRING, size INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY 'n' LOCATION '/data/'; weblog.hsql hive –f weblog_create_external_table.hql See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html
  87. 87. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 4. Loading Data to Hive Table $ hive hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; Creating Hive table hive (default)> LOAD DATA LOCAL INPATH '/tmp/test_tbl_data.csv' INTO TABLE test_tbl; Copying data from file:/tmp/test_tbl_data.csv Copying file: file:/tmp/test_tbl_data.csv Loading data to table default.test_tbl OK Time taken: 0.241 seconds hive (default)> Loading data to Hive table
  88. 88. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 5. Querying Data from Hive Table hive (default)> select * from test_tbl; OK 1 USA 62 Indonesia 63 Philippines 65 Singapore 66 Thailand Time taken: 0.287 seconds hive (default)>
  89. 89. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 5. Querying Data from Hive Table hive (default)> select country from test_tbl; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201303171733_0001, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201303171733_0001 Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201303171733_0001 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2013-03-17 18:13:19,097 Stage-1 map = 0%, reduce = 0% 2013-03-17 18:13:25,151 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.25 sec 2013-03-17 18:13:26,161 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.25 sec 2013-03-17 18:13:27,175 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.25 sec 2013-03-17 18:13:28,186 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.25 sec 2013-03-17 18:13:29,208 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.25 sec 2013-03-17 18:13:30,217 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.25 sec 2013-03-17 18:13:31,224 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 0.25 sec MapReduce Total cumulative CPU time: 250 msec Ended Job = job_201303171733_0001 MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 0.25 sec HDFS Read: 282 HDFS Write: 45 SUCCESS Total MapReduce CPU Time Spent: 250 msec OK USA Indonesia Philippines Singapore Thailand
  90. 90. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 6. Reviewing Hive Table Content from HDFS Command and WebUI [hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse/test_tbl Found 1 items -rw-r--r-- 1 hdadmin supergroup 59 2013-03-17 18:08 /user/hive/warehouse/test_tbl/test_tbl_data.csv [hdadmin@localhost hdadmin]$ [hdadmin@localhost hdadmin]$ hadoop fs -cat /user/hive/warehouse/test_tbl/test_tbl_data.csv 1,USA 62,Indonesia 63,Philippines 65,Singapore 66,Thailand [hdadmin@localhost hdadmin]$
  91. 91. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 7. Insert Overwriting the Hive Table hive (default)> LOAD DATA LOCAL INPATH '/tmp/test_tbl_data_updated.csv' overwrite INTO TABLE test_tbl; Copying data from file:/tmp/test_tbl_data_updated.csv Copying file: file:/tmp/test_tbl_data_updated.csv Loading data to table default.test_tbl Deleted hdfs://localhost:54310/user/hive/warehouse/test_tbl OK Time taken: 0.204 seconds hive (default)>
  92. 92. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Review Hive Table Created in HDFS and WebUI [hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse/test_tbl Found 1 items -rw-r--r-- 1 hdadmin supergroup 3510 2013-03-17 18:25 /user/hive/warehouse/test_tbl/test_tbl_data_updated.csv [hdadmin@localhost hdadmin]$ [hdadmin@localhost hdadmin]$ hadoop fs -cat /user/hive/warehouse/test_tbl/test_tbl_data_updated.csv 93,Afghanistan 355,Albania 213,Algeria 1684,AmericanSamoa 376,Andorra 244,Angola 1264,Anguilla 672,Antarctica 1268,AntiguaandBarbuda 54,Argentina 374,Armenia 297,Aruba 61,Australia 43,Austria 994,Azerbaijan 1242,Bahamas 973,Bahrain
  93. 93. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Install the Amazon EMR Command Line Interface
  94. 94. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Installing Amazon EMR CLI 1. Install Ruby 2. Download the Amazon EMR CLI 3. Install the Amazon EMR CLI 4. Create your credentials file (credentials.json) 5. Create an Amazon EC2 key pair 6. Configure your SSH credentials 7. Verify installation of the Amazon EMR CL Instruction: http://docs.aws.amazon.com/ElasticMapReduce/latest/ DeveloperGuide/emr-cli-install.html
  95. 95. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Example: Credentials file { "access_id": "AKI..........................A", "private_key": "SaJHI4wjyK.............UWDaYOw2el", "keypair": "imckey", "key-pair-file": "~/elastic-mapreduce-cli/imckey.pem", "log_uri": "s3n://imcbucket/", "region": "us-west-2" }
  96. 96. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Running Amazon EMR CLI THANACHARTs-MacBook-Air:~ THANACHART$ cd elastic-mapreduce-cli/ THANACHARTs-MacBook-Air:elastic-mapreduce-cli THANACHART$ THANACHARTs-MacBook-Air:elastic-mapreduce-ruby THANACHART$ ./elastic-mapreduce --list j-2JW8QBWXIYNV8 TERMINATED ec2-54-213-112-102.us-west- 2.compute.amazonaws.comHBase CLI COMPLETED Start HBase j-1JNA9G1O7ET2G TERMINATED ec2-54-213-112-74.us-west- 2.compute.amazonaws.com Hive Interactive2 COMPLETED Setup Hive j-1H7NX8OGFNFRW TERMINATED ec2-54-213-10-135.us-west- 2.compute.amazonaws.com Hive Interactive
  97. 97. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Running Hive Interactive on Amazon EMR
  98. 98. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Running Hive on Amazon EMR ● Amazon EMR enables you to run Hive scripts in two modes: ● Interactive ● Batch Hive.apache.org
  99. 99. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Upload an input file to Amazon S3 1. Select <yourbucket> in Amazon S3 service 2. Create afolder:data 3. Upload hdi-data.csv to the data folder
  100. 100. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Running Hive Interactive
  101. 101. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  102. 102. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  103. 103. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Select EC2 Key Pair
  104. 104. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  105. 105. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Find Job Flow ID
  106. 106. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Running CLI to check the Job Flow $ ./elastic-mapreduce --list -j j-37WK3Z1T2FZ7D j-37WK3Z1T2FZ7D STARTING ec2-54-213-119-89.us-west- 2.compute.amazonaws.com Hive Interactive Demo PENDING Setup Hive $ ./elastic-mapreduce --list -j j-37WK3Z1T2FZ7D j-37WK3Z1T2FZ7D RUNNING ec2-54-213-119-89.us-west- 2.compute.amazonaws.com Hive Interactive Demo RUNNING Setup Hive $ ./elastic-mapreduce --ssh j-37WK3Z1T2FZ7D hadoop@ip-172-31-24-126:~$hive Logging initialized using configuration in file:/home/hadoop/.versions/hive- 0.8.1/conf/hive-log4j.properties Hive history file=/mnt/var/lib/hive_081/tmp/history/hive_job_log_hadoop_201308011448_80 0175951.txt hive>
  107. 107. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Create a table using HiveQL hive> CREATE TABLE HDI( > id INT, country STRING, hdi FLOAT, lifeex INT, mysch INT, eysch > INT, gni INT) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY "," > STORED AS TEXTFILE > LOCATION "s3://imcbucket/data"; OK Time taken: 4.292 seconds hive> SHOW TABLES; OK hdi Time taken: 0.305 seconds
  108. 108. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Running a SELECT statement hive> SELECT country, gni FROM hdi WHERE gni > 2000; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201308011444_0001, Tracking URL = http://ip-172-31-24- 126:9100/jobdetails.jsp?jobid=job_201308011444_0001 Kill Command = /home/hadoop/bin/hadoop job -Dmapred.job.tracker=172.31.24.126:9001 -kill job_201308011444_0001 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2013-08-01 14:55:53,846 Stage-1 map = 0%, reduce = 0% 2013-08-01 14:58:37,725 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 15.52 sec
  109. 109. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Running a SELECT statement (cont.) MapReduce Total cumulative CPU time: 15 seconds 520 msec Ended Job = job_201308011444_0001 Counters: MapReduce Jobs Launched: Job 0: Map: 1 Accumulative CPU: 15.52 sec HDFS Read: 372 HDFS Write: 2435 SUCCESS Total MapReduce CPU Time Spent: 15 seconds 520 msec OK Norway 47557 Australia 34431 Netherlands 36402 United States 43017 New Zealand 23737 ...
  110. 110. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Lecture Understanding Pig
  111. 111. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Introduction A high-level platform for creating MapReduce programs Using Hadoop Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
  112. 112. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Pig Components ● Two Compnents ● Language (Pig Latin) ● Compiler ● Two Execution Environments ● Local pig -x local ● Distributed pig -x mapreduce Hive.apache.org
  113. 113. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Running Pig ● Script pig myscript ● Command line (Grunt) pig ● Embedded Writing a java program Hive.apache.org
  114. 114. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Pig Latin Hive.apache.org
  115. 115. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Pig Execution Stages Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi
  116. 116. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Why Pig? ● Makes writing Hadoop jobs easier ● 5% of the code, 5% of the time ● You don't need to be a programmer to write Pig scripts ● Provide major functionality required for DatawareHouse and Analytics ● Load, Filter, Join, Group By, Order, Transform ● User can write custom UDFs (User Defined Function) Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi
  117. 117. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Pig v.s. Hive Hive.apache.org
  118. 118. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Running a Pig script
  119. 119. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Starting Pig Command Line [hdadmin@localhost ~]$ pig -x local 2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53 2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hdadmin/pig_1375327740024.log 2013-08-01 10:29:00,066 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/hdadmin/.pigbootup not found 2013-08-01 10:29:00,212 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// grunt>
  120. 120. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop countryFilter.pig A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float, lifeex:int, mysch:i nt, eysch:int, gni:int); B = FILTER A BY gni > 2000; C = ORDER B BY gni; dump C; #Preparing Data [hdadmin@localhost ~]$ cp hadoop_data/hdi-data.csv /usr/local/pig-0.11.1/bin/ #Edit Your Script [hdadmin@localhost ~]$ cd /usr/local/pig-0.11.1/bin/ [hdadmin@localhost ~]$ vi countryFilter.pig Writing a Pig Script
  121. 121. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop [hdadmin@localhost ~]$ cd /usr/local/pig-0.11.1/bin/ [hdadmin@localhost ~]$ pig -x local grunt > run countryFilter.pig .... (150,Cameroon,0.482,51,5,10,2031) (126,Kyrgyzstan,0.615,67,9,12,2036) (156,Nigeria,0.459,51,5,8,2069) (154,Yemen,0.462,65,2,8,2213) (138,Lao People's Democratic Republic,0.524,67,4,9,2242) (153,Papua New Guinea,0.466,62,4,5,2271) (165,Djibouti,0.43,57,3,5,2335) (129,Nicaragua,0.589,74,5,10,2430) (145,Pakistan,0.504,65,4,6,2550) Running a Pig Script
  122. 122. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Writing a Join operation script CountryJoin..pig A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float, lifeex:int, mysch:int, eysch:int, gni:int); B = FILTER A BY gni> 2000; C = ORDER B BY gni; D = load 'export-data.csv' using PigStorage(',') AS (country:chararray, expct:float); E = JOIN C BY country, D by country; dump E;
  123. 123. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Running a Pig script on Amazon EMR
  124. 124. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Upload .pig file to Amazon S3 1. Select <yourbucket> in Amazon S3 service 2. Upload countryFilter-EMR.pigto the data folder
  125. 125. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Creating a Pig program
  126. 126. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  127. 127. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  128. 128. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  129. 129. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Viewing a result
  130. 130. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  131. 131. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Lecture Understanding HBase
  132. 132. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Introduction An open source, non-relational, distributed database HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (, providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.
  133. 133. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop HBase Features ● Column oriented data store, known as Hadoop Database ● Support random realtime CRUD operations (unlike HDFS) ● No SQL Database ● Opensource, written in Java ● Run on a cluster of commodity hardware Hive.apache.org
  134. 134. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop HBase Architecture Hive.apache.org
  135. 135. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop When to use Hbase? ● When you need high volume data to be stored ● Un-structured data ● Sparse data ● Column-oriented data ● Versioned data (same data template, captured at various time, time-elapse data) ● When you need high scalability Hive.apache.org
  136. 136. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Running HBase
  137. 137. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Starting HBase shell [hdadmin@localhost ~]$ start-hbase.sh starting master, logging to /usr/local/hbase-0.94.10/logs/hbase-hdadmin- master-localhost.localdomain.out [hdadmin@localhost ~]$ jps 3064 TaskTracker 2836 SecondaryNameNode 2588 NameNode 3513 Jps 3327 HMaster 2938 JobTracker 2707 DataNode [hdadmin@localhost ~]$ hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.94.10, r1504995, Fri Jul 19 20:24:16 UTC 2013 hbase(main):001:0>
  138. 138. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Create a table and insert data in HBase hbase(main):009:0> create 'test', 'cf' 0 row(s) in 1.0830 seconds hbase(main):010:0> put 'test', 'row1', 'cf:a', 'val1' 0 row(s) in 0.0750 seconds hbase(main):011:0> scan 'test' ROW COLUMN+CELL row1 column=cf:a, timestamp=1375363287644, value=val1 1 row(s) in 0.0640 seconds hbase(main):002:0> get 'test', 'row1' COLUMN CELL cf:a timestamp=1375363287644, value=val1 1 row(s) in 0.0370 seconds
  139. 139. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Running HBase commands on Amazon EMR
  140. 140. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Create a HBase shell
  141. 141. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  142. 142. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  143. 143. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  144. 144. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  145. 145. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Find Job Flow ID
  146. 146. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Starting Hbase Shell $ ./elastic-mapreduce --list -j j-3MKWRS0K8IH7K j-3MKWRS0K8IH7K WAITING ec2-54-213-117-162.us-west- 2.compute.amazonaws.comHBase Interactive COMPLETED Start HBase $ ./elastic-mapreduce --ssh j-3MKWRS0K8IH7K hadoop@ip-172-31-33-161:~$ hbase shell
  147. 147. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Recommendation to Further Study Hadoop Beginner's Guide Hadoop: The Definitive Guide, 3rd Edition
  148. 148. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Recommendation to Further Study Hadoop in Practice Hadoop MapReduce Cookbook
  149. 149. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Recommendation to Further Study Amazon Elastic MapReduce Developer Guide
  150. 150. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Thank you

×