Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop Workshop on EC2 : March 2015

1,258 views

Published on

This workshop is for a "Big Data using Hadoop course" at IMC Institute in March 2015. The workshop is based on Apache Hadoop and using an EC2 server on AWS.

Published in: Technology

Hadoop Workshop on EC2 : March 2015

  1. 1. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 1 Big Data using Hadoop Hands On Workshop March 2015 Dr.Thanachart Numnonda Certified Java Programmer thanachart@imcinstitute.com Danairat T. Certified Java Programmer, TOGAF – Silver danairat@gmail.com, +66-81-559-1446
  2. 2. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Launch a virtual server on EC2 Amazon Web Services
  3. 3. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  4. 4. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hadoop Installation Hadoop provides three installation choices: 1. Local mode: This is an unzip and run mode to get you started right away where allparts of Hadoop run within the same JVM 2. Pseudo distributed mode: This mode will be run on different parts of Hadoop as different Java processors, but within a single machine 3. Distributed mode: This is the real setup that spans multiple machines
  5. 5. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Virtual Server This lab will use a EC2 virtual server to install a Hadoop server using the following features: ● Ubuntu Server 14.04 LTS ● m3.mediun 1vCPU, 3.75 GB memory ● Security group: default ● Keypair: imchadoop
  6. 6. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Select a EC2 service and click on Lunch Instance
  7. 7. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Select an Amazon Machine Image (AMI) and Ubuntu Server 14.04 LTS (PV)
  8. 8. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Choose m3.medium Type virtual server
  9. 9. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Leave configuration details as default
  10. 10. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Add Storage: 20 GB
  11. 11. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Name the instance
  12. 12. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Select an existing security group > Select Security Group Name: default
  13. 13. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Click Launch and choose imchadoop as a key pair
  14. 14. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Review an instance / click Connect for an instruction to connect to the instance
  15. 15. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Connect to an instance from Mac/Linux
  16. 16. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Connect to an instance from Windows using Putty
  17. 17. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Connect to the instance
  18. 18. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Installing Hadoop
  19. 19. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Installing Hadoop and Ecosystem 1. Update the system 2. Configuring SSH 3. Installing JDK1.6 4. Download/Extract Hadoop 5. Installing Hadoop 6. Configure xml files 7. Formatting HDFS 8. Start Hadoop 9. Hadoop Web Console 10. Stop Hadoop Notes:- Hadoop and IPv6; Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4 stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster.If your organisation moves to IPv6 only, you will encounter problems. Source: http://wiki.apache.org/hadoop/HadoopIPv6
  20. 20. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 1) Update the system: sudo apt-get update
  21. 21. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 2. Configuring SSH: ssh-keygen
  22. 22. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Enabling SSH access to your local machine $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys Testing the SSH setup by connecting to your local machine $ ssh 54.68.149.232 Type Exit $ exit
  23. 23. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 3) Install JDK 1.7: sudo apt-get install openjdk-7-jdk (Enter Y when prompt for answering) (Type command > java –version
  24. 24. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 4) Download/Extract Hadoop 1) Type command > wget http://mirror.issp.co.th/apache/hadoop/common/hadoop-1.2.1/hadoop- 1.2.1.tar.gz 2) Type command > tar –xvzf hadoop-1.2.1.tar.gz 3) Type command > sudo mv hadoop-1.2.1 /usr/local/hadoop
  25. 25. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 5) Installing Hadoop 1) Type command > sudo vi $HOME/.bashrc 2) Add config as figure below 1) Type command > exec bash 2) Type command > sudo vi /usr/local/hadoop/conf/hadoop-env.sh 3) Edit the file as figure below
  26. 26. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 6) Configuring Hadoop conf/*-site.xml 1. core-site.xml (hadoop.tmp.dir, fs.default.name) 2. hdfs-site.xml (dfs.replication) 3. mapred-site.xml (mapred.job.tracker)
  27. 27. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Configuring core-site.xml 1) Type command > sudo vi /usr/local/hadoop/conf/core-site.xml 2)Add Private IP of a server as figure below (in this case a private IP is 172.31.12.11)
  28. 28. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Configuring mapred-site.xml 1) Type command > sudo sudo vi /usr/local/hadoop/conf/mapred- site.xml 2)Add Private IP of Jobtracker server as figure below (in this case a private IP is 172.31.12.11)
  29. 29. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Configuring hdfs-site.xml 1) Type command > sudo vi /usr/local/hadoop/conf/hdfs-site.xml 2)Add configure as figure below
  30. 30. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 7) Formating Hadoop 1)Type command > sudo mkdir /usr/local/hadoop/tmp 2)Type command > sudo chown ubuntu /usr/local/hadoop 3)Type command > sudo chown ubuntu /usr/local/hadoop/tmp 4)Type command > hadoop namenode –format
  31. 31. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Starting Hadoop ubuntu@ip-172-31-12-11:~$ start-all.sh Starting up a Namenode, Datanode, Jobtracker and a Tasktracker on your machine. [ubuntu@ip-172-31-12-11:~$ jps 11567 Jps 10766 NameNode 11099 JobTracker 11221 TaskTracker 10899 DataNode 11018 SecondaryNameNode ubuntu@ip-172-31-12-11:~$$ Checking Java Process and you are now running Hadoop as pseudo distributed mode
  32. 32. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hadoop is up! Viewing the Hadoop HDFS using WebUI http://54.68.149.232:50070/
  33. 33. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Stopping Hadoop ubuntu@ip-172-31-12-11:~$ /usr/local/hadoop/bin/stop-all.sh stopping jobtracker localhost: stopping tasktracker stopping namenode localhost: stopping datanode localhost: stopping secondarynamenode
  34. 34. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Importing Data to HDFS using Hadoop Command Line
  35. 35. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Importing Data to Hadoop Download War and Peace Full Text www.gutenberg.org/ebooks/2600
  36. 36. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Importing Data to Hadoop Download the file pg2600.txt $ wget https://dl.dropboxusercontent.com/u/12655380/ pg2600.txt $hadoop fs -mkdir /input $hadoop fs -mkdir /output $hadoop fs -copyFromLocal pg2600.txt /input Import to Hadoop
  37. 37. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Reviewing, Retrieving, Deleting Data from HDFS
  38. 38. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Review file in Hadoop HDFS ubuntu@ip-172-31-12-11:~$ hadoop fs -cat /input/pg2600.txt List HDFS File Read HDFS File Retrieve HDFS File to Local File System Please see also http://hadoop.apache.org/docs/r1.0.4/commands_manual.html ubuntu@ip-172-31-12-11:~$ hadoop fs -copyToLocal /input/pg2600.txt /tmp/file.txt
  39. 39. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Review file in Hadoop HDFS using WebUI
  40. 40. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hadoop Port Numbers Daemon Default Port Configuration Parameter in conf/*-site.xml HDFS Namenode 50070 dfs.http.address Datanodes 50075 dfs.datanode.http.address Secondarynamenode 50090 dfs.secondary.http.address MR JobTracker 50030 mapred.job.tracker.http.addre ss Tasktrackers 50060 mapred.task.tracker.http.addr ess
  41. 41. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Review Content from System shell
  42. 42. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Removing data from HDFS using Shell Command hdadmin@localhost detach]$ hadoop dfs -rm /input/input_test.txt Deleted hdfs://localhost:54310/input/input_test.txt hdadmin@localhost detach]$
  43. 43. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Lecture: Understanding Map Reduce Processing Client Name Node Job Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Map Reduce
  44. 44. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop High Level Architecture of MapReduce
  45. 45. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 45 Before MapReduce… ● Large scale data processing was difficult! – Managing hundreds or thousands of processors – Managing parallelization and distribution – I/O Scheduling – Status and monitoring – Fault/crash tolerance ● MapReduce provides all of these, easily! Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html
  46. 46. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 46 MapReduce Overview ● What is it? – Programming model used by Google – A combination of the Map and Reduce models with an associated implementation – Used for processing and generating large data sets
  47. 47. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 47 MapReduce Overview ● How does it solve our previously mentioned problems? – MapReduce is highly scalable and can be used across many computers. – Many small machines can be used to process jobs that normally could not be processed by a large machine.
  48. 48. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop MapReduce Framework Source: www.bigdatauniversity.com
  49. 49. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 49 How Map and Reduce Work Together
  50. 50. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 50 How Map and Reduce Work Together ● Map returns information ● Reduces accepts information ● Reduce applies a user defined function to reduce the amount of data
  51. 51. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 51 Map Abstraction ● Inputs a key/value pair – Key is a reference to the input value – Value is the data set on which to operate ● Evaluation – Function defined by user – Applies to every value in value input ● Might need to parse input ● Produces a new list of key/value pairs – Can be different type from input pair
  52. 52. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 52 Reduce Abstraction ● Starts with intermediate Key / Value pairs ● Ends with finalized Key / Value pairs ● Starting pairs are sorted by key ● Iterator supplies the values for a given key to the Reduce function.
  53. 53. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 53 Reduce Abstraction ● Typically a function that: – Starts with a large number of key/value pairs ● One key/value for each word in all files being greped (including multiple entries for the same word) – Ends with very few key/value pairs ● One key/value for each unique word across all the files with the number of instances summed into this entry ● Broken up so a given worker works with input of the same key.
  54. 54. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 54 Other Applications ● Yahoo! – Webmap application uses Hadoop to create a database of information on all known webpages ● Facebook – Hive data center uses Hadoop to provide business statistics to application developers and advertisers ● Rackspace – Analyzes sever log files and usage data using Hadoop
  55. 55. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 55 Why is this approach better? ● Creates an abstraction for dealing with complex overhead – The computations are simple, the overhead is messy ● Removing the overhead makes programs much smaller and thus easier to use – Less testing is required as well. The MapReduce libraries can be assumed to work properly, so only user code needs to be tested ● Division of labor also handled by the MapReduce libraries, so programmers only need to focus on the actual computation
  56. 56. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop MapReduce Framework map: (K1, V1) -> list(K2, V2)) reduce: (K2, list(V2)) -> list(K3, V3)
  57. 57. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop How does the MapReduce work? Output in a list of (Key, List of Values) in the intermediate file Sorting Partitioning Output in a list of (Key, Value) in the intermediate file InputSplit RecordReader RecordWriter
  58. 58. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop How does the MapReduce work? Sorting Partitioning Combining Car, 2 Car, 2 Bear, {1,1} Car, {2,1} River, {1,1} Deer, {1,1} Output in a list of (Key, List of Values) in the intermediate file Output in a list of (Key, Value) in the intermediate file InputSplit RecordReader RecordWriter
  59. 59. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop MapReduce Processing – The Data flow 1. InputFormat, InputSplits, RecordReader 2. Mapper - your focus is here 3. Partition, Shuffle & Sort 4. Reducer - your focus is here 5. OutputFormat, RecordWriter
  60. 60. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop InputFormat InputFormat: Description: Key: Value: TextInputFormat Default format; reads lines of text files The byte offset of the line The line contents KeyValueInputFormat Parses lines into key, val pairs Everything up to the first tab character The remainder of the line SequenceFileInputFor mat A Hadoop-specific high-performance binary format user-defined user-defined
  61. 61. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop InputSplit An InputSplit describes a unit of work that comprises a single map task. InputSplit presents a byte-oriented view of the input. You can control this value by setting the mapred.min.split.size parameter in core-site.xml, or by overriding the parameter in the JobConf object used to submit a particular MapReduce job. RecordReader RecordReader reads <key, value> pairs from an InputSplit. Typically the RecordReader converts the byte-oriented view of the input, provided by the InputSplit, and presents a record- oriented to the Mapper
  62. 62. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Mapper Mapper: The Mapper performs the user-defined logic to the input a key, value and emits (key, value) pair(s) which are forwarded to the Reducers. Partition, Shuffle & Sort After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. Partitioner controls the partitioning of map-outputs to assign to reduce task . he total number of partitions is the same as the number of reduce tasks for the job The set of intermediate keys on a single node is automatically sorted by internal Hadoop before they are presented to the Reducer This process of moving map outputs to the reducers is known as shuffling.
  63. 63. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reducer This is an instance of user-provided code that performs read each key, iterator of values in the partition assigned. The OutputCollector object in Reducer phase has a method named collect() which will collect a (key, value) output. OutputFormat, Record Writer OutputFormat governs the writing format in OutputCollector and RecordWriter writes output into HDFS. OutputFormat: Description TextOutputFormat Default; writes lines in "key t value" form SequenceFileOutputFormat Writes binary files suitable for reading into subsequent MapReduce jobs NullOutputFormat generates no output files
  64. 64. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Writing you own Map Reduce Program
  65. 65. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Wordcount (HelloWord in Hadoop) 1. package org.myorg; 2. 3. import java.io.IOException; 4. import java.util.*; 5. 6. import org.apache.hadoop.fs.Path; 7. import org.apache.hadoop.conf.*; 8. import org.apache.hadoop.io.*; 9. import org.apache.hadoop.mapred.*; 10. import org.apache.hadoop.util.*; 11. 12. public class WordCount { 13. 14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { 15. private final static IntWritable one = new IntWritable(1); 16. private Text word = new Text(); 17. 18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 19. String line = value.toString(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasMoreTokens()) { 22. word.set(tokenizer.nextToken()); 23. output.collect(word, one); 24. } 25. } 26. }
  66. 66. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Wordcount (HelloWord in Hadoop) 27. 28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { 29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 30. int sum = 0; 31. while (values.hasNext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. } 37.
  67. 67. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Wordcount (HelloWord in Hadoop) 38. public static void main(String[] args) throws Exception { 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setJobName("wordcount"); 41. 42. conf.setOutputKeyClass(Text.class); 43. conf.setOutputValueClass(IntWritable.class); 44. 45. conf.setMapperClass(Map.class); 46. 47. conf.setReducerClass(Reduce.class); 48. 49. conf.setInputFormat(TextInputFormat.class); 50. conf.setOutputFormat(TextOutputFormat.class); 51. 52. FileInputFormat.setInputPaths(conf, new Path(args[0])); 53. FileOutputFormat.setOutputPath(conf, new Path(args[1])); 54. 55. JobClient.runJob(conf); 57. } 58. } 59.
  68. 68. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Packaging Map Reduce and Deploying to Hadoop Runtime Environment
  69. 69. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Packaging Map Reduce Program Usage Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the Hadoop version installed, compile WordCount.java and create a jar: $ wget https://dl.dropboxusercontent.com/u/12655380/WordCount.java $ mkdir hduser $ cd hduser javac -classpath /usr/local/hadoop/hadoop-core-1.2.1.jar -d hduser WordCount.java $ jar -cvf ./wordcount.jar -C hduser/ . $ hadoop jar ./wordcount.jar org.myorg.WordCount /input/* /output/wordcount_output_dir Output: ……. $ hadoop fs -cat /output/wordcount_output_dir/part-00000
  70. 70. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing MapReduce Output Result
  71. 71. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing MapReduce Output Result
  72. 72. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing MapReduce Output Result
  73. 73. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing MapReduce Output Result
  74. 74. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing MapReduce Output Result
  75. 75. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing MapReduce Output Result
  76. 76. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Writing Map/Reduce Program on Eclipse
  77. 77. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Starting Eclipse
  78. 78. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Create a Java Project Let's name it HadoopWordCount
  79. 79. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 79 Add dependencies to the project ● Add the following two JARs to your build path ● hadoop-common.jar and hadoop-mapreduce-client-core.jar. Both can be founded at /usr/lib/hadoop/client ● By perform the following steps – Add a folder named lib to the project – Copy the mentioned JARs in this folder – Right-click on the project name >> select Build Path >> then Configure Build Path – Click on Add Jars, select these two JARs from the lib folder
  80. 80. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 80 Add dependencies to the project
  81. 81. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 81 Writing a source code ● Right click the project, the select New >> Package ● Name the package as org.myorg ● Right click at org.myorg, the select New >> Class ● Name the package as WordCount ● Writing a source code as shown in previoud slides
  82. 82. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 82
  83. 83. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 83 Building a Jar file ● Right click the project, the select Export ● Select Java and then JAR file ● Provide the JAR name, as wordcount.jar ● Leave the JAR package options as default ● In the JAR Manifest Specification section, in the botton, specify the Main class ● In this case, select WordCount ● Click on Finish ● The JAR file will be build and will be located at cloudera/workspace Note: you may need to re-size the dialog font size by select Windows >> Preferences >> Appearance >> Colors and Fonts
  84. 84. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Lecture Understanding Hive
  85. 85. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Introduction A Petabyte Scale Data Warehouse Using Hadoop Hive is developed by Facebook, designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL
  86. 86. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop What Hive is NOT Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs, etc.).
  87. 87. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 87 Hive Metastore ● Store Hive metadata ● Configurations – Embedded: in-process metastore, in-process database – Local: in-process metastore, out-of-process database – Remote: out-of-process metastore,out-of-process database
  88. 88. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 88 Hive Schema-On-Read ● Faster loads into the database (simply copy or move) ● Slower queries ● Flexibility – multiple schemas for the same data
  89. 89. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 89 HiveQL ● Hive Query Language ● SQL dialect ● No support for: – UPDATE, DELETE – Transactions – Indexes – HAVING clause in SELECT – Updateable or materialized views – Srored procedure
  90. 90. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 90 Hive Tables ● Managed- CREATE TABLE – LOAD- File moved into Hive's data warehouse directory – DROP- Both data and metadata are deleted. ● External- CREATE EXTERNAL TABLE – LOAD- No file moved – DROP- Only metadata deleted – Use when sharing data between Hive and Hadoop applications or you want to use multiple schema on the same data
  91. 91. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Running Hive Hive Shell ● Interactive hive ● Script hive -f myscript ● Inline hive -e 'SELECT * FROM mytable' Hive.apache.org
  92. 92. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop System Architecture and Components • Metastore: To store the meta data. • Query compiler and execution engine: To convert SQL queries to a sequence of map/reduce jobs that are then executed on Hadoop. • SerDe and ObjectInspectors: Programmable interfaces and implementations of common data formats and types. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. • UDF and UDAF: Programmable interfaces and implementations for user defined functions (scalar and aggregate functions). • Clients: Command line client similar to Mysql command line. hive.apache.org
  93. 93. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Architecture Overview HDFS Hive CLI QueriesBrowsing Map Reduce MetaStore Thrift API SerDe Thrift Jute JSON.. Execution Hive QL Parser Planner Mgmt. WebUI HDFS DDL Hive Hive.apache.org
  94. 94. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Sample HiveQL The Query compiler uses the information stored in the metastore to convert SQL queries into a sequence of map/reduce jobs, e.g. the following query SELECT * FROM t where t.c = 'xyz' SELECT t1.c2 FROM t1 JOIN t2 ON (t1.c1 = t2.c1) SELECT t1.c1, count(1) from t1 group by t1.c1 Hive.apache.org
  95. 95. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Creating Table and Retrieving Data using Hive
  96. 96. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hive Hands-On Labs 1. Installing Hive 2. Configuring / Starting Hive 3. Creating Hive Table 4. Reviewing Hive Table in HDFS 5. Alter and Drop Hive Table 6. Preparing Dataset 7. Loading Data to Hive Table 8. Querying Data from Hive Table 9. Reviewing Hive Table Content from HDFS Command and WebUI
  97. 97. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 1. Installing Hive # wget http://apache.mesi.com.ar/hive/hive-1.1.0/ apache-hive-1.1.0-bin.tar.gz # tar -xvzf apache-hive-1.1.0-bin.tar.gz # sudo mv apache-hive-1.1.0-bin /usr/local # rm apache-hive-1.1.0-bin.tar.gz Install Hive binary file
  98. 98. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 1. Installing Hive Edit $HOME ./bashrc # sudo vi $HOME/.bashrc
  99. 99. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 2. Configuring Hive Creating HDFS Directory for Hive Create hdfs /tmp and /user/hive/warehouse directory [hdadmin@localhost ~]$ hadoop fs -mkdir /tmp/hive [hdadmin@localhost ~]$ hadoop fs -mkdir /user/hive/warehouse [hdadmin@localhost ~]$ hadoop fs -chmod 777 /tmp/hive [hdadmin@localhost ~]$ hadoop fs -chmod 777 /user/hive/warehouse
  100. 100. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 2. Start Hive Starting Hive hive> quit; Quit from Hive
  101. 101. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 3. Creating Hive Table hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; OK Time taken: 4.069 seconds hive (default)> show tables; OK test_tbl Time taken: 0.138 seconds hive (default)> describe test_tbl; OK id int country string Time taken: 0.147 seconds hive (default)> See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html
  102. 102. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 4. Reviewing Hive Table in HDFS [hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse Found 1 items drwxr-xr-x - hdadmin supergroup 0 2013-03-17 17:51 /user/hive/warehouse/test_tbl [hdadmin@localhost hdadmin]$ Review Hive Table from HDFS WebUI
  103. 103. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 5. Alter and Drop Hive Table hive (default)> alter table test_tbl add columns (remarks STRING); hive (default)> describe test_tbl; OK id int country string remarks string Time taken: 0.077 seconds hive (default)> drop table test_tbl; OK Time taken: 0.9 seconds See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html
  104. 104. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 6. Preparing Large Dataset http://grouplens.org/datasets/movielens/
  105. 105. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop MovieLen Dataset 1)Type command > wget http://files.grouplens.org/datasets/movielens/ml-100k.zip 2)Type command > sudo apt-get install unzip 3)Type command > unzip ml-100k.zip 4)Type command > more ml-100k/u.user
  106. 106. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 6. Loading Data to Hive Table hive (default)> exit; ubuntu@ip-172-31-12-11:~/ml-100k$ hadoop fs -put u.user /dataset/movielens/users Loading data to Hive table $ hive hive (default)> CREATE EXTERNAL TABLE users (userid INT, age INT, gender STRING, occupation STRING, zipcode STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/dataset/movielens/users'; Creating Hive table
  107. 107. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 7. Querying Data from Hive Table
  108. 108. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 8. Loading Data to test_tbl Table $ hive hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; Creating Hive table hive (default)> LOAD DATA LOCAL INPATH '/tmp/test_tbl_data.csv' INTO TABLE test_tbl; Copying data from file:/tmp/test_tbl_data.csv Copying file: file:/tmp/test_tbl_data.csv Loading data to table default.test_tbl OK Time taken: 0.241 seconds hive (default)> Loading data to Hive table
  109. 109. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 9. Reviewing Hive Table Content from HDFS Command and WebUI [hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse/test_tbl Found 1 items -rw-r--r-- 1 hdadmin supergroup 59 2013-03-17 18:08 /user/hive/warehouse/test_tbl/test_tbl_data.csv [hdadmin@localhost hdadmin]$ [hdadmin@localhost hdadmin]$ hadoop fs -cat /user/hive/warehouse/test_tbl/test_tbl_data.csv 1,USA 62,Indonesia 63,Philippines 65,Singapore 66,Thailand [hdadmin@localhost hdadmin]$
  110. 110. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Loading Data to Hive Table $ hive hive (default)> hive> CREATE TABLE products ( prod_name STRING, description STRING, category STRING, qty_on_hand INT, prod_num STRING, packaged_with ARRAY<STRING> ) row format delimited fields terminated by ',' collection items terminated by ':' stored as textfile; Creating Hive table
  111. 111. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Lecture Understanding Pig
  112. 112. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Introduction A high-level platform for creating MapReduce programs Using Hadoop Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
  113. 113. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Pig Components ● Two Compnents ● Language (Pig Latin) ● Compiler ● Two Execution Environments ● Local pig -x local ● Distributed pig -x mapreduce Hive.apache.org
  114. 114. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Running Pig ● Script pig myscript ● Command line (Grunt) pig ● Embedded Writing a java program Hive.apache.org
  115. 115. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Pig Latin Hive.apache.org
  116. 116. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Pig Execution Stages Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi
  117. 117. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Why Pig? ● Makes writing Hadoop jobs easier ● 5% of the code, 5% of the time ● You don't need to be a programmer to write Pig scripts ● Provide major functionality required for DatawareHouse and Analytics ● Load, Filter, Join, Group By, Order, Transform ● User can write custom UDFs (User Defined Function) Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi
  118. 118. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Pig v.s. Hive Hive.apache.org
  119. 119. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Running a Pig script
  120. 120. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Installing Pig # wget http://archive.apache.org/dist/hadoop/pig/stable/ pig-0.7.0.tar.gz # tar -xvzf pig-0.7.0.tar.gz # sudo mv pig-0.7.0 /usr/local/ # rm pig-0.7.0.tar.gz Install Pig binary file
  121. 121. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Installing Pig Edit $HOME ./bashrc # sudo vi $HOME/.bashrc
  122. 122. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Starting Pig Command Line
  123. 123. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop countryFilter.pig A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float, lifeex:int, mysch:int, eysch:int, gni:int); B = FILTER A BY gni > 2000; C = ORDER B BY gni; dump C; #Preparing Data ubuntu@ip-172-31-12-11:~$ wget https://www.dropbox.com/s/pp168a6oiwqkxyu/ hdi-data.csv #Edit Your Script ubuntu@ip-172-31-12-11:~$ vi countryFilter.pig Writing a Pig Script
  124. 124. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop ubuntu@ip-172-31-12-11:~$ pig -x local grunt > run countryFilter.pig Running a Pig Script
  125. 125. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Lecture: Understanding Sqoop
  126. 126. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Introduction Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities: • Imports individual tables or entire databases to files in HDFS • Generates Java classes to allow you to interact with your imported data • Provides the ability to import from SQL databases straight into your Hive data warehouse See also: http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html
  127. 127. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Architecture Overview Hive.apache.org
  128. 128. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Loading Data from DBMS to Hadoop HDFS
  129. 129. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Sqoop Hands-On Labs 1. Loading Data into MySQL DB 2. Installing Sqoop 3. Configuring Sqoop 4. Installing DB driver for Sqoop 5. Importing data from MySQL to Hive Table 6. Reviewing data from Hive Table 7. Reviewing HDFS Database Table files
  130. 130. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 1. MySQL RDS Server on AWS A RDS Server is running on AWS with the following configuration > database: imc_db > username: admin > password: imcinstitute >addr: imcinstitutedb.cmw65obdqfnx.us-west-2.rds.amazonaws.com [This address may change]
  131. 131. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 1. country_tbl data Testing data query from MySQL DB Table name > country_tbl
  132. 132. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 2. Installing Sqoop # wget http://apache.osuosl.org/sqoop/1.4.5/sqoop-1.4.5.bin__hadoop- 1.0.0.tar.gz # tar -xvzf sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz # sudo mv sqoop-1.4.5.bin__hadoop-1.0.0 /usr/local/ # rm sqoop-1.4.5.bin__hadoop-1.0.0
  133. 133. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Installing Sqoop Edit $HOME ./bashrc # sudo vi $HOME/.bashrc
  134. 134. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 3. Configuring Sqoop ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop- 1.0.0/conf/ ubuntu@ip-172-31-12-11:~$ vi sqoop-env.sh
  135. 135. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 4. Installing DB driver for Sqoop ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop- 1.0.0/lib/ ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.05/lib$ wget https://www.dropbox.com/s/6zrp5nerrwfixcj/mysql-connector-java-5.1.23-bin.jar ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.055/lib$ exit
  136. 136. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 5. Importing data from MySQL to Hive Table [hdadmin@localhost ~]$sqoop import --connect jdbc:mysql://imcinstitutedb.cmw65obdqfnx.us-west- 2.rds.amazonaws.com/imc_db --username admin -P --table country_tbl --hive-import --hive-table country -m 1 Warning: /usr/lib/hbase does not exist! HBase imports will fail. Please set $HBASE_HOME to the root of your HBase installation. Warning: $HADOOP_HOME is deprecated. Enter password: <enter here>
  137. 137. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 6. Reviewing data from Hive Table
  138. 138. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 7. Reviewing HDFS Database Table files Start Web Browser to http://http://54.68.149.232:50070 then navigate to /user/hive/warehouse
  139. 139. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop 7. Reviewing HDFS Database Table files
  140. 140. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Lecture Understanding HBase
  141. 141. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Introduction An open source, non-relational, distributed database HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (, providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.
  142. 142. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop HBase Features ● Hadoop database modelled after Google's Bigtab;e ● Column oriented data store, known as Hadoop Database ● Support random realtime CRUD operations (unlike HDFS) ● No SQL Database ● Opensource, written in Java ● Run on a cluster of commodity hardware Hive.apache.org
  143. 143. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop When to use Hbase? ● When you need high volume data to be stored ● Un-structured data ● Sparse data ● Column-oriented data ● Versioned data (same data template, captured at various time, time-elapse data) ● When you need high scalability Hive.apache.org
  144. 144. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Which one to use? ● HDFS ● Only append dataset (no random write) ● Read the whole dataset (no random read) ● HBase ● Need random write and/or read ● Has thousands of operation per second on TB+ of data ● RDBMS ● Data fits on one big node ● Need full transaction support ● Need real-time query capabilities Hive.apache.org
  145. 145. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  146. 146. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
  147. 147. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop HBase Components Hive.apache.org ● Region ● Row of table are stores ● Region Server ● Hosts the tables ● Master ● Coordinating the Region Servers ● ZooKeeper ● HDFS ● API ● The Java Client API
  148. 148. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop HBase Architecture Hive.apache.org
  149. 149. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop HBase Shell Commands Hive.apache.org
  150. 150. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Running HBase
  151. 151. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Installing HBase # wget http://apache.cs.utah.edu/hbase/hbase-1.0.0/hbase-1.0.0-bin.tar.gz # tar -xvzf hbase-1.0.0-bin.tar.gz # sudo mv hbase-1.0.0 /usr/local/ # rm hbase-1.0.0-bin.tar.gz
  152. 152. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Installing HBase Edit $HOME ./bashrc # sudo vi $HOME/.bashrc
  153. 153. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Starting HBase shell ubuntu@ip-172-31-12-11:~$ start-hbase.sh starting master, logging to /usr/local/hbase-0.94.10/logs/hbase-hdadmin- master-localhost.localdomain.out ubuntu@ip-172-31-12-11:~$$ jps 3064 TaskTracker 2836 SecondaryNameNode 2588 NameNode 3513 Jps 3327 HMaster 2938 JobTracker 2707 DataNode ubuntu@ip-172-31-12-11:~$ hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.94.10, r1504995, Fri Jul 19 20:24:16 UTC 2013 hbase(main):001:0>
  154. 154. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Create a table and insert data in HBase hbase(main):009:0> create 'test', 'cf' 0 row(s) in 1.0830 seconds hbase(main):010:0> put 'test', 'row1', 'cf:a', 'val1' 0 row(s) in 0.0750 seconds hbase(main):011:0> scan 'test' ROW COLUMN+CELL row1 column=cf:a, timestamp=1375363287644, value=val1 1 row(s) in 0.0640 seconds hbase(main):002:0> get 'test', 'row1' COLUMN CELL cf:a timestamp=1375363287644, value=val1 1 row(s) in 0.0370 seconds
  155. 155. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Recommendation to Further Study
  156. 156. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop Thank you www.imcinstitute.com www.facebook.com/imcinstitute

×