Big Data on Public Cloud 
Using Cloudera on GoGrid 
& Amazon EMR 
Hands On Workshop 
October 2014 
Dr.Thanachart Numnonda 
IMC Institute 
thanachart@imcinstitute.com 
Modifiy from Original Version by Danairat T. 
Certified Java Programmer, TOGAF – Silver 
danairat@gmail.com 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 1 danairat@gmail.com
Hands-On: Running Hadoop 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Running this lab using Cloudera Live 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Using Cloudera Live on GoGrid 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Alternatively Using Cloudera VM 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Starting Hue on Cloudera 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Viewing HDFS 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Hands-On: Importing/Exporting 
Data to HDFS 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Importing Data to Hadoop 
Download War and Peace Full Text 
www.gutenberg.org/ebooks/2600 
$hadoop fs -mkdir input 
$hadoop fs -mkdir output 
$hadoop fs -copyFromLocal Downloads/pg2600.txt input 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Review file in Hadoop HDFS 
List HDFS File 
Read HDFS File 
[hdadmin@localhost bin]$ hadoop fs -cat input/pg2600.txt 
Retrieve HDFS File to Local File System 
[hdadmin@localhost bin]$ hadoop fs -copyToLocal input/pg2600.txt tmp/file.txt 
Please see also http://hadoop.apache.org/docs/r1.0.4/commands_manual.html 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Review file in Hadoop HDFS using 
File Browse 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Review file in Hadoop HDFS using Hue 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Hadoop Port Numbers 
Daemon Default 
Port 
Configuration Parameter in 
conf/*-site.xml 
HDFS Namenode 50070 dfs.http.address 
Datanodes 50075 dfs.datanode.http.address 
Secondarynamenode 50090 dfs.secondary.http.address 
MR JobTracker 50030 mapred.job.tracker.http.addre 
ss 
Tasktrackers 50060 mapred.task.tracker.http.addr 
ess 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Removing data from HDFS using 
Shell Command 
hdadmin@localhost detach]$ hadoop fs -rm input/pg2600.txt 
Deleted hdfs://localhost:54310/input/pg2600.txt 
hdadmin@localhost detach]$ 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Client 
Map Reduce 
Name Node Job Tracker 
Data Node 
Task Tracker 
Data Node 
Task Tracker 
Data Node 
Task Tracker 
Lecture: Understanding Map Reduce 
Processing 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
High Level Architecture of MapReduce 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Before MapReduce… 
● Large scale data processing was difficult! 
– Managing hundreds or thousands of processors 
– Managing parallelization and distribution 
– I/O Scheduling 
– Status and monitoring 
– Fault/crash tolerance 
● MapReduce provides all of these, easily! 
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 18 danairat@gmail.com
MapReduce Overview 
● What is it? 
– Programming model used by Google 
– A combination of the Map and Reduce models with an 
associated implementation 
– Used for processing and generating large data sets 
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 19 danairat@gmail.com
MapReduce Overview 
● How does it solve our previously mentioned problems? 
– MapReduce is highly scalable and can be used across many 
computers. 
– Many small machines can be used to process jobs that 
normally could not be processed by a large machine. 
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 20 danairat@gmail.com
MapReduce Framework 
Source: www.bigdatauniversity.com 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
How does the MapReduce work? 
Partitioning 
Sorting 
RecordWriter 
Output in a list of (Key, List of Values) 
in the intermediate file 
InputSplit 
RecordReader 
Output in a list of (Key, Value) 
in the intermediate file 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
How does the MapReduce work? 
Partitioning 
Sorting 
RecordWriter 
Output in a list of (Key, List of Values) 
in the intermediate file 
InputSplit 
RecordReader 
Output in a list of (Key, Value) 
in the intermediate file 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Map Abstraction 
● Inputs a key/value pair 
– Key is a reference to the input value 
– Value is the data set on which to operate 
● Evaluation 
– Function defined by user 
– Applies to every value in value input 
● Might need to parse input 
● Produces a new list of key/value pairs 
– Can be different type from input pair 
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 24 danairat@gmail.com
Reduce Abstraction 
● Starts with intermediate Key / Value pairs 
● Ends with finalized Key / Value pairs 
● Starting pairs are sorted by key 
● Iterator supplies the values for a given key to the 
Reduce function. 
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 25 danairat@gmail.com
Reduce Abstraction 
● Typically a function that: 
– Starts with a large number of key/value pairs 
● One key/value for each word in all files being greped 
(including multiple entries for the same word) 
– Ends with very few key/value pairs 
● One key/value for each unique word across all the files with 
the number of instances summed into this entry 
● Broken up so a given worker works with input of the 
same key. 
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 26 danairat@gmail.com
How Map and Reduce Work Together 
● Map returns information 
● Reduces accepts information 
● Reduce applies a user defined function to reduce the 
amount of data 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 27 danairat@gmail.com
Other Applications 
● Yahoo! 
– Webmap application uses Hadoop to create a database of 
information on all known webpages 
● Facebook 
– Hive data center uses Hadoop to provide business statistics to 
application developers and advertisers 
● Rackspace 
– Analyzes sever log files and usage data using Hadoop 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 28 danairat@gmail.com
MapReduce Framework 
map: (K1, V1) -> list(K2, V2)) 
reduce: (K2, list(V2)) -> list(K3, V3) 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Hands-On: Writing you own Map 
Reduce Program 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Wordcount (HelloWord in Hadoop) 
1. package org.myorg; 
2. 
3. import java.io.IOException; 
4. import java.util.*; 
5. 
6. import org.apache.hadoop.fs.Path; 
7. import org.apache.hadoop.conf.*; 
8. import org.apache.hadoop.io.*; 
9. import org.apache.hadoop.mapred.*; 
10. import org.apache.hadoop.util.*; 
11. 
12. public class WordCount { 
13. 
14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, 
IntWritable> { 
15. private final static IntWritable one = new IntWritable(1); 
16. private Text word = new Text(); 
17. 
18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter 
reporter) throws IOException { 
19. String line = value.toString(); 
20. StringTokenizer tokenizer = new StringTokenizer(line); 
21. while (tokenizer.hasMoreTokens()) { 
22. word.set(tokenizer.nextToken()); 
23. output.collect(word, one); 
24. } 
25. } 
26. } 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Wordcount (HelloWord in Hadoop) 
27. 
28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, 
IntWritable> { 
29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> 
output, Reporter reporter) throws IOException { 
30. int sum = 0; 
31. while (values.hasNext()) { 
32. sum += values.next().get(); 
33. } 
34. output.collect(key, new IntWritable(sum)); 
35. } 
36. } 
37. 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Wordcount (HelloWord in Hadoop) 
38. public static void main(String[] args) throws Exception { 
39. JobConf conf = new JobConf(WordCount.class); 
40. conf.setJobName("wordcount"); 
41. 
42. conf.setOutputKeyClass(Text.class); 
43. conf.setOutputValueClass(IntWritable.class); 
44. 
45. conf.setMapperClass(Map.class); 
46. 
47. conf.setReducerClass(Reduce.class); 
48. 
49. conf.setInputFormat(TextInputFormat.class); 
50. conf.setOutputFormat(TextOutputFormat.class); 
51. 
52. FileInputFormat.setInputPaths(conf, new Path(args[0])); 
53. FileOutputFormat.setOutputPath(conf, new Path(args[1])); 
54. 
55. JobClient.runJob(conf); 
57. } 
58. } 
59. 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Hands-On: Packaging Map Reduce 
and Deploying to Hadoop Runtime 
Environment 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Packaging Map Reduce Program 
Usage 
Assuming we have already downloaded wordcount.jar into the folder /Downloads: 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Reviewing MapReduce Job in Hue 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Reviewing MapReduce Job in Hue 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Reviewing MapReduce Output Result 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Reviewing MapReduce Output Result 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Reviewing MapReduce Output Result 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Running MapReduce Job Using Oozie 
: Select Java 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Hands-On: Running WordCount.jar 
on Amazon EMR 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Architecture Overview of Amazon EMR 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Creating an AWS account 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Signing up for the necessary services 
● Simple Storage Service (S3) 
● Elastic Compute Cloud (EC2) 
● Elastic MapReduce (EMR) 
Caution! This costs real money! 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Creating Amazon S3 bucket 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Upload .jar file and input file to 
Amazon S3 
1. Select <yourbucket> in Amazon S3 service 
2. Create folder : applications 
3. Upload wordcount.jar to the applications folder 
4. Create another folder: input 
5. Upload input_test.txt to the input folder 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Create a new cluster in EMR 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Assign a cluster name and log location 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Assign hardware configuration 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Select Custom JAR 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Input JAR Location and Arguments 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Select Create Cluster 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Monitoring the cluster 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
View Result from the S3 bucket 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Lecture 
Understanding Hive 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Introduction 
A Petabyte Scale Data Warehouse Using Hadoop 
Hive is developed by Facebook, designed to enable easy data 
summarization, ad-hoc querying and analysis of large 
volumes of data. It provides a simple query language called 
Hive QL, which is based on SQL 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
What Hive is NOT 
Hive is not designed for online transaction processing and 
does not offer real-time queries and row level updates. It is 
best used for batch jobs over large sets of immutable data 
(like web logs, etc.). 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Hive Metastore 
● Store Hive metadata 
● Configurations 
– Embedded: in-process metastore, in-process database 
– Local: in-process metastore, out-of-process database 
– Remote: out-of-process metastore,out-of-process database 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 59 danairat@gmail.com
Hive Schema-On-Read 
● Faster loads into the database (simply copy or move) 
● Slower queries 
● Flexibility – multiple schemas for the same data 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 60 danairat@gmail.com
HiveQL 
● Hive Query Language 
● SQL dialect 
● No support for: 
– UPDATE, DELETE 
– Transactions 
– Indexes 
– HAVING clause in SELECT 
– Updateable or materialized views 
– Srored procedure 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 61 danairat@gmail.com
Hive Tables 
● Managed- CREATE TABLE 
– LOAD- File moved into Hive's data warehouse directory 
– DROP- Both data and metadata are deleted. 
● External- CREATE EXTERNAL TABLE 
– LOAD- No file moved 
– DROP- Only metadata deleted 
– Use when sharing data between Hive and Hadoop applications 
or you want to use multiple schema on the same data 
Danairat T., 2013, Big Data Hadoop – Hands On Workshop 62 danairat@gmail.com
Running Hive 
Hive Shell 
● Interactive 
hive 
● Script 
hive -f myscript 
● Inline 
hive -e 'SELECT * FROM mytable' 
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
System Architecture and Components 
• Metastore: To store the meta data. 
• Query compiler and execution engine: To convert SQL queries to a 
sequence of map/reduce jobs that are then executed on Hadoop. 
• SerDe and ObjectInspectors: Programmable interfaces and 
implementations of common data formats and types. 
A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary 
representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java 
object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. 
• UDF and UDAF: Programmable interfaces and implementations for 
user defined functions (scalar and aggregate functions). 
• Clients: Command line client similar to Mysql command line. 
hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Architecture Overview 
HDFS 
Hive 
Hive CLI 
Browsing Queries 
Map Reduce 
Thrift API 
MetaStore 
Execution 
SerDe 
Hive QL 
Thrift Jute JSON.. 
DDL 
Parser 
Planner 
Mgmt. 
Web UI 
HDFS 
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Sample HiveQL 
The Query compiler uses the information stored in the metastore to 
convert SQL queries into a sequence of map/reduce jobs, e.g. the 
following query 
SELECT * FROM t where t.c = 'xyz' 
SELECT t1.c2 FROM t1 JOIN t2 ON (t1.c1 = t2.c1) 
SELECT t1.c1, count(1) from t1 group by t1.c1 
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Hands-On: Creating Table and 
Retrieving Data using Hive 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Running Hive from terminal 
Starting Hive 
Quit from Hive 
hive> quit; 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Starting Hive Editor from Hue 
Scroll Down 
the web page 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Creating Hive Table 
hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW 
FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; 
OK 
Time taken: 4.069 seconds 
hive (default)> show tables; 
OK 
test_tbl 
Time taken: 0.138 seconds 
hive (default)> describe test_tbl; 
OK 
id int 
country string 
Time taken: 0.147 seconds 
hive (default)> 
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Using Hue Query Editor 
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Using Hue Query Editor 
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Using Hue Query Editor 
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Reviewing Hive Table in HDFS 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Alter and Drop Hive Table 
hive (default)> alter table test_tbl add columns (remarks STRING); 
hive (default)> describe test_tbl; 
OK 
id int 
country string 
remarks string 
Time taken: 0.077 seconds 
hive (default)> drop table test_tbl; 
OK 
Time taken: 0.9 seconds 
See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Loading Data to Hive Table 
Creating Hive table 
$ hive 
hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW 
FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; 
Loading data to Hive table 
hive (default)> LOAD DATA LOCAL INPATH '/tmp/country.csv' INTO TABLE test_tbl; 
Copying data from file:/tmp/test_tbl_data.csv 
Copying file: file:/tmp/test_tbl_data.csv 
Loading data to table default.test_tbl 
OK 
Time taken: 0.241 seconds 
hive (default)> 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Querying Data from Hive Table 
hive (default)> select * from test_tbl; 
OK 
1 USA 
62 Indonesia 
63 Philippines 
65 Singapore 
66 Thailand 
Time taken: 0.287 seconds 
hive (default)> 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Insert Overwriting the Hive Table 
hive (default)> LOAD DATA LOCAL INPATH 
'/tmp/test_tbl_data_updated.csv' overwrite INTO TABLE test_tbl; 
Copying data from file:/tmp/test_tbl_data_updated.csv 
Copying file: file:/tmp/test_tbl_data_updated.csv 
Loading data to table default.test_tbl 
Deleted hdfs://localhost:54310/user/hive/warehouse/test_tbl 
OK 
Time taken: 0.204 seconds 
hive (default)> 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Lecture 
Understanding Pig 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Introduction 
A high-level platform for creating MapReduce programs Using Hadoop 
Pig is a platform for analyzing large data sets that consists of 
a high-level language for expressing data analysis programs, 
coupled with infrastructure for evaluating these programs. 
The salient property of Pig programs is that their structure is 
amenable to substantial parallelization, which in turns enables 
them to handle very large data sets. 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Pig Components 
● Two Compnents 
● Language (Pig Latin) 
● Compiler 
● Two Execution Environments 
● Local 
pig -x local 
● Distributed 
pig -x mapreduce 
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Running Pig 
● Script 
pig myscript 
● Command line (Grunt) 
pig 
● Embedded 
Writing a java program 
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Pig Latin 
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Pig Execution Stages 
Source Introduction to Apache Hadoop-Pig: PrashantKommireddi Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Why Pig? 
● Makes writing Hadoop jobs easier 
● 5% of the code, 5% of the time 
● You don't need to be a programmer to write Pig scripts 
● Provide major functionality required for 
DatawareHouse and Analytics 
● Load, Filter, Join, Group By, Order, Transform 
● User can write custom UDFs (User Defined Function) 
Source Introduction to Apache Hadoop-Pig: PrashantKommireddi Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Hands-On: Running a Pig script 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Starting Pig Command Line 
[hdadmin@localhost ~]$ pig -x local 
2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Apache Pig 
version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53 
2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Logging error 
messages to: /home/hdadmin/pig_1375327740024.log 
2013-08-01 10:29:00,066 [main] INFO org.apache.pig.impl.util.Utils - 
Default bootup file /home/hdadmin/.pigbootup not found 
2013-08-01 10:29:00,212 [main] INFO 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
to hadoop file system at: file:/// 
grunt> 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Starting Pig from Hue 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Writing a Pig Script 
#Preparing Data 
Download hdi-data.csv 
#Edit Your Script 
[hdadmin@localhost ~]$ cd Downloads/ 
[hdadmin@localhost ~]$ vi countryFilter.pig 
countryFilter.pig 
A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float, 
lifeex:int, mysch:i 
nt, eysch:int, gni:int); 
B = FILTER A BY gni > 2000; 
C = ORDER B BY gni; 
dump C; 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Running a Pig Script 
[hdadmin@localhost ~]$ cd Downloads 
[hdadmin@localhost ~]$ pig -x local 
grunt > run countryFilter.pig 
.... 
(150,Cameroon,0.482,51,5,10,2031) 
(126,Kyrgyzstan,0.615,67,9,12,2036) 
(156,Nigeria,0.459,51,5,8,2069) 
(154,Yemen,0.462,65,2,8,2213) 
(138,Lao People's Democratic Republic,0.524,67,4,9,2242) 
(153,Papua New Guinea,0.466,62,4,5,2271) 
(165,Djibouti,0.43,57,3,5,2335) 
(129,Nicaragua,0.589,74,5,10,2430) 
(145,Pakistan,0.504,65,4,6,2550) 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Lecture: Understanding Sqoop 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Introduction 
Sqoop (“SQL-to-Hadoop”) is a straightforward command-line 
tool with the following capabilities: 
• Imports individual tables or entire databases to files in 
HDFS 
• Generates Java classes to allow you to interact with your 
imported data 
• Provides the ability to import from SQL databases straight 
into your Hive data warehouse 
See also: http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Architecture Overview 
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Hands-On: Loading Data from DBMS 
to Hadoop HDFS 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Loading Data into MySQL DB 
[root@localhost ~]# service mysqld start 
Starting mysqld: [ OK ] 
[root@localhost ~]# mysql 
mysql> create database countrydb; 
Query OK, 1 row affected (0.00 sec) 
mysql> use countrydb; 
Database changed 
mysql> create table country_tbl(id INT, country VARCHAR(100)); 
Query OK, 0 rows affected (0.02 sec) 
mysql> LOAD DATA LOCAL INFILE '/tmp/country_data.csv' INTO TABLE 
country_tbl FIELDS terminated by ',' LINES TERMINATED BY 'rn'; 
Query OK, 237 rows affected, 4 warnings (0.00 sec) 
Records: 237 Deleted: 0 Skipped: 0 Warnings: 0 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Loading Data into MySQL DB 
Testing data query from MySQL DB 
mysql> select * from country_tbl; 
+------+------------------------------+ 
| id | country | 
+------+------------------------------+ 
| 93 | Afghanistan | 
| 355 | Albania | 
| 213 | Algeria | 
| 1684 | AmericanSamoa | 
| 376 | Andorra | 
... 
+------+------------------------------+ 
237 rows in set (0.00 sec) 
mysql> 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Installing DB driver for Sqoop 
[hdadmin@localhost conf]$ su - 
Password: 
[root@localhost ~]# cp /01_hadoop_workshop/03_sqoop1.4.2/mysql-connector- 
java-5.0.8-bin.jar /usr/local/sqoop-1.4.2.bin__hadoop- 
0.20/lib/ 
[root@localhost ~]# chown hdadmin:hadoop /usr/local/sqoop- 
1.4.2.bin__hadoop-0.20/lib/mysql-connector-java-5.0.8-bin.jar 
[root@localhost ~]# exit 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Importing data from MySQL to Hive Table 
[hdadmin@localhost ~]$ sqoop import --connect 
jdbc:mysql://localhost/countrydb --username root -P --table 
country_tbl --hive-import --hive-table country_tbl -m 1 
Warning: /usr/lib/hbase does not exist! HBase imports will fail. 
Please set $HBASE_HOME to the root of your HBase installation. 
Warning: $HADOOP_HOME is deprecated. 
Enter password: <enter here> 
13/03/21 18:07:43 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override 
13/03/21 18:07:43 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc. 
13/03/21 18:07:43 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 
13/03/21 18:07:43 INFO tool.CodeGenTool: Beginning code generation 
13/03/21 18:07:44 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `country_tbl` AS t LIMIT 1 
13/03/21 18:07:44 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `country_tbl` AS t LIMIT 1 
13/03/21 18:07:44 INFO orm.CompilationManager: HADOOP_HOME is /usr/local/hadoop/libexec/.. 
Note: /tmp/sqoop-hdadmin/compile/0b65b003bf2936e1303f5edf93338215/country_tbl.java uses or overrides a deprecated API. 
Note: Recompile with -Xlint:deprecation for details. 
13/03/21 18:07:44 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdadmin/compile/0b65b003bf2936e1303f5edf93338215/country_tbl.jar 
13/03/21 18:07:44 WARN manager.MySQLManager: It looks like you are importing from mysql. 
13/03/21 18:07:44 WARN manager.MySQLManager: This transfer can be faster! Use the --direct 
13/03/21 18:07:44 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path. 
13/03/21 18:07:44 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql) 
13/03/21 18:07:44 INFO mapreduce.ImportJobBase: Beginning import of country_tbl 
13/03/21 18:07:45 INFO mapred.JobClient: Running job: job_201303211744_0001 
13/03/21 18:07:46 INFO mapred.JobClient: map 0% reduce 0% 
13/03/21 18:08:02 INFO mapred.JobClient: map 100% reduce 0% 
13/03/21 18:08:07 INFO mapred.JobClient: Job complete: job_201303211744_0001 
13/03/21 18:08:07 INFO mapred.JobClient: Counters: 18 
13/03/21 18:08:07 INFO mapred.JobClient: Job Counters 
13/03/21 18:08:07 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12154 
13/03/21 18:08:07 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 
Big Data Hadoop 13/03/21 18:on 08:07 INFO Amazon mapred.JobClient: Total EMR time spent by – all Hands maps waiting after reserving On slots Workshop (ms)=0 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013 
13/03/21 18:08:07 INFO mapred.JobClient: Launched map tasks=1
Reviewing data from Hive Table 
[hdadmin@localhost ~]$ hive 
Logging initialized using configuration in file:/usr/local/hive-0.9.0- 
bin/conf/hive-log4j.properties 
Hive history 
file=/tmp/hdadmin/hive_job_log_hdadmin_201303211810_964909984.txt 
hive (default)> show tables; 
OK 
country_tbl 
test_tbl 
Time taken: 2.566 seconds 
hive (default)> select * from country_tbl; 
OK 
93 Afghanistan 
355 Albania 
..... 
Time taken: 0.587 seconds 
hive (default)> quit; 
[hdadmin@localhost ~]$ 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Reviewing HDFS Database Table files 
Start Web Browser to http://localhost:50070/ then navigate to /user/hive/warehouse 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Lecture 
Understanding HBase 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Introduction 
An open source, non-relational, distributed database 
HBase is an open source, non-relational, distributed database 
modeled after Google's BigTable and is written in Java. It is 
developed as part of Apache Software Foundation's Apache 
Hadoop project and runs on top of HDFS (, providing 
BigTable-like capabilities for Hadoop. That is, it provides a 
fault-tolerant way of storing large quantities of sparse data. 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
HBase Features 
● Hadoop database modelled after Google's Bigtab;e 
● Column oriented data store, known as Hadoop Database 
● Support random realtime CRUD operations (unlike 
HDFS) 
● No SQL Database 
● Opensource, written in Java 
● Run on a cluster of commodity hardware 
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
When to use Hbase? 
● When you need high volume data to be stored 
● Un-structured data 
● Sparse data 
● Column-oriented data 
● Versioned data (same data template, captured at various 
time, time-elapse data) 
● When you need high scalability 
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Which one to use? 
● HDFS 
● Only append dataset (no random write) 
● Read the whole dataset (no random read) 
● HBase 
● Need random write and/or read 
● Has thousands of operation per second on TB+ of data 
● RDBMS 
● Data fits on one big node 
● Need full transaction support 
● Need real-time query capabilities 
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
HBase Components 
Hive.apache.org 
● Region 
● Row of table are stores 
● Region Server 
● Hosts the tables 
● Master 
● Coordinating the Region 
Servers 
● ZooKeeper 
● HDFS 
● API 
● The Java Client API 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
HBase Shell Commands 
Hive.apache.org 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Hands-On: Running HBase 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Starting HBase shell 
[hdadmin@localhost ~]$ start-hbase.sh 
starting master, logging to /usr/local/hbase-0.94.10/logs/hbase-hdadmin-master- 
localhost.localdomain.out 
[hdadmin@localhost ~]$ jps 
3064 TaskTracker 
2836 SecondaryNameNode 
2588 NameNode 
3513 Jps 
3327 HMaster 
2938 JobTracker 
2707 DataNode 
[hdadmin@localhost ~]$ hbase shell 
HBase Shell; enter 'help<RETURN>' for list of supported commands. 
Type "exit<RETURN>" to leave the HBase Shell 
Version 0.94.10, r1504995, Fri Jul 19 20:24:16 UTC 2013 
hbase(main):001:0> 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Create a table and insert data in HBase 
hbase(main):009:0> create 'test', 'cf' 
0 row(s) in 1.0830 seconds 
hbase(main):010:0> put 'test', 'row1', 'cf:a', 'val1' 
0 row(s) in 0.0750 seconds 
hbase(main):011:0> scan 'test' 
ROW COLUMN+CELL 
row1 column=cf:a, timestamp=1375363287644, 
value=val1 
1 row(s) in 0.0640 seconds 
hbase(main):002:0> get 'test', 'row1' 
COLUMN CELL 
cf:a timestamp=1375363287644, value=val1 
1 row(s) in 0.0370 seconds 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Using Data Browsers in Hue for HBase 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Using Data Browsers in Hue for HBase 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Recommendation to Further Study 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
Thank you 
www.imcinstitute.com 
www.facebook.com/imcinstitute 
Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013

Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR

  • 1.
    Big Data onPublic Cloud Using Cloudera on GoGrid & Amazon EMR Hands On Workshop October 2014 Dr.Thanachart Numnonda IMC Institute thanachart@imcinstitute.com Modifiy from Original Version by Danairat T. Certified Java Programmer, TOGAF – Silver danairat@gmail.com Danairat T., 2013, Big Data Hadoop – Hands On Workshop 1 danairat@gmail.com
  • 2.
    Hands-On: Running Hadoop Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 3.
    Running this labusing Cloudera Live Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 4.
    Using Cloudera Liveon GoGrid Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 5.
    Alternatively Using ClouderaVM Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 6.
    Starting Hue onCloudera Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 7.
    Viewing HDFS DanairatT., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 8.
    Danairat T., ,danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 9.
    Hands-On: Importing/Exporting Datato HDFS Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 10.
    Importing Data toHadoop Download War and Peace Full Text www.gutenberg.org/ebooks/2600 $hadoop fs -mkdir input $hadoop fs -mkdir output $hadoop fs -copyFromLocal Downloads/pg2600.txt input Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 11.
    Review file inHadoop HDFS List HDFS File Read HDFS File [hdadmin@localhost bin]$ hadoop fs -cat input/pg2600.txt Retrieve HDFS File to Local File System [hdadmin@localhost bin]$ hadoop fs -copyToLocal input/pg2600.txt tmp/file.txt Please see also http://hadoop.apache.org/docs/r1.0.4/commands_manual.html Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 12.
    Review file inHadoop HDFS using File Browse Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 13.
    Review file inHadoop HDFS using Hue Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 14.
    Hadoop Port Numbers Daemon Default Port Configuration Parameter in conf/*-site.xml HDFS Namenode 50070 dfs.http.address Datanodes 50075 dfs.datanode.http.address Secondarynamenode 50090 dfs.secondary.http.address MR JobTracker 50030 mapred.job.tracker.http.addre ss Tasktrackers 50060 mapred.task.tracker.http.addr ess Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 15.
    Removing data fromHDFS using Shell Command hdadmin@localhost detach]$ hadoop fs -rm input/pg2600.txt Deleted hdfs://localhost:54310/input/pg2600.txt hdadmin@localhost detach]$ Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 16.
    Client Map Reduce Name Node Job Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Lecture: Understanding Map Reduce Processing Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 17.
    High Level Architectureof MapReduce Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 18.
    Before MapReduce… ●Large scale data processing was difficult! – Managing hundreds or thousands of processors – Managing parallelization and distribution – I/O Scheduling – Status and monitoring – Fault/crash tolerance ● MapReduce provides all of these, easily! Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html Danairat T., 2013, Big Data Hadoop – Hands On Workshop 18 danairat@gmail.com
  • 19.
    MapReduce Overview ●What is it? – Programming model used by Google – A combination of the Map and Reduce models with an associated implementation – Used for processing and generating large data sets Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html Danairat T., 2013, Big Data Hadoop – Hands On Workshop 19 danairat@gmail.com
  • 20.
    MapReduce Overview ●How does it solve our previously mentioned problems? – MapReduce is highly scalable and can be used across many computers. – Many small machines can be used to process jobs that normally could not be processed by a large machine. Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html Danairat T., 2013, Big Data Hadoop – Hands On Workshop 20 danairat@gmail.com
  • 21.
    MapReduce Framework Source:www.bigdatauniversity.com Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 22.
    How does theMapReduce work? Partitioning Sorting RecordWriter Output in a list of (Key, List of Values) in the intermediate file InputSplit RecordReader Output in a list of (Key, Value) in the intermediate file Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 23.
    How does theMapReduce work? Partitioning Sorting RecordWriter Output in a list of (Key, List of Values) in the intermediate file InputSplit RecordReader Output in a list of (Key, Value) in the intermediate file Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 24.
    Map Abstraction ●Inputs a key/value pair – Key is a reference to the input value – Value is the data set on which to operate ● Evaluation – Function defined by user – Applies to every value in value input ● Might need to parse input ● Produces a new list of key/value pairs – Can be different type from input pair Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html Danairat T., 2013, Big Data Hadoop – Hands On Workshop 24 danairat@gmail.com
  • 25.
    Reduce Abstraction ●Starts with intermediate Key / Value pairs ● Ends with finalized Key / Value pairs ● Starting pairs are sorted by key ● Iterator supplies the values for a given key to the Reduce function. Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html Danairat T., 2013, Big Data Hadoop – Hands On Workshop 25 danairat@gmail.com
  • 26.
    Reduce Abstraction ●Typically a function that: – Starts with a large number of key/value pairs ● One key/value for each word in all files being greped (including multiple entries for the same word) – Ends with very few key/value pairs ● One key/value for each unique word across all the files with the number of instances summed into this entry ● Broken up so a given worker works with input of the same key. Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html Danairat T., 2013, Big Data Hadoop – Hands On Workshop 26 danairat@gmail.com
  • 27.
    How Map andReduce Work Together ● Map returns information ● Reduces accepts information ● Reduce applies a user defined function to reduce the amount of data Danairat T., 2013, Big Data Hadoop – Hands On Workshop 27 danairat@gmail.com
  • 28.
    Other Applications ●Yahoo! – Webmap application uses Hadoop to create a database of information on all known webpages ● Facebook – Hive data center uses Hadoop to provide business statistics to application developers and advertisers ● Rackspace – Analyzes sever log files and usage data using Hadoop Danairat T., 2013, Big Data Hadoop – Hands On Workshop 28 danairat@gmail.com
  • 29.
    MapReduce Framework map:(K1, V1) -> list(K2, V2)) reduce: (K2, list(V2)) -> list(K3, V3) Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 30.
    Hands-On: Writing youown Map Reduce Program Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 31.
    Wordcount (HelloWord inHadoop) 1. package org.myorg; 2. 3. import java.io.IOException; 4. import java.util.*; 5. 6. import org.apache.hadoop.fs.Path; 7. import org.apache.hadoop.conf.*; 8. import org.apache.hadoop.io.*; 9. import org.apache.hadoop.mapred.*; 10. import org.apache.hadoop.util.*; 11. 12. public class WordCount { 13. 14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { 15. private final static IntWritable one = new IntWritable(1); 16. private Text word = new Text(); 17. 18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 19. String line = value.toString(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasMoreTokens()) { 22. word.set(tokenizer.nextToken()); 23. output.collect(word, one); 24. } 25. } 26. } Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 32.
    Wordcount (HelloWord inHadoop) 27. 28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { 29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 30. int sum = 0; 31. while (values.hasNext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. } 37. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 33.
    Wordcount (HelloWord inHadoop) 38. public static void main(String[] args) throws Exception { 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setJobName("wordcount"); 41. 42. conf.setOutputKeyClass(Text.class); 43. conf.setOutputValueClass(IntWritable.class); 44. 45. conf.setMapperClass(Map.class); 46. 47. conf.setReducerClass(Reduce.class); 48. 49. conf.setInputFormat(TextInputFormat.class); 50. conf.setOutputFormat(TextOutputFormat.class); 51. 52. FileInputFormat.setInputPaths(conf, new Path(args[0])); 53. FileOutputFormat.setOutputPath(conf, new Path(args[1])); 54. 55. JobClient.runJob(conf); 57. } 58. } 59. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 34.
    Hands-On: Packaging MapReduce and Deploying to Hadoop Runtime Environment Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 35.
    Packaging Map ReduceProgram Usage Assuming we have already downloaded wordcount.jar into the folder /Downloads: Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 36.
    Reviewing MapReduce Jobin Hue Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 37.
    Reviewing MapReduce Jobin Hue Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 38.
    Reviewing MapReduce OutputResult Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 39.
    Reviewing MapReduce OutputResult Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 40.
    Reviewing MapReduce OutputResult Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 41.
    Running MapReduce JobUsing Oozie : Select Java Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 42.
    Hands-On: Running WordCount.jar on Amazon EMR Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 43.
    Architecture Overview ofAmazon EMR Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 44.
    Creating an AWSaccount Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 45.
    Signing up forthe necessary services ● Simple Storage Service (S3) ● Elastic Compute Cloud (EC2) ● Elastic MapReduce (EMR) Caution! This costs real money! Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 46.
    Creating Amazon S3bucket Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 47.
    Upload .jar fileand input file to Amazon S3 1. Select <yourbucket> in Amazon S3 service 2. Create folder : applications 3. Upload wordcount.jar to the applications folder 4. Create another folder: input 5. Upload input_test.txt to the input folder Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 48.
    Create a newcluster in EMR Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 49.
    Assign a clustername and log location Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 50.
    Assign hardware configuration Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 51.
    Select Custom JAR Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 52.
    Input JAR Locationand Arguments Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 53.
    Select Create Cluster Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 54.
    Monitoring the cluster Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 55.
    View Result fromthe S3 bucket Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 56.
    Lecture Understanding Hive Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 57.
    Introduction A PetabyteScale Data Warehouse Using Hadoop Hive is developed by Facebook, designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 58.
    What Hive isNOT Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs, etc.). Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 59.
    Hive Metastore ●Store Hive metadata ● Configurations – Embedded: in-process metastore, in-process database – Local: in-process metastore, out-of-process database – Remote: out-of-process metastore,out-of-process database Danairat T., 2013, Big Data Hadoop – Hands On Workshop 59 danairat@gmail.com
  • 60.
    Hive Schema-On-Read ●Faster loads into the database (simply copy or move) ● Slower queries ● Flexibility – multiple schemas for the same data Danairat T., 2013, Big Data Hadoop – Hands On Workshop 60 danairat@gmail.com
  • 61.
    HiveQL ● HiveQuery Language ● SQL dialect ● No support for: – UPDATE, DELETE – Transactions – Indexes – HAVING clause in SELECT – Updateable or materialized views – Srored procedure Danairat T., 2013, Big Data Hadoop – Hands On Workshop 61 danairat@gmail.com
  • 62.
    Hive Tables ●Managed- CREATE TABLE – LOAD- File moved into Hive's data warehouse directory – DROP- Both data and metadata are deleted. ● External- CREATE EXTERNAL TABLE – LOAD- No file moved – DROP- Only metadata deleted – Use when sharing data between Hive and Hadoop applications or you want to use multiple schema on the same data Danairat T., 2013, Big Data Hadoop – Hands On Workshop 62 danairat@gmail.com
  • 63.
    Running Hive HiveShell ● Interactive hive ● Script hive -f myscript ● Inline hive -e 'SELECT * FROM mytable' Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 64.
    System Architecture andComponents • Metastore: To store the meta data. • Query compiler and execution engine: To convert SQL queries to a sequence of map/reduce jobs that are then executed on Hadoop. • SerDe and ObjectInspectors: Programmable interfaces and implementations of common data formats and types. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. • UDF and UDAF: Programmable interfaces and implementations for user defined functions (scalar and aggregate functions). • Clients: Command line client similar to Mysql command line. hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 65.
    Architecture Overview HDFS Hive Hive CLI Browsing Queries Map Reduce Thrift API MetaStore Execution SerDe Hive QL Thrift Jute JSON.. DDL Parser Planner Mgmt. Web UI HDFS Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 66.
    Sample HiveQL TheQuery compiler uses the information stored in the metastore to convert SQL queries into a sequence of map/reduce jobs, e.g. the following query SELECT * FROM t where t.c = 'xyz' SELECT t1.c2 FROM t1 JOIN t2 ON (t1.c1 = t2.c1) SELECT t1.c1, count(1) from t1 group by t1.c1 Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 67.
    Hands-On: Creating Tableand Retrieving Data using Hive Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 68.
    Running Hive fromterminal Starting Hive Quit from Hive hive> quit; Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 69.
    Starting Hive Editorfrom Hue Scroll Down the web page Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 70.
    Creating Hive Table hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; OK Time taken: 4.069 seconds hive (default)> show tables; OK test_tbl Time taken: 0.138 seconds hive (default)> describe test_tbl; OK id int country string Time taken: 0.147 seconds hive (default)> See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 71.
    Using Hue QueryEditor See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 72.
    Using Hue QueryEditor See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 73.
    Using Hue QueryEditor See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 74.
    Reviewing Hive Tablein HDFS Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 75.
    Alter and DropHive Table hive (default)> alter table test_tbl add columns (remarks STRING); hive (default)> describe test_tbl; OK id int country string remarks string Time taken: 0.077 seconds hive (default)> drop table test_tbl; OK Time taken: 0.9 seconds See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 76.
    Loading Data toHive Table Creating Hive table $ hive hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; Loading data to Hive table hive (default)> LOAD DATA LOCAL INPATH '/tmp/country.csv' INTO TABLE test_tbl; Copying data from file:/tmp/test_tbl_data.csv Copying file: file:/tmp/test_tbl_data.csv Loading data to table default.test_tbl OK Time taken: 0.241 seconds hive (default)> Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 77.
    Querying Data fromHive Table hive (default)> select * from test_tbl; OK 1 USA 62 Indonesia 63 Philippines 65 Singapore 66 Thailand Time taken: 0.287 seconds hive (default)> Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 78.
    Insert Overwriting theHive Table hive (default)> LOAD DATA LOCAL INPATH '/tmp/test_tbl_data_updated.csv' overwrite INTO TABLE test_tbl; Copying data from file:/tmp/test_tbl_data_updated.csv Copying file: file:/tmp/test_tbl_data_updated.csv Loading data to table default.test_tbl Deleted hdfs://localhost:54310/user/hive/warehouse/test_tbl OK Time taken: 0.204 seconds hive (default)> Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 79.
    Lecture Understanding Pig Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 80.
    Introduction A high-levelplatform for creating MapReduce programs Using Hadoop Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 81.
    Pig Components ●Two Compnents ● Language (Pig Latin) ● Compiler ● Two Execution Environments ● Local pig -x local ● Distributed pig -x mapreduce Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 82.
    Running Pig ●Script pig myscript ● Command line (Grunt) pig ● Embedded Writing a java program Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 83.
    Pig Latin Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 84.
    Pig Execution Stages Source Introduction to Apache Hadoop-Pig: PrashantKommireddi Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 85.
    Why Pig? ●Makes writing Hadoop jobs easier ● 5% of the code, 5% of the time ● You don't need to be a programmer to write Pig scripts ● Provide major functionality required for DatawareHouse and Analytics ● Load, Filter, Join, Group By, Order, Transform ● User can write custom UDFs (User Defined Function) Source Introduction to Apache Hadoop-Pig: PrashantKommireddi Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 86.
    Hive.apache.org Danairat T.,, danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 87.
    Hands-On: Running aPig script Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 88.
    Starting Pig CommandLine [hdadmin@localhost ~]$ pig -x local 2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53 2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hdadmin/pig_1375327740024.log 2013-08-01 10:29:00,066 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/hdadmin/.pigbootup not found 2013-08-01 10:29:00,212 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// grunt> Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 89.
    Starting Pig fromHue Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 90.
    Writing a PigScript #Preparing Data Download hdi-data.csv #Edit Your Script [hdadmin@localhost ~]$ cd Downloads/ [hdadmin@localhost ~]$ vi countryFilter.pig countryFilter.pig A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float, lifeex:int, mysch:i nt, eysch:int, gni:int); B = FILTER A BY gni > 2000; C = ORDER B BY gni; dump C; Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 91.
    Running a PigScript [hdadmin@localhost ~]$ cd Downloads [hdadmin@localhost ~]$ pig -x local grunt > run countryFilter.pig .... (150,Cameroon,0.482,51,5,10,2031) (126,Kyrgyzstan,0.615,67,9,12,2036) (156,Nigeria,0.459,51,5,8,2069) (154,Yemen,0.462,65,2,8,2213) (138,Lao People's Democratic Republic,0.524,67,4,9,2242) (153,Papua New Guinea,0.466,62,4,5,2271) (165,Djibouti,0.43,57,3,5,2335) (129,Nicaragua,0.589,74,5,10,2430) (145,Pakistan,0.504,65,4,6,2550) Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 92.
    Lecture: Understanding Sqoop Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 93.
    Introduction Sqoop (“SQL-to-Hadoop”)is a straightforward command-line tool with the following capabilities: • Imports individual tables or entire databases to files in HDFS • Generates Java classes to allow you to interact with your imported data • Provides the ability to import from SQL databases straight into your Hive data warehouse See also: http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 94.
    Architecture Overview Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 95.
    Hands-On: Loading Datafrom DBMS to Hadoop HDFS Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 96.
    Loading Data intoMySQL DB [root@localhost ~]# service mysqld start Starting mysqld: [ OK ] [root@localhost ~]# mysql mysql> create database countrydb; Query OK, 1 row affected (0.00 sec) mysql> use countrydb; Database changed mysql> create table country_tbl(id INT, country VARCHAR(100)); Query OK, 0 rows affected (0.02 sec) mysql> LOAD DATA LOCAL INFILE '/tmp/country_data.csv' INTO TABLE country_tbl FIELDS terminated by ',' LINES TERMINATED BY 'rn'; Query OK, 237 rows affected, 4 warnings (0.00 sec) Records: 237 Deleted: 0 Skipped: 0 Warnings: 0 Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 97.
    Loading Data intoMySQL DB Testing data query from MySQL DB mysql> select * from country_tbl; +------+------------------------------+ | id | country | +------+------------------------------+ | 93 | Afghanistan | | 355 | Albania | | 213 | Algeria | | 1684 | AmericanSamoa | | 376 | Andorra | ... +------+------------------------------+ 237 rows in set (0.00 sec) mysql> Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 98.
    Installing DB driverfor Sqoop [hdadmin@localhost conf]$ su - Password: [root@localhost ~]# cp /01_hadoop_workshop/03_sqoop1.4.2/mysql-connector- java-5.0.8-bin.jar /usr/local/sqoop-1.4.2.bin__hadoop- 0.20/lib/ [root@localhost ~]# chown hdadmin:hadoop /usr/local/sqoop- 1.4.2.bin__hadoop-0.20/lib/mysql-connector-java-5.0.8-bin.jar [root@localhost ~]# exit Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 99.
    Importing data fromMySQL to Hive Table [hdadmin@localhost ~]$ sqoop import --connect jdbc:mysql://localhost/countrydb --username root -P --table country_tbl --hive-import --hive-table country_tbl -m 1 Warning: /usr/lib/hbase does not exist! HBase imports will fail. Please set $HBASE_HOME to the root of your HBase installation. Warning: $HADOOP_HOME is deprecated. Enter password: <enter here> 13/03/21 18:07:43 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override 13/03/21 18:07:43 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc. 13/03/21 18:07:43 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 13/03/21 18:07:43 INFO tool.CodeGenTool: Beginning code generation 13/03/21 18:07:44 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `country_tbl` AS t LIMIT 1 13/03/21 18:07:44 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `country_tbl` AS t LIMIT 1 13/03/21 18:07:44 INFO orm.CompilationManager: HADOOP_HOME is /usr/local/hadoop/libexec/.. Note: /tmp/sqoop-hdadmin/compile/0b65b003bf2936e1303f5edf93338215/country_tbl.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 13/03/21 18:07:44 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdadmin/compile/0b65b003bf2936e1303f5edf93338215/country_tbl.jar 13/03/21 18:07:44 WARN manager.MySQLManager: It looks like you are importing from mysql. 13/03/21 18:07:44 WARN manager.MySQLManager: This transfer can be faster! Use the --direct 13/03/21 18:07:44 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path. 13/03/21 18:07:44 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql) 13/03/21 18:07:44 INFO mapreduce.ImportJobBase: Beginning import of country_tbl 13/03/21 18:07:45 INFO mapred.JobClient: Running job: job_201303211744_0001 13/03/21 18:07:46 INFO mapred.JobClient: map 0% reduce 0% 13/03/21 18:08:02 INFO mapred.JobClient: map 100% reduce 0% 13/03/21 18:08:07 INFO mapred.JobClient: Job complete: job_201303211744_0001 13/03/21 18:08:07 INFO mapred.JobClient: Counters: 18 13/03/21 18:08:07 INFO mapred.JobClient: Job Counters 13/03/21 18:08:07 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12154 13/03/21 18:08:07 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 Big Data Hadoop 13/03/21 18:on 08:07 INFO Amazon mapred.JobClient: Total EMR time spent by – all Hands maps waiting after reserving On slots Workshop (ms)=0 Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013 13/03/21 18:08:07 INFO mapred.JobClient: Launched map tasks=1
  • 100.
    Reviewing data fromHive Table [hdadmin@localhost ~]$ hive Logging initialized using configuration in file:/usr/local/hive-0.9.0- bin/conf/hive-log4j.properties Hive history file=/tmp/hdadmin/hive_job_log_hdadmin_201303211810_964909984.txt hive (default)> show tables; OK country_tbl test_tbl Time taken: 2.566 seconds hive (default)> select * from country_tbl; OK 93 Afghanistan 355 Albania ..... Time taken: 0.587 seconds hive (default)> quit; [hdadmin@localhost ~]$ Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 101.
    Reviewing HDFS DatabaseTable files Start Web Browser to http://localhost:50070/ then navigate to /user/hive/warehouse Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 102.
    Lecture Understanding HBase Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 103.
    Introduction An opensource, non-relational, distributed database HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (, providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 104.
    HBase Features ●Hadoop database modelled after Google's Bigtab;e ● Column oriented data store, known as Hadoop Database ● Support random realtime CRUD operations (unlike HDFS) ● No SQL Database ● Opensource, written in Java ● Run on a cluster of commodity hardware Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 105.
    When to useHbase? ● When you need high volume data to be stored ● Un-structured data ● Sparse data ● Column-oriented data ● Versioned data (same data template, captured at various time, time-elapse data) ● When you need high scalability Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 106.
    Which one touse? ● HDFS ● Only append dataset (no random write) ● Read the whole dataset (no random read) ● HBase ● Need random write and/or read ● Has thousands of operation per second on TB+ of data ● RDBMS ● Data fits on one big node ● Need full transaction support ● Need real-time query capabilities Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 107.
    Danairat T., ,danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 108.
    Danairat T., ,danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 109.
    HBase Components Hive.apache.org ● Region ● Row of table are stores ● Region Server ● Hosts the tables ● Master ● Coordinating the Region Servers ● ZooKeeper ● HDFS ● API ● The Java Client API Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 110.
    HBase Shell Commands Hive.apache.org Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 111.
    Hands-On: Running HBase Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 112.
    Starting HBase shell [hdadmin@localhost ~]$ start-hbase.sh starting master, logging to /usr/local/hbase-0.94.10/logs/hbase-hdadmin-master- localhost.localdomain.out [hdadmin@localhost ~]$ jps 3064 TaskTracker 2836 SecondaryNameNode 2588 NameNode 3513 Jps 3327 HMaster 2938 JobTracker 2707 DataNode [hdadmin@localhost ~]$ hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.94.10, r1504995, Fri Jul 19 20:24:16 UTC 2013 hbase(main):001:0> Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 113.
    Create a tableand insert data in HBase hbase(main):009:0> create 'test', 'cf' 0 row(s) in 1.0830 seconds hbase(main):010:0> put 'test', 'row1', 'cf:a', 'val1' 0 row(s) in 0.0750 seconds hbase(main):011:0> scan 'test' ROW COLUMN+CELL row1 column=cf:a, timestamp=1375363287644, value=val1 1 row(s) in 0.0640 seconds hbase(main):002:0> get 'test', 'row1' COLUMN CELL cf:a timestamp=1375363287644, value=val1 1 row(s) in 0.0370 seconds Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 114.
    Using Data Browsersin Hue for HBase Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 115.
    Using Data Browsersin Hue for HBase Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 116.
    Recommendation to FurtherStudy Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013
  • 117.
    Thank you www.imcinstitute.com www.facebook.com/imcinstitute Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.Big Data Hadoop on Amazon EMR – Hands On Workshop com Aug 2013