Data Computing Division
Hadoop Hands On
Session
Milind Bhandarkar
Greenplum,A Division of EMC
Monday, February 18, 13
Data Computing Division
Prerequisites
•Make sure you haveVMWare player installed
•VMWare Fusion for Mac OS X
•Copy the GPH...
Data Computing Division
Setting Up
•Start GPHDVirtual Machine
•Make sure you can login to it
•Copy exercise.zip from your ...
Data Computing Division
Preparation
•Make sure HDFS is running
•Make sure MapReduce is running
•Check configuration files *-...
Data Computing Division
Hands-On
•Objective: Implement Linear Regression using
MapReduce, and use it to train a model
•Dat...
Data Computing Division
Data
•Attributes about a type of fish
•M/F, Length, Diameter, Height,Weight,
Rings on shell
•Proble...
Data Computing Division
Step 1
•Copy the small sample data set to HDFS
•See: Scripts/cp_to_grid.sh
Monday, February 18, 13
Data Computing Division
Step 2
•Blow up the dataset 1000 times by adding
gaussian noise to most fields
•Output: 4M sample o...
Data Computing Division
Step 3
•Train model based on Linear Regression
•See: Scripts/stream_train_linreg.sh
•Monitor the J...
Upcoming SlideShare
Loading in …5
×

Hadoop Hands-On by @techmilind

424
-1

Published on

In the associated Hands-On session, participants will learn how to copy data from and to HDFS, browse HDFS, write, run and monitor MapReduce jobs, by fitting a logistic regression model on a real-world data set. The hands-on exercises will be carried out on a virtual machine running Greenplum HD distribution based on Apache Hadoop.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
424
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
51
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hadoop Hands-On by @techmilind

  1. 1. Data Computing Division Hadoop Hands On Session Milind Bhandarkar Greenplum,A Division of EMC Monday, February 18, 13
  2. 2. Data Computing Division Prerequisites •Make sure you haveVMWare player installed •VMWare Fusion for Mac OS X •Copy the GPHD (Greenplum Distribution of Hadoop v 1.0) virtual machine to your laptop •Also copy exercise.zip file to your laptop, and decompress Monday, February 18, 13
  3. 3. Data Computing Division Setting Up •Start GPHDVirtual Machine •Make sure you can login to it •Copy exercise.zip from your laptop to the VM, and unzip in ~/exercise Monday, February 18, 13
  4. 4. Data Computing Division Preparation •Make sure HDFS is running •Make sure MapReduce is running •Check configuration files *-site.xml Monday, February 18, 13
  5. 5. Data Computing Division Hands-On •Objective: Implement Linear Regression using MapReduce, and use it to train a model •Data Set: from Marine Resources Division, Department of Primary Industries and Fisheries,Tasmania •4177 samples from observations Monday, February 18, 13
  6. 6. Data Computing Division Data •Attributes about a type of fish •M/F, Length, Diameter, Height,Weight, Rings on shell •Problem:To predict number of rings as a function of other attributes Monday, February 18, 13
  7. 7. Data Computing Division Step 1 •Copy the small sample data set to HDFS •See: Scripts/cp_to_grid.sh Monday, February 18, 13
  8. 8. Data Computing Division Step 2 •Blow up the dataset 1000 times by adding gaussian noise to most fields •Output: 4M sample observations •Using Hadoop Streaming •See: Scripts/stream_replicate.sh •Monitor this job in JobTracker UI Monday, February 18, 13
  9. 9. Data Computing Division Step 3 •Train model based on Linear Regression •See: Scripts/stream_train_linreg.sh •Monitor the Job •Copy the model to a local directory •Check it Monday, February 18, 13
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×