Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop Hands-On by @techmilind


Published on

In the associated Hands-On session, participants will learn how to copy data from and to HDFS, browse HDFS, write, run and monitor MapReduce jobs, by fitting a logistic regression model on a real-world data set. The hands-on exercises will be carried out on a virtual machine running Greenplum HD distribution based on Apache Hadoop.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Hadoop Hands-On by @techmilind

  1. 1. Data Computing Division Hadoop Hands On Session Milind Bhandarkar Greenplum,A Division of EMC Monday, February 18, 13
  2. 2. Data Computing Division Prerequisites •Make sure you haveVMWare player installed •VMWare Fusion for Mac OS X •Copy the GPHD (Greenplum Distribution of Hadoop v 1.0) virtual machine to your laptop •Also copy file to your laptop, and decompress Monday, February 18, 13
  3. 3. Data Computing Division Setting Up •Start GPHDVirtual Machine •Make sure you can login to it •Copy from your laptop to the VM, and unzip in ~/exercise Monday, February 18, 13
  4. 4. Data Computing Division Preparation •Make sure HDFS is running •Make sure MapReduce is running •Check configuration files *-site.xml Monday, February 18, 13
  5. 5. Data Computing Division Hands-On •Objective: Implement Linear Regression using MapReduce, and use it to train a model •Data Set: from Marine Resources Division, Department of Primary Industries and Fisheries,Tasmania •4177 samples from observations Monday, February 18, 13
  6. 6. Data Computing Division Data •Attributes about a type of fish •M/F, Length, Diameter, Height,Weight, Rings on shell •Problem:To predict number of rings as a function of other attributes Monday, February 18, 13
  7. 7. Data Computing Division Step 1 •Copy the small sample data set to HDFS •See: Scripts/ Monday, February 18, 13
  8. 8. Data Computing Division Step 2 •Blow up the dataset 1000 times by adding gaussian noise to most fields •Output: 4M sample observations •Using Hadoop Streaming •See: Scripts/ •Monitor this job in JobTracker UI Monday, February 18, 13
  9. 9. Data Computing Division Step 3 •Train model based on Linear Regression •See: Scripts/ •Monitor the Job •Copy the model to a local directory •Check it Monday, February 18, 13