R Jobs on the Cloud

4,023 views

Published on

An introduction on large Data sets with R on Amazon EC2 with S3 support. Data distribution and job fragmentation with the use of Hadoop as the MapReduce implementation.

Published in: Technology, Business

R Jobs on the Cloud

  1. 1. R Jobs on the Cloud Doxaras Yiannis for mineknowledge
  2. 2. Your Another Company's Company's Cloud Cloud Competitors Company's Cloud My Company's Cloud
  3. 3. Instance Types Data Sets For Single Instance R AMI 1-2 GB 4-5 GB 9-10 GB
  4. 4. Pricing* • Small ami.small • Large ami.large • XLarge ami.xlarge Data received by EC2 instances costs 10¢ per GB (10243 bytes). Data sent from EC2 instances is charged on a sliding scale, depending on the volume of data transferred during the month: 18¢/GB from 0 to 10 TB, 16¢/GB from 10 to 50 TB, and 13¢/GB for any amount over 50 TB. Data transfers between EC2 instances incur no transfer fees. Data transfers between EC2 instances and S3 buckets located in the United States are also free, but data transfers between EC2 instances and S3 buckets located in Europe incur the standard transfer fees.
  5. 5. EBS You can visualize the EBS metaphor as an external hard drive, that serves as a data storage space on S3 for persistence between AMI reboots and failures.
  6. 6. Security •Pairing Keys, public/private key cryptography (openssl) •Network Security, pre- configured “default”, Inner- EC2, configure for external communication
  7. 7. AMI Setup •Search AMI manifest ID. •Image Location in S3. •m1.manifest.xml.
  8. 8. AMI Statuses pending (0): launching and not yet started •running (16): launched and performing like a normal computer (though not necessarily finished booting the AMI's operating system) •shutting-down (32): in the process of terminating •terminated (42): no longer running
  9. 9. Starting Instances • ImageId* • MinCount* • MaxCount* • KeyName • SecurityGroup • InstanceType • UserData • AddressingType
  10. 10. Logging@Instance • proper security group • proper public DNS entry • RSA Key value Authentication • $ ssh -i ec2-private-key.enc root@ec2-67-202-4-222.z-1.compute-1.amazonaw s.com • chmod 400 ec2-private-key.enc.
  11. 11. Logging@Instance __| __|_ ) Rev: 2 _| ( / • proper security group ___|___|___| Welcome to an EC2 Public Image :-) • proper public DNS entryGetting Started __ c __ /etc/ec2/release-notes.txt • RSA Key value Authentication [root@domU-12-31-35-00-53-82 ~]# • $ ssh -i ec2-private-key.enc root@ec2-67-202-4-222.z-1.compute-1.amazonaw s.com • chmod 400 ec2-private-key.enc.
  12. 12. Register An AMI •Bundle and Upload to S3 With Manifest.xml. •Register an AMI. •Describe AMI attributes. •Reset AMI Attributes. •Confirm AMI product Code.
  13. 13. Performance Issues •Instance Type •Shared Subsystems* •Network Bandwidth •Storage Space Initialization •RAID
  14. 14. Persistence • S3 is the main storage service for EC2 • Cloud Programming Involves Backup Mechanisms From Beginning of Deployment!
  15. 15. Persistence • S3 is the main storage service for EC2 • Cloud Programming Involves Backup Mechanisms From Beginning of Deployment! • Use EC2 as a cache. • Perform Schedule Backups. • Perform Schedule Bundling to an AMI. • Mount S3 as a local partition. • Push your luck.
  16. 16. Our AMI Choice • Operating System* • Software* • Auditing Actions* • Configure System Services* • Installed Amazon Building Tools* • Develop Shell Util Scripts* • Build and Upload to S3
  17. 17. R on Fedora •Extra Packages* using R scripting Plotting to eps, pdf •Plotting Utilities ? •Services ? •Data Distribution* Integration via web services with Oracle BI, and Microsoft Reports.
  18. 18. Demo ssh@AMI Tools Used. ElasticFox, S3Fox, bash scripting, python, rscript
  19. 19. R Cloud Data Handling AWS S3 R INPUT EC2 root network sdd drive R OUTPUT sda sdb AMI #1 AMI #2 AMI Backup AMI #3
  20. 20. Batch Processing With R #! /usr/bin/Rscript --vanilla --default-packages=utils args <- commandArgs(TRUE) res <- try(install.packages(args)) if(inherits(res, "try-error"))  q(status=1)  else  q() $ R --vanilla --slave < hello_world.R   $ R --vanilla --slave < hello_world.R > result.txt  $ cat > print_my_args.R << EOF   args <- commandArgs(TRUE)   print(args)   q()   EOF  $ R --slave "--args a=100 b=200" < print_my_args.R 
  21. 21. Large Data Sets •Excel, SAS, SPSS, etc •Upload files to S3 (use scripts) •Data Parallelism vs. Task Parallelism •Service Queuing •Messaging Interfaces
  22. 22. R Data Fragmentation • No Correlation type algorithms should be used in R Scripting • Data capture and delivery • Choose Proper AMI Type • Probabilistic Algorithm Outcomes • Consider Data Fragmentation In R Scripting* S3 integration and data preparation ?
  23. 23. To Parallelize Or Not? •R is not Thread Safe •R stores all Data in Memory •Algorithms are Serial Processes •Solutions Like Rmpi Raise Learning Curve.
  24. 24. Data Parallelism vs. Task Parallelism Parallel Agent
  25. 25. R Parallel Loop Parallelization For Task Fragmentation
  26. 26. Rmpi # Load the R MPI package if it is not already loaded. if (!is.loaded("mpi_initialize")) { library("Rmpi") } # Spawn as many slaves as possible mpi.spawn.Rslaves() # In case R exits unexpectedly, have it automatically clean up # resources taken up by Rmpi (slaves, memory, etc...) .Last <- function(){ if (is.loaded("mpi_initialize")){ if (mpi.comm.size(1) > 0){ print("Please use mpi.close.Rslaves() to close slaves.") mpi.close.Rslaves() } print("Please use mpi.quit() to quit R") .Call("mpi_finalize") } } # Tell all slaves to return a message identifying themselves mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size())) # Tell all slaves to close down, and exit the program mpi.close.Rslaves() mpi.quit()
  27. 27. Rmpi Cluster configuration from inside R. # Load the R MPI package if it is not already loaded. if (!is.loaded("mpi_initialize")) { library("Rmpi") } # Spawn as many slaves as possible mpi.spawn.Rslaves() # In case R exits unexpectedly, have it automatically clean up # resources taken up by Rmpi (slaves, memory, etc...) .Last <- function(){ if (is.loaded("mpi_initialize")){ if (mpi.comm.size(1) > 0){ print("Please use mpi.close.Rslaves() to close slaves.") mpi.close.Rslaves() } print("Please use mpi.quit() to quit R") .Call("mpi_finalize") } } # Tell all slaves to return a message identifying themselves mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size())) # Tell all slaves to close down, and exit the program mpi.close.Rslaves() mpi.quit()
  28. 28. REvolution Parallel R # Load the Parallel R stack require ('doNWS') # We define the function f in our local environment f <- function (x) { sqrt (x) } # Start up two R worker processes and register them with foreach/ parallel version setSleigh (sleigh (workerCount=2)) registerDoNWS () # Run a simple foreach loop in parallel on the two workers foreach (j=1:9, .combine=c) %dopar% f(j) # Note that the workers use the function f from our local environment, even though it was not explicitly # defined on the workers!
  29. 29. REvolution Parallel R foreach # Load the Parallel R stack iterators require ('doNWS') # We define the function f in our local environment f <- function (x) { sqrt (x) } # Start up two R worker processes and register them with foreach/ parallel version setSleigh (sleigh (workerCount=2)) registerDoNWS () # Run a simple foreach loop in parallel on the two workers foreach (j=1:9, .combine=c) %dopar% f(j) # Note that the workers use the function f from our local environment, even though it was not explicitly # defined on the workers!
  30. 30. Map Reduce
  31. 31. Map Reduce
  32. 32. R as a Data Worker For Hadoop 1. data plumbing: to take apply's in R, present them to Hadoop as input for a job -- essentially split the input vector into partitions, each of which goes to a Mapper task, then have the Reducers combine the results, which are sent back to R.  Then R continues its processing. 2. R algorithm parallelization -- rewriting critical parts of popular algorithms implemented in R, so that they can take advantage of R-Hadoop integration.
  33. 33. R as a Data Worker For Hadoop HadoopStreaming 1. data plumbing: to take apply's in R, present them to Hadoop as input for a job -- essentially split the input vector into partitions, each of which goes to a Mapper task, then have the Reducers combine the results, which are sent back to R.  Then R continues its processing. 2. R algorithm parallelization -- rewriting critical parts of popular algorithms implemented in R, so that they can take advantage of R-Hadoop integration.
  34. 34. R as a Data Worker For Hadoop HadoopStreaming 1. data plumbing: to take apply's in R, present them to Hadoop as input for a job -- essentially split the input vector into partitions, each of which goes to a Mapper task, then have the Reducers combine the results, which are sent back to R.  Then R continues its processing. 2. R algorithm parallelization -- rewriting critical parts of popular algorithms implemented in R, so that they can take advantage of R-Hadoop integration. User Experiences 400GB - > 5AMI’s
  35. 35. R Data Parallelization • Use RToolkit for “Parallel R” Processing • DNS and DynDNS node configuration • Node and Memory Optimization • Develop R Script and Distribute
  36. 36. Further Lookup • http://calculator.s3.amazonaws.com/calc5.html • Secure EC2 Instance http://developer.amazonwebservices.com/connect/ entry.jspa?externalID=1233 • http://www.revolution-computing.com/ • http://math.acadiau.ca/ACMMaC/Rmpi/sample.html • http://finzi.psych.upenn.edu/R/library/utils/html/ Rscript.html • http://www.rparallel.org/ • http://cran.r-project.org/web/views/
  37. 37. Hopefully You Have More Clouded Days From Now On doxaras@mineknowledge.com

×