Parallel Computing for Econometricians with           Amazon Web Services               Stephen J. Barr              Unive...
The Old Way         .    .   .   .   .   .
..
The New Way         .    .   .   .   .   .
..
Table of Contents   Tools Overview   Hadoop   Amazon Web Services   A Simple EMR and R Example      The R code - mapper   ...
Table of Contents   Tools Overview   Hadoop   Amazon Web Services   A Simple EMR and R Example      The R code - mapper   ...
Algorithms and Implementations      “Stupidly parallel” - e.g. a for loop where each iteration is      independent.       ...
In this presentation, we will be using Hadoop either directlythrough Elastic MapReduce or indirectly via the Segue package...
Alternatives       Wait a long time       Use multicores, eg.       http://www.rforge.net/doc/packages/multicore/mclapply....
Table of Contents   Tools Overview   Hadoop   Amazon Web Services   A Simple EMR and R Example      The R code - mapper   ...
What is it?       Hadoop is made by the Apache Software Foundation, which       makes open source software. Contributors t...
What is it for?       Used to process many TB of webserver logs for metrics, target       ad placement, etc       Users in...
MAPR EDUCE EXAMPLE – W ORD COUNT        Input                                         Output                              ...
Algorithm   The idea is that the job is broken into map and reduce steps.       Mapper processes input and creates chunks ...
Hadoop Performance Statistics      Hadoop is FAST! From 2010 Competition,      http://sortbenchmark.org/                  ...
Table of Contents   Tools Overview   Hadoop   Amazon Web Services   A Simple EMR and R Example      The R code - mapper   ...
What is this cloud?      Cloud computing is the idea of abstracting away from      hardware      All data and computing re...
AWS Overview  Get ready for some acronyms! Amazon Web Services (AWS) is full  of them. The relevant ones are:      EC2 - E...
AWS Links     EC2 - http://aws.amazon.com/ec2/     EMR - http://aws.amazon.com/elasticmapreduce/        Getting started gu...
Table of Contents   Tools Overview   Hadoop   Amazon Web Services   A Simple EMR and R Example      The R code - mapper   ...
Steps    1. Write mapper in R. The output will be aggregatred by       Hadoop’s aggregate function.    2. Create input file...
Files   The directory emr.simpleExample/simpleSimRmapper contains   the following        makeData.R generates 1000 csv file...
Tools OverviewHadoopAmazon Web ServicesA Simple EMR and R Example   The R code - mapper   Resources Listsegue and a SML Ex...
Mapper functions      INPUT: <STDIN>. This can be          A seed to a random number generator          Raw data text to p...
General R Mapper Code Outline1   t r i m W h i t e S p a c e <− f u n c t i o n ( l i n e ) gsub ( ” ( ˆ +) | ( +$ ) ” , ”...
Simple Mapper  file: sjb.simpleMapper.R Algorithm:      get the file from s3      read it      run regression      print res...
Lets run it!           .   .   .   .   .   .
Overview    1. Made some data with makeData.R    2. Used fileSplit.sh to make lists of files to grab from s3.       These l...
Numbers  Consider this, in less than 10 min      Instantiated a cluster of 13 m2.xlarge (68.4 GB RAM, 8 cores      each)  ...
Tools OverviewHadoopAmazon Web ServicesA Simple EMR and R Example   The R code - mapper   Resources Listsegue and a SML Ex...
UsefulLinks      Good EMR R Discussion      Hadoop on EMR with C# and F#      Hadoop Aggregate                            ...
Table of Contents   Tools Overview   Hadoop   Amazon Web Services   A Simple EMR and R Example      The R code - mapper   ...
Description   From the project website:       Segue has a simple goal: Parallel functionality in R; two       lines of cod...
AWS API - the segue underlying      API stands for Application Program Interface      All Amazon Web Services have API’s, ...
segue usage      Segue is ideal for CPU bound applications - e.g. simulations      replaces lapply, which applies a functi...
Tools OverviewHadoopAmazon Web ServicesA Simple EMR and R Example   The R code - mapper   Resources Listsegue and a SML Ex...
code overview   Note: Code available on my website, http://econsteve.com/r.   Showing 3 levels of optimization:       For ...
Simulated MLE  We use the simulator                                      [T                    ]                         ∑...
With for loops - R pseudocode   p a n e l L o g L i k . s i m p l e <− f u n c t i o n (THETA, d a t a L i s t , s e e d M...
With for loops - R pseudocode   We then maximize the likelihood function as:   o p t i m R e s <− optim (THETA . i n i t 1...
Opt 1 - matrices, lists, lapply   We adopt a new approach with the following rules:       Structure the data as a list of ...
Opt 1 - matrices, lists, lapply - firm likelihood   # t h i s s h o u l d be an e x t r e m e l y f a s t f i r m L i k e l...
The list-based outer loop   p a n e l L o g L i k . f a s t e r <− f u n c t i o n (THETA, d a t a L i s t , s e e d M a t...
Tools OverviewHadoopAmazon Web ServicesA Simple EMR and R Example   The R code - mapper   Resources Listsegue and a SML Ex...
The list-based outer loop - multicore   Use the R multicore library, and replace lapply with mclapply at   the outer loop....
multicore   N: 200 R: 150 T: 80 logLike: -34951.8 . On 4-core laptop   > proc . time ( )      user    syst em e l a p s e ...
Bad      .   .   .   .   .   .
Good       .   .   .   .   .   .
multicore is nice for optimizing a local job.Most machines today have at least 2 cores. Many have 4 or 8.However, that is ...
Tools OverviewHadoopAmazon Web ServicesA Simple EMR and R Example   The R code - mapper   Resources Listsegue and a SML Ex...
installing segue   Install prerequisite packages rjava and catools. On Ubuntu linux:   sudo apt−g e t i n s t a l l r−c r ...
Using segue   Now in R we do:   > library ( segue )   As we will be using are AWS account, we are going to need to set   c...
Firing up the cluster in segue   use the createCluster command.              c r e a t e C l u s t e r ( n u m I n s t a n...
parallel random number generation  >m y L i s t <− NULL  >s e t . s e e d ( 1 )  >f o r ( i i n 1:10){       a <− c ( rnor...
Monte Carlo π estimation  e s t i m a t e P i <− f u n c t i o n ( s e e d ) {      set . seed ( seed )      numDraws <− 1...
parallel MLE   Using code from sml.segue.R on my website. It is exactly the   same as the multicore example, but with the ...
Table of Contents   Tools Overview   Hadoop   Amazon Web Services   A Simple EMR and R Example      The R code - mapper   ...
EC2 has GPUs  Cluster GPU Quadruple Extra Large Instance      22 GB of memory 33.5 EC2 Compute Units (2 x Intel Xeon      ...
RHIPE        RHIPE = R and Hadoop Integrated Processing Environment        http://www.stat.purdue.edu/~sguha/rhipe/       ...
StarCluster I       Allows instantiation of generic clusters on EC2       Use MPI (Message Passing Interface) for much mor...
StarCluster II       Clusters are automatically configured with NFS, Sun Grid       Engine queuing system, and password-les...
Matlab         You can do it in theory, but you need either a license manager         or use Matlab compiler         It wi...
Table of Contents   Tools Overview   Hadoop   Amazon Web Services   A Simple EMR and R Example      The R code - mapper   ...
EC2 and Hadoop are Extremely Powerful      Huge and active community behind both Hadoop (Apache)      and EC2 (Amazon).   ...
AWS in Education  AMAZON WILL GIVE YOU MONEY      Researcher - send them your proposal, they send you credits,      you th...
Resources      My website http://www.econsteve.com/r for the code in      this presentation      AWS Managment Console    ...
Upcoming SlideShare
Loading in …5
×

Parallel Computing for Econometricians with Amazon Web Services

302 views
225 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
302
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Parallel Computing for Econometricians with Amazon Web Services

  1. 1. Parallel Computing for Econometricians with Amazon Web Services Stephen J. Barr University of Rochester March 2, 2011 . . . . . .
  2. 2. The Old Way . . . . . .
  3. 3. ..
  4. 4. The New Way . . . . . .
  5. 5. ..
  6. 6. Table of Contents Tools Overview Hadoop Amazon Web Services A Simple EMR and R Example The R code - mapper Resources List segue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segue Other EC2 Software Options Conclusion . . . . . .
  7. 7. Table of Contents Tools Overview Hadoop Amazon Web Services A Simple EMR and R Example The R code - mapper Resources List segue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segue Other EC2 Software Options Conclusion . . . . . .
  8. 8. Algorithms and Implementations “Stupidly parallel” - e.g. a for loop where each iteration is independent. Only 1 computer? (need 1-8 cores) - use the R multicore package on a single EC2 node. Need more? Use Hadoop / MapReduce - can do complicated mapping and aggregation, in addition to the stupidly parallel stuff MapReduce - use Hadoop directly (Java), Hadoop Streaming (any programming language), rhipe R package (R on Hadoop). . . . . . .
  9. 9. In this presentation, we will be using Hadoop either directlythrough Elastic MapReduce or indirectly via the Segue package forR . . . . . .
  10. 10. Alternatives Wait a long time Use multicores, eg. http://www.rforge.net/doc/packages/multicore/mclapply.html Take over the computer lab and start jobs by hand Buy your own cluster (huge initial cost and will be unutilized most of the time) . . . . . .
  11. 11. Table of Contents Tools Overview Hadoop Amazon Web Services A Simple EMR and R Example The R code - mapper Resources List segue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segue Other EC2 Software Options Conclusion . . . . . .
  12. 12. What is it? Hadoop is made by the Apache Software Foundation, which makes open source software. Contributors to the foundation are both large companies and individuals. Hadoop Common: The common utilities that support the other Hadoop subprojects. HDFS: A distributed file system that provides high throughput access to application data. MapReduce: A software framework for distributed processing of large data sets on compute clusters. Often, when people say “Hadoop” they mean Hadoop’s implementation of the map reduce algorithm. Algorithm made by google. Documented here: http://labs.google.com/papers/mapreduce.html . . . . . .
  13. 13. What is it for? Used to process many TB of webserver logs for metrics, target ad placement, etc Users include: Google - calculating pagerank, processing traffic, etc. Yahoo - > 100,000 CPUs in various clusters, including a 4,000 node cluster. Used for ad placement, etc. LinkedIn - huge social network graphs - “you may know...” Amazon - creating product search indices See: http://wiki.apache.org/hadoop/PoweredBy . . . . . .
  14. 14. MAPR EDUCE EXAMPLE – W ORD COUNT Input Output “This”, 3 . “Word”, 2 Map Phase Reduce “This”, Doc1 Phase “This”, Doc1 Mapper “This”, Doc2 Reducer “Word”, Doc1 Sort “This”, Doc3 Mapper “This”, Doc2 “This”, Doc3 Mapper “Word”, Doc1 “Word”, Doc3 Reducer “Word”, Doc3.
  15. 15. Algorithm The idea is that the job is broken into map and reduce steps. Mapper processes input and creates chunks Reducer aggregates the chunks Hadoop provides a Java implementation of this algorithm. Features include fault-tolerance, adding nodes on the fly, extreme speed, and more. Hadoop is implemented in Java, and Hadoop Streaming allows mapper and reducers over any language, communicating over <STDIN>, <STDOUT>. . . . . . .
  16. 16. Hadoop Performance Statistics Hadoop is FAST! From 2010 Competition, http://sortbenchmark.org/ . . . . . .
  17. 17. Table of Contents Tools Overview Hadoop Amazon Web Services A Simple EMR and R Example The R code - mapper Resources List segue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segue Other EC2 Software Options Conclusion . . . . . .
  18. 18. What is this cloud? Cloud computing is the idea of abstracting away from hardware All data and computing resources are managed services Pay per hour, based on need . . . . . .
  19. 19. AWS Overview Get ready for some acronyms! Amazon Web Services (AWS) is full of them. The relevant ones are: EC2 - Elastic Compute Cloud - Dynamically get N computers for a few cents per hour. Computers range from micro instances ($ 0.02/hr.) to 8-core, 70GB RAM “quad-xl” ($2.00/hr) to GPU machines ($2.10/hr). EMR - Elastic map Reduce - automates the instantiation of Hadoop jobs. Builds the cluster, runs the job, completely in the background S3 - Simple Storage Service - Store VERY large objects in the cloud. RDS - Relational Database Service - implementation of MySQL database. Easy way to store data and later load into R with package RMySQL. E.g. select date,price from myTable where TICKER=’AMZN’ . . . . . .
  20. 20. AWS Links EC2 - http://aws.amazon.com/ec2/ EMR - http://aws.amazon.com/elasticmapreduce/ Getting started guide - http://docs.amazonwebservices. com/ElasticMapReduce/latest/GettingStartedGuide/ S3 - http://aws.amazon.com/s3/ . . . . . .
  21. 21. Table of Contents Tools Overview Hadoop Amazon Web Services A Simple EMR and R Example The R code - mapper Resources List segue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segue Other EC2 Software Options Conclusion . . . . . .
  22. 22. Steps 1. Write mapper in R. The output will be aggregatred by Hadoop’s aggregate function. 2. Create input files 3. Upload all to S3 4. Configure EMR job in AWS Management Console 5. Done! . . . . . .
  23. 23. Files The directory emr.simpleExample/simpleSimRmapper contains the following makeData.R generates 1000 csv files with 1,000,000 rows, 4 columns each. Each file is about 76 MB fileSplit.sh takes a directory of input files and prepares them for use with EMR (more on this later) sjb.simpleMapper.R takes the name of a file from the command line, gets it from s3, runs a regression, hands back the coefficients. These coefficients are then aggregated using aggregate, a standard Hadoop reducer . . . . . .
  24. 24. Tools OverviewHadoopAmazon Web ServicesA Simple EMR and R Example The R code - mapper Resources Listsegue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segueOther EC2 Software OptionsConclusion . . . . . .
  25. 25. Mapper functions INPUT: <STDIN>. This can be A seed to a random number generator Raw data text to process A list of file names to process - we are doing this one. OUTPUT: <STDOUT> (print it!), which next goes to the reducer. . . . . . .
  26. 26. General R Mapper Code Outline1 t r i m W h i t e S p a c e <− f u n c t i o n ( l i n e ) gsub ( ” ( ˆ +) | ( +$ ) ” , ” ” , line )2 con <− f i l e ( ” s t d i n ” , open = ” r ” )3 w h i l e ( l e n g t h ( l i n e <− r e a d L i n e s ( con , n = 1 , warn = FALSE ) ) > 0 ) {4 l i n e <− t r i m W h i t e S p a c e ( l i n e )56 #p r o c e s s and p r i n t r e s u l t s7 }8 c l o s e ( con ) . . . . . .
  27. 27. Simple Mapper file: sjb.simpleMapper.R Algorithm: get the file from s3 read it run regression print results in a way that aggregate can read . . . . . .
  28. 28. Lets run it! . . . . . .
  29. 29. Overview 1. Made some data with makeData.R 2. Used fileSplit.sh to make lists of files to grab from s3. These lists will be fed into the mapper. Then transferred the data and lists to s3. See moveToS3.sh for a list of commands, but don’t try to run this directly. 3. sjb.simpleMapper.R reads lines. Each line is a file. Opens the file, does some work, prints some output. 4. Configure job on EMR using AWS Management Console. Using the standard aggregator to aggregate results. . . . . . .
  30. 30. Numbers Consider this, in less than 10 min Instantiated a cluster of 13 m2.xlarge (68.4 GB RAM, 8 cores each) Installed Linux OS and Hadoop software on all nodes Distribute approx. 20GB of data to the nodes Run some analysis in R Aggregate the results Shut down the cluster . . . . . .
  31. 31. Tools OverviewHadoopAmazon Web ServicesA Simple EMR and R Example The R code - mapper Resources Listsegue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segueOther EC2 Software OptionsConclusion . . . . . .
  32. 32. UsefulLinks Good EMR R Discussion Hadoop on EMR with C# and F# Hadoop Aggregate . . . . . .
  33. 33. Table of Contents Tools Overview Hadoop Amazon Web Services A Simple EMR and R Example The R code - mapper Resources List segue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segue Other EC2 Software Options Conclusion . . . . . .
  34. 34. Description From the project website: Segue has a simple goal: Parallel functionality in R; two lines of code; in under 15 minutes. J.D.Long From segue homepage: http://code.google.com/p/segue/ . . . . . .
  35. 35. AWS API - the segue underlying API stands for Application Program Interface All Amazon Web Services have API’s, which allow programmatic access. This exposes many more features than the AWS Managment Console For example, through the API one can start and stop a cluster without adding jobs, add nodes to a running cluster, etc. Using the API, you can write programs and treat clusters as the native objects segue is such a program . . . . . .
  36. 36. segue usage Segue is ideal for CPU bound applications - e.g. simulations replaces lapply, which applies a function to elements of a list, with emrlapply, which distributes the evaluation of the function to a cluster via Elastic Map Reduce the list can be anything - seeds to a random number generator, matrices to invert, data frames to analyse, etc. . . . . . .
  37. 37. Tools OverviewHadoopAmazon Web ServicesA Simple EMR and R Example The R code - mapper Resources Listsegue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segueOther EC2 Software OptionsConclusion . . . . . .
  38. 38. code overview Note: Code available on my website, http://econsteve.com/r. Showing 3 levels of optimization: For loops to matrices Evaluating firms on multicores Evaluating firms on multiple computers on EC2 . . . . . .
  39. 39. Simulated MLE We use the simulator [T ] ∑ N 1∑ ∏ R i ˆ ln LNR = ln h(yit |xit , θui r R i=1 r =1 t=1 where i ∈ N is a person among people, or firm in a set of firms. R √ is a number of of simulations to do, where R ∝ N, and Ti is the length of the data for firm i. . . . . . .
  40. 40. With for loops - R pseudocode p a n e l L o g L i k . s i m p l e <− f u n c t i o n (THETA, d a t a L i s t , s e e d M a t r i x ) { l o g L i k <− 0 u i r <− qnorm ( s e e d M a t r i x ) f o r ( n i n 1 :N) { LiR <− 0 ; f o r ( r i n 1 : R) { myProduct <− 1 a l p h a . r <− mu . a + u i r [ r , ( 2 ∗n ) −1] ∗ s i g m a . a b e t a . r <− mu . b + u i r [ r , ( 2 ∗n ) ] ∗ s i g m a . b f o r ( t i n 1 : T) { # f i = r e s i d u a l u s i n g Y , THETA myProduct <− myProduct ∗ f i } LiR <− LiR + myProduct L i <− LiR /R l o g L i k <− l o g L i k + l o g ( L i ) } # end f o r r i n R } # end f o r n return ( logLik ) } . . . . . .
  41. 41. With for loops - R pseudocode We then maximize the likelihood function as: o p t i m R e s <− optim (THETA . i n i t 1 , p a n e l L o g L i k . s i m p l e , This is extremely slow on one processor, and does not lend itself to parallelization. (30 min for 60 firms - didn’t bother to test more). . . . . . .
  42. 42. Opt 1 - matrices, lists, lapply We adopt a new approach with the following rules: Structure the data as a list of lists, where each sublist contains the data, ticker symbol, and uir for the relevant coefficients Make a firm (i ∈ N) likelihood function, and an outer panel likelihood function which sums the results of the firms . . . . . .
  43. 43. Opt 1 - matrices, lists, lapply - firm likelihood # t h i s s h o u l d be an e x t r e m e l y f a s t f i r m L i k e l i h o o d f u n c t i o n f i r m L i k e l i h o o d <− f u n c t i o n ( d a t a L i s t I t e m , THETA, R) { s i g m a . e <− THETA [ 1 ] ; mu . a <− THETA [ 2 ] ; s i g m a . a <− THETA [3] mu . b <− THETA [ 4 ] ; s i g m a . b <− THETA [ 5 ] d a t a . n <− d a t a L i s t I t e m $DATA; X . n <− d a t a . n$X ; Y . n <− d a t a . n$Y ; T <− nrow ( d a t a . n ) u i r A l p h a <− d a t a L i s t I t e m $UIRALPHA u i r B e t a <− d a t a L i s t I t e m $UIRBETA a l p h a . rmat <− mu . a + u i r A l p h a ∗ s i g m a . a b e t a . rmat <− mu . b + u i r B e t a ∗ s i g m a . b Y t S t a c k <− re pm at (Y . n , R , 1 ) X t S t a c k <− re pm at (X . n , R , 1 ) r e s i d M a t <− Y t S t a c k − a l p h a . rmat − X t S t a c k ∗ b e t a . rmat f i t M a t <− ( 1 / ( s i g m a . e ∗ s q r t ( 2 ∗ p i ) ) ) ∗ exp ( −( r e s i d M a t ˆ 2 ) / ( 2 ∗ sigma . e ˆ2) ) myProductVec <− a p p l y ( f i t M a t , 1 , pr o d ) L i 2 <− sum ( myProductVec ) /R return ( Li2 ) } . . . . . .
  44. 44. The list-based outer loop p a n e l L o g L i k . f a s t e r <− f u n c t i o n (THETA, d a t a L i s t , s e e d M a t r i x ) { # t h e s e e d m a t r i x a s R rows , and 2 ∗N c o l u m n s w h e r e t h e r e a r e N f i r m s and 2 p a r a m e t e r s o f i n t e r e s t ( a l p h a and beta ) u i r <− qnorm ( s e e d M a t r i x ) R <− nrow ( s e e d M a t r i x ) # n o t i c e t h a t we can c a l c u l a t e t h e l i k e l i h o o d s independently for # e a c h f i r m , s o we can make a f u n c t i o n and u s e l a p p l y . T h i s w i l l be # useful for parallelization f i r m L i k <− l a p p l y ( d a t a L i s t , f i r m L i k e l i h o o d , THETA, R) l o g L i k <− sum ( l o g ( u n l i s t ( f i r m L i k ) ) ) return ( logLik ) } . . . . . .
  45. 45. Tools OverviewHadoopAmazon Web ServicesA Simple EMR and R Example The R code - mapper Resources Listsegue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segueOther EC2 Software OptionsConclusion . . . . . .
  46. 46. The list-based outer loop - multicore Use the R multicore library, and replace lapply with mclapply at the outer loop. library ( multicore ) ... f i r m L i k <− m c l a p p l y ( d a t a L i s t , f i r m L i k e l i h o o d , THETA, R) This will lead to some substantial speedups. . . . . . .
  47. 47. multicore N: 200 R: 150 T: 80 logLike: -34951.8 . On 4-core laptop > proc . time ( ) user syst em e l a p s e d 389.180 36.960 125.674 N: 1000 R: 320 T: 80 logLike: -174621.9. On EC2 2XL > proc . time ( ) user syst em e l a p s e d 2705.77 2686.08 417.74 N: 5000 R: 710 T: 80 logLike: -870744.4 > proc . time ( ) user system elapsed 16206.480 16067.150 2768.588 multicore can provide quick and easy parallelization. Write program so that the parallel part is an operation on a list, then replace lapply with mclapply. . . . . . .
  48. 48. Bad . . . . . .
  49. 49. Good . . . . . .
  50. 50. multicore is nice for optimizing a local job.Most machines today have at least 2 cores. Many have 4 or 8.However, that is still only 1 machine. Let’s use n of them → . . . . . .
  51. 51. Tools OverviewHadoopAmazon Web ServicesA Simple EMR and R Example The R code - mapper Resources Listsegue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segueOther EC2 Software OptionsConclusion . . . . . .
  52. 52. installing segue Install prerequisite packages rjava and catools. On Ubuntu linux: sudo apt−g e t i n s t a l l r−c r a n −r j a v a r−c r a n −c a t o o l s Then, download and install segue http://code.google.com/p/segue/ . . . . . .
  53. 53. Using segue Now in R we do: > library ( segue ) As we will be using are AWS account, we are going to need to set credentials so that other people can’t launch clusters in our name. To get our credentials, go to: http://aws.amazon.com/account/ and click “Security Credentials”. Go back into R. setCredentials (" ABC123 " , " REALLY + LONG +12312312+ STRING +456456") . . . . . .
  54. 54. Firing up the cluster in segue use the createCluster command. c r e a t e C l u s t e r ( n u m I n s t a n c e s =2 , c r a n P a c k a g e s , filesOnNodes , r O b j e c t s O n N o d e s , e n a b l e D e b u g g i n g=FALSE , i n s t a n c e s P e r N o d e , m a s t e r I n s t a n c e T y p e=”m1 . s m a l l ” , s l a v e I n s t a n c e T y p e=”m1 . small ” , l o c a t i o n=” us−e a s t −1a ” , ec2KeyName , c o p y . image=FALSE , otherBootstrapActions , sourcePackagesToInstall ) In our case, lets fire up 10 m2.4xlarge. This gives us 80 cores and 684 GB of RAM to play with. . . . . . .
  55. 55. parallel random number generation >m y L i s t <− NULL >s e t . s e e d ( 1 ) >f o r ( i i n 1:10){ a <− c ( rnorm ( 9 9 9 ) , NA) m y L i s t [ [ i ] ] <− a } >o u t p u t L o c a l <− l a p p l y ( m y L i s t , mean , na . rm=T) >outputEmr <− e m r l a p p l y ( m y C l u s t e r , m y L i s t , mean , na . rm=T) > a l l . e q u a l ( outputEmr , o u t p u t L o c a l ) [ 1 ] TRUE segue handles this for you. This is very important for simulation. . . . . . .
  56. 56. Monte Carlo π estimation e s t i m a t e P i <− f u n c t i o n ( s e e d ) { set . seed ( seed ) numDraws <− 1 e6 r <− . 5 #r a d i u s . . . i n c a s e t h e u n i t c i r c l e i s t o o b o r i n g x <− r u n i f ( numDraws , min=−r , max=r ) y <− r u n i f ( numDraws , min=−r , max=r ) i n C i r c l e <− i f e l s e ( ( x ˆ2 + y ˆ 2 ) ˆ . 5 < r , 1 , 0 ) r e t u r n ( sum ( i n C i r c l e ) / l e n g t h ( i n C i r c l e ) ∗ 4 ) } s e e d L i s t <− a s . l i s t ( 1 : 1 0 0 ) r e q u i r e ( segue ) m y E s t i m a t e s <− e m r l a p p l y ( m y C l u s t e r , s e e d L i s t , e s t i m a t e P i ) myPi <− Reduce ( sum , m y E s t i m a t e s ) / l e n g t h ( m y E s t i m a t e s ) > f o r m a t ( myPi , d i g i t s =10) [ 1 ] ” 3.14166556 ” . . . . . .
  57. 57. parallel MLE Using code from sml.segue.R on my website. It is exactly the same as the multicore example, but with the addition of 2 lines to start the cluster. . . . . . .
  58. 58. Table of Contents Tools Overview Hadoop Amazon Web Services A Simple EMR and R Example The R code - mapper Resources List segue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segue Other EC2 Software Options Conclusion . . . . . .
  59. 59. EC2 has GPUs Cluster GPU Quadruple Extra Large Instance 22 GB of memory 33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core Nehalem architecture) 2 x NVIDIA Tesla Fermi M2050 GPUs 1690 GB of instance storage 64-bit platform I/O Performance: Very High (10 Gigabit Ethernet) API name: cg1.4xlarge The Fermi chip is important because they have ECC memory, so simulations are accurate. These are much more robust than gamer GPUs - cost $2800 per card. Each machine has 2. You can use for $2.10 per hour. . . . . . .
  60. 60. RHIPE RHIPE = R and Hadoop Integrated Processing Environment http://www.stat.purdue.edu/~sguha/rhipe/ Implements rhlapply function Exposes much more of Hadoop’s underlying functionality, including the HDFS ⇒ May be better for large data applications . . . . . .
  61. 61. StarCluster I Allows instantiation of generic clusters on EC2 Use MPI (Message Passing Interface) for much more complicated parallel programs. E.g., holding one giant matrix accross the RAM of several nodes From their page: Simple configuration with sensible defaults Single ”start” command to automatically launch and configure one or more clusters on EC2 Support for attaching and NFS-sharing Amazon Elastic Block Storage (EBS) volumes for persistent storage across a cluster Comes with a publicly available Amazon Machine Image (AMI) configured for scientific computing AMI includes OpenMPI, ATLAS, Lapack, NumPy, SciPy, and other useful libraries . . . . . .
  62. 62. StarCluster II Clusters are automatically configured with NFS, Sun Grid Engine queuing system, and password-less ssh between machines Supports user-contributed ”plugins” that allow users to perform additional setup routines on the cluster after StarCluster’s defaults http://web.mit.edu/stardev/cluster/ . . . . . .
  63. 63. Matlab You can do it in theory, but you need either a license manager or use Matlab compiler It will cost you. Whitepaper from Mathworks: http://www.mathworks.com/ programs/techkits/ec2_paper.html May be able to coax EMR run a compiled Matlab script, but you would have to bootstrap each machine to have the libraries required to run compiled Matlab applications Mathworks has no incentive to support this behaviour Requires toolboxes ($$$). . . . . . .
  64. 64. Table of Contents Tools Overview Hadoop Amazon Web Services A Simple EMR and R Example The R code - mapper Resources List segue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segue Other EC2 Software Options Conclusion . . . . . .
  65. 65. EC2 and Hadoop are Extremely Powerful Huge and active community behind both Hadoop (Apache) and EC2 (Amazon). EC2 and AWS in general allow you to change the way you think about computing resources, as a service rather than as devices to manage. New AWS features are always being added . . . . . .
  66. 66. AWS in Education AMAZON WILL GIVE YOU MONEY Researcher - send them your proposal, they send you credits, you thank them in the paper. Teacher - if you are teaching a class, each student gets $100 credit, good for one year. This would be great for teaching econometrics, where you can provide a machine image with software and data already available. Additionally, AWS for your backups (S3) and other tech needs . . . . . .
  67. 67. Resources My website http://www.econsteve.com/r for the code in this presentation AWS Managment Console http://aws.amazon.com/console/ AWS Blog http://aws.typepad.com AWS in Education http://aws.amazon.com/education/ . . . . . .

×