Parallel Computing for Econometricians with Amazon Web Services

Parallel Computing for Econometricians with
Amazon Web Services

Stephen J. Barr

University of Rochester

March 2, 2011

. . . . . .

The Old Way

. . . . . .

The New Way

. . . . . .

Table of Contents
Tools Overview

Hadoop

Amazon Web Services

A Simple EMR and R Example
The R code - mapper
Resources List

segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue

Other EC2 Software Options

Conclusion
. . . . . .

Algorithms and Implementations

“Stupidly parallel” - e.g. a for loop where each iteration is
independent.
Only 1 computer? (need 1-8 cores) - use the R multicore
package on a single EC2 node.
Need more? Use Hadoop / MapReduce - can do complicated
mapping and aggregation, in addition to the stupidly parallel
stuﬀ
MapReduce - use Hadoop directly (Java), Hadoop Streaming
(any programming language), rhipe R package (R on
Hadoop).

. . . . . .

In this presentation, we will be using Hadoop either directly
through Elastic MapReduce or indirectly via the Segue package for
R

. . . . . .

Alternatives

Wait a long time
Use multicores, eg.
http://www.rforge.net/doc/packages/multicore/mclapply.html
Take over the computer lab and start jobs by hand
Buy your own cluster (huge initial cost and will be unutilized
most of the time)

. . . . . .

What is it?

Hadoop is made by the Apache Software Foundation, which
makes open source software. Contributors to the foundation
are both large companies and individuals.
Hadoop Common: The common utilities that support the
other Hadoop subprojects.
HDFS: A distributed ﬁle system that provides high throughput
access to application data.
MapReduce: A software framework for distributed processing
of large data sets on compute clusters.
Often, when people say “Hadoop” they mean Hadoop’s
implementation of the map reduce algorithm.
Algorithm made by google. Documented here:
http://labs.google.com/papers/mapreduce.html .
. . . . .

What is it for?

Used to process many TB of webserver logs for metrics, target
ad placement, etc
Users include:
Google - calculating pagerank, processing traﬃc, etc.
Yahoo - > 100,000 CPUs in various clusters, including a 4,000
node cluster. Used for ad placement, etc.
LinkedIn - huge social network graphs - “you may know...”
Amazon - creating product search indices
See: http://wiki.apache.org/hadoop/PoweredBy

. . . . . .

MAPR EDUCE EXAMPLE – W ORD COUNT

Input Output
“This”, 3

. “Word”, 2

Map Phase Reduce
“This”, Doc1 Phase
“This”, Doc1
Mapper “This”, Doc2 Reducer
“Word”, Doc1 Sort
“This”, Doc3
Mapper “This”, Doc2

“This”, Doc3
Mapper “Word”, Doc1
“Word”, Doc3 Reducer
“Word”, Doc3
.

Algorithm

The idea is that the job is broken into map and reduce steps.
Mapper processes input and creates chunks
Reducer aggregates the chunks
Hadoop provides a Java implementation of this algorithm.
Features include fault-tolerance, adding nodes on the ﬂy, extreme
speed, and more.
Hadoop is implemented in Java, and Hadoop Streaming allows
mapper and reducers over any language, communicating over
<STDIN>, <STDOUT>.

. . . . . .

Hadoop Performance Statistics

Hadoop is FAST! From 2010 Competition,
http://sortbenchmark.org/

. . . . . .

What is this cloud?

Cloud computing is the idea of abstracting away from
hardware
All data and computing resources are managed services
Pay per hour, based on need

. . . . . .

AWS Overview
Get ready for some acronyms! Amazon Web Services (AWS) is full
of them. The relevant ones are:
EC2 - Elastic Compute Cloud - Dynamically get N computers
for a few cents per hour. Computers range from micro
instances ($ 0.02/hr.) to 8-core, 70GB RAM “quad-xl”
($2.00/hr) to GPU machines ($2.10/hr).
EMR - Elastic map Reduce - automates the instantiation of
Hadoop jobs. Builds the cluster, runs the job, completely in
the background
S3 - Simple Storage Service - Store VERY large objects in the
cloud.
RDS - Relational Database Service - implementation of
MySQL database. Easy way to store data and later load into
R with package RMySQL. E.g.
select date,price from myTable where TICKER=’AMZN’
. . . . . .

AWS Links

EC2 - http://aws.amazon.com/ec2/
EMR - http://aws.amazon.com/elasticmapreduce/
Getting started guide - http://docs.amazonwebservices.
com/ElasticMapReduce/latest/GettingStartedGuide/
S3 - http://aws.amazon.com/s3/

. . . . . .

Steps

1. Write mapper in R. The output will be aggregatred by
Hadoop’s aggregate function.
2. Create input ﬁles
3. Upload all to S3
4. Conﬁgure EMR job in AWS Management Console
5. Done!

. . . . . .

Files

The directory emr.simpleExample/simpleSimRmapper contains
the following
makeData.R generates 1000 csv files with 1,000,000 rows, 4
columns each. Each file is about 76 MB
fileSplit.sh takes a directory of input files and prepares
them for use with EMR (more on this later)
sjb.simpleMapper.R takes the name of a file from the
command line, gets it from s3, runs a regression, hands back
the coefficients. These coefficients are then aggregated using
aggregate, a standard Hadoop reducer

. . . . . .

Tools Overview

Hadoop

Amazon Web Services

A Simple EMR and R Example
The R code - mapper
Resources List

segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue

Other EC2 Software Options

Conclusion

. . . . . .

Mapper functions

INPUT: <STDIN>. This can be
A seed to a random number generator
Raw data text to process
A list of ﬁle names to process - we are doing this one.
OUTPUT: <STDOUT> (print it!), which next goes to the
reducer.

. . . . . .

General R Mapper Code Outline

1 t r i m W h i t e S p a c e <− f u n c t i o n ( l i n e ) gsub ( ” ( ˆ +) | ( +$ ) ” , ” ”
, line )
2 con <− f i l e ( ” s t d i n ” , open = ” r ” )
3 w h i l e ( l e n g t h ( l i n e <− r e a d L i n e s ( con , n = 1 , warn =
FALSE ) ) > 0 ) {
4 l i n e <− t r i m W h i t e S p a c e ( l i n e )
5
6 #p r o c e s s and p r i n t r e s u l t s
7 }
8 c l o s e ( con )

. . . . . .

Simple Mapper

ﬁle: sjb.simpleMapper.R Algorithm:
get the ﬁle from s3
read it
run regression
print results in a way that aggregate can read

. . . . . .

Lets run it!

. . . . . .

Overview

1. Made some data with makeData.R
2. Used fileSplit.sh to make lists of files to grab from s3.
These lists will be fed into the mapper. Then transferred the
data and lists to s3. See moveToS3.sh for a list of
commands, but don’t try to run this directly.
3. sjb.simpleMapper.R reads lines. Each line is a file. Opens
the file, does some work, prints some output.
4. Configure job on EMR using AWS Management Console.
Using the standard aggregator to aggregate results.

. . . . . .

Numbers

Consider this, in less than 10 min
Instantiated a cluster of 13 m2.xlarge (68.4 GB RAM, 8 cores
each)
Installed Linux OS and Hadoop software on all nodes
Distribute approx. 20GB of data to the nodes
Run some analysis in R
Aggregate the results
Shut down the cluster

. . . . . .

UsefulLinks

Good EMR R Discussion
Hadoop on EMR with C# and F#
Hadoop Aggregate

. . . . . .

Description

From the project website:
Segue has a simple goal: Parallel functionality in R; two
lines of code; in under 15 minutes.
J.D.Long
From segue homepage: http://code.google.com/p/segue/

. . . . . .

AWS API - the segue underlying

API stands for Application Program Interface
All Amazon Web Services have API’s, which allow
programmatic access. This exposes many more features than
the AWS Managment Console
For example, through the API one can start and stop a cluster
without adding jobs, add nodes to a running cluster, etc.
Using the API, you can write programs and treat clusters as
the native objects
segue is such a program

. . . . . .

segue usage

Segue is ideal for CPU bound applications - e.g. simulations
replaces lapply, which applies a function to elements of a
list, with emrlapply, which distributes the evaluation of the
function to a cluster via Elastic Map Reduce
the list can be anything - seeds to a random number
generator, matrices to invert, data frames to analyse, etc.

. . . . . .

code overview

Note: Code available on my website, http://econsteve.com/r.
Showing 3 levels of optimization:
For loops to matrices
Evaluating ﬁrms on multicores
Evaluating ﬁrms on multiple computers on EC2

. . . . . .

Simulated MLE

We use the simulator
[T ]
∑
N
1∑ ∏
R i
ˆ
ln LNR = ln h(yit |xit , θui
r
R
i=1 r =1 t=1

where i ∈ N is a person among people, or firm in a set of firms. R
√
is a number of of simulations to do, where R ∝ N, and Ti is the
length of the data for firm i.

. . . . . .

With for loops - R pseudocode

p a n e l L o g L i k . s i m p l e <− f u n c t i o n (THETA, d a t a L i s t , s e e d M a t r i x ) {
l o g L i k <− 0
u i r <− qnorm ( s e e d M a t r i x )
f o r ( n i n 1 :N) {
LiR <− 0 ;
f o r ( r i n 1 : R) {
myProduct <− 1
a l p h a . r <− mu . a + u i r [ r , ( 2 ∗n ) −1] ∗ s i g m a . a
b e t a . r <− mu . b + u i r [ r , ( 2 ∗n ) ] ∗ s i g m a . b
f o r ( t i n 1 : T) {
# f i = r e s i d u a l u s i n g Y , THETA
myProduct <− myProduct ∗ f i
}
LiR <− LiR + myProduct
L i <− LiR /R
l o g L i k <− l o g L i k + l o g ( L i )
} # end f o r r i n R
} # end f o r n
return ( logLik )
}

. . . . . .

With for loops - R pseudocode

We then maximize the likelihood function as:
o p t i m R e s <− optim (THETA . i n i t 1 , p a n e l L o g L i k . s i m p l e ,
This is extremely slow on one processor, and does not lend itself to
parallelization. (30 min for 60 ﬁrms - didn’t bother to test more).

. . . . . .

Opt 1 - matrices, lists, lapply

We adopt a new approach with the following rules:
Structure the data as a list of lists, where each sublist contains
the data, ticker symbol, and uir for the relevant coefficients
Make a firm (i ∈ N) likelihood function, and an outer panel
likelihood function which sums the results of the firms

. . . . . .

Opt 1 - matrices, lists, lapply - ﬁrm likelihood
# t h i s s h o u l d be an e x t r e m e l y f a s t f i r m L i k e l i h o o d f u n c t i o n
f i r m L i k e l i h o o d <− f u n c t i o n ( d a t a L i s t I t e m , THETA, R) {
s i g m a . e <− THETA [ 1 ] ; mu . a <− THETA [ 2 ] ; s i g m a . a <− THETA
[3]
mu . b <− THETA [ 4 ] ; s i g m a . b <− THETA [ 5 ]

d a t a . n <− d a t a L i s t I t e m $DATA; X . n <− d a t a . n$X ; Y . n <− d a t a .
n$Y ;
T <− nrow ( d a t a . n )

u i r A l p h a <− d a t a L i s t I t e m $UIRALPHA
u i r B e t a <− d a t a L i s t I t e m $UIRBETA

a l p h a . rmat <− mu . a + u i r A l p h a ∗ s i g m a . a
b e t a . rmat <− mu . b + u i r B e t a ∗ s i g m a . b
Y t S t a c k <− re pm at (Y . n , R , 1 )
X t S t a c k <− re pm at (X . n , R , 1 )
r e s i d M a t <− Y t S t a c k − a l p h a . rmat − X t S t a c k ∗ b e t a . rmat
f i t M a t <− ( 1 / ( s i g m a . e ∗ s q r t ( 2 ∗ p i ) ) ) ∗ exp ( −( r e s i d M a t ˆ 2 ) / ( 2
∗ sigma . e ˆ2) )
myProductVec <− a p p l y ( f i t M a t , 1 , pr o d )
L i 2 <− sum ( myProductVec ) /R
return ( Li2 )
}
. . . . . .

The list-based outer loop

p a n e l L o g L i k . f a s t e r <− f u n c t i o n (THETA, d a t a L i s t , s e e d M a t r i x )
{
# t h e s e e d m a t r i x a s R rows , and 2 ∗N c o l u m n s w h e r e t h e r e
a r e N f i r m s and 2 p a r a m e t e r s o f i n t e r e s t ( a l p h a and
beta )
u i r <− qnorm ( s e e d M a t r i x )
R <− nrow ( s e e d M a t r i x )

# n o t i c e t h a t we can c a l c u l a t e t h e l i k e l i h o o d s
independently for
# e a c h f i r m , s o we can make a f u n c t i o n and u s e l a p p l y .
T h i s w i l l be
# useful for parallelization

f i r m L i k <− l a p p l y ( d a t a L i s t , f i r m L i k e l i h o o d , THETA, R)
l o g L i k <− sum ( l o g ( u n l i s t ( f i r m L i k ) ) )
return ( logLik )
}

. . . . . .

The list-based outer loop - multicore

Use the R multicore library, and replace lapply with mclapply at
the outer loop.
library ( multicore )
...
f i r m L i k <− m c l a p p l y ( d a t a L i s t , f i r m L i k e l i h o o d , THETA, R)

This will lead to some substantial speedups.

. . . . . .

multicore

N: 200 R: 150 T: 80 logLike: -34951.8 . On 4-core laptop
> proc . time ( )
user syst em e l a p s e d
389.180 36.960 125.674

N: 1000 R: 320 T: 80 logLike: -174621.9. On EC2 2XL
> proc . time ( )
user syst em e l a p s e d
2705.77 2686.08 417.74

N: 5000 R: 710 T: 80 logLike: -870744.4
> proc . time ( )
user system elapsed
16206.480 16067.150 2768.588

multicore can provide quick and easy parallelization. Write
program so that the parallel part is an operation on a list, then
replace lapply with mclapply.

. . . . . .

multicore is nice for optimizing a local job.
Most machines today have at least 2 cores. Many have 4 or 8.
However, that is still only 1 machine. Let’s use n of them →

. . . . . .

installing segue

Install prerequisite packages rjava and catools. On Ubuntu linux:
sudo apt−g e t i n s t a l l r−c r a n −r j a v a r−c r a n −c a t o o l s
Then, download and install segue
http://code.google.com/p/segue/

. . . . . .

Using segue

Now in R we do:
> library ( segue )
As we will be using are AWS account, we are going to need to set
credentials so that other people can’t launch clusters in our name.
To get our credentials, go to:
http://aws.amazon.com/account/ and click “Security
Credentials”.
Go back into R.
setCredentials (" ABC123 " ,
" REALLY + LONG +12312312+ STRING +456456")

. . . . . .

Firing up the cluster in segue

use the createCluster command.
c r e a t e C l u s t e r ( n u m I n s t a n c e s =2 , c r a n P a c k a g e s ,
filesOnNodes ,
r O b j e c t s O n N o d e s , e n a b l e D e b u g g i n g=FALSE , i n s t a n c e s P e r N o d e ,
m a s t e r I n s t a n c e T y p e=”m1 . s m a l l ” , s l a v e I n s t a n c e T y p e=”m1
. small ” ,
l o c a t i o n=” us−e a s t −1a ” , ec2KeyName , c o p y . image=FALSE ,
otherBootstrapActions , sourcePackagesToInstall )

In our case, lets ﬁre up 10 m2.4xlarge. This gives us 80 cores and
684 GB of RAM to play with.

. . . . . .

parallel random number generation

>m y L i s t <− NULL
>s e t . s e e d ( 1 )
>f o r ( i i n 1:10){
a <− c ( rnorm ( 9 9 9 ) , NA)
m y L i s t [ [ i ] ] <− a
}

>o u t p u t L o c a l <− l a p p l y ( m y L i s t , mean , na . rm=T)
>outputEmr <− e m r l a p p l y ( m y C l u s t e r , m y L i s t , mean ,
na . rm=T)
> a l l . e q u a l ( outputEmr , o u t p u t L o c a l )
[ 1 ] TRUE
segue handles this for you. This is very important for simulation.

. . . . . .

Monte Carlo π estimation

e s t i m a t e P i <− f u n c t i o n ( s e e d ) {
set . seed ( seed )
numDraws <− 1 e6
r <− . 5 #r a d i u s . . . i n c a s e t h e u n i t c i r c l e i s t o o b o r i n g
x <− r u n i f ( numDraws , min=−r , max=r )
y <− r u n i f ( numDraws , min=−r , max=r )
i n C i r c l e <− i f e l s e ( ( x ˆ2 + y ˆ 2 ) ˆ . 5 < r , 1 , 0 )
r e t u r n ( sum ( i n C i r c l e ) / l e n g t h ( i n C i r c l e ) ∗ 4 )
}
s e e d L i s t <− a s . l i s t ( 1 : 1 0 0 )
r e q u i r e ( segue )
m y E s t i m a t e s <− e m r l a p p l y ( m y C l u s t e r , s e e d L i s t , e s t i m a t e P i )
myPi <− Reduce ( sum , m y E s t i m a t e s ) / l e n g t h ( m y E s t i m a t e s )
> f o r m a t ( myPi , d i g i t s =10)
[ 1 ] ” 3.14166556 ”

. . . . . .

parallel MLE

Using code from sml.segue.R on my website. It is exactly the
same as the multicore example, but with the addition of 2 lines to
start the cluster.

. . . . . .

EC2 has GPUs

Cluster GPU Quadruple Extra Large Instance
22 GB of memory 33.5 EC2 Compute Units (2 x Intel Xeon
X5570, quad-core Nehalem architecture)
2 x NVIDIA Tesla Fermi M2050 GPUs 1690 GB of instance
storage 64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
API name: cg1.4xlarge
The Fermi chip is important because they have ECC memory, so
simulations are accurate. These are much more robust than gamer
GPUs - cost $2800 per card. Each machine has 2. You can use for
$2.10 per hour.

. . . . . .

RHIPE

RHIPE = R and Hadoop Integrated Processing Environment
http://www.stat.purdue.edu/~sguha/rhipe/
Implements rhlapply function
Exposes much more of Hadoop’s underlying functionality,
including the HDFS ⇒
May be better for large data applications

. . . . . .

StarCluster I

Allows instantiation of generic clusters on EC2
Use MPI (Message Passing Interface) for much more
complicated parallel programs. E.g., holding one giant matrix
accross the RAM of several nodes
From their page:
Simple configuration with sensible defaults
Single ”start” command to automatically launch and
configure one or more clusters on EC2
Support for attaching and NFS-sharing Amazon Elastic Block
Storage (EBS) volumes for persistent storage across a cluster
Comes with a publicly available Amazon Machine Image
(AMI) configured for scientific computing
AMI includes OpenMPI, ATLAS, Lapack, NumPy, SciPy, and
other useful libraries
. . . . . .

StarCluster II

Clusters are automatically conﬁgured with NFS, Sun Grid
Engine queuing system, and password-less ssh between
machines
Supports user-contributed ”plugins” that allow users to
perform additional setup routines on the cluster after
StarCluster’s defaults
http://web.mit.edu/stardev/cluster/

. . . . . .

Matlab

You can do it in theory, but you need either a license manager
or use Matlab compiler
It will cost you.
Whitepaper from Mathworks: http://www.mathworks.com/
programs/techkits/ec2_paper.html
May be able to coax EMR run a compiled Matlab script, but
you would have to bootstrap each machine to have the
libraries required to run compiled Matlab applications
Mathworks has no incentive to support this behaviour
Requires toolboxes ($$$).

. . . . . .

EC2 and Hadoop are Extremely Powerful

Huge and active community behind both Hadoop (Apache)
and EC2 (Amazon).
EC2 and AWS in general allow you to change the way you
think about computing resources, as a service rather than as
devices to manage.
New AWS features are always being added

. . . . . .

AWS in Education

AMAZON WILL GIVE YOU MONEY
Researcher - send them your proposal, they send you credits,
you thank them in the paper.
Teacher - if you are teaching a class, each student gets $100
credit, good for one year. This would be great for teaching
econometrics, where you can provide a machine image with
software and data already available.
Additionally, AWS for your backups (S3) and other tech needs

. . . . . .

Resources

My website http://www.econsteve.com/r for the code in
this presentation
AWS Managment Console
http://aws.amazon.com/console/
AWS Blog http://aws.typepad.com
AWS in Education http://aws.amazon.com/education/

. . . . . .

Parallel Computing for Econometricians with Amazon Web Services

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Parallel Computing for Econometricians with Amazon Web Services

Similar to Parallel Computing for Econometricians with Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Parallel Computing for Econometricians with Amazon Web Services