SlideShare a Scribd company logo
Workshop: From Zero
to _
Budapest DW Forum 2014
Agenda today
1. Some setup before we start
2. (Back to the) introduction
3. Our workshop today
4. Part 2: a simple Scalding job on EMR
Some setup before we start
There is a lot to copy and paste – so let’s all join a Google
Hangout chat
http://bit.ly/1xgSQId
• If I forget to paste some content into the chat room, just shout
out and remind me
First, let’s all download and setup Virtualbox and Vagrant
http://docs.vagrantup.com/v2/installation/in
dex.html
https://www.virtualbox.org/wiki/Downloads
Now let’s setup our development environment
$ vagrant plugin install vagrant-vbguest
If you have git already installed:
$ git clone --recursive
https://github.com/snowplow/dev-environment.git
If not:
$ wget https://github.com/snowplow/dev-
environment/archive/temp.zip
$ unzip temp.zip
$ wget https://github.com/snowplow/ansible-
playbooks/archive/temp.zip
$ unzip temp.zip
Now let’s setup our development environment
$ cd dev-environment
$ vagrant up
$ vagrant ssh
Final step for now, let’s install some software
$ ansible-playbook /vagrant/ansible-
playbooks/aws-tools.yml --inventory-
file=/home/vagrant/ansible_hosts --
connection=local
$ ansible-playbook /vagrant/ansible-
playbooks/scala-sbt.yml --inventory-
file=/home/vagrant/ansible_hosts --
connection=local
(Back to the) introduction
Snowplow is an open-source web and event analytics platform,
built on Hadoop
• Co-founders Alex Dean and Yali Sassoon met at
OpenX, the open-source ad technology business
in 2008
• We released Snowplow as a skunkworks
prototype at start of 2012:
github.com/snowplow/snowplow
• We built Snowplow on top of Hadoop from the
very start
We wanted to take a fresh approach to web analytics
• Your own web event data -> in your own data warehouse
• Your own event data model
• Slice / dice and mine the data in highly bespoke ways to answer your
specific business questions
• Plug in the broadest possible set of analysis tools to drive value from your
data
Data warehouseData pipeline
Analyse your data in
any analysis tool
And we saw the potential of new “big data” technologies and
services to solve these problems in a scalable, low-cost manner
These tools make it possible to capture, transform, store and analyse all your
granular, event-level data, to you can perform any analysis
Amazon EMRAmazon S3CloudFront Amazon Redshift
Our Snowplow event processing flow runs on Hadoop,
specifically Amazon’s Elastic MapReduce hosted Hadoop service
Website / webapp
Snowplow Hadoop data pipeline
CloudFront-
based event
collector
Scalding-
based
enrichment
on Hadoop
JavaScript
event tracker
Amazon
Redshift /
PostgreSQL
Amazon S3
or
Clojure-
based event
collector
Why did we pick Hadoop?
Scalability
Easy to
reprocess
data
Highly
testable
We have customers processing 350m Snowplow
events a day in Hadoop – runs in <2 hours
If business rules change, we can fire up a large
cluster and re-process all historical raw
Snowplow events
We write unit and integration tests for our jobs
and run them locally, giving us confidence that
our jobs will run correctly at scale on Hadoop
And why Amazon’s Elastic MapReduce (EMR)?
No need to
run our own
cluster
Elastic
Interop with
other AWS
services
Running your own Hadoop cluster is a huge pain
– not for the fainthearted. By contrast, EMR just
works (most of the time !)
Snowplow runs as a nightly (sometimes more
frequent) batch job. We spin up the EMR cluster
to run the job, and shut it down straight after
EMR works really well with Amazon S3 as a file
store. We are big fans of Amazon Redshift
(hosted columnar database) too
Our workshop today
Hadoop is complicated…
… for our workshop today, we will stick to using Elastic
MapReduce and try to avoid any unnecessary complexity
… and we will learn by doing!
• Lots of books and articles about Hadoop and the theory of
MapReduce
• We will learn by doing – no theory unless it’s required to
directly explain the jobs we are creating
• Our priority is to get you up-and-running on Elastic
MapReduce, and confident enough to write your own
Hadoop jobs
Part 1: a simple Pig Latin job
on EMR
What is Pig (Latin)?
• Pig is a high-level platform for creating MapReduce jobs which can run
on Hadoop
• The language you write Pig jobs in is called Pig Latin
• For quick-and-dirty scripts, Pig just works
Hadoop DFS
Hadoop MapReduce
Crunch Hive Pig
Java
Cascading
Let’s all come up with a unique name for ourselves
• Lowercase letters, no spaces or hyphens or anything
• E.g. I will be alexsnowplow – please come up with a unique name for
yourself!
• It will be visible to other participants so choose something you don’t
mind being public 
• In the rest of this workshop, wherever you see YOURNAME, replace it
with your unique name
Let’s restart our Vagrant and do some setup
$ mkdir zero2hadoop
$ aws configure
// And type in:
AWS Access Key ID [None]:
AKIAILD6DCBTFI642JPQ
AWS Secret Access Key [None]:
KMVdr/bsq4FDTI5H143K3gjt4ErG2oTjd+1+a+ou
Default region name [None]: eu-west-1
Default output format [None]:
Let’s create some buckets in Amazon S3 – this is where our data
and our apps will live
$ aws s3 mb s3://zero2hadoop-in-YOURNAME
$ aws s3 mb s3://zero2hadoop-out-YOURNAME
$ aws s3 mb s3://zero2hadoop-jobs-YOURNAME
// Check those worked
$ aws s3 ls
Let’s get some source data uploaded
$ mkdir -p ~/zero2hadoop/part1/in
$ cd ~/zero2hadoop/part1/in
$ wget
https://raw.githubusercontent.com/snowplow/sc
alding-example-project/master/data/hello.txt
$ cat hello.txt
Hello world
Goodbye world
$ aws s3 cp hello.txt s3://zero2hadoop-in-
YOURNAME/part1/hello.txt
Let’s get our EMR command-line tools installed (1/2)
$ /vagrant/emr-cli/elastic-mapreduce
$ rvm install ruby-1.8.7-head
$ rvm use 1.8.7
$ alias emr=/vagrant/emr-cli/elastic-
mapreduce
Let’s get our EMR command-line tools installed (2/2)
Add this file:
{
"access_id": "AKIAI55OSYYRLYWLXH7A",
"private_key":
"SHRXNIBRdfWuLPbCt57ZVjf+NMKUjm9WTknDHPTP",
"region": "eu-west-1"
}
to: /vagrant/emr-cli/credentials.json
(sudo sntp -s 24.56.178.140)
Let’s get our EMR command-line tools installed (2/2)
// This should work fine now:
$ emr --list
<no output>
Let’s do some local file work
$ mkdir -p ~/zero2hadoop/part1/pig
$ cd ~/zero2hadoop/part1/pig
$ wget
https://gist.githubusercontent.com/alexanderd
ean/d8371cebdf00064591ae/raw/cb3030a6c48b85d1
01e296ccf27331384df3288d/wordcount.pig
// The original
https://gist.github.com/alexanderdean/d8371ce
bdf00064591ae
Now upload to S3
$ aws s3 cp wordcount.pig s3://zero2hadoop-
jobs-YOURNAME/part1/
$ aws s3 ls --recursive s3://zero2hadoop-
jobs-YOURNAME/part1/
2014-06-06 09:10:31 674
part1/wordcount.pig
And now we run our Pig script
$ emr --create --name "part1 YOURNAME" 
--set-visible-to-all-users true 
--pig-script s3n://zero2hadoop-jobs-
YOURNAME/part1/wordcount.pig 
--ami-version 2.0 
--args "-p,INPUT=s3n://zero2hadoop-in-
YOURNAME/part1, 
-p,OUTPUT=s3n://zero2hadoop-out-
YOURNAME/part1"
Let’s check out the jobs running in Elastic MapReduce – first at
the console
$ $ emr --list
j-1HR90SWPP40M4 STARTING
part1 YOURNAME
PENDING Setup Pig
PENDING Run Pig Script
and also in the UI
Okay let’s check the output of our job! (1/2)
$ aws s3 ls --recursive s3://zero2hadoop-out-
YOURNAME/part1
2014-06-06 09:57:53 0 part1/_SUCCESS
2014-06-06 09:57:50 26 part1/part-r-
00000
Okay let’s check the output of our job!
$ mkdir -p ~/zero2hadoop/part1/out
$ cd ~/zero2hadoop/part1/out
$ aws s3 cp --recursive s3://zero2hadoop-out-
YOURNAME/part1 .
$ ls
part-r-00000 _SUCCESS
$ cat part-r-00000
2 world
1 Hello
1 Goodbye
Part 2: a simple Scalding job
on EMR
What is Scalding?
• Scalding is a Scala API over Cascading, the Java framework for building
data processing pipelines on Hadoop:
Hadoop DFS
Hadoop MapReduce
Cascading Pig …
Java
Scalding Cascalog PyCascading
cascading.
jruby
Cascading has a “plumbing” abstraction over vanilla MapReduce
which should be quite comfortable to DW practitioners
Scalding improves further on Cascading by reducing boilerplate
and making more complex pipelines easier to express
• Scalding written in Scala – reduces a lot of boilerplate
versus vanilla Cascading. Easier to look at a job in its
entirety and see what it does
• Scalding created and supported by Twitter, who use it
throughout their organization
• We believe that data pipelines should be as strongly
typed as possible – all the other DSLs/APIs on top of
Cascading encourage dynamic typing
Strongly typed data pipelines – why?
• Catch errors as soon as possible – and report them in a strongly typed way too
• Define the inputs and outputs of each of your data processing steps in an
unambiguous way
• Forces you to formerly address the data types flowing through your system
• Lets you write code like this:
Okay let’s get started!
• Head to https://github.com/snowplow/scalding-example-project
Let’s get this code down locally and build it
$ mkdir -p ~/zero2hadoop/part2
$ cd ~/zero2hadoop/part2
$ git clone
git://github.com/snowplow/scalding-example-
project.git
$ cd scalding-example-project
$ sbt assembly
Here is our MapReduce code
Good, tests are passing, now let’s upload this to S3 so it’s
available to our EMR job
$ aws s3 cp target/scala-2.10/scalding-
example-project-0.0.5.jar s3://zero2hadoop-
jobs-YOURNAME/part2/
// If that doesn’t work:
$ aws cp s3://snowplow-hosted-assets/third-
party/scalding-example-project-0.0.5.jar
s3://zero2hadoop-jobs-YOURNAME/part2/
$ aws s3 ls s3://zero2hadoop-jobs-
YOURNAME/part2/
And now we run it!
$ emr --create --name ”part2 YOURNAME" 
--set-visible-to-all-users true 
--jar s3n://zero2hadoop-jobs-
YOURNAME/part2/scalding-example-project-
0.0.5.jar 
--arg
com.snowplowanalytics.hadoop.scalding.WordCou
ntJob 
--arg --hdfs 
--arg --input --arg s3n://zero2hadoop-in-
YOURNAME/part1/hello.txt 
--arg --output --arg s3n://zero2hadoop-out-
YOURNAME/part2
Let’s check out the jobs running in Elastic MapReduce – first at
the console
$ emr --list
j-1M62IGREPL7I STARTING
scalding-example-project
PENDING Example Jar Step
and also in the UI
Okay let’s check the output of our job!
$ aws s3 ls --recursive s3://zero2hadoop-out-
YOURNAME/part2
$ mkdir -p ~/zero2hadoop/part2/out
$ cd ~/zero2hadoop/part2/out
$ aws s3 cp --recursive s3://zero2hadoop-out-
YOURNAME/part2 .
$ ls
$ cat part-00000
goodbye 1
hello 1
world 2
Part 3: a more complex
Scalding job on EMR
Let’s explore another tutorial together
https://github.com/sharethrough/scalding-emr-tutorial
Questions?
http://snowplowanalytics.com
https://github.com/snowplow/snowplow
@snowplowdata
To talk offline – @alexcrdean on Twitter or
alex@snowplowanalytics.com

More Related Content

What's hot

Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Tugdual Grall
 
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen ShapiraGNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
gluent.
 
Airflow @ Agari
Airflow @ Agari Airflow @ Agari
Airflow @ Agari
Sid Anand
 
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
HostedbyConfluent
 
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNApache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
blueboxtraveler
 
Data Driven Enterprise with Apache Kafka
Data Driven Enterprise with Apache KafkaData Driven Enterprise with Apache Kafka
Data Driven Enterprise with Apache Kafka
confluent
 
Lambda architecture
Lambda architectureLambda architecture
Lambda architecture
Szilveszter Molnár
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
seoul_engineer
 
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
confluent
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
Databricks
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.io
Piotr Czarnas
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
Sid Anand
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Spark Summit
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
Guido Schmutz
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
Hakka Labs
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Guido Schmutz
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluLessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at Hulu
DataWorks Summit
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBER
Shuyi Chen
 
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Databricks
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
Serg Masyutin
 

What's hot (20)

Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
 
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen ShapiraGNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
 
Airflow @ Agari
Airflow @ Agari Airflow @ Agari
Airflow @ Agari
 
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
 
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNApache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
 
Data Driven Enterprise with Apache Kafka
Data Driven Enterprise with Apache KafkaData Driven Enterprise with Apache Kafka
Data Driven Enterprise with Apache Kafka
 
Lambda architecture
Lambda architectureLambda architecture
Lambda architecture
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
 
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.io
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluLessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at Hulu
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBER
 
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu Kasinathan
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 

Similar to From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
ZendCon 2015 - DevOps for Small Teams
ZendCon 2015 - DevOps for Small TeamsZendCon 2015 - DevOps for Small Teams
ZendCon 2015 - DevOps for Small Teams
Joe Ferguson
 
Madison PHP 2015 - DevOps For Small Teams
Madison PHP 2015 - DevOps For Small TeamsMadison PHP 2015 - DevOps For Small Teams
Madison PHP 2015 - DevOps For Small Teams
Joe Ferguson
 
DevOps For Small Teams
DevOps For Small TeamsDevOps For Small Teams
DevOps For Small Teams
Joe Ferguson
 
Triple Blitz Strike
Triple Blitz StrikeTriple Blitz Strike
Triple Blitz Strike
Denis Zhdanov
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)
Søren Lund
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Databricks
 
DiUS Computing Lca Rails Final
DiUS  Computing Lca Rails FinalDiUS  Computing Lca Rails Final
DiUS Computing Lca Rails Final
Robert Postill
 
Drupal development
Drupal development Drupal development
Drupal development
Dennis Povshedny
 
Hands on Docker - Launch your own LEMP or LAMP stack - SunshinePHP
Hands on Docker - Launch your own LEMP or LAMP stack - SunshinePHPHands on Docker - Launch your own LEMP or LAMP stack - SunshinePHP
Hands on Docker - Launch your own LEMP or LAMP stack - SunshinePHP
Dana Luther
 
Drupal Efficiency - Coding, Deployment, Scaling
Drupal Efficiency - Coding, Deployment, ScalingDrupal Efficiency - Coding, Deployment, Scaling
Drupal Efficiency - Coding, Deployment, Scaling
smattoon
 
Docker Online Meetup #3: Docker in Production
Docker Online Meetup #3: Docker in ProductionDocker Online Meetup #3: Docker in Production
Docker Online Meetup #3: Docker in Production
Docker, Inc.
 
The Secrets of The FullStack Ninja - Part A - Session I
The Secrets of The FullStack Ninja - Part A - Session IThe Secrets of The FullStack Ninja - Part A - Session I
The Secrets of The FullStack Ninja - Part A - Session I
Oded Sagir
 
Midwest PHP 2017 DevOps For Small team
Midwest PHP 2017 DevOps For Small teamMidwest PHP 2017 DevOps For Small team
Midwest PHP 2017 DevOps For Small team
Joe Ferguson
 
Toolbox of a Ruby Team
Toolbox of a Ruby TeamToolbox of a Ruby Team
Toolbox of a Ruby Team
Arto Artnik
 
Drupal Efficiency using open source technologies from Sun
Drupal Efficiency using open source technologies from SunDrupal Efficiency using open source technologies from Sun
Drupal Efficiency using open source technologies from Sun
smattoon
 
Shipping Applications to Production in Containers with Docker
Shipping Applications to Production in Containers with DockerShipping Applications to Production in Containers with Docker
Shipping Applications to Production in Containers with Docker
Jérôme Petazzoni
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
Csaba Toth
 
Docker and serverless Randstad Jan 2019: OpenFaaS Serverless: when functions ...
Docker and serverless Randstad Jan 2019: OpenFaaS Serverless: when functions ...Docker and serverless Randstad Jan 2019: OpenFaaS Serverless: when functions ...
Docker and serverless Randstad Jan 2019: OpenFaaS Serverless: when functions ...
Edward Wilde
 

Similar to From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce (20)

Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
ZendCon 2015 - DevOps for Small Teams
ZendCon 2015 - DevOps for Small TeamsZendCon 2015 - DevOps for Small Teams
ZendCon 2015 - DevOps for Small Teams
 
Madison PHP 2015 - DevOps For Small Teams
Madison PHP 2015 - DevOps For Small TeamsMadison PHP 2015 - DevOps For Small Teams
Madison PHP 2015 - DevOps For Small Teams
 
DevOps For Small Teams
DevOps For Small TeamsDevOps For Small Teams
DevOps For Small Teams
 
Triple Blitz Strike
Triple Blitz StrikeTriple Blitz Strike
Triple Blitz Strike
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
DiUS Computing Lca Rails Final
DiUS  Computing Lca Rails FinalDiUS  Computing Lca Rails Final
DiUS Computing Lca Rails Final
 
Drupal development
Drupal development Drupal development
Drupal development
 
Hands on Docker - Launch your own LEMP or LAMP stack - SunshinePHP
Hands on Docker - Launch your own LEMP or LAMP stack - SunshinePHPHands on Docker - Launch your own LEMP or LAMP stack - SunshinePHP
Hands on Docker - Launch your own LEMP or LAMP stack - SunshinePHP
 
Drupal Efficiency - Coding, Deployment, Scaling
Drupal Efficiency - Coding, Deployment, ScalingDrupal Efficiency - Coding, Deployment, Scaling
Drupal Efficiency - Coding, Deployment, Scaling
 
Docker Online Meetup #3: Docker in Production
Docker Online Meetup #3: Docker in ProductionDocker Online Meetup #3: Docker in Production
Docker Online Meetup #3: Docker in Production
 
The Secrets of The FullStack Ninja - Part A - Session I
The Secrets of The FullStack Ninja - Part A - Session IThe Secrets of The FullStack Ninja - Part A - Session I
The Secrets of The FullStack Ninja - Part A - Session I
 
Midwest PHP 2017 DevOps For Small team
Midwest PHP 2017 DevOps For Small teamMidwest PHP 2017 DevOps For Small team
Midwest PHP 2017 DevOps For Small team
 
Toolbox of a Ruby Team
Toolbox of a Ruby TeamToolbox of a Ruby Team
Toolbox of a Ruby Team
 
Drupal Efficiency using open source technologies from Sun
Drupal Efficiency using open source technologies from SunDrupal Efficiency using open source technologies from Sun
Drupal Efficiency using open source technologies from Sun
 
Shipping Applications to Production in Containers with Docker
Shipping Applications to Production in Containers with DockerShipping Applications to Production in Containers with Docker
Shipping Applications to Production in Containers with Docker
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
 
Docker and serverless Randstad Jan 2019: OpenFaaS Serverless: when functions ...
Docker and serverless Randstad Jan 2019: OpenFaaS Serverless: when functions ...Docker and serverless Randstad Jan 2019: OpenFaaS Serverless: when functions ...
Docker and serverless Randstad Jan 2019: OpenFaaS Serverless: when functions ...
 

More from Alexander Dean

Asynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAsynchronous micro-services and the unified log
Asynchronous micro-services and the unified log
Alexander Dean
 
Snowplow New York City Meetup #2
Snowplow New York City Meetup #2Snowplow New York City Meetup #2
Snowplow New York City Meetup #2
Alexander Dean
 
Introducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricIntroducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabric
Alexander Dean
 
Unified Log London (May 2015) - Why your company needs a unified log
Unified Log London (May 2015) - Why your company needs a unified logUnified Log London (May 2015) - Why your company needs a unified log
Unified Log London (May 2015) - Why your company needs a unified log
Alexander Dean
 
Snowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back againSnowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back again
Alexander Dean
 
Span Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified log
Alexander Dean
 
Big Data Beers - Introducing Snowplow
Big Data Beers - Introducing SnowplowBig Data Beers - Introducing Snowplow
Big Data Beers - Introducing Snowplow
Alexander Dean
 
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Alexander Dean
 

More from Alexander Dean (8)

Asynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAsynchronous micro-services and the unified log
Asynchronous micro-services and the unified log
 
Snowplow New York City Meetup #2
Snowplow New York City Meetup #2Snowplow New York City Meetup #2
Snowplow New York City Meetup #2
 
Introducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricIntroducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabric
 
Unified Log London (May 2015) - Why your company needs a unified log
Unified Log London (May 2015) - Why your company needs a unified logUnified Log London (May 2015) - Why your company needs a unified log
Unified Log London (May 2015) - Why your company needs a unified log
 
Snowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back againSnowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back again
 
Span Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified log
 
Big Data Beers - Introducing Snowplow
Big Data Beers - Introducing SnowplowBig Data Beers - Introducing Snowplow
Big Data Beers - Introducing Snowplow
 
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
 

Recently uploaded

SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
Hironori Washizaki
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Codeigniter VS Cakephp Which is Better for Web Development.pdf
Codeigniter VS Cakephp Which is Better for Web Development.pdfCodeigniter VS Cakephp Which is Better for Web Development.pdf
Codeigniter VS Cakephp Which is Better for Web Development.pdf
Semiosis Software Private Limited
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
Gerardo Pardo-Castellote
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
pavan998932
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
lorraineandreiamcidl
 
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
kalichargn70th171
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 

Recently uploaded (20)

SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Codeigniter VS Cakephp Which is Better for Web Development.pdf
Codeigniter VS Cakephp Which is Better for Web Development.pdfCodeigniter VS Cakephp Which is Better for Web Development.pdf
Codeigniter VS Cakephp Which is Better for Web Development.pdf
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
 
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 

From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

  • 1. Workshop: From Zero to _ Budapest DW Forum 2014
  • 2. Agenda today 1. Some setup before we start 2. (Back to the) introduction 3. Our workshop today 4. Part 2: a simple Scalding job on EMR
  • 3. Some setup before we start
  • 4. There is a lot to copy and paste – so let’s all join a Google Hangout chat http://bit.ly/1xgSQId • If I forget to paste some content into the chat room, just shout out and remind me
  • 5. First, let’s all download and setup Virtualbox and Vagrant http://docs.vagrantup.com/v2/installation/in dex.html https://www.virtualbox.org/wiki/Downloads
  • 6. Now let’s setup our development environment $ vagrant plugin install vagrant-vbguest If you have git already installed: $ git clone --recursive https://github.com/snowplow/dev-environment.git If not: $ wget https://github.com/snowplow/dev- environment/archive/temp.zip $ unzip temp.zip $ wget https://github.com/snowplow/ansible- playbooks/archive/temp.zip $ unzip temp.zip
  • 7. Now let’s setup our development environment $ cd dev-environment $ vagrant up $ vagrant ssh
  • 8. Final step for now, let’s install some software $ ansible-playbook /vagrant/ansible- playbooks/aws-tools.yml --inventory- file=/home/vagrant/ansible_hosts -- connection=local $ ansible-playbook /vagrant/ansible- playbooks/scala-sbt.yml --inventory- file=/home/vagrant/ansible_hosts -- connection=local
  • 9. (Back to the) introduction
  • 10. Snowplow is an open-source web and event analytics platform, built on Hadoop • Co-founders Alex Dean and Yali Sassoon met at OpenX, the open-source ad technology business in 2008 • We released Snowplow as a skunkworks prototype at start of 2012: github.com/snowplow/snowplow • We built Snowplow on top of Hadoop from the very start
  • 11. We wanted to take a fresh approach to web analytics • Your own web event data -> in your own data warehouse • Your own event data model • Slice / dice and mine the data in highly bespoke ways to answer your specific business questions • Plug in the broadest possible set of analysis tools to drive value from your data Data warehouseData pipeline Analyse your data in any analysis tool
  • 12. And we saw the potential of new “big data” technologies and services to solve these problems in a scalable, low-cost manner These tools make it possible to capture, transform, store and analyse all your granular, event-level data, to you can perform any analysis Amazon EMRAmazon S3CloudFront Amazon Redshift
  • 13. Our Snowplow event processing flow runs on Hadoop, specifically Amazon’s Elastic MapReduce hosted Hadoop service Website / webapp Snowplow Hadoop data pipeline CloudFront- based event collector Scalding- based enrichment on Hadoop JavaScript event tracker Amazon Redshift / PostgreSQL Amazon S3 or Clojure- based event collector
  • 14. Why did we pick Hadoop? Scalability Easy to reprocess data Highly testable We have customers processing 350m Snowplow events a day in Hadoop – runs in <2 hours If business rules change, we can fire up a large cluster and re-process all historical raw Snowplow events We write unit and integration tests for our jobs and run them locally, giving us confidence that our jobs will run correctly at scale on Hadoop
  • 15. And why Amazon’s Elastic MapReduce (EMR)? No need to run our own cluster Elastic Interop with other AWS services Running your own Hadoop cluster is a huge pain – not for the fainthearted. By contrast, EMR just works (most of the time !) Snowplow runs as a nightly (sometimes more frequent) batch job. We spin up the EMR cluster to run the job, and shut it down straight after EMR works really well with Amazon S3 as a file store. We are big fans of Amazon Redshift (hosted columnar database) too
  • 18. … for our workshop today, we will stick to using Elastic MapReduce and try to avoid any unnecessary complexity
  • 19. … and we will learn by doing! • Lots of books and articles about Hadoop and the theory of MapReduce • We will learn by doing – no theory unless it’s required to directly explain the jobs we are creating • Our priority is to get you up-and-running on Elastic MapReduce, and confident enough to write your own Hadoop jobs
  • 20. Part 1: a simple Pig Latin job on EMR
  • 21. What is Pig (Latin)? • Pig is a high-level platform for creating MapReduce jobs which can run on Hadoop • The language you write Pig jobs in is called Pig Latin • For quick-and-dirty scripts, Pig just works Hadoop DFS Hadoop MapReduce Crunch Hive Pig Java Cascading
  • 22. Let’s all come up with a unique name for ourselves • Lowercase letters, no spaces or hyphens or anything • E.g. I will be alexsnowplow – please come up with a unique name for yourself! • It will be visible to other participants so choose something you don’t mind being public  • In the rest of this workshop, wherever you see YOURNAME, replace it with your unique name
  • 23. Let’s restart our Vagrant and do some setup $ mkdir zero2hadoop $ aws configure // And type in: AWS Access Key ID [None]: AKIAILD6DCBTFI642JPQ AWS Secret Access Key [None]: KMVdr/bsq4FDTI5H143K3gjt4ErG2oTjd+1+a+ou Default region name [None]: eu-west-1 Default output format [None]:
  • 24. Let’s create some buckets in Amazon S3 – this is where our data and our apps will live $ aws s3 mb s3://zero2hadoop-in-YOURNAME $ aws s3 mb s3://zero2hadoop-out-YOURNAME $ aws s3 mb s3://zero2hadoop-jobs-YOURNAME // Check those worked $ aws s3 ls
  • 25. Let’s get some source data uploaded $ mkdir -p ~/zero2hadoop/part1/in $ cd ~/zero2hadoop/part1/in $ wget https://raw.githubusercontent.com/snowplow/sc alding-example-project/master/data/hello.txt $ cat hello.txt Hello world Goodbye world $ aws s3 cp hello.txt s3://zero2hadoop-in- YOURNAME/part1/hello.txt
  • 26. Let’s get our EMR command-line tools installed (1/2) $ /vagrant/emr-cli/elastic-mapreduce $ rvm install ruby-1.8.7-head $ rvm use 1.8.7 $ alias emr=/vagrant/emr-cli/elastic- mapreduce
  • 27. Let’s get our EMR command-line tools installed (2/2) Add this file: { "access_id": "AKIAI55OSYYRLYWLXH7A", "private_key": "SHRXNIBRdfWuLPbCt57ZVjf+NMKUjm9WTknDHPTP", "region": "eu-west-1" } to: /vagrant/emr-cli/credentials.json (sudo sntp -s 24.56.178.140)
  • 28. Let’s get our EMR command-line tools installed (2/2) // This should work fine now: $ emr --list <no output>
  • 29. Let’s do some local file work $ mkdir -p ~/zero2hadoop/part1/pig $ cd ~/zero2hadoop/part1/pig $ wget https://gist.githubusercontent.com/alexanderd ean/d8371cebdf00064591ae/raw/cb3030a6c48b85d1 01e296ccf27331384df3288d/wordcount.pig // The original https://gist.github.com/alexanderdean/d8371ce bdf00064591ae
  • 30. Now upload to S3 $ aws s3 cp wordcount.pig s3://zero2hadoop- jobs-YOURNAME/part1/ $ aws s3 ls --recursive s3://zero2hadoop- jobs-YOURNAME/part1/ 2014-06-06 09:10:31 674 part1/wordcount.pig
  • 31. And now we run our Pig script $ emr --create --name "part1 YOURNAME" --set-visible-to-all-users true --pig-script s3n://zero2hadoop-jobs- YOURNAME/part1/wordcount.pig --ami-version 2.0 --args "-p,INPUT=s3n://zero2hadoop-in- YOURNAME/part1, -p,OUTPUT=s3n://zero2hadoop-out- YOURNAME/part1"
  • 32. Let’s check out the jobs running in Elastic MapReduce – first at the console $ $ emr --list j-1HR90SWPP40M4 STARTING part1 YOURNAME PENDING Setup Pig PENDING Run Pig Script
  • 33. and also in the UI
  • 34. Okay let’s check the output of our job! (1/2) $ aws s3 ls --recursive s3://zero2hadoop-out- YOURNAME/part1 2014-06-06 09:57:53 0 part1/_SUCCESS 2014-06-06 09:57:50 26 part1/part-r- 00000
  • 35. Okay let’s check the output of our job! $ mkdir -p ~/zero2hadoop/part1/out $ cd ~/zero2hadoop/part1/out $ aws s3 cp --recursive s3://zero2hadoop-out- YOURNAME/part1 . $ ls part-r-00000 _SUCCESS $ cat part-r-00000 2 world 1 Hello 1 Goodbye
  • 36. Part 2: a simple Scalding job on EMR
  • 37. What is Scalding? • Scalding is a Scala API over Cascading, the Java framework for building data processing pipelines on Hadoop: Hadoop DFS Hadoop MapReduce Cascading Pig … Java Scalding Cascalog PyCascading cascading. jruby
  • 38. Cascading has a “plumbing” abstraction over vanilla MapReduce which should be quite comfortable to DW practitioners
  • 39. Scalding improves further on Cascading by reducing boilerplate and making more complex pipelines easier to express • Scalding written in Scala – reduces a lot of boilerplate versus vanilla Cascading. Easier to look at a job in its entirety and see what it does • Scalding created and supported by Twitter, who use it throughout their organization • We believe that data pipelines should be as strongly typed as possible – all the other DSLs/APIs on top of Cascading encourage dynamic typing
  • 40. Strongly typed data pipelines – why? • Catch errors as soon as possible – and report them in a strongly typed way too • Define the inputs and outputs of each of your data processing steps in an unambiguous way • Forces you to formerly address the data types flowing through your system • Lets you write code like this:
  • 41. Okay let’s get started! • Head to https://github.com/snowplow/scalding-example-project
  • 42. Let’s get this code down locally and build it $ mkdir -p ~/zero2hadoop/part2 $ cd ~/zero2hadoop/part2 $ git clone git://github.com/snowplow/scalding-example- project.git $ cd scalding-example-project $ sbt assembly
  • 43. Here is our MapReduce code
  • 44. Good, tests are passing, now let’s upload this to S3 so it’s available to our EMR job $ aws s3 cp target/scala-2.10/scalding- example-project-0.0.5.jar s3://zero2hadoop- jobs-YOURNAME/part2/ // If that doesn’t work: $ aws cp s3://snowplow-hosted-assets/third- party/scalding-example-project-0.0.5.jar s3://zero2hadoop-jobs-YOURNAME/part2/ $ aws s3 ls s3://zero2hadoop-jobs- YOURNAME/part2/
  • 45. And now we run it! $ emr --create --name ”part2 YOURNAME" --set-visible-to-all-users true --jar s3n://zero2hadoop-jobs- YOURNAME/part2/scalding-example-project- 0.0.5.jar --arg com.snowplowanalytics.hadoop.scalding.WordCou ntJob --arg --hdfs --arg --input --arg s3n://zero2hadoop-in- YOURNAME/part1/hello.txt --arg --output --arg s3n://zero2hadoop-out- YOURNAME/part2
  • 46. Let’s check out the jobs running in Elastic MapReduce – first at the console $ emr --list j-1M62IGREPL7I STARTING scalding-example-project PENDING Example Jar Step
  • 47. and also in the UI
  • 48. Okay let’s check the output of our job! $ aws s3 ls --recursive s3://zero2hadoop-out- YOURNAME/part2 $ mkdir -p ~/zero2hadoop/part2/out $ cd ~/zero2hadoop/part2/out $ aws s3 cp --recursive s3://zero2hadoop-out- YOURNAME/part2 . $ ls $ cat part-00000 goodbye 1 hello 1 world 2
  • 49. Part 3: a more complex Scalding job on EMR
  • 50. Let’s explore another tutorial together https://github.com/sharethrough/scalding-emr-tutorial