SlideShare a Scribd company logo
Getting started with Hadoop, Hive,
and Elastic MapReduce
Hunter Blanks, hblanks@monetate.com / @tildehblanks
github.com/hblanks/talks/




                                                      PhillyAWS. January 12, 2012
Overview
In the figure below, 4 inputs are mapped to 16 (key, value) tuples.




These tuples are sorted by those keys, and then reduced, into 4 new (key, value) tuples.
Finally, these new tuples are once again being sorted and reduced, and so on...
                                                   source: http://code.google.com/p/mapreduce-framework/wiki/MapReduce
Timeline

        2004                                2008         2009             2010             2011




    MapReduce paper Doug Cutting    Hadoop wins     Elastic MapReduce             mrjob
       published     (and other     terabyte sort        launched.                open
                    Yahoos) start    benchmark      Facebook presents            sourced
                   work on Hadoop                   first paper on Hive.
An overview of available (and not available) tools
1) MapReduce. Google's C++ implementation, which sits atop GFS, chubby,
BigTable, and who knows what else. See also sawzall.

2) Hadoop. Apache's/Facebook's/Yahoo's Java implementation, which
furthermore includes, or else integrates with: HDFS (a reimplementation of
GFS), Zookeeper (a reimplementation of chubby), and HBase (a
reimplementation of BigTable). See also Pig (not exactly sawzall, but another
take on it).

3) Hive. Apache's/Facebook's data warehousing system, which allows users to
query large datasets using SQL-like syntax, over Hadoop.

4) Elastic MapReduce. Amazon's API for spinning up temporary Hadoop
clusters (aka "Jobflows"), typically reading input and writing output to S3. It is
much, much easier than bringing up your own Hadoop cluster.

5) mrjob. Yelp's Python framework for running MapReduce jobs on Hadoop
clusters, with or without ElasticMapReduce.
Terms for Elastic MapReduce
1) job flow. A transient (or semi-transient) Hadoop cluster. Jobflows typically
terminate after all their steps are done, but they can also be flagged as
"permanent," which just means they don't shut down on their own.

2) master, core, and task nodes. A jobflow has one master node; when it
dies, your jobflow dies (or is at least supposed to). It has a fixed number of
core nodes (determined when the jobflow was started) -- these core nodes can
do map/reduce tasks, and also serve HDFS. It can also have a changeable
number of task nodes, which do tasks but don't serve HDFS.

3) step. A jobflow contains one or more "steps". These may or may not be
actual Hadoop "jobs" -- sometimes a step is just for initialization, other times it
is an actual map/reduce job.

4) spot instance vs on-demand. By default, any job flow you start will use
normal, "on-demand" EC2 instances, with a fixed hourly price. You may
alternatively specify a spot instances by naming a maximum price (a bid price)
you're willing to pay. If your bid price is below the current spot price, then you
will pay the spot price for your instances (typically 50-66% cheaper than on-
demand instances).
The easiest ways to get started with Elastic MapReduce, #1
If you know Python, start with mrjob.

Here's a sample MRJob class, which you might write to wordcounter.py:

    from mrjob.job import MRJob

    class MRWordCounter(MRJob):
        def mapper(self, key, line):
            for word in line.split():
                yield word, 1

        def reducer(self, word, occurrences):
            yield word, sum(occurrences)

    if __name__ == '__main__':
        MRWordCounter.run()



And how you run it:

    export AWS_ACCESS_KEY_ID=...; export AWS_SECRET_ACCESS_KEY=...
    python wordcounter.py -r emr < input > output

These samples, and much more documentation, is at http://
packages.python.org/mrjob/writing-and-running.html.
The easiest ways to get started with Elastic MapReduce, #2
If you know any other scripting language, start with the AWS elastic-mapreduce
command line tool, or the AWS web console. Both are outlined at:

  http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/

An example of the command line, assuming a file wordSplitter.py, is:

   ./elastic-mapreduce --create --stream 
       --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py 
       --input s3://elasticmapreduce/samples/wordcount/input 
       --output s3://mybucket.foo.com/wordcount 
       --reducer aggregate

A (not so) quick reference card for the command line tool is at:

  http://s3.amazonaws.com/awsdocs/ElasticMapReduce/latest/emr-qrc.pdf
Getting started with Hive is not so easy
1) You're going to need to put your data into some sort of warehouse, probably in S3.
(HDFS will not persist in the cloud). This may involve its own MapReduce step, but with
careful attention to how you output your files to S3. Partitioning matters, but you won't
get it with just the default Hadoop streaming jar! (We use oddjob's)

2) To start a jobflow with Hive programmatically (i.e., not using the command line elastic-
mapreduce client), you need to add several initial steps to the jobflow -- notably ones
that start Hive and possibly configure its heapsize. Fortunately, Amazon's elastic-
mapreduce client is easy enough to reverse engineer...

3) You're going to need to make a schema (surprise) for your data warehouse.

4) You may have warehoused data in JSON. Hive much prefers flat, tabular files.

5) To talk to Hive programmatically from other servers, you'll need to use Hive's thrift
binding, and also open up the ElasticMapReduce-master security group (yes, EMR
created this for you when you first ran a jobflow) so that your application box can talk to
the master node of your job flow.

6) All that said, it may beat writing your own custom MapReduce jobs.
Other pitfalls to beware
1) The JVM is not your friend. It especially loves memory.

- be prepared to add swap on your task nodes, at least until Hadoop's fork()/exec()
  behavior is fixed in AWS' version of Hadoop

- be prepared to profile your cluster with Ganglia (it's really, really easy to setup). Just
  add this bootstrap action to do so:
  s3://elasticmapreduce/bootstrap-actions/install-ganglia

2) Multiple output files can get you into trouble, especially if you don't sort your outputs
first.

3) Hadoop, and Hive as well, much prefer larger files to lots of small files.
Further reading

 - "MapReduce: Simplified Data Processing on Large Clusters"
   OSDI'04. http://research.google.com/archive/mapreduce.html

 - "Hive - A Warehousing Solution Over a Map-Reduce Framework"
   VLDB, 2009. http://www.vldb.org/pvldb/2/vldb09-938.pdf

 - Hadoop: The Definitive Guide. O'Reilly Associates, 2009.


Thank you!




                                        (P.S. We're hiring. hblanks@monetate.com)

More Related Content

What's hot

Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Uwe Printz
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
Antonio Silveira
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
Phil Young
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
Prashanth Babu
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
joelcrabb
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training
Keylabs
 
מיכאל
מיכאלמיכאל
מיכאל
sqlserver.co.il
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
Deborah Akuoko
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
Sean Murphy
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdfEdureka!
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
Nick Dimiduk
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
Arvind Kumar
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
DerrekYoungDotCom
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Someshwar Kale
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 

What's hot (20)

Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 

Similar to Getting started with Hadoop, Hive, and Elastic MapReduce

Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
Kalyan Hadoop
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Spark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSpark,Hadoop,Presto Comparition
Spark,Hadoop,Presto Comparition
Sandish Kumar H N
 
Parallel Computing for Econometricians with Amazon Web Services
Parallel Computing for Econometricians with Amazon Web ServicesParallel Computing for Econometricians with Amazon Web Services
Parallel Computing for Econometricians with Amazon Web Servicesstephenjbarr
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
Adam Muise
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
Harika583
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Steve Watt
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoop
RexRamos9
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
Andrea Iacono
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
Thanusha154
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
Big Data Interview Questions
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010
Christopher Curtin
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Bhushan Kulkarni
 
Unit 4 lecture2
Unit 4 lecture2Unit 4 lecture2
Unit 4 lecture2
vishal choudhary
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Cognizant
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 

Similar to Getting started with Hadoop, Hive, and Elastic MapReduce (20)

Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Spark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSpark,Hadoop,Presto Comparition
Spark,Hadoop,Presto Comparition
 
Parallel Computing for Econometricians with Amazon Web Services
Parallel Computing for Econometricians with Amazon Web ServicesParallel Computing for Econometricians with Amazon Web Services
Parallel Computing for Econometricians with Amazon Web Services
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoop
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Unit 4 lecture2
Unit 4 lecture2Unit 4 lecture2
Unit 4 lecture2
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 

Recently uploaded

The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Jen Stirrup
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 

Recently uploaded (20)

The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 

Getting started with Hadoop, Hive, and Elastic MapReduce

  • 1. Getting started with Hadoop, Hive, and Elastic MapReduce Hunter Blanks, hblanks@monetate.com / @tildehblanks github.com/hblanks/talks/ PhillyAWS. January 12, 2012
  • 2. Overview In the figure below, 4 inputs are mapped to 16 (key, value) tuples. These tuples are sorted by those keys, and then reduced, into 4 new (key, value) tuples. Finally, these new tuples are once again being sorted and reduced, and so on... source: http://code.google.com/p/mapreduce-framework/wiki/MapReduce
  • 3. Timeline 2004 2008 2009 2010 2011 MapReduce paper Doug Cutting Hadoop wins Elastic MapReduce mrjob published (and other terabyte sort launched. open Yahoos) start benchmark Facebook presents sourced work on Hadoop first paper on Hive.
  • 4. An overview of available (and not available) tools 1) MapReduce. Google's C++ implementation, which sits atop GFS, chubby, BigTable, and who knows what else. See also sawzall. 2) Hadoop. Apache's/Facebook's/Yahoo's Java implementation, which furthermore includes, or else integrates with: HDFS (a reimplementation of GFS), Zookeeper (a reimplementation of chubby), and HBase (a reimplementation of BigTable). See also Pig (not exactly sawzall, but another take on it). 3) Hive. Apache's/Facebook's data warehousing system, which allows users to query large datasets using SQL-like syntax, over Hadoop. 4) Elastic MapReduce. Amazon's API for spinning up temporary Hadoop clusters (aka "Jobflows"), typically reading input and writing output to S3. It is much, much easier than bringing up your own Hadoop cluster. 5) mrjob. Yelp's Python framework for running MapReduce jobs on Hadoop clusters, with or without ElasticMapReduce.
  • 5. Terms for Elastic MapReduce 1) job flow. A transient (or semi-transient) Hadoop cluster. Jobflows typically terminate after all their steps are done, but they can also be flagged as "permanent," which just means they don't shut down on their own. 2) master, core, and task nodes. A jobflow has one master node; when it dies, your jobflow dies (or is at least supposed to). It has a fixed number of core nodes (determined when the jobflow was started) -- these core nodes can do map/reduce tasks, and also serve HDFS. It can also have a changeable number of task nodes, which do tasks but don't serve HDFS. 3) step. A jobflow contains one or more "steps". These may or may not be actual Hadoop "jobs" -- sometimes a step is just for initialization, other times it is an actual map/reduce job. 4) spot instance vs on-demand. By default, any job flow you start will use normal, "on-demand" EC2 instances, with a fixed hourly price. You may alternatively specify a spot instances by naming a maximum price (a bid price) you're willing to pay. If your bid price is below the current spot price, then you will pay the spot price for your instances (typically 50-66% cheaper than on- demand instances).
  • 6. The easiest ways to get started with Elastic MapReduce, #1 If you know Python, start with mrjob. Here's a sample MRJob class, which you might write to wordcounter.py: from mrjob.job import MRJob class MRWordCounter(MRJob): def mapper(self, key, line): for word in line.split(): yield word, 1 def reducer(self, word, occurrences): yield word, sum(occurrences) if __name__ == '__main__': MRWordCounter.run() And how you run it: export AWS_ACCESS_KEY_ID=...; export AWS_SECRET_ACCESS_KEY=... python wordcounter.py -r emr < input > output These samples, and much more documentation, is at http:// packages.python.org/mrjob/writing-and-running.html.
  • 7. The easiest ways to get started with Elastic MapReduce, #2 If you know any other scripting language, start with the AWS elastic-mapreduce command line tool, or the AWS web console. Both are outlined at: http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/ An example of the command line, assuming a file wordSplitter.py, is: ./elastic-mapreduce --create --stream --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py --input s3://elasticmapreduce/samples/wordcount/input --output s3://mybucket.foo.com/wordcount --reducer aggregate A (not so) quick reference card for the command line tool is at: http://s3.amazonaws.com/awsdocs/ElasticMapReduce/latest/emr-qrc.pdf
  • 8. Getting started with Hive is not so easy 1) You're going to need to put your data into some sort of warehouse, probably in S3. (HDFS will not persist in the cloud). This may involve its own MapReduce step, but with careful attention to how you output your files to S3. Partitioning matters, but you won't get it with just the default Hadoop streaming jar! (We use oddjob's) 2) To start a jobflow with Hive programmatically (i.e., not using the command line elastic- mapreduce client), you need to add several initial steps to the jobflow -- notably ones that start Hive and possibly configure its heapsize. Fortunately, Amazon's elastic- mapreduce client is easy enough to reverse engineer... 3) You're going to need to make a schema (surprise) for your data warehouse. 4) You may have warehoused data in JSON. Hive much prefers flat, tabular files. 5) To talk to Hive programmatically from other servers, you'll need to use Hive's thrift binding, and also open up the ElasticMapReduce-master security group (yes, EMR created this for you when you first ran a jobflow) so that your application box can talk to the master node of your job flow. 6) All that said, it may beat writing your own custom MapReduce jobs.
  • 9. Other pitfalls to beware 1) The JVM is not your friend. It especially loves memory. - be prepared to add swap on your task nodes, at least until Hadoop's fork()/exec() behavior is fixed in AWS' version of Hadoop - be prepared to profile your cluster with Ganglia (it's really, really easy to setup). Just add this bootstrap action to do so: s3://elasticmapreduce/bootstrap-actions/install-ganglia 2) Multiple output files can get you into trouble, especially if you don't sort your outputs first. 3) Hadoop, and Hive as well, much prefer larger files to lots of small files.
  • 10. Further reading - "MapReduce: Simplified Data Processing on Large Clusters" OSDI'04. http://research.google.com/archive/mapreduce.html - "Hive - A Warehousing Solution Over a Map-Reduce Framework" VLDB, 2009. http://www.vldb.org/pvldb/2/vldb09-938.pdf - Hadoop: The Definitive Guide. O'Reilly Associates, 2009. Thank you! (P.S. We're hiring. hblanks@monetate.com)