SlideShare a Scribd company logo
Big Data and APIs
for PHP Developers
 SXSW Interactive 2011
    Austin, Texas

     #BigDataAPIs
 

Text REDCROSS to 90999 to make a $10 donation and
support the American Red Cross' disaster relief efforts to help
those affected by the earthquake in Japan and tsunami
throughout the Pacific.
Topics & Goal

Topics:
   o Introductions
   o Definition of Big Data
   o Working with Big Data
   o APIs
   o Visualization
   o MapReduce

 Goal:
    To provide an interesting discussion revolving around all
aspects of big data & to spark your imagination on the subject.
Who we are
Bradley Holt (@bradleyholt)

Curator of this Workshop

Co-Founder and Technical Director,
Found Line

Author
Julie Steele (@jsteeleeditor)

Pythonista
   (but we like her anyway)

Acquisitions Editor, O'Reilly Media

Graphic Designer, freelance

Visualization curator
Laura Thomson (@lxt)

Webtools Engineering Manager,
Mozilla

crash-stats.mozilla.com

PHP, Python, Scaling, Systems
Eli White (@eliw)

Worked for:
  Digg, Tripadvisor, Hubble

PHP guy at heart

Currently unemployed (Hiring?)

Author:
Dennis Yang (@sinned)

Director of Product & Marketing,
   Infochimps

Previously:
   mySimon, CNET,
   & cofounder of Techdirt
David Zuelke (@dzuelke)

Lead Developer: Agavi

Managing Partner,
  Bitextender GmbH

Addicted to rockets, sharks w/
friggin laser beams, helicopters,
HTTP, REST, CouchDB and
MapReduce. And PHP.
Who are you?

Let's Learn about the Audience!
Tell Us About You

•   Who is currently working on a Big Data problem?
•   Have you integrated with an existing API?
•   Do you publish your own API?
•   How many of you are PHP developers?
•   Tell us what you hope to learn here today.
What is Big Data?
Different Types of Big Data

Large chunks of data
   o Massive XML files, Images, Video, log files

Massive amounts of small data points
   o Typical Web Data, Survey votes, Tweets

Requests/Traffic
   o Serving lots of data, regardless of set size

Processing vs storing
   o Different concerns if only processing versus storing
Working with Big Data
CAP Theorem
Consistency
     All clients will see a consistent view of the data.

Availability
     Clients will have access to read and write data.

Partition Tolerance
     The system won't fail if individual nodes can't communicate.


You can't have all three—pick two!

http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
Common Tools

     Hadoop and HBase

            Cassandra

                  Riak

             MongoDB

             CouchDB

   Amazon Web Services
Scaling Data vs. Scaling Requests

Highly related? (or are they?)

At Digg, both were related, as more requests meant more data,
and more data made it harder to handle more requests.

At TripAdvisor, it was handling pass through big data, we never
stored it, but had to process it quickly.

Mozilla's Socorro has a small number of webapp users (100?),
but catches 3 million crashes a day via POST, median size
150k, and must store, process, and analyze these.
OLTP vs. OLAP

OLTP: low latency, high volume
OLAP: long running, CPU intensive, low volume (?)
Trying to do both with the same system happens more often
than you might think (and will make you cry)

One bad MapReduce can bring a system to its knees
Online vs offline solutions

Keep only recent data in online system

Replicate / snapshot data to secondary system

Run expensive jobs on secondary
Cache output of expensive jobs in primary

Make snapshot/window data available to end users for ad hoc
processing
Case Study: Socorro
What's Socorro?
Scaling up

A different type of scaling:
Typical webapp: scale to millions of users without degradation
of response time
Socorro: less than a hundred users, terabytes of data.

Basic law of scale still applies:
The bigger you get, the more spectacularly you fail
Some numbers

At peak we receive 3000 crashes per minute
3 million per day
Median crash size 100k -> 150k
30TB stored in HBase and growing every day
What can we do?

Does betaN have more (null signature) crashes than other
betas?
Analyze differences between Flash versions x and y crashes
Detect duplicate crashes
Detect explosive crashes
Email victims of a malware-related crash
Find frankeninstalls
War stories

HBase
  Low latency writes
  Fast to retrieve one record or a range, anything else MR

Stability problems and solutions
  Big clusters: network is key
  Need spare capacity, secondary systems
  Need great instrumentation
  Redundant layers (disk, RDBMS, cache)
  Next: Universal API

Big data is hard to move
APIs for Public Datasets
Sources for public data

• Data.gov

• DataSF: http://datasf.org/

• Public Data Sets on AWS:
  http://aws.amazon.com/publicdatasets/

• UNData: http://data.un.org/

• OECD: http://stats.oecd.org/
Infochimps

Over 10,000 data sets listed at Infochimps
Over 1,800 data sets available through our API
APIs allow easy access to terabyte scale data
Data is the lifeblood of your product

Any great application consists of:
• Awesome code
• The right data

And, as we've seen with sites like CNET, NPR, Netflix & Twitter,
it has become a best practice to build applications against APIs
to access that data.
Infochimps Screennames Autocomplete
API
GET
http://www.infochimps.com/datasets/twitter‐screen‐name‐autocomplete?
prefix=infochi&apikey=api_test-W1cipwpcdu9Cbd9pmm8D4Cjc469



{
"completions":["infochimps",
"InfoChile",
"infochiapas",
"infochilecompra",
"Infochick",
"infochimp",
"infoChief1",
"infochip2",
"infochild",
"infochiocciola",
How we make the Screenname
Autocomplete API
• Flying Monkey Scraper
   o continually crisscrosses the user graph, to discover new
     users
   o 1B objects to insert a day, 8 nodes
• Hadoop
   o To do the processing, 15+ nodes
   o Pig, Wukong
   o Precalculate 100M usernames -> prefixes
     -> a few hundred million rows
   o Sorted by Trstrank
• Apeyeye
   o load balanced cluster of 10 nodes, across 2 data centers
Infochimps Yahoo Stock API

GET
http://api.infochimps.com/economics/finance/stocks/
y_historical/

price_range{
"results":[{
"open":19.1,
"adj_close":9.55,
"close":19.09,
"high":19.33,
"symbol":"AAPL",
"date":20010813,
"exchange":"NASDAQ",
"volume":5285600,
"low":18.76
},
How we make the Stock API

Changes every day...

You can get Yahoo stock data, every day too, in CSV form.

• Hackbox:
   o little piece of armored sausage that takes the data and
     munges it up to be useful
• Troop:
   o Publishes it into the datastore
   o Writes the API docs
   o Puts the API endpoint into the code
   o Stages for deploy
Build these APIs yourself.. if you want.

You can build these APIs yourself.
Check out http://infochimps.com/labs to see our open
sourced codebase.

Or, you can let us do it, and you can focus on writing awesome
code for your application, and let us do the monkeying with the
data.
And actually....

These two APIs illustrate great reasons why you *wouldn't*
want to build them yourself:

• The data involved is too large to practically handle
• The data involved updates frequently, so it's a hassle
Examples of data available through the
Infochimps API
• Trstrank
  o   How trustworthy is a Twitter user?
• Conversations
  o   Given two users, get a summary of their interactions
• Twitter Name Autocomplete
  o   Given a string, find twitter names that complete it
• Geo Name Autocomplete
  o   Given a string, find place names that complete it
• Qwerly
  o   maps from Twitter handle to other social networks
• Daily Historical Stock Quotes
• IP to Census
  o   Given an IP address, map to Geo, then map to US Census info
And many more APIs...

Many more APIs available:
• Word Frequencies
• Word Lists
• Freebase
• DBPedia
• AggData
• Material Safety Data Sheets (MSDS)
• Internet "Weather" from Cedexis

And... many more to come.

Relocate your subroutine to the cloud.
http://infochimps.com/apis
Developing a Big Data API
Rules of thumb

Generate APIs that only give incremental access, don't shove
more data than needed at a user.

For performance reasons, don't allow the user to request too
much data in one request, and throttle requests.

Consider building API against secondary system.

Asynchronous APIs: request/queue data, pick it up later.

Caches for big data don't necessarily help.
Visualization
Exploring vs. Explaining




  L. fineartamerica.com/featured/exploring-archimedes-david-robinson.html
  R. sgp.undp.org/web/projects/10771/environmental_awareness_and_familiarity_with_animals.html
Structure has a purpose.




  fiveless.deviantart.com/art/Periodic-Table-of-the-Elements-147350318
squidspot.com/Periodic_Table_of_Typefaces.html
michaelvandaniker.com/labs/browserVisualization/
Firefox rulz
It hurts. Make it stop.
Chart junk makes baby Jesus cry.




L. www.flickr.com/photos/santo_cuervo/3693877386/
R. www.gereports.com/a-good-look-at-the-cost-of-chronic-diseases/
http://jec.senate.gov/republicans/public/
index.cfm?p=CommitteeNews&ContentRecord_id=bb302d88-3d0d-4424-8e33-3c5d2578c2b0
www.flickr.com/photos/robertpalmer/3743826461/
It deosn't mttaer waht oredr the ltteers in a wrod
are, the olny iprmoetnt tihng is taht the frist and
lsat ltteres are at the rghit pclae. The rset can be a
tatol mses and you can sitll raed it wouthit a
porbelm. Tihs is bcuseae we do not raed ervey
lteter by it slef but the wrod as a wlohe.


IT DEOSN'T MTTAER WAHT OREDR THE LTTEERS IN
A WROD ARE, THE OLNY IPRMOETNT TIHNG IS TAHT
THE FRIST AND LSAT LTTERES ARE AT THE RGHIT
PCLAE. THE RSET CAN BE A TATOL MSES AND YOU
CAN SITLL RAED IT WOUTHIT A PORBELM. TIHS IS
BCUSEAE WE DO NOT RAED ERVEY LTETER BY IT
SLEF BUT THE WROD AS A WLOHE.
The Stroop Effect


             RED    YELLOW   BLUE   GREEN


             RED    YELLOW   BLUE   GREEN
Color: function vs. decoration




Martin Wattenberg and Fernanda Viégas, Chapter 11, Beautiful Visualization (O'Reilly Media)
Andrew Odewahn, Chapter 8, Beautiful Visualization (O'Reilly Media)
Be kind to the color blind




    courses.washington.edu/info424/Labs/ChoroplethMap.html
Cool tools to try

• ManyEyes
    www-958.ibm.com/software/data/cognos/manyeyes/
• Wordle
    www.wordle.net/
• GraphViz
    www.graphviz.org/
• Protovis
    vis.stanford.edu/protovis/
• Tableau
    www.tableausoftware.com/
Your turn!




datavis.tumblr.com/post/
2746708037/the-sequel-map
MapReduce

Or: How to even store and process
 that much data in the first place?
how much is "that much"?
Some Numbers
Some Numbers

Facebook, new data per day:
Some Numbers

Facebook, new data per day:

• 03/2008: 200 GB
Some Numbers

Facebook, new data per day:

• 03/2008: 200 GB

• 04/2009:   2 TB
Some Numbers

Facebook, new data per day:

• 03/2008: 200 GB

• 04/2009:   2 TB

• 10/2009:   4 TB
Some Numbers

Facebook, new data per day:

• 03/2008: 200 GB

• 04/2009:   2 TB

• 10/2009:   4 TB

• 03/2010: 12 TB
Some Numbers

Facebook, new data per day:   Google's processing jobs:

• 03/2008: 200 GB

• 04/2009:   2 TB

• 10/2009:   4 TB

• 03/2010: 12 TB
Some Numbers

Facebook, new data per day:   Google's processing jobs:

• 03/2008: 200 GB             400 PB per month (in 2007!)

• 04/2009:   2 TB

• 10/2009:   4 TB

• 03/2010: 12 TB
Some Numbers

Facebook, new data per day:   Google's processing jobs:

• 03/2008: 200 GB             400 PB per month (in 2007!)

• 04/2009:   2 TB             Average job size is 180 GB

• 10/2009:   4 TB

• 03/2010: 12 TB
what if you have this much data?
what if it's just 1% of what Facebook has to deal with?
no problemo, you say?
reading 180 GB off a hard disk will take ~45 minutes
and then you haven't even processed it yet!
today's computers process data way faster than it can be read
solution: parallelize your I/O
but now you need to coordinate
that's hard
what if a node dies?
what if a node dies?
does the whole job have to re-start?
what if a node dies?
does the whole job have to re-start?
   can another node take over?
what if a node dies?
does the whole job have to re-start?
   can another node take over?
   how do you coordinate this?
Enter: Our Hero

MapReduce to the Rescue!
olden days
distribute workload across a grid
ship data between nodes
store it centrally on a SAN
I/O bottleneck
2004
MapReduce: Simplified Data Processing on Large Clusters
    http://labs.google.com/papers/mapreduce.html
distribute the data up front across nodes
then ship computing nodes to where the data is
share-nothing architecture
scaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaalable
Basic Principle: The Mapper
Basic Principle: The Mapper

Mapper reads records and emits key and value pairs.
Basic Principle: The Mapper

Mapper reads records and emits key and value pairs.

Take an Apache web server log file as an example:
Basic Principle: The Mapper

Mapper reads records and emits key and value pairs.

Take an Apache web server log file as an example:

• Each line is a record.
Basic Principle: The Mapper

Mapper reads records and emits key and value pairs.

Take an Apache web server log file as an example:

• Each line is a record.

• Mapper extracts request URI and number of bytes sent.
Basic Principle: The Mapper

Mapper reads records and emits key and value pairs.

Take an Apache web server log file as an example:

• Each line is a record.

• Mapper extracts request URI and number of bytes sent.

• Mapper emits the URI as the key and the bytes as the value.
Basic Principle: The Mapper

Mapper reads records and emits key and value pairs.

Take an Apache web server log file as an example:

• Each line is a record.

• Mapper extracts request URI and number of bytes sent.

• Mapper emits the URI as the key and the bytes as the value.

Parallelize by having log files per hour, splitting up the files into
even smaller chunks (by line) and so forth.
Basic Principle: The Reducer
Basic Principle: The Reducer

All values (from all nodes) for the same key are sent to the
same reducer.
Basic Principle: The Reducer

All values (from all nodes) for the same key are sent to the
same reducer.

Keys get sorted, and in case of a simple count or sum, you can
do a first reduce run on each mapper node once it's finished to
cut down on I/O (that's the combiner).
Basic Principle: The Reducer

All values (from all nodes) for the same key are sent to the
same reducer.

Keys get sorted, and in case of a simple count or sum, you can
do a first reduce run on each mapper node once it's finished to
cut down on I/O (that's the combiner).

Apache web server log example to the rescue again:
Basic Principle: The Reducer

All values (from all nodes) for the same key are sent to the
same reducer.

Keys get sorted, and in case of a simple count or sum, you can
do a first reduce run on each mapper node once it's finished to
cut down on I/O (that's the combiner).

Apache web server log example to the rescue again:

• Reducer is invoked for a URI like "/foobar" and a list of all
  number of bytes.
Basic Principle: The Reducer

All values (from all nodes) for the same key are sent to the
same reducer.

Keys get sorted, and in case of a simple count or sum, you can
do a first reduce run on each mapper node once it's finished to
cut down on I/O (that's the combiner).

Apache web server log example to the rescue again:

• Reducer is invoked for a URI like "/foobar" and a list of all
  number of bytes.

• Sum up the bytes, and we have the total traffic per URI!
Hello, Hadoop

MapReduce for the Masses
Hadoop is a MapReduce framework
comes with a distributed FS, task tracker and so forth
so we can focus on writing MapReduce jobs
works quite well, too
Big Hadoop Installations
Big Hadoop Installations

Facebook:
Big Hadoop Installations

Facebook:

• Mostly used with Hive
Big Hadoop Installations

Facebook:

• Mostly used with Hive

• 8400 cores, 13 PB total
  storage capacity
Big Hadoop Installations

Facebook:

• Mostly used with Hive

• 8400 cores, 13 PB total
  storage capacity

   o   8 cores, 32 GB RAM, 12
       TB disk per node
Big Hadoop Installations

Facebook:

• Mostly used with Hive

• 8400 cores, 13 PB total
  storage capacity

   o   8 cores, 32 GB RAM, 12
       TB disk per node

   o   1 GbE per node
Big Hadoop Installations

Facebook:

• Mostly used with Hive

• 8400 cores, 13 PB total
  storage capacity

   o   8 cores, 32 GB RAM, 12
       TB disk per node

   o   1 GbE per node

• 4 GbE between racks
Big Hadoop Installations

Facebook:                       Yahoo:

• Mostly used with Hive

• 8400 cores, 13 PB total
  storage capacity

   o   8 cores, 32 GB RAM, 12
       TB disk per node

   o   1 GbE per node

• 4 GbE between racks
Big Hadoop Installations

Facebook:                       Yahoo:

• Mostly used with Hive         • 40% of jobs use Pig

• 8400 cores, 13 PB total
  storage capacity

   o   8 cores, 32 GB RAM, 12
       TB disk per node

   o   1 GbE per node

• 4 GbE between racks
Big Hadoop Installations

Facebook:                       Yahoo:

• Mostly used with Hive         • 40% of jobs use Pig

• 8400 cores, 13 PB total       • > 100,000 CPU cores in
  storage capacity                > 25,000 servers

   o   8 cores, 32 GB RAM, 12
       TB disk per node

   o   1 GbE per node

• 4 GbE between racks
Big Hadoop Installations

Facebook:                       Yahoo:

• Mostly used with Hive         • 40% of jobs use Pig

• 8400 cores, 13 PB total       • > 100,000 CPU cores in
  storage capacity                > 25,000 servers

   o   8 cores, 32 GB RAM, 12   • Largest cluster: 4000
       TB disk per node           nodes

   o   1 GbE per node

• 4 GbE between racks
Big Hadoop Installations

Facebook:                       Yahoo:

• Mostly used with Hive         • 40% of jobs use Pig

• 8400 cores, 13 PB total       • > 100,000 CPU cores in
  storage capacity                > 25,000 servers

   o   8 cores, 32 GB RAM, 12   • Largest cluster: 4000
       TB disk per node           nodes

   o   1 GbE per node              o   2 x 4 CPU cores and 16
                                       GB RAM per node
• 4 GbE between racks
Hadoop at Facebook
Hadoop at Facebook

Daily usage:
Hadoop at Facebook

Daily usage:

• 25 TB logged by Scribe
Hadoop at Facebook

Daily usage:

• 25 TB logged by Scribe

• 135 TB compressed data
  scanned
Hadoop at Facebook

Daily usage:

• 25 TB logged by Scribe

• 135 TB compressed data
  scanned

• 7500+ Hive jobs
Hadoop at Facebook

Daily usage:

• 25 TB logged by Scribe

• 135 TB compressed data
  scanned

• 7500+ Hive jobs

• ~80k compute hours
Hadoop at Facebook

Daily usage:               Data per day growth:

• 25 TB logged by Scribe

• 135 TB compressed data
  scanned

• 7500+ Hive jobs

• ~80k compute hours
Hadoop at Facebook

Daily usage:               Data per day growth:

• 25 TB logged by Scribe   •   I/08: 200 GB

• 135 TB compressed data
  scanned

• 7500+ Hive jobs

• ~80k compute hours
Hadoop at Facebook

Daily usage:               Data per day growth:

• 25 TB logged by Scribe   •   I/08: 200 GB

• 135 TB compressed data   • II/09: 2 TB compressed
  scanned

• 7500+ Hive jobs

• ~80k compute hours
Hadoop at Facebook

Daily usage:               Data per day growth:

• 25 TB logged by Scribe   •   I/08: 200 GB

• 135 TB compressed data   • II/09: 2 TB compressed
  scanned
                           • III/09: 4 TB compressed
• 7500+ Hive jobs

• ~80k compute hours
Hadoop at Facebook

Daily usage:               Data per day growth:

• 25 TB logged by Scribe   •   I/08: 200 GB

• 135 TB compressed data   • II/09: 2 TB compressed
  scanned
                           • III/09: 4 TB compressed
• 7500+ Hive jobs
                           •   I/10: 12 TB compressed
• ~80k compute hours
HDFS

Hadoop Distributed File System
HDFS Overview
HDFS Overview

• Designed for very large data sets, transparent compression,
  block-based storage (64 MB block size by default)
HDFS Overview

• Designed for very large data sets, transparent compression,
  block-based storage (64 MB block size by default)

• Designed for streaming rather than random reads
HDFS Overview

• Designed for very large data sets, transparent compression,
  block-based storage (64 MB block size by default)

• Designed for streaming rather than random reads

• Write-once, read-many (although there is a way to append)
HDFS Overview

• Designed for very large data sets, transparent compression,
  block-based storage (64 MB block size by default)

• Designed for streaming rather than random reads

• Write-once, read-many (although there is a way to append)

• Stores data redundantly (three replicas by default), is aware
  of your network topology
HDFS Overview

• Designed for very large data sets, transparent compression,
  block-based storage (64 MB block size by default)

• Designed for streaming rather than random reads

• Write-once, read-many (although there is a way to append)

• Stores data redundantly (three replicas by default), is aware
  of your network topology

• Namenode has metadata and knows where blocks reside
HDFS Overview

• Designed for very large data sets, transparent compression,
  block-based storage (64 MB block size by default)

• Designed for streaming rather than random reads

• Write-once, read-many (although there is a way to append)

• Stores data redundantly (three replicas by default), is aware
  of your network topology

• Namenode has metadata and knows where blocks reside

• Datanodes hold the data
Task Processing

How Hadoop Gets the Job Done
Job Processing
Job Processing

• Input Formats split up your data into individual records
Job Processing

• Input Formats split up your data into individual records

• Mappers do their work, then a partitioner partitions & sorts
Job Processing

• Input Formats split up your data into individual records

• Mappers do their work, then a partitioner partitions & sorts

• Combiner can perform local pre-reduce on each mapper
Job Processing

• Input Formats split up your data into individual records

• Mappers do their work, then a partitioner partitions & sorts

• Combiner can perform local pre-reduce on each mapper

• Reducers perform reduction for each key
Job Processing

• Input Formats split up your data into individual records

• Mappers do their work, then a partitioner partitions & sorts

• Combiner can perform local pre-reduce on each mapper

• Reducers perform reduction for each key

• Mapper, Combiner and Reducer can be an external process
Job Processing

• Input Formats split up your data into individual records

• Mappers do their work, then a partitioner partitions & sorts

• Combiner can perform local pre-reduce on each mapper

• Reducers perform reduction for each key

• Mapper, Combiner and Reducer can be an external process

   o   Called Hadoop Streaming, uses STDIN & STDOUT
Job Processing

• Input Formats split up your data into individual records

• Mappers do their work, then a partitioner partitions & sorts

• Combiner can perform local pre-reduce on each mapper

• Reducers perform reduction for each key

• Mapper, Combiner and Reducer can be an external process

   o   Called Hadoop Streaming, uses STDIN & STDOUT

        Shameless plug: http://github.com/dzuelke/hadoophp
</BigDataAPIs>
</BigDataAPIs>

• Bradley Holt, @bradleyholt, bradley.holt@foundline.com
</BigDataAPIs>

• Bradley Holt, @bradleyholt, bradley.holt@foundline.com

• Julie Steele, @jsteeleeditor, jsteele@oreilly.com
</BigDataAPIs>

• Bradley Holt, @bradleyholt, bradley.holt@foundline.com

• Julie Steele, @jsteeleeditor, jsteele@oreilly.com

• Laura Thomson, @lxt, laura@mozilla.com
</BigDataAPIs>

• Bradley Holt, @bradleyholt, bradley.holt@foundline.com

• Julie Steele, @jsteeleeditor, jsteele@oreilly.com

• Laura Thomson, @lxt, laura@mozilla.com

• Eli White, @eliw, eli@eliw.com
</BigDataAPIs>

• Bradley Holt, @bradleyholt, bradley.holt@foundline.com

• Julie Steele, @jsteeleeditor, jsteele@oreilly.com

• Laura Thomson, @lxt, laura@mozilla.com

• Eli White, @eliw, eli@eliw.com

• Dennis Yang, @sinned, dennis@infochimps.com
</BigDataAPIs>

• Bradley Holt, @bradleyholt, bradley.holt@foundline.com

• Julie Steele, @jsteeleeditor, jsteele@oreilly.com

• Laura Thomson, @lxt, laura@mozilla.com

• Eli White, @eliw, eli@eliw.com

• Dennis Yang, @sinned, dennis@infochimps.com

• David Zuelke, @dzuelke, david.zuelke@bitextender.com

More Related Content

What's hot

Augury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Augury and Omens Aside, Part 1:
 The Business Case for Apache MesosAugury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Augury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Paco Nathan
 

What's hot (15)

Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
 
R, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsR, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science Competitions
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Data Visualization: A Quick Tour for Data Science Enthusiasts
Data Visualization: A Quick Tour for Data Science EnthusiastsData Visualization: A Quick Tour for Data Science Enthusiasts
Data Visualization: A Quick Tour for Data Science Enthusiasts
 
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonThe Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
Entities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web SearchEntities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web Search
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 
Python in Data Science Work
Python in Data Science WorkPython in Data Science Work
Python in Data Science Work
 
Augury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Augury and Omens Aside, Part 1:
 The Business Case for Apache MesosAugury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Augury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
 
Data Visualization at Twitter
Data Visualization at TwitterData Visualization at Twitter
Data Visualization at Twitter
 
BD-ACA week3a
BD-ACA week3aBD-ACA week3a
BD-ACA week3a
 

Viewers also liked

PHP と MySQL でカジュアルに MapReduce する
PHP と MySQL でカジュアルに MapReduce するPHP と MySQL でカジュアルに MapReduce する
PHP と MySQL でカジュアルに MapReduce する
Yuya Takeyama
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 

Viewers also liked (12)

PHP と MySQL でカジュアルに MapReduce する
PHP と MySQL でカジュアルに MapReduce するPHP と MySQL でカジュアルに MapReduce する
PHP と MySQL でカジュアルに MapReduce する
 
CV - Vivek Bajpai
CV - Vivek BajpaiCV - Vivek Bajpai
CV - Vivek Bajpai
 
PHP and MySQL : Server Side Scripting For Web Development
PHP and MySQL : Server Side Scripting For Web DevelopmentPHP and MySQL : Server Side Scripting For Web Development
PHP and MySQL : Server Side Scripting For Web Development
 
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cPart 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
14 Banking Facts to Help You Master the New Digital Economy
14 Banking Facts to Help You Master the New Digital Economy14 Banking Facts to Help You Master the New Digital Economy
14 Banking Facts to Help You Master the New Digital Economy
 
Titan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraTitan: Big Graph Data with Cassandra
Titan: Big Graph Data with Cassandra
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 

Similar to Big data and APIs for PHP developers - SXSW 2011

Similar to Big data and APIs for PHP developers - SXSW 2011 (20)

Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010
 
Big data use cases in the cloud presentation
Big data use cases in the cloud presentationBig data use cases in the cloud presentation
Big data use cases in the cloud presentation
 
Big Data
Big DataBig Data
Big Data
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
Big data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and HealthcareBig data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and Healthcare
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
Python PPT
Python PPTPython PPT
Python PPT
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
Semantic Web Science
Semantic Web ScienceSemantic Web Science
Semantic Web Science
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data
Big DataBig Data
Big Data
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
Big Data
Big DataBig Data
Big Data
 
Big Data
Big DataBig Data
Big Data
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
Server-Driven User Interface (SDUI) at Priceline
Server-Driven User Interface (SDUI) at PricelineServer-Driven User Interface (SDUI) at Priceline
Server-Driven User Interface (SDUI) at Priceline
 
Motion for AI: Creating Empathy in Technology
Motion for AI: Creating Empathy in TechnologyMotion for AI: Creating Empathy in Technology
Motion for AI: Creating Empathy in Technology
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Transforming The New York Times: Empowering Evolution through UX
Transforming The New York Times: Empowering Evolution through UXTransforming The New York Times: Empowering Evolution through UX
Transforming The New York Times: Empowering Evolution through UX
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 

Big data and APIs for PHP developers - SXSW 2011

  • 1. Big Data and APIs for PHP Developers SXSW Interactive 2011 Austin, Texas #BigDataAPIs
  • 2.   Text REDCROSS to 90999 to make a $10 donation and support the American Red Cross' disaster relief efforts to help those affected by the earthquake in Japan and tsunami throughout the Pacific.
  • 3. Topics & Goal Topics: o Introductions o Definition of Big Data o Working with Big Data o APIs o Visualization o MapReduce Goal: To provide an interesting discussion revolving around all aspects of big data & to spark your imagination on the subject.
  • 5. Bradley Holt (@bradleyholt) Curator of this Workshop Co-Founder and Technical Director, Found Line Author
  • 6. Julie Steele (@jsteeleeditor) Pythonista (but we like her anyway) Acquisitions Editor, O'Reilly Media Graphic Designer, freelance Visualization curator
  • 7. Laura Thomson (@lxt) Webtools Engineering Manager, Mozilla crash-stats.mozilla.com PHP, Python, Scaling, Systems
  • 8. Eli White (@eliw) Worked for: Digg, Tripadvisor, Hubble PHP guy at heart Currently unemployed (Hiring?) Author:
  • 9. Dennis Yang (@sinned) Director of Product & Marketing, Infochimps Previously: mySimon, CNET, & cofounder of Techdirt
  • 10. David Zuelke (@dzuelke) Lead Developer: Agavi Managing Partner, Bitextender GmbH Addicted to rockets, sharks w/ friggin laser beams, helicopters, HTTP, REST, CouchDB and MapReduce. And PHP.
  • 11. Who are you? Let's Learn about the Audience!
  • 12. Tell Us About You • Who is currently working on a Big Data problem? • Have you integrated with an existing API? • Do you publish your own API? • How many of you are PHP developers? • Tell us what you hope to learn here today.
  • 13. What is Big Data?
  • 14. Different Types of Big Data Large chunks of data o Massive XML files, Images, Video, log files Massive amounts of small data points o Typical Web Data, Survey votes, Tweets Requests/Traffic o Serving lots of data, regardless of set size Processing vs storing o Different concerns if only processing versus storing
  • 16. CAP Theorem Consistency All clients will see a consistent view of the data. Availability Clients will have access to read and write data. Partition Tolerance The system won't fail if individual nodes can't communicate. You can't have all three—pick two! http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
  • 17. Common Tools Hadoop and HBase Cassandra Riak MongoDB CouchDB Amazon Web Services
  • 18.
  • 19. Scaling Data vs. Scaling Requests Highly related? (or are they?) At Digg, both were related, as more requests meant more data, and more data made it harder to handle more requests. At TripAdvisor, it was handling pass through big data, we never stored it, but had to process it quickly. Mozilla's Socorro has a small number of webapp users (100?), but catches 3 million crashes a day via POST, median size 150k, and must store, process, and analyze these.
  • 20. OLTP vs. OLAP OLTP: low latency, high volume OLAP: long running, CPU intensive, low volume (?) Trying to do both with the same system happens more often than you might think (and will make you cry) One bad MapReduce can bring a system to its knees
  • 21. Online vs offline solutions Keep only recent data in online system Replicate / snapshot data to secondary system Run expensive jobs on secondary Cache output of expensive jobs in primary Make snapshot/window data available to end users for ad hoc processing
  • 24.
  • 25.
  • 26.
  • 27. Scaling up A different type of scaling: Typical webapp: scale to millions of users without degradation of response time Socorro: less than a hundred users, terabytes of data. Basic law of scale still applies: The bigger you get, the more spectacularly you fail
  • 28. Some numbers At peak we receive 3000 crashes per minute 3 million per day Median crash size 100k -> 150k 30TB stored in HBase and growing every day
  • 29. What can we do? Does betaN have more (null signature) crashes than other betas? Analyze differences between Flash versions x and y crashes Detect duplicate crashes Detect explosive crashes Email victims of a malware-related crash Find frankeninstalls
  • 30. War stories HBase Low latency writes Fast to retrieve one record or a range, anything else MR Stability problems and solutions Big clusters: network is key Need spare capacity, secondary systems Need great instrumentation Redundant layers (disk, RDBMS, cache) Next: Universal API Big data is hard to move
  • 31. APIs for Public Datasets
  • 32. Sources for public data • Data.gov • DataSF: http://datasf.org/ • Public Data Sets on AWS: http://aws.amazon.com/publicdatasets/ • UNData: http://data.un.org/ • OECD: http://stats.oecd.org/
  • 33. Infochimps Over 10,000 data sets listed at Infochimps Over 1,800 data sets available through our API APIs allow easy access to terabyte scale data
  • 34. Data is the lifeblood of your product Any great application consists of: • Awesome code • The right data And, as we've seen with sites like CNET, NPR, Netflix & Twitter, it has become a best practice to build applications against APIs to access that data.
  • 36. How we make the Screenname Autocomplete API • Flying Monkey Scraper o continually crisscrosses the user graph, to discover new users o 1B objects to insert a day, 8 nodes • Hadoop o To do the processing, 15+ nodes o Pig, Wukong o Precalculate 100M usernames -> prefixes -> a few hundred million rows o Sorted by Trstrank • Apeyeye o load balanced cluster of 10 nodes, across 2 data centers
  • 37. Infochimps Yahoo Stock API GET
http://api.infochimps.com/economics/finance/stocks/ y_historical/ price_range{ "results":[{ "open":19.1, "adj_close":9.55, "close":19.09, "high":19.33, "symbol":"AAPL", "date":20010813, "exchange":"NASDAQ", "volume":5285600, "low":18.76 },
  • 38. How we make the Stock API Changes every day... You can get Yahoo stock data, every day too, in CSV form. • Hackbox: o little piece of armored sausage that takes the data and munges it up to be useful • Troop: o Publishes it into the datastore o Writes the API docs o Puts the API endpoint into the code o Stages for deploy
  • 39. Build these APIs yourself.. if you want. You can build these APIs yourself. Check out http://infochimps.com/labs to see our open sourced codebase. Or, you can let us do it, and you can focus on writing awesome code for your application, and let us do the monkeying with the data.
  • 40. And actually.... These two APIs illustrate great reasons why you *wouldn't* want to build them yourself: • The data involved is too large to practically handle • The data involved updates frequently, so it's a hassle
  • 41. Examples of data available through the Infochimps API • Trstrank o How trustworthy is a Twitter user? • Conversations o Given two users, get a summary of their interactions • Twitter Name Autocomplete o Given a string, find twitter names that complete it • Geo Name Autocomplete o Given a string, find place names that complete it • Qwerly o maps from Twitter handle to other social networks • Daily Historical Stock Quotes • IP to Census o Given an IP address, map to Geo, then map to US Census info
  • 42. And many more APIs... Many more APIs available: • Word Frequencies • Word Lists • Freebase • DBPedia • AggData • Material Safety Data Sheets (MSDS) • Internet "Weather" from Cedexis And... many more to come. Relocate your subroutine to the cloud. http://infochimps.com/apis
  • 43. Developing a Big Data API
  • 44. Rules of thumb Generate APIs that only give incremental access, don't shove more data than needed at a user. For performance reasons, don't allow the user to request too much data in one request, and throttle requests. Consider building API against secondary system. Asynchronous APIs: request/queue data, pick it up later. Caches for big data don't necessarily help.
  • 46. Exploring vs. Explaining L. fineartamerica.com/featured/exploring-archimedes-david-robinson.html R. sgp.undp.org/web/projects/10771/environmental_awareness_and_familiarity_with_animals.html
  • 47. Structure has a purpose. fiveless.deviantart.com/art/Periodic-Table-of-the-Elements-147350318
  • 51. It hurts. Make it stop.
  • 52. Chart junk makes baby Jesus cry. L. www.flickr.com/photos/santo_cuervo/3693877386/ R. www.gereports.com/a-good-look-at-the-cost-of-chronic-diseases/
  • 55. It deosn't mttaer waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteres are at the rghit pclae. The rset can be a tatol mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by it slef but the wrod as a wlohe. IT DEOSN'T MTTAER WAHT OREDR THE LTTEERS IN A WROD ARE, THE OLNY IPRMOETNT TIHNG IS TAHT THE FRIST AND LSAT LTTERES ARE AT THE RGHIT PCLAE. THE RSET CAN BE A TATOL MSES AND YOU CAN SITLL RAED IT WOUTHIT A PORBELM. TIHS IS BCUSEAE WE DO NOT RAED ERVEY LTETER BY IT SLEF BUT THE WROD AS A WLOHE.
  • 56. The Stroop Effect RED YELLOW BLUE GREEN RED YELLOW BLUE GREEN
  • 57. Color: function vs. decoration Martin Wattenberg and Fernanda Viégas, Chapter 11, Beautiful Visualization (O'Reilly Media)
  • 58. Andrew Odewahn, Chapter 8, Beautiful Visualization (O'Reilly Media)
  • 59. Be kind to the color blind courses.washington.edu/info424/Labs/ChoroplethMap.html
  • 60. Cool tools to try • ManyEyes www-958.ibm.com/software/data/cognos/manyeyes/ • Wordle www.wordle.net/ • GraphViz www.graphviz.org/ • Protovis vis.stanford.edu/protovis/ • Tableau www.tableausoftware.com/
  • 62.
  • 63. MapReduce Or: How to even store and process that much data in the first place?
  • 64. how much is "that much"?
  • 66. Some Numbers Facebook, new data per day:
  • 67. Some Numbers Facebook, new data per day: • 03/2008: 200 GB
  • 68. Some Numbers Facebook, new data per day: • 03/2008: 200 GB • 04/2009: 2 TB
  • 69. Some Numbers Facebook, new data per day: • 03/2008: 200 GB • 04/2009: 2 TB • 10/2009: 4 TB
  • 70. Some Numbers Facebook, new data per day: • 03/2008: 200 GB • 04/2009: 2 TB • 10/2009: 4 TB • 03/2010: 12 TB
  • 71. Some Numbers Facebook, new data per day: Google's processing jobs: • 03/2008: 200 GB • 04/2009: 2 TB • 10/2009: 4 TB • 03/2010: 12 TB
  • 72. Some Numbers Facebook, new data per day: Google's processing jobs: • 03/2008: 200 GB 400 PB per month (in 2007!) • 04/2009: 2 TB • 10/2009: 4 TB • 03/2010: 12 TB
  • 73. Some Numbers Facebook, new data per day: Google's processing jobs: • 03/2008: 200 GB 400 PB per month (in 2007!) • 04/2009: 2 TB Average job size is 180 GB • 10/2009: 4 TB • 03/2010: 12 TB
  • 74. what if you have this much data?
  • 75. what if it's just 1% of what Facebook has to deal with?
  • 77. reading 180 GB off a hard disk will take ~45 minutes
  • 78. and then you haven't even processed it yet!
  • 79. today's computers process data way faster than it can be read
  • 81. but now you need to coordinate
  • 83.
  • 84. what if a node dies?
  • 85. what if a node dies? does the whole job have to re-start?
  • 86. what if a node dies? does the whole job have to re-start? can another node take over?
  • 87. what if a node dies? does the whole job have to re-start? can another node take over? how do you coordinate this?
  • 88. Enter: Our Hero MapReduce to the Rescue!
  • 92. store it centrally on a SAN
  • 94. 2004
  • 95. MapReduce: Simplified Data Processing on Large Clusters http://labs.google.com/papers/mapreduce.html
  • 96. distribute the data up front across nodes
  • 97. then ship computing nodes to where the data is
  • 101. Basic Principle: The Mapper Mapper reads records and emits key and value pairs.
  • 102. Basic Principle: The Mapper Mapper reads records and emits key and value pairs. Take an Apache web server log file as an example:
  • 103. Basic Principle: The Mapper Mapper reads records and emits key and value pairs. Take an Apache web server log file as an example: • Each line is a record.
  • 104. Basic Principle: The Mapper Mapper reads records and emits key and value pairs. Take an Apache web server log file as an example: • Each line is a record. • Mapper extracts request URI and number of bytes sent.
  • 105. Basic Principle: The Mapper Mapper reads records and emits key and value pairs. Take an Apache web server log file as an example: • Each line is a record. • Mapper extracts request URI and number of bytes sent. • Mapper emits the URI as the key and the bytes as the value.
  • 106. Basic Principle: The Mapper Mapper reads records and emits key and value pairs. Take an Apache web server log file as an example: • Each line is a record. • Mapper extracts request URI and number of bytes sent. • Mapper emits the URI as the key and the bytes as the value. Parallelize by having log files per hour, splitting up the files into even smaller chunks (by line) and so forth.
  • 108. Basic Principle: The Reducer All values (from all nodes) for the same key are sent to the same reducer.
  • 109. Basic Principle: The Reducer All values (from all nodes) for the same key are sent to the same reducer. Keys get sorted, and in case of a simple count or sum, you can do a first reduce run on each mapper node once it's finished to cut down on I/O (that's the combiner).
  • 110. Basic Principle: The Reducer All values (from all nodes) for the same key are sent to the same reducer. Keys get sorted, and in case of a simple count or sum, you can do a first reduce run on each mapper node once it's finished to cut down on I/O (that's the combiner). Apache web server log example to the rescue again:
  • 111. Basic Principle: The Reducer All values (from all nodes) for the same key are sent to the same reducer. Keys get sorted, and in case of a simple count or sum, you can do a first reduce run on each mapper node once it's finished to cut down on I/O (that's the combiner). Apache web server log example to the rescue again: • Reducer is invoked for a URI like "/foobar" and a list of all number of bytes.
  • 112. Basic Principle: The Reducer All values (from all nodes) for the same key are sent to the same reducer. Keys get sorted, and in case of a simple count or sum, you can do a first reduce run on each mapper node once it's finished to cut down on I/O (that's the combiner). Apache web server log example to the rescue again: • Reducer is invoked for a URI like "/foobar" and a list of all number of bytes. • Sum up the bytes, and we have the total traffic per URI!
  • 114. Hadoop is a MapReduce framework
  • 115. comes with a distributed FS, task tracker and so forth
  • 116. so we can focus on writing MapReduce jobs
  • 120. Big Hadoop Installations Facebook: • Mostly used with Hive
  • 121. Big Hadoop Installations Facebook: • Mostly used with Hive • 8400 cores, 13 PB total storage capacity
  • 122. Big Hadoop Installations Facebook: • Mostly used with Hive • 8400 cores, 13 PB total storage capacity o 8 cores, 32 GB RAM, 12 TB disk per node
  • 123. Big Hadoop Installations Facebook: • Mostly used with Hive • 8400 cores, 13 PB total storage capacity o 8 cores, 32 GB RAM, 12 TB disk per node o 1 GbE per node
  • 124. Big Hadoop Installations Facebook: • Mostly used with Hive • 8400 cores, 13 PB total storage capacity o 8 cores, 32 GB RAM, 12 TB disk per node o 1 GbE per node • 4 GbE between racks
  • 125. Big Hadoop Installations Facebook: Yahoo: • Mostly used with Hive • 8400 cores, 13 PB total storage capacity o 8 cores, 32 GB RAM, 12 TB disk per node o 1 GbE per node • 4 GbE between racks
  • 126. Big Hadoop Installations Facebook: Yahoo: • Mostly used with Hive • 40% of jobs use Pig • 8400 cores, 13 PB total storage capacity o 8 cores, 32 GB RAM, 12 TB disk per node o 1 GbE per node • 4 GbE between racks
  • 127. Big Hadoop Installations Facebook: Yahoo: • Mostly used with Hive • 40% of jobs use Pig • 8400 cores, 13 PB total • > 100,000 CPU cores in storage capacity > 25,000 servers o 8 cores, 32 GB RAM, 12 TB disk per node o 1 GbE per node • 4 GbE between racks
  • 128. Big Hadoop Installations Facebook: Yahoo: • Mostly used with Hive • 40% of jobs use Pig • 8400 cores, 13 PB total • > 100,000 CPU cores in storage capacity > 25,000 servers o 8 cores, 32 GB RAM, 12 • Largest cluster: 4000 TB disk per node nodes o 1 GbE per node • 4 GbE between racks
  • 129. Big Hadoop Installations Facebook: Yahoo: • Mostly used with Hive • 40% of jobs use Pig • 8400 cores, 13 PB total • > 100,000 CPU cores in storage capacity > 25,000 servers o 8 cores, 32 GB RAM, 12 • Largest cluster: 4000 TB disk per node nodes o 1 GbE per node o 2 x 4 CPU cores and 16 GB RAM per node • 4 GbE between racks
  • 132. Hadoop at Facebook Daily usage: • 25 TB logged by Scribe
  • 133. Hadoop at Facebook Daily usage: • 25 TB logged by Scribe • 135 TB compressed data scanned
  • 134. Hadoop at Facebook Daily usage: • 25 TB logged by Scribe • 135 TB compressed data scanned • 7500+ Hive jobs
  • 135. Hadoop at Facebook Daily usage: • 25 TB logged by Scribe • 135 TB compressed data scanned • 7500+ Hive jobs • ~80k compute hours
  • 136. Hadoop at Facebook Daily usage: Data per day growth: • 25 TB logged by Scribe • 135 TB compressed data scanned • 7500+ Hive jobs • ~80k compute hours
  • 137. Hadoop at Facebook Daily usage: Data per day growth: • 25 TB logged by Scribe • I/08: 200 GB • 135 TB compressed data scanned • 7500+ Hive jobs • ~80k compute hours
  • 138. Hadoop at Facebook Daily usage: Data per day growth: • 25 TB logged by Scribe • I/08: 200 GB • 135 TB compressed data • II/09: 2 TB compressed scanned • 7500+ Hive jobs • ~80k compute hours
  • 139. Hadoop at Facebook Daily usage: Data per day growth: • 25 TB logged by Scribe • I/08: 200 GB • 135 TB compressed data • II/09: 2 TB compressed scanned • III/09: 4 TB compressed • 7500+ Hive jobs • ~80k compute hours
  • 140. Hadoop at Facebook Daily usage: Data per day growth: • 25 TB logged by Scribe • I/08: 200 GB • 135 TB compressed data • II/09: 2 TB compressed scanned • III/09: 4 TB compressed • 7500+ Hive jobs • I/10: 12 TB compressed • ~80k compute hours
  • 143. HDFS Overview • Designed for very large data sets, transparent compression, block-based storage (64 MB block size by default)
  • 144. HDFS Overview • Designed for very large data sets, transparent compression, block-based storage (64 MB block size by default) • Designed for streaming rather than random reads
  • 145. HDFS Overview • Designed for very large data sets, transparent compression, block-based storage (64 MB block size by default) • Designed for streaming rather than random reads • Write-once, read-many (although there is a way to append)
  • 146. HDFS Overview • Designed for very large data sets, transparent compression, block-based storage (64 MB block size by default) • Designed for streaming rather than random reads • Write-once, read-many (although there is a way to append) • Stores data redundantly (three replicas by default), is aware of your network topology
  • 147. HDFS Overview • Designed for very large data sets, transparent compression, block-based storage (64 MB block size by default) • Designed for streaming rather than random reads • Write-once, read-many (although there is a way to append) • Stores data redundantly (three replicas by default), is aware of your network topology • Namenode has metadata and knows where blocks reside
  • 148. HDFS Overview • Designed for very large data sets, transparent compression, block-based storage (64 MB block size by default) • Designed for streaming rather than random reads • Write-once, read-many (although there is a way to append) • Stores data redundantly (three replicas by default), is aware of your network topology • Namenode has metadata and knows where blocks reside • Datanodes hold the data
  • 149. Task Processing How Hadoop Gets the Job Done
  • 151. Job Processing • Input Formats split up your data into individual records
  • 152. Job Processing • Input Formats split up your data into individual records • Mappers do their work, then a partitioner partitions & sorts
  • 153. Job Processing • Input Formats split up your data into individual records • Mappers do their work, then a partitioner partitions & sorts • Combiner can perform local pre-reduce on each mapper
  • 154. Job Processing • Input Formats split up your data into individual records • Mappers do their work, then a partitioner partitions & sorts • Combiner can perform local pre-reduce on each mapper • Reducers perform reduction for each key
  • 155. Job Processing • Input Formats split up your data into individual records • Mappers do their work, then a partitioner partitions & sorts • Combiner can perform local pre-reduce on each mapper • Reducers perform reduction for each key • Mapper, Combiner and Reducer can be an external process
  • 156. Job Processing • Input Formats split up your data into individual records • Mappers do their work, then a partitioner partitions & sorts • Combiner can perform local pre-reduce on each mapper • Reducers perform reduction for each key • Mapper, Combiner and Reducer can be an external process o Called Hadoop Streaming, uses STDIN & STDOUT
  • 157. Job Processing • Input Formats split up your data into individual records • Mappers do their work, then a partitioner partitions & sorts • Combiner can perform local pre-reduce on each mapper • Reducers perform reduction for each key • Mapper, Combiner and Reducer can be an external process o Called Hadoop Streaming, uses STDIN & STDOUT  Shameless plug: http://github.com/dzuelke/hadoophp
  • 159. </BigDataAPIs> • Bradley Holt, @bradleyholt, bradley.holt@foundline.com
  • 160. </BigDataAPIs> • Bradley Holt, @bradleyholt, bradley.holt@foundline.com • Julie Steele, @jsteeleeditor, jsteele@oreilly.com
  • 161. </BigDataAPIs> • Bradley Holt, @bradleyholt, bradley.holt@foundline.com • Julie Steele, @jsteeleeditor, jsteele@oreilly.com • Laura Thomson, @lxt, laura@mozilla.com
  • 162. </BigDataAPIs> • Bradley Holt, @bradleyholt, bradley.holt@foundline.com • Julie Steele, @jsteeleeditor, jsteele@oreilly.com • Laura Thomson, @lxt, laura@mozilla.com • Eli White, @eliw, eli@eliw.com
  • 163. </BigDataAPIs> • Bradley Holt, @bradleyholt, bradley.holt@foundline.com • Julie Steele, @jsteeleeditor, jsteele@oreilly.com • Laura Thomson, @lxt, laura@mozilla.com • Eli White, @eliw, eli@eliw.com • Dennis Yang, @sinned, dennis@infochimps.com
  • 164. </BigDataAPIs> • Bradley Holt, @bradleyholt, bradley.holt@foundline.com • Julie Steele, @jsteeleeditor, jsteele@oreilly.com • Laura Thomson, @lxt, laura@mozilla.com • Eli White, @eliw, eli@eliw.com • Dennis Yang, @sinned, dennis@infochimps.com • David Zuelke, @dzuelke, david.zuelke@bitextender.com