1.
Big Data and APIs
for PHP Developers
SXSW Interactive 2011
Austin, Texas
#BigDataAPIs
2.
Text REDCROSS to 90999 to make a $10 donation and
support the American Red Cross' disaster relief efforts to help
those affected by the earthquake in Japan and tsunami
throughout the Pacific.
3.
Topics & Goal
Topics:
o Introductions
o Definition of Big Data
o Working with Big Data
o APIs
o Visualization
o MapReduce
Goal:
To provide an interesting discussion revolving around all
aspects of big data & to spark your imagination on the subject.
5.
Bradley Holt (@bradleyholt)
Curator of this Workshop
Co-Founder and Technical Director,
Found Line
Author
6.
Julie Steele (@jsteeleeditor)
Pythonista
(but we like her anyway)
Acquisitions Editor, O'Reilly Media
Graphic Designer, freelance
Visualization curator
7.
Laura Thomson (@lxt)
Webtools Engineering Manager,
Mozilla
crash-stats.mozilla.com
PHP, Python, Scaling, Systems
8.
Eli White (@eliw)
Worked for:
Digg, Tripadvisor, Hubble
PHP guy at heart
Currently unemployed (Hiring?)
Author:
9.
Dennis Yang (@sinned)
Director of Product & Marketing,
Infochimps
Previously:
mySimon, CNET,
& cofounder of Techdirt
10.
David Zuelke (@dzuelke)
Lead Developer: Agavi
Managing Partner,
Bitextender GmbH
Addicted to rockets, sharks w/
friggin laser beams, helicopters,
HTTP, REST, CouchDB and
MapReduce. And PHP.
12.
Tell Us About You
• Who is currently working on a Big Data problem?
• Have you integrated with an existing API?
• Do you publish your own API?
• How many of you are PHP developers?
• Tell us what you hope to learn here today.
14.
Different Types of Big Data
Large chunks of data
o Massive XML files, Images, Video, log files
Massive amounts of small data points
o Typical Web Data, Survey votes, Tweets
Requests/Traffic
o Serving lots of data, regardless of set size
Processing vs storing
o Different concerns if only processing versus storing
16.
CAP Theorem
Consistency
All clients will see a consistent view of the data.
Availability
Clients will have access to read and write data.
Partition Tolerance
The system won't fail if individual nodes can't communicate.
You can't have all three—pick two!
http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
17.
Common Tools
Hadoop and HBase
Cassandra
Riak
MongoDB
CouchDB
Amazon Web Services
18.
Scaling Data vs. Scaling Requests
Highly related? (or are they?)
At Digg, both were related, as more requests meant more data,
and more data made it harder to handle more requests.
At TripAdvisor, it was handling pass through big data, we never
stored it, but had to process it quickly.
Mozilla's Socorro has a small number of webapp users (100?),
but catches 3 million crashes a day via POST, median size
150k, and must store, process, and analyze these.
19.
OLTP vs. OLAP
OLTP: low latency, high volume
OLAP: long running, CPU intensive, low volume (?)
Trying to do both with the same system happens more often
than you might think (and will make you cry)
One bad MapReduce can bring a system to its knees
20.
Online vs offline solutions
Keep only recent data in online system
Replicate / snapshot data to secondary system
Run expensive jobs on secondary
Cache output of expensive jobs in primary
Make snapshot/window data available to end users for ad hoc
processing
23.
Scaling up
A different type of scaling:
Typical webapp: scale to millions of users without degradation
of response time
Socorro: less than a hundred users, terabytes of data.
Basic law of scale still applies:
The bigger you get, the more spectacularly you fail
24.
Some numbers
At peak we receive 3000 crashes per minute
3 million per day
Median crash size 100k -> 150k
30TB stored in HBase and growing every day
25.
What can we do?
Does betaN have more (null signature) crashes than other
betas?
Analyze differences between Flash versions x and y crashes
Detect duplicate crashes
Detect explosive crashes
Email victims of a malware-related crash
Find frankeninstalls
26.
War stories
HBase
Low latency writes
Fast to retrieve one record or a range, anything else MR
Stability problems and solutions
Big clusters: network is key
Need spare capacity, secondary systems
Need great instrumentation
Redundant layers (disk, RDBMS, cache)
Next: Universal API
Big data is hard to move
28.
Sources for public data
• Data.gov
• DataSF: http://datasf.org/
• Public Data Sets on AWS:
http://aws.amazon.com/publicdatasets/
• UNData: http://data.un.org/
• OECD: http://stats.oecd.org/
29.
Infochimps
Over 10,000 data sets listed at Infochimps
Over 1,800 data sets available through our API
APIs allow easy access to terabyte scale data
30.
Data is the lifeblood of your product
Any great application consists of:
• Awesome code
• The right data
And, as we've seen with sites like CNET, NPR, Netflix & Twitter,
it has become a best practice to build applications against APIs
to access that data.
31.
Infochimps Screennames Autocomplete
API
GET http://www.infochimps.com/datasets/twitter‐screen‐name‐autocomplete?
prefix=infochi&apikey=api_test-W1cipwpcdu9Cbd9pmm8D4Cjc469
{
"completions":["infochimps",
"InfoChile",
"infochiapas",
"infochilecompra",
"Infochick",
"infochimp",
"infoChief1",
"infochip2",
"infochild",
"infochiocciola",
32.
How we make the Screenname
Autocomplete API
• Flying Monkey Scraper
o continually crisscrosses the user graph, to discover new
users
o 1B objects to insert a day, 8 nodes
• Hadoop
o To do the processing, 15+ nodes
o Pig, Wukong
o Precalculate 100M usernames -> prefixes
-> a few hundred million rows
o Sorted by Trstrank
• Apeyeye
o load balanced cluster of 10 nodes, across 2 data centers
33.
Infochimps Yahoo Stock API
GET http://api.infochimps.com/economics/finance/stocks/
y_historical/
price_range{
"results":[{
"open":19.1,
"adj_close":9.55,
"close":19.09,
"high":19.33,
"symbol":"AAPL",
"date":20010813,
"exchange":"NASDAQ",
"volume":5285600,
"low":18.76
},
34.
How we make the Stock API
Changes every day...
You can get Yahoo stock data, every day too, in CSV form.
• Hackbox:
o little piece of armored sausage that takes the data and
munges it up to be useful
• Troop:
o Publishes it into the datastore
o Writes the API docs
o Puts the API endpoint into the code
o Stages for deploy
35.
Build these APIs yourself.. if you want.
You can build these APIs yourself.
Check out http://infochimps.com/labs to see our open
sourced codebase.
Or, you can let us do it, and you can focus on writing awesome
code for your application, and let us do the monkeying with the
data.
36.
And actually....
These two APIs illustrate great reasons why you *wouldn't*
want to build them yourself:
• The data involved is too large to practically handle
• The data involved updates frequently, so it's a hassle
37.
Examples of data available through the
Infochimps API
• Trstrank
o How trustworthy is a Twitter user?
• Conversations
o Given two users, get a summary of their interactions
• Twitter Name Autocomplete
o Given a string, find twitter names that complete it
• Geo Name Autocomplete
o Given a string, find place names that complete it
• Qwerly
o maps from Twitter handle to other social networks
• Daily Historical Stock Quotes
• IP to Census
o Given an IP address, map to Geo, then map to US Census info
38.
And many more APIs...
Many more APIs available:
• Word Frequencies
• Word Lists
• Freebase
• DBPedia
• AggData
• Material Safety Data Sheets (MSDS)
• Internet "Weather" from Cedexis
And... many more to come.
Relocate your subroutine to the cloud.
http://infochimps.com/apis
40.
Rules of thumb
Generate APIs that only give incremental access, don't shove
more data than needed at a user.
For performance reasons, don't allow the user to request too
much data in one request, and throttle requests.
Consider building API against secondary system.
Asynchronous APIs: request/queue data, pick it up later.
Caches for big data don't necessarily help.
42.
Exploring vs. Explaining
L. fineartamerica.com/featured/exploring-archimedes-david-robinson.html
R. sgp.undp.org/web/projects/10771/environmental_awareness_and_familiarity_with_animals.html
43.
Structure has a purpose.
fiveless.deviantart.com/art/Periodic-Table-of-the-Elements-147350318
48.
Chart junk makes baby Jesus cry.
L. www.flickr.com/photos/santo_cuervo/3693877386/
R. www.gereports.com/a-good-look-at-the-cost-of-chronic-diseases/
51.
It deosn't mttaer waht oredr the ltteers in a wrod
are, the olny iprmoetnt tihng is taht the frist and
lsat ltteres are at the rghit pclae. The rset can be a
tatol mses and you can sitll raed it wouthit a
porbelm. Tihs is bcuseae we do not raed ervey
lteter by it slef but the wrod as a wlohe.
IT DEOSN'T MTTAER WAHT OREDR THE LTTEERS IN
A WROD ARE, THE OLNY IPRMOETNT TIHNG IS TAHT
THE FRIST AND LSAT LTTERES ARE AT THE RGHIT
PCLAE. THE RSET CAN BE A TATOL MSES AND YOU
CAN SITLL RAED IT WOUTHIT A PORBELM. TIHS IS
BCUSEAE WE DO NOT RAED ERVEY LTETER BY IT
SLEF BUT THE WROD AS A WLOHE.
52.
The Stroop Effect
RED YELLOW BLUE GREEN
RED YELLOW BLUE GREEN
53.
Color: function vs. decoration
Martin Wattenberg and Fernanda Viégas, Chapter 11, Beautiful Visualization (O'Reilly Media)
54.
Andrew Odewahn, Chapter 8, Beautiful Visualization (O'Reilly Media)
55.
Be kind to the color blind
courses.washington.edu/info424/Labs/ChoroplethMap.html
62.
Some Numbers
Facebook, new data per day:
• 03/2008: 200 GB
63.
Some Numbers
Facebook, new data per day:
• 03/2008: 200 GB
• 04/2009: 2 TB
64.
Some Numbers
Facebook, new data per day:
• 03/2008: 200 GB
• 04/2009: 2 TB
• 10/2009: 4 TB
65.
Some Numbers
Facebook, new data per day:
• 03/2008: 200 GB
• 04/2009: 2 TB
• 10/2009: 4 TB
• 03/2010: 12 TB
66.
Some Numbers
Facebook, new data per day: Google's processing jobs:
• 03/2008: 200 GB
• 04/2009: 2 TB
• 10/2009: 4 TB
• 03/2010: 12 TB
67.
Some Numbers
Facebook, new data per day: Google's processing jobs:
• 03/2008: 200 GB 400 PB per month (in 2007!)
• 04/2009: 2 TB
• 10/2009: 4 TB
• 03/2010: 12 TB
68.
Some Numbers
Facebook, new data per day: Google's processing jobs:
• 03/2008: 200 GB 400 PB per month (in 2007!)
• 04/2009: 2 TB Average job size is 180 GB
• 10/2009: 4 TB
• 03/2010: 12 TB
95.
Basic Principle: The Mapper
Mapper reads records and emits key and value pairs.
96.
Basic Principle: The Mapper
Mapper reads records and emits key and value pairs.
Take an Apache web server log file as an example:
97.
Basic Principle: The Mapper
Mapper reads records and emits key and value pairs.
Take an Apache web server log file as an example:
• Each line is a record.
98.
Basic Principle: The Mapper
Mapper reads records and emits key and value pairs.
Take an Apache web server log file as an example:
• Each line is a record.
• Mapper extracts request URI and number of bytes sent.
99.
Basic Principle: The Mapper
Mapper reads records and emits key and value pairs.
Take an Apache web server log file as an example:
• Each line is a record.
• Mapper extracts request URI and number of bytes sent.
• Mapper emits the URI as the key and the bytes as the value.
100.
Basic Principle: The Mapper
Mapper reads records and emits key and value pairs.
Take an Apache web server log file as an example:
• Each line is a record.
• Mapper extracts request URI and number of bytes sent.
• Mapper emits the URI as the key and the bytes as the value.
Parallelize by having log files per hour, splitting up the files into
even smaller chunks (by line) and so forth.
102.
Basic Principle: The Reducer
All values (from all nodes) for the same key are sent to the
same reducer.
103.
Basic Principle: The Reducer
All values (from all nodes) for the same key are sent to the
same reducer.
Keys get sorted, and in case of a simple count or sum, you can
do a first reduce run on each mapper node once it's finished to
cut down on I/O (that's the combiner).
104.
Basic Principle: The Reducer
All values (from all nodes) for the same key are sent to the
same reducer.
Keys get sorted, and in case of a simple count or sum, you can
do a first reduce run on each mapper node once it's finished to
cut down on I/O (that's the combiner).
Apache web server log example to the rescue again:
105.
Basic Principle: The Reducer
All values (from all nodes) for the same key are sent to the
same reducer.
Keys get sorted, and in case of a simple count or sum, you can
do a first reduce run on each mapper node once it's finished to
cut down on I/O (that's the combiner).
Apache web server log example to the rescue again:
• Reducer is invoked for a URI like "/foobar" and a list of all
number of bytes.
106.
Basic Principle: The Reducer
All values (from all nodes) for the same key are sent to the
same reducer.
Keys get sorted, and in case of a simple count or sum, you can
do a first reduce run on each mapper node once it's finished to
cut down on I/O (that's the combiner).
Apache web server log example to the rescue again:
• Reducer is invoked for a URI like "/foobar" and a list of all
number of bytes.
• Sum up the bytes, and we have the total traffic per URI!
114.
Big Hadoop Installations
Facebook:
• Mostly used with Hive
115.
Big Hadoop Installations
Facebook:
• Mostly used with Hive
• 8400 cores, 13 PB total
storage capacity
116.
Big Hadoop Installations
Facebook:
• Mostly used with Hive
• 8400 cores, 13 PB total
storage capacity
o 8 cores, 32 GB RAM, 12
TB disk per node
117.
Big Hadoop Installations
Facebook:
• Mostly used with Hive
• 8400 cores, 13 PB total
storage capacity
o 8 cores, 32 GB RAM, 12
TB disk per node
o 1 GbE per node
118.
Big Hadoop Installations
Facebook:
• Mostly used with Hive
• 8400 cores, 13 PB total
storage capacity
o 8 cores, 32 GB RAM, 12
TB disk per node
o 1 GbE per node
• 4 GbE between racks
119.
Big Hadoop Installations
Facebook: Yahoo:
• Mostly used with Hive
• 8400 cores, 13 PB total
storage capacity
o 8 cores, 32 GB RAM, 12
TB disk per node
o 1 GbE per node
• 4 GbE between racks
120.
Big Hadoop Installations
Facebook: Yahoo:
• Mostly used with Hive • 40% of jobs use Pig
• 8400 cores, 13 PB total
storage capacity
o 8 cores, 32 GB RAM, 12
TB disk per node
o 1 GbE per node
• 4 GbE between racks
121.
Big Hadoop Installations
Facebook: Yahoo:
• Mostly used with Hive • 40% of jobs use Pig
• 8400 cores, 13 PB total • > 100,000 CPU cores in
storage capacity > 25,000 servers
o 8 cores, 32 GB RAM, 12
TB disk per node
o 1 GbE per node
• 4 GbE between racks
122.
Big Hadoop Installations
Facebook: Yahoo:
• Mostly used with Hive • 40% of jobs use Pig
• 8400 cores, 13 PB total • > 100,000 CPU cores in
storage capacity > 25,000 servers
o 8 cores, 32 GB RAM, 12 • Largest cluster: 4000
TB disk per node nodes
o 1 GbE per node
• 4 GbE between racks
123.
Big Hadoop Installations
Facebook: Yahoo:
• Mostly used with Hive • 40% of jobs use Pig
• 8400 cores, 13 PB total • > 100,000 CPU cores in
storage capacity > 25,000 servers
o 8 cores, 32 GB RAM, 12 • Largest cluster: 4000
TB disk per node nodes
o 1 GbE per node o 2 x 4 CPU cores and 16
GB RAM per node
• 4 GbE between racks
126.
Hadoop at Facebook
Daily usage:
• 25 TB logged by Scribe
127.
Hadoop at Facebook
Daily usage:
• 25 TB logged by Scribe
• 135 TB compressed data
scanned
128.
Hadoop at Facebook
Daily usage:
• 25 TB logged by Scribe
• 135 TB compressed data
scanned
• 7500+ Hive jobs
129.
Hadoop at Facebook
Daily usage:
• 25 TB logged by Scribe
• 135 TB compressed data
scanned
• 7500+ Hive jobs
• ~80k compute hours
130.
Hadoop at Facebook
Daily usage: Data per day growth:
• 25 TB logged by Scribe
• 135 TB compressed data
scanned
• 7500+ Hive jobs
• ~80k compute hours
131.
Hadoop at Facebook
Daily usage: Data per day growth:
• 25 TB logged by Scribe • I/08: 200 GB
• 135 TB compressed data
scanned
• 7500+ Hive jobs
• ~80k compute hours
132.
Hadoop at Facebook
Daily usage: Data per day growth:
• 25 TB logged by Scribe • I/08: 200 GB
• 135 TB compressed data • II/09: 2 TB compressed
scanned
• 7500+ Hive jobs
• ~80k compute hours
133.
Hadoop at Facebook
Daily usage: Data per day growth:
• 25 TB logged by Scribe • I/08: 200 GB
• 135 TB compressed data • II/09: 2 TB compressed
scanned
• III/09: 4 TB compressed
• 7500+ Hive jobs
• ~80k compute hours
134.
Hadoop at Facebook
Daily usage: Data per day growth:
• 25 TB logged by Scribe • I/08: 200 GB
• 135 TB compressed data • II/09: 2 TB compressed
scanned
• III/09: 4 TB compressed
• 7500+ Hive jobs
• I/10: 12 TB compressed
• ~80k compute hours
137.
HDFS Overview
• Designed for very large data sets, transparent compression,
block-based storage (64 MB block size by default)
138.
HDFS Overview
• Designed for very large data sets, transparent compression,
block-based storage (64 MB block size by default)
• Designed for streaming rather than random reads
139.
HDFS Overview
• Designed for very large data sets, transparent compression,
block-based storage (64 MB block size by default)
• Designed for streaming rather than random reads
• Write-once, read-many (although there is a way to append)
140.
HDFS Overview
• Designed for very large data sets, transparent compression,
block-based storage (64 MB block size by default)
• Designed for streaming rather than random reads
• Write-once, read-many (although there is a way to append)
• Stores data redundantly (three replicas by default), is aware
of your network topology
141.
HDFS Overview
• Designed for very large data sets, transparent compression,
block-based storage (64 MB block size by default)
• Designed for streaming rather than random reads
• Write-once, read-many (although there is a way to append)
• Stores data redundantly (three replicas by default), is aware
of your network topology
• Namenode has metadata and knows where blocks reside
142.
HDFS Overview
• Designed for very large data sets, transparent compression,
block-based storage (64 MB block size by default)
• Designed for streaming rather than random reads
• Write-once, read-many (although there is a way to append)
• Stores data redundantly (three replicas by default), is aware
of your network topology
• Namenode has metadata and knows where blocks reside
• Datanodes hold the data
145.
Job Processing
• Input Formats split up your data into individual records
146.
Job Processing
• Input Formats split up your data into individual records
• Mappers do their work, then a partitioner partitions & sorts
147.
Job Processing
• Input Formats split up your data into individual records
• Mappers do their work, then a partitioner partitions & sorts
• Combiner can perform local pre-reduce on each mapper
148.
Job Processing
• Input Formats split up your data into individual records
• Mappers do their work, then a partitioner partitions & sorts
• Combiner can perform local pre-reduce on each mapper
• Reducers perform reduction for each key
149.
Job Processing
• Input Formats split up your data into individual records
• Mappers do their work, then a partitioner partitions & sorts
• Combiner can perform local pre-reduce on each mapper
• Reducers perform reduction for each key
• Mapper, Combiner and Reducer can be an external process
150.
Job Processing
• Input Formats split up your data into individual records
• Mappers do their work, then a partitioner partitions & sorts
• Combiner can perform local pre-reduce on each mapper
• Reducers perform reduction for each key
• Mapper, Combiner and Reducer can be an external process
o Called Hadoop Streaming, uses STDIN & STDOUT
151.
Job Processing
• Input Formats split up your data into individual records
• Mappers do their work, then a partitioner partitions & sorts
• Combiner can perform local pre-reduce on each mapper
• Reducers perform reduction for each key
• Mapper, Combiner and Reducer can be an external process
o Called Hadoop Streaming, uses STDIN & STDOUT
Shameless plug: http://github.com/dzuelke/hadoophp
It appears that you have an ad-blocker running. By whitelisting SlideShare on your ad-blocker, you are supporting our community of content creators.
Hate ads?
We've updated our privacy policy.
We’ve updated our privacy policy so that we are compliant with changing global privacy regulations and to provide you with insight into the limited ways in which we use your data.
You can read the details below. By accepting, you agree to the updated privacy policy.