The Other Way of Doing Big Data

How to
Win at Scale
and its
Inﬂuence on People

Philip (ﬂip) Kromer
CTO, Infochimps.com

Big Data is Inevitable

It Demands a New Approach

There’s Another Way

You’re Going to Have to
follow It

There’s Another Way

You’re Going to Have to
follow It

It Might be a Better Way

Federated Truth
email
MySQL HBase s3
spreadsheets
elasticsearch elasticsearch
HDFS
hipchat
redis mongo

MongoDB log ﬁles
salesforce
zabbix
hubspot
ADP
Chargify
BC/BS
ZenDesk google docs

• Manage 100s of machines: architecture as code
• Contain system complexity: relentlessly decouple
• Maintain coherency: federated truth
• Manage true costs: optimize for people not machines
• Manage failure & change:resiliency engineering

The Other Way

Declarative, not Homogenous
Decoupled, not Standardized
Federated, not Centralized
Simple, not Performant
Resilient, not Reliable

Architecture as Code
Lightweight Lightweight
Dashboard
Dashboard
HBase
HBase

API
Data Transport
ESh ﬂume

ElasticSearch ElasticSearch

Operations Application

Ironfan
+ ops ics.com Hadoop On-Demand
Hadoop

Chef

Dashboard
Dashboard
HBase
HBase

API
Data Transport
ESh ﬂume



ops ics.com Hadoop On-Demand
Hadoop

HM NN ZK

RS RS

RS RS

RS RS

provision machine

run state

settings

standard components

cluster-speciﬁc

facet groups

Lightweight
Dashboard
Lightweight
Dashboard
HBase
HBase
HM NN ZK
API
Data Transport
ESh ﬂume

ElasticSearch

RS RS
ElasticSearch


Hadoop

RS RS

RS RS

regionserver ssh
nfs
datanode
zbx
stargate log
tasktracker fw

zookeeper

Wins
from Declarative
Dashboard
Dashboard
HBase
HBase

API
Data Transport
ESh ﬂume



Hadoop

Our Stack
Dashboard
Dashboard
HBase
HBase

API
Data Transport
ESh ﬂume



Hadoop

Engineer : System = 1:10

• >60 distinct components
• 50-150 machines
• 1 ops + 5 hackers + 1 analyst

Self-similar
Dashboard
Dashboard
HBase
HBase

API
Data Transport
ESh ﬂume



Hadoop
HM NN ZK

RS RS ssh ssh
hb 2d mstr
hb master nfs nfs
RS RS namenode zbx 2d nn zbx
log jobtracker log
zookeeper
RS RS fw zookeeper fw
alpha beta

regionserver ssh regionserver ssh
nfs nfs
datanode datanode
zbx zbx
stargate log stargate log
tasktracker fw tasktracker fw

zookeeper gamma delta

Example: Scraper

Scraper disk tail’er decorator sink

Jobs database

Scraper
ﬂume

Jobs database

while true:
get_job
fetch_url
dump_to_disk

Scraper
ﬂume

Jobs database

while true: ensures
get_job reliable
fetch_url delivery
dump_to_disk

Scraper
ﬂume

Jobs database

while true: ensures parse
get_job reliable raw
fetch_url delivery =>
dump_to_disk objects

Scraper
ﬂume

Jobs database

while true: ensures parse store
get_job reliable raw object
fetch_url delivery => =>
dump_to_disk objects database

alice

alice

bob

alice

bob

bob

• Immediately Understandable
• Clear Interface
• Few Moving Parts

Data Stores in Production

• HBase • MySQL
• ElasticSearch • Redis
• Cassandra • sqlite
• TokyoTyrant • whisper (graphite)
• SimpleDB • ﬁle system
• MongoDB • S3

Programs Used for This Talk

• Emacs • Skitch
• Keynote • ﬁnder
• Preview • ﬂickr.com
• Chrome • google image search
• ruby (pry) • ssh

How’s my Batch Job Going?

• 1 x Job Status
• 1 x Counters & App Metrics
• N x Task Status
• M x Machine System Stats
• 1 x Cloud Status
• 1 x Chef Server

Dashboard
Dashboard
HBase
HBase

API
Data Transport
ESh flume



Hadoop

System Diagram Dataflow

Workflow

Dashboard
Dashboard
HBase
HBase

API
Data Transport
ESh flume



Hadoop

System Diagram Dataflow

Workflow Org Chart

Robots are Cheap

People are Important

Expensive / Not Expensive
1 trillion 10 kb objects:
• 100 % in RAM:

$ 212,000 /mo
• 10% in Ram:

$ 21,000 /mo
• On Disk:

$ 3,000 /mo
• On S3:

$ 1,200 /mo

Expensive / Not Expensive
1 trillion 10 kb objects:
• 100 % in RAM:

$ 212,000 /mo
• 10% in Ram:

$ 21,000 /mo
• On Disk:

$ 3,000 /mo
• On S3:

$ 1,200 /mo
1 Intern, part-time:

$ 1,500 /mo

Monolithic Software

means Meetings

n^2 law of coupling

100 things 5 + 3 + 2 things
+ 2 (tax)

n^2 law of coupling
2500
+
900
+
400
+
400
=
10,000 things 4200 things
to go wrong to go wrong

Infochimps.com 2011
text search

Planet of the
API acct'g
APIs

infochimps.com models

A/B testing

cloud
services

Infochimps.com 2012
datasets catalog API

API docs
text search
content

dashboards Planet of the
API acct'g
APIs
auth & payment
layout
console
models

A/B testing
blog
press cloud
services
collateral

Infochimps.com 2012
(infochimps)
icsexpl catalog API
(saas)

capuchin
elasticsrch
kanzi

beergoggls Planet of the
MongoDB
APIs
george george

alphamale
MySQL

redis
WPEngine
totem cloud
services
hubspot

this drawing ﬁts in my head

datasets catalog API

this app ﬁts in my head,
and my laptop

ﬁn.

http://infochimps.com
http://github.com/infochimps-labs

The Other Way of Doing Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to The Other Way of Doing Big Data

Similar to The Other Way of Doing Big Data (20)

More from Infochimps, a CSC Big Data Business

More from Infochimps, a CSC Big Data Business (17)

Recently uploaded

Recently uploaded (20)

The Other Way of Doing Big Data

Editor's Notes