Data Center Computing for Data Science: an evolution of machines, middleware, math, and Mesos

General Assembly SF, 2013-08-27:
“Data Center Computing for Data Science:
an evolution of machines, middleware, math, and Mesos”
Learnings generalized from trends in Data Science:
a 30-year retrospective on Machine Learning,
a 10-year summary of Leading Data ScienceTeams,
and a 2-year survey of Enterprise Use Cases
Paco Nathan @pacoid
Chief Scientist, Mesosphere
1Saturday, 31 August 13

1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workﬂows
5. an evolution of cluster computing
GA/SF, 2013-08-27

employing a mode of thought which includes both logical and analytical reasoning:
evaluating the whole of a problem, as well as its component parts; attempting
to assess the effects of changing one or more variables
this approach attempts to understand not just problems and solutions,
but also the processes involved and their variances
particularly valuable in Big Data work when combined with hands-on experience in
physics – roughly 50% of my peers come from physics or physical engineering…
programmers typically don’t think this way…
however, both systems engineers and data scientists must
Process Variation Data Tools
Statistical Thinking

Modeling
back in the day, we worked with practices based on
data modeling
1. sample the data
2. fit the sample to a known distribution
3. ignore the rest of the data
4. infer, based on that fitted distribution
that served well with ONE computer, ONE analyst,
ONE model… just throw away annoying “extra” data
circa late 1990s: machine data, aggregation, clusters, etc.
algorithmic modeling displaced the prior practices
of data modeling
because the data won’t fit on one computer anymore

Two Cultures
“A new research community using these tools sprang up.Their goal
was predictive accuracy.The community consisted of young computer
scientists, physicists and engineers plus a few aging statisticians.
They began using the new tools in working on complex prediction
problems where it was obvious that data models were not applicable:
speech recognition, image recognition, nonlinear time series prediction,
handwriting recognition, prediction in ﬁnancial markets.”
Statistical Modeling: TheTwo Cultures
Leo Breiman, 2001
bit.ly/eUTh9L
chronicled a sea change from data modeling (silos, manual
process) to the rising use of algorithmic modeling (machine
data for automation/optimization) which led in turn to the
practice of leveraging inter-disciplinary teams

approximately 80% of the costs for data-related projects
gets spent on data preparation – mostly on cleaning up
data quality issues: ETL, log ﬁles, etc., generally by socializing
the problem
unfortunately, data-related budgets tend to go into
frameworks that can only be used after clean up
most valuable skills:
‣ learn to use programmable tools that prepare data
‣ learn to understand the audience and their priorities
‣ learn to socialize the problems, knocking down silos
‣ learn to generate compelling data visualizations
‣ learn to estimate the conﬁdence for reported results
‣ learn to automate work, making process repeatable
What is needed most?
UniqueRegistration
aunchedgameslobby
NUI:TutorialMode
BirthdayMessage
hatPublicRoomvoice
unchedheyzapgame
Test:testsuitestarted
CreateNewPet
rted:client,community
NUI:MovieMode
BuyanItem:web
PutonClothing
paceremaining:512M
aseCartPageStep2
FeedPet
PlayPet
ChatNow
EditPanel
anelFlipProductOver
AddFriend
Open3DWindow
ChangeSeat
TypeaBubble
VisitOwnHomepage
TakeaSnapshot
NUI:BuyCreditsMode
NUI:MyProfileClicked
sspaceremaining:1G
LeaveaMessage
NUI:ChatMode
NUI:FriendsMode
dv
WebsiteLogin
AddBuddy
NUI:PublicRoomMode
NUI:MyRoomMode
anelRemoveProduct
yPanelApplyProduct
NUI:DressUpMode
UniqueRegistration
Launchedgameslobby
NUI:TutorialMode
BirthdayMessage
ChatPublicRoomvoice
Launchedheyzapgame
ConnectivityTest:testsuitestarted
CreateNewPet
MovieViewStarted:client,community
NUI:MovieMode
BuyanItem:web
PutonClothing
Addressspaceremaining:512M
CustomerMadePurchaseCartPageStep2
FeedPet
PlayPet
ChatNow
EditPanel
ClientInventoryPanelFlipProductOver
AddFriend
Open3DWindow
ChangeSeat
TypeaBubble
VisitOwnHomepage
TakeaSnapshot
NUI:BuyCreditsMode
NUI:MyProfileClicked
Addressspaceremaining:1G
LeaveaMessage
NUI:ChatMode
NUI:FriendsMode
dv
WebsiteLogin
AddBuddy
NUI:PublicRoomMode
NUI:MyRoomMode
ClientInventoryPanelRemoveProduct
ClientInventoryPanelApplyProduct
NUI:DressUpMode

apps
discovery
modeling
integration
systems
help people ask the
right questions
allow automation to
place informed bets
deliver data products
at scale to LOB end uses
build smarts into
product features
keep infrastructure
running, cost-effective
Team Process = Needs
analysts
engineers
inter-disciplinary
leadership

business process,
stakeholder
data prep, discovery,
modeling, etc.
software engineering,
automation
systems engineering,
availability
data
science
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
Team Composition = Roles
leverage non-traditional
pairing among roles, to
complement skills and
tear down silos

discovery
discovery
modeling
modeling
integration
integration
appsapps
systems
systems
business process,
stakeholder
data prep, discovery,
modeling, etc.
software engineering,
automation
systems engineering,
availability
data
science
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
Team Composition = Needs × Roles

Alternatively, Data Roles × Skill Sets
Harlan Harris, et al.
datacommunitydc.org/blog/wp-content/uploads/
2012/08/SkillsSelfIDMosaic-edit-500px.png
Analyzing the Analyzers
Harlan Harris, Sean Murphy,
Marck Vaisman
O’Reilly, 2013
amazon.com/dp/B00DBHTE56

Learning Curves
difﬁculties in the commercial use of distributed systems
often get represented as issues of managing complexity
much of the risk in managing a data science team is about
budgeting for learning curve: some orgs practice a kind of
engineering “conservatism”, with highly structured process
and strictly codiﬁed practices – people learn a few things
well, then avoid having to struggle with learning many new
things perpetually…
that anti-pattern leads to big teams, low ROI
scale➞
complexity➞
ultimately, the challenge is about
managing learning curves within
a social context

GA/SF, 2013-08-27

Business Disruption through Data
Geoffrey Moore
Mohr DavidowVentures, author CrossingThe Chasm
@Hadoop Summit, 2012:
what Amazon did to the retail sector… has put the
entire Global 1000 on notice over the next decade…
data as the major force… mostly through apps –
verticals, leveraging domain expertise
Michael Stonebraker
INGRES, PostgreSQL,Vertica,VoltDB, Paradigm4, etc.
@XLDB, 2012:
complex analytics workloads are now displacing SQL
as the basis for Enterprise apps

Data Categories
Three broad categories of data
Curt Monash, 2010
dbms2.com/2010/01/17/three-broad-categories-of-data
• Human/Tabular data – human-generated data which ﬁts into tables/arrays
• Human/Nontabular data – all other data generated by humans
• Machine-Generated data
let’s now add other useful distinctions:
• Open Data
• Curated Metadata
• A/D conversion for sensors (IoT)

Open Data notes
successful apps incorporate three components:
• Big Data (consumer interest, personalization)
• Open Data (monetizing public data)
• Curated Metadata
most of the largest Cascading deployments leverage some
Open Data components: Climate Corp, Factual, Nokia, etc.
consider buildingeye.com, aggregate building permits:
• pricing data for home owners looking to remodel
• sales data for contractors
• imagine joining data with building inspection history,
for better insights about properties for sale…
research notes about
Open Data use cases:
goo.gl/cd995T

Trends in Public Administration
late 1880s – late 1920s (Woodrow Wilson)
as hierarchy, bureaucracy → only for the most educated, elite
late 1920s – late 1930s
as a business, relying on “Scientiﬁc Method”, gov as a process
late 1930s – late 1940s (Robert Dale)
relationships, behavioral-based → policy not separate from politics
late 1940s – 1980s
yet another form of management → less “command and control”
1980s – 1990s (David Osborne,Ted Gaebler)
New Public Management → service efﬁciency, more private sector
1990s – present (Janet & Robert Denhardt)
Digital Age → transparency, citizen-based “debugging”, bankruptcies
Adapted from:
The Roles,Actors, and Norms Necessary to
Institutionalize Sustainable Collaborative Governance
Peter Pirnejad
USC Price School of Policy
2013-05-02
Drivers, circa 2013
• governments have run out of money,
cannot increase staff and services
• better data infra at scale (cloud, OSS, etc.)
• machine learning techniques to monetize
• viable ecosystem for data products,APIs
• mobile devices enabling use cases

Open Data ecosystem
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Data feeds structured for
public private partnerships

Open Data ecosystem – caveats for agencies
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
Required Focus
• respond to viable use cases
• not budgeting hackathons

Open Data ecosystem – caveats for publishers
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
Required Focus
• surface the metadata
• curate, allowing for joins/aggregation
• not scans as PDFs

Open Data ecosystem – caveats for aggregators
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
Required Focus
• make APIs consumable by automation
• allow for probabilistic usage
• not OSS licensing for data

Open Data ecosystem – caveats for data vendors
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
Required Focus
• supply actionable data
• track data provenance carefully
• provide feedback upstream,
i.e., cleaned data at source
• focus on core verticals

Open Data ecosystem – caveats for end uses
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
Required Focus
• address consumer needs
• identify community beneﬁts
of the data

algorithmic modeling
+ machine data (Big Data)
+ curation, metadata
+ Open Data
data products, as feedback into automation
evolution of feedback loops
less about “bigness”, more about complexity
internet of things
+ A/D conversion
+ more complex analytics
accelerated evolution, additional feedback loops
orders of magnitude higher data rates
Recipes for Success
source: National Geographic
“A kind of Cambrian explosion”
source: National Geographic

Trendlines
Big Data? we’re just getting started:
• ~12 exabytes/day, jet turbines on commercial ﬂights
• Google self-driving cars, ~1 Gb/s per vehicle
• National Instruments initiative: Big Analog Data™
• 1m resolution satellites skyboximaging.com
• open resource monitoring reddmetrics.com
• Sensing XChallenge nokiasensingxchallenge.org
consider the implications of Jawbone, Nike, etc.,
plus the effects of Google Glass…
technologyreview.com/...

Internet of Things

GA/SF, 2013-08-27

in general, apps alternate between learning patterns/rules
and retrieving similar things…
machine learning – scalable, arguably quite ad-hoc,
generally “black box” solutions, enabling you to make billion
dollar mistakes, with oh so much commercial emphasis
(i.e. the “heavy lifting”)
statistics – rigorous, much slower to evolve, conﬁdence
and rationale become transparent, preventing you from
making billion dollar mistakes, any good commercial project
has ample stats work used in QA
(i.e.,“CYA, cover your analysis”)
once Big Data projects get beyond merely digesting
log ﬁles, optimization will likely become the next
overused buzzword :)
Learning Theory

Generalizations about Machine Learning…
great introduction to ML, plus a proposed categorization
for comparing different machine learning approaches:
A Few UsefulThings to Know about Machine Learning
Pedro Domingos, U Washington
homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
toward a categorization for Machine Learning algorithms:
• representation: classifier must be represented in some
formal language that computers can handle (algorithms, data
structures, etc.)
• evaluation: evaluation function (objective function, scoring
function) is needed to distinguish good classifiers from bad
ones
• optimization: method to search among the classifiers in
the language for the highest-scoring one

Something to consider about Algorithms…
many algorithm libraries used today are based on implementations
back when people used DO loops in FORTRAN, 30+ years ago
MapReduce is Good Enough?
Jimmy Lin, U Maryland
umiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdf
astrophysics and genomics are light years ahead of e-commerce in
terms of data rates and sophisticated algorithms work – as Breiman
suggested in 2001 – may take a few years to percolate into industry
other game-changers:
• streaming algorithms, sketches, probabilistic data structures
• signiﬁcant “Big O” complexity reduction (e.g., skytree.net)
• better architectures and topologies (e.g., GPUs and CUDA)
• partial aggregates – parallelizing workﬂows

Make It Sparse…
also, take a moment to check this out…
(and related work on sparse Cholesky, etc.)
QR factorization of a “tall-and-skinny” matrix
• used to solve many data problems at scale,
e.g., PCA, SVD, etc.
• numerically stable with efﬁcient implementation
on large-scale Hadoop clusters
suppose that you have a sparse matrix of customer
interactions where there are 100MM customers,
with a limited set of outcomes…
cs.purdue.edu/homes/dgleich
stanford.edu/~arbenson
github.com/ccsevers/scalding-linalg
David Gleich, slideshare.net/dgleich

Sparse Matrix Collection
for those times when you really, really need
a wide variety of sparse matrix examples…
University of Florida Sparse Matrix Collection
cise.uﬂ.edu/research/sparse/matrices/
Tim Davis, U Florida
cise.uﬂ.edu/~davis/welcome.html
Yifan Hu, AT&T Research
www2.research.att.com/~yifanhu/

A Winning Approach…
consider that if you know priors about a system, then
you may be able to leverage low dimensional structure
within high dimensional data… what impact does that
have on sampling rates?
1. real-world data
2. graph theory for representation
3. sparse matrix factorization for production work
4. cost-effective parallel processing
for machine learning app at scale

Just Enough Mathematics?
having a solid background in statistics becomes vital,
because it provides formalisms for what we’re trying
to accomplish at scale
along with that, some areas of math help – regardless
of the “calculus threshold” invoked at many universities…
linear algebra e.g., calculating algorithms for large-scale apps efﬁciently
graph theory e.g., representation of problems in a calculable language
abstract algebra e.g., probabilistic data structures in streaming analytics
topology e.g., determining the underlying structure of the data
operations research e.g., techniques for optimization … in other words, ROI

ADMM: a general approach for optimizing learners
Distributed Optimization and Statistical Learning
via the Alternating Direction Method of Multipliers
Stephen Boyd, Neal Parikh, et al., Stanford
stanford.edu/~boyd/papers/admm_distr_stats.html
“Throughout, the focus is on applications rather than theory, and a main goal is
to provide the reader with a kind of ‘toolbox’ that can be applied in many situations
to derive and implement a distributed algorithm of practical use.Though the focus
here is on parallelism, the algorithm can also be used serially, and it is interesting
to note that with no tuning, ADMM can be competitive with the best known
methods for some problems.”
“While we have emphasized applications that can be concisely explained, the
algorithm would also be a natural ﬁt for more complicated problems in areas
like graphical models. In addition, though our focus is on statistical learning
problems, the algorithm is readily applicable in many other cases, such as in
engineering design, multi-period portfolio optimization, time series analysis,
network ﬂow, or scheduling.”

GA/SF, 2013-08-27

Enterprise Data Workﬂows
middleware for Big Data applications is evolving,
with commercial examples that include:
Cascading, Lingual, Pattern, etc.
Concurrent
ParAccel Big Data Analytics Platform
Actian
Anaconda supporting IPython Notebook, Pandas,Augustus, etc.
Continuum Analytics
ETL
data
prep
predictive
model
data
sources
end
uses

Anatomy of an Enterprise app
deﬁnition of a typical Enterprise workﬂow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
ANSI SQL for ETL

ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic

ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive models

ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive modelsANSI SQL for ETL most of the licensing costs…

ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
most of the project costs…

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Cascading allows multiple departments to combine their workﬂow components
into an integrated app – one among many, typically – based on 100% open source
a compiler sees it all…
one connected DAG:
• optimization
• troubleshooting
• exception handling
• notiﬁcations
cascading.org

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
FlowDef flowDef = FlowDef.flowDef()
.setName( "etl" )
.addSource( "example.employee", emplTap )
.addSource( "example.sales", salesTap )
.addSink( "results", resultsTap );

SQLPlanner sqlPlanner = new SQLPlanner()
.setSql( sqlStatement );

flowDef.addAssemblyPlanner( sqlPlanner );
cascading.org

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
FlowDef flowDef = FlowDef.flowDef()
.setName( "classifier" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );

PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlModel ) )
.retainOnlyActiveIncomingFields();

flowDef.addAssemblyPlanner( pmmlPlanner );

Cascading – functional programming
Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
to ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:
• leverages JVM and Java-based tools without any
need to create new languages
• allows programmers who have J2EE expertise
to leverage the economics of Hadoop clusters
Edgar Codd alluded to this (DSLs for structuring data)
in his original paper about relational model

Cascading – functional programming
• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
have invested in open source projects atop Cascading –
used for their large-scale production deployments
• new case studies for Cascading apps are mostly based on
domain-speciﬁc languages (DSLs) in JVM languages which
emphasize functional programming:
Cascalog in Clojure (2010)
Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
Why Adopting the Declarative Programming PracticesWill ImproveYour Return fromTechnology
Dan Woods, 2013-04-17 Forbes
forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-
practices-will-improve-your-return-from-technology/

Workflow Abstraction – pattern language
Cascading uses a “plumbing” metaphor in Java
to define workflows out of familiar elements:
Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc.
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
data is represented as flows of tuples
operations in the flows bring functional
programming aspects into Java
A Pattern Language
Christopher Alexander, et al.
amazon.com/dp/0195019199

Workflow Abstraction – business process
following the essence of literate programming, Cascading
workflows provide statements of business process
this recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
this is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
by virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale

void map (String doc_id, String text):
for each word w in segment(text):
emit(w, "1");
void reduce (String word, Iterator group):
int count = 0;
for each pc in group:
count += Int(pc);
emit(word, String(count));
The Ubiquitous Word Count
Definition:
this simple program provides an excellent test case
for parallel processing:
• requires a minimal amount of code
• demonstrates use of both symbolic and numeric values
• shows a dependency graph of tuples as an abstraction
• is not many steps away from useful search indexing
• serves as a “HelloWorld” for Hadoop apps
a distributed computing framework that runsWord Count
efficiently in parallel at scale can handle much larger
and more interesting compute problems
count how often each word appears
in a collection of text documents

Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
1 map
1 reduce
18 lines code gist.github.com/3900702
WordCount – conceptual ﬂow diagram
cascading.org/category/impatient

WordCount – Cascading app in Java
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M

mapreduce
Every('wc')[Count[decl:'count']]
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
GroupBy('wc')[by:['token']]
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']
[head]
[tail]
[{2}:'token', 'count']
[{1}:'token']
[{2}:'doc_id', 'text']
[{2}:'doc_id', 'text']
wc[{1}:'token']
[{1}:'token']
[{1}:'token']
[{1}:'token']
WordCount – generated ﬂow diagram
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M

(ns impatient.core
  (:use [cascalog.api]
        [cascalog.more-taps :only (hfs-delimited)])
  (:require [clojure.string :as s]
            [cascalog.ops :as c])
  (:gen-class))
(defmapcatop split [line]
  "reads in a line of string and splits it by regex"
  (s/split line #"[[](),.)s]+"))
(defn -main [in out & args]
  (?<- (hfs-delimited out)
       [?word ?count]
       ((hfs-delimited in :skip-header? true) _ ?line)
       (split ?line :> ?word)
       (c/count ?count)))
; Paul Lam
; github.com/Quantisan/Impatient
WordCount – Cascalog / Clojure
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M

github.com/nathanmarz/cascalog/wiki
• implements Datalog in Clojure, with predicates backed
by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –
approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development
(TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs –
Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn
WordCount – Cascalog / Clojure
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M

import com.twitter.scalding._

class WordCount(args : Args) extends Job(args) {
Tsv(args("doc"),
('doc_id, 'text),
skipHeader = true)
.read
.flatMap('text -> 'token) {
text : String => text.split("[ [](),.]")
}
.groupBy('token) { _.size('count) }
.write(Tsv(args("wc"), writeHeader = true))
}
WordCount – Scalding / Scala
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M

github.com/twitter/scalding/wiki
• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual ﬂow diagram
and function calls
• extensive libraries are available for linear algebra, abstract
algebra, machine learning – e.g., Matrix API, Algebird, etc.
• signiﬁcant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog
WordCount – Scalding / Scala
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M

CREATE TABLE text_docs (line STRING);

LOAD DATA LOCAL INPATH 'data/rain.txt'
OVERWRITE INTO TABLE text_docs
;

SELECT
word, COUNT(*)
FROM
(SELECT
split(line, 't')[1] AS text
FROM text_docs
) t
LATERAL VIEW explode(split(text, '[ ,.()]')) lTable AS
word
GROUP BY word
;
WordCount – Apache Hive
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M

WordCount – Apache Hive
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
hive.apache.org
pro:
‣ most popular abstraction atop Apache Hadoop
‣ SQL-like language is syntactically familiar to most analysts
‣ simple to load large-scale unstructured data and run ad-hoc queries
con:
‣ not a relational engine, many surprises at scale
‣ difficult to represent complex workflows, ML algorithms, etc.
‣ one poorly-trained analyst can bottleneck an entire cluster
‣ app-level integration requires other coding, outside of script language
‣ logical planner mixed with physical planner; cannot collect app stats
‣ non-deterministic exec: number of maps+reduces may change unexpectedly
‣ business logic must cross multiple language boundaries: difficult to
troubleshoot, optimize, audit, handle exceptions, set notifications, etc.

docPipe = LOAD '$docPath' USING PigStorage('t', 'tagsource')
AS (doc_id, text);
docPipe = FILTER docPipe BY doc_id != 'doc_id';
-- specify regex to split "document" text lines into token stream
tokenPipe = FOREACH docPipe
GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token;
tokenPipe = FILTER tokenPipe BY token MATCHES 'w.*';
-- determine the word counts
tokenGroups = GROUP tokenPipe BY token;
wcPipe = FOREACH tokenGroups
GENERATE group AS token, COUNT(tokenPipe) AS count;
-- output
STORE wcPipe INTO '$wcPath' USING PigStorage('t', 'tagsource');
EXPLAIN -out dot/wc_pig.dot -dot wcPipe;
WordCount – Apache Pig
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M

WordCount – Apache Pig
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
pig.apache.org
pro:
‣ easy to learn data manipulation language (DML)
‣ interactive prompt (Grunt) makes it simple to prototype apps
‣ extensibility through UDFs
con:
‣ not a full programming language; must extend via UDFs outside of language
‣ app-level integration requires other coding, outside of script language
‣ simple problems are simple to do; hard problems become quite complex
‣ difficult to parameterize scripts externally; must rewrite to change taps!
‣ logical planner mixed with physical planner; cannot collect app stats
‣ non-deterministic exec: number of maps+reduces may changes unexpectedly
‣ business logic must cross multiple language boundaries: difficult to
troubleshoot, optimize, audit, handle exceptions, set notifications, etc.

Two Avenues to the App Layer…
scale ➞
complexity➞
Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments – using J2EE,
ANSI SQL, SAS, etc. – to migrate
workﬂows onto Apache Hadoop while
leveraging existing staff
Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding

GA/SF, 2013-08-27

Q3 1997: inﬂection point
four independent teams were working toward horizontal
scale-out of workﬂows based on commodity hardware
this effort prepared the way for huge Internet successes
in the 1997 holiday season… AMZN, EBAY, Inktomi
(YHOO Search), then GOOG
MapReduce and the Apache Hadoop open source stack
emerged from this period

RDBMS
Stakeholder
SQL Query
result sets
Excel pivot tables
PowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BI
Analysts
optimized
code
Circa 1996: pre- inﬂection point

RDBMS
Stakeholder
SQL Query
result sets
Excel pivot tables
PowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BI
Analysts
optimized
code
Circa 1996: pre- inﬂection point
“throw it over the wall”

RDBMS
SQL Query
result sets
recommenders
+
classiﬁers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes

RDBMS
SQL Query
result sets
recommenders
+
classiﬁers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes
“data products”

Workﬂow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere

Workﬂow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere
“optimize topologies”

Amazon
“Early Amazon: Splitting the website” – Greg Linden
glinden.blogspot.com/2006/02/early-amazon-splitting-website.html
eBay
“The eBay Architecture” – Randy Shoup, Dan Pritchett
addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf
Inktomi (YHOO Search)
“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
youtu.be/E91oEn1bnXM
Google
“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
youtu.be/qsan-GQaeyk
perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
MIT Media Lab
“Social Information Filtering for Music Recommendation” – Pattie Maes
pubs.media.mit.edu/pubs/papers/32paper.ps
ted.com/speakers/pattie_maes.html
Primary Sources

Cluster Computing’s Dirty Little Secret
people like me make a good living by leveraging high ROI
apps based on clusters, and so the execs agree to build
out more data centers…
clusters for Hadoop/HBase, for Storm, for MySQL,
for Memcached, for Cassandra, for Nginx, etc.
this becomes expensive!
a single class of workloads on a given cluster is simpler
to manage; but terrible for utilization… various notions
of “cloud” help
Cloudera, Hortonworks, probably EMC soon: sell a notion
of “Hadoop as OS” All your workloads are belong to us
regardless of how architectures change, death and taxes
will endure: servers fail, and data must move
Google Data Center, Fox News
~2002

Three Laws, or more?
meanwhile, architectures evolve toward much, much larger data…
pistoncloud.com/ ...
Rich Freitas, IBM Research
Q:
what kinds of disruption in topologies
could this imply? because there’s
no such thing as RAM anymore…

Topologies
Hadoop and other topologies arose from a need for fault-
tolerant workloads, leveraging horizontal scale-out based
on commodity hardware
because the data won’t ﬁt on one computer anymore
a variety of Big Data technologies has since emerged,
which can be categorized in terms of topologies and
the CAP Theorem
C A
P
strong
consistency
high
availability
partition
tolerance
eventual
consistency
“You can have at most two of these properties for
any shared-data system… the choice of which
feature to discard determines the nature of your
system.” – Eric Brewer, 2000 (Inktomi/YHOO)
cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
julianbrowne.com/article/viewer/brewers-cap-theorem

Some Topologies Other Than Hadoop…
Spark (iterative/interactive)
Titan (graph database)
Redis (data structure server)
Zookeeper (distributed metadata)
HBase (columnar data objects)
Riak (durable key-value store)
Storm (real-time streams)
ElasticSearch (search index)
MongoDB (document store)
ParAccel (MPP)
SciDB (array database)

“Return of the Borg”
consider that Google is generations ahead of
Hadoop, etc., with much improved ROI on its
data centers…
Borg serves as a kind of “secret sauce” for
data center OS, with Omega as its next
evolution:
2011 GAFS Omega
John Wilkes, et al.
youtu.be/0ZFMlO98Jkc
Omega: ﬂexible, scalable schedulers for large compute clusters
Malte Schwarzkopf,Andy Konwinski, Michael Abd-El-Malek, John Wilkes
eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf

Omega: ﬂexible, scalable schedulers for large compute clusters
Malte Schwarzkopf,Andy Konwinski, Michael Abd-El-Malek, John Wilkes
eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf

Return of the Borg: HowTwitter Rebuilt Google’s SecretWeapon
Cade Metz
wired.com/wiredenterprise/2013/03/google-
borg-twitter-mesos
The Datacenter as a Computer: An Introduction
to the Design ofWarehouse-Scale Machines
Luiz André Barroso, Urs Hölzle
research.google.com/pubs/pub35290.html

Mesos – deﬁnitions
a common substrate for cluster computing
heterogenous assets in your data center or cloud
made available as a homogenous set of resources
• top-level Apache project
• scalability to 10,000s of nodes
• obviates the need for virtual machines
• isolation between tasks with Linux Containers (pluggable)
• fault-tolerant replicated master using ZooKeeper
• multi-resource scheduling (memory and CPU aware)
• APIs in C++, Java, Python
• web UI for inspecting cluster state
• available for Linux, Mac OSX, OpenSolaris

Mesos – simpliﬁes app development
CHRONOS SPARK HADOOP DPARK MPI
JVM (JAVA, SCALA, CLOJURE, JRUBY)
MESOS
PYTHON C++

Mesos – data center OS stack
HADOOP STORM CHRONOS RAILS JBOSS
TELEMETRY
Kernel
OS
Apps
MESOS
CAPACITY PLANNING GUISECURITYSMARTER SCHEDULING

Prior Practice: Dedicated Servers
DATACENTER
• low utilization rates
• longer time to ramp up new services

Prior Practice: Virtualization
DATACENTER PROVISIONED VMS
• even more machines to manage
• substantial performance decrease due to virtualization
• VM licensing costs

Prior Practice: Static Partitioning
DATACENTER STATIC PARTITIONING
• even more machines to manage
• substantial performance decrease due to virtualization
• VM licensing costs
• static partitioning limits elasticity

MESOS
Mesos: One Large Pool Of Resources
DATACENTER
“We wanted people to be able to program
for the data center just like they program
for their laptop."
Ben Hindman

What are the costs of Virtualization?
benchmark
type
OpenVZ
improvement
mixed workloads 210%-300%
LAMP (related) 38%-200%
I/O throughput 200%-500%
response time order magnitude
more pronounced
at higher loads

What are the costs of Single Tenancy?
0%
25%
50%
75%
100%
RAILS CPU
LOAD
MEMCACHED
CPU LOAD
0%
25%
50%
75%
100%
HADOOP CPU
LOAD
0%
25%
50%
75%
100%
t t
0%
25%
50%
75%
100%
Rails
Memcached
Hadoop
COMBINED CPU LOAD (RAILS,
MEMCACHED, HADOOP)

Compelling arguments for Data Center OS
• obviates the need forVMs (licensing, adiosVMware)
• provides OS-level building blocks for developing new
distributed frameworks (learning curve, adios Hadoop)
• removes significantVM overhead (performance)
• requires less h/w to buy (CapEx), power and fix (OpEx)
• implies lessVMs, thus less Ops overhead (staff)
• removes the complexity of Chef/Puppet (staff)
• allows higher utilization rates (ROI)
• reduces latency for data updates (OLTP + OLAP on same server)
• reshapes cluster resources dynamically (100’s ms vs. minutes)
• runs dev/test clusters on same h/w as production (flexibility)
• evaluates multiple versions without more h/w (vendor lock-in)

Opposite Ends of the Spectrum, One Substrate
Built-in /
bare metal
Hypervisors
Solaris Zones
Linux CGroups

Opposite Ends of the Spectrum, One Substrate
Request /
Response
Batch

Case Study: Twitter (bare metal / on premise)
“Mesos is the cornerstone of our elastic compute infrastructure –
it’s how we build all our new services and is critical forTwitter’s
continued success at scale. It's one of the primary keys to our
data center efﬁciency."
Chris Fry, SVP Engineering
blog.twitter.com/2013/mesos-graduates-from-apache-incubation
• key services run in production: analytics, typeahead, ads
• Twitter engineers rely on Mesos to build all new services
• instead of thinking about static machines, engineers think
about resources like CPU, memory and disk
• allows services to scale and leverage a shared pool of
servers across data centers efﬁciently
• reduces the time between prototyping and launching

Case Study: Airbnb (fungible cloud infrastructure)
“We think we might be pushing data science in the ﬁeld of travel
more so than anyone has ever done before… a smaller number
of engineers can have higher impact through automation on
Mesos."
Mike Curtis,VP Engineering
gigaom.com/2013/07/29/airbnb-is-engineering-itself-into-a-data-driven...
• improves resource management and efﬁciency
• helps advance engineering strategy of building small teams
that can move fast
• key to letting engineers make the most of AWS-based
infrastructure beyond just Hadoop
• allowed company to migrate off Elastic MapReduce
• enables use of Hadoop along with Chronos, Spark, Storm, etc.

Resources
Apache Mesos Project
mesos.apache.org
Mesosphere
mesosphe.re
Getting Started
mesosphe.re/tutorials
Documentation
mesos.apache.org/documentation
Research Paper
usenix.org/legacy/event/nsdi11/tech/full_papers/Hindman_new.pdf
Collected Notes/Archives
goo.gl/jPtTP

Enterprise DataWorkﬂows with Cascading
O’Reilly, 2013
shop.oreilly.com/product/
0636920028536.do
monthly newsletter for updates, events,
conference summaries, etc.:
liber118.com/pxn/

Data Center Computing for Data Science: an evolution of machines, middleware, math, and Mesos

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Center Computing for Data Science: an evolution of machines, middleware, math, and Mesos

Similar to Data Center Computing for Data Science: an evolution of machines, middleware, math, and Mesos (20)

More from Paco Nathan

More from Paco Nathan (20)

Recently uploaded

Recently uploaded (20)

Data Center Computing for Data Science: an evolution of machines, middleware, math, and Mesos