Intro to Data Science for Enterprise Big Data

Intro to Data Science
Paco Nathan
Document
Collection

Scrub
Tokenize
token

M

Concurrent, Inc. Stop Word
List
HashJoin
Left

RHS
Regex
token
GroupBy
token
R

Count

pnathan@concurrentinc.com
Word
Count

@pacoid

Copyright @2012, Concurrent, Inc.

opportunity

Unstructured Data
meets
Enterprise Scale

core values
Data Science teams develop actionable insights,
building conﬁdence for decisions

that work may inﬂuence a few decisions worth
billions (e.g., M&A) or billions of small decisions
(e.g., AdWords)

probably somewhere in-between…
solving for pattern, at scale.

NB: projects require
teams, not sole players

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

backstory

personal timeline
1980s 1990s 2000s 2010s

Symbiot, Adknowledge,
lead data teams ShareThis, IMVU, etc.

consult

start-up CTO BNTI

Bell Labs,
enterprise Moto

IBM,
research NASA

school Stanford

inﬂection point: demand side
• huge Internet successes after 1997 holiday season… 1997
AMZN, EBAY, then GOOG, Inktomi (YHOO Search)

• consider this metric: 1998
annual revenue per customer / operational data store size
dropped more than 100x within a few years after 1997

• storage and processing costs plummeted, now we must
work much smarter to extract ROI from Big Data…
our methods must adapt 2004
• “conventional wisdom” of RDBMS and BI tools became
less viable; business cadre still focused on pivot tables
and pie charts… which tends toward inertia

• MapReduce and the Hadoop open source stack grew
directly out of this context… but that only solves parts
massive disruption in retail, advertising, etc.,
“All of Fortune 500 is now on notice over the next 10-year
period.” – Geoffrey Moore, 2012 (Mohr Davidow Ventures)

inﬂection point: supply side

source: source:
DJ Patil R-Bloggers

statistical thinking

Process Variation Data Tools

a mode of thinking which includes both logical and analytical
reasoning: evaluating the whole of a problem, as well as its
component parts; attempting to assess the effects of changing
one or more variables

this approach attempts to understand not just problems and
solutions, but also the processes involved and their variances
particularly valuable in Big Data work when combined with hands-on
experience in physics – roughly 50% of my peers come from physics
or physical engineering… programmers typically don’t think this way

most valuable skills
• approximately 80% of the costs for data-related projects
get spent on data preparation – mostly on cleaning up
data quality issues

• unfortunately, data-related budgets for many companies tend
to go into frameworks which can only be used after clean up

• most valuable skills:
‣ learn to use programmable tools that prepare data

‣ learn to generate compelling data visualizations

‣ learn to estimate the conﬁdence for reported results

‣ learn to automate work, making analysis repeatable
D3
the rest of the skills – modeling,
algorithms, etc. – those are secondary

social caveats
• the phrase “This data cannot be correct!” may be an
early warning about the organization itself

• much depends on how the people whom you work
alongside tend to arrive at their decisions:
‣ probably good: Induction, Abduction, Circumscription
‣ probably poor: Deduction, Speculation, Justiﬁcation

in general, one good
data visualization
can put many ongoing
verbal arguments
to rest
xkcd

reference

Statistical Modeling:
The Two Cultures
by Leo Breiman
Statistical Science, 2001

http://bit.ly/eUTh9L

reference

Data Quality
by Jack Olson
Morgan Kaufmann, 2003

http://www.amazon.com/dp/1558608915

reference

Building Data Science Teams
by DJ Patil
O’Reilly, 2011

http://www.amazon.com/dp/B005O4U3ZE

reference

Data Jujitsu
by DJ Patil
O’Reilly, 2012

http://www.amazon.com/dp/B008HMN5BE

reference

RStudio
download and
run it on your laptop

http://rstudio.org/

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

build:
data science teams

process

help people ask the
discovery right questions

allow automation to
modeling place informed bets

deliver products at
integration scale to customers

leverage smarts in
apps product features Gephi

keep infrastructure
systems running, cost-effective

matrix = needs × roles
nn
o
overy
very elliing
e ng ratiio
rat o apps
apps stem
stem
s
s
diisc
d sc mod
mod nteg
iinte
g sy
sy

stakeholder

scientist

developer

ops

matrix: usage
nn
o
overy
very elliing
e ng ratiio
rat o apps
apps stem
stem
s
s
diisc
d sc mod
mod nteg
iinte
g sy
sy

conceptual tool for managing Data Science teams stakeholder

overlay your project requirements (needs)
with your team’s strengths (roles)
scientist
that will show very quickly where to focus

NB: bring in individuals who cover 2-3 needs, developer
particularly for team leads

ops

matrix: needs
nn
o
overy
very elliing
e ng ratiio
rat o apps
apps stem
stem
s
s
diisc
d sc mod
mod nteg
iinte
g sy
sy

one dimension is “needs”: stakeholder
discovery, modeling, integration, apps, systems

these are the primary phases of leveraging Big Data…
scientist
stakeholders represent the domain: the key aspect
to leverage

analysts usually drive from discovery toward integration, developer
while the engineers tend to drive from systems toward
integration
ops
NB: effective, hands-on management in Data Science
must live in the space of integration, not delegate it

matrix: roles
nn
o
overy
very elliing
e ng ratiio
rat o apps
apps stem
stem
s
s
diisc
d sc mod
mod nteg
iinte
g sy
sy

one dimension is “roles”: stakeholder
stakeholder, scientist, developer, ops

each role leverages different disciplines, opportunities,
scientist
and risks… there’s great power in pairing people with
complementary skills, in team environments where they
can recognize each other’s priorities and perspectives
developer
blurring these roles is wonderful, when you ﬁnd great
people capable of doing so, e.g., DevOps… however,
when businesses get into trouble, they will tend to ops
“push down” these roles, blurring boundaries in
ways which stresses teams and limits scalability

matrix: example team
nn
o
overy
very elliing
e ng ratiio
rat o apps
apps stem
stem
s
s
diisc
d sc mod
mod nteg
iinte
g sy
sy

stakeholder

scientist

developer

ops

matrix: example team
nn
o
overy
very elliing
e ng ratiio
rat o apps
apps stem
stem
s
s
diisc
d sc mod
mod nteg
iinte
g sy
sy

stakeholder

scientist

developer

ops

summary: this team seems heavy on systems, may need more overlap
between modeling and integration, particularly among team leads

typical hand-offs
integrity availability discovery communications

people
vendor
data
sources
Query
data Query
Hosts
query BI & dashboards
warehouse Hosts
hosts reporting
production
cluster presentations

decision support

classiﬁers
predictive analyze,
customer analytics visualize business
interactions recommenders stakeholders

internal API, crons, etc.
modeling

engineers,
automation analysts

data priorities
• Availability
Top priority, providing access to data as needed.
Lack of availability causes large hidden costs to a business.

• Integrity
integrity

vendor
data
sources
availability discovery communications

people

•
Query
data Query
Hosts
Hosts

Discovery
warehouse hosts reporting
production

decision support

classiﬁers
predictive analyze,

•

Modeling internal API, crons, etc.
modeling

engineers,
automation analysts

• Communications

data priorities
• Availability

• Integrity
Work within Engineering to ensure that customer data,
internal metrics, third-party sources, etc., get collected and
maintained in ways which are meaningful and consistent
for required business use cases.

• Discovery
integrity

vendor
data
sources

people

•
Query
data Query
Hosts
Hosts

Modeling
production

decision support

classiﬁers
predictive analyze,

•

Communications internal API, crons, etc.
modeling

engineers,
automation analysts

data priorities
• Availability

• Integrity

• Discovery
Analyze and visualize data on behalf of business stakeholders.
Leverage statistics so that we not only say “What” decisions to
take, but can answer “Why?” and “How good are they?”

• Modeling integrity

vendor
data
sources

people

•
Query
data Query
Hosts

Communications
warehouse Hosts
hosts reporting
production

decision support

classiﬁers
predictive analyze,

modeling

engineers,
automation analysts

data priorities
• Availability integrity

vendor
data
sources

people

•
Query
data Query
Hosts

Integrity
warehouse Hosts
hosts reporting
production

decision support

classiﬁers
predictive analyze,

•
interactions stakeholders

Discovery
recommenders

modeling

engineers,
automation analysts

• Modeling
Use business learnings in automated, scalable ways.
For example, manage an automated bid system.

Principally “algorithmic modeling”, not “data modeling”.

• Communications

data priorities
• Availability

• Integrity
integrity

vendor
data
sources
availability

Query
discovery communications

people

•
data Query
Hosts
Hosts

Discovery
production

decision support

classiﬁers
predictive analyze,

•

Modeling internal API, crons, etc.
modeling

engineers,
automation analysts

• Communications
Work closely with stakeholders so that insights gleaned from
data+analysis are understood, and important to the business.

Sum of learnings from this ongoing process represents
our primary value.

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

theory:
wrangle the data

CAP theorem

high
availability

C A
strong
consistency

P eventual
consistency

partition
tolerance

CAP theorem
“You can have at most two of these properties for any shared-data
system… the choice of which feature to discard determines the
nature of your system.” – Eric Brewer, 2000 (Inktomi)

• revenue transactions in ecommerce typically require
strong consistency and partition tolerance

• most analytics jobs for business use cases generally require
availability and eventual consistency, but tend to
not tolerate highly partitioned data

• ETL becomes an Achilles heal for “agile”:
‣ agile/experiment-driven/scale-out, which leads to…
‣ provably-hard-to-detect metadata drift, leading to…
‣ high-risk technical debt

interpretation
• purpose: theoretical limits for data access patterns
• essence:
‣ consistency
‣ availability
‣ partition tolerance

• best case scenario: you may pick two … or spend
billions struggling to obtain all three at scale (GOOG)
• translated: cost of doing business
https://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf

http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

data access patterns
• the world is not made of data warehouses…
• a handful of common data access patterns prevail
• learn to recognize these for any given problem
• typically expressed as trade-offs among:
‣ speed & volume (latency and throughput)

‣ reads & writes (access and storage)

‣ consistency / availability / partition tolerance

as for roles on teams, some mixing is valuable;
OTOH, too much blurring of boundaries causes
stress

data access patterns
• design patterns: originated in consensus negotiation
for architecture, later used in software engineering

• consider the corollaries in large-scale data work…
• essential advice:
select data frameworks based on
your data access patterns

• in other words, decouple use cases based on
needs – to avoid “one size ﬁts all” blockers

• let’s review some examples…

access → frameworks → forfeits
ﬁnancial transactions general ledger in RDBMS CAx
ad-hoc queries RDS (hosted MySQL) CAx
reporting, dashboards like Pentaho CAx
log rotation/persistence like Riak xxP
search indexes like Lucene/Solr xAP
static content, archives S3 (durable storage) xAP
customer facts like Redis, Membase xAP
distributed counters, locks, sets like Redis x A P*
data objects CRUD key/value – like, NoSQL on MySQL CxP
authoritative metadata like Zookeeper CxP
data prep, modeling at scale like Hadoop/Cascading + R CxP
graph analysis like Hadoop + Redis + Gephi CxP
data marts like Hadoop/HBase CxP

Amdahl’s law

source:
Wikipedia

interpretation
• purpose: theoretical limits for scalable computation
• essence:
task overhead and data independence
deﬁne limits of parallelism for any given problem;
however, these also suggest how well a problem
can be scaled-out

• translated: return on investment
http://en.wikipedia.org/wiki/Amdahl's_law

http://www.bu.edu/tech/research/training/tutorials/matlab-pct/scalability/

parallel computation
• parallelism allows for horizontal scale-out, which
create business “levers” in cost/performance at scale

• NB: MapReduce provides a compute framework which
is part-parallel and part-serial… that tends to
complicate app development

• most hard problems in industry have portions which
do not allow data independence, or which require
iteration

• current efforts in massively parallel algorithms research
may help to parallelize problems and reduce iteration –
estimates are 3-5 years out for industry use
GPUs and other hardware architecture advancements
will likely make Hadoop unrecognizable 3-5 years out

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

theory:
manage the science

the science in data science

• Estimate Probability!

• Calculate Analytic Variance!!

• Apply Learning Theory!!!
edoMpUsserD:IUN

• Manipulate Order
tcudorP ylppA lenaP yrotnevnI tneilC
tcudorP evomeR lenaP yrotnevnI tneilC
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
ydduB ddA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA
dekcilCeliforPyM:IUN
edoMstiderCyuB:IUN
tohspanS a ekaT

Complexity!!!!
egapemoH nwO tisiV
elbbuB a epyT
taeS egnahC
wodniW D3 nepO
dneirF ddA
revO tcudorP pilF lenaP yrotnevnI tneilC
l e n aP t i dE
woN tahC
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP
bew :metI na yuB
edoMeivoM:IUN
ytinummoc ,tneilc :detratS weiV eivoM
teP weN etaerC
detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC
egasseM yadhtriB
edoMlairotuT:IUN
y b b o l s em a g d e hc n u a L
noitartsigeR euqinU
edoMpUsserD:IUN
tcudorP ylppA lenaP yr
tcudorP evomeR lenaP
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
ydduB ddA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps ss
dekcilCeliforPyM:IUN
edoMstiderCyuB:IUN
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
taeS egnahC

dneirF ddA
revO tcudorP pilF lena
lenaP tidE
woN tahC
teP yalP
teP deeF
2 petS egaP traC esah
M215 :gniniamer ecaps
gnihtolC no tuP
bew :metI na yuB
edoMeivoM:IUN
ytinummoc ,tneilc :detra
teP weN etaerC
detrats etius tset :tseTy
emag pazyeh dehcnua
eciov mooRcilbuP tahC
egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL
noitartsigeR euqinU
wodniW D3 nepO

probability estimation
“a random variable or stochastic variable is a
variable whose value is subject to variations”
“an estimator is a rule for calculating an
estimate of a given quantity based on observed
data”
estimators and probability
distributions provide the essential
basis for our insights
bayesian methods, shrinkage…
these are our friends
quantile estimation, empirical CDFs…
…versus frequentist notions

analytic variance
our tools for automation leverage deep
understanding of covariance
cannot overstate the importance of
sampling… insist on metrics described
as conﬁdence intervals, where valid
bootstrapping, bagging…
these are our friends
Monte Carlo methods resolve “black box”
problems
point estimates may help prevent
“uninformed” decisions
do not skimp on this part, ever…
a hard lesson learned from BI failures

learning theory
in general, apps alternate between learning
patterns/rules and retrieving similar things…
statistical learning theory – rigorous,
prevents you from making billion dollar
mistakes, probably our future
machine learning – scalable, enables
you to make billion dollar mistakes, much
commercial emphasis
supervised vs. unsupervised
arguably, optimization is a related area

once Big Data projects get beyond merely
digesting log ﬁles, optimization will likely
become yet another buzzword :)

order complexity
techniques for manipulating order complexity:
dimensional reduction… with clustering
as a common case
e.g., you may have 100 million HTML docs,
but there are only ~10K useful keywords
low-dimensional structures, PCA
linear algebra tricks: eigenvalues, matrix
decomposition, etc.
many hard problems resolved by “divide and
conquer”
this is an area ripe for much advancement in
algorithms research near-term

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

praxis

some great tools…
reporting:
visualization:
PowerPivot, Pentaho, Jaspersoft, SAS
ggplot2, D3, Gephi
analytics/modeling:
R, Weka, Matlab, PMML, GLPK
text:
LDA, WordNet, OpenNLP, Mallet, Bixo, NLTK

apps:
Cascading, Scalding, Cascalog, R markdown, SWF
scale-out:
Scalr, RightScale, CycleComputing, vFabric, Beanstalk
graph: column:
Gremlin, Vertica,
GraphLab, HBase, key/val: index: relational:
Neo4J Drill, Redis, Lucene/Solr, usual suspects
Dynamo Membase, ElasticSearch
MySQL

stream/iter: hadoop:
Storm, Spark EMR, HW, MapR,
EMC, Azure, Compute
durable storage:
ASV, S3, Riak, Couch

a sample of great algorithms…
time series analysis seasonal variation geospatial

hidden markov models ARIMA bayesian point estimates kriging k-d trees

funnel optimization topics lang id anti-fraud regression

linear programming cosine similarity LDA TextRank LID TF-IDF random forest GLM/GAM

elasticity of demand recommender key phrase doc similarity classiﬁer

differential equations k-medoids PCA LSH k-means|| probabilistic hashing

customer lifetime value market segmentation dimensional reduction customer experiments

connected components markov random walk association rules multi-arm bandit

sessionization social graph what if ? sample variance

afﬁliation networks MCMC bootstrapping

Intro to Data Science for Enterprise Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Intro to Data Science for Enterprise Big Data

Similar to Intro to Data Science for Enterprise Big Data (14)

More from Paco Nathan

More from Paco Nathan (20)

Recently uploaded

Recently uploaded (20)

Intro to Data Science for Enterprise Big Data

Editor's Notes