Hidden Gems found with Hadoop

Hidden Gems found with Hadoop
Paco Nathan
Lead, Analytics team @ IMVU.com

Ask Questions Early…

‣ How do Hadoop and “Big Data” ﬁt into the practice
of Continuous Deployment ?

‣ Why don’t we simply load all our data into Oracle,
then generate reports and spreadsheets as needed ?

‣ Given all the conﬂicting “NoSQL” options, how
does an engineer design an effective data store ?

‣ Is there one framework we can just buy and resolve
all these annoying data issues ?

‣ What kinds of analytics work can be performed
using Hadoop in the cloud ?

‣ Is IMVU currently hiring ? ☺

Continuous Deployment

• IMVU: ~50 engineers work in parallel, builds push live every ~8 minutes
• depends on “immune system” regression checks, progressive roll-outs
• dedication to transparency and metrics: data-intensive company culture
• extensive use of customer experiments (A/B testing) on millions of users
• instrumentation, alerting, strict discipline on conﬁg and resource usage
• Ops excellence, plus big investment in a ﬁnely tuned production environment
http://www.quora.com/What-are-best-examples-of-companies-using-continuous-deployment

http://www.slideshare.net/bgdurrett/3-reasons-you-should-use-continuous-deployment

http://www.startuplessonslearned.com/2009/06/why-continuous-deployment.html

Data Analytics

• data usage downstream from production cluster is a lower priority
• industry truism: data usage downstream almost never trumps
the priority of direct revenue transactions

• even so, business strategy depends on data analytics – which in
practice, at scale, must live downstream from transactions

• however, data analytics jobs tend to break that extensive work in
testing/monitoring which allows for continuous deployment:

- mission critical code which can’t be veriﬁed readily by unit tests
- “slow queries” trip immune system, signaling regressions
- likewise for large data transfers within production cluster
- tightly conﬁgured environment vs. elastic resource needs

How Did We Get Here?

• big Internet successes after 1997 holiday season…
AMZN, EBAY, then GOOG, Inktomi (YHOO Search)

• consider how, among tech ﬁrms, this metric:
annual revenue per customer / operational data store size
dropped more than 100x within a few years after 1997

• “conventional wisdom” of RDBMS and BI tools became
much less viable; however, business cadre which came of
age when “spreadsheets were new” tends to carry along
too much inertia to confront these issues pro-actively

• one one hand, storage and processing costs plummeted…
on the other hand, we must now work much smarter
to extract ROI from “Big Data”, so methods must adapt

• MapReduce and the Hadoop open source stack grew
directly out of this context… but they only solve part
of these problems

CAP Theorem

• Eric Brewer, 2000: “You can have at most two of these properties for
any shared-data system … the choice of which feature to discard
determines the nature of your system.”

• direct revenue apps in consumer Internet require consistency and
partition tolerance

• data analytics jobs for business uses generally require availability and
eventual consistency, but tend to not tolerate highly partitioned data

• ETL becomes an Achilles heal for “Lean Startup™”:
‣ agile/experiment-driven/scale-out, which leads to… strong
consistency
high
availability

‣ provably-hard-to-detect metadata drift, which leads to…
C A
‣ high-risk technical debt
https://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf

http://www.julianbrowne.com/article/viewer/brewers-cap-theorem RDBMS
P eventual
partition consistency
tolerance

Data Access Patterns

• design patterns: originated in consensus negotiation
for architecture, then software engineering

• consider the corollaries in large scale data wrangling…
• essential advice:
select data frameworks based on your data access patterns

• in other words, decouple usage based on need –
to avoid “one size ﬁts all” blockers

• let’s review some examples…

Access Patterns ↔ Frameworks

ﬁnancial transactions general ledger in RDBMS CAx
ad-hoc queries RDS (hosted MySQL) CAx
reporting, dashboards like Pentaho CAx
log rotation/persistence like Riak xxP
search indexes like Lucene, Solr xAP
static content, archives S3 (durable storage) xAP
customer facts like Redis, Membase xAP
distributed counters, locks, sets like Redis x A P*
data objects CRUD key/value – like, NoSQL on MySQL CxP
authoritative metadata like Zookeeper CxP
data prep, modeling at scale like Hadoop/Hive/Cascading + R CxP
graph analysis like Hadoop + Redis + Gephi CxP
data marts like Hadoop/Hive/HBase CxP

Data Prep → Modeling at Scale
Analytics jobs performed in the cloud with Hadoop, R, etc.:
• log clean-up, sessionization
• roll-ups, slices, sampling, data cubes, visualizations
• language identiﬁcation, key phrase extraction
• co-occurrence analysis, topic trending
• custom search indexes
• random forests and other classiﬁers
• connected components, effects across social graph
• virtual economy metrics

Business use cases:
• customer segmentation edoMpUsserD:IUN
tcudorP ylppA lenaP yrotnevnI tneilC
tcudorP evomeR lenaP yrotnevnI tneilC
edoMmooRyM:IUN
edoMmooRcilbuP:IUN

• retention models
ydduB ddA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA

•
dekcilCeliforPyM:IUN

anti-fraud
edoMstiderCyuB:IUN
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
taeS egnahC
wodniW D3 nepO
dneirF ddA

•
revO tcudorP pilF lenaP yrotnevnI tneilC

content recommendation
lenaP tidE
woN tahC
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP

•
bew :metI na yuB

ad optimization
edoMeivoM:IUN
ytinummoc ,tneilc :detratS weiV eivoM
teP weN etaerC
detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC
egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL
noitartsigeR euqinU
edoMpUss
tcudorP yl
tcudorP ev
edoMmoo
edoMmoo
ydduB ddA
nigoL etisb
vd
edoMsdnei
edoMtahC:
egasseM a
G1 :gninia
dekcilCelif
edoMstider
tohspanS
egapemoH
elbbuB a e
taeS egna
wodniW D
dneirF ddA
revO tcudo
lenaP tidE
woN tahC
teP yalP
teP deeF
2 petS ega
M215 :gnin
gnihtolC n
bew :metI
edoMeivo
ytinummoc
teP weN et
detrats etiu
emag pazy
eciov moo
egasseM y
edoMlairot
ybbol sem
noitartsige

Finding Hidden Gems…

data objects, cloud-based data access patterns business
transactions data marts use cases

Hive reporting ad-hoc queries,
RDS framework reporting

search,
Lucene / Solr cache recommenders,
data services
Hadoop

graph analysis,
Redis sessionization,
data services
MySQL
partitions
MySQL predictive modeling,
partitions
MySQL Gephi
ETL S3 social graph,
partitions factor analysis,
R time series,
data visualization

Related Resources

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/

http://www.slideshare.net/pacoid/getting-started-on-hadoop

https://github.com/ceteri/ceteri-mapred

http://redis.io/

http://www.r-project.org/

http://gephi.org/

Analytics Team, IMVU.com

• IMVU: 90 employees in Bay Area, $40MM annual rev
• largest virtual goods catalog: +6MM items UGC
- Best Places to Work in Bay Area, 2011 & 2010
- Red Herring Global 100 Tech Startup, 2010
- Inc. 500, 2010
http://www.imvu.com/jobs/
@pacoid

Hidden Gems found with Hadoop

Recommended

Recommended

More Related Content

Similar to Hidden Gems found with Hadoop

Similar to Hidden Gems found with Hadoop (20)

More from Paco Nathan

More from Paco Nathan (20)

Recently uploaded

Recently uploaded (20)

Hidden Gems found with Hadoop

Editor's Notes