20091110startup2startup

Socializing Big Data
Lessons from the Hadoop Community

Jeff Hammerbacher
Chief Scientist and Vice President of Products, Cloudera
November 10, 2009

Tuesday, November 10, 2009

My Background
Thanks for Asking
▪ hammer@cloudera.com
▪ Studied Mathematics at Harvard
▪ Worked as a Quant on Wall Street
▪ Conceived, built, and led the Data team at Facebook
▪ Nearly 30 amazing engineers and data scientists
▪ Released Hive and Cassandra as open source projects
▪ Published research at conferences: SIGMOD, CHI, ICWSM
▪ Founder of Cloudera
▪ Rethinking data analysis with Apache Hadoop at the core


Presentation Outline
▪ What is Hadoop?
▪ Hadoop at Facebook
▪ Brief history of the Facebook Data team
▪ Summary of how we used Hadoop
▪ Reasons for choosing Hadoop
▪ How is software built and adopted?
▪ “Laboratory Life”
▪ Social Learning Theory
▪ Organizations and tools in open source development
▪ Moving from the “Age of Data” to the “Age of Learning”


What I’m Not Talking About
Ask Questions
▪ How to build a team of data scientists
▪ Where and how to use data analysis in your organization
▪ The growing importance of measurement and attention
▪ Which tools to use for collecting, storing, and analyzing data
▪ Statistics, Machine Learning, Data Visualization, Open Data
▪ How data analysis is done outside of the web domain
▪ What Big Data means for your startup


The Apache Hadoop community is producing
innovative, world class software for web
scale data management and analysis.

By studying how software is built and
adopted, we can enhance the rate at which
data processing technologies evolve.

The Hadoop community is open to everyone and
will play a central role in this evolution.
You should join us!


What is Hadoop?
Not Just a Stuffed Elephant
▪ Open source project, written mostly in Java
▪ Inspired by Google infrastructure
▪ Software for “warehouse-scale computers”
▪ Hundreds of production deployments
▪ Project structure
▪ Hadoop Distributed File System (HDFS)
▪ Hadoop MapReduce
▪ Hadoop Common: client libraries and management tools
▪ Other subprojects: Avro, HBase, Hive, Pig, Zookeeper


Anatomy of a Hadoop Cluster
▪ Commodity servers
▪ 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC
▪ Typically arranged in 2 level architecture
▪
Commodity
40 nodes per rack Hardware Cluster
▪ Inexpensive to acquire and maintain

•! Typically in 2 level architecture
–! Nodes are commodity Linux PCs
Tuesday, November 10, 2009 –! 40 nodes/rack

'$*31%10$13+3&'1%)#$#I%
#79:"5$)$3-.".0&2$3-"&)"06-"*+,.0-2"84"82-$?()3"()*&5()3"
/(+-."()0&"'(-*-.;"*$++-%"C8+&*?.;D"$)%".0&2()3"-$*6"&/"06-"8+&*?."

HDFS
2-%,)%$)0+4"$*2&.."06-"'&&+"&/".-2=-2.<""B)"06-"*&55&)"*$.-;"
#79:".0&2-."062--"*&5'+-0-"*&'(-."&/"-$*6"/(+-"84"*&'4()3"-$*6"
'(-*-"0&"062--"%(//-2-)0".-2=-2.E"
HDFS distributes file blocks among servers
"

" !" " F"

I" !"

" H" H"
F"

!" " F"

G" #79:" G"

I"
I"

H" " !" " F"

G" G"

I" H"

"
!"#$%&'()'*+!,'-"./%"0$/&.'1"2&'02345.'6738#'.&%9&%.'
"

MapReduce
MapReduce pushes work out to the data
(#)**+%$#41'%
Q" K"
#)5#0$#.1%*6%(/789%
)#$#%)&'$3&:;$&*0% !" Q"
'$3#$1.<%$*%+;'"%=*34%
N" N"
*;$%$*%>#0<%0*)1'%&0%#%
?@;'$13A%B"&'%#@@*='%
#0#@<'1'%$*%3;0%&0% K"
+#3#@@1@%#0)%1@&>&0#$1'%
$"1%:*$$@101?4'% P"
&>+*'1)%:<%>*0*@&$"&?% !"
'$*3#.1%'<'$1>'A%
Q"
K"
P"
P"
!"
N"

"
!"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+'
Tuesday, November 10, 2009 "

Hadoop Subprojects
▪ Avro
▪ Cross-language framework for data serialization and RPC
▪ HBase
▪ Table storage above HDFS, modeled after Google’s BigTable
▪ Hive
▪ SQL interface to structured data stored in HDFS
▪ Pig
▪ Language for data ﬂow programming
▪ Zookeeper
▪ Coordination service for distributed systems


Hadoop at Yahoo!
▪ Jan 2006: Hired Doug Cutting
▪ Apr 2006: Sorted 1.9 TB on 188 nodes in 47 hours
▪ Apr 2008: Sorted 1 TB on 910 nodes in 209 seconds
▪ Aug 2008: Deployed 4,000 node Hadoop cluster
▪ May 2009: Sorted 1 TB on 1,460 nodes in 62 seconds
▪ Sorted 1 PB on 3,658 nodes in 16.25 hours
▪ Other data points
▪ Over 25,000 nodes running Hadoop across 17 clusters
▪ Hundreds of thousands of jobs per day from over 600 users
▪ 82 PB of data


Facebook Before Hadoop
Early 2006: The First Research Scientist
▪ Source data living on horizontally partitioned MySQL tier
▪ Intensive historical analysis difficult
▪ No way to assess impact of changes to the site

▪ First try: Python scripts pull data into MySQL
▪ Second try: Python scripts pull data into Oracle

▪ ...and then we turned on impression logging


Facebook Data Infrastructure
2007
Scribe Tier MySQL Tier

Data Collection
Server

Oracle Database
Server


Facebook Data Infrastructure
2008
Scribe Tier MySQL Tier

Hadoop Tier

Oracle RAC Servers


Facebook Workloads
▪ Data collection
▪ server logs
▪ application databases
▪ web crawls
▪ Thousands of multi-stage processing pipelines
▪ Summaries consumed by external users
▪ Summaries for internal reporting
▪ Ad optimization pipeline
▪ Experimentation platform pipeline
▪ Ad hoc analyses


Workload Statistics
Facebook 2009
▪ Largest cluster running Hive: 4,800 cores, 5.5 PB of storage
▪ 4 TB of compressed new data added per day
▪ 135TB of compressed data scanned per day
▪ 7,500+ Hive jobs on per day
▪ 80K compute hours per day
▪ Around 200 people per month run Hive jobs

(data from Ashish Thusoo’s Hadoop World NYC presentation)


Why Did Facebook Choose Hadoop?
1. Demonstrated effectiveness for primary workload
2. Proven ability to scale past any commercial vendor
3. Easy provisioning and capacity planning with commodity nodes
4. Data access for engineers and business analysts
5. Single system to manage XML, JSON, text, and relational data
6. No schemas enabled data collection without involving Data team
7. Cost of software: zero dollars
8. Deep commitment to continued development from Yahoo!
9. Active user and developer community
10. Apache-licensed open source code; ASF owns copyright


Hadoop Community Support
People Build Technology
▪ 185+ contributors to the open source code base
▪ ~50 engineers at Yahoo!, ~15 at Facebook, ~15 at Cloudera
▪ Over 500 (paid!) attendees at Hadoop World NYC
▪ Three books (O’Reilly, Apress, Manning)
▪ Training videos free online
▪ Regular user group meetups in many cities
▪ University courses across the world
▪ Growing consultant and systems integrator expertise
▪ Commercial training, certiﬁcation, and support from Cloudera


How Software is Built
Methodological Reflexivity
▪ Latour and Woolgar’s “Laboratory Life”
▪ Study scientists doing science
▪ Use “thick descriptions” and focus on “microconcerns”
▪ Some studies of closed and open source development exist
▪ “Mythical Man Month”, “Cathedral and the Bazaar”
▪ Hertel et al. surveyed 141 Linux kernel developers
▪ Focus on the people creating code
▪ Less religion, more empirical analyses
▪ Build tools to facilitate interaction and output


Building Open Source Software
Structural Conditions for Success
▪ Moon and Sproul proposed some rules for successful projects
▪ Authority comes from competence
▪ Leaders have clear responsibilities and delegate often
▪ The code has a modular structure
▪ Establish a parallel release policy: stable and experimental
▪ Give credit to non-source contributions, e.g. documentation
▪ Communicate clear rules and norms for community online
▪ Use simple and reliable communication tools


Building Software Faster
Consolidate Best Practices
▪ Javascript frameworks starting to converge
▪ Many adopting jQuery’s selector syntax
▪ Signiﬁcant benchmarks emerging
▪ Web frameworks push idioms into project structure
▪ What would be the Rails/Django equivalent for data storage?
▪ Reusable components also nice, e.g. log structured merge trees
▪ Compare work on BOOM, RodentStore
▪ Debian distributes release note writing responsibility via “beats”


Complications of Open Source
▪ Intellectual property
▪ Trademark, Copyright, Patent, and Trade Secret
▪ Litigation history
▪ Business models and foundations to ensure long-term support
▪ Direct support: Red Hat, MySQL
▪ Indirect support: LLVM, GSoC
▪ Foundations: Apache, Python, Django
▪ Diversity of licenses
▪ Licenses form communities
▪ Licenses change over time (cf. Rambus BSD incident)


How Software is Adopted
Choosing the Right Tool for the Job
▪ Must be aware that a software project exists
▪ Tools like GitHub, Ohloh, Launchpad
▪ Sites like Reddit and Hacker News
▪ Existing example use cases are critical
▪ At Facebook, we studied motivations for content production
▪ Especially effective: Bandura’s “Social Learning Theory”
▪ Hadoop being run in production at scale by Yahoo!/Facebook
▪ Active user communities and great documentation
▪ Reward ﬁrst approach


Open Learning
Open Data, Hypotheses and Workflows
▪ In science, data is generated once and analyzed many times
▪ IceCube
▪ LHC
▪ Lots of places where data and visualizations get shared
▪ data.gov, Many Eyes, Swivel, theinfo.org, InfoChimps, iCharts
▪ Record which hypotheses and workﬂows have been applied
▪ Increase diversity of questions asked and applications built
▪ Analysis skills unevenly distributed; send skills to the data!


The Future of Data Processing
Hadoop, the Browser, and Collaboration
▪ “The Unreasonable Effectiveness of Data”, “MAD Skills”
▪ Single namespace for your organization’s bits
▪ Single engine for distributed data processing
▪ Materialization of structured subsets into optimized stores
▪ Browser as client interface with focus on user experience
▪ The system gets better over time using workload information
▪ Cloning and sharing of common libraries and workﬂows
▪ Global metadata store driving collection, analysis, and reporting
▪ Version control within and between sites, cf. Orchestra


Cloudera Offerings
Only One Slide, I Promise
▪ Two software products
▪ Cloudera’s Distribution for Hadoop
▪ Cloudera Desktop
▪ ...more on the way
▪ Training and Certiﬁcation
▪ For Developers, Operators, and Managers
▪ Support
▪ Professional services


Cloudera Desktop
Big Data can be Beautiful


20091110startup2startup

More Related Content

What's hot

Viewers also liked

Similar to 20091110startup2startup

More from Jeff Hammerbacher

Recently uploaded

20091110startup2startup