Tuesday, October 27, 2009
Hadoop and Cloudera
         Managing Petabytes with Open Source



         Jeff Hammerbacher
         Chief Scientist an...
Why You Should Care
         Hadoop in the Life Sciences
         ▪   CloudBurst: Highly Sensitive Short Read Mapping with...
My Background
         Thanks for Asking
         ▪   hammer@cloudera.com
         ▪   Studied Mathematics at Harvard
    ...
Presentation Outline
         ▪   What is Hadoop?
             ▪   HDFS
             ▪   MapReduce
             ▪   Hive, ...
What is Hadoop?
         ▪   Apache Software Foundation project, mostly written in Java
         ▪   Inspired by Google in...
Anatomy of a Hadoop Cluster
         ▪   Commodity servers
             ▪   1 RU, 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA,...
HDFS
         ▪   Pool commodity servers into a single hierarchical namespace
         ▪   Break files into 128 MB blocks a...
'$*31%10$13+3&'1%)#$#I%
                                 #79:"5$)$3-.".0&2$3-"&)"06-"*+,.0-2"84"82-$?()3"()*&5()3"
       ...
Hadoop MapReduce
         ▪   Fault tolerant execution layer and API for parallel data processing
         ▪   Can target ...
MapReduce
                      MapReduce pushes work out to the data
             (#)**+%$#41'%
                         ...
Hadoop Subprojects
         ▪   Avro
             ▪   Cross-language framework for RPC and serialization
         ▪   HBas...
Hadoop Community Support
         ▪   185 total contributors to the open source code base
             ▪   60 engineers at...
Hadoop Project Mechanics
         ▪   Trademark owned by ASF; Apache 2.0 license for code
         ▪   Rigorous unit, smok...
Hadoop at Facebook
         Early 2006: The First Research Scientist
         ▪   Source data living on horizontally parti...
Facebook Data Infrastructure
                                                  2007
                                   Scr...
Facebook Data Infrastructure
                                                         2008
                               ...
Major Data Team Workloads
         ▪   Data collection
             ▪   server logs
             ▪   application databases...
Workload Statistics
         Facebook 2009
         ▪   Largest cluster running Hive: 4,800 cores, 5.5 PB of storage
     ...
Hadoop at Yahoo!
         ▪   Jan 2006: Hired Doug Cutting
         ▪   Apr 2006: Sorted 1.9 TB on 188 nodes in 47 hours
 ...
Example Hadoop Applications
         ▪   Yahoo!
             ▪   Yahoo! Search Webmap
             ▪   Content and ad targ...
Cloudera Offerings
         Only One Slide, I Promise
         ▪   Two software products
             ▪   Cloudera’s Distr...
Cloudera Desktop
                            Big Data can be Beautiful




Tuesday, October 27, 2009
(c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1....
Upcoming SlideShare
Loading in...5
×

20091027genentech

1,746

Published on

Presentation at Genentech on October 27, 2009

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,746
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
35
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

20091027genentech

  1. 1. Tuesday, October 27, 2009
  2. 2. Hadoop and Cloudera Managing Petabytes with Open Source Jeff Hammerbacher Chief Scientist and Vice President of Products, Cloudera October 27, 2009 Tuesday, October 27, 2009
  3. 3. Why You Should Care Hadoop in the Life Sciences ▪ CloudBurst: Highly Sensitive Short Read Mapping with MapReduce ▪ “CloudBurst reduces the running time from hours to mere minutes” ▪ Crossbow: Genotyping from short reads using cloud computing ▪ “Crossbow shows how Hadoop can be a enabling technology for computational biology” ▪ SMARTS substructure searching using the CDK and Hadoop ▪ “The Hadoop framework makes handling large data problems pretty much trivial” ▪ Smith-Waterman Protein Alignment ▪ “Existing algorithms ported easily to Hadoop” Tuesday, October 27, 2009
  4. 4. My Background Thanks for Asking ▪ hammer@cloudera.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Conceived, built, and led Data team at Facebook ▪ Nearly 30 amazing engineers and data scientists ▪ Several open source projects and research papers ▪ Founder of Cloudera ▪ Vice President of Products and Chief Scientist (other titles) ▪ Also, check out the book “Beautiful Data” Tuesday, October 27, 2009
  5. 5. Presentation Outline ▪ What is Hadoop? ▪ HDFS ▪ MapReduce ▪ Hive, Pig, Avro, Zookeeper, and friends ▪ Solving big data problems with Hadoop at Facebook and Yahoo! ▪ Short history of Facebook’s Data team ▪ Hadoop applications at Yahoo!, Facebook, and Cloudera ▪ Other examples: LHC, smart grid, genomes ▪ Questions and Discussion Tuesday, October 27, 2009
  6. 6. What is Hadoop? ▪ Apache Software Foundation project, mostly written in Java ▪ Inspired by Google infrastructure ▪ Software for programming warehouse-scale computers (WSCs) ▪ Hundreds of production deployments ▪ Project structure ▪ Hadoop Distributed File System (HDFS) ▪ Hadoop MapReduce ▪ Hadoop Common ▪ Other subprojects ▪ Avro, HBase, Hive, Pig, Zookeeper Tuesday, October 27, 2009
  7. 7. Anatomy of a Hadoop Cluster ▪ Commodity servers ▪ 1 RU, 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC ▪ Typically arranged in 2 level architecture ▪ Commodity 40 nodes per rack Hardware Cluster ▪ Inexpensive to acquire and maintain •! Typically in 2 level architecture –! Nodes are commodity Linux PCs Tuesday, October 27, 2009 –! 40 nodes/rack
  8. 8. HDFS ▪ Pool commodity servers into a single hierarchical namespace ▪ Break files into 128 MB blocks and replicate blocks ▪ Designed for large files written once but read many times ▪ Files are append-only ▪ Two major daemons: NameNode and DataNode ▪ NameNode manages file system metadata ▪ DataNode manages data using local filesystem ▪ HDFS manages checksumming, replication, and compression ▪ Throughput scales nearly linearly with node cluster size ▪ Access from Java, C, command line, FUSE, or Thrift Tuesday, October 27, 2009
  9. 9. '$*31%10$13+3&'1%)#$#I% #79:"5$)$3-.".0&2$3-"&)"06-"*+,.0-2"84"82-$?()3"()*&5()3" /(+-."()0&"'(-*-.;"*$++-%"C8+&*?.;D"$)%".0&2()3"-$*6"&/"06-"8+&*?." HDFS 2-%,)%$)0+4"$*2&.."06-"'&&+"&/".-2=-2.<""B)"06-"*&55&)"*$.-;" #79:".0&2-."062--"*&5'+-0-"*&'(-."&/"-$*6"/(+-"84"*&'4()3"-$*6" '(-*-"0&"062--"%(//-2-)0".-2=-2.E" HDFS distributes file blocks among servers " " !" " F" I" !" " H" H" F" !" " F" G" #79:" G" I" I" H" " !" " F" G" G" I" H" " !"#$%&'()'*+!,'-"./%"0$/&.'1"2&'02345.'6738#'.&%9&%.' " Tuesday, October 27, 2009
  10. 10. Hadoop MapReduce ▪ Fault tolerant execution layer and API for parallel data processing ▪ Can target multiple storage systems ▪ Key/value data model ▪ Two major daemons: JobTracker and TaskTracker ▪ Many client interfaces ▪ Java ▪ C++ ▪ Streaming ▪ Pig ▪ SQL (Hive) Tuesday, October 27, 2009
  11. 11. MapReduce MapReduce pushes work out to the data (#)**+%$#41'% Q" K" #)5#0$#.1%*6%(/789% )#$#%)&'$3&:;$&*0% !" Q" '$3#$1.<%$*%+;'"%=*34% N" N" *;$%$*%>#0<%0*)1'%&0%#% ?@;'$13A%B"&'%#@@*='% #0#@<'1'%$*%3;0%&0% K" +#3#@@1@%#0)%1@&>&0#$1'% $"1%:*$$@101?4'% P" &>+*'1)%:<%>*0*@&$"&?% !" '$*3#.1%'<'$1>'A% Q" K" P" P" !" N" " !"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+' Tuesday, October 27, 2009 "
  12. 12. Hadoop Subprojects ▪ Avro ▪ Cross-language framework for RPC and serialization ▪ HBase ▪ Table storage on top of HDFS, modeled after Google’s BigTable ▪ Hive ▪ SQL interface to structured data stored in HDFS ▪ Pig ▪ Language for data flow programming; also Owl, Zebra, SQL ▪ Zookeeper ▪ Coordination service for distributed systems Tuesday, October 27, 2009
  13. 13. Hadoop Community Support ▪ 185 total contributors to the open source code base ▪ 60 engineers at Yahoo!, 15 at Facebook, 15 at Cloudera ▪ Over 500 (paid!) attendees at Hadoop World NYC ▪ Hadoop World Beijing later this month ▪ Three books (O’Reilly, Apress, Manning) ▪ Training videos free online ▪ Regular user group meetups in many cities ▪ University courses across the world ▪ Growing consultant and systems integrator expertise ▪ Commercial training, certification, and support from Cloudera Tuesday, October 27, 2009
  14. 14. Hadoop Project Mechanics ▪ Trademark owned by ASF; Apache 2.0 license for code ▪ Rigorous unit, smoke, performance, and system tests ▪ Release cycle of 3 months (-ish) ▪ Last major release: 0.20.0 on April 22, 2009 ▪ 0.21.0 will be last release before 1.0; nearly complete ▪ Subprojects on different release cycles ▪ Releases put to a vote according to Apache guidelines ▪ Releases made available as tarballs on Apache and mirrors ▪ Cloudera packages own release for many platforms ▪ RPM and Debian packages; AMI for Amazon’s EC2 Tuesday, October 27, 2009
  15. 15. Hadoop at Facebook Early 2006: The First Research Scientist ▪ Source data living on horizontally partitioned MySQL tier ▪ Intensive historical analysis difficult ▪ No way to assess impact of changes to the site ▪ First try: Python scripts pull data into MySQL ▪ Second try: Python scripts pull data into Oracle ▪ ...and then we turned on impression logging Tuesday, October 27, 2009
  16. 16. Facebook Data Infrastructure 2007 Scribe Tier MySQL Tier Data Collection Server Oracle Database Server Tuesday, October 27, 2009
  17. 17. Facebook Data Infrastructure 2008 Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers Tuesday, October 27, 2009
  18. 18. Major Data Team Workloads ▪ Data collection ▪ server logs ▪ application databases ▪ web crawls ▪ Thousands of multi-stage processing pipelines ▪ Summaries consumed by external users ▪ Summaries for internal reporting ▪ Ad optimization pipeline ▪ Experimentation platform pipeline ▪ Ad hoc analyses Tuesday, October 27, 2009
  19. 19. Workload Statistics Facebook 2009 ▪ Largest cluster running Hive: 4,800 cores, 5.5 PB of storage ▪ 4 TB of compressed new data added per day ▪ 135TB of compressed data scanned per day ▪ 7,500+ Hive jobs on per day ▪ 80K compute hours per day ▪ Around 200 people per month run Hive jobs Tuesday, October 27, 2009
  20. 20. Hadoop at Yahoo! ▪ Jan 2006: Hired Doug Cutting ▪ Apr 2006: Sorted 1.9 TB on 188 nodes in 47 hours ▪ Apr 2008: Sorted 1 TB on 910 nodes in 209 seconds ▪ Aug 2008: Deployed 4,000 node Hadoop cluster ▪ May 2009: Sorted 1 TB on 1,460 nodes in 62 seconds ▪ Data Points ▪ Over 25,000 nodes running Hadoop across 17 clusters ▪ Hundreds of thousands of jobs per day ▪ Typical HDFS cluster: 1,400 nodes, 2 PB capacity ▪ Sorted 1 PB on 3,658 nodes in 16.25 hours Tuesday, October 27, 2009
  21. 21. Example Hadoop Applications ▪ Yahoo! ▪ Yahoo! Search Webmap ▪ Content and ad targeting optimization ▪ Facebook ▪ Fraud and abuse detection ▪ Lexicon (text mining) ▪ Cloudera ▪ Facial recognition for automatic tagging ▪ Genome sequence analysis ▪ Financial services, government, and of course: HEP! Tuesday, October 27, 2009
  22. 22. Cloudera Offerings Only One Slide, I Promise ▪ Two software products ▪ Cloudera’s Distribution for Hadoop ▪ Cloudera Desktop ▪ Training and Certification ▪ For Developers, Operators, and Managers ▪ Support ▪ Professional services Tuesday, October 27, 2009
  23. 23. Cloudera Desktop Big Data can be Beautiful Tuesday, October 27, 2009
  24. 24. (c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Tuesday, October 27, 2009
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×