Your SlideShare is downloading. ×
0
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
20091203gemini
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

20091203gemini

2,297

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,297
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  1. Thursday, December 3, 2009
  2. Hadoop and Cloudera Managing Petabytes with Open Source Jeff Hammerbacher Chief Scientist and Vice President of Products, Cloudera December 3, 2009 Thursday, December 3, 2009
  3. My Background Thanks for Asking ▪ hammer@cloudera.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Conceived, built, and led Data team at Facebook ▪ Nearly 30 amazing engineers and data scientists ▪ Several open source projects and research papers ▪ Founder of Cloudera ▪ Vice President of Products and Chief Scientist (other titles) ▪ Also, check out the book “Beautiful Data” Thursday, December 3, 2009
  4. Presentation Outline ▪ What is Hadoop? ▪ HDFS ▪ MapReduce ▪ Hive, Pig, Avro, Zookeeper, and friends ▪ Solving big data problems with Hadoop at Facebook and Yahoo! ▪ Short history of Facebook’s Data team ▪ Hadoop applications at Yahoo!, Facebook, and Cloudera ▪ Other examples: LHC, smart grid, genomes ▪ Questions and Discussion Thursday, December 3, 2009
  5. What is Hadoop? ▪ Apache Software Foundation project, mostly written in Java ▪ Inspired by Google infrastructure ▪ Software for programming warehouse-scale computers (WSCs) ▪ Hundreds of production deployments ▪ Project structure ▪ Hadoop Distributed File System (HDFS) ▪ Hadoop MapReduce ▪ Hadoop Common ▪ Other subprojects ▪ Avro, HBase, Hive, Pig, Zookeeper Thursday, December 3, 2009
  6. Anatomy of a Hadoop Cluster ▪ Commodity servers ▪ 1 RU, 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC ▪ Typically arranged in 2 level architecture ▪ Commodity 40 nodes per rack Hardware Cluster ▪ Inexpensive to acquire and maintain •! Typically in 2 level architecture –! Nodes are commodity Linux PCs Thursday, December 3, 2009 –! 40 nodes/rack
  7. HDFS ▪ Pool commodity servers into a single hierarchical namespace ▪ Break files into 128 MB blocks and replicate blocks ▪ Designed for large files written once but read many times ▪ Files are append-only via a single writer ▪ Two major daemons: NameNode and DataNode ▪ NameNode manages file system metadata ▪ DataNode manages data using local filesystem ▪ HDFS manages checksumming, replication, and compression ▪ Throughput scales nearly linearly with node cluster size Thursday, December 3, 2009
  8. '$*31%10$13+3&'1%)#$#I% #79:"5$)$3-.".0&2$3-"&)"06-"*+,.0-2"84"82-$?()3"()*&5()3" /(+-."()0&"'(-*-.;"*$++-%"C8+&*?.;D"$)%".0&2()3"-$*6"&/"06-"8+&*?." HDFS 2-%,)%$)0+4"$*2&.."06-"'&&+"&/".-2=-2.<""B)"06-"*&55&)"*$.-;" #79:".0&2-."062--"*&5'+-0-"*&'(-."&/"-$*6"/(+-"84"*&'4()3"-$*6" '(-*-"0&"062--"%(//-2-)0".-2=-2.E" HDFS distributes file blocks among servers " " !" " F" I" !" " H" H" F" !" " F" G" #79:" G" I" I" H" " !" " F" G" G" I" H" " !"#$%&'()'*+!,'-"./%"0$/&.'1"2&'02345.'6738#'.&%9&%.' " Thursday, December 3, 2009
  9. Hadoop MapReduce ▪ Fault tolerant execution layer and API for parallel data processing ▪ Can target multiple storage systems ▪ Key/value data model ▪ Two major daemons: JobTracker and TaskTracker ▪ Many client interfaces ▪ Java ▪ C++ ▪ Streaming ▪ Pig ▪ SQL (Hive) Thursday, December 3, 2009
  10. MapReduce MapReduce pushes work out to the data (#)**+%$#41'% Q" K" #)5#0$#.1%*6%(/789% )#$#%)&'$3&:;$&*0% !" Q" '$3#$1.<%$*%+;'"%=*34% N" N" *;$%$*%>#0<%0*)1'%&0%#% ?@;'$13A%B"&'%#@@*='% #0#@<'1'%$*%3;0%&0% K" +#3#@@1@%#0)%1@&>&0#$1'% $"1%:*$$@101?4'% P" &>+*'1)%:<%>*0*@&$"&?% !" '$*3#.1%'<'$1>'A% Q" K" P" P" !" N" " !"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+' Thursday, December 3, 2009 "
  11. Hadoop Subprojects ▪ Avro ▪ Cross-language framework for RPC and serialization ▪ HBase ▪ Table storage on top of HDFS, modeled after Google’s BigTable ▪ Hive ▪ SQL interface to structured data stored in HDFS ▪ Pig ▪ Language for data flow programming; also Owl, Zebra, SQL ▪ Zookeeper ▪ Coordination service for distributed systems Thursday, December 3, 2009
  12. Hadoop Community Support ▪ 185+ contributors to the open source code base ▪ ~60 engineers at Yahoo!, ~15 at Facebook, ~15 at Cloudera ▪ Over 500 (paid!) attendees at Hadoop World NYC ▪ Three books (O’Reilly, Apress, Manning) ▪ Training videos free online ▪ Regular user group meetups in many cities ▪ New York Meetup group has 238 members ▪ University courses across the world ▪ Growing consultant and systems integrator expertise ▪ Commercial training, certification, and support from Cloudera Thursday, December 3, 2009
  13. Hadoop Project Mechanics ▪ Trademark owned by ASF; Apache 2.0 license for code ▪ Rigorous unit, smoke, performance, and system tests ▪ Release cycle of 9 months ▪ Last major release: 0.20.0 on April 22, 2009 ▪ 0.21.0 will be last release before 1.0; nearly complete ▪ Subprojects on different release cycles ▪ Releases put to a vote according to Apache guidelines ▪ Releases made available as tarballs on Apache and mirrors ▪ Cloudera packages a distribution for many platforms ▪ RPM and Debian packages; AMI for Amazon’s EC2 Thursday, December 3, 2009
  14. Hadoop at Facebook Early 2006: The First Research Scientist ▪ Source data living on horizontally partitioned MySQL tier ▪ Intensive historical analysis difficult ▪ No way to assess impact of changes to the site ▪ First try: Python scripts pull data into MySQL ▪ Second try: Python scripts pull data into Oracle ▪ ...and then we turned on impression logging Thursday, December 3, 2009
  15. Facebook Data Infrastructure 2007 Scribe Tier MySQL Tier Data Collection Server Oracle Database Server Thursday, December 3, 2009
  16. Facebook Data Infrastructure 2008 Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers Thursday, December 3, 2009
  17. Major Data Team Workloads ▪ Data collection ▪ server logs ▪ application databases ▪ web crawls ▪ Thousands of multi-stage processing pipelines ▪ Summaries consumed by external users ▪ Summaries for internal reporting ▪ Ad optimization pipeline ▪ Experimentation platform pipeline ▪ Ad hoc analyses Thursday, December 3, 2009
  18. Workload Statistics Facebook 2009 ▪ Largest cluster running Hive: 4,800 cores, 5.5 PB of storage ▪ 4 TB of compressed new data added per day ▪ 135TB of compressed data scanned per day ▪ 7,500+ Hive jobs on per day ▪ 80K compute hours per day ▪ Around 200 people per month run Hive jobs (data from Ashish Thusoo’s Hadoop World NYC presentation) Thursday, December 3, 2009
  19. Hadoop at Yahoo! ▪ Jan 2006: Hired Doug Cutting ▪ Apr 2006: Sorted 1.9 TB on 188 nodes in 47 hours ▪ Apr 2008: Sorted 1 TB on 910 nodes in 209 seconds ▪ Aug 2008: Deployed 4,000 node Hadoop cluster ▪ May 2009: Sorted 1 TB on 1,460 nodes in 62 seconds ▪ Sorted 1 PB on 3,658 nodes in 16.25 hours ▪ Other data points ▪ Over 25,000 nodes running Hadoop across 17 clusters ▪ Hundreds of thousands of jobs per day from over 600 users ▪ 82 PB of data Thursday, December 3, 2009
  20. Cloudera Offerings Only One Slide, I Promise ▪ Two software products ▪ Cloudera’s Distribution for Hadoop ▪ Cloudera Desktop ▪ ...more on the way ▪ Support ▪ Professional services Thursday, December 3, 2009
  21. Hadoop at Cloudera Cloudera’s Distribution for Hadoop ▪ Open source distribution of Apache Hadoop for enterprise use ▪ Includes HDFS, MapReduce, Pig, Hive, and ZooKeeper ▪ Ensures cross-subproject compatibility ▪ Adds backported patches and customer-specific patches ▪ Adds Cloudera utilities like MRUnit and Sqoop ▪ Better integration with daemon administration utilities ▪ Follows the Filesystem Hierarchy Standard (FHS) for file layout ▪ Tools for automatically generating a configuration ▪ Packaged as RPM, DEB, AMI, or tarball Thursday, December 3, 2009
  22. Hadoop at Cloudera Training and Certification ▪ Free online training ▪ Basic, Intermediate (including Hive and Pig), and Advanced ▪ Includes a virtual machine with software and exercises ▪ Live training sessions ▪ One live session per month somewhere in the world ▪ If you have a large group, we may come to you ▪ Certification ▪ Exams for Developers, Administrators, and Managers ▪ Administered online or in person Thursday, December 3, 2009
  23. Hadoop at Cloudera Services and Support ▪ Professional Services ▪ Get Hadoop up and running in your environment ▪ Optimize an existing Hadoop infrastructure ▪ Design new algorithms to make the most of your data ▪ Support ▪ Unlimited questions for Cloudera’s technical team ▪ Access to our Knowledge Base ▪ Help prioritize feature development for CDH ▪ Early access to upcoming Cloudera software products Thursday, December 3, 2009
  24. Hadoop at Cloudera Commercial Software ▪ General thesis: build commercially-licensed software products which complement CDH for data management and analysis ▪ Current products ▪ Cloudera Desktop ▪ Extensible user interface for users of Cloudera software ▪ Upcoming products ▪ Talk to me in private Thursday, December 3, 2009
  25. Cloudera Desktop Big Data can be Beautiful Thursday, December 3, 2009
  26. Gemini-Specific Questions ▪ Scala ▪ ScalaNLP’s SMR, Jonhnny Weslley’s SHadoop ▪ Jeff Hodges’s Componentize ▪ Compliance/regulatory domain ▪ Security roadmap ▪ HADOOP-4487 ▪ HIVE-842 ▪ Real-time BI ▪ MapReduce Online Prototype (MOP) ▪ Talk to me in private Thursday, December 3, 2009
  27. (c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Thursday, December 3, 2009

×