• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
20091203gemini
 

20091203gemini

on

  • 3,247 views

 

Statistics

Views

Total Views
3,247
Views on SlideShare
3,246
Embed Views
1

Actions

Likes
2
Downloads
54
Comments
0

1 Embed 1

http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    20091203gemini 20091203gemini Presentation Transcript

    • Thursday, December 3, 2009
    • Hadoop and Cloudera Managing Petabytes with Open Source Jeff Hammerbacher Chief Scientist and Vice President of Products, Cloudera December 3, 2009 Thursday, December 3, 2009
    • My Background Thanks for Asking ▪ hammer@cloudera.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Conceived, built, and led Data team at Facebook ▪ Nearly 30 amazing engineers and data scientists ▪ Several open source projects and research papers ▪ Founder of Cloudera ▪ Vice President of Products and Chief Scientist (other titles) ▪ Also, check out the book “Beautiful Data” Thursday, December 3, 2009
    • Presentation Outline ▪ What is Hadoop? ▪ HDFS ▪ MapReduce ▪ Hive, Pig, Avro, Zookeeper, and friends ▪ Solving big data problems with Hadoop at Facebook and Yahoo! ▪ Short history of Facebook’s Data team ▪ Some stats about Hadoop use at Yahoo! ▪ What we’re building at Cloudera ▪ Questions and Discussion ▪ Including some questions Ajay sent me Thursday, December 3, 2009
    • What is Hadoop? ▪ Apache Software Foundation project, mostly written in Java ▪ Inspired by Google infrastructure ▪ Software for programming warehouse-scale computers (WSCs) ▪ Hundreds of production deployments ▪ Project structure ▪ Hadoop Distributed File System (HDFS) ▪ Hadoop MapReduce ▪ Hadoop Common ▪ Other subprojects ▪ Avro, HBase, Hive, Pig, Zookeeper Thursday, December 3, 2009
    • Anatomy of a Hadoop Cluster ▪ Commodity servers ▪ 1 RU, 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC ▪ Typically arranged in 2 level architecture ▪ Commodity 40 nodes per rack Hardware Cluster ▪ Inexpensive to acquire and maintain •! Typically in 2 level architecture –! Nodes are commodity Linux PCs Thursday, December 3, 2009 –! 40 nodes/rack
    • HDFS ▪ Pool commodity servers into a single hierarchical namespace ▪ Break files into 128 MB blocks and replicate blocks ▪ Designed for large files written once but read many times ▪ Files are append-only via a single writer ▪ Two major daemons: NameNode and DataNode ▪ NameNode manages file system metadata ▪ DataNode manages data using local filesystem ▪ HDFS manages checksumming, replication, and compression ▪ Throughput scales nearly linearly with node cluster size Thursday, December 3, 2009
    • '$*31%10$13+3&'1%)#$#I% #79:"5$)$3-.".0&2$3-"&)"06-"*+,.0-2"84"82-$?()3"()*&5()3" /(+-."()0&"'(-*-.;"*$++-%"C8+&*?.;D"$)%".0&2()3"-$*6"&/"06-"8+&*?." HDFS 2-%,)%$)0+4"$*2&.."06-"'&&+"&/".-2=-2.<""B)"06-"*&55&)"*$.-;" #79:".0&2-."062--"*&5'+-0-"*&'(-."&/"-$*6"/(+-"84"*&'4()3"-$*6" '(-*-"0&"062--"%(//-2-)0".-2=-2.E" HDFS distributes file blocks among servers " " !" " F" I" !" " H" H" F" !" " F" G" #79:" G" I" I" H" " !" " F" G" G" I" H" " !"#$%&'()'*+!,'-"./%"0$/&.'1"2&'02345.'6738#'.&%9&%.' " Thursday, December 3, 2009
    • Hadoop MapReduce ▪ Fault tolerant execution layer and API for parallel data processing ▪ Can target multiple storage systems ▪ Key/value data model ▪ Two major daemons: JobTracker and TaskTracker ▪ Many client interfaces ▪ Java ▪ C++ ▪ Streaming ▪ Pig ▪ SQL (Hive) Thursday, December 3, 2009
    • MapReduce MapReduce pushes work out to the data (#)**+%$#41'% Q" K" #)5#0$#.1%*6%(/789% )#$#%)&'$3&:;$&*0% !" Q" '$3#$1.<%$*%+;'"%=*34% N" N" *;$%$*%>#0<%0*)1'%&0%#% ?@;'$13A%B"&'%#@@*='% #0#@<'1'%$*%3;0%&0% K" +#3#@@1@%#0)%1@&>&0#$1'% $"1%:*$$@101?4'% P" &>+*'1)%:<%>*0*@&$"&?% !" '$*3#.1%'<'$1>'A% Q" K" P" P" !" N" " !"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+' Thursday, December 3, 2009 "
    • Hadoop Subprojects ▪ Avro ▪ Cross-language framework for RPC and serialization ▪ HBase ▪ Table storage on top of HDFS, modeled after Google’s BigTable ▪ Hive ▪ SQL interface to structured data stored in HDFS ▪ Pig ▪ Language for data flow programming; also Owl, Zebra, SQL ▪ Zookeeper ▪ Coordination service for distributed systems Thursday, December 3, 2009
    • Hadoop Community Support ▪ 185+ contributors to the open source code base ▪ ~60 engineers at Yahoo!, ~15 at Facebook, ~15 at Cloudera ▪ Over 500 (paid!) attendees at Hadoop World NYC ▪ Three books (O’Reilly, Apress, Manning) ▪ Training videos free online ▪ Regular user group meetups in many cities ▪ New York Meetup group has 238 members ▪ University courses across the world ▪ Growing consultant and systems integrator expertise ▪ Commercial training, certification, and support from Cloudera Thursday, December 3, 2009
    • Hadoop Project Mechanics ▪ Trademark owned by ASF; Apache 2.0 license for code ▪ Rigorous unit, smoke, performance, and system tests ▪ Release cycle of 9 months ▪ Last major release: 0.20.0 on April 22, 2009 ▪ 0.21.0 will be last release before 1.0; nearly complete ▪ Subprojects on different release cycles ▪ Releases put to a vote according to Apache guidelines ▪ Releases made available as tarballs on Apache and mirrors ▪ Cloudera packages a distribution for many platforms ▪ RPM and Debian packages; AMI for Amazon’s EC2 Thursday, December 3, 2009
    • Hadoop at Facebook Early 2006: The First Research Scientist ▪ Source data living on horizontally partitioned MySQL tier ▪ Intensive historical analysis difficult ▪ No way to assess impact of changes to the site ▪ First try: Python scripts pull data into MySQL ▪ Second try: Python scripts pull data into Oracle ▪ ...and then we turned on impression logging Thursday, December 3, 2009
    • Facebook Data Infrastructure 2007 Scribe Tier MySQL Tier Data Collection Server Oracle Database Server Thursday, December 3, 2009
    • Facebook Data Infrastructure 2008 Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers Thursday, December 3, 2009
    • Major Data Team Workloads ▪ Data collection ▪ server logs ▪ application databases ▪ web crawls ▪ Thousands of multi-stage processing pipelines ▪ Summaries consumed by external users ▪ Summaries for internal reporting ▪ Ad optimization pipeline ▪ Experimentation platform pipeline ▪ Ad hoc analyses Thursday, December 3, 2009
    • Workload Statistics Facebook 2009 ▪ Largest cluster running Hive: 4,800 cores, 5.5 PB of storage ▪ 4 TB of compressed new data added per day ▪ 135TB of compressed data scanned per day ▪ 7,500+ Hive jobs on per day ▪ 80K compute hours per day ▪ Around 200 people per month run Hive jobs (data from Ashish Thusoo’s Hadoop World NYC presentation) Thursday, December 3, 2009
    • Hadoop at Yahoo! ▪ Jan 2006: Hired Doug Cutting ▪ Apr 2006: Sorted 1.9 TB on 188 nodes in 47 hours ▪ Apr 2008: Sorted 1 TB on 910 nodes in 209 seconds ▪ Aug 2008: Deployed 4,000 node Hadoop cluster ▪ May 2009: Sorted 1 TB on 1,460 nodes in 62 seconds ▪ Sorted 1 PB on 3,658 nodes in 16.25 hours ▪ Other data points ▪ Over 25,000 nodes running Hadoop across 17 clusters ▪ Hundreds of thousands of jobs per day from over 600 users ▪ 82 PB of data Thursday, December 3, 2009
    • Hadoop at Cloudera Cloudera’s Distribution for Hadoop ▪ Open source distribution of Apache Hadoop for enterprise use ▪ Includes HDFS, MapReduce, Pig, Hive, and ZooKeeper ▪ Ensures cross-subproject compatibility ▪ Adds backported patches and customer-specific patches ▪ Adds Cloudera utilities like MRUnit and Sqoop ▪ Better integration with daemon administration utilities ▪ Follows the Filesystem Hierarchy Standard (FHS) for file layout ▪ Tools for automatically generating a configuration ▪ Packaged as RPM, DEB, AMI, or tarball Thursday, December 3, 2009
    • Hadoop at Cloudera Training and Certification ▪ Free online training ▪ Basic, Intermediate (including Hive and Pig), and Advanced ▪ Includes a virtual machine with software and exercises ▪ Live training sessions ▪ One live session per month somewhere in the world ▪ If you have a large group, we may come to you ▪ Certification ▪ Exams for Developers, Administrators, and Managers ▪ Administered online or in person Thursday, December 3, 2009
    • Hadoop at Cloudera Services and Support ▪ Professional Services ▪ Get Hadoop up and running in your environment ▪ Optimize an existing Hadoop infrastructure ▪ Design new algorithms to make the most of your data ▪ Support ▪ Unlimited questions for Cloudera’s technical team ▪ Access to our Knowledge Base ▪ Help prioritize feature development for CDH ▪ Early access to upcoming Cloudera software products Thursday, December 3, 2009
    • Hadoop at Cloudera Commercial Software ▪ General thesis: build commercially-licensed software products which complement CDH for data management and analysis ▪ Current products ▪ Cloudera Desktop ▪ Extensible user interface for users of Cloudera software ▪ Upcoming products ▪ Talk to me in private Thursday, December 3, 2009
    • Cloudera Desktop Big Data can be Beautiful Thursday, December 3, 2009
    • Gemini-Specific Questions ▪ Scala ▪ ScalaNLP’s SMR, Jonhnny Weslley’s SHadoop ▪ Jeff Hodges’s Componentize ▪ Compliance/regulatory domain ▪ Security roadmap ▪ HADOOP-4487 ▪ HIVE-842 ▪ Real-time BI ▪ MapReduce Online Prototype (MOP) ▪ Talk to me in private Thursday, December 3, 2009
    • (c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Thursday, December 3, 2009