20100201hplabs
Upcoming SlideShare
Loading in...5
×
 

20100201hplabs

on

  • 2,940 views

 

Statistics

Views

Total Views
2,940
Views on SlideShare
2,939
Embed Views
1

Actions

Likes
2
Downloads
66
Comments
0

1 Embed 1

http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    20100201hplabs 20100201hplabs Presentation Transcript

    • Monday, February 1, 2010
    • What is Cloudera Building? For Now, At Least Jeff Hammerbacher Chief Scientist and Vice President of Products, Cloudera February 1, 2010 Monday, February 1, 2010
    • My Background Thanks for Asking ▪ hammer@cloudera.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Conceived, built, and led Data team at Facebook ▪ Nearly 30 amazing engineers and data scientists ▪ Several open source projects and research papers ▪ Founder of Cloudera ▪ Vice President of Products and Chief Scientist ▪ Also, check out the book “Beautiful Data” Monday, February 1, 2010
    • Presentation Outline ▪ What is Hadoop? ▪ HDFS and MapReduce ▪ Hive, Pig, Avro, Zookeeper, HBase ▪ My time at Facebook ▪ The problem we were trying to solve ▪ Why we chose Hadoop ▪ What we’re building at Cloudera ▪ What we build and sell right now ▪ Some plans for the future ▪ Questions and Discussion Monday, February 1, 2010
    • What is Hadoop? ▪ Apache Software Foundation project, mostly written in Java ▪ Inspired by Google infrastructure ▪ Software for programming warehouse-scale computers (WSCs) ▪ Hundreds of production deployments ▪ Project structure ▪ Hadoop Distributed File System (HDFS) ▪ Hadoop MapReduce ▪ Hadoop Common ▪ Other subprojects ▪ Avro, HBase, Hive, Pig, Zookeeper Monday, February 1, 2010
    • Anatomy of a Hadoop Cluster ▪ Commodity servers ▪ 1 RU, 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC ▪ Or: 2 RU, 2 x 8 core CPU, 32 GB RAM, 12 x 1 TB SATA ▪ Typically arranged in 2 level architecture Commodity Hardware Cluster ▪ Inexpensive to acquire and maintain •! Typically in 2 level architecture –! Nodes are commodity Linux PCs Monday, February 1, 2010 –! 40 nodes/rack
    • HDFS ▪ Pool commodity servers into a single hierarchical namespace ▪ Break files into 128 MB blocks and replicate blocks ▪ Designed for large files written once but read many times ▪ Files are append-only via a single writer ▪ Two major daemons: NameNode and DataNode ▪ NameNode manages file system metadata ▪ DataNode manages data using local filesystem ▪ HDFS manages checksumming, replication, and compression ▪ Throughput scales nearly linearly with node cluster size Monday, February 1, 2010
    • '$*31%10$13+3&'1%)#$#I% #79:"5$)$3-.".0&2$3-"&)"06-"*+,.0-2"84"82-$?()3"()*&5()3" /(+-."()0&"'(-*-.;"*$++-%"C8+&*?.;D"$)%".0&2()3"-$*6"&/"06-"8+&*?." HDFS 2-%,)%$)0+4"$*2&.."06-"'&&+"&/".-2=-2.<""B)"06-"*&55&)"*$.-;" #79:".0&2-."062--"*&5'+-0-"*&'(-."&/"-$*6"/(+-"84"*&'4()3"-$*6" '(-*-"0&"062--"%(//-2-)0".-2=-2.E" HDFS distributes file blocks among servers " " !" " F" I" !" " H" H" F" !" " F" G" #79:" G" I" I" H" " !" " F" G" G" I" H" " !"#$%&'()'*+!,'-"./%"0$/&.'1"2&'02345.'6738#'.&%9&%.' " Monday, February 1, 2010
    • Hadoop MapReduce ▪ Fault tolerant execution layer and API for parallel data processing ▪ Can target multiple storage systems ▪ Key/value data model ▪ Two major daemons: JobTracker and TaskTracker ▪ Many client interfaces ▪ Java ▪ C++ ▪ Streaming ▪ Pig ▪ SQL (Hive) Monday, February 1, 2010
    • MapReduce MapReduce pushes work out to the data (#)**+%$#41'% Q" K" #)5#0$#.1%*6%(/789% )#$#%)&'$3&:;$&*0% !" Q" '$3#$1.<%$*%+;'"%=*34% N" N" *;$%$*%>#0<%0*)1'%&0%#% ?@;'$13A%B"&'%#@@*='% #0#@<'1'%$*%3;0%&0% K" +#3#@@1@%#0)%1@&>&0#$1'% $"1%:*$$@101?4'% P" &>+*'1)%:<%>*0*@&$"&?% !" '$*3#.1%'<'$1>'A% Q" K" P" P" !" N" " !"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+' Monday, February 1, 2010 "
    • Hadoop Subprojects ▪ Avro ▪ Cross-language framework for RPC and serialization ▪ HBase ▪ Table storage on top of HDFS, modeled after Google’s BigTable ▪ Hive ▪ SQL interface to structured data stored in HDFS ▪ Pig ▪ Language for data flow programming; also Owl, Zebra, SQL ▪ Zookeeper ▪ Coordination service for distributed systems Monday, February 1, 2010
    • Hadoop Community Support ▪ 185+ contributors to the open source code base ▪ ~60 engineers at Yahoo!, ~15 at Facebook, ~15 at Cloudera ▪ Over 500 (paid!) attendees at Hadoop World NYC ▪ Regular user group meetups in many cities ▪ Bay Area Meetup group has 534 members ▪ Three books (O’Reilly, Apress, Manning) ▪ Training videos free online ▪ University courses across the world ▪ Growing consultant and systems integrator expertise ▪ Training, certification, support, and services from Cloudera Monday, February 1, 2010
    • Hadoop Project Mechanics ▪ Trademark owned by ASF; Apache 2.0 license for code ▪ Rigorous unit, smoke, performance, and system tests ▪ Release cycle of 9 months ▪ Last major release: 0.20.0 on April 22, 2009 ▪ 0.21.0 will be last release before 1.0; nearly complete ▪ Subprojects on different release cycles ▪ Releases put to a vote according to Apache guidelines ▪ Releases made available as tarballs on Apache and mirrors ▪ Cloudera packages a distribution for many platforms ▪ RPM and Debian packages; AMI for Amazon’s EC2 Monday, February 1, 2010
    • Hadoop at Facebook Early 2006: The First Research Scientist ▪ Source data living on horizontally partitioned MySQL tier ▪ Intensive historical analysis difficult ▪ No way to assess impact of changes to the site ▪ First try: Python scripts pull data into MySQL ▪ Second try: Python scripts pull data into Oracle ▪ ...and then we turned on impression logging Monday, February 1, 2010
    • Facebook Data Infrastructure 2007 Scribe Tier MySQL Tier Data Collection Server Oracle Database Server Monday, February 1, 2010
    • Facebook Data Infrastructure 2008 Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers Monday, February 1, 2010
    • Major Data Team Workloads ▪ Data collection ▪ server logs ▪ application databases ▪ web crawls ▪ Thousands of multi-stage processing pipelines ▪ Summaries consumed by external users ▪ Summaries for internal reporting ▪ Ad optimization pipeline ▪ Experimentation platform pipeline ▪ Ad hoc analyses Monday, February 1, 2010
    • Workload Statistics Facebook 2010 ▪ Largest cluster running Hive: 8,400 cores, 12.5 PB of storage ▪ 12 TB of compressed new data added per day ▪ 135TB of compressed data scanned per day ▪ 7,500+ Hive jobs on per day ▪ 80K compute hours per day ▪ Around 200 people per month run Hive jobs (data from Ashish Thusoo’s Bay Area ACM DM SIG presentation) Monday, February 1, 2010
    • Why Did Facebook Choose Hadoop? 1. Demonstrated effectiveness for primary workload 2. Proven ability to scale past any commercial vendor 3. Easy provisioning and capacity planning with commodity nodes 4. Data access for all: engineers, business analysts, sales managers 5. Single system to manage XML/JSON, text, and relational data 6. No schemas enabled data collection without involving Data team 7. Simple, modular architecture 8. Easy to build, deploy, and monitor 9. Apache-licensed open source code granted to ASF Monday, February 1, 2010
    • Why Did Facebook Choose Hadoop? ▪ Most importantly: the community ▪ Broad and deep commitment to future development from multiple organizations ▪ Interaction with a community often useful for recruiting ▪ Growing body of users and operators with prior expertise meant lower cost of training new users ▪ Learn about best practices from other organizations ▪ Widely available public materials for improving skills ▪ Not then, but now ▪ Commercial training, certification, support, and services ▪ Growing body of complementary software Monday, February 1, 2010
    • Hadoop at Cloudera Cloudera’s Distribution for Hadoop ▪ Open source distribution of Apache Hadoop for enterprise use ▪ Includes HDFS, MapReduce, Pig, Hive, HBase, and ZooKeeper ▪ Ensures cross-subproject compatibility ▪ Adds backported patches and customer-specific patches ▪ Adds Cloudera utilities like MRUnit and Sqoop ▪ Better integration with daemon administration utilities ▪ Follows the Filesystem Hierarchy Standard (FHS) for file layout ▪ Tools for automatically generating a configuration ▪ Packaged as RPM, DEB, AMI, or tarball Monday, February 1, 2010
    • Hadoop at Cloudera Training and Certification ▪ Free online training ▪ Basic, Intermediate (including Hive and Pig), and Advanced ▪ Includes a virtual machine with software and exercises ▪ Live training sessions ▪ One live session per month somewhere in the world ▪ If you have a large group, we may come to you ▪ Certification ▪ Exams for Developers, Administrators, and Managers ▪ Administered online or in person Monday, February 1, 2010
    • Hadoop at Cloudera Services and Support ▪ Professional Services ▪ Get Hadoop up and running in your environment ▪ Optimize an existing Hadoop infrastructure ▪ Design new algorithms to make the most of your data ▪ Support ▪ Unlimited questions for Cloudera’s technical team ▪ Access to our Knowledge Base ▪ Help prioritize feature development for CDH ▪ Early access to upcoming Cloudera software products Monday, February 1, 2010
    • Hadoop at Cloudera Upcoming Work: HDFS ▪ Symlinks ▪ Security ▪ Native client ▪ Refactoring of read and write path ▪ Fast upgrade and recovery; later, high availability ▪ Integration of Avro ▪ Integration with existing monitoring tools ▪ Better stress testing at scale Monday, February 1, 2010
    • Hadoop at Cloudera Upcoming Work: MapReduce ▪ Avro end-to-end ▪ Avro in the shuffle ▪ Avro input and output formats ▪ Security ▪ Continued API evolution; better support for non-Java languages ▪ System counters ▪ Integration with a resource manager: Nexus? ▪ Potential rewrite of master process, the JobTracker ▪ Potential integration of MOP Monday, February 1, 2010
    • Hadoop at Cloudera Upcoming Work: Avro ▪ Complete the RPC spec: SASL over TCP sockets, HTTP ▪ Also: counters for debugging ▪ Improved language support ▪ Currently: C, Java, C++, Python, Ruby ▪ Define “levels” of implementation ▪ Better command line utilities and tools for authoring schemas ▪ Additional compression codecs ▪ Additional benchmarks ▪ Integration with other projects (HDFS, MapReduce, Hive, HBase) Monday, February 1, 2010
    • Hadoop at Cloudera Commercial Software ▪ General thesis: build commercially-licensed software products which complement CDH for data management and analysis ▪ Current products ▪ Cloudera Desktop ▪ Extensible interface for users of Cloudera software ▪ Upcoming products ▪ Nothing in writing Monday, February 1, 2010
    • Cloudera Desktop Big Data can be Beautiful Monday, February 1, 2010
    • (c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Monday, February 1, 2010