• Like

Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Uploaded on


More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. (More) Apache Hadoop Philip Zeyliger (Math, Dunster ‘04) philip@cloudera.com @philz42 @cloudera October 19, 2009 CS 264
  • 2. Who am I? Software Engineer Zak’s classmate Worked at (Interns)
  • 3. Outline Review of last Wednesday Your Homework Data Warehousing Short Break Some Hadoop Internals Research & Hadoop
  • 4. Last Wednesday
  • 5. The Basics Clusters, not individual machines Scale Linearly Separate App Code from Fault-Tolerant Distributed Systems Code Systems Programmers Statisticians
  • 6. Some Big Numbers Yahoo! Hadoop Clusters: > 82PB, >25k machines (Eric14, HadoopWorld NYC ’09) Google: 40 GB/s GFS read/write load (Jeff Dean, LADIS ’09) [~3,500 TB/day] Facebook: 4TB new data per day; DW: 4800 cores, 5.5 PB (Dhruba Borthakur, HadoopWorld)
  • 7. M-R Model Logical Physical Flow Physical
  • 8. Important APIs → is 1:many Input Format data→K₁,V₁ Writable Mapper K₁,V₁→K₂,V₂ JobClient M/R Flow Other Combiner K₂,iter(V₂)→K₂,V₂ Partitioner K₂,V₂→int *Context Reducer K₂, iter(V₂)→K₃,V₃ Filesystem Out. Format K₃,V₃→data
  • 9. public int run(String[] args) throws Exception { grepJob.setReducerClass(LongSumRedu FileOutputFormat.setOutputPath(sort if (args.length < 3) { cer.class); Job, new Path(args[1])); System.out.println("Grep // sort by decreasing freq <inDir> <outDir> <regex> [<group>]"); FileOutputFormat.setOutputPath(grep sortJob.setOutputKeyComparatorClass Job, tempDir); (LongWritable.DecreasingComparator. ToolRunner.printGenericCommandUsage class); (System.out); grepJob.setOutputFormat(SequenceFil return -1; eOutputFormat.class); JobClient.runJob(sortJob); } } finally { Path tempDir = new Path("grep- grepJob.setOutputKeyClass(Text.clas temp-"+Integer.toString(new s); FileSystem.get(grepJob).delete(temp Random().nextInt(Integer.MAX_VALUE) Dir, true); )); grepJob.setOutputValueClass(LongWri } JobConf grepJob = new table.class); return 0; JobConf(getConf(), Grep.class); } try { JobClient.runJob(grepJob); grepJob.setJobName("grep- search"); JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep- FileInputFormat.setInputPaths(grepJ sort"); the “grep” ob, args[0]); FileInputFormat.setInputPaths(sortJ grepJob.setMapperClass(RegexMapper. ob, tempDir); class); example sortJob.setInputFormat(SequenceFile grepJob.set("mapred.mapper.regex", InputFormat.class); args[2]); if (args.length == 4) sortJob.setMapperClass(InverseMappe grepJob.set("mapred.mapper.regex.gr r.class); oup", args[3]); // write a single file sortJob.setNumReduceTasks(1); grepJob.setCombinerClass(LongSumRed ucer.class);
  • 10. $ cat input.txt adams dunster kirkland dunster kirland dudley dunster adams dunster winthrop $ bin/hadoop jar hadoop-0.18.3- examples.jar grep input.txt output1 'dunster|adams' $ cat output1/part-00000 4 dunster 2 adams
  • 11. JobConf grepJob = new JobConf(getConf(), Grep.class); try { grepJob.setJobName("grep-search"); FileInputFormat.setInputPaths(grepJob, args[0]); Job grepJob.setMapperClass(RegexMapper.class); grepJob.set("mapred.mapper.regex", args[2]); if (args.length == 4) grepJob.set("mapred.mapper.regex.group", args[3]); grepJob.setCombinerClass(LongSumReducer.class); grepJob.setReducerClass(LongSumReducer.class); 1of 2 FileOutputFormat.setOutputPath(grepJob, tempDir); grepJob.setOutputFormat(SequenceFileOutputFormat.class); grepJob.setOutputKeyClass(Text.class); grepJob.setOutputValueClass(LongWritable.class); JobClient.runJob(grepJob); } ...
  • 12. JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort"); FileInputFormat.setInputPaths(sortJob, tempDir); Job sortJob.setInputFormat(SequenceFileInputFormat.class); sortJob.setMapperClass(InverseMapper.class); (implicit identity reducer) // write a single file sortJob.setNumReduceTasks(1); 2 of 2 FileOutputFormat.setOutputPath(sortJob, new Path(args[1])); // sort by decreasing freq sortJob.setOutputKeyComparatorClass( LongWritable.DecreasingComparator.class); JobClient.runJob(sortJob); } finally { FileSystem.get(grepJob).delete(tempDir, true); } return 0; }
  • 13. The types there... ?, Text Text, Long Text, list(Long) Text, Long Long, Text
  • 14. A Simple Join Id Last First People 1 Washington George 2 Lincoln Abraham Key Entry Log Location Id Time Dunster 1 11:00am Dunster 2 11:02am Kirkland 2 11:08am You want to track individuals throughout the day. How would you do this in M/R, if you had to?
  • 15. (white-board)
  • 16. Your Homework (this is the only lolcat in this lecture)
  • 17. Implement PageRank over Wikipedia Pages Mental Challenges Practical Challenges Learn an Learn Finicky algorithm Software Adapt it to M/R Debug an unfamiliar Model environment
  • 18. Advice Tackle Parts Separately Algorithm Implementing in M/R (What are the type signatures?) Starting a cluster on EC2 Small dataset Large dataset
  • 19. More Advice Wealth of “Getting Started” materials online Feel free to work together Don’t be a perfectionist about it; data is dirty! if (____ ≫ Java), use “streaming”
  • 20. Good Luck!
  • 21. Data Warehousing 101
  • 22. What is DW? a.k.a. BI “Business Intelligence” Provides data to support decisions Not the operational/transactional database e.g., answers “what has our inventory been over time?”, not “what is our inventory now?”
  • 23. pitting and popping cherries into our mouths at a rate of more A bit smaller than than 157 million pounds over a three month period. Wow! natural peanut butte So what becomes of the other 53 million pounds? Well, Trader Joe’s Mini M some of the fruit is frozen, some used for jams and preserves excellent for snacki Why DW? and some is used to make Trader Joe’s Cherry Cider. Our for chocolate chips Cherry Cider is a 100% juice blend – cherry, apple, plum cream. We’re sellin and pineapple juices from concentrate – that makes ample use of Bing cherries from the Pacific Northwest. It has big, Chocolat bold cherry sweetness and no added sugar. We’re selling Cherry Cider in a 64 fluid ounce bottle for $3.69, every day. $1.99 Do you have a first it involve nearly br I told you, hands off the Geez, lighten up. You Chocolate Chip Granola Bars! Trader Joe’s, you c get six in every box. You could share. it takes is a bit of Coated Granola B Learn from data No rock-hard-teeth oats, organic rice cr The bottoms are cov chocolate. They’re these little chocolat Reporting Trader Joe’s Cho Bars are definitely healthier when we flavors, colors or pr fats. And because Ad-hoc analysis 17 deliciously affordable, e.g.: which trail mix should TJ’s discontinue? (and other important business questions)
  • 24. Traditionally... Big databases Schemas Dimensional Modelling (Ralph Kimball)
  • 25. “MAD Skills” MAD Skills: New Analysis Practices for Big Data Jeffrey Cohen Brian Dolan Mark Dunlap Greenplum Fox Interactive Media Evergreen Technologies Joseph M. Hellerstein Caleb Welton U.C. Berkeley Greenplum Magnetic ABSTRACT As massive data acquisition and storage becomes increas- into groups. This was the topic of significant academic re- search and industrial development throughout the 1990’s. ingly affordable, a wide variety of enterprises are employing Traditionally, a carefully designed EDW is considered to statisticians to engage in sophisticated data analysis. In this have a central role in good IT practice. The design and evolution of a comprehensive EDW schema serves as the Agile paper we highlight the emerging practice of Magnetic, Ag- ile, Deep (MAD) data analysis as a radical departure from rallying point for disciplined data integration within a large traditional Enterprise Data Warehouses and Business Intel- enterprise, rationalizing the outputs and representations of ligence. We present our design philosophy, techniques and all business processes. The resulting database serves as the experience providing MAD analytics for one of the world’s repository of record for critical business functions. In addi- largest advertising networks at Fox Interactive Media, us- tion, the database server storing the EDW has traditionally ing the Greenplum parallel database system. We describe been a major computational asset, serving as the central, database design methodologies that support the agile work- scalable engine for key enterprise analytics. The concep- Deep ing style of analysts in these settings. We present data- tual and computational centrality of the EDW makes it a parallel algorithms for sophisticated statistical techniques, mission-critical, expensive resource, used for serving data- with a focus on density methods. Finally, we reflect on intensive reports targeted at executive decision-makers. It is database system features that enable agile design and flexi- traditionally controlled by a dedicated IT staff that not only ble algorithm development using both SQL and MapReduce maintains the system, but jealously controls access to ensure interfaces over a variety of storage mechanisms. that executives can rely on a high quality of service. [12] While this orthodox EDW approach continues today in many settings, a number of factors are pushing towards a 1. INTRODUCTION very different philosophy for large-scale data management in If you are looking for a career where your services will be the enterprise. First, storage is now so cheap that small sub- in high demand, you should find something where you provide groups within an enterprise can develop an isolated database a scarce, complementary service to something that is getting of astonishing scale within their discretionary budget. The ubiquitous and cheap. So what’s getting ubiquitous and cheap? world’s largest data warehouse from just over a decade ago Data. And what is complementary to data? Analysis. can be stored on less than 20 commodity disks priced at – Prof. Hal Varian, UC Berkeley, Chief Economist at Google [5] under $100 today. A department can pay for 1-2 orders of magnitude more storage than that without coordinating mad (adj.): an adjective used to enhance a noun. with management. Meanwhile, the number of massive-scale 1- dude, you got skills. data sources in an enterprise has grown remarkably: mas- 2- dude, you got mad skills. sive databases arise today even from single sources like click- – UrbanDictionary.com [22] streams, software logs, email and discussion forum archives, etc. Finally, the value of data analysis has entered com- Standard business practices for large-scale data analysis cen- mon culture, with numerous companies showing how sophis- ter on the notion of an “Enterprise Data Warehouse” (EDW) ticated data analysis leads to cost savings and even direct that is queried by “Business Intelligence” (BI) software. BI revenue. The end result of these opportunities is a grassroots tools produce reports and interactive interfaces that summa- move to collect and leverage data in multiple organizational
  • 26. MADness is Enabling BI / Reporting Ad-hoc RDBMS (Aggregates) Queries? ETL (Extraction, Transform, Load) Storage (Raw Data) Data Mining? Collection Instrumentation } Traditional DW
  • 27. BI / Reporting Ad-hoc Queries RDBMS (Aggregates) Data Mining ETL (Extraction, Transform, Load) Storage (Raw Data) Collection Instrumentation } Traditional DW
  • 28. be Tier MySQL Tier Facebook’s DW (phase N) Data Collection Server Oracle Database Server
  • 29. Facebook’s DW (phase M) M 2008N > Facebook Data Infrastructure Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers Wednesday, April 1, 2009
  • 30. Short Break
  • 31. Hadoop Internals
  • 32. HDFS 3x64MB file, 3 rep 4x64MB file, 3 rep Namenode Small file, 7 rep Datanodes One Rack A Different Rack
  • 33. HDFS Write Path
  • 34. HDFS Failures? Datanode crash? Clients read another copy Background rebalance Namenode crash? uh-oh
  • 35. M/R Job on stars Tasktrackers on the same Different job machines as datanodes Idle One Rack A Different Rack
  • 36. M/R
  • 37. M/R Failures Task fails Try again? Try again somewhere else? Report failure Retries possible because of idempotence
  • 38. Programming these systems... Everything can fail Inherently multi-threaded Toolset still young Mental models are different...
  • 39. Research & Hadoop
  • 40. Scheduling & Sharing Mixed use Batch Interactive Text Real-time Isolation Metrics: Latency, Throughput, Utilization (per resource)
  • 41. Scheduling Fair and LATE Scheduling (Berkeley) Nexus (Berkeley) Quincy (MSR)
  • 42. Implementation APPENDIX A. NARADA IN OverLog /** If I have none, just store what I got */ Here we provide an executable OverLog implementation R6 member@X(X, Address, ASequence, T, ALive) :- of Narada’s mesh maintenance algorithms. Current limita- membersFound@X(X, Address, ASequence, ALive, C), tions of the P2 parser and planner require slightly wordier C == 0, T := f_now(). syntax for some of our constructs. Specifically, handling of negation is still incomplete, requiring that we rewrite some rules to eliminate negation. Furthermore, our planner cur- /** If I have some, just update with the information I received if it has a higher rently handles rules with collocated terms only. The Over- sequence number. */ Log specification below is directly parsed and executed by our current codebase. R7 member@X(X, Address, ASequence, T, ALive) :- membersFound@X(X, Address, ASequence, ALive, C), /** Base tables */ C > 0, T := f_now(), member@X(X, Address, MySequence, MyT, MyLive), MySequence < ASequence. materialize(member, infinity, infinity, keys(2)). materialize(sequence, infinity, 1, keys(2)). materialize(neighbor, infinity, infinity, keys(2)). /** Update my neighbor’s member entry */ R8 member@X(X, Y, YSeq, T, YLive) :- refresh@X(X, /* Environment table containing configuration Y, YSeq, A, AS, AL), T := f_now(), YLive := 1. values */ materialize(env, infinity, infinity, keys(2,3)). /** Add anyone from whom I receive a refresh BOOM Project message to my neighbors */ /* Setup of configuration values */ N1 neighbor@X(X, Y) :- refresh@X(X, Y, YS, A, AS, L). E0 neighbor@X(X,Y) :- periodic@X(X,E,0,1), env@X(X, H, Y), H == "neighbor". /** Probing of neighbor liveness */ (Berkeley) /** Start with sequence number 0 */ L1 neighborProbe@X(X) :- periodic@X(X, E, 1). L2 deadNeighbor@X(X, Y) :- neighborProbe@X(X), T := S0 sequence@X(X, Sequence) :- periodic@X(X, E, 0, f_now(), neighbor@X(X, Y), member@X(X, Y, YS, YT, 1), Sequence := 0. L), T - YT > 20. L3 delete neighbor@X(X, Y) :- deadNeighbor@X(X, Y). L4 member@X(X, Neighbor, DeadSequence, T, Live) :- /** Periodically start a refresh */ deadNeighbor@X(X, Neighbor), member@X(X, Neighbor, S, T1, L), Live := 0, DeadSequence := S R1 refreshEvent@X(X) :- periodic@X(X, E, 3). + 1, T:= f_now(). /** Increment my own sequence number */ B. CHORD IN OverLog Here we provide the full OverLog specification for Chord. Overlog (Berkeley) R2 refreshSequence@X(X, NewSequence) :- This specification deals with lookups, ring maintenance with refreshEvent@X(X), sequence@X(X, Sequence), NewSequence := Sequence + 1. a fixed number of successors, finger-table maintenance and opportunistic finger table population, joins, stabilization, and node failure detection. /** Save my incremented sequence */ /* The base tuples */ R3 sequence@X(X, NewSequence) :- refreshSequence@X(X, NewSequence). materialize(node, infinity, 1, keys(1)). materialize(finger, 180, 160, keys(2)). materialize(bestSucc, infinity, 1, keys(1)). /** Send a refresh to all neighbors with my current materialize(succDist, 10, 100, keys(2)). membership */ materialize(succ, 10, 100, keys(2)). materialize(pred, infinity, 100, keys(1)). R4 refresh@Y(Y, X, NewSequence, Address, ASequence, materialize(succCount, infinity, 1, keys(1)). ALive) :- refreshSequence@X(X, NewSequence), materialize(join, 10, 5, keys(1)). member@X(X, Address, ASequence, Time, ALive), materialize(landmark, infinity, 1, keys(1)). neighbor@X(X, Y). materialize(fFix, infinity, 160, keys(2)). materialize(nextFingerFix, infinity, 1, keys(1)). materialize(pingNode, 10, infinity, keys(2)). /** How many member entries that match the member materialize(pendingPing, 10, infinity, keys(2)). in a refresh message (but not myself) do I have? */ R5 membersFound@X(X, Address, ASeq, ALive, /** Lookups */ count<*>) :- refresh@X(X, Y, YSeq, Address, ASeq, ALive), member@X(X, Address, MySeq, MyTime, L1 lookupResults@R(R,K,S,SI,E) :- node@NI(NI,N), MyLive), X != Address. lookup@NI(NI,K,R,E), bestSucc@NI(NI,S,SI), K in 15
  • 43. Debugging and Visualization Task durations (RandomWriter: 100GB written: 4 hosts): All nodes Task durations (Sort: 20GB input: 4 hosts): All nodes 40 JT_Map JT_Map JT_Reduce 150 30 100 Per-task Per-task 20 Mochi (CMU) 50 10 0 0 0 100 200 300 400 0 200 400 600 800 Time/s Time/s Figure 5: Summarized Swimlanes plot for RandomWriter (top) and Sort (bottom) Task durations (Matrix-Vec Multiply, Inefficient # Reducers): Per-node JT_Map JT_Reduce Task durations (Matrix-Vec Multiply, Efficient # Reducers): Per-node JT_Map JT_Reduce Parallax (UW) 60 60 50 40 Per-task Per-task 40 30 20 20 10 0 0 0 200 400 600 800 0 100 200 300 400 500 600 700 Time/s Time/s Figure 6: Matrix-vector Multiplication before optimization (above), and after optimization (below) 4 Examples of Mochi’s Value We demonstrate the use of Mochi’s visualizations (using mainly Swimlanes due to space constraints). All of the data is derived from log traces from the Yahoo! M45 [11] production cluster. The examples in § 4.1, § 4.2 involve 5-node clusters (4-slave, 1-master), and the example in § 4.3 is from a 25-node cluster. Mochi’s analysis and visualizations have run on real-world data from 300-node Hadoop production clusters, but we omit these results for lack of space; furthermore, at that scale, Mochi’s interactive visualization (zooming in/out and targeted inspection) is of more benefit, rather than a static one. 4.1 Understanding Hadoop Job Structure Figure 5 shows the Swimlanes plots from the Sort and RandomWriter benchmark workloads (part of the
  • 44. Usability
  • 45. Performance Need for benchmarks (besides GraySort) Low-hanging fruit!
  • 46. Higher-Level Languages Hive (a lot like SQL) (Facebook/Apache) Pig Latin (Yahoo!/Apache) DryadLINQ (Microsoft) Sawzall (Google) SCOPE (Microsoft) JAQL (IBM)
  • 47. Optimizations For a single query.... For a single workflow... Across workflows... Bring out last century’s DB research! (joins) And file system research too! (RAID) HadoopDB (Yale) Data Formats (yes, in ’09)
  • 48. New Datastore Models File System Bigtable, Dynamo, Cassanda, ... Database
  • 49. New Computation Models MPI M/R Online M/R Dryad Pregel for Graphs Iterative ML Algorithms
  • 50. Hardware Larger-Scale Computing Data Center Design (Hamilton, Barroso, Hölzle) Energy-Efficiency Network Topology and Hardware What does flash mean in this context? What about multi-core?
  • 51. Synchronization, Coordination, and Consistency Chubby, ZooKeeper, Paxos, ... Eventual Consistency
  • 52. Applied Research (research using M/R) “Unreasonable Effectiveness of Data” WebTables (Cafarella) Translation ML...
  • 53. Conferences... (some in exotic locales) SIGMOD SOSP NSDI VLDB LADIS SC/ISC ICDE OSDI SoCC CIDR SIGCOMM Others (ask a prof!) HPTS HotCloud
  • 54. Parting Thoughts
  • 55. the flour, and are able to sell a five pound bag to you for only crispies) to create a salty, savory snack that dares to thin $2.99. Our flour is made from 100% U.S. grown hard wheat outside the snack box. Sound a little strange? Perhaps. Bu – All Purpose is a blend of hard winter and spring wheat once you try them, we think you’ll be back for more. We’r and White Whole Wheat is 100% hard white winter wheat selling Trader Joe’s Sesame Seaweed Rice Balls in a fiv – and both have four grams of protein in every quarter-cup ounce bag for only $1.49. serving. You’ll find both Baker Josef’s Flours directly at The Wheel the source – your neighborhood Trader Joe’s. Baby Swiss from a Master • Only $3.99 a Pound! Trader Joe’s Baby Swiss Cheese comes to us from a Cheesemaker who has been creating quality cheeses fo Wisconsin farmer-owned cheese co-op that has been more than 30 years. producing craftsman cheeses since 1885. It is an artisan- made cheese produced under the watchful eye of a Master Baby Swiss is similar to Swiss cheese but is aged for a shorte period of time, resulting in a milder cheese with significantl “Look, there are lots of different typesOriginal” “The of wheels!” – Todd Lipcon smaller “eyes” than its grown-up namesake. From a flavo Sweet & Nutty… Just Like We Are! standpoint, it’s buttery, a little nutty and a touch sweet. I chunks well for salads, melts beautifully on burgers an slices easily for snacks. We’re selling random weight block Honey Roasted Peanuts of Master-crafted Trader Joe’s Baby Swiss Cheese fo Remember the sweet and crunchy taste of the original honey $3.99 a pound, every day – a terrific value, and the sam Don’t Re-invent Re-invent! roasted peanuts? Remember the first time you tried a knock- off version and felt sadness, coupled with disappointment, enveloped in ennui, longing for a snack that was as good great price we offered on this cheese back in 2005! as the original? Trader Joe’s has the power to make you ennui-free. Focus on your Lots of new When the original purveyor of honey roasted peanuts became yet another victim of corporate reorganization, one of our industrious nut suppliers bought exclusive rights to their data/problem possibilities! original honey roasted peanut recipe, and we’ve been selling truckloads of them ever since. Honey Roasted Peanuts are a natural for snacking any time – to satisfy the afternoon munchies, out on a long hike, or just sitting in front of the TV watching a game. What about... Proof that our nut buyer is as industrious as our nut supplier, New Models! we’re selling this one-of-a-kind product at a one-of-a-kind price – each 16 ounce bag of Trader Joe’s The Original Honey Roasted Peanuts is $2.69, every day. Uh-oh. Looks like Joe’s been reinventing the wheel again. New implementations! 19 Reliability, Better optimizations! Durability, Stability, Tooling
  • 56. Conclusion It’s a great time to be in Distributed Systems. Participate! Build! Collaborate!
  • 57. Questions? philip@cloudera.com (we’re hiring) (interns)