0
The Hadoop Community


Jeff Hammerbacher
Manager, Data
May 28 - 29, 2008
Who is Using Hadoop?
▪   Mentioned previously: Yahoo!, Powerset, Quantcast, Last.fm, Autodesk
▪   A9.com
    ▪   Build Ama...
Hadoop in the ASF
▪   Began life as subproject of Lucene
▪   Now a top level project: http://hadoop.apache.org
    ▪   HBa...
The Hadoop PMC
                   Name                       Organization
Andrzej Bialecki             Getopt

Doug Cuttin...
Apache Infrastructure for Hadoop
▪   Web site: Apache Forrest
▪   Wiki: MoinMoin
▪   Version Control: Subversion
▪   API D...
Contributing to Hadoop
▪   Get comfortable with available documentation on the website
▪   Read through the wiki
▪   Brows...
Contributing to Hadoop
More About JIRA
▪   Every Issue has a unique, numbered Key
    ▪   Type
    ▪   Status
    ▪   Prio...
Contributing to Hadoop
More About Ticket Classification
▪   Status: Open, In Progress, Reopened, Resolved, Closed, or Patch...
Contributing to Hadoop
More About JIRA
▪   Title, Created Time, Updated Time, Component, Affect/Fix Version, Links/
    Su...
Contributing to Hadoop
Filters and the Issue Navigator
▪   You can view related Issues via the Issue Navigator
    ▪   “Fi...
The JIRA Issue Navigator
The JIRA Issue Navigator
    Creating a New Filter
Contributing to Hadoop
JIRA Reports and Release Notes
▪   Reports add a visualization component to Filters
    ▪   Most ca...
Future Directions for Hadoop
HDFS
▪   For 0.18
    ▪   HADOOP-1700: Append to Files in HDFS
        ▪   Numerous blocking ...
Future Directions for Hadoop
HDFS
▪   Scalability
    ▪   Separate DFS into multiple volumes and have a NN per volume
    ...
Future Directions for Hadoop
MapReduce
▪   For 0.18
    ▪   HADOOP-544: Replace the job, tip and task ids with objects
   ...
Future Directions for Hadoop
MapReduce
▪   Scheduling
    ▪   Factor job and task scheduling out of code to allow for test...
Future Directions for Hadoop
Other Interesting Tickets
▪   HADOOP-4: Tool to mount dfs on linux
▪   HADOOP-249: Improving ...
Contributing to Hadoop
Patch Submission
▪   http://wiki.apache.org/hadoop/HowToContribute
▪   Basically run “svn diff” on ...
Contributing to Hadoop
Project Ideas
▪   http://wiki.apache.org/hadoop/ProjectSuggestions
    ▪   Testing, Tools, and Rese...
Contributing to Hadoop
Project Ideas continued
▪   Performance
    ▪   Speculative execution policies
    ▪   Resource-awa...
(c) 2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights rese...
20080529dublinpt1
Upcoming SlideShare
Loading in...5
×

20080529dublinpt1

2,584

Published on

One in a series of presentations given at the IBM Cloud Computing Center in Dublin.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,584
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
95
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Transcript of "20080529dublinpt1"

  1. 1. The Hadoop Community Jeff Hammerbacher Manager, Data May 28 - 29, 2008
  2. 2. Who is Using Hadoop? ▪ Mentioned previously: Yahoo!, Powerset, Quantcast, Last.fm, Autodesk ▪ A9.com ▪ Build Amazon’s product search indices ▪ Session analytics ▪ FIM ▪ Log analysis and machine learning ▪ Wikia Search ▪ 125 nodes ▪ Veoh, NetSeer, Krugle, Rapleaf, Joost, New York Times ▪ For more, check out http://wiki.apache.org/hadoop/PoweredBy
  3. 3. Hadoop in the ASF ▪ Began life as subproject of Lucene ▪ Now a top level project: http://hadoop.apache.org ▪ HBase now a subproject of Hadoop ▪ Pig and Mahout are related projects ▪ Zookeeper and virtual cluster management project in Apache soon ▪ Apache Software Foundation provides organizational and legal support ▪ Membership in ASF is by invitation only ▪ ASF Members elect a Board of Directors ▪ Each top level project has a Project Management Committee (PMC) ▪ Each PMC has a VP who is an officer of the ASF appointed by the Board ▪ The VP of the Hadoop PMC is Owen O’Malley of Yahoo
  4. 4. The Hadoop PMC Name Organization Andrzej Bialecki Getopt Doug Cutting Yahoo! Dhruba Borthakur Facebook Enis Soztutar Agmlab Jim Kellerman Powerset Nigel Daley Yahoo! Owen O’Malley Yahoo! Michael Stack Powerset Christophe Taton INRIA Tom White Independent Consultant
  5. 5. Apache Infrastructure for Hadoop ▪ Web site: Apache Forrest ▪ Wiki: MoinMoin ▪ Version Control: Subversion ▪ API Documentation: JavaDoc ▪ Bug Tracking: JIRA ▪ Continuous Build Server: Hudson ▪ IRC Channel: #hadoop on irc.freenode.org ▪ And of course, the mailing lists: ▪ core-user@hadoop.apache.org ▪ core-dev@hadoop.apache.org
  6. 6. Contributing to Hadoop ▪ Get comfortable with available documentation on the website ▪ Read through the wiki ▪ Browse the mailing list archives ▪ Dig into the JIRA! ▪ Open source bug tracking software from Atlassian ▪ “Issues”: Bugs, feature requests, documentation requests ▪ Issues categorized by “component” and “version” ▪ “Workflow”: Issue as FSM; each state is a “status” ▪ http://www.atlassian.com/software/jira/docs/latest/introduction.html
  7. 7. Contributing to Hadoop More About JIRA ▪ Every Issue has a unique, numbered Key ▪ Type ▪ Status ▪ Priority ▪ Assignee ▪ Reporter ▪ Votes ▪ Watchers
  8. 8. Contributing to Hadoop More About Ticket Classification ▪ Status: Open, In Progress, Reopened, Resolved, Closed, or Patch Available ▪ Priorities: Blocker, Critical, Major, Minor, Trivial ▪ Type: Bug, Improvement, New Feature, Task ▪ Voting on an issue means you actively want to see it fixed ▪ Watching an issue means you can passively track progress
  9. 9. Contributing to Hadoop More About JIRA ▪ Title, Created Time, Updated Time, Component, Affect/Fix Version, Links/ Sub-Tasks, Description, Comments
  10. 10. Contributing to Hadoop Filters and the Issue Navigator ▪ You can view related Issues via the Issue Navigator ▪ “Filter” determines what is shown in the Navigator ▪ Common Filters on the right-hand side of main login ▪ Outstanding, Assigned to Me, Reported by Me, Resolved Recently, Added Recently, Updated Recently, Most Important ▪ “Most Important” Filter just sorts by Issue Priority ▪ I’d recommend the “... Recently” and “Most Important” Filters first ▪ Can also click “Find Issues” on top nav to build your own Filters
  11. 11. The JIRA Issue Navigator
  12. 12. The JIRA Issue Navigator Creating a New Filter
  13. 13. Contributing to Hadoop JIRA Reports and Release Notes ▪ Reports add a visualization component to Filters ▪ Most can be applied to any saved filter ▪ Some Reports have a chart configured ▪ Common Reports: ▪ Road Map ▪ Open Issues ▪ Popular Issues (based on number of Votes) ▪ To keep up with what’s new, the Release Notes are quite useful
  14. 14. Future Directions for Hadoop HDFS ▪ For 0.18 ▪ HADOOP-1700: Append to Files in HDFS ▪ Numerous blocking issues; hope to have code freeze by early June ▪ 8 Voters, 21 Watchers ▪ HADOOP-3022: Fast Cluster Restart ▪ HADOOP-1702: Reduce buffer copies when data is written to DFS ▪ HADOOP-3164: Use FileChannel.transferTo() when data is read from DN ▪ HADOOP-3058: Hadoop DFS to report more replication metrics ▪ HADOOP-3246: FTP client over HDFS
  15. 15. Future Directions for Hadoop HDFS ▪ Scalability ▪ Separate DFS into multiple volumes and have a NN per volume ▪ Manage volume metadata in Zookeeper ▪ Availability ▪ Mirroring ▪ Have Zookeeper manage metadata ▪ Backup and Recovery ▪ Synchronized global snapshot via ZFS or LVM ▪ http://wiki.apache.org/hadoop/HdfsFutures
  16. 16. Future Directions for Hadoop MapReduce ▪ For 0.18 ▪ HADOOP-544: Replace the job, tip and task ids with objects ▪ HADOOP-3245: Provide ability to persist running jobs ▪ HADOOP-3130: Shuffling takes too long to get the last map output ▪ HADOOP-3221: Need a quot;LineBasedTextInputFormatquot; ▪ HADOOP-3149: Supporting multiple outputs for M/R jobs ▪ HADOOP-2182: Input Split details for maps should be logged ▪ HADOOP-3226: Run combiner when merging spills from map output ▪ HADOOP-3227: Implement a binary input/output format for Streaming
  17. 17. Future Directions for Hadoop MapReduce ▪ Scheduling ▪ Factor job and task scheduling out of code to allow for testing different policies (HADOOP-3412) ▪ Augment JobTracker to be a resource manager and job scheduler ▪ Speculative Execution Policies ▪ Separate logic for Mapper and Reducer ▪ Break Reducer into more granular tasks ▪ Allow for execution across many different data sources ▪ for example, MySQL
  18. 18. Future Directions for Hadoop Other Interesting Tickets ▪ HADOOP-4: Tool to mount dfs on linux ▪ HADOOP-249: Improving Map -> Reduce performance and Task JVM reuse ▪ HADOOP-2510: Map-Reduce 2.0 ▪ HADOOP-2864: Improve the Scalability and Robustness of IPC ▪ HADOOP-2884: Refactor Hadoop package structure and source tree ▪ HADOOP-3366: Shuffle/Merge improvements ▪ HADOOP-3421: Requirements for a Resource Manager for Hadoop ▪ HADOOP-3444: Implementing a Resource Manager (V1) for Hadoop
  19. 19. Contributing to Hadoop Patch Submission ▪ http://wiki.apache.org/hadoop/HowToContribute ▪ Basically run “svn diff” on your checkout of trunk and write output to a “.patch” file, then attach it to the issue ▪ Hudson will pick up patch and apply to trunk ▪ Make sure to have tests and JavaDoc comments ▪ Performance regressions tested via DFSIO and GridMix benchmarks
  20. 20. Contributing to Hadoop Project Ideas ▪ http://wiki.apache.org/hadoop/ProjectSuggestions ▪ Testing, Tools, and Research ▪ Security ▪ Tools ▪ Performance monitoring and benchmarking ▪ Anomaly detection ▪ General system management
  21. 21. Contributing to Hadoop Project Ideas continued ▪ Performance ▪ Speculative execution policies ▪ Resource-aware task scheduling (instead of slot-based) ▪ Better failure detection algorithms ▪ Linear Algebra, Statistics, and Machine Learning ▪ SAS/R for massive data sets ▪ Vector and Matrix algebra libraries ▪ Common statistical functions: point estimation, hypothesis testing ▪ Model training and validation libraries
  22. 22. (c) 2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×