• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop World 2011: Practical Knowledge for  Your First Hadoop Project - Mark Slusar, NAVTEQ, Boris Lublinsky, NAVTEQ, & Mike Segel, Segel & Associates
 

Hadoop World 2011: Practical Knowledge for Your First Hadoop Project - Mark Slusar, NAVTEQ, Boris Lublinsky, NAVTEQ, & Mike Segel, Segel & Associates

on

  • 2,528 views

"This session will discuss a collection of guidelines and advice to help a technologist complete their first Hadoop project. Part 1 reviews tactics to ""sell"" Hadoop to stakeholders and senior ...

"This session will discuss a collection of guidelines and advice to help a technologist complete their first Hadoop project. Part 1 reviews tactics to ""sell"" Hadoop to stakeholders and senior management, including understanding what Hadoop is, alignment of goals, picking the right project, and level setting expectations. Part 2 entails running a successful Hadoop development project. Topics covered include training, preparation & planning activities, development & test activities, and deployment & operations activities. Also included are talking points to help with educating stakeholders.

Statistics

Views

Total Views
2,528
Views on SlideShare
1,748
Embed Views
780

Actions

Likes
8
Downloads
0
Comments
0

5 Embeds 780

http://www.cloudera.com 748
http://www.eformat.co.nz 26
http://blog.cloudera.com 3
https://www.cloudera.com 2
http://cloudera.matt.dev 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This is our obligatory slide that tells you who we are and that some of us are really old and have been doing this for far too long.Everyone does their own, so that the audience know who we areMaybe have everyone introduce themselves, but I really don’t want to pimp myself. [Mikey]
  • [Mikey] I want to preface this slide by stating that the ideal audience for this presentation is for someone who’s just starting to investigate Hadoop and wants to introduce it to their organization. If you’ve already started implementing a project, please pay attention to part 2 where we discuss ways to increase your project’s chance for success.Also any feedback on your Hadoop selling experience will be valuable for authors
  • [Mikey]Step one: Setting your goals.The first thing one needs to do is to identify what problems you want to solve. Create a ‘short list’ of the problems, and determine which problem is the best candidate.Look for a problem that can be solved in a m/r environment. Look for a problem where you’re not ‘betting the farm’, one where if you fail to deliver a solution on time and on budget, you’re not going to condemn Hadoop as an option for future projects.Create a problem statement which in plain English identifies the problem you are attempting to solve and some ‘boxing’ constraints which limit the scope of the problem.Once you have identified the problem you want to solve, you need to sell the problem and solution to your stakeholders. In selling the solution you want to focus on the solution itself and not the underlying technology. In this case,we are talking about Hadoop. Sure Hadoop is sexy and everyone wants to learn it… to pad their resume. But your stake holders don’t care about the technology, just that you have a potential solution which solves their problem and is cost effective. While we are here because we like Hadoop; and use Hadoop; remember that Hadoop is just a tool, its not a ‘cure all’ and perfect for every problem.If you’re at the stage that you know you want to use Hadoop, but you don’t know what sort of problems you need to solve, it time to identify the potential stakeholders, those who
  • [Mikey] This leads us to our next point. When do we want to use Hadoop... What sort of problems do we think will be a good fit for Hadoop…, and what problems do we think would be better solved using a different tool… These are all questions that we have to think about before settling on a tool.[Boris will walk through slide]
  • [Mikey]Part of the selling process is to first realize what you want to sell to ‘management’. You first have to set your goals and know what you want to gain from the project. (Besides learning how to work with a really cool tool and pad your resume… ) If you do not yet have a problem to solve, you may want to do some research and talk with your stakeholders.So… we set realistic goals… like processing X records per hour or Y incoming files... Some metric that you know you can beat and should really be obtainable. Once you have the project, the goals, you need to set boundaries. Like processing a specific stream of data only. Or only handling csv files and not XML input files. Once you’ve set your boundaries, if at all possible, you want to avoid scope creep noting that you can always add to the project after you get it working. Lock down the requirements at the start of the project.[Talk through points on slide…]
  • [Mikey]There is a psychology to the selling process.At a high level.Even if you’ve done your homework and know the answer before presenting the solution, if you provide the answer too quickly, your stakeholders will suspect you and your solution.You have to listen to your stake holders, learn their ‘pain’ and determine the scope of the problem and what are the constraints to the problem. (Proposing a million dollar solution when there’s only 100,000 in the budget doesn’t help.) By listening to the stakeholders, you are showing them that you are crafting your solution to meet their needs and that when you present your solution you can address and re-affirm their pain.Once you have a rough idea, get estimates rather than relying on a SWAG.Its ok to say you don’t know something, make an action item and take the time to get the right answer.For the stakeholders they need to buy in to your solution, and to take ownership of the solution. They need to appreciate the underlying technology, its challenges and risks.
  • [Mikey]When presenting the solution to the stakeholders, you need to have your ducks in a row. You need to re-affirm their stated pains, along with any latent pains you find while talking with the stakeholder’s team(s). This not only shows that you are presenting a solution but that your solution addresses their needs and starts the process of taking ownership for the solution.During the presentation you want to avoid ‘cutting to the chase’ or going straight to the bottom line of saying that it costs X dollars. By going straight to the bottom line, you don’t give the stakeholders and project leads time to digest the solution and to take ownership of the problem.Stakeholders typically don’t care about the underlying technology. They are more interested in finding a cost effective solution that solves their problem and can be modified as the business or environment changes. This is not to say that explaining the technology in the solution isn’t important, but your ‘sales’ success is going to be based on how well you meet their criteria for success. Does the solution meet their needs? What’s the time to value? Relative to other possible solutions, is this cost effective?There are some objections that you can’t overcome. In these situations, Hadoop isn’t a good fit and you should move on to the next potential problem to solve.
  • Mark
  • Boris + Mark
  • Hadoop is an unregistered Trademark of Apache and is meant to refer to Apache’s release only. Any release which is not the official release from Apache would be a derivative work. Cloudera offers a derivative work which is free and also has commercial support.MapRTech has a derivative work that replaces the underlying HDFS with their own proprietary use of C++ and writing directly to the raw disk.Mikey, Boris Amazon
  • The KISS principal has been around for ages. Regardless of your design methodology, you want to start off simple and build out. This allows one to learn the technology and work through the design challenges. When working through the software design, start by creating a simple English description of how you want to process the data and what you want to achieve in each step. This is useful when you need to go back to the SME/Business Analyst who may not be familiar with UML or a class diagram. (Its also a document that you can use to verify the other diagrams.)Boris, Mike - hardware
  • Boris
  • [mike] I am not sure what you want to say with this slide please add speaker notes![mark] One of your first tasks is setting your environment up, whether you go virtual or physical for your first project you will need to refer to documentation, do not stay too far from default configuration until you are comfortable or advised to do so. Use a tool like puppet or chef for configuration management. Additional configuration tips can be found in Mike Guenthers presentation and at docs.cloudera.comAs you develop features; you will need to address your data model. You will also be writing code to ingest data, process it, and display it. Keep in mind that these features will be part of your report on how you succeeded with hadoop.This is a multi level slide:1. You need reproducible environment. You can’t afford to rely on the manual tweaking every time you have to re install2.Without proper configuration your application will not work, Configuration is a two level process. Optimize your cluster to run well any job and the optimize your job for the cluster that you are using. Give an example of separating Hbase configuration from your table configuration3 Describe design stepsMark?
  • Boris.
  • Boris, Mike community
  • Mark
  • Mark
  • Mark?
  • All

Hadoop World 2011: Practical Knowledge for  Your First Hadoop Project - Mark Slusar, NAVTEQ, Boris Lublinsky, NAVTEQ, & Mike Segel, Segel & Associates Hadoop World 2011: Practical Knowledge for Your First Hadoop Project - Mark Slusar, NAVTEQ, Boris Lublinsky, NAVTEQ, & Mike Segel, Segel & Associates Presentation Transcript

  • Boris Lublinsky / NAVTEQ Mark Slusar / NAVTEQMike Segel / Segel & Assoc.
  • Boris Lublinsky • 25+ yrs experience as an enterprise architect with a focus on end to end solutions, distributed systems, SOA, BPM, etc • InfoQ SOA editor, OASIS member, writer, speaker • FermiLab, SSA, Platinum, CNA , Navteq, et alMike Segel • 20+ yrs experience in the IT Industry with a focus high powered computing, information management, and philosophy • Founder of Chicago Hadoop User Group (an excuse to drink beer and eat pizza  ) • Clients include NAVTEQ, Orc, IBM, Informix, Montgomery Ward, CCC, and others…Mark Slusar • 15 yrs experience with a background of design, technology, and leadership • Sponsor of Chicago Hadoop User Group • Federal Reserve, NEC, United Airlines, NAVTEQ, et al
  •  This presentation is based on our 2+ years of Hadoop Projects onboarding experiences Part 1: Tactics to „sell‟ Hadoop to Stakeholders and Senior Management • Understanding what is Hadoop • Alignment of goals • Picking project • Level Setting Expectations Part 2; Running a Successful Development Project • Training • Preparation & Planning Activities • Development & Test Activities • Deployment & Operations Activities
  • • Define the problem: • Understanding company‟s pain(s) • Finding the right problem to solve • Low Hanging Fruit • High Value • High Visibility • Don‟t bet the Farm. • Create a problem statement Sell the Solution(s) and not a Technology • Selling is an educational process • Understand that Hadoop is a tool, not a panacea „cure all‟
  • Hadoop Not HadoopLarge data storage Real-time data processing Bringing execution to the (difficult) data Data Set is not large enough Structured and Processing algorithm not unstructured data compatible w M/R Massively Parallel Existing processes are well processing suited to solve the problem. Extensible ecosystem ACID Requirements (Transaction Based) One person doing a million things vs. one million people doing one thing.
  •  Set realistic goals Set boundaries Avoid scope creep Embrace what you don‟t know: • Honest evaluation of you and your team‟s skills • Hadoop is a paradigm shift therefore you need to alter your approach to solving the problem. •Level Set Expectations • Technology is new to the organization • There is a learning curve • TANSTAAFL (There aint no such thing as a free lunch.) • Think for yourself: take Hadoop urban legends with a grain salt
  •  The sales process takes time. Selling is an educational process  For you: • Learn the Stakeholders Pain • Determine the Scope of the problem • Formulate your own estimates  For your Stakeholders: • Must „buy in‟ to your solution. • Appreciate the underlying technology • Understand the risks Don‟t oversell and underestimate
  •  Reaffirm the stated pains and any identified latent pain(s). Give your audience time to digest the presented information. Show how the solution solves their problem Avoid „The Bottom Line‟ Understand common objections and overcome them. • “…We can do this in a RDBMS …” • “…This sounds risky…” • “… Who else is doing this? … “ • “… Who‟s using it in production? …” • “… Sounds expensive … “ Talking points included at the end of the slide presentation
  •  Executive Sponsorship – Identify the key players and understand their „pains‟. Project is Sufficiently Funded Project Charter – The project is well defined with set goals and expectations. Level Set Expectations: The technology is new to your company, and it should be expected that you will face setbacks during the project. (Lower the expectations to a point where you know you can exceed them.) Outside Expertise. (Buy/Build/Blended Model)
  •  Resources have been identified and have been dedicated to this project.  Business Analysts Support – have a good understanding of data and access patterns is essential.  Architecture – Hadoop is a paradigm shift. It is essential to reflect it in a solution architecture. Integration with existing enterprise applications can provide additional challenges  Developers – Candidates (Java/Unix Proficiency with a myriad of data-driven projects under their belt). Ability and desire to learn new tricks.  Infrastructure Support – have Hadoop administrators who are experienced and/or capable of learning. Training – Not just APIs, but also Hadoop concepts and patterns.
  •  Hadoop is an unregistered TM of Apache There exist several companies that provide commercial support for Hadoop and Hadoop derivatives. • Cloudera • MapRTech • HortonWorks • Others (HStreaming, DataStax, …) And there is also Amazon…
  •  Application - Walk through the business process and create an simple plain English outline of what you want to achieve in each step. Hardware - Determine your initial data set(s) and design out your cluster accordingly. Design & Development are iterative processes.  Your first iteration is rarely your last iteration.  Don‟t be embarrassed by your code. Share it with others for feedback and improvement. KISS, KISS, KISS Data storage - Which to use: HDFS or Hbase?
  • HDFS HBaseUse HDFS when you are always going to Use HBase when you want randomaccess your data as an entire set or a access to your data set. Accessvery large subset. individual records, partial records,HDFS access is sequential read only. and subsets of records. HBase provides more control overHDFS supports only create and append partitioning data.HDFS is mainly used in Map/Reduce.Direct access from the client is HBase supports get, put, update andpossible, but typically requires scan of sequential keysindexing. It provides language HBase can be accessed from either a(Java) APIs only Map/Reduce program, or directly from a client. It supports Java, REST andWhen using HDFS you always want large Thrift APIs(GB) size files. HBase provides build in versioningPackaging smaller sized files into larger and purging of data.ones requires development efforts. Many new enhancements are coming. Coprocessors is the most significant one
  •  Automate your Environment Setup  Use Puppet, Chef, Cloudera Enterprise Manager, etc…  Rely on Hadoop Ecosystem whenever possible. Configuration • See Mike Guenther‟s Lecture (CHUG Archive) • Use Cloudera Docs • Configuration is a continuous process • Tune both the cluster and application independently. • Don‟t optimize your cluster for your application, optimize your application for your cluster. Plan your Development Iterations • Data storage Model • ETL (loading data in/out of Hadoop) • Automate Environment Setup • Processing • Integration (interacting with other enterprise applications) • Reporting interface & diagnostics to show speed and utilization
  •  Understand Map Reduce model and patterns – read Jimmy Lin and Chris Dyer book Data-Intensive Text Processing with MapReduce. See if you really need reducers (they are expensive) and if you do, try to use combiners Use custom InputFormat if you need better control execution of Maps Programmatic writes to predetermine files might lead to unpredictable results. Use Oozie for orchestrating multiple Map Reduce jobs. Use Oozie for automatically starting your jobs when data arrives Don‟t be afraid to ask for help.
  •  Be prepared to re-factor your code many times. You often start wrong, but your goal is to end right. Tom White‟s Hadoop Book Lars George‟s Hbase Book In addition to MapReduce, Investigate additional Hadoop technologies (Pig, Hive, Flume, et al) Be prescriptive, use only the technology you really need Don‟t forget about the community, they will be extremely helpful. See (http://www.meetup.com/Chicago-area-Hadoop- User-Group-CHUG/ ) [Shameless plug.  ]
  •  Unit Test the Application and the Interface Test Hadoop – report issues to Cloudera. Opening Support Tickets* – life saver for new teams. (Cloudera offers support contracts ) Optimize your application, not the cluster End to End Testing – it matters, it ensures confidence Performance testing – its one of the drivers of the project.  Make Sure you test on realistic data volumes – results can be deceiving on smaller data sets.  Showcase the ability of the cluster compared to existing systems Consulting – look over your application, but do not outsource implementation to consultants. Make sure you build internal knowledge *Assumes that you have a corporate license…
  •  SLAs – Not advisable for Hadoop Project #1 Involve Deployment & Operations personnel from the get-go; they will be supporting it Operations Team : • Hadoop Administration Training • Operations Team – Data Analysts & Users trained and involved with process as stakeholders Data Maintenance – The role of the DBA begins to change, existing DBAs should have interest in Hadoop Playbooks – should help address many Hadoop related issues without involving developers & architects UATs – use as needed and depending on methodology
  •  What worked well in the first project? What did not work? Ready to process Mission Critical Data? Begin to establish SLAs? Consider real-time data delivery? Ready to support enterprise data?
  • http:// hadoop.apache.org/ (Apache Hadoop)http:// www.cloudera.com/ (Cloudera)http://www.meetup.com/Chicago-area-Hadoop-User-Group- CHUG/ (CHUG)Or find Mike, Boris, or Mark on Linkedin
  • Appendix
  • • Scalability – A large data problem can broken into many pieces processed in parallel by 10, 100, or 1000 machines; all while working for a common goal. Adding more machines improves scalability.• Incredible Performance – Hadoop holds the performance record for data processing (terabyte sort in 209 seconds – yahoo)• Data integrity – Data is stored multiple times across nodes.• Separation of concerns – developers need to write only business code – mappers and reducers. All infrastructure “heavy lifting” & job management is done by the framework.
  • • Yahoo – Content Optimization, Sorting, Ad Placement• Facebook – Largest Hadoop Cluster, Terabytes of insights processed per DAY. Social email.• LinkedIn – Computationally Intensive operations for Enterprise Data: “People You May Know”, “Viewers of this Profile Also Viewed”, “Job Recommendations”• Groupon – Analytics and Data mining on “Extreme Data”• Nokia- See http://www.cloudera.com/videos/apache-hadoop- nokia-josh-devins• For more companies see: http://wiki.apache.org/hadoop/PoweredBy
  • • Massive data storage – ability to correlate seemingly disparate data. Ability to store lots of historical data.• Computational Power – Ability to run reports and ask questions that could previously not be asked – asking “golden questions”• Throughput – time to complete jobs allows even more “golden questions”• “Golden questions” – change the game, drive profits, and positively disrupt businesses
  • • Commodity Resources - Nodes cost as much as a workstation. No specialized hardware. Expenditures - No software purchases, no negotiations with vendors, no licensing headaches – free downloads. (For initial PoC installation.)• Easily proved - Proof of Concept can be executed in a virtualized environment or at a public cloud.