Harnessing the Power of Apache Hadoop


Published on

Rapidly, Reliably Operationalize Apache Hadoop and Unlock the Value of Your Data.

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Many modern businesses have IT infrastructures made up of products that have been bolted on and cobbled together over the years.  These systems rarely talk to each other, and if they do, it’s even less likely the business is linking the data to be valuable for them, and valuable for you.(CLICK - ANIMATED BUBBLES) You can find varying definitions for Big Data nearly everywhere you look. Who’s got it right and why should you care? Potentially every definition has a piece of the truth – the one everyone can agree upon. (CLICK - ANIMATED RECTANGLE)
  • What types of answers are being derived from Big Data?{Talk about your sports car/train analogy – why people are turning to Hadoop}What is Hadoop good for? What’s a database good for? Hadoop is like the train, not as fast as the sportscar, but similar because it carries the whole load of answers to previously unsolvable problems. With Oracle and SQL, for example, you have speed, and some transactions have to happen quickly. But, only Hadoop does Big Data. When Hadoop comes into a datacenter it’s usually because the old systems cannot handle the volume and variety of data. Analytics & data processing -- the two workloads that really matter.Nothing scales this big or is this flexible. Nothing else can handle this variety of data. When it was created, Hadoop seemed magical.What types of previously unsolvable problems are being answered?{Touch on a couple of these examples to provide context.}
  • {Set the stage with this slide – do not get into the nitty-gritty of installation}. This is an massivelysimplified view of how you go about installing Apache Hadoop. It is a manual process and requires a great deal of expertise, not to mention that Hadoop is now implemented as a “stack.” Therefore installation is no longer just a matter of deploying MapReduce and HDFS (i.e “core Hadoop”). In fact, more than half (55%) of Hadoop users today have implemented 5 or more components (Hive, Pig, Zookeeper, etc.), and this number will continue to increase as the ecosystem grows.it’s the apps that run on top that are important. These analytical apps would be impossible without Hadoop. Per some of the use cases we just discussed, these apps help you find bad guys, help find fraud, etc.
  • {Refer to the installation steps from the previous slide and how, once in production, at a high level, a Hadoop deployment might look like this. Touch on some of the various ways customers are putting Hadoop into production. Transition with the cautionary tale, reiterated from the previous slide, that, while this slide might look simple, it is indeed a major undertaking to “go it alone.”}
  • As mentioned, installingHadoop isn’t just “that” simple. {CLICK – RECTANGLES FADE IN} People face challenges when deploying Hadoop on their own. We’re illustrating just some of the potential issues that could be encountered. {Talk to a couple of the challenges as they appear.}{CLICK – BLOCK FADES IN} One of the biggest challenges, along with having the expertise necessary to deploy Hadoop, is the length of time between the initiation of production and when you begin seeing value. This is one of the many reasons what we’re doing at Cloudera is so exciting. The empowerment we provide to the end user and ability to see results faster over the entire lifecycle of the deployment is honestly unparalleled. Let me share some specifics about how a distribution like Cloudera’s Distribution Including Apache Hadoop (aka CDH) truly helps you rapidly and efficiently operationalize Hadoop.
  • {Why does it make sense to consider a distribution like CDH? Why is CDH a visionary distribution? Beyond the selling points below, really help the audience understand why CDH is beyond what they could do for themselves or by adopting anyone else’s distribution. Do not comment on competitor faults, but focus on where CDH is going and Cloudera’s commitment to the adoption of Apache Hadoop with a distribution like CDH.}{You are going to deep dive on CDH, SCM Express and Cloudera Enterprise in the next three slides, so hit these next comments at a high level.}{CLICK}The “packaging” steps are done for you – CDH is an integrated Apache Hadoop-based stack * Contains all of the components needed for production use* Tested and packaged to work together{CLICK}It’s completely free and open source* Nothing proprietary* Leverages the rapid innovation by a community of 500+ developers Broad partner ecosystem – solution vendors are integrating with open source Apache Hadoop{CLICK}It streamlines your path to success with Hadoop* Simplified deployment with SCM Express* Wizard-based installation* Automated configuration based on best practices* Central management of services across the cluster Free download from Cloudera.com{CLICK}Rapidly “operationalize” your Apache Hadoop cluster with Cloudera Enterprise  * Get comprehensive visibility and control over your cluster with the Cloudera Management Suite* Leverage Cloudera’s expertise to consistently meet or exceed your SLAs with Cloudera Support
  • While Cloudera’s solutions might be the right fit for you, there are many decisions to make that warrant serious consideration before you actually deploy Hadoop.{Don’t get down into the weeds on these. Stay future-focused – why should people be thinking about these things? How does Cloudera help them arrive at the answers? What is Cloudera doing and where are they going that could make answering these questions easier?}What are the use cases?* Beyond core Hadoop, which components will you need?* Where are the integration points?* How much capacity/performance do I need?Will the cluster be a shared resource?* Which group(s) will be active on the cluster?* How will cluster resources be provisioned?* What level of security is required?What service levels do I need to maintain?* Do the service levels differ from group to group?How will we manage and operate the cluster?* What capabilities do I need?* What tools/applications are available?* Build or buy?How equipped is my organization?* What is the level of in-house Hadoop expertise?* How do I fill the gaps (hiring, training)?Who will provide advanced support?{CLICK}How do I manage all of this within my time and budget constraints?{As you close this slide, lead the audience away from deployment and into lifecycle management.}
  • So, let’s talk a bit more technically about what Cloudera brings to the table to get you running quickly and reliably pulling real value from all of your data. CDH is the foundation to your deployment, because, as I mentioned, it contains all of the components needed for production use and is tested and packaged to work together….Apache Hadoop– reliable, scalable distributed computingApache Hive– SQL-like language and metadata repositoryApache Pig – High level language for expressing data analysis programsApache HBase– Hadoop database for random, real-time read/write accessApache Zookeeper – Highly reliable distributed coordination serviceApache Flume*– Distributed service for collecting and aggregating log and event dataApache Whirr*– Library for running Hadoop in the cloudApache Sqoop*– Integrating Hadoop with RDBMSApache Oozie* – Server-based workflow engine for Hadoop ActivitiesFuse-DFS – Module within Hadoop for mounting HDFS as a traditional file systemHue – Browser-based desktop interface for interacting with Hadoop
  • SCM Express is the key piece that helps to ensure your rapid deployment. It can help you provision a complete Apache Hadoop stack in minutes and centrally manages system services through a user-friendly interface for up to 50 notes. It is also available for free on cloudera.com. Let me explain some of the challenges people were facing specifically that led us to offering this solution…Now, let me dive into some of the key features that will speed your deployment…{CLICK for each feature – 5 clicks}
  • If you really want to spend your time finding the meaningful answers buried in your data, you don’t necessarily want to be concerned about Hadoop. The easier it is for you monitor and manage it over its entire lifecycle, the more you can focus on what matters to your organization – the things that keep you competitive, secure, relevant and productive.Getting tied up in the weeds of operational challenges can inhibit progress toward your organizational goals.{CLICK}Obviously, despite these challenges, we believe in the value of Apache Hadoop. We developed Cloudera Enterprise to that others can more easily recognize that value, not just in deployment efforts, but throughout its lifecycle in production.
  • So what exactly is it and how exactly does it help you utilize Apache Hadoop in a way that really unlocks your data….
  • {You may want to point out here the depth of expertise on your staff – do not point at competitors’ lack of depth, but really exploit Cloudera’s expertise.}
  • Once in production, your operation will likely resemble this on the surface (a very basic illustration of a complex environment), and, over time, there will be a natural ebb and flow of your Hadoop usage.Aside from making your deployment easier, we care about this ebb and flow of the full Hadoop lifecycle. As workloads shift, your team evolves and the types of questions you want to ask change, you want your system to easily be able to shift up and shift down the number of resources your spend on these things (acitivty monitor, for example, lets you build a picture of the general health and shifting workloads of the data; speeds of jobs; what jobs ran in the past). You need to be be able to shift your clusters. You also need to be able to plan for expansion.We let you start, and then operate and then grow or sunset pieces of the cluster over time. This is a steady-state, long term, ongoing platform that delivers value to admins. We not only let you not just get Hadoop, but use Hadoop long term.
  • If you would like to learn more about the Hadoop lifecycle, I encourage you to attend the next webcast in this series. Charles Zedlewski, our VP of Product, will explain how to plan for and manage the Apache Hadoop lifecycle inside a Cloudera deployment.{Make a mention of Hadoop World. Something simple along these lines, “Also, don’t forget that Hadoop World is coming up November 8th through the 9th. We encourage you to register now.”}In the meantime, I welcome questions on today’s presentation. We built in time to have a “discussion” on the topic of deploying Apache Hadoop.
  • Harnessing the Power of Apache Hadoop

    1. 1. Harnessing the Power of Apache Hadoop<br />Rapidly, Reliably Operationalize Apache Hadoop and Unlock the Value of Your Data<br />Mike Olson | CEO<br />
    2. 2. Big Data – Who’s Got it Right?<br />Exabytes/Petabytes<br />Volume, Velocity, Variety<br />When you derive value from ALL your data, you can find the answer to just about everything.<br />Internal and3rd Party Data<br />Too big for traditional databases<br />Massive amounts of unstructured data<br />
    3. 3. What Do Enterprises Derive from Their Data Using Hadoop?<br />2<br />1<br />VERTICAL<br />INDUSTRY TERM<br />INDUSTRY TERM<br />Clickstream Sessionization<br />Social Network Analysis<br />Engagement<br />Content Optimization<br />Web<br />Mediation<br />Network Analytics<br />Media<br />ADVANCED ANALYTICS<br />DATA PROCESSING<br />Loyalty & Promotions Analysis<br />Data Factory<br />Telco<br />Trade Reconciliation<br />Fraud Analysis<br />Retail<br />SIGINT<br />Entity Analysis<br />Financial<br />Genome Mapping<br />Sequencing Analysis<br />Federal<br />Bioinformatics<br />
    4. 4. Steps to Install Apache Hadoop<br />Decide on required components<br />1<br />Download components<br />2<br />Install on desired nodes<br />3<br />Configure and start services<br />4<br />Test & optimize<br />5<br />Go live<br />6<br />
    5. 5. Apache Hadoop in Production<br />How Apache Hadoop fits<br />into your existing infrastructure.<br />CUSTOMERS<br />ANALYSTS<br />BUSINESS USERS<br />OPERATORS<br />ENGINEERS<br />Web Application<br />Management Tools<br />IDE’s<br />BI / Analytics<br />Enterprise Reporting<br />Enterprise Data Warehouse<br />Operational Rules Engines<br />Logs<br />Files<br />Web Data<br />Relational Databases<br />
    6. 6. Deploying Hadoop on Your Own<br />Time to Value<br />Ensure repeatable success<br />Support & meeting SLAs<br />Ongoing configuration & mgt of cluster<br />CHALLENGES<br />Deploying services across your cluster<br />Select components based on use case<br />Manage component versions and interoperability<br />
    7. 7. Why Consider CDH<br />Integrated Hadoop-based Stack<br />Free & Open Source<br />Simplified Deployment (SCM Express)<br />Operationalize with Cloudera Enterprise<br />
    8. 8. Top 5 Considerations for Operationalizing Apache Hadoop<br />Managing all thiswithin time and budget constraints<br />
    9. 9. Complete, Integrated Hadoop Stack<br />CDH Components<br /><ul><li>Apache Hadoop
    10. 10. Apache Hive
    11. 11. Apache Pig
    12. 12. Apache HBase
    13. 13. Apache Zookeeper
    14. 14. Apache Flume*
    15. 15. Apache Whirr*
    16. 16. Apache Sqoop*
    17. 17. Apache Oozie*
    18. 18. Fuse-DFS
    19. 19. Hue</li></ul>File System Mount<br />UI Framework<br />SDK<br />FUSE-DFS<br />HUE<br />HUE SDK<br />Workflow<br />Scheduling<br />Metadata<br />APACHE OOZIE<br />APACHE OOZIE<br />APACHE HIVE<br />Data Integration<br />Fast Read/Write Access<br />Languages / Compilers<br />APACHE PIG, APACHE HIVE<br />APACHE FLUME, APACHE SQOOP<br />APACHE HBASE<br />Coordination<br />APACHE ZOOKEEPER<br />* Currently undergoing Incubation at the Apache Software Foundation.<br />
    20. 20. SCM Express<br />Service & Configuration Manager (SCM) Express <br />Removes the complexity from deploying and configuring CDH<br />KEY FEATURES<br />Automates the expansion of services to new nodes when they come online<br />Central, real-time dashboard for configuration management<br />Incorporates comprehensive validation and error checking<br />Automated, wizard-based installation of the complete Apache Hadoop stack<br />Ability to enact changes across the cluster while it’s running<br />4<br />5<br />3<br />2<br />1<br />
    21. 21. Running Apache Hadoop in Production<br />Why Consider Cloudera Enterprise<br /><ul><li>Distributed system presents unique operational challenges
    22. 22. Fixed cost of internal patch and release infrastructure is prohibitive
    23. 23. Skills and expertise are scarce
    24. 24. Tracking to community development efforts is challenging</li></ul>Production support backed by the<br />Apache committers<br />• • •<br />Management suite supports full lifecycle of operationalizing Apache Hadoop<br />• • •<br />Has the depth of experience supporting hundreds of production Apache Hadoop clusters<br />
    25. 25. Cloudera Enterprise<br />Includes the Cloudera Management Suite.<br />
    26. 26. Cloudera Enterprise<br />And Includes Cloudera Support.<br />
    27. 27. Cloudera Helps Enterprises Operationalize Hadoop<br /><ul><li>Consulting Services
    28. 28. Cloudera University</li></ul>Cloudera Services<br />CUSTOMERS<br />ANALYSTS<br />BUSINESS USERS<br />OPERATORS<br />ENGINEERS<br />Cloudera Enterprise<br /><ul><li>Cloudera Management Suite
    29. 29. Cloudera Support</li></ul>Web Application<br />Management Tools<br />IDE’s<br />BI / Analytics<br />Enterprise Reporting<br />Enterprise Data Warehouse<br />Cloudera’s Distribution Including Apache Hadoop (CDH)<br />&<br />SCM Express<br />Operational Rules Engines<br />Logs<br />Files<br />Web Data<br />Relational Databases<br />
    30. 30. Join us!<br />Next in the Harnessing the Power of Apache Hadoopseries<br />How to Manage your Apache Hadoop Lifecycle<br />Charles Zedlewski, VP of Product<br />August 4, 2011<br />10:00 a.m. PDT (1:00 p.m. EDT)<br />www.cloudera.com/company/events/<br />Q & A<br />Visit www.hadoopworld.com<br />* Call for speaker submissions <br />* Registration now open<br />