Delivering Insightful Analytics in a Data Driven World


Published on

Traditional data management and business intelligence is struggling under the pressures of increasing data volumes, new diverse data types, and a growing number of business users who need to make better data-driven decisions. Hadoop provides an affordable and scalable platform for storing, processing, and analyzing "big data", but it's just the beginning.

Join Datameer and Cloudera as we share what it takes to deliver successful analytic applications in the new world of big data.

This webinar will address the following questions:

What is Hadoop? What problems does it solve?
How do leading organizations get the most out of Hadoop?
How can you get started using Hadoop?

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • When setting the context for the emergence of Hadoop, it helps to start by characterizing how data has changed over the past 3 decades.This can be pretty easily described by four attributes – and you’ve probably heard these referred to as the “4V’s of Big Data – Volume, Variety, Velocity and Value.When relational databases were originally developed, the data landscape looked a lot different along all four of these axes. But the rise of end-user computing, more sophisticated machines that are constantly generating data and the proliferation of mobile devices has led to situation where organizations are literally drowning in data. Their systems are permanently overwhelmed. And at the same time there is also a fundamental shift taking place in how we look at and use data to drive decisions. Much more data is available to us now, so in addition to the operational reporting we’ve always done, there’s now a keen interest in doing more free-form exploration – combining many different sources of data in much larger volumes to identify new trends, better serve customers, understand markets and create new data-driven products.This is the environment out of which Hadoop has emerged, and here’s why (transition to next slide).
  • Prior to Hadoop there wasn’t really any single system that could handle all four of the attributes of Big Data:Some handled volumeSome varietyNone could really handle velocity all that well due to the “schema on write” design paradigm of the relational databaseSo as you look across many data management infrastructures you’ll notice a pattern: a proliferation of specialized systems, each solving a piece of the overall problem.Compounding this phenomenon is the fact that many organizations have different departments or lines of business that have their own data management practices. So there are silos preventing anyone from getting a complete view of the business (car loan and home loan example)Having many specialized systems means:Complexity - where data is constantly moving around from place to placeIt’s duplicated and fragmented and, of course,Costly - a recent customer of ours calculated over a 10x spend increase per incremental terabyte to expand their current infrastructure vs. deploying Cloudera EnterpriseAnd so when it comes to Big Data workloads specifically – and what I mean by that are workloads that operate on diverse sets of data from many sources and involve a heavy amount of data processing and/or more free-form “exploratory” analysis:The promise of Hadoop (and our vision for the platform) is to deliver a central platform or “reservoir” where data from across the business (and outside the business) can be landed and acted upon.It destroys data silos and solves end-to-end problems – from data ingestion, to transformation, to exploratory analysis, and ultimately migration of clean, “productized” sets of data into specialized systems for reporting and operational analytics.So what is it about Hadoop that makes it game changing enough to deliver on this vision? (transition to next slide)
  • The answer is pretty simple really. It’s because this system was designed from the ground up to handle all four attributes of Big Data.Volume:Distributed architecture scales cost-effectivelyStorage and compute resources scale linearly as capacity and performance needs growOpen source software running on industry standard hardware changes the economics of data managementVariety:Store data in any formatHandles structured and unstructured data without the need to fit a single data modelVelocity:Load raw data and define “how you look at it” laterA flexible file system allows you to define data structures at query time and makes it much easier to adapt to changes in data types and sourcesRemoves the bottlenecks that are often associated with ingesting data from many sources and makes data available to add value to the business soonerValue:Process data faster, ask any questionDistributed processing leverages data locality to bring drastic performance increases to processing workloads like ETL/ELT and machine learning“Schema-on-Read” approach which allows you to define data structures at query time enables you to truly and easily ask any question.
  • [Pause for questions]
  • There are 4 main things you should consider during your process of “operationalizing” Hadoop:Hadoop platformSystem managementSupportEquipping your organizationIn my remaining time, I’ll walk through these 1 by 1. (transition to next slide)
  • Let’s start with the Hadoop platform itself.Hadoop is open source, so it’s unlike commercial software in that you have the option to simply pull down the bits and deploy it yourself. However, while that IS an option, we always recommend that customers select a commercial distribution of Hadoop such as our own CDH as it will greatly accelerate your time to value and make ongoing maintenance much easierBut regardless of how you choose to deploy Hadoop, there are certain things you should keep in mind:Functionality:What will the platform need to do in order to work effectively? There are a couple of angles to this:Hadoop is not just Hadoop – there are a number of other open source projects in the ecosystem designed to make "core Hadoop" work as part of an overall data management infrastructure. So you need to figure out which projects you need to make the system work in your environment.As an example, there are currently more than a dozen projects bundled into CDH and, on average, our customers use at least 5 of those for things like streaming data into the system, integrating with relational databases and BI tools and coordinating jobs and processes.You also need to consider features that will allow the platform to comply to your standards and best practices and use projects that conform to those – examples of these are things like security, audit controls, high availability and business continuityCohesiveness:How unified are the different components of the system?One of the tricky things about open source is that there is no global product manager sitting over the Hadoop ecosystem making sure all the projects conform to a certain set of standards – logging, documentation, versions of libraries, etc.But at the end of the day, you want the product to feel like a single product and not 14 different products that have been stitched togetherThis is another great advantage of using a commercial distribution like CDH – we go to great lengths to integrate the different projects and make them behave more similarly than they would otherwiseStability & Predictability:What are the standards for testing and documentation? What is the cycle of updates and new releases?There is a reason enterprises consume commercial software - there are set processes and standards for testing, documenting and releasing that are well-known and followed. So when you buy commercial you know what you’re getting.Those types of things are challenging with open source. Again, each project has it's own development community and governing body and there aren't really any global standards or coordination of development effortsI may be sounding like a broken record, but again, this is another reason to go with a distro – there are standards for QA, there's documentation that covers not only each individual component, but also how they work together, and there are predictable schedules for "platform" releases so that you can plan for updates and upgrades
  • Once you’ve made your platform decision, the next thing to consider is how you will manage it and there are 3 things to think about there:Manage Complexity:Hadoop is more than a dozen services running across many machinesHundreds of hardware componentsThousands of settingsLimitless permutationsSo you need something that is going to simplify that complexity. Ideally, you'd want a single tool - something like Cloudera Manager - that provides end-to-end administration - deployment, service and host management, monitoring, alerting and diagnostic tools. Essentially, you want to make sure that whether you're managing 10 nodes with 3 services or 500 nodes with a dozen services, the experience is the same.Provide "Hadoop" Context:Hadoop is a system, not just a collection of partsEverything is interrelatedRaw data about individual pieces is not enoughMust extract what’s importantMaximizing Efficiency:Simplifying complicated, error-prone workflowsBuilt-in safeguards and error checkingDiagnostic tools
  • One other thing to consider is support. You have 2 alternatives here: you can self-support or go with a commercial vendor. Since Cloudera provides technical support, obviously we see it as a huge advantage, and I wanted to take a minute to explain the benefits.Extend your teamLeverage expertsInfluence roadmaps
  • The last thing to consider is how you will equip your organization to manage and use Hadoop once it goes into production. There are three things to consider here, and all of these are things that Cloudera’s training and professional services teams can help you with:Training for your teamdevelopers, operators, data analystsEvaluate and prove value:use case discoveryproof-of-conceptultimately production deploymentEstablish a Center-of-ExcellenceHow do you expand your cluster beyond a single use case and ultimately operate it as a strategic platform that’s used across the business?Build data ingestion pipelinesConstruct an information architectureEnact standards and processesAccess controls and resource allocationSupported applications and platformsMaintenance and upgrades
  • So to summarize, as you're looking to design and deploy your Hadoop infrastructure it's important to keep in mind that:Hadoop is not a black box, it's a part of an overall data management strategyYou must integrate it with your current infrastructure as well as the processes and teams that help you turn data into information that drives decisions.So you need to make sure you have all your bases covered – from the platform itself, to the management tools, and the equipping of your teamAnd this is really the focus of our business at Cloudera - we want to be your partner, providing not only the platform, but also the management, training and services you need to ensure that Hadoop becomes a technology that adds incredible value to the business.
  • [Pause for questions and turn over to Matt]
  • Delivering Insightful Analytics in a Data Driven World

    1. 1. Agenda 1 Hadoop Overview (Drew) 2 Getting Started: Operating Hadoop (Drew) 3 Getting Started: Use Cases & Demo (Matt)1
    2. 2. The Progression of Data THEN NOW GB VOLUME PB Structured + Structured VARIETY Unstructured Trickle VELOCITY Torrent Operational Reporting + VALUE Reporting Data Discovery2
    3. 3. The Promise of Hadoop LEGACY NEW Multiple platforms A single data platform COMPLEX, FRAGMENTED, COSTLY SIMPLIFIED, UNIFIED, EFFICIENT3
    4. 4. Apache Hadoop A Revolutionary Platform for Big Data INGEST STORE EXPLORE PROCESS ANALYZE SERVE VOLUME VARIETY Distributed architecture Store data in any format scales cost-effectively VELOCITY VALUE Load raw data and define Process data faster, “how you look at it” later Ask any question4
    5. 5. Agenda 1 Hadoop Overview (Drew) 2 Getting Started: Operating Hadoop (Drew) 3 Getting Started: Use Cases & Demo (Matt)5
    6. 6. Operational Considerations 1 Hadoop Platform 2 System Management 3 Support 4 Equipping Your Organization6
    7. 7. Operational Considerations Hadoop Platform  Functionality  Cohesiveness  Stability & Predictability CLOUDERA ENTERPRISE: CDH47
    8. 8. Operational Considerations System Management  Manage Complexity  Provide “Hadoop” Context  Maximize Efficiency CLOUDERA ENTERPRISE: CLOUDERA MANAGER8
    9. 9. Operational Considerations Support  Extend Your Team  Leverage Experts  Influence Roadmaps CLOUDERA ENTERPRISE: CLOUDERA SUPPORT9
    10. 10. Operational Considerations Equipping Your Organization  Train Your Team  Evaluate and Prove Value  Establish a COE CLOUDERA UNIVERSITY & PRO SERVICES10
    11. 11. Hadoop in Production • Training • Professional Services OPERATORS ENGINEERS ANALYSTS BUSINESS USERS CUSTOMERS • System Management Management • Support Enterprise Web IDE’s BI / Analytics Tools • Training Reporting Application Enterprise Data Warehouse Hadoop Platform Operational Rules Engines Relational Logs Files Web Data Databases11
    12. 12. Agenda 1 Hadoop Overview (Drew) 2 Getting Started: Operating Hadoop (Drew) 3 Getting Started: Use Cases & Demo (Matt)12