• Like
  • Save

Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Series | Forrester

  • 2,436 views
Uploaded on

Apache Hadoop, an open-source platform, is increasingly gaining adoption within organizations trying to draw insight from all the big data being generated. Hadoop, and a handful of open-source tools …

Apache Hadoop, an open-source platform, is increasingly gaining adoption within organizations trying to draw insight from all the big data being generated. Hadoop, and a handful of open-source tools that complement it, are promising to make gigantic and diverse datasets easily and economically available for quick analysis. A burgeoning partner ecosystem is also essential to helping organizations turn big data into business value.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,436
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • http://www.flickr.com/photos/ychi2010/6769591849/sizes/m/in/photostream/For decades companies have been making decisions based on transactional data stored in relational databases, Beyond that data is a potential treasure trove of non-traditional, less structured data that can be mind for useful insight. Decreases in the cost of storage and compute power have made it feasible to collect this data – which would have been thrown away only a few years ago. As a result, more and more companies are looking to include non-traditional yet potentially valuable data with their traditional enterprise data in the analysis proceses.
  • FALLBACK
  • Data science involves looking at data differently. Rather than creating a uniform schema (rows and columns), tools like Hadoop give data scientists the flexibility to store data in a format that fits the question we're trying to answer. This requires an underlying system that's flexible. A system that can store and process any type of data, starting with it's original raw format and allowing scientists to transform and apply a schema to suit the particular problem.Data scientists use tools and technologies that can read and write data in compact storage, are fast to read and write and can be accessed from a wide variety of languages.We use libraries such as Avro, which gives flexibility to structure and process data.
  • Standard pitch from CDH4 launch…When we talk about bringingHadoop to the enterprise, there are six essential characteristics or areas that we focus on.High Availability – most customers want to use Hadoop to power mission critical applications and workflows. As such the system must run with maximum uptime to keep all data and processes available to the business.Granular security – enterprises require the ability to secure sensitive data types as well as control who has access to system resources and when. Cloudera works with the open source community to build these capabilities into the platform and provides simple configuration and enforcement through our management application.Robust Management – Hadoop is a distributed system with many moving parts. Centralized management is critical for successful implementationScalable and Extensible – one of the great things about Hadoop is it’s massive scalability. We want to make it easy for you to take advantage of this by integrating your applications with the platform.Certified and Compatible – Enterprises have invested significant amounts of time and money into their existing infrastructure (data warehouses, BI applications, etc.). We want to make sure that Hadoop integrates seamlessly with those technologies.Global Support and Services – As Hadoop becomes a critical component of the data management infrastructure, we want to empower our customers to meet stringent service level agreements and build out their own Hadoop workforce.
  • Hadoop is an open-source framework for running applications on large clusters of commodity hardware. As a result, it delivers enormous processing power and the ability to handle virtually limitless concurrent tasks and jobs, making it remarkably low-cost complement to traditional enterprise data infrastructure. Organizations use Hadoop in 5 ways. 1) staging area for data warehouse and analytics store, 2) initial discovery and analysis, 3) storage and analysis of unstructured/semistructured content, 4) making total data available for analysis, 5) low cost storage of large data volumes.With traditional database and data analytics tools, information is stored in neat rows and columns, and there are limits to how much data you can juggle and how quickly. The Hadoop Distributed File System provides an environment to exploit massively parallel processing against large amounts of data. Hadoop changes the dynamics of large scale computing. With Hadoop, you can distribute raw data across a vast cluster of low-cost machines, and you can process that data in the same place you store it. The result is that you can store all your data and analyze it as needed. A paradigm shift - merging the power of analytics with the power of Hadoop data storage and processing to get better answers faster. This new paradigm will significantly improve an organization’s ability to assimilate vast data assets and give them the compute and analytical power to tackle problems/opportunities they never thought possible. As businesses become more analytical to gain competitive advantage and comply with new regulations, enterprise data warehouses are pushed to answer more ad-hoc questions from more people analyzing vastly larger volumes of data, often in real-time. Hadoop and next-gen analytic platforms are fundamental building blocks of the architecture needed to compete effectively in a data-driven world. Hadoop is the next wave of strategic enterpriseinformation management. THE ‘BIG DATA’ SHIFT“Big Data analysis is usually iterative: you ask one question or examine one data set, then think of more questions or decide to look at more data. That’s different from the “single source of truth” approach to standard BI and data warehousing.” — PwC 2010 Technology Forecast-----------------------------------------BRINGS STORAGE AND COMPUTATION TOGETHER IN A SINGLE SYSTEMPROCESS & ANALYZE DATA IN PLACEREMOVE NETWORK BOTTLENECKSELIMINATE DATA MIGRATIONSWORKS WITH EVERY TYPE OF DATA, IN ITS NATIVE FORMATNO NEED TO FIT A SINGLE SCHEMANOTHING LOST THROUGH ETLLOOK AT ALL YOUR DATA FOR A COMPREHENSIVE VIEWCHANGES THE ECONOMICS OFDATA MANAGEMENTOSS + COMMODITY HARDWAREKEEP EVERYTHING ONLINE SUPERCOMPUTING FOR EVERYONE
  • Hadoop is not a single entity. It is a rich, complex, and evolving ecosystem of multiple open source products from Apache. In addition, the ecosystem expands almost daily as
more open source and vendor products support or extend Hadoop products and technical approaches.We are a platform company. Within our partner ecosystem you get everything you need to leverage big data. Hadoop is now a 1st class citizen in the enterprise IT department. With so many key IT vendors “attaching to Hadoop” via the Cloudera Connect program, the penetration of Hadoop related technologies into the heart of the enterprise analytics environment is acceleratedCoordinating your traditional and Big Data processes takes a vendor that understands the legacy and modern approach to data processing Cloudera is differentiated by its combination of platform + methodology + ecosystem. (methodology = data computing)
  • The possibilities of big data continue to evolve rapidly, driven by innovation in the underlying technologies, platforms, and analytic capabilities for handling data, as well as the evolution of behavior among its users as more and more individuals live/work digital lives. To evolve into an organization that is “data-driven” and competes on data, the business must make better use of data as it moves through daily operations which demands a radical rethinking of traditional data warehousing and transaction processing. Hadoop leverages several resources that have been outside the information architectures we have today. It is bringing in new programming languages, new skills and new data and being deployed as a new platform. Think how it is used to extend/supplement how we leverage information, synergistic if we put the pieces together right. What is possible now that so many of the constraints are removed?
  • Business Challenges:We need to use all the data we collect to help our customersSmall reductions to time in AR lead to big savings and better cash flowRelay has an existing suite of Analytics products, but we always want to do more This means keeping data at much higher fidelityRegulatory challengesNeed to store these transactions to meet regulatory compliance
  • Storage of transaction dataMillions of transactions per dayThousands of files coming in as well as data flowing through web service and direct connection requestsStorage of log dataAverage over 150 GB of log data collected per day Data is used for troubleshooting customer issues and may be used 30 to 60 days after it is collected
  • Project in place to meet business requirement around storage and retrieval of dataLooked at traditional solutionsDatabase – too costly, would not allow for easy indexing of filesFile system – Using enterprise standards, (lots of CPUs and SAN), proved to be untenable when searchingHybrid – File system + Solr. Did not investigate very thoroughly as there were issues around working with that volume of data

Transcript

  • 1. REALIZING THE PROMISE OF BIG DATA WITH HADOOP Noel Yuhanna - Forrester Omer Trajman - Cloudera Jeremy Dyer & Marty Smith - RelayHealth1
  • 2. Hadoop and Big DataNoel Yuhanna, Principal Analyst2 © 2009 Forrester Research, Inc. Reproduction Prohibited 2012
  • 3. Enterprises have 100s of terabytes or petabytes of data but most of it is unused…Unused data is a valuable asset and should be leveraged !3 © 2012 Forrester Research, Inc. Reproduction Prohibited
  • 4. Big Data - Problem or opportunity? Big data presents serious challenges: – Strains the current limits of IT infrastructure and resources – Requires an upgrade across the stack: storage, compute A huge opportunity exists with big data! – Improve operational efficiency – Offer new insights that can provide competitive advantage – Deliver advanced, predictive analytics – with more precision – Support activities and analysis that generate revenue and bring businesses closer to their customers much faster 4 © 2012 Forrester Research, Inc. Reproduction Prohibited
  • 5. Big Data requires a new approach to dataprocessing and analyticsOrganizations need to be able to: • Process any data at any given time • Manage very large data sets that run into 100s of TB and PBs • Process data economically • Integrate with many sources of data • Support predictive analytics and self-service data management platform5 © 2012 Forrester Research, Inc. Reproduction Prohibited
  • 6. What is Hadoop and how can it help?Open source software that enables distributedparallel processing of large amounts of data across Largelow-cost commodity servers. amounts of data It leverages an extensible framework for building advanced analytics and new data management capabilities. It’s already being commercialized and adopted rapidly in enterprises. Hadoop Flexible Distributed processing Economical Scalable Open Source Insights6 © 2012 Forrester Research, Inc. Reproduction Prohibited
  • 7. How are organizations adopting Hadoop? Hadoop adoption: – Current adoption estimate is 20% seen mostly in mid-sized to large organizations – Adoption is likely to double through 2016 – Adoption seen across all vertical industries with various use cases – Many organizations are currently doing POC/Sandbox with Hadoop platform How Hadoop will evolve in organizations: – Will start out as independent project focusing on priority Analytics – Will start to integrate with existing systems, Apps and databases – Embed seamlessly into data management and Analytical Platforms – Hadoop will become the Data platform delivering self-service capabilities7 © 2012 Forrester Research, Inc. Reproduction Prohibited
  • 8. How to get going on the Big Data journey Big Data is here to stay! Hadoop is here to stay! Hadoop should be part of your data management and BI strategy Integrate Hadoop with existing data mgt., databases and Apps Hadoop can help save money, deliver new insights and possibilities Don’t limit yourself to structured data only A big data initiative is not a one time project, its an on-going process8 © 2012 Forrester Research, Inc. Reproduction Prohibited
  • 9. CLOUDERA: THE STANDARD FORAPACHE HADOOP IN THE ENTERPRISEOMER TRAJMAN, VP CUSTOMER SOLUTIONS
  • 10. “ YOU CAN’T SOLVE 21ST CENTURY PROBLEMS WITH 20TH CENTURY TECHNOLOGIES ”
  • 11. HOSPITALS NEED MORE COMPREHENSIVE PATIENT INFORMATION BANKS MUST DETECT FRAUD BROADCAST NETWORKS FASTER WANT TO DELIVER CUSTOMIZED CONTENT BY HOUSEHOLD AIRLINES WANT TO UPDATE FLIGHT POWER COMPANIES PRICES IN REAL- WANT TO SAVE TIME CUSTOMERS MONEY BY ANALYZING USAGE DATA OIL COMPANIES WANT TO PREDICT THE LOCATION OF DEPOSITS MORE ACCURATALYRETAILERS WANT TO PARTICLE CREATE MORE PHYSICISTS WANTTARGETTED OFFERS REAL-TIME DATA TO CUSTOMERS FROM THE HADRON COLLIDER
  • 12. SCIENTIFIC APPROACHTO DATA REQUIRES…STORAGE FORMATSFLEXIBILITYEXTENSIBILITYCOMPACT STORAGEFAST LOAD/STOREWIDELY SUPPORTED
  • 13. SIX CHARACTERISTICS OFENTERPRISE-GRADE HADOOP1 HIGH AVAILABILITY 2 GRANULAR SECURITY THERE’S NO DOWNTIME. YOUR DATA IS PROCESS AND CONTROL SENSITIVE ALWAYS AVAILABLE FOR DECISIONS DATA WITH CONFIDENCE3 ROBUST MANAGEMENT 4 SCALABLE AND EXTENSIBLE ACHIEVE OPTIMAL PERFORMANCE VIA ADAPTS TO YOUR WORKLOAD AND CENTRALIZED ADMINISTRATION GROWS WITH THE BUSINESS5 CERTIFIED AND COMPATIBLE 6 GLOBAL SUPPORT AND SERVICES EXTEND AND LEVERAGE EXISTING ACHIEVE SLAs AND ADHERE TO INFRASTRUCTURE INVESTMENTS EXISTING IT POLICIES
  • 14. HADOOP PROVIDES A DATA HUB FOR ALL BIG DATA WORKLOADS • Brings storage and computation together in one single system • Works with every type of data in its native format • Changes the economics of data management
  • 15. APACHE HADOOPCO-EXISTS WITH EDW, ETL & BI TOOLS  Consulting Services  Cloudera University Cloudera ServicesOPERATORS ENGINEERS ANALYSTS BUSINESS USERS CUSTOMERS Cloudera EnterpriseManagement  Cloudera Manager Enterprise Web  Cloudera Support IDE’s BI / Analytics Tools Reporting Application Enterprise Data Warehouse Cloudera’s DistributionIncluding Apache Hadoop (CDH) & Operational Rules Cloudera Manager Free Edition Engines Relational Logs Files Web Data Databases
  • 16. CLOUDERA’S PARTNER ECOSYSTEM:WIDEST INTEGRATION All the industry leaders develop on CDH. CDH4 STORAGE COMPUTATION ACCESS INTEGRATION Big Data storage, processing and analytics platform based on Apache Hadoop – 100% open source BI / Analytics Data Integration Database OS / Cloud / Sys Mgmt Hardware 16
  • 17. REDEFINE WHAT’SPOSSIBLE WITHYOUR DATA
  • 18. Why Hadoop, Why Cloudera, Why Now? Agenda ✛ RH overview ✛ What is our need ✛ Why our system/data is complicated ✛ How Hadoop meets our needs
  • 19. McKesson Corporation ✛ Largest healthcare company in the world $103+billion in revenues; Fortune 15; S&P 500 Est. 1833 Headquarters: San Francisco ✛ Business Distribution Solutions Technology Solutions ✛ Extensive resource base 32,000+ employees solely dedicated to healthcare ✛ Comprehensive array of solutions Significant value through a single relationship ✛ Broadest customer base in healthcare Experienced partners in improving healthcare
  • 20. Overview of Financial Solutions 200,000 Physicians 1900 2,000 Payers / Hospitals Health Plans Provider-to-Payer Interactions Total Interactions: 2.4 Billion/Year
  • 21. Business Challenges ✛ Help customers save money ✛ Small reductions to time in AR  big savings, better cash flow ✛ Meet regulatory challenges > Must store 7 years transactional data
  • 22. What Big Data Means to RelayHealth Every single day: + millions of transactions generated + thousands of files received + 150GB+ log data collected …to be stored for 7 years
  • 23. Why RelayHealth Considered Hadoop✛ Business requirement around data storage & retrieval✛ Looked at traditional solutions Database File System $$$; Untenable when Not easy to searching index files Hybrid (File System + Solr) Not scalable
  • 24. Achieving Operational Efficiency with Hadoop & Cloudera✛ Why Hadoop? ✛ Why Cloudera? > Store billions of files across > Core Apache Hadoop machines leveraging OSS community > Mine data in files using M/R > Integration with other open source solutions: > Aggregate log data & search HBase, Solr, Camel through it using unique > Committer level knowledge of customer identifying information code & how it works > World-class support > Store data in its highest fidelity state > Cloudera Manager
  • 25. Changing Perception✛ Simple archive vs. a way to share data across the organization✛ Building the ability to collect data flowing through our system at all points needed✛ Integrating CDH into the rest of the enterprise > Storing data in its highest fidelity state > Moving away from traditional warehousing systems > Ability to distill data in the cluster for mining in other systems – CDH connectors
  • 26. Summary✛ Challenge: ✛ Solution: ✛ Shorten healthcare providers’ ✛ Hadoop payment cycles via scalable, flexible data streamlined message processing & analysis on processing multi-structured data ✛ RDBMS can’t keep up ✛ Cloudera Enterprise with growing data adding volumes + data storage expertise, support & mandates for regulatory management tools to compliance open source Hadoop
  • 27. Q&A28
  • 28. REGISTER NOW FOR THE REMAINING ‘POWER OF HADOOP’ WEBINARS: THANK WHAT THE HADOOP: WHY YOUR BUSINESS CAN’T YOU! AFFORD TO IGNORE THE POWER OF HADOOP GIGAOM PRO AND CLOUDERA WEDNESDAY, AUGUST 29, 10AM PST THE BUSINESS ADVANTAGE OF HADOOP:LESSONS FROM THE FIELD 451 RESEARCH AND CLOUDERA THURSDAY, SEPTEMBER 26, 10AM PST29