Frank Ketelaars & Robert Hartevelt, IBM - Big Data - BI Symposium 2012

1,176 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,176
On SlideShare
0
From Embeds
0
Number of Embeds
45
Actions
Shares
0
Downloads
49
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Big data has become a business issue, or at least an issue that business people are aware of. Look at the coverage its getting in the business press. From the Wall Street Journal “Companies are being inundated with data” to the Financial Times “Increasingly businesses are applying analytics to social media such as Facebook and Twitter” to Forbes “Big Data has arrived at Seton Health Care Family”. Why is it getting this type of coverage? Because it has the potential to profoundly affect the way we do business. The quote on CNBC really exemplifies this “Data is the new Oil”. Data is a natural resource that is growing bigger. Like any resource, it is difficult to extract. It comes in many types – or a huge variety. It is also difficult to refine, or analyze. Many organizations do not even tap into this natural resource – they ignore data, or they use it for one purpose. This is largely because it is difficult to structure and restructure for different purposes. But some organizations have cracked the code, and they have figured out how to process and analyze the data available to them, and they are utilizing it to achieve breakthrough outcomes. If data is a natural resource, what is your company doing to capitalize on it?
  • Big data has 5 key characteristics. The first is volume. Of course this may seem obvious, but it is complex that you may think. Yes the volume of data is growing. Experts predict that the volume of data in the world will grow to 25 Zettabytes in 2020. That same phenomenon affects every business – their data is growing at the same exponential rate too. But it isn’t jus the volume of data that is growing. It’s the number of sources of that data. And that leads to the third characteristic of big data, variety, which we will cover later. Data is increasingly accelerating the velocity at which it is created and at which it is integrated. We’ve moved from batch to a real-time business. Data comes at you at a record or a byte level, not always in bulk. And the demands of the business have increased as well – from an answer next week to an answer in a minute. And the world is also becoming more instrumented and interconnected. The volume of data streaming off those instruments is exponentially larger than it was even 2 years ago. Variety presents an equally difficult challenge. The growth in data sources has fuelled the growth in data types. In fact, 80% of the worlds data is unstructured. Yet most traditional methods apply analytics only to structured information. And finally we have veracity. How can you act upon information if you don’t trust it. Establishing trust in big data presents a huge challenge as the sources and the variety grows.
  • Big data will impact every aspect of your business, or at least it has the potential to do so. The primary areas of business impact are listed on this slide. Knowing everything about your customers – that has been the holy grail from CRM to customer portals to master data management. What’s held those systems back from ‘knowing everything’? Well, the variety of data is one aspect. If 80% of data is unstructured, then it stands to reason that 80% of the data about your customers is unknown to those systems. Or zero-latency operations – the ability to analyze streaming data from instrumented operational systems, or to deeply analyze inventory and supply chain operations, yields incredible insights and results. Organizations already have the data they need to optimize – the only thing holding them back is the technology to analyze big data. Product innovation. So few companies do it well, yet it can be the key difference maker in winning or losing a market. What if most companies could level the playing field with the “R&D Giants”? What if they could analyze market data from social media to identify and capitalize on key market trends What if they could analyze their product service and sales data utilizing a variety of analytical techniques? Instant awareness of fraud and risk – its looking for a rare event and finding it in time to prevent it. That’s a big data problem of both volume and velocity, and sometimes variety. Some organizations are realizing value by analyzing more data to develop better fraud models, or to find the rare occurrence by literally looking everywhere. Others are analyzing data in motion to detect potential events and take appropriate action. And finally many organizations are exploiting instrumented assets. Oil and gas companies are pursuing predictive diagnostics on remote oil rigs. New smart buildings monitor heating and cooling systems to lower costs. The opportunities are as endless as the data streaming off those instrumented assets.
  • Business-centric big data is about identifying a business problem, or a pain point, and then applying the appropriate technology to address that problem. Too often, big data may start as a research or IT project in search of a business problem. Successful implementations start the other way around. And getting started in the right place is crucial, because you are demonstrating a new technology (big data) to your organization, the success of that first project may determine how widely it is adopted in your organization. The most common pain points are listed here Unlock big data – quickly get a view and understand big data sources Analyze raw data – ingest and analyze data in its native format Simplify your warehouse – optimize your warehouse by offloading deep analytics tasks to purpose-built appliances Reduce cost with hadoop – offload workloads and data sets to hadoop for cost-efficient processing Analyze data in motion – harness streaming data and analyze it These 5 pain points lead to entry points in the big data platform – each requires a different set of big data capabilities to get started. And each will lead to expanding to new use cases and new big data capabilities over time.
  • Let’s first look at unlocking big data. The customer need is to understand existing data sources without moving any of the data – to discover, navigate, view, and search big data in a federated manner. One customer was able to get up and running in a few months to search and navigate big data across many existing sources of big data. This type of implementation can yield significant business value - from cutting manual efforts to search and retrieve big data, to gaining a better understanding of existing sources of big data before further analysis. The payback period is often short. Customer example – Proctor and Gamble …. The entry point in the big data platform is Vivisimo Velocity – it enables federated search and navigation.
  • Next we have a pain point around analyzing raw data. The primary need is to analyze unstructured, or semi-structured, data from one or multiple sources. Often the content is textual – and the meaning is hidden within the text. Another common need is to combine different data types – structured and unstructured – for combined analysis. Customers often gain significant value in this approach – they unlock insights that were previously unknown. Those insights can be the key to retaining a valuable customer, to identifying a previously undetected fraud, or discovering a game-changing efficiency in operational processes. One client, a financial services regulatory organization, analyzed a variety of new data sources and integrated the insights with their existing data warehouse to further enhance their risk modeling processes. The big data platform entry point is InfoSphere BigInsights, a Hadoop-based analytics system.
  • Often data warehouse environments are anything but simple. Warehouses can become glutted with data and not be well-suited to any one particular task. Often, organizations will be hampered by poor performance of analytics – queries will take hours or even days to run. And the cost of the data warehouse and improving performance can be prohibitively high. The value is striking. Many organizations realize a 10 to 100 times performance boost on deep analytics. Queries that took hours now take minutes. So the cost and performance is significant – and the efficiency of employees is boosted. Its also extremely simple to install and administer, yielding significantly lower administration costs. One customer example is Catalina marketing – who executes 10x the amount of predictive workloads with the same staffing level. The entry point for this pain point is IBM Netezza.
  • Hadoop is a cost-efficient platform and it has the ability to significantly lower the cost of certain workloads. Organizations may have particular pain around reducing the overall cost of their data warehouse. Certain groups of data may be seldom used and possible candidates to offload to a lower-cost platform. Certain operations such as transformations may be able to be offloaded to a more cost efficient platform. The primary area of value creation is cost savings. By pushing workloads and data sets onto a Hadoop platform, organizations are able to preserve their queries and take advantage of Hadoop’s cost-effective processing capabilities. One customer example, a financial services firm, moved processing of applications and reports from an operational data warehouse to Hadoop Hbase; they were able to preserve their existing queries and reduce the operating costs of their data management platform. The entry point for this pain is InfoSphere BigInsights – IBM’s Hadoop-based product.
  • Key Points Since most Hadoop vendors offer some form of Hadoop, let’s point out what makes BigInsights better for enterprises than open source Hadoop, and our competitors who simply package open source components with out any added value. On the left, you see the characteristics that make Hadoop different and so valuable for analyzing big data. On the right, you see what we add on around open source Hadoop. I’ve talked quite a bit about our accelerators and integration. Let me also mention some of the things we are doing from a performance and reliability. Adaptive MapReduce is an IBM innovation that speeds MapReduce jobs without change the way these job are writing Compression capabilities to reduce storage costs and query time Indexing to reduce the latency on text searches
  • Key Points Here you can see the different parts of BigInsights platform Notice how BigInsights builds and adds value on top of open source Hadoop. Immediately on top of Hadoop, we’ve added a set of engines that optimize the performance of MapReduce workloads, simplify workload scheduling, and index analysis results to speed results access. Additionally, we’ve added an enterprise-level security module to control user and data access Above the engines, we’ve added our unique text analytics capability to analyze unstructured data without having to first convert it to structured data. We’ve also created a collection of accelerators – packaged content and best practices to solve common generalized and industry big data problems. On top, you see the Visualization, Development Tooling, and Administration Console – professional tooling and user interfaces for data scientists, developers, and administrators On the right, you see the integration and governance capabilities that support the BigInsights platform. More on each of these value-capabilities coming up in following slides.
  • Key Points Hadoop is a framework originally created by Google and Yahoo used to optimize internet search workloads - was one of the early big data challenges. As part of the open source Apache group, it has evolved beyond internet search and found to be an effective platform for: Analyzing large volumes (petabytes or more) of data – analyzing an entire data (versus a subset of available data) provides accurate analyses and much better predictions Deriving new insights from combinations of data types – combining data from multiple sources and types (structured & unstructured) to uncover new data relationships and insights than by independently analyzing silos of structured data Data volumes that are too expensive to store with existing data warehousing technologies Sandbox for data discovery & exploration – a place where data scientists can uncover new data relationships and dependencies that impact the business
  • Customers often have many sources of streaming data, yet they are unable to take full advantage of them. Sometimes its because there is simply too much data to collect and store before analyzing it. Or it may be because of timing – by the time they store data on disk, analyze it, and respond – it’s too late. They need a way to harness the natural resource of streaming data and turn it into actionable insight. The benefits of streaming analytics are immediately obvious. Dramatic cost savings by analyzing data and only storing what is necessary. The ability to detect and make real-time decisions, resulting in customer retention to detecting fraud to cross-selling a product. One client, Ufone, analyzed Call Detail Records as data streamed off their network. By analyzing CDRs in real-time, they were able to detect potential customer service issues and proactively respond, thereby reducing customer churn. The entry point to the big data platform is InfoSphere Streams, which is often accompanied by a system to persist insights and perform deeper analysis to adjust the streaming analytic models – either Netezza or InfoSphere BigInsights.
  • Key Points New paradigm is required to analyze data in motion – some big data problems simply don’t allow you to persist and then analyze data Can process multiple streams of data at the same data Modular design that has unlimited scalability – millions of events per day Designed for variety – to analyze many data types simultaneously Video, audio, text, social media, devices (smart meters, RFID, instruments) as well as structured data Can perform complex calculations on the data in real-time Built-in integration with the other capabilities in the IBM big data platform Data Warehousing (Netezza, InfoSphere Warehouse, Smart Analytics System) Hadoop (BigInsights)
  • Key Points Stream computing is a different paradigm – the left shows the traditional way data is accessed using queries to pull the data from a data storage device such as a data warehouse or database – which is still valid for many requirements The new stream computing paradigm brings data to the query – data is pushed or flows through the analytics. This is required for many new use cases in big data Common drivers for those new use cases – When you need an immediate response/action and persisting and analyzing stored data isn’t fast enough. When it is too expensive to store the data to be analyzed – e.g. most of it is throw-away and its more efficient to analyze/filter as you receive it and store the filtered results.
  • This is a research project done by UOIT = University of Ontario Institute of Technology (Toronto) for (neonatal) premature babies in Intensive Care Units. In ICUs, the instrumentation is there: dozens of sensors for each infant. But the instruments work in isolation and only produce visual signals, auditory signals in case of anomalies, and sometimes a paper record on demand. Streams is a highly usable platform for finally capturing all that data and combining it in useful ways, in real time. Since this kind of combined analysis has not been done before, Streams first needs to enable the research effort of figuring out which patterns are relevant and predictive, before any of this can be standardized and put into an approved product and practice. In summary, Streams enables real-time analytics and correlations on physiological data streams, for instance, Blood pressure, Temperature, EKG, Blood oxygen saturation etc., This, in turn, can enable early detection of the onset of potentially life-threatening conditions up to 24 hours earlier than current medical practices. Early intervention leads to lower patient morbidity and better long term outcomes. This technology also enables physicians to verify new clinical hypotheses.
  • There are many entry points to the big data platform. It isn’t a one-time, one-size-fits-all proposition. There are many entry points to the big data platform – illustrated on this slide and in the previous slides. <Read pains and entry points to re-iterate>. They key point is that clients will start with one pain and entry point, and adopt others over time. And there is a benefit to doing so – they may leverage reusable aspects of the platform as they adopt new capabilities – sharing analytics, accelerators, etc. from one implementation to the next. And that is the power of the platform – the ability to leverage from one project to the next and to go faster .
  • We’ve explored common pain points around big data, and the entry points they lead to in the big data platform. We’ve covered the new technologies that make it possible to capitalize on big data. The only question that remains is the benefit of a platform approach – why a platform vs individual products for individual needs? While its true that the 5 pain points often lead to an individual product or capability in the big data platform, there are often a supporting set of capabilities that are needed right in phase 1. But the real benefit of the platform comes in the second phase and beyond. It’s the ability to leverage your investment for multiple purposes. For example, to leverage the text analytics developed for Hadoop in your stream computing deployment, Or the ability to leverage integration points among the various aspects of the platform to reduce your overall development cost and time. Or the ability to leverage a common integration and governance foundation across Hadoop, stream computing, and data warehousing. The real benefit of the platform is leverage – the ability to leverage capabilities over your big data journey. IBM is the only vendor with this broad and balanced a view of big data and the needs of a platform – and the benefit is pre-integration of its components to reduce your implementation time and cost.
  • Frank Ketelaars & Robert Hartevelt, IBM - Big Data - BI Symposium 2012

    1. 1. IBM Big Data Platform Information Management Frank Ketelaars → Robert Hartevelt Presales IBM Big Data © 2012 IBM Corporation
    2. 2. Information Management © 2012 IBM Corporation
    3. 3. Information Management“Data is the new Oil” In its raw form, oil has little value. Once processed and refined, it helps power the world. “Big Data has arrived at Seton “At the World Economic Forum Health Care Family, fortunately last month in Davos, Switzerland, “Increasingly, businesses are applying accompanied by an analytics tool Big Data was a marquee topic. A analytics to social media such as that will help deal with the report by the forum, “Big Data, Facebook and Twitter, as well as to complexity of more than two Big Impact,” declared data a new product review websites, to try to million patient contacts a year…” class of economic asset, like “understand where customers are, what currency or gold. makes them tick and what they want”, says Deepak Advani, who heads IBM’s predictive analytics group.” “Companies are being inundated with data—from information on customer-buying habits to supply- chain efficiency. But many managers struggle to make sense of “…now Watson is being put to work the numbers.” digesting millions of pages of research, incorporating the best The Oscar Senti-meter — a tool clinical practices and monitoring the developed by the L.A. Times, IBM and outcomes to assist physicians in the USC Annenberg Innovation Lab — treating cancer patients.” analyzes opinions about the Academy Awards race shared in millions of public messages on Twitter.”“Data is the new oil.”Clive Humby © 2012 IBM Corporation
    4. 4. Information ManagementThe characteristics of big data Cost efficiently Responding to the Collectively Analyzing processing the increasing Velocity the broadening growing Volume Variety 30 Billion 50x PB RFID sensors and counting 80% of the worlds data is unstructured 2010 2020 TB Establishing the 1 in 3 business leaders don’t trust the Veracity of big information they use to make decisions data sources4 © 2012 IBM Corporation
    5. 5. Information ManagementBig Data will impact every aspect of your business Know Everything about your Customer Analyze all sources of data to know your customers as individuals, from channel interactions to social media. Run Zero-latency Operations Analyze all available operational data and react in real-time to optimize processes. Reduce the cost of IT with new cost-effective technologies. Innovate New Products at Speed and Scale Capture all sources of feedback and analyze vast amounts of market and research data to drive innovation. Instant Awareness of Fraud and Risk Develop better fraud/risk models by analyzing all available data, and detect fraud in real-time with streaming transaction analysis. Exploit Instrumented Assets Monitor assets from real-time data feeds to predict and prevent maintenance issues and develop new products & services. 55 © 2012 IBM Corporation
    6. 6. Information ManagementBusiness-centric Big Data enables you to start with a critical businesspain and expand the foundation for future requirements  “Big data” isn’t just a technology— it’s a business strategy for capitalizing on information resources  Getting started is crucial  Success at each entry point is accelerated by products within the big data platform  Build the foundation for future requirements by expanding further into the big data platform6 © 2012 IBM Corporation
    7. 7. Information Management1 – Unlock Big Data • Customer Need – Understand existing data sources – Expose the data within existing content management and file systems for new uses, without copying the data to a central location – Search and navigate big data from federated sources • Value Statement – Get up and running quickly and discover and retrieve relevant big data – Use big data sources in new information- centric applications • Customer examples – Proctor and Gamble – Connect employees with a 360° view of big data sources • Get started with: IBM Data Explorer (formerly known as Vivisimo Velocity)7 © 2012 IBM Corporation
    8. 8. Information Management2 – Analyze Raw Data • Customer Need – Ingest data as-is into Hadoop and derive insight from it – Process large volumes of diverse data within Hadoop – Combine insights with the data warehouse – Low-cost ad-hoc analysis with Hadoop to test new hypothesis • Value Statement – Gain new insights from a variety and combination of data sources – Overcome the prohibitively high cost of converting unstructured data sources to a structured format – Extend the value of the data warehouse by bringing in new types of data and driving new types of analysis – Experiment with analysis of different data combinations to modify the analytic models in the data warehouse • Customer examples – Financial Services Regulatory Org – managed additional data types and integrated with their existing data warehouse • Get started with: InfoSphere BigInsights8 © 2012 IBM Corporation
    9. 9. Information Management3 – Simplify your Warehouse • Customer Need – Business users are hampered by the poor performance of analytics of a general-purpose enterprise warehouse – queries take hours to run – Enterprise data warehouse is encumbered by too much data for too many purposes – Need to ingest huge volumes of structured data and run multiple concurrent deep analytic queries against it – IT needs to reduce the cost of maintaining the data warehouse • Value Statement – Speed – 10-100x faster performance on deep analytic queries – Simplicity – minimal administration and tuning of the appliance – Up and running quickly • Customer examples – Catalina Marketing – executing 10x the amount of predictive workloads with the same staff • Get started with: PureData for Analytics /Netezza9 © 2012 IBM Corporation
    10. 10. Information Management4 – Reduce costs with Hadoop • Customer Need – Reduce the overall cost to maintain data in the warehouse – often its seldom used and kept ‘just in case’ – Lower costs as data grows within the data warehouse – Reduce expensive infrastructure used for processing and transformations • Value Statement – Support existing and new workloads on the most cost effective alternative, while preserving existing access and queries – Lower storage costs – Reduce processing costs by pushing processing onto commodity hardware and the parallel processing of Hadoop • Customer examples – Financial Services Firm – move processing of applications and reports to Hadoop Hbase while preserving existing queries • Get started with: IBM InfoSphere BigInsights10 © 2012 IBM Corporation
    11. 11. Information ManagementWhat’s so Special About Open Source Hadoop? Storage Scalable • Distributed • New nodes can be added on the fly • Reliable • Commodity gear Affordable • Massively parallel computing on commodity servers Flexible • Hadoop is schema-less – can absorb any MapReduce type of data • Parallel Programming Fault Tolerant • Fault Tolerant • Through MapReduce software framework11 © 2012 IBM Corporation
    12. 12. Information ManagementInfoSphere BigInsights – A Closer Look User Interfaces Integration More Than Hadoop Databases Visualization Dev Tools Admin Console • Performance & workload optimizations Accelerators • Unique text analytic engines Text Application Analytics Accelerators • Spreadsheet-style visualization for Content data discovery & exploration Management BigInsights Engine • Built-in IDE & admin consoles Map Reduce + Indexing • Enterprise-class security Information • High-speed connectors to integration Workload Mgmt Security Governance with other systems Apache • Analytical accelerators Hadoop/cloudera12 © 2012 IBM Corporation
    13. 13. Information ManagementHadoop is Well Suited for Handling Certain Types of Big Data Challenges Analyzing larger volumes may Deriving new insights provide better results from combinations of data types Larger data volumes are cost Exploring data – prohibitive with existing technology a sandbox for ad-hoc analytics13 © 2012 IBM Corporation
    14. 14. Information Management5 – Analyze Streaming Data • Customer Need – Harness and process streaming data sources – Select valuable data and insights to be stored for further processing – Quickly process and analyze perishable data, and take timely action Streaming Data • Value Statement Sources Streams Computing – Significantly reduced processing time and cost – process and then store what’s valuable – React in real-time to capture opportunities before they expire ACTION • Customer examples – Ufone – Telco Call Detail Record (CDR) analytics for customer churn prevention • Get started with: InfoSphere Streams14 © 2012 IBM Corporation
    15. 15. Information Management InfoSphere Streams - Streaming Analytics for Big Data • Built to analyze data in motion – Multiple concurrent input streams – Massive scalability • Process and analyze a variety of data – Structured, unstructured content, video, audio – Advanced analytic operators • Enables Adaptive Real-Time Analytics – With Data Warehousing – With Hadoop Systems15 © 2012 IBM Corporation
    16. 16. Information Management Stream Computing Represents a Paradigm Shift Traditional Computing Stream ComputingHistorical fact finding Current fact findingFind and analyze information stored on disk Analyze data in motion – before it is storedBatch paradigm, pull model Low latency paradigm, push modelQuery-driven: submits queries to static data Data driven – bring data to the analytics Real-time Analytics16 © 2012 IBM Corporation
    17. 17. Information ManagementUniversity of Ontario Institute of Technology  Use case – Neonatal infant monitoring – Predict infection in ICU 24 hours in advance  Solutions – 120 children monitored :120K msg/sec, billion msg/day – Trials expanding to include hospitals in US and China Event Pre- Analysis processer Framework Sensor Stream-based Distributed Interoperable Solutions Network Health care Infrastructure (Applications) © 2012 IBM Corporation
    18. 18. Information ManagementEntry points are accelerated by products within the big data platform Analytic Applications1 – Unlock Big Data BI / Reporting Exploration / Functional Industry Predictive Content Visualization App App Analytics BI / AnalyticsIBM Vivisimo Reporting IBM Big Data Platform 3 – Simplify your Visualization Application Systems warehouse & Discovery Development Management2 – Analyze Raw Rata PureData for Analytics/NetezzaInfoSphere BigInsights Accelerators Hadoop Stream Data System Computing Warehouse 5 – Analyze Streaming4 – Reduce costs with DataHadoop InfoSphere StreamsInfoSphere BigInsights Information Integration & Governance18 © 2012 IBM Corporation
    19. 19. Information ManagementIBM’s Big Data Platform Netezza Streams BigInsights • Act on Data “In-Motion” • 10-100’s of TB of • Act on Data “At Rest” structured data • Structured & • Structured & Unstructured Data • Extreme Performance Unstructured Data • Extremely scalable Real • In-Database Analytics • Text Analytics time analytics • Extend SQL with • Commodity Hardware MapReduce • “Cold Data” Storage © 2012 IBM Corporation
    20. 20. Information ManagementThe platform advantage• The platform provides benefit as you move Analytic Applications from an entry point to a second and third BI / Reporting Exploration / Functional Industry Predictive Content BI / project Visualization App App Analytics Analytics Reporting• Shared components and integration IBM Big Data Platform between systems lowers deployment costs Visualization Application Systems & Discovery Development Management• Key points of leverage • Reuse text analytics across streams and Hadoop Accelerators • HDFS connectors between Streams and Information Integration Hadoop Stream Data System Computing Warehouse • Common integration, meta data and governance across all engines • Accelerators built across multiple engines – common analytics, models, and visualization Information Integration & Governance20 © 2012 IBM Corporation
    21. 21. Information Management THINK21 © 2012 IBM Corporation

    ×