Big data has become a business issue, or at least an issue that business people are aware of. Look at the coverage its getting in the business press. From the Wall Street Journal “Companies are being inundated with data” to the Financial Times “Increasingly businesses are applying analytics to social media such as Facebook and Twitter” to Forbes “Big Data has arrived at Seton Health Care Family”. Why is it getting this type of coverage? Because it has the potential to profoundly affect the way we do business. The quote on CNBC really exemplifies this “Data is the new Oil”. Data is a natural resource that is growing bigger. Like any resource, it is difficult to extract. It comes in many types – or a huge variety. It is also difficult to refine, or analyze. Many organizations do not even tap into this natural resource – they ignore data, or they use it for one purpose. This is largely because it is difficult to structure and restructure for different purposes. But some organizations have cracked the code, and they have figured out how to process and analyze the data available to them, and they are utilizing it to achieve breakthrough outcomes. If data is a natural resource, what is your company doing to capitalize on it?
Big data has 5 key characteristics. The first is volume. Of course this may seem obvious, but it is complex that you may think. Yes the volume of data is growing. Experts predict that the volume of data in the world will grow to 25 Zettabytes in 2020. That same phenomenon affects every business – their data is growing at the same exponential rate too. But it isn’t jus the volume of data that is growing. It’s the number of sources of that data. And that leads to the third characteristic of big data, variety, which we will cover later. Data is increasingly accelerating the velocity at which it is created and at which it is integrated. We’ve moved from batch to a real-time business. Data comes at you at a record or a byte level, not always in bulk. And the demands of the business have increased as well – from an answer next week to an answer in a minute. And the world is also becoming more instrumented and interconnected. The volume of data streaming off those instruments is exponentially larger than it was even 2 years ago. Variety presents an equally difficult challenge. The growth in data sources has fuelled the growth in data types. In fact, 80% of the worlds data is unstructured. Yet most traditional methods apply analytics only to structured information. And finally we have veracity. How can you act upon information if you don’t trust it. Establishing trust in big data presents a huge challenge as the sources and the variety grows.
Big data will impact every aspect of your business, or at least it has the potential to do so. The primary areas of business impact are listed on this slide. Knowing everything about your customers – that has been the holy grail from CRM to customer portals to master data management. What’s held those systems back from ‘knowing everything’? Well, the variety of data is one aspect. If 80% of data is unstructured, then it stands to reason that 80% of the data about your customers is unknown to those systems. Or zero-latency operations – the ability to analyze streaming data from instrumented operational systems, or to deeply analyze inventory and supply chain operations, yields incredible insights and results. Organizations already have the data they need to optimize – the only thing holding them back is the technology to analyze big data. Product innovation. So few companies do it well, yet it can be the key difference maker in winning or losing a market. What if most companies could level the playing field with the “R&D Giants”? What if they could analyze market data from social media to identify and capitalize on key market trends What if they could analyze their product service and sales data utilizing a variety of analytical techniques? Instant awareness of fraud and risk – its looking for a rare event and finding it in time to prevent it. That’s a big data problem of both volume and velocity, and sometimes variety. Some organizations are realizing value by analyzing more data to develop better fraud models, or to find the rare occurrence by literally looking everywhere. Others are analyzing data in motion to detect potential events and take appropriate action. And finally many organizations are exploiting instrumented assets. Oil and gas companies are pursuing predictive diagnostics on remote oil rigs. New smart buildings monitor heating and cooling systems to lower costs. The opportunities are as endless as the data streaming off those instrumented assets.
Business-centric big data is about identifying a business problem, or a pain point, and then applying the appropriate technology to address that problem. Too often, big data may start as a research or IT project in search of a business problem. Successful implementations start the other way around. And getting started in the right place is crucial, because you are demonstrating a new technology (big data) to your organization, the success of that first project may determine how widely it is adopted in your organization. The most common pain points are listed here Unlock big data – quickly get a view and understand big data sources Analyze raw data – ingest and analyze data in its native format Simplify your warehouse – optimize your warehouse by offloading deep analytics tasks to purpose-built appliances Reduce cost with hadoop – offload workloads and data sets to hadoop for cost-efficient processing Analyze data in motion – harness streaming data and analyze it These 5 pain points lead to entry points in the big data platform – each requires a different set of big data capabilities to get started. And each will lead to expanding to new use cases and new big data capabilities over time.
Let’s first look at unlocking big data. The customer need is to understand existing data sources without moving any of the data – to discover, navigate, view, and search big data in a federated manner. One customer was able to get up and running in a few months to search and navigate big data across many existing sources of big data. This type of implementation can yield significant business value - from cutting manual efforts to search and retrieve big data, to gaining a better understanding of existing sources of big data before further analysis. The payback period is often short. Customer example – Proctor and Gamble …. The entry point in the big data platform is Vivisimo Velocity – it enables federated search and navigation.
Next we have a pain point around analyzing raw data. The primary need is to analyze unstructured, or semi-structured, data from one or multiple sources. Often the content is textual – and the meaning is hidden within the text. Another common need is to combine different data types – structured and unstructured – for combined analysis. Customers often gain significant value in this approach – they unlock insights that were previously unknown. Those insights can be the key to retaining a valuable customer, to identifying a previously undetected fraud, or discovering a game-changing efficiency in operational processes. One client, a financial services regulatory organization, analyzed a variety of new data sources and integrated the insights with their existing data warehouse to further enhance their risk modeling processes. The big data platform entry point is InfoSphere BigInsights, a Hadoop-based analytics system.
Often data warehouse environments are anything but simple. Warehouses can become glutted with data and not be well-suited to any one particular task. Often, organizations will be hampered by poor performance of analytics – queries will take hours or even days to run. And the cost of the data warehouse and improving performance can be prohibitively high. The value is striking. Many organizations realize a 10 to 100 times performance boost on deep analytics. Queries that took hours now take minutes. So the cost and performance is significant – and the efficiency of employees is boosted. Its also extremely simple to install and administer, yielding significantly lower administration costs. One customer example is Catalina marketing – who executes 10x the amount of predictive workloads with the same staffing level. The entry point for this pain point is IBM Netezza.
Hadoop is a cost-efficient platform and it has the ability to significantly lower the cost of certain workloads. Organizations may have particular pain around reducing the overall cost of their data warehouse. Certain groups of data may be seldom used and possible candidates to offload to a lower-cost platform. Certain operations such as transformations may be able to be offloaded to a more cost efficient platform. The primary area of value creation is cost savings. By pushing workloads and data sets onto a Hadoop platform, organizations are able to preserve their queries and take advantage of Hadoop’s cost-effective processing capabilities. One customer example, a financial services firm, moved processing of applications and reports from an operational data warehouse to Hadoop Hbase; they were able to preserve their existing queries and reduce the operating costs of their data management platform. The entry point for this pain is InfoSphere BigInsights – IBM’s Hadoop-based product.
Key Points Since most Hadoop vendors offer some form of Hadoop, let’s point out what makes BigInsights better for enterprises than open source Hadoop, and our competitors who simply package open source components with out any added value. On the left, you see the characteristics that make Hadoop different and so valuable for analyzing big data. On the right, you see what we add on around open source Hadoop. I’ve talked quite a bit about our accelerators and integration. Let me also mention some of the things we are doing from a performance and reliability. Adaptive MapReduce is an IBM innovation that speeds MapReduce jobs without change the way these job are writing Compression capabilities to reduce storage costs and query time Indexing to reduce the latency on text searches
Key Points Here you can see the different parts of BigInsights platform Notice how BigInsights builds and adds value on top of open source Hadoop. Immediately on top of Hadoop, we’ve added a set of engines that optimize the performance of MapReduce workloads, simplify workload scheduling, and index analysis results to speed results access. Additionally, we’ve added an enterprise-level security module to control user and data access Above the engines, we’ve added our unique text analytics capability to analyze unstructured data without having to first convert it to structured data. We’ve also created a collection of accelerators – packaged content and best practices to solve common generalized and industry big data problems. On top, you see the Visualization, Development Tooling, and Administration Console – professional tooling and user interfaces for data scientists, developers, and administrators On the right, you see the integration and governance capabilities that support the BigInsights platform. More on each of these value-capabilities coming up in following slides.
Key Points Hadoop is a framework originally created by Google and Yahoo used to optimize internet search workloads - was one of the early big data challenges. As part of the open source Apache group, it has evolved beyond internet search and found to be an effective platform for: Analyzing large volumes (petabytes or more) of data – analyzing an entire data (versus a subset of available data) provides accurate analyses and much better predictions Deriving new insights from combinations of data types – combining data from multiple sources and types (structured & unstructured) to uncover new data relationships and insights than by independently analyzing silos of structured data Data volumes that are too expensive to store with existing data warehousing technologies Sandbox for data discovery & exploration – a place where data scientists can uncover new data relationships and dependencies that impact the business
Customers often have many sources of streaming data, yet they are unable to take full advantage of them. Sometimes its because there is simply too much data to collect and store before analyzing it. Or it may be because of timing – by the time they store data on disk, analyze it, and respond – it’s too late. They need a way to harness the natural resource of streaming data and turn it into actionable insight. The benefits of streaming analytics are immediately obvious. Dramatic cost savings by analyzing data and only storing what is necessary. The ability to detect and make real-time decisions, resulting in customer retention to detecting fraud to cross-selling a product. One client, Ufone, analyzed Call Detail Records as data streamed off their network. By analyzing CDRs in real-time, they were able to detect potential customer service issues and proactively respond, thereby reducing customer churn. The entry point to the big data platform is InfoSphere Streams, which is often accompanied by a system to persist insights and perform deeper analysis to adjust the streaming analytic models – either Netezza or InfoSphere BigInsights.
Key Points New paradigm is required to analyze data in motion – some big data problems simply don’t allow you to persist and then analyze data Can process multiple streams of data at the same data Modular design that has unlimited scalability – millions of events per day Designed for variety – to analyze many data types simultaneously Video, audio, text, social media, devices (smart meters, RFID, instruments) as well as structured data Can perform complex calculations on the data in real-time Built-in integration with the other capabilities in the IBM big data platform Data Warehousing (Netezza, InfoSphere Warehouse, Smart Analytics System) Hadoop (BigInsights)
Key Points Stream computing is a different paradigm – the left shows the traditional way data is accessed using queries to pull the data from a data storage device such as a data warehouse or database – which is still valid for many requirements The new stream computing paradigm brings data to the query – data is pushed or flows through the analytics. This is required for many new use cases in big data Common drivers for those new use cases – When you need an immediate response/action and persisting and analyzing stored data isn’t fast enough. When it is too expensive to store the data to be analyzed – e.g. most of it is throw-away and its more efficient to analyze/filter as you receive it and store the filtered results.
This is a research project done by UOIT = University of Ontario Institute of Technology (Toronto) for (neonatal) premature babies in Intensive Care Units. In ICUs, the instrumentation is there: dozens of sensors for each infant. But the instruments work in isolation and only produce visual signals, auditory signals in case of anomalies, and sometimes a paper record on demand. Streams is a highly usable platform for finally capturing all that data and combining it in useful ways, in real time. Since this kind of combined analysis has not been done before, Streams first needs to enable the research effort of figuring out which patterns are relevant and predictive, before any of this can be standardized and put into an approved product and practice. In summary, Streams enables real-time analytics and correlations on physiological data streams, for instance, Blood pressure, Temperature, EKG, Blood oxygen saturation etc., This, in turn, can enable early detection of the onset of potentially life-threatening conditions up to 24 hours earlier than current medical practices. Early intervention leads to lower patient morbidity and better long term outcomes. This technology also enables physicians to verify new clinical hypotheses.
There are many entry points to the big data platform. It isn’t a one-time, one-size-fits-all proposition. There are many entry points to the big data platform – illustrated on this slide and in the previous slides. <Read pains and entry points to re-iterate>. They key point is that clients will start with one pain and entry point, and adopt others over time. And there is a benefit to doing so – they may leverage reusable aspects of the platform as they adopt new capabilities – sharing analytics, accelerators, etc. from one implementation to the next. And that is the power of the platform – the ability to leverage from one project to the next and to go faster .
We’ve explored common pain points around big data, and the entry points they lead to in the big data platform. We’ve covered the new technologies that make it possible to capitalize on big data. The only question that remains is the benefit of a platform approach – why a platform vs individual products for individual needs? While its true that the 5 pain points often lead to an individual product or capability in the big data platform, there are often a supporting set of capabilities that are needed right in phase 1. But the real benefit of the platform comes in the second phase and beyond. It’s the ability to leverage your investment for multiple purposes. For example, to leverage the text analytics developed for Hadoop in your stream computing deployment, Or the ability to leverage integration points among the various aspects of the platform to reduce your overall development cost and time. Or the ability to leverage a common integration and governance foundation across Hadoop, stream computing, and data warehousing. The real benefit of the platform is leverage – the ability to leverage capabilities over your big data journey. IBM is the only vendor with this broad and balanced a view of big data and the needs of a platform – and the benefit is pre-integration of its components to reduce your implementation time and cost.
Frank Ketelaars & Robert Hartevelt, IBM - Big Data - BI Symposium 2012