Konceptuelt overblik over Big Data, Flemming Bagger, IBM


Published on

Præsentation fra IBM Smarter Business 2012

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Nothing illustrates the breakthrough of Twitter than a simple comparison between the Olympic Games. To put it in perspective, if you look at the time between the Beijing Olympics in 2008 and the 2012 London games +CLICK+ as we start the London Olympics, there are over 500-million active users on Twitter, pushing out over 400 million tweets a day. This is a massive increase from the six million Twitter users during the Beijing Olympics in 2008 pushing out about 300,000 tweets per day. How big? +CLICK+ The number of Twitter users between these two periods in time have increase by 83X and the number of tweets by a whopping 1333X!
  • C&A is a Brazilian retailer that has ‘Smart Hangers’ where shoppers can Like a piece of clothing and see the amount of Likes an object has on Facebook. On the C&A Web site, each piece of clothing has it’s own post and the Likes just keep piling up. While folks can get push happy, so it’s tough to tell just how popular something is, the point here is where we are headed.
  • Obviously, there are many other forms of data. Let ’ s start with the hottest topic associated with Big Data today: social networks. Twitter generates about 12 terabytes a day of tweet data – which is every single day. Now, keep in mind, these numbers are hard to keep accurate, so the point is that they ’ re big , right? So don ’ t fixate on the actual number because they change all the time and realize that even if these numbers are out of date by 2 years, it ’ s at a point where it ’ s too staggering to handle exclusively using traditional approaches. +CLICK+ Facebook over a year ago was generating 25 terabytes of log data every day ( Facebook log data reference: http://www.datacenterknowledge.com/archives/2009/04/17/a-look-inside-facebooks-data-center/ ) and probably about 7 to 8 terabytes of data that goes up on the Internet. +CLICK+ Google, who knows? Look at Google Plus, YouTube, Google Maps, and all that kind of stuff. So that ’ s the left hand of this chart – the social network layer. +CLICK+ Now let ’ s get back to instrumentation: there are massive amounts of proliferated technologies that allow us to be more interconnected than in the history of the world – and it just isn ’ t P2P (people to people) interconnections, it ’ s M2M (machine to machine) as well. Again, with these numbers, who cares what the current number is, I try to keep them updated, but it ’ s the point that even if they are out of date, it ’ s almost unimaginable how large these numbers are. Over 4.6 billion camera phones that leverage built-in GPD to tag your location or your photos, purpose built GPS devices, smart metres. If you recall the bridge that collapsed in Minneapolis a number of years ago in the USA, it was rebuilt with smart sensors inside it that measure the contraction of the concrete based on weather conditions, ice build up, and so much more. So I didn ’ t realise how true it was when Sam P launched Smart Planet: I thought it was a marketing play. But truly the world is more instrumented, interconnected, and intelligent than it ’ s ever been before and this capability allows us to address new problems and gain new insight never before thought possible and that ’ s what the Big Data opportunity is going to be all about!
  • Big data comes from many sources. Its much more than traditional data sources. And it order to capitalize on the breakthrough opportunities we’ve discussed, you definitely need to look beyond traditional data sources. But at the same time, don’t forget that big data comes from those traditional sources too. Transactional data and application data is growing an a significant rate. Although it’s structured, that data is large and it is contained in many different structures. Big data includes machine data – logs, web logs, instrumentation data, network data. Data generated by machines is multiplying quickly, and it contains valuable insights that need to be discovered. Social data also needs to be incorporated. Most social data is really textual data. And the valuable insights remain locked within that text and its many possible meanings. And most of that data isn’t valuable, or has a very short expiry date during which it is valuable. That makes social data very challenging – extracting insight from largely textual content in very little time. And enterprise content must be amalgamated as well. And that data comes in many forms, and also in significant volume.
  • Big data has 4 key characteristics. The first is volume. Of course this may seem obvious, but it is complex that you may think. Yes the volume of data is growing. Experts predict that the volume of data in the world will grow to 25 Zettabytes in 2020. That same phenomenon affects every business – their data is growing at the same exponential rate too. But it isn’t jus the volume of data that is growing. It’s the number of sources of that data. And that leads to the third characteristic of big data, variety, which we will cover later. Data is increasingly accelerating the velocity at which it is created and at which it is integrated. We’ve moved from batch to a real-time business. Data comes at you at a record or a byte level, not always in bulk. And the demands of the business have increased as well – from an answer next week to an answer in a minute. And the world is also becoming more instrumented and interconnected. The volume of data streaming off those instruments is exponentially larger than it was even 2 years ago. Variety presents an equally difficult challenge. The growth in data sources has fuelled the growth in data types. In fact, 80% of the worlds data is unstructured. Yet most traditional methods apply analytics only to structured information. And finally we have veracity. How can you act upon information if you don’t trust it. Establishing trust in big data presents a huge challenge as the sources and the variety grows.
  • In this slide you can see a graph – it ’ s not to scale, but you get the point – and this graph shows that the percentage of data available to an enterprise is growing enormously; you can see that at the top bar. And as the amount of data available to an organization grows, the percentage of data that the organization can actually process is decreasing. It ’ s kind of like we ’ re getting “ dumber ” as organizations – in terms of proportion of measurement to the data we are collecting - are understanding less and less of it.   +CLICK+ I call the shaded area between these opposite trending lines “ The Blind Spot ” : it contains signals and noise. This area has got all this data in there, and perhaps it would make sense for us to ingest this into our traditional analytic systems, but we don ’ t know if that data will yield value or not – it ’ s a blind spot. We have a hunch that there is value in there, but truly we have no idea what ’ s in the shaded area. Furthermore, while we know there is value in here, we know it ’ s not all going to be useful, so how do we sift through the noise to find the signals? We can start ingesting 10 TB of data a day , ask the CIO for her or his approval for triple OPEX and CAPEX costs on a hunch? So we have to find a way to find the signals within all the noise in a cost effective manner.   Now if we can leverage some new approach to find the value in the blind spot, at a relatively low cost, if we could tie together things like Big Data social media around our core trusted information that we know about our customers, and drop the stuff that isn ’ t related to what the business is trying to accomplish, you could really start to monetize that relationship and intents - not just transactions. And that ’ s the difference, right? Do we monetize intent and relationships? - And that ’ s a problem domain that includes Big Data.   In the previous paragraph I just gave a ubiquitous example, since social media is so obviously tied to Big Data. But you can imagine this dichotomy in any industry. For example, think Oil and Gas (O&G) drill well readings streaming in – and wanting to apply analytics to that with geological data that is unstructured and comes from other sources in various formats and is likely often changing (from an attribute perspective). Harvesting wind energy, traffic patterns, and more.
  • Another reason that big data is a hot topic in the market today is the new technology that enables an organization to take advantage of the natural resource of big data. Big data itself isn’t new – its been here for a while and growing exponentially. What is new is the technology to process and analyze it. The purpose of big data technology is to cost effectively manage and analyze all of the available data. Any data, as is. If you want to analyze structured data, then structure it. If you want to analyze an acoustic file, then analyze the acoustic file with appropriate analytics. You’ll see the wide variety of sources of big data. It comes from our traditional systems – Billing systems, ERP systems, CRM systems. It also comes from machine data – from RFID tags, sensors, network switches. And it comes from humans – website data, social media, etc.
  • Key Points Many use cases require multiple technologies to address big data challenges Pre-processing – to ingest multiple data types, structuring data, identify insights, then store those insights in a structured DW Combined structured and unstructured – having a structured DW and unstructured Hadoop system analyzing data and sharing insights back and forth High velocity and historical – stream computing to analyze in motion data and store insights in structured DW for deeper insights and/or reporting Reuse structured – unload structured data into Hadoop and experiment – some companies have found entirely new uses for data that could become new service offerings (e.g,. A large bank discovered that they can profile their client base by financial profile and potentially offer a service to tell customers how they rate vs. their profile – e.g., you have 20% higher mortgage than clients in your fin profile)
  • Let’s first look at unlocking big data. The customer need is to understand existing data sources without moving any of the data – to discover, navigate, view, and search big data in a federated manner. One customer was able to get up and running in a few months to search and navigate big data across many existing sources of big data. This type of implementation can yield significant business value - from cutting manual efforts to search and retrieve big data, to gaining a better understanding of existing sources of big data before further analysis. The payback period is often short. Customer example – Proctor and Gamble …. The entry point in the big data platform is Vivisimo Velocity – it enables federated search and navigation.
  • Next we have a pain point around analyzing raw data. The primary need is to analyze unstructured, or semi-structured, data from one or multiple sources. Often the content is textual – and the meaning is hidden within the text. Another common need is to combine different data types – structured and unstructured – for combined analysis. Customers often gain significant value in this approach – they unlock insights that were previously unknown. Those insights can be the key to retaining a valuable customer, to identifying a previously undetected fraud, or discovering a game-changing efficiency in operational processes. One client, a financial services regulatory organization, analyzed a variety of new data sources and integrated the insights with their existing data warehouse to further enhance their risk modeling processes. The big data platform entry point is InfoSphere BigInsights, a Hadoop-based analytics system.
  • Often data warehouse environments are anything but simple. Warehouses can become glutted with data and not be well-suited to any one particular task. Often, organizations will be hampered by poor performance of analytics – queries will take hours or even days to run. And the cost of the data warehouse and improving performance can be prohibitively high. The value is striking. Many organizations realize a 10 to 100 times performance boost on deep analytics. Queries that took hours now take minutes. So the cost and performance is significant – and the efficiency of employees is boosted. Its also extremely simple to install and administer, yielding significantly lower administration costs. One customer example is Catalina marketing – who executes 10x the amount of predictive workloads with the same staffing level. The entry point for this pain point is IBM Netezza.
  • Hadoop is a cost-efficient platform and it has the ability to significantly lower the cost of certain workloads. Organizations may have particular pain around reducing the overall cost of their data warehouse. Certain groups of data may be seldom used and possible candidates to offload to a lower-cost platform. Certain operations such as transformations may be able to be offloaded to a more cost efficient platform. The primary area of value creation is cost savings. By pushing workloads and data sets onto a Hadoop platform, organizations are able to preserve their queries and take advantage of Hadoop’s cost-effective processing capabilities. One customer example, a financial services firm, moved processing of applications and reports from an operational data warehouse to Hadoop Hbase; they were able to preserve their existing queries and reduce the operating costs of their data management platform. The entry point for this pain is InfoSphere BigInsights – IBM’s Hadoop-based product.
  • Key Points Hadoop is not a product but an open source framework for more cost effectively and efficiently analyzing large amounts of structured and unstructured data. However, to use open source Hadoop requires the download, installation, configuration, and maintenance of a myriad of different software pieces (Hadoop, MapReduce, Hive, Pig, Hbase, etc). On the left, you see the characteristics that make Hadoop different and so valuable for analyzing big data. Some vendors try to simplify the installation and configuration of the Hadoop framework and projects by prepackaging all the components into a single “distribution” without providing any real added value. IBM’s approach to Hadoop is different. On the right, you see what the innovations and enhancements we added to our BigInsights hadoop-analytics product making it significantly better for enterprises than open source Hadoop: In the area of performance and reliability, we’ve added ground-breaking innovations: Our “Adaptive MapReduce” that speeds up MapReduce workloads by enabling dynamic changes to resource utilization (CPU, disk space, memory) without human intervention. Without this innovation, Hadoop users would need to monitor their MapReduce workloads and manually turn the configuration “knobs” to adjust the resource utilization. Compression enhancements that reduce storage needs/costs as well as query time Indexing reduces the latency on text searches And, our Workload Scheduler capability that makes it easy to schedule and optimize Hadoop analytics runs Other enhancements: Accelerators of prepackaged content and knowledge (best practice patterns) to solve discrete big data problems UIs and tooling needed by data scientists, developers, and administrators to minimize the Hadoop learning curve Our of the box integration connectors to access any type of data type and source Security to control data access – critical to maintain data privacy and protect confidential data
  • Customers often have many sources of streaming data, yet they are unable to take full advantage of them. Sometimes its because there is simply too much data to collect and store before analyzing it. Or it may be because of timing – by the time they store data on disk, analyze it, and respond – it’s too late. They need a way to harness the natural resource of streaming data and turn it into actionable insight. The benefits of streaming analytics are immediately obvious. Dramatic cost savings by analyzing data and only storing what is necessary. The ability to detect and make real-time decisions, resulting in customer retention to detecting fraud to cross-selling a product. One client, Ufone, analyzed Call Detail Records as data streamed off their network. By analyzing CDRs in real-time, they were able to detect potential customer service issues and proactively respond, thereby reducing customer churn. The entry point to the big data platform is InfoSphere Streams, which is often accompanied by a system to persist insights and perform deeper analysis to adjust the streaming analytic models – either Netezza or InfoSphere BigInsights.
  • There are many entry points to the big data platform. It isn’t a one-time, one-size-fits-all proposition. There are many entry points to the big data platform – illustrated on this slide and in the previous slides. <Read pains and entry points to re-iterate>. They key point is that clients will start with one pain and entry point, and adopt others over time. And there is a benefit to doing so – they may leverage reusable aspects of the platform as they adopt new capabilities – sharing analytics, accelerators, etc. from one implementation to the next. And that is the power of the platform – the ability to leverage from one project to the next and to go faster .
  • Konceptuelt overblik over Big Data, Flemming Bagger, IBM

    1. 1. Insight to Action – Big Data– Challenge and Opportunity
    2. 2. Smarter Business 2012 Mobility Smarter Social Smarter Smarter – bring your own Analytics Collaboration Security Cities deviceInsight to Action – Smarter Smarter Smarter SmarterBig Data - Challenge Commerce Product Process Infrastructure and Opportunity & Marketing Innovation Optimization Management Automation
    3. 3. Agenda10:30 IBM Big Data Platform Flemming Bagger, Big Data Analytics Leader, Nordic11:15 Pause11:30 Opnå konkrete resultater med Big Data Analytics Lauren Walker, Big Data Analytics Leader, Europe12:15 Frokost13:30 Succes eller fiasko? Sådan håndteres Big Data i den finansielle sektor Keith Prince, EMEA Industry Solutions Executive, Financial Services, IBM14:15 Pause14:30 Dataindsamling og overvågning på tværs af sociale medier Ulrik Bo Larsen, Founder & CEO, FALCON Social15:10 Afrunding
    4. 4. Agenda10:30 IBM Big Data Platform Flemming Bagger, Big Data Analytics Leader, Nordic11:15 Pause11:30 Opnå konkrete resultater med Big Data Analytics Lauren Walker, Big Data Analytics Leader, Europe12:15 Frokost13:30 Succes eller fiasko? Sådan håndteres Big Data i den finansielle sektor Keith Prince, EMEA Industry Solutions Executive, Financial Services, IBM14:15 Pause14:30 Dataindsamling og overvågning på tværs af sociale medier Ulrik Bo Larsen, Founder & CEO, FALCON Social15:10 Afrunding
    5. 5. Information ManagementHighlight from the IBM CEO Study 2012 © 2012 IBM Corporation
    6. 6. Information Management 83x 6,000,000 users on Twitter 500,000,000 users on Twitter pushing out 300,000 pushing out 400,000,000 tweets per day tweets per day 1333x © 2012 IBM Corporation
    7. 7. Information Management In 2005 there were 1.3 billion RFID tags in circulation…© 2012 IBM Corporation
    8. 8. Information ManagementWhere is big data coming from? 4.6 30 billion RFID tags today billion camera 12+ TBs (1.3B in 2005) phones of tweet data world wide every day 100s of millions of GPS data every? TBs of enabled day devices sold annually 25+ TBs of 2+ log data billion every day people on 76 million smart the Web by end meters in 2009… 2011 200M by 2014 © 2012 IBM Corporation
    9. 9. Information ManagementIn Order to Realize New Opportunities, You Need to Think Beyond TraditionalSources of Data Transactional and Machine Data Social Data Enterprise Application Data Content  Volume  Velocity  Variety  Variety  Structured  Semi-structured  Highly unstructured  Highly unstructured  Throughput  Ingestion  Veracity  Volume © 2012 IBM Corporation
    10. 10. Information ManagementThe Characteristics of Big Data Cost efficiently Responding to the Collectively analyzing processing the increasing Velocity the broadening Variety growing Volume 50x 35 ZB 30 Billion RFID sensors 80% of the and counting worlds data is unstructured 2010 2020 Establishing the 1 in 3 business leaders don’t trust Veracity of big the information they use to make data sources decisions © 2012 IBM Corporation
    11. 11. Information Management The Big Data Conundrum The percentage of available data an enterprise can analyze is decreasing proportionately to the available to that enterprise – Quite simply, this means as enterprises, we are getting “more naive” about our business over time Just collecting and storing “Big Data” doesn’t drive a cent of value to an organization’s bottom line Data AVAILABLE to an organization Data an organization can PROCESS © 2012 IBM Corporation
    12. 12. Information ManagementBig Data is a Hot topic- Because Technology Makes it Possible to Analyze ALL Available Data Cost effectively manage and analyze all available data in its native form unstructured, structured, streaming…….Internal and external Website Social Media Billing ERP Network Switches CRM RFID © 2012 IBM Corporation
    13. 13. Information ManagementMost Client Use Cases Combine Multiple Technologies Pre-processing Ingest and analyze unstructured data types and convert to structured data Combine structured and unstructured analysis Augment data warehouse with additional external sources, such as social media Combine high velocity and historical analysis Analyze and react to data in motion; adjust models with deep historical analysis Reuse structured data for exploratory analysis Experimentation and ad-hoc analysis with structured data © 2012 IBM Corporation
    14. 14. Information ManagementBusiness-centric Big Data enables you to start with a critical business pain andexpand the foundation for future requirements  “Big data” isn’t just a technology—it’s a business strategy for capitalizing on information resources  Getting started is crucial  Success at each entry point is accelerated by products within the Big Data platform  Build the foundation for future requirements by expanding further into the big data platform14 © 2012 IBM Corporation
    15. 15. Information Management1 – Unlock Big Data Customer Need – Understand existing data sources – Expose the data within existing content management and file systems for new uses, without copying the data to a central location – Search and navigate big data from federated sources Value Statement – Get up and running quickly and discover and retrieve relevant big data – Use big data sources in new information-centric applications Get started with: IBM Vivisimo Velocity © 2012 IBM Corporation
    16. 16. Information ManagementMost Common Big Data Use Case = 360-ViewsSingle view of the information Customer- Facing Professional/Kn owledge Worker © 2012 IBM Corporation
    17. 17. Information Management2 – Analyze Raw Data Customer Need – Ingest data as-is into Hadoop and derive insight from it – Process large volumes of diverse data within Hadoop – Combine insights with the data warehouse – Low-cost ad-hoc analysis with Hadoop to test new hypothesis Value Statement – Gain new insights from a variety and combination of data sources – Overcome the prohibitively high cost of converting unstructured data sources to a structured format – Extend the value of the data warehouse by bringing in new types of data and driving new types of analysis – Experiment with analysis of different data combinations to modify the analytic models in the data warehouse Get started with: InfoSphere BigInsights © 2012 IBM Corporation
    18. 18. Information Management3 – Simplify your Warehouse  Customer Need – Business users are hampered by the poor performance of analytics of a general-purpose enterprise warehouse – queries take hours to run – Enterprise data warehouse is encumbered by too much data for too many purposes – Need to ingest huge volumes of structured data and run multiple concurrent deep analytic queries against it – IT needs to reduce the cost of maintaining the data warehouse  Value Statement – Speed and Simplicity for deep analytics (Netezza) – 100s to 1000s users/second for operation analytics (IBM Smart Analytics System)  Get started with: IBM Netezza18 © 2012 IBM Corporation
    19. 19. Information Management4 – Reduce costs with Hadoop Customer Need – Reduce the overall cost to maintain data in the warehouse – often its seldom used and kept ‘just in case’ – Lower costs as data grows within the data warehouse – Reduce expensive infrastructure used for processing and transformations Value Statement – Support existing and new workloads on the most cost effective alternative, while preserving existing access and queries – Lower storage costs – Reduce processing costs by pushing processing onto commodity hardware and the parallel processing of Hadoop Get started with: IBM InfoSphere BigInsights © 2012 IBM Corporation
    20. 20. Information ManagementIBM Significantly Enhances Hadoop IBM Innovation• Scalable • Performance & reliability – New nodes can be added on the fly. – Adaptive MapReduce, Compression, Indexing, Flexible Scheduler• Affordable – Massively parallel computing on • Analytic Accelerators commodity servers • Productivity Accelerators• Flexible – Web-based UIs – Hadoop is schema-less, and can absorb – Tools to leverage existing skills any type of data. – End-user visualization• Fault Tolerant • Enterprise Integration – Through MapReduce software framework – To extend & enrich your information supply chain.20 © 2012 IBM Corporation
    21. 21. Information Management5 – Analyze Streaming Data Streaming Data Sources Streams Computing Customer Need – Harness and process streaming data sources – Select valuable data and insights to be stored for ACTION further processing – Quickly process and analyze perishable data, and take timely action Value Statement – Significantly reduced processing time and cost – process and then store what’s valuable – React in real-time to capture opportunities before they expire Customer examples – Ufone – Telco Call Detail Record (CDR) analytics for customer churn prevention Get started with: InfoSphere Streams © 2012 IBM Corporation
    22. 22. Information ManagementEntry points are accelerated by products within the big data platform1 – Unlock Big Data Analytic Applications BI / Exploration / Functional Industry Predictive ContentIBM Vivisimo Reporting Visualization App App BI / Analytics Analytics Reporting IBM Big Data Platform 3 – Simplify your Visualization Application Systems warehouse2 – Analyze Raw Rata & Discovery Development Management NetezzaInfoSphereBigInsights Accelerators Hadoop Stream Data System Computing Warehouse 5 – Analyze Streaming4 – Reduce costs with DataHadoop InfoSphere StreamsInfoSphereBigInsights Information Integration & Governance22 © 2012 IBM Corporation
    23. 23. Information ManagementIs Big Data imperative? © 2012 IBM Corporation
    24. 24. Information Management THINK24 © 2012 IBM Corporation
    25. 25. Agenda10:30 IBM Big Data Platform Flemming Bagger, Big Data Analytics Leader, Nordic11:15 Pause11:30 Opnå konkrete resultater med Big Data Analytics Lauren Walker, Big Data Analytics Leader, Europe12:15 Frokost13:30 Succes eller fiasko? Sådan håndteres Big Data i den finansielle sektor Keith Prince, EMEA Industry Solutions Executive, Financial Services, IBM14:15 Pause14:30 Dataindsamling og overvågning på tværs af sociale medier Ulrik Bo Larsen, Founder & CEO, FALCON Social15:10 Afrunding
    26. 26. Pause