Big Data - Trender och verklighet inom Information Management.
Denna presentation hölls på IBM Data Server Day den 22 maj i Stockholm av Jacques Milman, Datawarehouse Architecture Leader, IBM
Includes a wide variety of relational & non-relational data sources. The volume of this data is large, or can grow to be large. Though the data might be “ noisy ” , it can contain significant insight. Data may have an extremely short half-life. Comes from a wide variety of structured and unstructured (non-relational) sources. The volume of this data is large or can grow to be large. The data can contain significant insight, though a large portion may not be valuable.
Key Points There are many varied use cases that may be addressed by Internet-scale analytics – across many industries. Utilities – already mentioned the weather analysis for wind turbine placement. Mention the smart meter analysis – reading large volumes of smart meter data, and combining that with other consumption data, and weather data, to understand the impact on consumption. Also able to store anomalies in meter reads – data that doesn ’ t fit into a pre-defined structure – which aids in problem detection and repairs. IT – log analysis is a popular use case – bringing in log data from many systems to track a transaction that unfolds across many systems - helps determine where the error occurred. Previously this was possible, but only throuhg mapping many log files to one relational structure – which was costly and time consuming. E Commerce & Multi-channel – have already mentioned both use case previously – no need to dwell on them here. Transportation – taking in a huge variety of data – logistics, traffic data, weather patterns, fuel consumption – to optimize logistics – it contains all three elements of V3 – significant variety, velocity is moderate but the time window to solve the problem is quite short and the data is very volatile, and volume. Transmission monitoring – voltage levels and transient voltage fluctuations
Includes a wide variety of relational & non-relational data sources. The volume of this data is large, or can grow to be large. Though the data might be “ noisy ” , it can contain significant insight. Data may have an extremely short half-life. Comes from a wide variety of structured and unstructured (non-relational) sources. The volume of this data is large or can grow to be large. The data can contain significant insight, though a large portion may not be valuable.
Key Points There are many varied use cases that may be addressed by Internet-scale analytics – across many industries. Utilities – already mentioned the weather analysis for wind turbine placement. Mention the smart meter analysis – reading large volumes of smart meter data, and combining that with other consumption data, and weather data, to understand the impact on consumption. Also able to store anomalies in meter reads – data that doesn ’ t fit into a pre-defined structure – which aids in problem detection and repairs. IT – log analysis is a popular use case – bringing in log data from many systems to track a transaction that unfolds across many systems - helps determine where the error occurred. Previously this was possible, but only throuhg mapping many log files to one relational structure – which was costly and time consuming. E Commerce & Multi-channel – have already mentioned both use case previously – no need to dwell on them here. Transportation – taking in a huge variety of data – logistics, traffic data, weather patterns, fuel consumption – to optimize logistics – it contains all three elements of V3 – significant variety, velocity is moderate but the time window to solve the problem is quite short and the data is very volatile, and volume. Transmission monitoring – voltage levels and transient voltage fluctuations
Here is another example of something the University of Southern California Annenberg School of Communication did with the IBM Big Data platform’s BigSheets technology. USC@Annenburg created the Film Forecaster tool and used it to correctly predict 2011’s summer block busters based on scraping Twitter and analyzing that against a simple lexicon that described a positive or negative showing for a movie. They made quite the impact since this very solution was featured on ABC News (a national news agency in the USA). More striking is the quote: the application was built by a communication Masters student who learned Big Sheets in a day.
This picture is a little simplistic for 2 reasons: First if gives pre-eminence to Netezza. That is because Netezza’s simplicity, performance and agile support for ad-hoc analysis is often the default proposition for an analytic warehouse in a greenfield situation (though this is not necessarily true if there is an existing commitment to Power or to DB2). Secondly it does not recognise the differentiation between exploratory analysis and repeated analysis. But if you are doing exploratory analysis of relational (ie structured) data, Netezza is a better platform; it thrives on ad-hoc analysis and has very rich tooling (INZA, SPSS etc) for analytics. Clearly exploratory on unstructured is BigI, Exploratory analysis on something in between (e.g. CDRs) could be done on Netezza, but if the data is not already being loaded (and even in a Netezza customer the raw XDRs are probably not loaded into the warehouse) then exploration in a low-cost Hadoop grid makes tons of sense. We have at least one customer use case of this, where once the analysis was repeatable it was implemented in the Netezza. But there are also use cases where the repeated analysis remains in BigI, exploiting its differentiating enterprise readiness.
If it’s data in motion (remember the babies being monitored). it has to be real-time. it has to be Streams. That’s the easy one. If it’s unstructured data, at rest, the best place to start is BigInsights, though you may load data into the relational warehouse subsequently for further insight. If it’s relational data, it’s unlikely you are going to move it to Hadoop If it’s semi-structured you have a choice and you’ll be influenced by these other development factors: It may be that an organization has already developed a map-reduce solution that delivers a high value analysis for data that was unloaded from the corporate EDW. Is the right solution to say ‘great, now you know the solution, re-code it in SQL using in-database analytics and implement it on your warehouse?’ Maybe a better solution is to implement BigInsights to enterprise-harden the Hadoop environment and run the application as is, but with production applications reliability and supportability. It may be that the volume is so huge that a DWH can’t handle it and certainly can’t handle it economically (think Vestas) it may be better to go to the platform with more of the appropriate analytic skills or other development resources available It may be that the customer wants to build their capability in Hadoop because they will have more challenging use case later that will be clear-cut BigInsights use cases. It may be that the customer just wants to experiment cheaply and quickly (though actually that’s more a BigI Basic edition use case – we ’ ll be looking to enterprise harden it later) But remember they are influencers, not deciders. IBMers can adapt to whatever best matches the customer’s needs, because of the comprehensive nature of our big data portfolio.