Option to talk about EMC’s acquisition of Greenplum here. Key points: EMC were the first to start talking about “big data” If appropriate, tell the story of the term “Big Data”: I remember when Gartner first started publishing articles and papers on what they called “Extreme Data” – this was in the days when everything was “extreme”. There was Extreme Sports (people jumping out of airplanes on surfboards), extreme performance, everything was becoming “extreme”. In IT people were talking about “extreme programming” (which became Agile). and Gartner talked about Extreme Data. Joe Tucci of EMC had already been talking about what he called Big Data, and saw that Gartner were calling the same concept Extreme Data. Joe didn’t like the term Extreme – it didn’t accurately depict what we were talking about. So he took a gamble and stuck with the term Big Data. And eventually Gartner and the analysts stopped talking about Extreme Data and started using the EMC term, Big Data. [This story is useful as it carries the meta-message of EMC as thought leaders.] EMC were the first movers in acquiring a Big Data Analytics platform. We looked at Netezza (runs on proprietary hardware, can’t be virtualised); we looked at Teradata (too big, expensive, too established to be able to pivot their business model); we looked at Vertica (good for some use cases but not for unpredictible ad-hoc queries and deep atomic-level analysis); we looked at AsterData (only addresses part of the ‘big data’ challenge); looked at a number of others but there were some key reasons why Greenplum was the standout selection for a serious Big Data Analytics strategy.
Simple definition of Big Data. Some analysts are talking about a 4 th attribute called “complexity” – however the complexity is more around the fact that we’re running more complex queries and analytics against these data sets – not really an attribute of the data itself. We’ll be looking at some examples of what each of these three terms mean.
Big Data is massive new data volumes This is a typical Australian electricity bill. How often do I get one of these? (guess) – every 3 months. (Why? Because a person has to physically walk up to the meter and read it.) – very manual. Can only be done a few times a year. Now, the electricity company has a data warehouse which captures all their billing data. They use it to analyse usage patterns across parts of the network (and not much else). Their data warehouse might be 3 terabytes. Not huge. This is a smart meter. The smart meter provides readings directly to the utility via wireless or mobile phone networks – every 5 minutes. So instead of one reading per customer every three months, we can access a record per customer every 5 minutes. So the data has just grown 3000 times. (That’s big data). So suddenly the utility, to retain the same level of customer data, needs 9 petabytes of data warehouse. Most of it becomes exhaust data – but there is tremendous value in this data if you can keep it and analyse it. Network load analysis over time Better decisions on where to increase network capacity Real-time alerts when a particular power node is approaching saturation You can also provide the information back to your customers…
This is the Silverspring web site, a screenshot taken about 2 weeks ago. Notice the promise here to the consumer – “See your energy use in real time” – you only want to promise that if your database can perform, can handle huge numbers of ad-hoc queries from consumers accessing your website 24 x 7. SilverSpring use Greenplum to capture all the readings from their smart meters into a database, and they make this data available to their customers.
An example of what consumer-facing real-time electricity usage looks like.
But it’s not just for the consumers – the real value for the utility is what they can do with the data themselves: Predictive maintenance Usage trends over time down to the suburb and street level Geo-spatial mappings over streets looking weather-related incidents, maintenance cost anomalies and so on. (I worked with a Utility in Sydney where, using their data warehouse, they were able to identify some motors and pumps that were starting and stopping several times an hour, and others that only cycled once or twice a day or week – so instead of blindly sending around a maintenance crew every 3 months, they could maintain some pumps every month and other pumps only twice a year. Savings of several $M)
(an example of variety) Here’s the web site of an Australian bank. Every single click on this website is captured in a Greenplum database. The marketing team can then look at how people navigate around the site, what works, what doesn’t work, which ads attract more clicks, and then they can move things around to test how customers respond. Based on how users navigate through the site, and which adds are clicked on, the marketing team manage which adds appear where, looking for the best response. They do this every 24 hours. This example was presented in 2011 to the Australian Institute of Analytics Professionals as an example of real business use for web-click analytics.
(an example of velocity) every trade on the NYSE is captured in a Greenplum database. 300,000 transactions per second. And then analysts have algorithms running on this data in real time, looking for patterns that suggest fraud, such as insider trading. This analysis requires atomic-level data (no summarisation – every trade) and many months of data to find the patterns they are looking for. The market regulator, FINRA, is also a Greenplum customer. They actually aggregate the trade data from many bourses, including Arca, NYSE Amex, NASDAQ, Euronext, and the International Securities Exchange (ISX). All these sources are aggregated, and FINRA to the analysis across all of them, looking for more sophisticated insider trading and other fraudulent activity that may be hidden across several exchanges.
The purpose of this slide is to establish that it’s not (just) EMC who are saying this about data warehousing needing to change – this comes from Ralph Kimball, one of the fathers of data warehousing And here are some examples of the kind of analytics that can be run against different types of data, and the kind of insights you can expect to gain His point is that traditional data warehousing architectures don’t cater for these types of analytics.
This is traditional data warehousing – basically hasn’t changed in 20 years. (“In the past when I was consulting and advising in Information Management Strategy, I used to use this fact to reassure the client – we are treading a well-worn path here – all the mistakes have been made, this is not new, it’s been around for 20 years – However the time has come that this approach does need to be revisited, as it’s not flexible and agile enough to meet the accelerating rates of business change, and increasing data complexity and volumes.”) On the left we have the “source systems” – usually transaction systems eg SAP, Oracle apps, Siebel, CRM… (in insurance you would have a policy system – or more than one – and one or more claims systems) (in a bank you would have the core banking transaction system, plus mortgage systems, plus credit card systems, plus margin lending, and so on) - to get a “single view of customer” you have to bring all this data together into one data model. Talk about Bill Inmon, “father of data warehousing” – the Corporate Information Factory – the ideal world in which all information assets across the organisation are brought together into a comprehensive Enterprise Data Model , and then any question about any aspect of the business can be answered by this magic data model. So: (click) We transform the data and conform it all into a Consolidated Data Repository or Integrated Data Model (or Enterprise Data Model). These are very complex data models. Can take years to design and build. Often a company will buy one from IBM or Teradata because it takes too long to design your own, cheaper to get one off the shelf. BUT then you have to integrate all your data sources into the data model – a lot of work. One bank in Singapore spent $10M to buy one of these data models, then had to spend another $14M integrating their data sources into it. (Big banks, insurers and other companies had good success with this approach – 20 years ago. And the approach has remained pretty much unchanged in 20 years.) But now, the data model is so complex that you can’t have the business people working with it directly – so you create a data mart layer with simpler data models. And then you expose these data marts to the analysts through Business Intelligence tools like Cognos, Business Objects, MicroStrategy, Tableu, Qlikview and so on. But the pace of business has changed – and data warehousing has not kept up. This creates a lot of pain in the world of information management. I think I can summarise the pain of data warehousing in a few points. ( Note to presenter : depending on the audience and the time available the following content has to be filtered) First, it’s expensive (we know that, but that fact drives certain behaviours that have a detrimental effect on the ability to get value from this asset.) A typical project in data warehousing starts from a few hundred thousand and ranges up to the millions. (We used to joke: When the business comes to us with a question, the answer to the question is always the same: 6 months and half a million dollars. That’s how long it will take to enhance the EDW to answer the question) It’s expensive just in terms of raw hardware costs and licensing – often a multi-million dollar investment to kick off, and then ongoing annual licence fees for support. It’s also expensive to “feed and water” – the resources that are needed to troubleshoot data loads, create or manage partitions and indexes, tune queries and so on. And it’s expensive to develop and enhance the data model – a typical BI project on an Enterprise Data Warehouse starts at a few hundred thousand dollars and 3+ months of design, development and testing – and in a large site can easily be in the millions or tens of millions. (Cite an example or two – we are talking with a bank at the moment about a risk project in response to new requirements from the regulator – they are already throwing around figures as high as $80 million – and this does not create any new data, it just reports on data they already have.) Because of this cost, the usefulness of the DW is constrained. often, data is summarised after a certain point because we can’t use too much storage - it’s too expensive (when I worked at a Telco, we would keep two months of atomic level data, and then an end of month process would summarise the third month into a much smaller data set – meaning we lost a huge amount of valuable detail in the call data records. Why did we do this? The DW wasn’t big enough to keep the data, and getting more TB was just too expensive) So we compromise on the quality and depth of the data we can keep and the analysis we can provide, because of this cost factor. Analytics projects often need temporary space on the DW to work with subsets of data – in our case there were always projects competing for space. Marketing campaigns would have to reserve a few hundred GB between specified dates, and if another project ran over it’s end date, there would be a fight over who gets the space – often came down to raw dollar impact – if I kill the project now to give you the space, the impact is x million dollars; but if marketing don’t run their campaign, they lose y dollars in sunk cost and z million dollars in opportunity cost… and so on. Why couldn’t we just get more space? It’s too expensive and takes too long. Finally, the cost of the DW meant that it was rigidly protected. Access to the data required certain authorisations , sometimes even just to run a query needed certain permissions – it slowed everyone down. So if an analyst had a great idea about running some scenarios over a market segment, they had to decide whether it was worth going through the effort of getting permission to run the query (that could take half a day), then once they’ve run it they want to tweak it a little here and there, resubmit, tinker a bit, resubmit, and on and on. So the usefulness and value to the business of this great investment was constrained by the fact that it was such a huge investment. Second, enhancements and new projects take far too long. Time to value is just too great. Often by the time a project is completed, the business has moved on and the requirements have changed. One of the reasons for this is the use of rigid, sometimes purchased industry or enterprise data models. New data sources, if needed for a new reporting subject area (whether say bringing Risk data into the Enterprise Financial data model, or bringing a new customer information source into the marketing datamart) need to be mapped into the enterprise data model, then ETL jobs need to be designed, developed and tested, there needs to be impact analysis on existing reports to see if any are affected, and if so there’s another project to remediate those reports so that the business is not disrupted when the new project goes live. All this is not just expensive but takes a long time. Third , the barriers to easy access to the protected DW drives analysts and users to hive off and create their own data sets which they have unrestricted access to. When it’s hard to get access to a dataset, or hard to get some space in the EDW for a sand-box, the user community will preserve copies of the data that matter most to them, in local data stores in Excel or Access, or keep their own data marts in SQL Server or Oracle. These data sub-sets are irregularly refreshed, not subject to quality controls and are not auditable . Yet they are often used to derive figures that go to the board or to regulatory authorities, or to manage a whole reporting area such as Risk – and the numbers from these data repositories are passed to report assemblers who put these figures into monthly board reports, annual reports or regulatory reports (for example). The whole purpose of the data warehouse was to retire these fragmented data repositories and keep the data centralised where its quality can be assured, and yet the difficulty of getting to the data in the controlled environment drives the user community to bypass those very safeguards. And… all of these problems are about to get worse! (transition to next slide)
(10 minute mark) (so where we were just managing or surviving with the data we already have, we’re about to be deluged with a tsunami of Big Data.)
Big Data will revolutionise DW and analytics. In summary, these are the challenges that traditional data warehousing faces from Big Data and Big Data Analytics.
There’s a lot on this slide and you can’t talk about all of it. The purpose of the slide is to impress the audience that we (EMC) have done a lot of deep and comprehensive thinking on how Information Management is going to change as the result of Big Data. Big Data doesn’t need to be a threat – there is a journey from where we are to where we need to go, there are companies at various stages along this journey, and we are helping many of them. It’s a holistic journey that isn’t just about technology – it’s about all of these (4) layers. Big Data and Big Data Analytics has implications for all of these aspects of an Information Management Strategy. IM Strategy has to address each of these aspects – where we are now, where we believe we need to be, and then how we are going to get there. I want to take some time to look at each of these aspects and highlight what that journey looks like in some detail.
Key points – talk down the first column then the second (not row by row). Read up on the model-less warehouse or transformationless warehouse, and on data vault. This is relatively new and there will be old school people who have their whole careers and credibility built on the old way of doing things. You can’t dismiss that – it’s a both-and, not either-or (see later slides). If it gets confrontational, offer to take it “offline” and talk in person later. The defense is that it’s not EMC saying this, Kimball is saying it, the thought leaders are saying it. Kimball has a paper on this – offer to send it to the audience. If you know the “end of science” story you can tell that as well.
This slide you can talk to row by row.
There are other points that could be added – the slide does not have all of them. eg Tightly controlled data dictionary -> wiki as data asset inventory Classic waterfall report development processes -> iterative prototypes with frequent review cycles in the UAP itself Defined / pre-canned reports -> many one-off queries for data exploration
The scope of the traditional data warehouse will shrink to those subject areas that require and justify the heavy governance that has traditionally been used to manage highly important information that needs to be accurate to the unit or to the cent. Eg reporting to the market, to the CFO, to regulators, to the board. (There will still need to be data integration eg if a bank has credit card holders who are also mortgage holders, they need to report on total exposure. Or if a telco has pre-paid and post-paid customers, they need to report to regulators on the size of their customer base) But much of what the data warehouse is used for does not require this rigour. And almost all of what Big Data is used for does not require this rigour. So we are seeing the emergence of a different kind of platform – the Analytical Platform for Big Data.
(this slide teases out the points from the last one) Don’t spend time on this slide – it’s just a transition to introduce the Unified Analytics Platform (It’s important here not to go into detail on describing the analytical platform because there are slides following to do that – otherwise the rest of the presentation becomes redundant/repetitive)
Here’s an example of a customer who used the Analytics Platform for some analysis that couldn’t run on the EDW (as it would have interrupted important BAU reports – can’t run full table scans on the entire call record history – everything else grinds to a halt.)
(Note, this slide needs to go, the story is too old – replace with your favourite big data value story)
this slide is to transition into a Greenplum-specific conversation – much of this change is internal (and we can provide advice and consulting to assist) – but in the technology layer we can provide solutions.
Here’s an example of a customer who used the Analytics Platform for some analysis that couldn’t run on the EDW (as it would have interrupted important BAU reports – can’t run full table scans on the entire call record history – everything else grinds to a halt.)
Is BCC using something like that? How are the travel time functions determined?
Travel time function for road connection 10491 to area 10784
Travel time function for road connection 10491 to area 10784
The growth trajectory of data has already surpassed the capability of today ’s databases to adequately store and efficiently process them. We are seeing a fault line developing and widening rapidly between the demand and supply for better technology solutions to handle the data explosion. Final pitch – organisations can already see the writing on the wall – this survey was taken in 2011. (TDWI = the Data Warehouse Institute)
So this is what the stack looks like from a high level. There’s the Greenplum database for co-processing structured, semi-structured, and unstructured data with Greenplum Hadoop. These are overlaid with a unified data access and query layer that combines the programming languages of choice (SQL, MapReduce, Etc.). Over the access layer comes our partner tool and services layer. We are not about locking customers into a single tool or stack. Instead we work with the tool vendor of your choice, be it SAS or R, Microstrategy or informatica. And sitting atop all of that technologies is the Chorus layer, which provides productivity tools to facilitate collaboration between the different stakeholders. What sets this diagram apart from a typically vendor example is the inclusion of people. That is not a mistake. We have introduced the Unified Analytics Platform but there is more to the story than technology and I will talk more about that in a few minutes. UAP is about enabling data analytics practitioners to access and manage datasets and projects much more easily. A typical such team can include the data platform administrator, data scientist, analysts, engineers, BI teams, and most importantly the line of business user and how they participate on this data science team. We develop, package, and support this as a unified software platform available over your favorite commodity hardware, cloud infrastructure, or from our modular Data Computing Appliance.
Key to the success of the new approach to Information Management is the ability to collaborate and share knowledge within the same environment that manages the sand-pits and datasets.
Option to talk about EMC’s acquisition of Greenplum here. Key points: EMC were the first to start talking about “big data” If appropriate, tell the story of the term “Big Data”: I remember when Gartner first started publishing articles and papers on what they called “Extreme Data” – this was in the days when everything was “extreme”. There was Extreme Sports (people jumping out of airplanes on surfboards), extreme performance, everything was becoming “extreme”. In IT people were talking about “extreme programming” (which became Agile). and Gartner talked about Extreme Data. Joe Tucci of EMC had already been talking about what he called Big Data, and saw that Gartner were calling the same concept Extreme Data. Joe didn’t like the term Extreme – it didn’t accurately depict what we were talking about. So he took a gamble and stuck with the term Big Data. And eventually Gartner and the analysts stopped talking about Extreme Data and started using the EMC term, Big Data. [This story is useful as it carries the meta-message of EMC as thought leaders.] EMC were the first movers in acquiring a Big Data Analytics platform. We looked at Netezza (runs on proprietary hardware, can’t be virtualised); we looked at Teradata (too big, expensive, too established to be able to pivot their business model); we looked at Vertica (good for some use cases but not for unpredictible ad-hoc queries and deep atomic-level analysis); we looked at AsterData (only addresses part of the ‘big data’ challenge); looked at a number of others but there were some key reasons why Greenplum was the standout selection for a serious Big Data Analytics strategy.