Information Management in the Age of Big Data

1,380 views

Published on

Mark Burnard, EMC Greenplum
Meetup #2, 27 Mar 2012 - http://sydney.bigdataaustralia.com.au/events/53934632/

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,380
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Option to talk about EMC’s acquisition of Greenplum here. Key points: EMC were the first to start talking about “big data” If appropriate, tell the story of the term “Big Data”: I remember when Gartner first started publishing articles and papers on what they called “Extreme Data” – this was in the days when everything was “extreme”. There was Extreme Sports (people jumping out of airplanes on surfboards), extreme performance, everything was becoming “extreme”. In IT people were talking about “extreme programming” (which became Agile). and Gartner talked about Extreme Data. Joe Tucci of EMC had already been talking about what he called Big Data, and saw that Gartner were calling the same concept Extreme Data. Joe didn’t like the term Extreme – it didn’t accurately depict what we were talking about. So he took a gamble and stuck with the term Big Data. And eventually Gartner and the analysts stopped talking about Extreme Data and started using the EMC term, Big Data. [This story is useful as it carries the meta-message of EMC as thought leaders.] EMC were the first movers in acquiring a Big Data Analytics platform. We looked at Netezza (runs on proprietary hardware, can’t be virtualised); we looked at Teradata (too big, expensive, too established to be able to pivot their business model); we looked at Vertica (good for some use cases but not for unpredictible ad-hoc queries and deep atomic-level analysis); we looked at AsterData (only addresses part of the ‘big data’ challenge); looked at a number of others but there were some key reasons why Greenplum was the standout selection for a serious Big Data Analytics strategy.
  • Simple definition of Big Data. Some analysts are talking about a 4 th attribute called “complexity” – however the complexity is more around the fact that we’re running more complex queries and analytics against these data sets – not really an attribute of the data itself. We’ll be looking at some examples of what each of these three terms mean.
  • Big Data is massive new data volumes This is a typical Australian electricity bill. How often do I get one of these? (guess) – every 3 months. (Why? Because a person has to physically walk up to the meter and read it.) – very manual. Can only be done a few times a year. Now, the electricity company has a data warehouse which captures all their billing data. They use it to analyse usage patterns across parts of the network (and not much else). Their data warehouse might be 3 terabytes. Not huge. This is a smart meter. The smart meter provides readings directly to the utility via wireless or mobile phone networks – every 5 minutes. So instead of one reading per customer every three months, we can access a record per customer every 5 minutes. So the data has just grown 3000 times. (That’s big data). So suddenly the utility, to retain the same level of customer data, needs 9 petabytes of data warehouse. Most of it becomes exhaust data – but there is tremendous value in this data if you can keep it and analyse it. Network load analysis over time Better decisions on where to increase network capacity Real-time alerts when a particular power node is approaching saturation You can also provide the information back to your customers…
  • This is the Silverspring web site, a screenshot taken about 2 weeks ago. Notice the promise here to the consumer – “See your energy use in real time” – you only want to promise that if your database can perform, can handle huge numbers of ad-hoc queries from consumers accessing your website 24 x 7. SilverSpring use Greenplum to capture all the readings from their smart meters into a database, and they make this data available to their customers.
  • An example of what consumer-facing real-time electricity usage looks like.
  • But it’s not just for the consumers – the real value for the utility is what they can do with the data themselves: Predictive maintenance Usage trends over time down to the suburb and street level Geo-spatial mappings over streets looking weather-related incidents, maintenance cost anomalies and so on. (I worked with a Utility in Sydney where, using their data warehouse, they were able to identify some motors and pumps that were starting and stopping several times an hour, and others that only cycled once or twice a day or week – so instead of blindly sending around a maintenance crew every 3 months, they could maintain some pumps every month and other pumps only twice a year. Savings of several $M)
  • (an example of variety) Here’s the web site of an Australian bank. Every single click on this website is captured in a Greenplum database. The marketing team can then look at how people navigate around the site, what works, what doesn’t work, which ads attract more clicks, and then they can move things around to test how customers respond. Based on how users navigate through the site, and which adds are clicked on, the marketing team manage which adds appear where, looking for the best response. They do this every 24 hours. This example was presented in 2011 to the Australian Institute of Analytics Professionals as an example of real business use for web-click analytics.
  • (an example of velocity) every trade on the NYSE is captured in a Greenplum database. 300,000 transactions per second. And then analysts have algorithms running on this data in real time, looking for patterns that suggest fraud, such as insider trading. This analysis requires atomic-level data (no summarisation – every trade) and many months of data to find the patterns they are looking for. The market regulator, FINRA, is also a Greenplum customer. They actually aggregate the trade data from many bourses, including Arca, NYSE Amex, NASDAQ, Euronext, and the International Securities Exchange (ISX). All these sources are aggregated, and FINRA to the analysis across all of them, looking for more sophisticated insider trading and other fraudulent activity that may be hidden across several exchanges.
  • The purpose of this slide is to establish that it’s not (just) EMC who are saying this about data warehousing needing to change – this comes from Ralph Kimball, one of the fathers of data warehousing And here are some examples of the kind of analytics that can be run against different types of data, and the kind of insights you can expect to gain His point is that traditional data warehousing architectures don’t cater for these types of analytics.
  • This is traditional data warehousing – basically hasn’t changed in 20 years. (“In the past when I was consulting and advising in Information Management Strategy, I used to use this fact to reassure the client – we are treading a well-worn path here – all the mistakes have been made, this is not new, it’s been around for 20 years – However the time has come that this approach does need to be revisited, as it’s not flexible and agile enough to meet the accelerating rates of business change, and increasing data complexity and volumes.”) On the left we have the “source systems” – usually transaction systems eg SAP, Oracle apps, Siebel, CRM… (in insurance you would have a policy system – or more than one – and one or more claims systems) (in a bank you would have the core banking transaction system, plus mortgage systems, plus credit card systems, plus margin lending, and so on) - to get a “single view of customer” you have to bring all this data together into one data model. Talk about Bill Inmon, “father of data warehousing” – the Corporate Information Factory – the ideal world in which all information assets across the organisation are brought together into a comprehensive Enterprise Data Model , and then any question about any aspect of the business can be answered by this magic data model. So: (click) We transform the data and conform it all into a Consolidated Data Repository or Integrated Data Model (or Enterprise Data Model). These are very complex data models. Can take years to design and build. Often a company will buy one from IBM or Teradata because it takes too long to design your own, cheaper to get one off the shelf. BUT then you have to integrate all your data sources into the data model – a lot of work. One bank in Singapore spent $10M to buy one of these data models, then had to spend another $14M integrating their data sources into it. (Big banks, insurers and other companies had good success with this approach – 20 years ago. And the approach has remained pretty much unchanged in 20 years.) But now, the data model is so complex that you can’t have the business people working with it directly – so you create a data mart layer with simpler data models. And then you expose these data marts to the analysts through Business Intelligence tools like Cognos, Business Objects, MicroStrategy, Tableu, Qlikview and so on. But the pace of business has changed – and data warehousing has not kept up. This creates a lot of pain in the world of information management. I think I can summarise the pain of data warehousing in a few points. ( Note to presenter : depending on the audience and the time available the following content has to be filtered) First, it’s expensive (we know that, but that fact drives certain behaviours that have a detrimental effect on the ability to get value from this asset.) A typical project in data warehousing starts from a few hundred thousand and ranges up to the millions. (We used to joke: When the business comes to us with a question, the answer to the question is always the same: 6 months and half a million dollars. That’s how long it will take to enhance the EDW to answer the question) It’s expensive just in terms of raw hardware costs and licensing – often a multi-million dollar investment to kick off, and then ongoing annual licence fees for support. It’s also expensive to “feed and water” – the resources that are needed to troubleshoot data loads, create or manage partitions and indexes, tune queries and so on. And it’s expensive to develop and enhance the data model – a typical BI project on an Enterprise Data Warehouse starts at a few hundred thousand dollars and 3+ months of design, development and testing – and in a large site can easily be in the millions or tens of millions. (Cite an example or two – we are talking with a bank at the moment about a risk project in response to new requirements from the regulator – they are already throwing around figures as high as $80 million – and this does not create any new data, it just reports on data they already have.) Because of this cost, the usefulness of the DW is constrained. often, data is summarised after a certain point because we can’t use too much storage - it’s too expensive (when I worked at a Telco, we would keep two months of atomic level data, and then an end of month process would summarise the third month into a much smaller data set – meaning we lost a huge amount of valuable detail in the call data records. Why did we do this? The DW wasn’t big enough to keep the data, and getting more TB was just too expensive) So we compromise on the quality and depth of the data we can keep and the analysis we can provide, because of this cost factor. Analytics projects often need temporary space on the DW to work with subsets of data – in our case there were always projects competing for space. Marketing campaigns would have to reserve a few hundred GB between specified dates, and if another project ran over it’s end date, there would be a fight over who gets the space – often came down to raw dollar impact – if I kill the project now to give you the space, the impact is x million dollars; but if marketing don’t run their campaign, they lose y dollars in sunk cost and z million dollars in opportunity cost… and so on. Why couldn’t we just get more space? It’s too expensive and takes too long. Finally, the cost of the DW meant that it was rigidly protected. Access to the data required certain authorisations , sometimes even just to run a query needed certain permissions – it slowed everyone down. So if an analyst had a great idea about running some scenarios over a market segment, they had to decide whether it was worth going through the effort of getting permission to run the query (that could take half a day), then once they’ve run it they want to tweak it a little here and there, resubmit, tinker a bit, resubmit, and on and on. So the usefulness and value to the business of this great investment was constrained by the fact that it was such a huge investment. Second, enhancements and new projects take far too long. Time to value is just too great. Often by the time a project is completed, the business has moved on and the requirements have changed. One of the reasons for this is the use of rigid, sometimes purchased industry or enterprise data models. New data sources, if needed for a new reporting subject area (whether say bringing Risk data into the Enterprise Financial data model, or bringing a new customer information source into the marketing datamart) need to be mapped into the enterprise data model, then ETL jobs need to be designed, developed and tested, there needs to be impact analysis on existing reports to see if any are affected, and if so there’s another project to remediate those reports so that the business is not disrupted when the new project goes live. All this is not just expensive but takes a long time. Third , the barriers to easy access to the protected DW drives analysts and users to hive off and create their own data sets which they have unrestricted access to. When it’s hard to get access to a dataset, or hard to get some space in the EDW for a sand-box, the user community will preserve copies of the data that matter most to them, in local data stores in Excel or Access, or keep their own data marts in SQL Server or Oracle. These data sub-sets are irregularly refreshed, not subject to quality controls and are not auditable . Yet they are often used to derive figures that go to the board or to regulatory authorities, or to manage a whole reporting area such as Risk – and the numbers from these data repositories are passed to report assemblers who put these figures into monthly board reports, annual reports or regulatory reports (for example). The whole purpose of the data warehouse was to retire these fragmented data repositories and keep the data centralised where its quality can be assured, and yet the difficulty of getting to the data in the controlled environment drives the user community to bypass those very safeguards. And… all of these problems are about to get worse! (transition to next slide)
  • (10 minute mark) (so where we were just managing or surviving with the data we already have, we’re about to be deluged with a tsunami of Big Data.)
  • Big Data will revolutionise DW and analytics. In summary, these are the challenges that traditional data warehousing faces from Big Data and Big Data Analytics.
  • There’s a lot on this slide and you can’t talk about all of it. The purpose of the slide is to impress the audience that we (EMC) have done a lot of deep and comprehensive thinking on how Information Management is going to change as the result of Big Data. Big Data doesn’t need to be a threat – there is a journey from where we are to where we need to go, there are companies at various stages along this journey, and we are helping many of them. It’s a holistic journey that isn’t just about technology – it’s about all of these (4) layers. Big Data and Big Data Analytics has implications for all of these aspects of an Information Management Strategy. IM Strategy has to address each of these aspects – where we are now, where we believe we need to be, and then how we are going to get there. I want to take some time to look at each of these aspects and highlight what that journey looks like in some detail.
  • Key points – talk down the first column then the second (not row by row). Read up on the model-less warehouse or transformationless warehouse, and on data vault. This is relatively new and there will be old school people who have their whole careers and credibility built on the old way of doing things. You can’t dismiss that – it’s a both-and, not either-or (see later slides). If it gets confrontational, offer to take it “offline” and talk in person later. The defense is that it’s not EMC saying this, Kimball is saying it, the thought leaders are saying it. Kimball has a paper on this – offer to send it to the audience. If you know the “end of science” story you can tell that as well.
  • This slide you can talk to row by row.
  • There are other points that could be added – the slide does not have all of them. eg Tightly controlled data dictionary -> wiki as data asset inventory Classic waterfall report development processes -> iterative prototypes with frequent review cycles in the UAP itself Defined / pre-canned reports -> many one-off queries for data exploration
  • The scope of the traditional data warehouse will shrink to those subject areas that require and justify the heavy governance that has traditionally been used to manage highly important information that needs to be accurate to the unit or to the cent. Eg reporting to the market, to the CFO, to regulators, to the board. (There will still need to be data integration eg if a bank has credit card holders who are also mortgage holders, they need to report on total exposure. Or if a telco has pre-paid and post-paid customers, they need to report to regulators on the size of their customer base) But much of what the data warehouse is used for does not require this rigour. And almost all of what Big Data is used for does not require this rigour. So we are seeing the emergence of a different kind of platform – the Analytical Platform for Big Data.
  • (this slide teases out the points from the last one) Don’t spend time on this slide – it’s just a transition to introduce the Unified Analytics Platform (It’s important here not to go into detail on describing the analytical platform because there are slides following to do that – otherwise the rest of the presentation becomes redundant/repetitive)
  • Here’s an example of a customer who used the Analytics Platform for some analysis that couldn’t run on the EDW (as it would have interrupted important BAU reports – can’t run full table scans on the entire call record history – everything else grinds to a halt.)
  • (Note, this slide needs to go, the story is too old – replace with your favourite big data value story)
  • this slide is to transition into a Greenplum-specific conversation – much of this change is internal (and we can provide advice and consulting to assist) – but in the technology layer we can provide solutions.
  • Here’s an example of a customer who used the Analytics Platform for some analysis that couldn’t run on the EDW (as it would have interrupted important BAU reports – can’t run full table scans on the entire call record history – everything else grinds to a halt.)
  • Is BCC using something like that? How are the travel time functions determined?
  • Travel time function for road connection 10491 to area 10784
  • Travel time function for road connection 10491 to area 10784
  • The growth trajectory of data has already surpassed the capability of today ’s databases to adequately store and efficiently process them. We are seeing a fault line developing and widening rapidly between the demand and supply for better technology solutions to handle the data explosion. Final pitch – organisations can already see the writing on the wall – this survey was taken in 2011. (TDWI = the Data Warehouse Institute)
  • So this is what the stack looks like from a high level. There’s the Greenplum database for co-processing structured, semi-structured, and unstructured data with Greenplum Hadoop. These are overlaid with a unified data access and query layer that combines the programming languages of choice (SQL, MapReduce, Etc.). Over the access layer comes our partner tool and services layer. We are not about locking customers into a single tool or stack. Instead we work with the tool vendor of your choice, be it SAS or R, Microstrategy or informatica. And sitting atop all of that technologies is the Chorus layer, which provides productivity tools to facilitate collaboration between the different stakeholders. What sets this diagram apart from a typically vendor example is the inclusion of people. That is not a mistake. We have introduced the Unified Analytics Platform but there is more to the story than technology and I will talk more about that in a few minutes. UAP is about enabling data analytics practitioners to access and manage datasets and projects much more easily. A typical such team can include the data platform administrator, data scientist, analysts, engineers, BI teams, and most importantly the line of business user and how they participate on this data science team. We develop, package, and support this as a unified software platform available over your favorite commodity hardware, cloud infrastructure, or from our modular Data Computing Appliance.
  • Key to the success of the new approach to Information Management is the ability to collaborate and share knowledge within the same environment that manages the sand-pits and datasets.
  • Option to talk about EMC’s acquisition of Greenplum here. Key points: EMC were the first to start talking about “big data” If appropriate, tell the story of the term “Big Data”: I remember when Gartner first started publishing articles and papers on what they called “Extreme Data” – this was in the days when everything was “extreme”. There was Extreme Sports (people jumping out of airplanes on surfboards), extreme performance, everything was becoming “extreme”. In IT people were talking about “extreme programming” (which became Agile). and Gartner talked about Extreme Data. Joe Tucci of EMC had already been talking about what he called Big Data, and saw that Gartner were calling the same concept Extreme Data. Joe didn’t like the term Extreme – it didn’t accurately depict what we were talking about. So he took a gamble and stuck with the term Big Data. And eventually Gartner and the analysts stopped talking about Extreme Data and started using the EMC term, Big Data. [This story is useful as it carries the meta-message of EMC as thought leaders.] EMC were the first movers in acquiring a Big Data Analytics platform. We looked at Netezza (runs on proprietary hardware, can’t be virtualised); we looked at Teradata (too big, expensive, too established to be able to pivot their business model); we looked at Vertica (good for some use cases but not for unpredictible ad-hoc queries and deep atomic-level analysis); we looked at AsterData (only addresses part of the ‘big data’ challenge); looked at a number of others but there were some key reasons why Greenplum was the standout selection for a serious Big Data Analytics strategy.
  • Information Management in the Age of Big Data

    1. 1. Information Management in the Age of Big Data Mark Burnard EMC Greenplum March 2012 mark.burnard@emc.com© Copyright 2011 EMC Corporation. All rights reserved. 1
    2. 2. So what is “Big Data”? B ig D a t a is m e l u • m a s s i v e n e w d a t a v o lV s ume o Va • a n d n e w d a ta typ e s r ie • g e ne ra te d b y ma ny ne w d e v ic e s ty Velocity© Copyright 2011 EMC Corporation. All rights reserved. 2
    3. 3. Volume© Copyright 2011 EMC Corporation. All rights reserved. 3
    4. 4. © Copyright 2011 EMC Corporation. All rights reserved. 4
    5. 5. Meter Data is Growing Exponentially 3,000x 35040 R e ad s p e r Ye ar 700x 120x 8760 30x 1 460 12 365 M e te r-re ad ing fre qu e ncy© Copyright 2011 EMC Corporation. All rights reserved. 5
    6. 6. © Copyright 2011 EMC Corporation. All rights reserved. 6
    7. 7. Big Data use case: Smart Meter data (Consumer view)© Copyright 2011 EMC Corporation. All rights reserved. 7
    8. 8. Big Data use case: Smart Meter data (Utility view)© Copyright 2011 EMC Corporation. All rights reserved. 8
    9. 9. Variety© Copyright 2011 EMC Corporation. All rights reserved. 9
    10. 10. © Copyright 2011 EMC Corporation. All rights reserved. 10
    11. 11. Velocity© Copyright 2011 EMC Corporation. All rights reserved. 11
    12. 12. © Copyright 2011 EMC Corporation. All rights reserved. 12
    13. 13. Use Cases forBig Data Analytics from Ralph Kimball “…systems to support big data analytics have to look very different than the classic relational database systems from the 1980s and 1990s. The original RDBMSs were not built to handle any of these requirements!” - Ralph Kimball Source: Kimball, Ralph “The Evolving Role of the Enterprise Data Warehouse in the Era of Big Data Analytics”© Copyright 2011 EMC Corporation. All rights reserved. 13
    14. 14. Traditional Data Warehousing (and Business Intelligence and Business Analytics) • it’s expensive • enhancements and projects take too long • it drives people to create their own “data feifdoms”© Copyright 2011 EMC Corporation. All rights reserved. 14
    15. 15. The challenges of Data Warehousing… are now exacerbated by the era of Big Data© Copyright 2011 EMC Corporation. All rights reserved. 15
    16. 16. Big Data will revolutionise Data Warehousing and analytics. New Realities…• Do it faster – Volume: ingest more data – Velocity: ingest it faster New Demands!• Manage new data types – Variety: manage and allow queries across structured, semi-structured and unstructured data• Be more flexible – Unpredictable queries, Rapidly evolving bespoke analytics – New tools: Hadoop, MapReduce, Hive, HBase, “R”• Do it at a lower cost – And, keep it unsummarised, and keep it for longer© Copyright 2011 EMC Corporation. All rights reserved. 16
    17. 17. Information Management Strategy for Big Data Current State Target State Transition Plan Assessment of current Resource gap, training People & organisational structure Required skillsets and plan and insource/ and capabilities vs organisational structure Skillsets requirements of the to support future state outsource/ suppliment model future state Review of current Sustainable approach to Incremental approach Processes & methodologies, information management to implement new processes & governance in light of differing levels processes, methodology Methodology vs fit for purpose of governance needed and governance (future state) Demarcation of subject Implementation plan for Information Review of requirements areas by level of rigour in new platforms, models & and fitness for purpose; data mgmt; new data frameworks, and Architecture map of datamart feifdoms models & data absorbtion of datamart management frameworks fefidoms Required future Roadmap for Technology Review of current technology platforms, implementing target state platforms & capabilities Architecture vs business needs ecosystem and technologies, prioritised architecture by business benefit© Copyright 2011 EMC Corporation. All rights reserved. 17
    18. 18. Old School New School Information Data Model - centric Business - centric• Driven by the Enterprise Data Model • Driven by business need to turn data into information, and (Corporate Information Factory) by Business-led projects (long- and short-term)• Huge effort and expense in transforming, • Little or no transformation - business logic is pushed out to cleansing and matching data (conformed the business. (eg the "Transformationless Warehouse", or dimensions etc) "Data Vault")• Big challenges and expenses in managing • Simple data lineage, reduced need for metadata metadata, data lineage, MDM integration management. Master Data is just another data source. • Different data sources can update the UAP at different• Data loads from multiple systems must be intervals, from trickle-feed to hourly/ nightly/ weekly/ coordinated and inter-dependencies monthly/ ad-hoc, as long as the users know when the last managed in the ETL scheduling tool and refresh occurred. Some datasets are "pointers" to external framework data sources - no replication.• Structured data • Structured, semi-structured and unstructured data• Often forced to work with subsets of data, or • Platform handles analytics on full datasets, unsummarised - forced to summarise data older than n > much richer insights. (Wired Magazine: "The End of days/months/years Science")© Copyright 2011 EMC Corporation. All rights reserved. 18
    19. 19. Old School New School Technology Constrained by Technology Empowered by Technology • Low cost of space and performance means teams can cycle• High cost of space and performance means queries and investigations much faster -> different way of access/use is rationalised/restricted working: more cycles -> more accurate results• Adding new data sources or developing new • Adding a new data source to the platform takes minutes, and data marts / subject areas typically takes the logic to integrate the data source is applied by the months business / analyst• Architecture is usually "scale up" - requires • Architecture is "scale out" - add capacity without down expensive offload-copy-restore when time. Possible to use "hybrid cloud" model to add capacity increasing capacity on demand during peak periods.• Dev, Test and DR environment require their • Dev, Test and DR can be virtual machines, provisioned and own servers, maintenance etc scaled on demand.• Processing is in ETL servers, in database, and • Processing is almost entirely in database. Data movement is in BI application servers. minimised.• Many orphan data marts on PCs, laptops, • Need for user-created marts is met on the Unified Analytics servers Platform. Safety with flexibility.© Copyright 2011 EMC Corporation. All rights reserved. 19
    20. 20. Old School New School Processes IT-centric and Control-heavy Trust and enablement • Safety is in knowledge management, collaboration and• Safety is in IT control. peer review• Precision needed - must reconcile and must be • Approximate results may be acceptable (depending on exact. Gold standard applies to all data in the the business use case) enterprise data model• Enforce simplicity - hide complexity from the • Expose complexity; trust the team. Build and iterate business (dumb it down; drag and drop from a reports from whatever data sources you need (and are restricted semantic layer) authorised to access)• Emphasis on process - fill out the form, submit • Emphasis on Self service the request • Information enables forward-looking insights -> supports• Information supports "rear view mirror" innovation centres and business process re-engineering reporting on the past or tweaking• Analysts react to difficulty accessing data by • Analytical sandpits are supported on the UAP - logic creating copies of data in "off the radar" applied can be peer reviewed in the platform databases; logic applied is unauditable© Copyright 2011 EMC Corporation. All rights reserved. 20
    21. 21. Old School New School People Information consumption Information-led Innovation (fixed reporting) (flexible exploring)• Focus is on standard reports for directors and • Reporting is so BAU it is not the focus; analysts managers (analysts get the leftovers) empowered to get creative and add much more value.• Business doesnt trust the warehouse (logic • Business has control of the logic and transformations (if applied in transformations is opaque) you don’t trust it… fix it yourself - you built it!) • Multiple data types and repositories (RDBMS, Hadoop,• Single platform, single RDBMS, with many "off text, logs) - must be accessible via an overlying single the radar" data marts interface/platform (UAP) • LOBs can collaborate using web 2.0/KM tools built into• LOBs working in silos the UAP• Tightly controlled data dictionaries and • Wiki-style approach for a “data asset registry” allows metadata management to preserve the Single collaborative and agile metadata management Source of Truth• "Power user" floats around training and • Data Scientist floats around educating and empowering troubleshooting© Copyright 2011 EMC Corporation. All rights reserved. 21
    22. 22. © Copyright 2011 EMC Corporation. All rights reserved. 22
    23. 23. Old School New School Agile Process & Tools Analytics Engines Analytic Engines Analytic Productivity Platform Technology & Information People & Processes © Copyright 2011 EMC Corporation. All rights reserved. 23
    24. 24. Unified Analytics Platform - Customer Example:T-Mobile Greenplum Database + EDC Chorus 10 0 T B E n t e r p r i s e 1 P e ta b yte DW A n a ly t ic D W Greenplum Database + Chorus:ustomer Challenge: – Extracted data from EDW and other source systems to quickly assemble new analytic mart – 100TB EDW focused on operational reporting – Generated a social graph from call detail records and financial consolidation and subscriber data – EDW is single source of truth, under heavy – Within 2 weeks uncovered behavior where governance and control “connected” subscribers where 7X more likely to – Unable to support all of the critical initiatives churn than average user around data surrounding the business – Deployed1PB production EDC with GP to power – Customer loyalty and churn the #1 business their analytic initiatives initiative from the CEO on down© Copyright 2011 EMC Corporation. All rights reserved. 24
    25. 25. T-Mobile Churn Analysis• Extracted data from EDW and other source systems into new analytic sandbox• Generated a social graph from call detail records and subscriber data• Within 2 weeks uncovered behavior where “connected” subscribers were seven times more likely to churn than average user• T-Mobile valued this insight at $70 million (for a $1 million investment in Greenplum).© Copyright 2011 EMC Corporation. All rights reserved. 25
    26. 26. Information Management in the age of Big Data from to People & Information-led Skillsets Information consumption Innovation Processes & IT-centric and Business-centric; Methodology control-heavy empowerment and trust Information Architecture Data Model - centric Business needs - centric Technology Constrained by Empowered by Architecture Technology Technology© Copyright 2011 EMC Corporation. All rights reserved. 26
    27. 27. Unified Analytics Platform - Customer Example:T-Mobile Greenplum Database + EDC Chorus 10 0 T B E n t e r p r i s e 1 P e ta b yte DW A n a ly t ic D W Greenplum Database + Chorus:ustomer Challenge: – Extracted data from EDW and other source systems to quickly assemble new analytic mart – 100TB EDW focused on operational reporting – Generated a social graph from call detail records and financial consolidation and subscriber data – EDW is single source of truth, under heavy – Within 2 weeks uncovered behavior where governance and control “connected” subscribers where 7X more likely to – Unable to support all of the critical initiatives churn than average user around data surrounding the business – Deployed1PB production EDC with GP to power – Customer loyalty and churn the #1 business their analytic initiatives initiative from the CEO on down© Copyright 2011 EMC Corporation. All rights reserved. 27
    28. 28. T-Mobile Churn Analysis• Extracted data from EDW and other source systems into new analytic sandbox• Generated a social graph from call detail records and subscriber data• Within 2 weeks uncovered behavior where “connected” subscribers were seven times more likely to churn than average user• T-Mobile valued this insight at $70 million (for a $1 million investment in Greenplum).© Copyright 2011 EMC Corporation. All rights reserved. 28
    29. 29. Traffic Network Modelling© Copyright 2011 EMC Corporation. All rights reserved. 29
    30. 30. Parallel Model Learning• Solving tens of thousands of statistical modelling problems, one for each road in the city, in parallel: SELECT origin, dest, madlib.linregr(travel_time, array[peak_period(entry_time), … origin_vol, dest_vol]) FROM route_travel_info GROUP BY origin,dest;• A model: t(x) = 466 + 7.72 peakPeriod(x) + 22.5 workDay(x) + 0.378 originVol(x) + 0.691 destVol(x)© Copyright 2011 EMC Corporation. All rights reserved. 30
    31. 31. Applications for a Traffic Network Model• Compute the shortest path between any two nodes at a future time point• Identify potential bottlenecks in the traffic network through betweenness centrality scores• Identify phase transition points for massive traffic congestion using simulation techniques• Study the likely impact of new roads and traffic policies like the proposed 40 km/hr speed limit in Sydney CBD© Copyright 2011 EMC Corporation. All rights reserved. 31
    32. 32. the Big Data writing is on the wall…The Data Warehouse Institute (TDWI):• 50% of TDWI survey respondents will replace their DW platform in the next 3 years because: Cannot do Poor query response 45% advanced analysis Can’t support advanced analytics 40% Inadequate data load speed 39% Cannot handle Can’t scale up to large date volumes 37% big data Cost of scaling up is too expensive 33% volumes Poorly suited to real-time or on-demand workloads 29% Source: TDWI Next Gen Database Study, 2010© Copyright 2011 EMC Corporation. All rights reserved. 32
    33. 33. © Copyright 2011 EMC Corporation. All rights reserved. 33
    34. 34. The Greenplum Unified Analytics Platform Data Data Data Bl LOBPeople DATA SCIENCE TEAM Scientist Engineer Analyst Analyst User Greenplum Chorus - Analytic Productivity LayerProcesses 3rd Party/Partner Tools & ServicesInformation Data Access & Query Layer Data Platform Greenplum Database Greenplum Hadoop AdminTechnology Private/Hybrid Cloud Infrastructure or Appliance© Copyright 2011 EMC Corporation. All rights reserved. 34
    35. 35. © Copyright 2011 EMC Corporation. All rights reserved. 35
    36. 36. Information Management in the Age of Big Data Thank you© Copyright 2011 EMC Corporation. All rights reserved. 36

    ×