Big data and the data quality imperative

2,183 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,183
On SlideShare
0
From Embeds
0
Number of Embeds
104
Actions
Shares
0
Downloads
36
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Big data and the data quality imperative

  1. 1. TRILLIUM SOFTWARE 2013 CUSTOMER CONFERENCE(Who’s Afraid of…)The Big Bad Data Wolf?The Big Bad Data Challenge – Big Data & theData Quality ImperativePresented By:Nigel TurnerVP InformationManagement Strategy1
  2. 2. The tale of the Three Little Pigs2© Copyright 2013, Trillium Software, Inc. All rights reserved.
  3. 3. Big Data – what is it? Set of new concepts, practices & technologies tomanage & exploit digital data Can be defined as: “Data that exceeds the processing capability ofconventional database systems. The data is too big,moves too fast, or doesn’t fit the strictures of yourdatabase architecture”(Source: Ed Dumbill – O’Reilly Community) Its key premise is that all data has potentialvalue if it can be collected, analysed and used togenerate actionable insight333© Copyright 2013, Trillium Software, Inc. All rights reserved.
  4. 4. Where does Big Data come from?SOCIALMEDIA &SOCIALNETWORKSMACHINEGENERATEDWIDELY KNOWNSOURCES444© Copyright 2013, Trillium Software, Inc. All rights reserved.
  5. 5. What’s different aboutBig Data? New technologies which enable distributed & highlyscalable MPP (Massively Parallel Processing), e.g. Apache Hadoop MapReduce NoSQL databases Strong emphasis on analytical approaches Emergence of “data science” Predictive Analytics Data Mining The “democratisation” of data Data made available to all (cf Cloud Computing) Business and not IT led BI5
  6. 6. Big Data & Data Quality – parallelworlds?6BIGDATADATAQUALITY© Copyright 2013, Trillium Software, Inc. All rights reserved.
  7. 7. Parallel worlds… or are they (1)?7Shared with 100,000+others and counting…
  8. 8. Parallel worlds… or are they (2)?8“ I spend the vast majority of my time cleaningdata systems…cleaning and preparingdata sets makes everything I do better… it’s the highest value activity I do”Josh WillsSenior Director of Data ScienceCloudera(From “Training a new generation ofData Scientists” – Cloudera video)
  9. 9. When Big Data & Data Qualityworlds collide…9Big Data willexpose Data QualityshortcomingsPoor Data Qualitywill undermine thevalue of Big Datainvestments
  10. 10. Big Data – building on solidfoundationsBIG DATA / ANALYTICSDATA QUALITY FOUNDATION10
  11. 11. The 3Vs and the DQ challenge• Exponential growth of data – predicted 40-60% perannum• 2.5 quintillion bytes of data are created every day• 90% of all digital data created in the last two years• Data generated more varied and complex than before:– Text, Audio, Images, Machine Generated etc.• Much of this data is semi-structured or unstructured• Traditional IT techniques ill equipped to process &analyse it• Data often generated in real time• Analysis and response needs to be rapid, often alsoreal time• Traditional BI / DW environments cannot cope – newapproaches are needed1111
  12. 12. Big Data –Foundations of Success Identifying the right data to solve the businessproblem or opportunity The ability to integrate & match varied data frommultiple data sources structured, semi-structured, unstructured Building the right IT infrastructure to support BigData applications Having the right capabilities & skills to exploitthe data1212
  13. 13. Big Data – some verticalapplications Retail: using point of sale & social media data tosupplement & enrich traditional CRM / Marketing data Insurance & Banking: fraud detection Health: holistic patient analysis Utilities: consumption peaks & troughs & capacityplanning Telcos: call routing optimisation & customer churn Manufacturing: predictive fault identification & supplychain optimisation Research: particle analysis, genomics etc.13
  14. 14. Example Big Data benefit:The Open Big Data Cloud14SOURCE: LINKED OPEN DATA (LOD) COMMUNITY
  15. 15. Big Data in practice - Volvo Every Volvo vehicle has hundreds ofmicroprocessors / sensors Data generated used within the car itselfbut also captured for analysis by Volvoand its dealers All data is loaded into a centralisedanalysis hub & integrated with CRM,dealership, product & social network data Used to optimise design & manufacturing,enhance customer interaction, improvesafety & act on customer feedback15
  16. 16. Big Data – Barriers & Pitfalls The sheer volume of data – what’s worth using? Data extraction challenges The ability to match data from disparate sources/ formats / media The time taken to integrate new data sources The risks of mismatching and incorrectidentification of individuals Legal & regulatory pitfalls Security concerns – corporate & individual Lack of skills & expertise1616
  17. 17. Big Data – the data integrationchallengeSOCIALMEDIASENSORSOPENDATAEMAILMOBILESEXTERNALDATASOURCESINTERNALDATASOURCESCRMBILLINGOPSSALESPRODSANALYTICS PLATFORM 1ANALYTICS PLATFORM 2ANALYTICS PLATFORM 3ANALYTICS PLATFORM nACTIONABLE INSIGHT & KNOWLEDGE17
  18. 18. Big Data – the Data QualityImperative (1) Need to profile external and internal data sources Need to classify data to define what data reallymatters Need to assure the quality of internal (and someexternal) data sources for accuracy, completeness,consistency Need to define & apply business rules & metadatamanagement to how the data will be defined andused Need for a data governance framework to ensureconsistency & control18
  19. 19. Big Data – the Data QualityImperative (2) Need processes & tools to enable: Source data profiling Data integration Data parsing Data standardisation Business rule creation & management Metadata management & a shared business / IT glossary Data de-duplication Data normalisation Data matching Data enrichment Data audit Many of these functions must be capable ofbeing carried out in real time with zero lag19
  20. 20. Big Data – DQ as the key enablerSOCIALMEDIASENSORSOPENDATAEMAILEXTERNALDATASOURCESINTERNALDATASOURCESCRMBILLINGOPSSALESPRODSANALYTICS PLATFORM 1ANALYTICS PLATFORM 2ANALYTICS PLATFORM 3ANALYTICS PLATFORM nACTIONABLE INSIGHT & KNOWLEDGEPROFILEPARSESTANDARDISEMATCHENRICHDATA QUALITY PLATFORMPROFILEPARSESTANDARDISEMATCHENRICHMOBILES20
  21. 21. Big Data – some algorithms1. BIG DATA + POOR DATA QUALITY = BIGPROBLEMS2. DATA DEMOCRITISATION – DATA GOVERNANCE =ANARCHY3. DATA MASH UPS – DATA QUALITY = DATA MESS4. BIG DATA ANALYTICS + POOR DQ = WRONGRESULTS5. BIG DATA – DATA ASSURANCE = JAIL6. 3V + DATA QUALITY = 4V (VALIDITY)21
  22. 22. Big Data & Data Quality –summary• Big Data will depend ondata quality to reap itsclaimed benefits – theGIGO truism• The democratization ofdata will expose poorDQ• The need for DataGovernance increases asdata becomes moreaccessible• Data skills will becomemore valued for ‘datascience’• Big Data will increasethe 3Vs of data• Control of data becomesmore difficult – scopeand variety of useincreases• Data standards &business rules becomemore complex• Potential legal &regulatory minefield2222
  23. 23. What action should we take asdata management / DQprofessionals? Identify and get involved in any current orplanned Big Data initiatives within ourorganisations Ensure that the Data Quality and DataGovernance implications & imperatives of theseinitiatives are understood Plan for the new Data Quality and DataGovernance challenges that these trends willpose2323
  24. 24. So who’s afraid of the Big BadData Wolf?24
  25. 25. Questions(Who’s Afraid of…) The Big Bad Data Wolf?The big Bad Data challenge – Big Data &the Data Quality imperative25

×