Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Emerging Data Quality Trends for Governing and Analyzing Big Data


Published on

Business initiatives across industries are applying more data than ever to drive analytics and AI in the quest for new competitive insights. As the volume and variety of data gathered by organizations continues to escalate, both on-premises and in the cloud, traditional methods of Data Quality are transforming to meet this Big Data challenge. This webinar looks at these emerging trends in Data Quality to address Data Governance, entity resolution at scale, AI and machine learning, and establishing Data Quality as a core tenet of data literacy.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Emerging Data Quality Trends for Governing and Analyzing Big Data

  1. 1. Emerging Data Quality Trends for Governing and Analyzing Big Data August 1, 2019 Harald Smith
  2. 2. Speaker Harald Smith • Director of Product Marketing, Syncsort • 20+ years in Information Management with a focus on data quality, integration, and governance • Co-author of Patterns of Information Management • Author of two Redbooks on Information Governance and Data Integration • Blog author: “Data Democratized”
  3. 3. Agenda • Ongoing Data Challenges • Four Emerging Data Quality Trends • Approaches to addressing Data Quality needs • Questions
  4. 4. Why is Data Quality so important?
  5. 5. Data: the fuel of the future Data is to this century, what oil was to the last one: a driver of growth and change. The Economist: Fuel of the future - Data is giving rise to a new economy: 6th May 2017 Flows of data have created new infrastructures, new businesses, new monopolies, new politics and crucially new economics. Digital information is unlike any previous resource: it is extracted, refined, valued, bought and sold in different ways. It changes the rules for markets and it demands new approaches from regulators. Many a battle will be fought over who should own, and benefit from, data. 5 Emerging Data Quality Trends
  6. 6. Analysis Segmentation Data compliance Access Scheduling All reports! Competitor analysis Sales reports Single Customer / 360 View Data regulation Security Workloads Aggregations HR / recruitment Dashboards CRM Content Governance Capacity Management Performance planning Forecasting & modelling Overall business strategy! Performance metrics Campaign management Risk Optimization & SLA’s Route planning Cash flow Territory management ROI Disaster Recovery Inventory Contingency planning UX Data impacts all areas of the business Sales Marketing FinanceLegal IT Operations Management 6 Emerging Data Quality Trends
  7. 7. Data Governance & Quality are top of mind 3V’s of Big Data Volume, variety, and velocity of data is growing Ever more Analysis New tools allowing more granular data dissection and segmentation Dichotomy in Outcomes Expectations of data is increasing yet confidence in data is falling Governance Requirements Broader and deeper compliance & regulation expectations trust & confidence 7 Emerging Data Quality Trends
  8. 8. “Get to Know Me”… • Design and deliver rich, individualized experiences that build customer loyalty • Increasingly broad spectrum of data sources involved in, and required for, effectively personalizing customer experiences and targeted marketing offers What Types of Data? • Internal sources – often many/overlapping • 3rd Party data – geospatial, demographics, firmographics • Suppression data – keeping customer information updated • New sources – mobile, social media What Data Challenges? • Incorporating and managing the expected exponential increase in digital demographic data • Tapping into customer technology histories to build and evolve an understanding of individual customers Use Case: 360 View of Customer Internal Data ▪ Customer Master Data ▪ Point-of-Sale Data ▪ Contact Form Data ▪ Loyalty Program Data ▪ ecommerce Data ▪ Customer Service Data Suppression Data ▪ Change of Address ▪ Mortality ▪ Do Not Call Third-Party Data ▪ Age ▪ Occupation ▪ Education ▪ Gender ▪ Income ▪ Geospatial/Location Social Data ▪ Digital demographics ▪ Sentiment ▪ Opinions ▪ Interests ▪ Social handles 8 Emerging Data Quality Trends
  9. 9. Protect Financial Assets and Ensure Compliance • Flag credit card fraud in real time • Identify and report on money laundering What Types of Data? • Internal sources – often many/overlapping • Suppression data – keeping customer information updated • Mobile data – devices, locations • New sources – social media, 3rd party data, … What Data Challenges? • Fraudulent transaction detection requires: • Huge volumes of customer profile data • Recent transaction activity with “last known” values • Device data with geolocation and time-based tagging • Data used to refine Machine Learning models (e.g., anomaly detection, implausible behavior analysis) to review new transactions in real time Use Case: Anti-Fraud/Anti-Money Laundering Internal Data ▪ Customer Master Data ▪ Point-of-Sale Data ▪ Contact Form Data ▪ Loyalty Program Data ▪ ecommerce Data ▪ Customer Service Data Mobile Data ▪ Device ▪ Location ▪ Wearables ▪ Mobile wallets Suppression Data ▪ Change of Address ▪ Mortality ▪ Do Not Call Social Data ▪ Digital Demographics ▪ Sentiment ▪ Opinions ▪ Interests ▪ Social handles 9 Emerging Data Quality Trends
  10. 10. Only 35%of senior executives have a high level of trust in the accuracy of their Big Data Analytics KPMG 2016 Global CEO Outlook 92% of executives are concerned about the negative impact of data and analytics on corporate reputation KPMG 2017 Global CEO Outlook 80%of AI/ML projects are stalling due to poor data quality Dimensional Research, 2019 Big Data Needs Data Quality 10 Emerging Data Quality Trends “Societal trust in business is arguably at an all-time low and, in a world increasingly driven by data and technology, reputations and brands are ever harder to protect.” EY “Trust in Data and Why it Matters”, 2017. The importance of data quality in the enterprise: • Decision making – Trust the data that drives your business • Customer centricity – Get a single, complete and accurate view of your customer for better sales, marketing and customer service • Compliance – Know your data, and ensure its accuracy to meet industry and government regulations • Machine learning & AI – High quality models require training on high quality, accurate data
  11. 11. Four Emerging Data Quality Trends
  12. 12. Four Emerging Data Quality Trends All the traditional DQ issues remain, but now consider: 1. New DQ considerations for new types of data 2. New application considerations (e.g. Machine learning) 3. Processing at scale/meeting SLAs 4. Data Democratization and resource/knowledge constraints 12 Emerging Data Quality Trends
  13. 13. 1. New Data, New Measures
  14. 14. Common Data Quality Problems All the traditional data quality issues remain, but now at greater scale and in more places • Many data records with different layouts • Inconsistent data formats (number formatting, measurements, languages, postal conventions and dates) • Lack of standardization of the different fields • Names spelled differently, partially entered, or multiple names provided • Misspellings and keystroke errors • Data sourced from third parties does not contain all the necessary fields or is out-of- date • Invalid values: codes, reference data, out-of- range, future dates Lack of Standardization 14 Emerging Data Quality Trends
  15. 15. Common Data Quality Measurements What measures can we take advantage of? • Completeness – Are the relevant fields populated? • Integrity – Does the data maintain an internal structural integrity or a relational integrity across sources • Uniqueness – Are keys or records unique? • Validity – Does the data have the correct values? • Code and reference values • Valid ranges • Valid value combinations • Consistency – Is the data at consistent levels of aggregation or does it have consistent valid values over time? 15 Emerging Data Quality Trends • Timeliness – Did the data arrive in a time period that makes it useful or usable?
  16. 16. Example: Call Center Record Unique ✓ Integrity ✓ Complete ? Consistent ✓ Timely ✓ Valid ? Is Duration = 0 important? Is 01/01/20xx a defaulted date? And how will this be linked or connected with my other data? The file appears complete, but does it cover all call centers? 16 Emerging Data Quality Trends
  17. 17. Example: Social Media Feed Unique? Integrity? Complete? Consistent? Timely? Valid? 17 Emerging Data Quality Trends
  18. 18. New Data Quality Problems New data, new data quality challenges • 3rd Party and external data with unknown provenance or relevance • Bias in the data – whether in collection, extraction, or other processing • Data without standardized structure or formatting • Continuously streaming data • Disjointed data (e.g. gaps in receipt) • Consistency and verification of data sources • Changes and transformation applied to data (i.e. does it really represent the original input) 18 Emerging Data Quality Trends “34 percent of bankers in our survey report that their organization has been the target of adversarial AI at least once, and 78 percent believe automated systems create new risks, such as fake data, external data manipulation, and inherent bias.” Accenture Banking Technology Vision 2018
  19. 19. What else can we review or measure? Provenance – Where did the data originate, who gathered it, and what criteria was used to create it? • E.g. government agency, 3rd party provider, free or paid data Coverage (Relevance) – How well does the data source meet the defined needs? • E.g. does it cover the relevant geography? Is it biased (and if so, how)? Continuity – Data points for all intervals or expected intervals? • E.g. sensors, weather records, call data records Triangulation – What Gartner describes as ‘consistency of data across proximate data points’, i.e. consistent measurements from related points of reference. • E.g. if temperatures in Chicago and Louisville are 30°and 32°then temperature in Indianapolis for same day is unlikely to be 70° Transformation from origin – how many layers and/or changes has the data passed through? • E.g. has the original data source already been merged with two other record sources? And is the result accurate? Repetition or duplication of data patterns – Data points exactly the same across multiple recording intervals or across multiple sensors. • E.g. is there tampering with sensors or call data? Additional Measures of Data Quality 19 Emerging Data Quality Trends
  20. 20. 20 Emerging Data Quality Trends Example: New Data Quality Measures applied Triangulated Continuity Provenance Coverage Usage Repeated patterns Transformation Jane Doe pulled from Twitter based on #Blackberry All items for #Blackberry in relevant time interval appear to be included Marketing confirms this data has high value Good association with current product & sales data All tweets appear unique within the date & vs. prior feeds This needed to include #BB and #Crackberry as well! No changes or merges of the data were applied
  21. 21. 2. Machine Learning & Data Quality
  22. 22. “ ” The magic of machine learning is that you build a statistical model based on the most valid dataset for the domain of interest. If the data is junk, then you’ll be building a junk model that will not be able to do its job. James Kobeilus SiliconANGLE Wikibon Lead Analyst for Data Science, Deep Learning, App Development 2018
  23. 23. Common Machine Learning Applications Marketing • Targeted marketing • Recommendation engine • Next best action • Customer churn prevention Risk Management • Anti-money laundering • Fraud detection • Cybersecurity • Know your customer 23 Emerging Data Quality Trends
  24. 24. Data Challenges with Machine Learning Five Big Challenges of Enabling Machine Learning 1. Scattered and Difficult to Access Datasets Much of the necessary data is trapped in mainframes or streams in from POS, and ATM machines in incompatible formats, making it difficult to gather and prepare the data for model training. 2. Data Cleansing at Scale Data quality cleansing and preparation routines have to be reproduced at scale. Most data quality tools are not designed to work on that scale of data. 3. Entity Resolution and Customer Identification Distinguishing matches across massive datasets that indicate a single specific entity requires sophisticated multi-field matching algorithms and a lot of compute power. Essentially everything has to be compared to everything. 4. Need for Near Real-Time Current Data Tracking and detection needs to happen very rapidly. Current transactions need to be constantly added to combined datasets, prepared and presented to models as close to real-time as possible. 5. Tracking Lineage from the Source Data changes made to help train models have to be exactly duplicated in production, in order for models to accurately make predictions on new data, and for required audit trails. Capture of complete lineage, from source to end point is needed. 24 Emerging Data Quality Trends
  25. 25. Data Quality Challenges with Machine Learning Incorrect, Incomplete, Mis-Formatted, and Sparse “Dirty Data” – Mistakes and errors are almost never the patterns you’re looking for in a data set. Sparse data generates other issues. Correcting and standardizing will tend to boost the signal, but must account for bias. Missing context – Many data sources lack context around location or population segments. Unless enriched with other data sets, (e.g. geospatial, demographics, or firmographics data), some ML algorithms will not be usable. Multiple copies – If your data comes from many sources, as it often does, it may contain multiple records of information about the same person, company, product or other entity. Removing duplicates and enhancing the overall depth and accuracy of knowledge about a single entity can make a huge difference. Spurious correlations – Just as missing context may hinder some ML algorithms, inclusion of already correlated data (e.g. city and postal code) may result in overfitting of ML algorithms. Correcting data problems vastly increases a data set’s usefulness for machine learning. However, traditional data quality software is designed to work on smaller data sets. And data analysts may not be aware of specific data quality issues that must be addressed to support machine learning. Traditional data quality processes are an effective method to remove defects. 25 Emerging Data Quality Trends
  26. 26. Example: Missing segments of populations Event: Hurricane Sandy 20 million tweets • Majority of tweets from Manhattan not the hard hit areas such as Seaside Heights and Midland Beach due to power outages and diminishing cell phone batteries • Despite the millions of Spanish-speakers affected, very few Spanish-language tweets collected • Assess % across and against all likely locations • Seek out disconfirming information Data: Boston Potholes Street Bump App • Draws on accelerometer and GPS data to help passively detect potholes • Lower income groups in the US are less likely to have smartphones, particularly older residents - penetration as low as 16% • Result is underreporting of road problems in more elderly communities • Assess % across all likely locations • Add other sources • Utilize demographics for evaluations 26 Emerging Data Quality Trends
  27. 27. Example: Noise, or Inserted content “Bots are just a tool for making the numbers look how you want them to look.” Sam Woolley Researcher, Oxford University’s Project on Computational Propaganda Wired: Nov 8, 2016 “The Political Twitter Bots Will Rage This Election Day” Event: Election Bot tweets • ~400,000 bots tweeting on the election • ~20% of all election-related tweets came from an army of influential bots • 55-80% of Twitter activity—the likes, follows, and retweets —are from bots • It had been easier to identify earlier bots, but now it’s incredibly difficult for a human to make a determination • Evaluate patterns • Is there any real sentiment here? • How much repetitive content is there? • How much “influence” comes from a single or a few sources (negative or positive)? • Will it skew the analysis? 27 Emerging Data Quality Trends
  28. 28. Example: Simple bias “The “black sheep problem” is that if you were to try to guess what color most sheep were by looking [at] language data, it would be very difficult for you to conclude that they weren't almost all black. In English, “black sheep” outnumbers “white sheep” about 25:1 (many "black sheeps” are movie references); in French it's 3:1; in German it's 12:1. Some languages get it right; in Korean it's 1:1.5 in favor of white sheep…” Hal Daumé Associate Professor, University of Maryland Blog: June 24, 2016 “Language bias and black sheep” and-black-sheep.html Data: Google Word2Vec data set Word2vec • Converts words into a vector space for analysis • “Numerous researchers have begun to use the data to better understand everything from machine translation to intelligent Web searching.” • Embeddings based on a group of 300 million words taken from Google News • Researchers from Boston University and Microsoft have found it is “blatantly sexist” • Impacts the ability to create personalized services • Evaluate % of words & associations • How do I interpret a sentiment? • Does this data set contain hidden and unexpressed bias? • Will I miss opportunities because of hidden assumptions? 28 Emerging Data Quality Trends
  29. 29. 3. Data Quality at Scale
  30. 30. Challenges To Ensuring Data Quality Many sources of data (70%) and volume of data (48%) are among the top 3 challenges companies face when ensuring high quality data. Applying governance processes to manage and measure data quality is second with 50%. * Syncsort, 2019 Enterprise Data Quality survey 70% 50% 48% 47% 46% 43% 32% 27% 27% 25% 15% Many sources of data Applying governance processes to manage and measure data… Volume of data Inconsistent formats of data Inconsistent definitions of data Missing information Connecting policies and rules to data Misfielded data Lack of skills/staff Lack of tools (or inadequate tools) Not seen as an organizational priority What are the greatest challenges you face when ensuring high data quality? 30 Emerging Data Quality Trends
  31. 31. Processing at Scale New Data Quality considerations • Handling data volumes and distributed data • Profiling data – assessing high volumes and streaming data • Standardizing and enriching data content • Matching entities – not just master data – e.g. transactions for fraud detection • Meeting Service Level Agreements (SLA’s) • Running consistently on new and regularly changing platforms (Hadoop, Spark, Cloud) 31 Emerging Data Quality Trends
  32. 32. Big Data at scale distributes data across many nodes – not necessarily with other relevant data! • Data Quality functions must be performed in a consistent manner, no matter where actual processing takes place, how the data is segmented, and what the data volume is • Cleansing, standardization, and data validation will generally scale linearly • Data Enrichment: Reference data, lookups must be readily accessible by any process wherever executed Handling distributed data volumes Source: HP Analyst Briefing 32 Emerging Data Quality Trends
  33. 33. • But particular implications for profiling, joining, sorting, and matching data • Profiling: Identification of outliers necessitates full volume views and need to aggregate statistics and frequencies of data distributed across cluster • Joins & sorts: Efficient shuffling of data stored across cluster is critical • Entity Resolution: Distinguishing matches that indicate a single specific entity across so much data requires multiple passes with sophisticated multi-field matching algorithms – with results that are understandable by business users in order to be meaningful Handling distributed data volumes 33 Emerging Data Quality Trends
  34. 34. Anti-Money Laundering on Hadoop at Global Bank • Must provide cluster-native data verification, enrichment, and demanding multi-field fuzzy matching for entity resolution to Golden Record • Massive data volumes • Scattered data – Mainframe, RDBMS, Cloud, … • Must be secure – Kerberos, LDAP • Must have lineage – data origin to end point • Must archive unaltered mainframe data Full Anti-Money Laundering regulatory compliance with financial crimes data lake – high performance results at massive scale. • Full end-to-end data lineage supplied to Apache Atlas and ASG Data Intelligence • Cluster-native data verification, enrichment, and demanding multi-field entity resolution on Spark • Unmodified mainframe “Golden Records” stored on Hadoop Bank must monitor transactions to detect Money Laundering for FCA compliance. Leverage Machine learning at scale to detect patterns, but … Requires large amounts of current, clean data. 34 Emerging Data Quality Trends
  35. 35. 4. Data Literacy / Democratization
  36. 36. Data Democratization Data Quality is a key component to user empowerment • Data Literacy - critical to understand: • Business context and language • Data (including data structures and data types) • Data access (how and where to find) • Data usage (how will the data be used by the business) • Basic Statistics • Data Quality dimensions • Data Quality techniques and tools • Resource constraints – in both Data Quality and technologies • What questions to ask? • Where to find answers? 36 Emerging Data Quality Trends
  37. 37. Approaches to Addressing Emerging Data Quality Trends
  38. 38. Approaches Data Literacy / Data Governance • Communicating Best Practices in Data Quality for everyone 38 Emerging Data Quality Trends “Universal” Data Quality Best Practices • Establish Scope: ask core questions • Identifying data requirements • Address bias • Understand context • Address and resolve data quality issues • Apply data governance processes Solving “Big Data” Data Quality Challenges • Handle scale • Ensure consistent data quality application across platforms
  39. 39. Culture of Data Literacy • “Democratization of Data” requires cultural support • Empowered to ask questions about the data • Trained to understand and use data • Trained to understand approaching and evaluating data quality • Traditional data, new data, machine learning requirements, … • Understand the business context of the data Program of Data Governance • Provide the processes and practices necessary for success • Measure, monitor, and improve • Continuous iteration and development Center of Excellence/Knowledge Base • Where do you go to find answers? • Who can help show you how? Communicate! 39 Emerging Data Quality Trends
  40. 40. Data Literacy: challenges & best practices • Lack of Common Terminology • Organizational Barriers & Silos • Isolated or Unknown Work • Lack of Engagement Establish a Common Language • Define terminology – a ‘stake in the ground’ • Map information • Support with policies/standards Gain Broader Buy In • Bring stakeholders together • Build the structure, culture, ownership, steering groups, stewardship over time Enrich Information • Discover what you don’t know • Resolve differences • Enhance/annotate to increase insight Share Insights Regularly • Produce and share tangible outcomes • Highlight ‘wins’ • Demonstrate efficiencies & savings Copyright © Syncsort 2019
  41. 41. “If you don’t know what you want to get out of the data, how can you know what data you need – and what insight you’re looking for?” Wolf Ruzicka Chairman of the Board at EastBanc Technologies Blog post: June 1, 2017 “Grow A Data Tree Out Of The “Big Data” Swamp” Establish Scope • Understand the business objective and problem • Asking the “right questions” about your data (not just “what” and “how”) • “Empowering users (“Who”) to gain new clarity into the core problem (“Why”) • “High-quality data” definition will vary by business problem Identify Requirements & Processes • Do you have all the data required? • Do you understand the characteristics and context of the data? • How will data be matched, consolidated, or connected? • What’s needed to facilitate the matching, consolidation, or connection required? • Have you evaluated the sources? • What’s the Fitness for your Purpose? Universal Data Quality best practices 41 Emerging Data Quality Trends
  42. 42. Understand Context • What are the Critical Data Elements? • What qualities do we need to address, or leave alone? • When, and where, do we need to transform or enrich the data content? • How are we connecting, relating, or combining data? Develop, Test, and Deploy Corrective Measures • Consistent application of standardization, transformation, enrichment, and entity resolution • Common templates, rules, metrics, and processes that can be leveraged • Deploy into batch, real-time, or embedded services Apply Data Governance • Deploy and implement metrics and measures for ongoing assessment and evaluation Universal Data Quality best practices “Never lead with a data set; lead with a question.” Anthony Scriffignano Chief Data Scientist, Dun & Bradstreet Forbes Insights, May 31, 2017 “The Data Differentiator” 42 Emerging Data Quality Trends
  43. 43. Quantify: challenges & best practices • Hidden Activities • Money, Time and Resource Waste • Lack of Transparency and Trust • Disconnect Between Process and Measures Identify Baseline Measures • Keep a focus on lean and agile • Define value accurately for the business Link to Business Performance • Create and refine streams of value • Transform culture through action and empowerment Monitor, Report and Remediate Issues • Continuously review • Ensure issues are visible and understood • Understand root causes • Address/resolve issues Quantify Impact of Changes • Demonstrate through clearly understood measures • Establish value continuously • Finish early, finish often Copyright © Syncsort 2019
  44. 44. Leverage tools built for Big Data • Focus on the data quality challenges, not the Big Data ones • Connect to and process hundreds of millions of records of data • Standardize, enhance, and match international data sets with postal and country-code validation • Integrate, enrich, and match new and legacy customer data from multiple disparate sources • Deploy data quality workflows as native, parallel MapReduce or Spark processes for optimal efficiency on premises or in the Cloud • Increase processing efficiency by expanding cluster, not rebuilding processes • Support failover through fault-tolerant designs; during a node failure, processing is redirected to another node 44 Emerging Data Quality Trends
  45. 45. Simplify: Design Once, Deploy Anywhere Intelligent Execution - Insulate your organization from underlying complexities of Big Data Get excellent performance every time without tuning, load balancing, etc. Avoid re-design, re-compile, re-work • Future-proof job designs for emerging compute frameworks • Move from dev to test to production • Move from on-premises to Cloud • Move from one Cloud to another Use existing Data Quality skills • Focus on data quality problems, not technical ones Design Once in visual GUI Deploy Anywhere! On-Premises, Cloud MapReduce, Spark, Future Platforms Windows, Linux, Unix Batch, Streaming Single Node, Cluster Emerging Data Quality Trends45
  46. 46. Data Quality remains Data Quality, even at scale “Data and analytics leaders need to understand the business priorities and challenges of their organization. Only then will they be in the right position to create compelling business cases that connect data quality improvement with key business priorities.” Ted Friedman VP Distinguished Analyst, Gartner Smarter with Gartner at June 12, 2018 “How to Create a Business Case for Data Quality Improvement” “Never lead with a data set; lead with a question.” Anthony Scriffignano Chief Data Scientist, Dun & Bradstreet Forbes Insights, May 31, 2017 “The Data Differentiator” 46 Emerging Data Quality Trends
  47. 47. Q&A
  48. 48.