Data Quality & Hadoop
Upcoming SlideShare
Loading in...5
×
 

Data Quality & Hadoop

on

  • 815 views

 

Statistics

Views

Total Views
815
Views on SlideShare
815
Embed Views
0

Actions

Likes
0
Downloads
21
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Scott McNealy speaking about JINI as quoted in Wired , 1999.
  • Target ID’ed about 25 products that, when analyzed together, allowed him to assign each shopper a “pregnancy prediction” score and estimate arrival date with reasonable accuracy. e.g., 23-yr old woman buys cocoa-butter lotion, a purse large enough to double as a diaper bag, zinc and magnesium supplements and a bright blue rug.
  • Look at two axes: the nature of the app, which dictates how data is used, the nature of the data, which determines how suited it is for cleansing. Key questions to ask for the app center around the degree of mission-criticality, compliance exposure, and business impact the app carries. We’ll go into a few examples on the next chart. Then look at the nature of the data to see if is it feasible and worthwhile to cleanse it. Volume – It won’t be practical to apply traditional record-by-record cleansing on ingestion when Tbytes of data are involved. Variety – Data cleansing of text for ID or address matching not new. But the varieties of data held by Hadoop (and other big data stores) can encompass anything from cryptic sensory instrument data, machine-generated log files, human-readable text files (that in turn might be embedded in log files, messages, or embedded tags), to rich data (audio, video). There are variations in the "density" of data. Text parsing is well-established technology & if the data is to be fed to SQL target, logical strategy is some form of ETL or ELT following a MapReduce or Pig operation that does some of the initial heavy lifting. But if the goal is otherwise – digest machine data, trend customer sentiment, etc., the goal is to paint the big picture rather than generate an exact snapshot. Velocity – streaming data is typically not cleansed because the need is for immediacy. In most cases (e.g., operational processes), the streams are typically comprised of relatively uniform data that usually has some structure. For use cases where excerpts are persisted offline, such data is treated as “normal” data warehousing or ODS data. Value – this is where density of data comes in. In aggregate, the data and the analytics or operational decision support processes should return value. In microcosm, the value of individual records varies from traditional structured data – which contains high value and has meaning on its own (high density) vs log files that individually have little meaning (e.g., low content, density, and negligible value) – but in aggregate provide value. This data is typically not cleansed, and kept in raw state.
  • Web ad placement optimization Mission-critical for B2C websites High volume & variety Individual record errors have scant impact Requires big picture, not exact picture Counter-party risk management for capital markets (e.g., HFT, derivatives, etc.) Mission critical for supporting trading complex issues High volume & velocity; variety increasing as new sources of data introduced: transactional data/event streams, along with messages, rich media, log data, etc. Compliance impacts Requires exact picture – minimal tolerance for data errors Customer sentiment analysis Not mission-critical Analytics will have material impact on the business, but are not dependent on the reliability of atomic data. High-volume (from large sample sizes) & variable data from social networking sources Big picture more important than exact picture. Outliers likely to be overwhelmed by “good” data – although definition of “good” data very loose. Managing smart utility grids or urban infrastructure Mission-critical for utilities that have made the investment Regulatory compliance often a factor Machine data – typically retained as raw data Outliers will be winnowed out – but may provide useful KPIs that challenge what’s normal or may indicate potential malfunctioning equipment In many of these cases, outlier data may be valuable, even if it is “bad” data. Hold that thought – we’ll get back to that shortly.
  • Bad data may provide key indicators, not only of system malfunction, but that either conditions are changing or assumptions about the data are wrong. Instead, track the incidence of outliers, don't discard them. For sensory systems, out of range data could signify a need for recalibration of devices or the logic for interpreting or processing the data. It could provide advance warning that environmental conditions are starting to change. For higher level, human-readable data (e.g., social network postings), bad data could point to major misconceptions in the perception of reality: a mistake in a person's name might actually represent a nickname or alternate identity that people use when they interact with different social groups. Alternately, names of people, places, or things that might be flagged as errors could in reality represent different entities.
  • Crowdsourcing: Draw data from as diverse an array of sources as possible, is likely to drown out the noise. This principle could apply to high-density and low-density data alike. As an added line of defense, conducting trend analyses on incoming data streams can further identify if certain data sources are drifting off the norm Data science: Rather than dwell on quality on a record-by-record process (which will be a resource drain when dealing with huge sets of data), examine the data in aggregate and run sophisticated statistical analysis, pattern matching, and/or machine learning techniques to identify potential outliers. Analyses can therefore be adjusted to screen outliers. Additionally, outliers should be tracked to check for false positives or to spot emerging trends where outliers today become part of the "new norm" tomorrow Semantic modeling: Enterprise Architectural strategy at runtime to identify the data and in turn apply logic on demand to assert that the data is valid. One financial services firm is implementing a business semantic model, using a metadata repository from Adaptive, using the OMG Common Warehouse Model (CWM) to manage metadata interchange between data warehousing tools. With semantics established, quality checks ion data are enforced at run time, when the analytics are executed.

Data Quality & Hadoop Data Quality & Hadoop Presentation Transcript

  • Hadoop: Do Data Warehousing rules apply? Tony Baer tony.baer@ovum.com Oct 24, 20121 © Copyright Ovum. All rights reserved. Ovum is a subsidiary of Informa plc.
  • Agenda  Challenges traditional data stewardship practice  Privacy – is all the world a stage?  Limits to data lifecycle?  Data quality: the big, the bad, the ugly – and it all might be good!2 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • Data stewardship challenges – What’s old is new Remember? Back to undifferentiated ‘gobblobs’ of data Programmatic access reigns 10.102.8.152 - - [05/Nov/2003:00:19:54 -0500] "GET File systems, not (always) tables /inventory/index.jsp HTTP/1.1" 200 4028 "http://www.mycompany.com/index.jsp" "Mozilla/4.08 [en] (Win98; I ;Nav)" 192.168.114.201, -, 03/20/01, 7:55:20, W3SVC2, SALES1, Batch is back 172.21.13.45, 4502, 163, 3223, 200, 0, GET,/DeptLogo.gif, -, 172.16.255.255, anonymous, 03/20/01, 23:58:11, MSFTPSVC, SALES1, 172.16.255.255, 60, 275, 0, 0, if index(tempvalue,?) then But… tempvalue=scan(tempvalue,1,?); else if index(tempvalue,&)>1 then tempvalue=scan(tempvalue,1,&); Volume, variety, velocity, and where’s the value?? Just because you can, should you?3 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • Data stewardship questions for Big Data  Can we, should we “control” this data?  Are there limits to how much we should know?  Can we just keep piling up data forever?  Can we cleanse terabytes of data?  Do we still need “good” data?4 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • Agenda  Challenges traditional data stewardship practice  Privacy – is all the world a stage?  Limits to data lifecycle?  Data quality: the big, the bad, the ugly – and it all might be good!5 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • Privacy – the more things change… “You have zero privacy anyway…. Get over it” -- Scott McNealy, 1999 Facebook does not actually delete images… but instead merely removes the links – a fix “is in sight” -- ZDNet, 2/6/12 Facebook agrees to 20 years of federal privacy audits -- NY Times, 11/29/116 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • What privacy? Florida made $63m last year by selling DMV information (name, date of birth, type of vehicle driven) to companies like LexusNexus & Shadow Soft. -- Terence Craig & Mary Ludloff Privacy and Big Data (O’Reilly Media, 2011)7 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • Big Data privacy 101 – Don’t be creepy  Governance problem first, How Companies Learn Your technology second Secrets  Understand the relationship with your customers & business partners  Keep communications in context  Don’t catch your customers by “My daughter got this in the mail!” he surprise said. “She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to  The law still trying to catch up encourage her to get pregnant?” -- NY Times 2/16/128 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • Agenda  Challenges traditional data stewardship practice  Privacy – is all the world a stage?  Limits to data lifecycle?  Data quality: the big, the bad, the ugly – and it all might be good!9 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • Data lifecycle – How long can this go on?  Google, Yahoo, Facebook, etc. don’t deprecate web data  Hadoop designed for economical scale-out  Moore’s Law, declining cost of storage  Is Hadoop Archive the answer?  Is Hadoop the new tape?Management & skills will be the limit Aerial view of Quincy, WA data ctrs10 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • Agenda  Challenges traditional data stewardship practice  Privacy – is all the world a stage?  Limits to data lifecycle?  Data quality: the big, the bad, the ugly – and it all might be good!11 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • Data Quality & Hadoop – Big Quality Questions  Can we cleanse terabytes of data?  Do we still need “good” data?  Are there new approaches to cleansing Big Data?12 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • Framing the issue  “Garbage in, garbage out,’ but DW forced the issue  Traditional approaches  Profiling, cleansing, MDM  DW vs. Hadoop data quality challenges  Known data sets & known criteria vs. vaguely known  Bounded vs. less bounded tasks  Limitations of MapReduce*  Cleansing & transformation within a single Map operation;  Profiling & matching of unstructured data  Matching of data in operations without inter-process communications *Source: David Loshin, "Hadoop and Data Quality, Data Integration, Data Analysis" at http://www.dataroundtable.com/?p=884113 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • Is data quality necessary for Hadoop?  The App  How mission-critical?  Regulatory compliance impacts?  What degree of business impact?  The Data  The 4V’s (volume, variety, velocity, value) determine what approaches to quality are feasible14 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • Examples  Web ad placement optimization  Counter-party risk management for capital markets  Customer sentiment analysis  Managing smart utility grids or urban infrastructure15 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • Bad data may be good  Sensory data  Outlier or drift?  Time to recalibrate devices?  Time to perform preventive maintenance?  Are new/unaccounted environmental factors skewing readings?  Human-readable data  Flawed concept of reality?  Flawed assumptions on data meaning?  Changes producing ‘new norm’16 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • Big Data quality in Hadoop – Emergent approaches  Crowdsourcing data –  Collect data far & wide from as many diverse sources as possible. Torrents of data overcome the noise.  Comparative trend analysis of incoming streams to dynamically ID the norm or sweet spot of “good” data  Apply data science to “correct the dots”  Don’t go record by record. Statistically analyze the data set in aggregate.  Iteratively analyze & re-analyze nature of data, keep analyzing outliers  Apply off-the-wall approaches  Enterprise Architectural approach  Semantic (domain) model-driven  Apply cleansing logic at run time  Critical for sensitive, regulatory-driven apps17 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • Summary  Challenges traditional data stewardship practice  Combination of old & new  Privacy – is all the world a stage?  Best practices, legal requirements still in flux  Don’t be creepy!  Limits to data lifecycle?  Few enterprises are Google or Facebook  Ability to manage large infrastructure will be major limit  Data quality  Strategy depends on type of app & data set(s)  A spectrum of approaches -- from none to classic ETL to aggregate statistical  No single silver bullet18 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • Thank you Tony Baer Ovum tony.baer@ovum.com19 © Copyright Ovum. All rights reserved. Ovum is a subsidiary of Informa plc.