Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Quality Panel : Diachron Workshop @EDBT

311 views

Published on

my contribution to the Diachron workshop at the EDBT'16 conference

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Big Data Quality Panel : Diachron Workshop @EDBT

  1. 1. P.Missier-2016 Diachronworkshoppanel Big Data Quality Panel Diachron Workshop @EDBT Panta Rhei (Heraclitus, through Plato) Paolo Missier Newcastle University, UK Bordeaux, March 2016 (*) Painting by Johannes Moreelse (*)
  2. 2. P.Missier-2016 Diachronworkshoppanel The “curse” of Data and Information Quality • Quality requirements are often specific to the application that makes use of the data (“fitness for purpose”) • Quality Assurance (actions required to meet the requirements) are specific to the data types A few generic quality techniques (linkage, blocking, …) but mostly ad hoc solutions
  3. 3. P.Missier-2016 Diachronworkshoppanel V for “Veracity”? Q3. To what extent traditional approaches for diagnosis, prevention and curation are challenged by the Volume Variety and Velocity characteristics of Big Data? V Issues Example High Volume • Scalability: What kinds of QC step can be parallelised? • Human curation not feasible Parallel meta-blocking High Velocity • Statistics-based diagnosis, data- type specific • Human curation not feasible Reliability of sensor readings High Variety • Heterogeneity is not a new issue! Data fusion for decision making Recent contributions on Quality & Big Data (IEEE Big Data 2015) Chung-Yi Li et al., Recommending missing sensor values Yang Wang and Kwan-Liu Ma, Revealing the fog-of-war: A visualization-directed, uncertainty-aware approach for exploring high-dimensional data S. Bonner et al., Data quality assessment and anomaly detection via map/reduce and linked data: A case study in the medical domain V. Efthymiou, K. Stefanidis and V. Christophides, Big data entity resolution: From highly to somehow similar entity descriptions in the Web V. Efthymiou, G. Papadakis, G. Papastefanatos, K. Stefanidis and T. Palpanas, Parallel meta-blocking: Realizing scalable entity resolution over large, heterogeneous data
  4. 4. P.Missier-2016 Diachronworkshoppanel Can we ignore quality issues? Q4: How difficult is the evaluation of the threshold under which data quality can be ignored? • Some analytics algorithms may be tolerant to {outliers, missing values, implausible values} in the input • But this “meta-knowledge” is specific to each algorithm. Hard to derive general models • i.e. the importance and danger of FP / FN A possible incremental learning approach: Build a database of past analytics task: H = {<In, P, Out>} Try and learn (In, Out) correlations over a growing collection H
  5. 5. P.Missier-2016 Diachronworkshoppanel Data to Knowledge Meta-knowledge Big Data The Big Analytics Machine Algorithms Tools Middleware Reference datasets “Valuable Knowledge” The Data-to-Knowledge pattern of the Knowledge Economy:
  6. 6. P.Missier-2016 Diachronworkshoppanel The missing element: time Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t Change  data currency
  7. 7. P.Missier-2016 Diachronworkshoppanel The ReComp decision support system Observe change • In big data • In meta-knowledge Assess and measure • knowledge decay Estimate • Cost and benefits of refresh Enact • Reproduce (analytics) processes Currency of data and of meta-knowledge: - What knowledge should be refreshed? - When, how? - Cost / benefits
  8. 8. P.Missier-2016 Diachronworkshoppanel ReComp: 2016-18 Change Events Diff(.,.) functions “business Rules” Prioritised KAs Cost estimates Reproducibility assessment ReComp DSS History DB Past KAs and their metadata  provenance Observe change Assess and measure Estimate Enact KA: Knowledge Assets META-K
  9. 9. P.Missier-2016 Diachronworkshoppanel Metadata + Analytics The knowledge is in the metadata! Research hypothesis: supporting the analysis can be achieved through analytical reasoning applied to a collection of metadata items, which describe details of past computations. identify recomp candidates large-scale recomp estimate change impact Estimate reproducibility cost/effort Change Events Change Impact Model Cost Model Model updates Model updates Meta-K • Logs • Provenance • Dependencies

×