Big Data Quality Panel: Diachron Workshop @EDBT

•Download as PPTX, PDF•

0 likes•646 views

1) Traditional approaches to ensuring data quality such as quality assurance and curation face challenges from big data's volume, velocity, and variety characteristics. 2) It is difficult to determine general thresholds for when data quality issues can be ignored as the importance varies between different analytics algorithms. 3) The ReComp decision support system aims to use metadata about past analytics tasks to determine when knowledge needs to be refreshed due to changes in big data or models.

P.Missier-2016
Diachronworkshoppanel
Big Data Quality Panel
Diachron Workshop @EDBT
Panta Rhei (Heraclitus, through Plato)
Paolo Missier
Newcastle University, UK
Bordeaux, March 2016
(*) Painting by Johannes Moreelse
(*)

P.Missier-2016
Diachronworkshoppanel
The “curse” of Data and Information Quality
• Quality requirements are often specific to the application
that makes use of the data (“fitness for purpose”)
• Quality Assurance (actions required to meet the
requirements) are specific to the data types
A few generic quality techniques (linkage, blocking, …)
but mostly ad hoc solutions

P.Missier-2016
Diachronworkshoppanel
V for “Veracity”?
Q3. To what extent traditional approaches for diagnosis, prevention
and curation are challenged by the Volume Variety and Velocity
characteristics of Big Data?
V Issues Example
High Volume • Scalability: What kinds of QC
step can be parallelised?
• Human curation not feasible
Parallel meta-blocking
High Velocity • Statistics-based diagnosis, data-
type specific
• Human curation not feasible
Reliability of sensor
readings
High Variety • Heterogeneity is not a new issue! Data fusion for decision
making
Recent contributions on Quality & Big Data (IEEE Big Data 2015)
Chung-Yi Li et al., Recommending missing sensor values
Yang Wang and Kwan-Liu Ma, Revealing the fog-of-war: A visualization-directed, uncertainty-aware
approach for exploring high-dimensional data
S. Bonner et al., Data quality assessment and anomaly detection via map/reduce and linked data: A case
study in the medical domain
V. Efthymiou, K. Stefanidis and V. Christophides, Big data entity resolution: From highly to somehow
similar entity descriptions in the Web
V. Efthymiou, G. Papadakis, G. Papastefanatos, K. Stefanidis and T. Palpanas, Parallel meta-blocking:
Realizing scalable entity resolution over large, heterogeneous data

P.Missier-2016
Diachronworkshoppanel
Can we ignore quality issues?
Q4: How difficult is the evaluation of the threshold under which data
quality can be ignored?
• Some analytics algorithms may be tolerant to {outliers, missing
values, implausible values} in the input
• But this “meta-knowledge” is specific to each algorithm. Hard to
derive general models
• i.e. the importance and danger of FP / FN
A possible incremental learning approach:
Build a database of past analytics task:
H = {<In, P, Out>}
Try and learn (In, Out) correlations over a growing collection H

P.Missier-2016
Diachronworkshoppanel
Data to Knowledge
Meta-knowledge
Big
Data
The Big
Analytics
Machine
Algorithms
Tools
Middleware
Reference
datasets
“Valuable
Knowledge”
The Data-to-Knowledge pattern of the Knowledge Economy:

P.Missier-2016
Diachronworkshoppanel
The missing element: time
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
Change  data currency

P.Missier-2016
Diachronworkshoppanel
The ReComp decision support system
Observe change
• In big data
• In meta-knowledge
Assess and
measure
• knowledge decay
Estimate
• Cost and benefits of refresh
Enact
• Reproduce (analytics)
processes
Currency of data and of meta-knowledge:
- What knowledge should be refreshed?
- When, how?
- Cost / benefits

P.Missier-2016
Diachronworkshoppanel
ReComp: 2016-18
Change
Events
Diff(.,.)
functions
“business
Rules”
Prioritised KAs
Cost estimates
Reproducibility
assessment
ReComp DSS
History DB
Past KAs
and their metadata  provenance
Observe
change
Assess and
measure
Estimate
Enact
KA: Knowledge Assets
META-K

P.Missier-2016
Diachronworkshoppanel
Metadata + Analytics
The knowledge is
in the metadata!
Research hypothesis:
supporting the analysis can be achieved through analytical reasoning applied to a collection of metadata
items, which describe details of past computations.
identify
recomp
candidates
large-scale
recomp
estimate
change
impact
Estimate
reproducibility
cost/effort
Change
Events
Change
Impact
Model
Cost
Model
Model
updates
Model
updates
Meta-K • Logs
• Provenance
• Dependencies

Talk given at TAPP'16 (Theory and Practice of Provenance), June 2016, paper is here: https://arxiv.org/abs/1604.06412 Abstract: The cost of deriving actionable knowledge from large datasets has been decreasing thanks to a convergence of positive factors: low cost data generation, inexpensively scalable storage and processing infrastructure (cloud), software frameworks and tools for massively distributed data processing, and parallelisable data analytics algorithms. One observation that is often overlooked, however, is that each of these elements is not immutable, rather they all evolve over time. As those datasets change over time, the value of their derivative knowledge may decay, unless it is preserved by reacting to those changes. Our broad research goal is to develop models, methods, and tools for selectively reacting to changes by balancing costs and benefits, i.e. through complete or partial re-computation of some of the underlying processes. In this paper we present an initial model for reasoning about change and re-computations, and show how analysis of detailed provenance of derived knowledge informs re-computation decisions. We illustrate the main ideas through a real-world case study in genomics, namely on the interpretation of human variants in support of genetic diagnosis.

Your data won’t stay smart forever:exploring the temporal dimension of (big ...

Paolo Missier

Much of the knowledge produced through data-intensive computations is liable to decay over time, as the underlying data drifts, and the algorithms, tools, and external data sources used for processing change and evolve. Your genome, for example, does not change over time, but our understanding of it does. How often should be look back at it, in the hope to gain new insight e.g. into genetic diseases, and how much does that cost when you scale re-analysis to an entire population? The "total cost of ownership” of knowledge derived from data (TCO-DK) includes the cost of refreshing the knowledge over time in addition to the initial analysis, but is often not a primary consideration. The ReComp project aims to provide models, algorithms, and tools to help humans understand TCO-DK, i.e., the nature and impact of changes in data, and assess the cost and benefits of knowledge refresh. In this talk we try and map the scope of ReComp, by giving a number of patterns that cover typical analytics scenarios where re-computation is appropriate. We specifically describe two such scenarios, where we are conducting small scale, proof-of-concept ReComp experiments to help us sketch the general ReComp architecture. This initial exercise reveals a multiplicity of problems and research challenges, which will inform the rest of the project

ReComp: challenges in selective recomputation of (expensive) data analytics t...

Paolo Missier

- Science des Réseaux - Réseaux géographiques - Réseaux temporels - Le Big Data et la Science des Réseaux - Les réseaux en Intelligence Analytique - Réseaux de données sociales et analyse communautaire - Réseaux de données agroalimentaires et analyse stratégique - Intelligence émotionnelle - Intelligence analytique et réseaux de neurones - De l’apprentissage automatique (machine learning) au raisonnement automatique.

20170110_IOuellette_CVIan Ouellette

Towards reproducibility and maximally-open data

Pablo Bernabeu

Future of hpc

Putchong Uthayopas

NG2S: A Study of Pro-Environmental Tipping Point via ABMs

Kan Yuenyong

A study of tipping point: much less is known about the most efficient ways to reach such transitions or how self-reinforcing systemic transformations might be instigated through policy. We employ an agent-based model to study the emergence of social tipping points through various feedback loops that have been previously identified to constitute an ecological approach to human behavior. Our model suggests that even a linear introduction of pro-environmental affordances (action opportunities) to a social system can have non-linear positive effects on the emergence of collective pro-environmental behavior patterns.

Moa: Real Time Analytics for Data Streams

Albert Bifet

Minimal viable-datareuse-czi

Paul Groth

The literature contains a myriad of recommendations, advice, and strictures about what data providers should do to facilitate data reuse. It can be overwhelming. Based on recent empirical work (analyzing data reuse proxies at scale, understanding data sensemaking and looking at how researchers search for data), I talk about what practices are a good place to start for helping others to reuse your data.

Big dataAshish Kulkarni

Data Science, Data & Dashboards Design

Koo Ping Shung

Estimating Query Difficulty for News Prediction Retrieval (poster presentation)

Nattiya Kanhabua

News prediction retrieval has recently emerged as the task of retrieving predictions related to a given news story (or a query). Predictions are defined as sentences containing time references to future events. Such future-related information is crucially important for understanding the temporal development of news stories, as well as strategies planning and risk management. The aforementioned work has been shown to retrieve a significant number of relevant predictions. However, only a certain news topics achieve good retrieval effectiveness. In this paper, we study how to determine the difficulty in retrieving predictions for a given news story. More precisely, we address the query difficulty estimation problem for news prediction retrieval. We propose different entity-based predictors used for classifying queries into two classes, namely, Easy and Difficult. Our prediction model is based on a machine learning approach. Through experiments on real-world data, we show that our proposed approach can predict query difficulty with high accuracy.

The Roots: Linked data and the foundations of successful Agriculture Data

Paul Groth

Pitfalls in benchmarking data stream classification and how to avoid themAlbert Bifet

From Text to Data to the World: The Future of Knowledge Graphs

Paul Groth

Role of Data Accessibility During Pandemic

Databricks

This talk focuses on the importance of data access and how crucial it is, to have the granular level of data availability in the open-source space as it helps researchers and data teams to fuel their work. We present to you the research conducted by the DS4C (Data Science for Covid-19) team who made a huge and detailed level of South Korea Covid-19 data available to a wider community. The DS4C dataset was one of the most impactful datasets on Kaggle with over fifty thousand cumulative downloads and 300 unique contributors. What makes the DS4C dataset so potent is the sheer amount of data collected for each patient. The Korean government has been collecting and releasing patient information with unprecedented levels of detail. The data released includes infected people’s travel routes, the public transport they took, and the medical institutions that are treating them. This extremely fine-grained detail is what makes the DS4C dataset valuable as it makes it easier for researchers and data scientists to identify trends and more evidence to support hypotheses to track down the cause and gain additional insights. We will cover the data challenges, impact that it had on the community by making this data available on a public forum and conclude it with an insightful visual representation.

ReComp: optimising the re-execution of analytics pipelines in response to cha...

Paolo Missier

Predicting the “Next Big Thing” in Science - #scichallenge2017

Adrian Mladenic Grobelnik

End-to-End Learning for Answering Structured Queries Directly over Text

Paul Groth

Kenett On Information NYU-Poly 2013The Hebrew University of Jerusalem

Big Data & DS Analytics for PAARL

Philippine Association of Academic/Research Librarians

Big Data for Library Services (2017)

Albert Anthony Gavino, MBA

What's hot

Himansu sahoo resume-ds

Himansu Sahoo

Sentiment Knowledge Discovery in Twitter Streaming Data

Albert Bifet

La résolution de problèmes à l'aide de graphes

Data2B

20170110_IOuellette_CVIan Ouellette

Towards reproducibility and maximally-open data

Pablo Bernabeu

Future of hpc

Putchong Uthayopas

NG2S: A Study of Pro-Environmental Tipping Point via ABMs

Kan Yuenyong

Moa: Real Time Analytics for Data Streams

Albert Bifet

Minimal viable-datareuse-czi

Paul Groth

Big dataAshish Kulkarni

Data Science, Data & Dashboards Design

Koo Ping Shung

Estimating Query Difficulty for News Prediction Retrieval (poster presentation)

Nattiya Kanhabua

The Roots: Linked data and the foundations of successful Agriculture Data

Paul Groth

Pitfalls in benchmarking data stream classification and how to avoid themAlbert Bifet

From Text to Data to the World: The Future of Knowledge Graphs

Paul Groth

Role of Data Accessibility During Pandemic

Databricks

ReComp: optimising the re-execution of analytics pipelines in response to cha...

Paolo Missier

Predicting the “Next Big Thing” in Science - #scichallenge2017

Adrian Mladenic Grobelnik

End-to-End Learning for Answering Structured Queries Directly over Text

Paul Groth

Kenett On Information NYU-Poly 2013The Hebrew University of Jerusalem

What's hot (20)

Himansu sahoo resume-ds

Sentiment Knowledge Discovery in Twitter Streaming Data

La résolution de problèmes à l'aide de graphes

20170110_IOuellette_CV

Towards reproducibility and maximally-open data

Future of hpc

NG2S: A Study of Pro-Environmental Tipping Point via ABMs

Moa: Real Time Analytics for Data Streams

Minimal viable-datareuse-czi

Big data

Data Science, Data & Dashboards Design

Estimating Query Difficulty for News Prediction Retrieval (poster presentation)

The Roots: Linked data and the foundations of successful Agriculture Data

Pitfalls in benchmarking data stream classification and how to avoid them

From Text to Data to the World: The Future of Knowledge Graphs

Role of Data Accessibility During Pandemic

ReComp: optimising the re-execution of analytics pipelines in response to cha...

Predicting the “Next Big Thing” in Science - #scichallenge2017

End-to-End Learning for Answering Structured Queries Directly over Text

Kenett On Information NYU-Poly 2013

Similar to Big Data Quality Panel: Diachron Workshop @EDBT

Big Data & DS Analytics for PAARL

Philippine Association of Academic/Research Librarians

Big Data for Library Services (2017)

Albert Anthony Gavino, MBA

Introduction to open-data

OpenAccessBelgium

Elsevier

Christina Azzam

Luciano uvi hackfest.28.10.2020

Joanne Luciano

BIG DATA.ppt

UsmanAliyuAminu

A Big Picture in Research Data Management

Carole Goble

Challenges in Analytics for BIG Data

Prasant Misra

Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup

Edward Curry

Data management efforts such as Master Data Management and Data Curation are a popular approach for high quality enterprise data. However, Data Curation can be heavily centralised and labour intensive, where the cost and effort can become prohibitively high. The concentration of data management and stewardship onto a few highly skilled individuals, like developers and data experts, can be a significant bottleneck. This talk explores how to effectively involving a wider community of users within big data management activities. The bottom-up approach of involving crowds in the creation and management of data has been demonstrated by projects like Freebase, Wikipedia, and DBpedia. The talk discusses how crowdsourcing data management techniques can be applied within an enterprise context. Topics covered include: - Data Quality And Data Curation - Crowdsourcing - Case Studies on Crowdsourced Data Curation - Setting up a Crowdsourced Data Curation Process - Linked Open Data Example - Future Research Challenges

What Data Science Will Mean to You - One Person's View

Philip Bourne

BIG-DATAPPTFINAL.ppt

rajsharma159890

Pemanfaatan Big Data Dalam Riset 2023.pptx

elisarosa29

The Role of Automated Function Prediction in the Era of Big Data and Small Bu...

Philip Bourne

dissertation proposal writing service

Phd Assistance

Data Science: Origins, Methods, Challenges and the future?

Cagatay Turkay

Slides for my talk at City Unrulyversity on 18.03.15 in London. Discuss the term Data Science, touch upon the origins and the data scientist types. A longer discussion on the Data Science process and challenges analysts face. And here is the abstract of the talk: Data Science ... the term is everywhere now, on the news, recruitment sites, technology boards. "Data scientist" is even named to be sexiest job title of the century. But what is it, really? Is it just a hype or a term that will be with us for some time? This session will investigate where the term is originating from and how it relates to decades of research in established fields such as statistics, data mining, visualisation and machine learning. We will investigate how the field is evolving with the emergence of large, heterogeneous data resources. We will discuss the objectives, tools and challenges of data science as a practice, and look at examples from research and industrial applications.

My FAIR share of the work - Diamond Light Source - Dec 2018

Susanna-Assunta Sansone

Real-time applications of Data Science.pptx

shalini s

Data_Science_Applications_&_Use_Cases.pptx

ssuser1a4f0f

NCME Big Data in Education

Philip Piety

Opening/Framing Comments: John Behrens, Vice President, Center for Digital Data, Analytics, & Adaptive Learning Pearson Discussion of how the field of educational measurement is changing; how long held assumptions may no longer be taken for granted and that new terminology and language are coming into the. Panel 1: Beyond the Construct: New Forms of Measurement This panel presents new views of what assessment can be and new species of big data that push our understanding for what can be used in evidentiary arguments.  Marcia Linn, Lydia Liu from UC Berkeley and ETS discuss continuous assessment of science and new kinds of constructs that relate to collaboration and student reasoning.  John Byrnes from SRI International discusses text and other semi-structured data sources and different methods of analysis.  Kristin Dicerbo from Pearson discusses hidden assessments and the different student interactions and events that can be used in inferential processes. Panel 2: The Test is Just the Beginning: Assessments Meet Systems Context This panel looks at how assessments are not the end game, but often the first step in larger big-data practices at districts/state/national levels.  Gerald Tindal from the University of Oregon discusses State data systems and special education, including curriculum-based measurement across geographic settings.  Jack Buckley Commissioner of the National Center for Educational Statistics discussing national datasets where tests and other data connect.  Lindsay Page, Will Marinell from the Strategic Data Project at Harvard discussing state and district datasets used for evaluating teachers, colleges of education, and student progress. Panel 3: Connecting the Dots: Research Agendas to Integrate Different Worlds This panel will look at how research organizations are viewing the connections between the perspectives presented in Panels 1 and 2; what is known, what is still yet to be discovered in order to achieve the promised of big connected data in education.  Andrea Conklin Bueschel Program Director at the Spencer Foundation  Ed Dieterle Senior Program Officer at the Bill and Melinda Gates Foundation  Edith Gummer Program Manager at National Science Foundation

Data_Science_Applications_&_Use_Cases.pptx

wahiba ben abdessalem

Similar to Big Data Quality Panel: Diachron Workshop @EDBT (20)

Big Data & DS Analytics for PAARL

Big Data for Library Services (2017)

Introduction to open-data

Elsevier

Luciano uvi hackfest.28.10.2020

BIG DATA.ppt

A Big Picture in Research Data Management

Challenges in Analytics for BIG Data

Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup

What Data Science Will Mean to You - One Person's View

BIG-DATAPPTFINAL.ppt

Pemanfaatan Big Data Dalam Riset 2023.pptx

The Role of Automated Function Prediction in the Era of Big Data and Small Bu...

dissertation proposal writing service

Data Science: Origins, Methods, Challenges and the future?

My FAIR share of the work - Diamond Light Source - Dec 2018

Real-time applications of Data Science.pptx

Data_Science_Applications_&_Use_Cases.pptx

NCME Big Data in Education

Data_Science_Applications_&_Use_Cases.pptx

More from Paolo Missier

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?

Paolo Missier

Design and Development of a Provenance Capture Platform for Data Science

Paolo Missier

Towards explanations for Data-Centric AI using provenance records

Paolo Missier

In this presentation, given to graduate students at Universita' RomaTre, Italy, we suggest that concepts well-known in Data Provenance can be exploited to provide explanations in the context of data-centric AI processes. Through use cases (incremental data cleaning, training set pruning), we build up increasingly complex provenance patterns, culminating in an open question: how to describe "why" a specific data item has been manipulated as part of data processing, when such processing may consist of a complex data transformation algorithm.

Interpretable and robust hospital readmission predictions from Electronic Hea...

Paolo Missier

Data-centric AI and the convergence of data and model engineering:opportunit...

Paolo Missier

A keynote talk given to the IDEAL 2023 conference (Evora, Portugal Nov 23, 2023). Abstract. The past few years have seen the emergence of what the AI community calls "Data-centric AI", namely the recognition that some of the limiting factors in AI performance are in fact in the data used for training the models, as much as in the expressiveness and complexity of the models themselves. One analogy is that of a powerful engine that will only run as fast as the quality of the fuel allows. A plethora of recent literature has started the connection between data and models in depth, along with startups that offer "data engineering for AI" services. Some concepts are well-known to the data engineering community, including incremental data cleaning, multi-source integration, or data bias control; others are more specific to AI applications, for instance the realisation that some samples in the training space are "easier to learn from" than others. In this "position talk" I will suggest that, from an infrastructure perspective, there is an opportunity to efficiently support patterns of complex pipelines where data and model improvements are entangled in a series of iterations. I will focus in particular on end-to-end tracking of data and model versions, as a way to support MLDev and MLOps engineers as they navigate through a complex decision space.

Realising the potential of Health Data Science:opportunities and challenges ...

Paolo Missier

Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)

Paolo Missier

A Data-centric perspective on Data-driven healthcare: a short overview

Paolo Missier

Capturing and querying fine-grained provenance of preprocessing pipelines in ...

Paolo Missier

Tracking trajectories of multiple long-term conditions using dynamic patient...

Paolo Missier

Momentum has been growing into research to better understand the dynamics of multiple long-term conditions-multimorbidity (MLTC-M), defined as the co-occurrence of two or more long-term or chronic conditions within an individual. Several research efforts make use of Electronic Health Records (EHR), which represent patients' medical histories. These range from discovering patterns of multimorbidity, namely by clustering diseases based on their co-occurrence in EHRs, to using EHRs to predict the next disease or other specific outcomes. One problem with the former approach is that it discards important temporal information on the co-occurrence, while the latter requires "big" data volumes that are not always available from routinely collected EHRs, limiting the robustness of the resulting models. In this paper we take an intermediate approach, where initially we use about 143,000 EHRs from UK Biobank to perform time-independent clustering using topic modelling, and Latent Dirichlet Allocation specifically. We then propose a metric to measure how strongly a patient is "attracted" into any given cluster at any point through their medical history. By tracking how such gravitational pull changes over time, we may then be able to narrow the scope for potential interventions and preventative measures to specific clusters, without having to resort to full-fledged predictive modelling. In this preliminary work we show exemplars of these dynamic associations, which suggest that further exploration may lead to On behalf of the AI-MULTIPLY consortium. Funded by NIHR AIM Development grant to AI-MULTIPLY actionable insights into patients' medical trajectories.

Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...

Paolo Missier

Digital biomarkers for preventive personalised healthcare

Paolo Missier

Digital biomarkers for preventive personalised healthcare

Paolo Missier

Data Provenance for Data Science

Paolo Missier

Capturing and querying fine-grained provenance of preprocessing pipelines in ...

Paolo Missier

Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...

Paolo Missier

Data Science for (Health) Science:tales from a challenging front line, and h...

Paolo Missier

Analytics of analytics pipelines:from optimising re-execution to general Dat...

Paolo Missier

ReComp, the complete story: an invited talk at Cardiff University

Paolo Missier

Efficient Re-computation of Big Data Analytics Processes in the Presence of C...

Paolo Missier

More from Paolo Missier (20)