This document discusses managing uncertainty in data and data quality problems. It describes how most data quality problems can be modeled as uncertainty in the data. Probabilistic databases can store, query, and analyze data while accounting for this uncertainty. This allows for scalable and "good enough" initial data integration that can improve over time, avoiding excessive "data fiddling". Measuring expected precision and recall provides a way to quantitatively assess quality and know when cleaning efforts should stop.
Dealing with poor data quality of osint data in fraud risk analysisUniversity of Twente
Presented at the SIKS Smart Auditing Workshop, 25 Feb 2015.
Governmental organizations responsible for keeping certain types of fraud under control, often use data-driven methods for both immediate detection of fraud, or for fraud risk analysis aimed at more effectively targeting inspections. A blind spot in such methods, is that the source data often represents a 'paper reality'. Fraudsters will attempt to disguise themselves in the data they supply painting a world in which they do nothing wrong. This blind spot can be counteracted by enriching the data with traces and indicators from more 'real-world' sources such as social media and internet. One of the crucial data management problems in accomplishing this enrichment is how to capture and handle data quality problems. The presentation will start with a real-world example, which is also used as starting point for a problem generalization in terms of information combination and enrichment (ICE). We then present the ICE technology as well as how data quality problems can be managed with probabilistic databases. In terms of the 4 V's of big data -- volume, velocity, variety and veracity -- this presentation focuses on the third and fourth V's: variety and veracity.
Data Science Isn't a Fad: Let's Keep it That WayMelinda Thielbar
First presented at the February 2013 Research Triangle Analysts meeting, this presentation discusses the technical side of making data science a field that's here to last. This presentation focuses on the "science" aspect of data science and how it drives value to an organization.
Data Storytelling - Game changer for Analytics Gramener
50 Percent of Data Science Projects Fail at
Consumption: Can Storytelling Be Your Game
Changer
Growth of Self Service BI has generated a lot of
dashboards, but “lots” does not always mean “good” or
“useful”.
• While advances in AI/ML lead to deeper insights,
business teams struggle with the adoption of
algorithms and consumption
• How can data officers and analytics leaders
get better business ROI from their data science
investments?
• This session will show you how to unleash the
power of data storytelling for business decision-
making, using industry examples
Los “Data Scientists” se catalogan como algunos de los profesionales con mayor demanda en el mundo laboral de la actualidad. Desafortunadamente no existen candidatos suficientemente calificados para satisfacer esta demanda. Esto se debe tal vez a la complejidad de las habilidades requeridas para ejercer la profesión, las cuales incluyen matemática, estadística, computación, y administración. Mediante ejemplos de la vida real, esta conferencia pretende demostrar que completar exitosamente un proyecto de “Data Science” es posible. Este proceso requiere el entendimiento del problema del negocio, la aplicación de modelos matemáticos o estadísticos adecuados, y la implementación correcta de la solución.
Slides from my presentation at the Data Intelligence conference in Washington DC (6/23/2017). See this link for the abstract: http://www.data-intelligence.ai/presentations/36
A lecture in digital analytics at Aalto University. The lecture is a part of a module in Information Technology Program (ITP).
Summer 2015, Helsinki
--
Dr. Joni Salminen is a lecturer in digital marketing. Besides online marketing, his interests include startups and web platforms. Contact: joolsa@utu.fi
Dealing with poor data quality of osint data in fraud risk analysisUniversity of Twente
Presented at the SIKS Smart Auditing Workshop, 25 Feb 2015.
Governmental organizations responsible for keeping certain types of fraud under control, often use data-driven methods for both immediate detection of fraud, or for fraud risk analysis aimed at more effectively targeting inspections. A blind spot in such methods, is that the source data often represents a 'paper reality'. Fraudsters will attempt to disguise themselves in the data they supply painting a world in which they do nothing wrong. This blind spot can be counteracted by enriching the data with traces and indicators from more 'real-world' sources such as social media and internet. One of the crucial data management problems in accomplishing this enrichment is how to capture and handle data quality problems. The presentation will start with a real-world example, which is also used as starting point for a problem generalization in terms of information combination and enrichment (ICE). We then present the ICE technology as well as how data quality problems can be managed with probabilistic databases. In terms of the 4 V's of big data -- volume, velocity, variety and veracity -- this presentation focuses on the third and fourth V's: variety and veracity.
Data Science Isn't a Fad: Let's Keep it That WayMelinda Thielbar
First presented at the February 2013 Research Triangle Analysts meeting, this presentation discusses the technical side of making data science a field that's here to last. This presentation focuses on the "science" aspect of data science and how it drives value to an organization.
Data Storytelling - Game changer for Analytics Gramener
50 Percent of Data Science Projects Fail at
Consumption: Can Storytelling Be Your Game
Changer
Growth of Self Service BI has generated a lot of
dashboards, but “lots” does not always mean “good” or
“useful”.
• While advances in AI/ML lead to deeper insights,
business teams struggle with the adoption of
algorithms and consumption
• How can data officers and analytics leaders
get better business ROI from their data science
investments?
• This session will show you how to unleash the
power of data storytelling for business decision-
making, using industry examples
Los “Data Scientists” se catalogan como algunos de los profesionales con mayor demanda en el mundo laboral de la actualidad. Desafortunadamente no existen candidatos suficientemente calificados para satisfacer esta demanda. Esto se debe tal vez a la complejidad de las habilidades requeridas para ejercer la profesión, las cuales incluyen matemática, estadística, computación, y administración. Mediante ejemplos de la vida real, esta conferencia pretende demostrar que completar exitosamente un proyecto de “Data Science” es posible. Este proceso requiere el entendimiento del problema del negocio, la aplicación de modelos matemáticos o estadísticos adecuados, y la implementación correcta de la solución.
Slides from my presentation at the Data Intelligence conference in Washington DC (6/23/2017). See this link for the abstract: http://www.data-intelligence.ai/presentations/36
A lecture in digital analytics at Aalto University. The lecture is a part of a module in Information Technology Program (ITP).
Summer 2015, Helsinki
--
Dr. Joni Salminen is a lecturer in digital marketing. Besides online marketing, his interests include startups and web platforms. Contact: joolsa@utu.fi
Applying Data Quality Best Practices at Big Data ScalePrecisely
Global organizations are investing aggressively in data lake infrastructures in the pursuit of new, breakthrough business insights. At the same time, however, 2 out of 3 business executives are not highly confident in the accuracy and reliability of their own Big Data. Regaining that confidence requires utilizing proven data quality tools at Big Data scale.
In this on-demand webinar, discover how to ensure your data lake is a trusted source for advanced business insights that lead to new revenue, cost savings and competitiveness. You will have the opportunity to:
• Compare your organization’s data lake “readiness” against initial findings from our upcoming annual Big Data Trends survey
• Gain insight into where and how to leverage data quality best practices for Big Data use cases
• Explore how a ‘Develop Once, Deploy Anywhere’ approach, including to native Big Data infrastructures such as Hadoop and Spark, facilitates consistent data quality patterns
Analytic Transformation | 2013 Loras College Business Analytics SymposiumCartegraph
Loras College is proud to present our annual Business Analytics Symposium on March 27, 2014 at the Grand River Center in Dubuque, IA. Industry experts will share their insights about the evolving field of business analytics opportunities. Learn about everything from best practices when analyzing data to the importance and benefits of building a culture of analytics within your organization.
To learn more, secure your seat or to take advantage of group discounts visit www.loras.edu/bigdata.
The enterprise marketer's playbook: Building an integrated data strategy.
An integrated data strategy can help any business see customer journeys more clearly ― and then give customers more relevant ads and experiences that get results. So why doesn't everyone have such a strategy? We look at what sets the marketing leaders apart.
Let marketing data be your guide
If you've ever felt too swamped by data to find the customer insights you need, you're not alone. But there's a new and better approach to gaining deeper audience insights: building an integrated data strategy.
Read this report to learn how:
86% of senior executives agree that eliminating organizational silos is critical to expanding the use of data and analytics in decision-making.
75% of marketers agree that lack of education and training on data and analytics is the biggest barrier to more business decisions being made based on data insights.
Leading marketers are 59% more likely to use digital analytics to optimize the user experience in real time.
To be updated is not enough for companies today. Organizations must be constantly watching also to the trends in order to predict and forecast the next steps for their business. The following document is a Executive Summary of the current situation but also of the more notable trends that will help to understand the basics of the Analytics Market
Data Scientist: The Sexiest Job in the 21st CenturyLyn Fenex
Presentation from WUSS 2015:
“Data scientist” is often used as a blanket title to describe jobs that are drastically different. There are plenty of articles and discussions on the web about what data science is, what qualities define a data scientist, how to nurture them, and how you should position yourself to be a competitive applicant. There are far fewer resources out there about the steps to take in order to obtain the skills necessary to practice this elusive discipline. This presentation will explore a collection of freely accessible materials and content to jumpstart your understanding of the theory and tools of Data Science. We will also discuss some of the variable understandings that companies use to define the roles of their Data Scientists.
Why Sales and Marketing Specialists will become Big Data ScientistsCindyGordon
Dr. Cindy Gordon's presentation on Big Data and Predictive Analytics at the Sales & Marketing Middle East Conference May19-20 Abu Dhabi. Dr Gordon's presentation "Why Sales and Marketing Specialists will become Big Data Scientists" highlights the trends and skills needed in this era of accelerating volumes of data, along with predictive and prescriptive analytics that will help make sense of it all
BA and Beyond 20 - Bas Van Gils - Data management: from the trenchesBA and Beyond
If processes are the value creation engine of the organisation, then data is the fuel. Data is the critical resource for almost any organisation and should be managed as such. Organizations embarking on the data management journey often complain about yet another management framework to take into account, expecting more work, more overhead, more investment, and less agility. They couldn’t be further from the truth.
In this talk, Bas will show that there is a vast body of practical guidance (theory, cases) that can be leveraged for building a data management capability. He will also show that people are the key to the balancing act between strategy and execution, business and IT, and between a top-down and bottom-up approach to data management.
Big Data Analytics: A New Business OpportunityEdward Curry
This talk introduces Big Data analytics and how they can be used to deliver value within organisations. The talk will cover the transformational potential of creating data value chains between different sectors. Developing a Big Data analytics capability will be discussed in addition to the challenges facing the emerging data economy.
Data is the New Oil: Presented By Naveen Narayanan, Global Client Partner of ...InterCon
InterCon is a premier technology conference that brings together like-minded people on a common platform to share knowledge, present ideas, get recognition, and network. InterCon Dubai will offer knowledgeable sessions, informative content, extraordinary speakers, and an overall memorable experience.
Follow us:
Facebook: https://www.facebook.com/InterConWorld
Linkedin: https://www.linkedin.com/showcase/int...
Twitter: https://twitter.com/InterConWorld
Instagram - https://www.instagram.com/interconworld/
State of Analytics: Retail and Consumer GoodsSPI Conference
There is little doubt that Business Analytics will become a core differentiator in consumer industries, but even though Retail and Consumer Goods companies view analytics as extremely strategic, they struggle to effectively leverage it across the enterprise. EKN has studied the adoption and impact analytics in these industries for the last 5 years, and this counterpoint presentation will summarize key trends in analytics and shares fresh 2016 data on the state of analytics. Presented by Joe Skorupa (Editorial Director, RIS News) & Gaurav Pant (Senior VP Research & Principal Analyst, EKN Research) at the 2016 SPI Conference.
Predictive Analytics - How to get stuff out of your Crystal BallDATAVERSITY
Everyone wants to leverage data. The optimal implementation of analytics is an organization-wide set of capabilities. These are called advantageous organizational analytic capabilities in that a clear ROI is demonstrable from these efforts. Turns out that there are a number of prerequisites to advantageous organizational analytics. These include:
Adopting a crawl, walk, run strategy
Understanding current and potential organizational maturity and corresponding capabilities
Achieving an appropriate technology/human capability balance
Implementing useful IT systems development practices
Installing necessary non-IT leadership
This webinar will explore these and other topics using examples drawn from DOD, healthcare researchers, and donation center operations.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Applying Data Quality Best Practices at Big Data ScalePrecisely
Global organizations are investing aggressively in data lake infrastructures in the pursuit of new, breakthrough business insights. At the same time, however, 2 out of 3 business executives are not highly confident in the accuracy and reliability of their own Big Data. Regaining that confidence requires utilizing proven data quality tools at Big Data scale.
In this on-demand webinar, discover how to ensure your data lake is a trusted source for advanced business insights that lead to new revenue, cost savings and competitiveness. You will have the opportunity to:
• Compare your organization’s data lake “readiness” against initial findings from our upcoming annual Big Data Trends survey
• Gain insight into where and how to leverage data quality best practices for Big Data use cases
• Explore how a ‘Develop Once, Deploy Anywhere’ approach, including to native Big Data infrastructures such as Hadoop and Spark, facilitates consistent data quality patterns
Analytic Transformation | 2013 Loras College Business Analytics SymposiumCartegraph
Loras College is proud to present our annual Business Analytics Symposium on March 27, 2014 at the Grand River Center in Dubuque, IA. Industry experts will share their insights about the evolving field of business analytics opportunities. Learn about everything from best practices when analyzing data to the importance and benefits of building a culture of analytics within your organization.
To learn more, secure your seat or to take advantage of group discounts visit www.loras.edu/bigdata.
The enterprise marketer's playbook: Building an integrated data strategy.
An integrated data strategy can help any business see customer journeys more clearly ― and then give customers more relevant ads and experiences that get results. So why doesn't everyone have such a strategy? We look at what sets the marketing leaders apart.
Let marketing data be your guide
If you've ever felt too swamped by data to find the customer insights you need, you're not alone. But there's a new and better approach to gaining deeper audience insights: building an integrated data strategy.
Read this report to learn how:
86% of senior executives agree that eliminating organizational silos is critical to expanding the use of data and analytics in decision-making.
75% of marketers agree that lack of education and training on data and analytics is the biggest barrier to more business decisions being made based on data insights.
Leading marketers are 59% more likely to use digital analytics to optimize the user experience in real time.
To be updated is not enough for companies today. Organizations must be constantly watching also to the trends in order to predict and forecast the next steps for their business. The following document is a Executive Summary of the current situation but also of the more notable trends that will help to understand the basics of the Analytics Market
Data Scientist: The Sexiest Job in the 21st CenturyLyn Fenex
Presentation from WUSS 2015:
“Data scientist” is often used as a blanket title to describe jobs that are drastically different. There are plenty of articles and discussions on the web about what data science is, what qualities define a data scientist, how to nurture them, and how you should position yourself to be a competitive applicant. There are far fewer resources out there about the steps to take in order to obtain the skills necessary to practice this elusive discipline. This presentation will explore a collection of freely accessible materials and content to jumpstart your understanding of the theory and tools of Data Science. We will also discuss some of the variable understandings that companies use to define the roles of their Data Scientists.
Why Sales and Marketing Specialists will become Big Data ScientistsCindyGordon
Dr. Cindy Gordon's presentation on Big Data and Predictive Analytics at the Sales & Marketing Middle East Conference May19-20 Abu Dhabi. Dr Gordon's presentation "Why Sales and Marketing Specialists will become Big Data Scientists" highlights the trends and skills needed in this era of accelerating volumes of data, along with predictive and prescriptive analytics that will help make sense of it all
BA and Beyond 20 - Bas Van Gils - Data management: from the trenchesBA and Beyond
If processes are the value creation engine of the organisation, then data is the fuel. Data is the critical resource for almost any organisation and should be managed as such. Organizations embarking on the data management journey often complain about yet another management framework to take into account, expecting more work, more overhead, more investment, and less agility. They couldn’t be further from the truth.
In this talk, Bas will show that there is a vast body of practical guidance (theory, cases) that can be leveraged for building a data management capability. He will also show that people are the key to the balancing act between strategy and execution, business and IT, and between a top-down and bottom-up approach to data management.
Big Data Analytics: A New Business OpportunityEdward Curry
This talk introduces Big Data analytics and how they can be used to deliver value within organisations. The talk will cover the transformational potential of creating data value chains between different sectors. Developing a Big Data analytics capability will be discussed in addition to the challenges facing the emerging data economy.
Data is the New Oil: Presented By Naveen Narayanan, Global Client Partner of ...InterCon
InterCon is a premier technology conference that brings together like-minded people on a common platform to share knowledge, present ideas, get recognition, and network. InterCon Dubai will offer knowledgeable sessions, informative content, extraordinary speakers, and an overall memorable experience.
Follow us:
Facebook: https://www.facebook.com/InterConWorld
Linkedin: https://www.linkedin.com/showcase/int...
Twitter: https://twitter.com/InterConWorld
Instagram - https://www.instagram.com/interconworld/
State of Analytics: Retail and Consumer GoodsSPI Conference
There is little doubt that Business Analytics will become a core differentiator in consumer industries, but even though Retail and Consumer Goods companies view analytics as extremely strategic, they struggle to effectively leverage it across the enterprise. EKN has studied the adoption and impact analytics in these industries for the last 5 years, and this counterpoint presentation will summarize key trends in analytics and shares fresh 2016 data on the state of analytics. Presented by Joe Skorupa (Editorial Director, RIS News) & Gaurav Pant (Senior VP Research & Principal Analyst, EKN Research) at the 2016 SPI Conference.
Predictive Analytics - How to get stuff out of your Crystal BallDATAVERSITY
Everyone wants to leverage data. The optimal implementation of analytics is an organization-wide set of capabilities. These are called advantageous organizational analytic capabilities in that a clear ROI is demonstrable from these efforts. Turns out that there are a number of prerequisites to advantageous organizational analytics. These include:
Adopting a crawl, walk, run strategy
Understanding current and potential organizational maturity and corresponding capabilities
Achieving an appropriate technology/human capability balance
Implementing useful IT systems development practices
Installing necessary non-IT leadership
This webinar will explore these and other topics using examples drawn from DOD, healthcare researchers, and donation center operations.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Richard's entangled aventures in wonderlandRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
Managing uncertainty in data - Presentation at Data Science Northeast Netherlands Meetup 14 Jan 2016
1. MANAGING UNCERTAINTY IN DATA
THE KEY TO EFFECTIVE MANAGEMENT OF DATA QUALITY
PROBLEMS
MAURICE VAN KEULEN
2. Paradigms of scientific method
Empiricism
Mathematical modeling
Simulation
A new paradigm: Data-intensive Scientific Discovery
Combining and analyzing data in novel ways is
capable of tackling research questions that could not
be answered before
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
2
REVOLUTION IN SCIENTIFIC METHOD
Bio-Informatics professor:
“ PhD of 4 years, 3 years
devoted to ‘data fiddling’ ”
3. Research on pregnancy processes based on Electronic
Patient Dossiers (EPDs) of some population of women
Select consult & treatment records from their EPDs
from multiple sources
After first analysis one discovers many records not
related to pregnancy (e.g., dermatologist consult)
Assumption that all records that belong to a pregnant
woman are related to pregnancy is wrong, hence also
the selection criterion!
There is no objective means to ascertain this such as a
field ‘related to pregnancy’
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
3
A FIRST STORY: PREGNANCY RESEARCH
4. A painstaking process follows with specifying filter
rules and manually inspecting samples of results
Imperfect process so noisy records remain!
Wrong diagnoses cause more records to be
erroneously in or out more noisy records
Then, one looks at a sample and notices something
strange in the times of consults: many appear close to
each other and in the evening
Modification time of EPD record (what is recorded)
does not reflect actual moment of activity (semantics)
sequence and duration noise
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
4
A FIRST STORY: PREGNANCY RESEARCH
5. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
5
GEO-SOCIAL RECOMMENDATION: GPS TRAJECTORIES
• Detect visits from trajectories
• GPS traces from mobile phones
• Point-Of-Interest (POI) data
harvested from the internet
• Purpose: construct profiles of
• Customers
• Products
• for recommendation
• Holiday homes
• Greeting cards
6. Substantial amount of money involved in fraud. Ease of
committing fraud incites otherwise decent people to do it
as well. Danger to society
Inspect where there is a high risk of fraud
Example ISZW: labor market, labor circumstances, etc.
But: government data represents paper reality!
Include traces from the internet (social media, web
forums): Customers, employees, and by-standers
leave behind observations and opinions
But natural language: about which company do they
talk?
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
6
DATA-DRIVE FRAUD RISK ANALYSIS
7. Paris Hilton stayed in the Paris Hilton
Lady Gaga - Speechless live @ Helsinki 10/13/2010
http://www.youtube.com/watch?v=yREociHyijk . . .
@ladygaga also talks about her Grampa who died
recently
Laelith Demonia has just defeated liwanu Hird.
Career wins is 575, career losses is 966.
Adding Win7Beta, Win2008, and Vista x64 and x86
images to munin. #wds
history should show that bush jr should be in jail or at
least never should have been president
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
7
NATURAL LANGUAGE PROCESSING: AMBIGUITY ABOUNDS
8. Search (finding the needle in the haystack)
Information extraction from unstructured sources
Natural language processing
Web harvesting
(both produce lower quality structured data)
Data quality management
Responsible analytics is (among other things)
“Knowing how data quality problems in the source
data affect the analytical results”
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
8
TECHNOLOGY WE WORK ON
WE = DATABASES GROUP FROM UNIVERSITY OF TWENTE
Equally
true for
Business
Analytics
9. CustID Sales Name
1234 6000 John
2345 5000 Mary
3456 12000 Bart
… … …
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
10
IMPACT OF DATA IMPERFECTIONS
SELECT SUM(Sales)
FROM CustSales
3423000
Sample of data looks
fine
Result of analysis looks
perfectly reasonable
If you don’t look hard
enough
if you don’t properly pay
attention to it
… you will be unaware
… that you are possibly
looking at significantly
erroneous figures!!!
10. CustID Sales Name
1234 6000 John
2345 5000 Mary
3456 12000 Bart
… … …
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
11
IMPACT OF DATA IMPERFECTIONS
SELECT SUM(Sales)
FROM CustSales
3423000
CustID Sales Name
6789 2 Tom
4567 6000 Jon
5678 NULL Nina
… … …
????
Wrong figures included
Missing figures
Double counting
etc.
Many more problems
at value, record,
schema, source, trust
levels
11. Probabilistic database technology can store, query,
analyze, reason with data taking into account possible
influence on the results
Treats data quality problems as a fact of life
Responsible analytics: know deficiencies of results
Generic and scalable approach and technology
Nice properties for application: postpone-
resolution/cleaning, pay-as-you-go; good-is-good-
enough; human-in-the-loop
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
12
PROBABILISTIC DATABASES TO THE RESCUE
12. Let’s go for an initial
integration that can readily
and meaningfully be used
“Good is good enough” for
meaningful use in many
applications
(can be achieved 10x
earlier)
Let it improve during use
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
PROBABILISTIC DATA INTEGRATION
Use
(analytics)
Measure
quality
Improve
data quality
Partial data
integration
Enumerate cases for
remaining problems
Store data with
uncertainty in UDBMS
InitialintegrationContinuousimprovement
13
Postpon
e
problems
Stop
earlier
Pay as
you go
Human
in the
loop
13. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
14
COMBINING DATA …
Keulen, M. (2012) Managing Uncertainty: The Road
Towards Better Data Interoperability. IT - Information
Technology, 54 (3). pp. 138-146. ISSN 1611-2776
Car brand Sales
B.M.W. 25
Mercedes 32
Renault 10
Car brand Sales
BMW 72
Mercedes-Benz 39
Renault 20
Car brand Sales
Bayerische Motoren Werke 8
Mercedes 35
Renault 15
Car brand Sales
B.M.W. 25
Bayerische Motoren Werke 8
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
14. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
15
… AND THE PROBLEM OF SEMANTIC DUPLICATES
Car brand Sales
B.M.W. 25
Bayerische Motoren Werke 8
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
Preferred customers …
SELECT SUM(Sales)
FROM CarSales
WHERE Sales>100
0
‘No preferred customers’
15. Database
Real world
(of car brands)
Mercedes-Benz 39
72BMW
45Renault
67Mercedes
8
Bayerische
Motoren Werke
25B.M.W.
SalesCar brand ω
d1
d2
d3
d4
d5
d6
o1
o2
o3
o4
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
16
SEMANTIC DUPLICATES
16. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
17
MOST DATA QUALITY PROBLEMS
CAN BE MODELED AS UNCERTAINTY IN DATA
Car brand Sales
B.M.W. 25
Bayerische Motoren Werke 8
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
Mercedes 106
Mercedes-Benz 106
1
2
3
4
5
6
X=0
X=0
X=1 Y=0
X=1 Y=1
X=0 4 and 5 different 0.2
X=1 4 and 5 the same 0.8
Y=0 “Mercedes”
correct name
0.5
Y=1 “Mercedes-Benz”
correct name
0.5
B.M.W. / BMW / Bayerische Motoren Werke analogously
Run some duplicate
detection tool
17. Looks like ordinary database
Several “possible” answers or approximate answers to
queries
Important: Scalability (big data!)
Sales of “preferred customers”
SELECT SUM(sales)
FROM carsales
WHERE sales≥ 100
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
18
IMPORTANT TOOL: PROBABILISTIC DATABASE
SUM(sales) P
0 14%
105 6%
106 56%
211 24%
18. Sales of “preferred customers”
SELECT SUM(sales)
FROM carsales
WHERE sales≥ 100
Answer: 106
Risk = Probability * Impact
Analyst only bothered with
problems that matter
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
19
QUERYING AND RELIABILITY ASSESSMENT
SUM(sales) P
0 14%
105 6%
106 56%
211 24%
Second most likely
answer at 24% with
impact factor 2 in
sales (211 vs 106)
Risk of substantially
wrong answer
19. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
20
BACK TO GEO-SOCIAL RECOMMENDATION
HOW TO MODEL THE GPS TRAJECTORY PROBLEM?
Smoothing: any jumps and/or sudden sharp angles
are suspicious and probably wrong
Points become
estimated points
Some points are
possibly suspicious
Some are more
suspicious than others
Model the uncertainty
explicitly in the data
20. Fraud risk analysis
about which company do they talk?
Indicators become possible indicators
Fraud risk analysis is statistics / probability theory!
Reasoning with possible indicators is very easy. It’s just more data
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
21
AMBIGUITY IN NATURAL LANGUAGE PROCESSING
AND ITS CONSEQUENCES FOR FRAUD RISK ANALYSIS
Paris Hilton
stayed in the
Paris Hilton
Phrase begin end type ref
Paris 1 1 City sws.geonames.org/
2988507
Paris 1 1 Firstname
Hilton 1 1 Lastname
Paris Hilton 1 2 Person https://en.wikipedia.org/wi
ki/Paris_Hilton
Paris Hilton 1 2 Hotel www.hilton.com/Paris
… … … …
“belong
together”
21. Inspired from information retrieval
(search engine evaluation)
Precision = ratio of answers that are correct
(3/5 = 60%)
Recall = ratio of correct answers given
(3/4 = 75%)
Expected precision and recall
A correct answer is better if the system dares to
claim that it is correct with a higher probability
Analogously, incorrect answers with a high
probability are worse than incorrect answers
with a low probability
Expected precision = (0.8+0.7+0.2) / 2.3 = 74%
Expected recall = (0.8+0.7+0.2) / 4 = 43%
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
22
KNOW WHEN TO STOP CLEANING: MEASURING QUALITY
A
B
C
D
E
F
G
80%
70%
50%
20%
10%
22. Data quality: intangible problem with unknown impact
The key to effective management of DQ problems
Model DQ problems as uncertainty *in* the data
Probabilistic database technology for scalability
Postpone resolution/cleaning: pay-as-you-go
Measure and know when to stop:
good-is-good-enough; human-in-the-loop
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
23
CONCLUSIONS
Bio-Informatics professor:
“ PhD of 4 years, 3 years
devoted to ‘data fiddling’ ”
If we can reduce the data fiddling
with 1 year (33%), we make the
scientist twice as productive!
Editor's Notes
Abstract
Business analytics and data science are significantly impaired by a wide variety of 'data handling' issues, especially when data from different sources are combined and when unstructured data is involved. The root cause of many such problems centers around data semantics and data quality. We have developed a generic method which is based on modeling such problems as uncertainty *in* the data. A recently conceived new kind of DBMS can store, manage, and query large volumes of uncertain data: the UDBMS or "Uncertain Database". Together, they allow one to, e.g., postpone the resolution of data problems, assess what their influence is on analytical results, etc. We furthermore develop technology for data cleansing, web harvesting, and natural language processing which uses this method to deal with ambiguity of natural language and many other problems encountered when using unstructured data
Explain Bio-Informatics
Of course there are others, e.g., BioInformatics
First two examples showed data quality and semantical problems, if you do NLP you are faced with the same!
Refer back to pregnancy and movie examples: all those issues can be modeled as uncertainty in data. Queries and analytics results will give all possible results, i.e., handle for influence on results
With OSINT data, this problem of semantic duplicates is enormous .,..
Notice that all these are “tables”
TODO: deze slide wat explicieter / concreter maken