SlideShare a Scribd company logo
Three Laws of Trusted Data Sharing:
(Building a Better Business
Case for Data Sharing)
Tim Menzies (prof, cs)
tim.menzies@gmail.com
August 6, 2015
• Discussions about
sharing
• Too much fear
• Not enough about
benefits
• Can we learn more from
sharing that hoarding ?
• Yes (results from SE)
• Three laws of trusted data
sharing:
• For SE quality prediction..
• Better models from shared
privatized data that from all raw
data
• Q: does this work for other
kinds of data?
• A: don’t know… yet
2
Why We Care…
– Sebastian Elbaum et al. 2014
Sharing industrial datasets
with the research community
is extremely valuable, but
also extremely challenging as
it needs to balance the
usefulness of the dataset with
the industry’s concerns for
privacy and competition.
S. Elbaum, A. Mclaughlin, and J. Penix, “The google dataset of testing results,” june 2014. [Online].
Available: https://code.google.com/p/google-shared-dataset-of-test-suite-results
3
Cost of privacy
- Privacy Goals (conflicting)
• protect confidentiality of software defect data
with privacy preserving techniques...
• while data remains useful
- Not trivial
• With standard anonymization methods
• as privacy increases...
• data becomes less useful
13
Usefulnes
s
Privacy
J. Brickell and V. Shmatikov, “The cost of privacy: destruction of data-mining utility in anonymized data publishing,” in Proceeding of the 14th
ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’08.
M. Grechanik, C. Csallner, C. Fu, and Q. Xie, “Is data privacy always good for software testing?” in Proceedings of the 2010 IEEE 21st
International Symposium on Software Reliability Engineering, ser. ISSRE ’10.
4
Building a business case
for data sharing
• Funded by NC Data Science and
Analytics Initiative
• Joint project with Prof. Bojan Cukic,
UNC Charlotte
• Applying the following to data from
– The smart cities initiative
– Community health care data
– Biometrics data
• Q1: What do you lose by not sharing?
– Compare conclusions seen with via sharing or via
hoarding?
• Q2: Does anonymization protect us?
– Using standard privatization algorithms:
– Can we violate privacy on data from Smart Cities,
Community health, Biometrics
• Q3: Are we protecting data too much
– Using standard privatization algorithms:
– How worse off are our models?
• Q4: Do costs of sharing out-weight
benefits?
– Apply our novel “3 laws of data sharing” and see
what what can be learned?
– Check of learned models not very useful,
interesting
5
About me: http://menzies.us
• Funding: $7 million
– NASA, DoD, National Science
Foundation, National Archives, etc
– Some STTR work
• Ph.D/masters students: dozens
• Papers: 200+
• Teaching:
– Grad SE + automated SE
• Service:
– Editorial boards: TSE, EMSE, ASE
– Conference org: ICSME’16, ASE,
– Many program committees
6
Recent books
7
Sharing data, Turkey to Texas:
Toasters to rocket ships
8
Sharing data Turkey to Texas:
Toasters to rocket ships
9
Q: Does this work for other kinds of data? E.g. anonymized privatized data?
A: Perhaps
Everyone else’s research question
Why does
software fail?
10
Sure, software sometimes fails
(at may do so at the worst time)
• E.g. software floating
point bug, Ariane 5, 1996
• Cost of vehicle: $500 million
• Development cost: $7 billion
• Loss of income due to loss of
client confidence: unknown
•
11
Everyone else’s research question
Why does
software fail?
12
My research question
Why does
software fail?
13
Ever work?
According to the maths, software is
too complex to understand
• 1024 stars in the sky
• NV states in software
– Consider 100 if statements
– Then N=2, V=100 and NV=2100
– a million times more than 1024
• The space inside our software
– is bigger than stars in the sky.
14
IEEE Computer, Jan 2007, p54- 60
http://menzies.us/pdf/07strange.pdf
15
N =#tests
required
C= odds bug found
P= Probability of bug
Complex things
should not work
C = 1 – (1-p)N so
N = log(1-C)/log(1-p)
Yet (often)
they do
• Examples:
– Open source
software
– The internet
– Electrical power grids
– Pace makers
– International air
traffic control
systems
– Operating systems
– Etc
– etc
16
N =#tests
required
C= odds bug found
P= Probability of bug
Complex things
should not work
C = 1 – (1-p)N so
N = log(1-C)/log(1-p)
Sure, software sometimes fails
(at may do so at the worst time)
• E.g. software floating
point bug, Ariane 5, 1996
• Cost of vehicle: $500 million
• Development cost: $7 billion
• Loss of income due to loss of
client confidence: unknown
• But puzzle is this:
– These errors should be much more frequent
– So where is all that missing behavior?
17
When reasoning about complex things,
you don’t have to look at very much
• Narrows: Amarel 1960s
• Prototypes: Chen 1975
• Frames: Minsky, 1975
• Min environments: DeKleer, 1986
• Saturation: Horgan & Mathur: 1980
• Homogenous propagation: Michael: 1981
• Master variables: Crawford & Baker, 1995
• Clumps, Druzdel, 1997
• Feature subset section, Kohavi, 1997,
• Back doors, Williams, 2002
• Active learning: many people (2000+)
18
Specifically, for “transfer learning”
(migrating conclusions from one project to another)
19
Q: How to transfer ?
A: Ignore most of the data
• relevancy filtering:
Turhan ESEj’09; Peters TSE’13
• variance filtering:
Kocaguneli TSE’12,TSE’13
• performance similarities:
He ESEM’13
Target domain: software quality prediction
Ignoring data = privacy?
20
Defects per KLOC
Static code features
(e.g. LOC per class, coupling, etc)
How well each
column predicts
For defectsCentrality
count
Sort by column “worth”
21
Defects per KLOC
Static code features
(e.g. LOC per class, coupling, etc)
How well each
column predicts
For defectsCentrality
count
Sort by row “centrality”
22
Defects per KLOC
Static code features
(e.g. LOC per class, coupling, etc)
How well each
column predicts
For defectsCentrality
count
Prune the dull rows
23
Defects per KLOC
Static code features
(e.g. LOC per class, coupling, etc)
How well each
column predicts
For defectsCentrality
count
Prune the dull columns
24
Defects per KLOC
Static code features
(e.g. LOC per class, coupling, etc)
How well each
column predicts
For defectsCentrality
count
Data “corners”
49/900 = 5.4% of the data
25
Defects per KLOC
Static code features
(e.g. LOC per class, coupling, etc)
How well each
column predicts
For defectsCentrality
count
Too much pruning?
• For SE quality data no
– Vasil 213:
• Quality by extrapolating between the rows of the
corners
• Just as good as using all the data
• The “corners” are the nub, the essence
– Without any superfluous detail removed
26
Three law of data sharing
• First Law: don’t share everything; just the “corners”.
27
Three law of data sharing
• First Law: don’t share everything; just the “corners”.
• Second Law: anonymize the data in the “corners”.
28
Three law of data sharing
• First Law: don’t share everything; just the “corners”.
• Second Law: anonymize the data in the “corners”.
29
All data Just the corners
Three law of data sharing
• First Law: don’t share everything; just the “corners”.
• Second Law: anonymize the data in the “corners”.
30
All data Just the corners
Mutate data to some
random nearby location
Three law of data sharing
• First Law: don’t share everything; just the “corners”.
• Second Law: anonymize the data in the “corners”.
• Third Law: never mutate across “decision boundary”.
31
Three law of data sharing
• First Law: don’t share everything; just the “corners”.
• Second Law: anonymize the data in the “corners”.
• Third Law: never mutate across “decision boundary”.
32
Three law of data sharing
• First Law: don’t share everything; just the “corners”.
• Second Law: anonymize the data in the “corners”.
• Third Law: never mutate across “decision boundary”.
33
Three law of data sharing
• First Law: don’t share everything; just the “corners”.
• Second Law: anonymize the data in the “corners”.
• Third Law: never mutate across “decision boundary”.
34
Three law of data sharing
• First Law: don’t share everything; just the “corners”.
• Second Law: anonymize the data in the “corners”.
• Third Law: never mutate across “decision boundary”.
35
Three law of data sharing
• First Law: don’t share everything; just the “corners”.
• Second Law: anonymize the data in the “corners”.
• Third Law: never mutate across “decision boundary”.
36
Better models from shared privatized
data that from all raw data
• Simulated 20 data owners sharing
privatized data
– “pass the parcel”
• Data owners incrementally added
their data to a parcel of shared
data
– but only data that was somehow
outstandingly different to data
already in the parcel
• Data was privatized
– using corners
– before leaving each data owner)
• Shared parcel :
– just 5% of all data
• Software quality predictors built
from this 5%,
– predictors performed better than
predictors built from all that data.
37
Peters, F., Menzies, T., & Layman, L. (2015). LACE2: Better
Privacy-Preserving Data Sharing for Cross Project Defect
Prediction. In ICSE’15, Florence, Italy
http://menzies.us/pdf/15lace2.pdf
Building a business case
for data sharing
• Funded by NC Data Science and
Analytics Initiative
• Joint project with Prof. Bojan Cukic,
UNC Charlotte
• Applying the following to data from
– The smart cities initiative
– Community health care data
– Biometrics data
• Q1: What do you lose by not sharing?
– Compare conclusions seen with via sharing or via
hoarding?
• Q2: Does anonymization protect us?
– Using standard privatization algorithms:
– Can we violate privacy on data from Smart Cities,
Community health, Biometrics
• Q3: Are we protecting data too much
– Using standard privatization algorithms:
– How worse off are our models?
• Q4: Do costs of sharing out-weight
benefits?
– Apply our novel “3 laws of data sharing” and see
what what can be learned?
– Check of learned models not very useful,
interesting
38
• Discussions about
sharing
• Too much fear
• Not enough about
benefits
• Can we learn more from
sharing that hoarding ?
• Yes (results from SE)
• Three laws of trusted data
sharing:
• For SE quality prediction..
• Better models from shared
privatized data that from all raw
data
• Q: does this work for other
kinds of data?
• A: don’t know… yet
39
40

More Related Content

What's hot

A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
Paco Nathan
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...
Armando Vieira
 
Deep Learning Use Cases - Data Science Pop-up Seattle
Deep Learning Use Cases - Data Science Pop-up SeattleDeep Learning Use Cases - Data Science Pop-up Seattle
Deep Learning Use Cases - Data Science Pop-up Seattle
Domino Data Lab
 
Towards Mining Software Repositories Research that Matters
Towards Mining Software Repositories Research that MattersTowards Mining Software Repositories Research that Matters
Towards Mining Software Repositories Research that Matters
Tao Xie
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference
Srinath Perera
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science
James Hendler
 
Agile data science
Agile data scienceAgile data science
Agile data science
Joel Horwitz
 
Data science presentation
Data science presentationData science presentation
Data science presentation
MSDEVMTL
 
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
Paco Nathan
 
Crowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesCrowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic Perspectives
Aditya Parameswaran
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
Spotle.ai
 
Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future Tense
Paco Nathan
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
odsc
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
Joaquin Vanschoren
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Sampath Kumar
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
Travis Oliphant
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
Sri Ambati
 
Semantic Web: The Inside Story
Semantic Web: The Inside StorySemantic Web: The Inside Story
Semantic Web: The Inside Story
James Hendler
 
Data Science, Machine Learning, and H2O
Data Science, Machine Learning, and H2OData Science, Machine Learning, and H2O
Data Science, Machine Learning, and H2O
Sri Ambati
 

What's hot (20)

A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...
 
Deep Learning Use Cases - Data Science Pop-up Seattle
Deep Learning Use Cases - Data Science Pop-up SeattleDeep Learning Use Cases - Data Science Pop-up Seattle
Deep Learning Use Cases - Data Science Pop-up Seattle
 
Towards Mining Software Repositories Research that Matters
Towards Mining Software Repositories Research that MattersTowards Mining Software Repositories Research that Matters
Towards Mining Software Repositories Research that Matters
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science
 
Agile data science
Agile data scienceAgile data science
Agile data science
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Crowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesCrowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic Perspectives
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future Tense
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Semantic Web: The Inside Story
Semantic Web: The Inside StorySemantic Web: The Inside Story
Semantic Web: The Inside Story
 
Data Science, Machine Learning, and H2O
Data Science, Machine Learning, and H2OData Science, Machine Learning, and H2O
Data Science, Machine Learning, and H2O
 

Similar to Three Laws of Trusted Data Sharing: (Building a Better Business Case for Data Sharing)

Dm sei-tutorial-v7
Dm sei-tutorial-v7Dm sei-tutorial-v7
Dm sei-tutorial-v7CS, NcState
 
Icse 2013-tutorial-data-science-for-software-engineering
Icse 2013-tutorial-data-science-for-software-engineeringIcse 2013-tutorial-data-science-for-software-engineering
Icse 2013-tutorial-data-science-for-software-engineeringCS, NcState
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
CS, NcState
 
DBMS
DBMSDBMS
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
Kathirvel Ayyaswamy
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
CS, NcState
 
Causal networks, learning and inference - Introduction
Causal networks, learning and inference - IntroductionCausal networks, learning and inference - Introduction
Causal networks, learning and inference - Introduction
Fabio Stella
 
Big Data et eGovernment
Big Data et eGovernmentBig Data et eGovernment
Big Data et eGovernment
eGov Innovation Center
 
DATAIA & TransAlgo
DATAIA & TransAlgoDATAIA & TransAlgo
DATAIA & TransAlgo
Nozha Boujemaa
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptx
shalini s
 
Data_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfData_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdf
vishal choudhary
 
In search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked DataIn search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked Data
jonblower
 
Sensors1(1)
Sensors1(1)Sensors1(1)
Sensors1(1)
Lakmal Pathirana
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Dr. Sunil Kr. Pandey
 
Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?
Gregory Piatetsky-Shapiro
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
ssuser1a4f0f
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
wahiba ben abdessalem
 
Computational intelligence for big data analytics bda 2013
Computational intelligence for big data analytics   bda 2013Computational intelligence for big data analytics   bda 2013
Computational intelligence for big data analytics bda 2013
oj08
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
Rukshan Batuwita
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has Changed
Philip Bourne
 

Similar to Three Laws of Trusted Data Sharing: (Building a Better Business Case for Data Sharing) (20)

Dm sei-tutorial-v7
Dm sei-tutorial-v7Dm sei-tutorial-v7
Dm sei-tutorial-v7
 
Icse 2013-tutorial-data-science-for-software-engineering
Icse 2013-tutorial-data-science-for-software-engineeringIcse 2013-tutorial-data-science-for-software-engineering
Icse 2013-tutorial-data-science-for-software-engineering
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
 
DBMS
DBMSDBMS
DBMS
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
 
Causal networks, learning and inference - Introduction
Causal networks, learning and inference - IntroductionCausal networks, learning and inference - Introduction
Causal networks, learning and inference - Introduction
 
Big Data et eGovernment
Big Data et eGovernmentBig Data et eGovernment
Big Data et eGovernment
 
DATAIA & TransAlgo
DATAIA & TransAlgoDATAIA & TransAlgo
DATAIA & TransAlgo
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptx
 
Data_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfData_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdf
 
In search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked DataIn search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked Data
 
Sensors1(1)
Sensors1(1)Sensors1(1)
Sensors1(1)
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
Computational intelligence for big data analytics bda 2013
Computational intelligence for big data analytics   bda 2013Computational intelligence for big data analytics   bda 2013
Computational intelligence for big data analytics bda 2013
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has Changed
 

More from CS, NcState

Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9CS, NcState
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).
CS, NcState
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits
CS, NcState
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab templateCS, NcState
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUCS, NcState
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
CS, NcState
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia
CS, NcState
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software Engineering
CS, NcState
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)
CS, NcState
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
CS, NcState
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1CS, NcState
 
Know thy tools
Know thy toolsKnow thy tools
Know thy tools
CS, NcState
 
What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter?
CS, NcState
 
In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?
CS, NcState
 
Sayyad slides ase13_v4
Sayyad slides ase13_v4Sayyad slides ase13_v4
Sayyad slides ase13_v4
CS, NcState
 
Ase2013
Ase2013Ase2013
Ase2013
CS, NcState
 
Warning: don't do CS
Warning: don't do CSWarning: don't do CS
Warning: don't do CS
CS, NcState
 
How to do better experiments in SE
How to do better experiments in SEHow to do better experiments in SE
How to do better experiments in SECS, NcState
 
Idea Engineering
Idea EngineeringIdea Engineering
Idea EngineeringCS, NcState
 

More from CS, NcState (20)

Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab template
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSU
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software Engineering
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
 
Goldrush
GoldrushGoldrush
Goldrush
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1
 
Know thy tools
Know thy toolsKnow thy tools
Know thy tools
 
What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter?
 
In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?
 
Sayyad slides ase13_v4
Sayyad slides ase13_v4Sayyad slides ase13_v4
Sayyad slides ase13_v4
 
Ase2013
Ase2013Ase2013
Ase2013
 
Warning: don't do CS
Warning: don't do CSWarning: don't do CS
Warning: don't do CS
 
How to do better experiments in SE
How to do better experiments in SEHow to do better experiments in SE
How to do better experiments in SE
 
Idea Engineering
Idea EngineeringIdea Engineering
Idea Engineering
 

Recently uploaded

Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
An Approach to Detecting Writing Styles Based on Clustering Techniques
An Approach to Detecting Writing Styles Based on Clustering TechniquesAn Approach to Detecting Writing Styles Based on Clustering Techniques
An Approach to Detecting Writing Styles Based on Clustering Techniques
ambekarshweta25
 
Online aptitude test management system project report.pdf
Online aptitude test management system project report.pdfOnline aptitude test management system project report.pdf
Online aptitude test management system project report.pdf
Kamal Acharya
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
drwaing
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 

Recently uploaded (20)

Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
An Approach to Detecting Writing Styles Based on Clustering Techniques
An Approach to Detecting Writing Styles Based on Clustering TechniquesAn Approach to Detecting Writing Styles Based on Clustering Techniques
An Approach to Detecting Writing Styles Based on Clustering Techniques
 
Online aptitude test management system project report.pdf
Online aptitude test management system project report.pdfOnline aptitude test management system project report.pdf
Online aptitude test management system project report.pdf
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 

Three Laws of Trusted Data Sharing: (Building a Better Business Case for Data Sharing)

  • 1. Three Laws of Trusted Data Sharing: (Building a Better Business Case for Data Sharing) Tim Menzies (prof, cs) tim.menzies@gmail.com August 6, 2015
  • 2. • Discussions about sharing • Too much fear • Not enough about benefits • Can we learn more from sharing that hoarding ? • Yes (results from SE) • Three laws of trusted data sharing: • For SE quality prediction.. • Better models from shared privatized data that from all raw data • Q: does this work for other kinds of data? • A: don’t know… yet 2
  • 3. Why We Care… – Sebastian Elbaum et al. 2014 Sharing industrial datasets with the research community is extremely valuable, but also extremely challenging as it needs to balance the usefulness of the dataset with the industry’s concerns for privacy and competition. S. Elbaum, A. Mclaughlin, and J. Penix, “The google dataset of testing results,” june 2014. [Online]. Available: https://code.google.com/p/google-shared-dataset-of-test-suite-results 3
  • 4. Cost of privacy - Privacy Goals (conflicting) • protect confidentiality of software defect data with privacy preserving techniques... • while data remains useful - Not trivial • With standard anonymization methods • as privacy increases... • data becomes less useful 13 Usefulnes s Privacy J. Brickell and V. Shmatikov, “The cost of privacy: destruction of data-mining utility in anonymized data publishing,” in Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’08. M. Grechanik, C. Csallner, C. Fu, and Q. Xie, “Is data privacy always good for software testing?” in Proceedings of the 2010 IEEE 21st International Symposium on Software Reliability Engineering, ser. ISSRE ’10. 4
  • 5. Building a business case for data sharing • Funded by NC Data Science and Analytics Initiative • Joint project with Prof. Bojan Cukic, UNC Charlotte • Applying the following to data from – The smart cities initiative – Community health care data – Biometrics data • Q1: What do you lose by not sharing? – Compare conclusions seen with via sharing or via hoarding? • Q2: Does anonymization protect us? – Using standard privatization algorithms: – Can we violate privacy on data from Smart Cities, Community health, Biometrics • Q3: Are we protecting data too much – Using standard privatization algorithms: – How worse off are our models? • Q4: Do costs of sharing out-weight benefits? – Apply our novel “3 laws of data sharing” and see what what can be learned? – Check of learned models not very useful, interesting 5
  • 6. About me: http://menzies.us • Funding: $7 million – NASA, DoD, National Science Foundation, National Archives, etc – Some STTR work • Ph.D/masters students: dozens • Papers: 200+ • Teaching: – Grad SE + automated SE • Service: – Editorial boards: TSE, EMSE, ASE – Conference org: ICSME’16, ASE, – Many program committees 6
  • 8. Sharing data, Turkey to Texas: Toasters to rocket ships 8
  • 9. Sharing data Turkey to Texas: Toasters to rocket ships 9 Q: Does this work for other kinds of data? E.g. anonymized privatized data? A: Perhaps
  • 10. Everyone else’s research question Why does software fail? 10
  • 11. Sure, software sometimes fails (at may do so at the worst time) • E.g. software floating point bug, Ariane 5, 1996 • Cost of vehicle: $500 million • Development cost: $7 billion • Loss of income due to loss of client confidence: unknown • 11
  • 12. Everyone else’s research question Why does software fail? 12
  • 13. My research question Why does software fail? 13 Ever work?
  • 14. According to the maths, software is too complex to understand • 1024 stars in the sky • NV states in software – Consider 100 if statements – Then N=2, V=100 and NV=2100 – a million times more than 1024 • The space inside our software – is bigger than stars in the sky. 14 IEEE Computer, Jan 2007, p54- 60 http://menzies.us/pdf/07strange.pdf
  • 15. 15 N =#tests required C= odds bug found P= Probability of bug Complex things should not work C = 1 – (1-p)N so N = log(1-C)/log(1-p)
  • 16. Yet (often) they do • Examples: – Open source software – The internet – Electrical power grids – Pace makers – International air traffic control systems – Operating systems – Etc – etc 16 N =#tests required C= odds bug found P= Probability of bug Complex things should not work C = 1 – (1-p)N so N = log(1-C)/log(1-p)
  • 17. Sure, software sometimes fails (at may do so at the worst time) • E.g. software floating point bug, Ariane 5, 1996 • Cost of vehicle: $500 million • Development cost: $7 billion • Loss of income due to loss of client confidence: unknown • But puzzle is this: – These errors should be much more frequent – So where is all that missing behavior? 17
  • 18. When reasoning about complex things, you don’t have to look at very much • Narrows: Amarel 1960s • Prototypes: Chen 1975 • Frames: Minsky, 1975 • Min environments: DeKleer, 1986 • Saturation: Horgan & Mathur: 1980 • Homogenous propagation: Michael: 1981 • Master variables: Crawford & Baker, 1995 • Clumps, Druzdel, 1997 • Feature subset section, Kohavi, 1997, • Back doors, Williams, 2002 • Active learning: many people (2000+) 18
  • 19. Specifically, for “transfer learning” (migrating conclusions from one project to another) 19 Q: How to transfer ? A: Ignore most of the data • relevancy filtering: Turhan ESEj’09; Peters TSE’13 • variance filtering: Kocaguneli TSE’12,TSE’13 • performance similarities: He ESEM’13 Target domain: software quality prediction
  • 20. Ignoring data = privacy? 20 Defects per KLOC Static code features (e.g. LOC per class, coupling, etc) How well each column predicts For defectsCentrality count
  • 21. Sort by column “worth” 21 Defects per KLOC Static code features (e.g. LOC per class, coupling, etc) How well each column predicts For defectsCentrality count
  • 22. Sort by row “centrality” 22 Defects per KLOC Static code features (e.g. LOC per class, coupling, etc) How well each column predicts For defectsCentrality count
  • 23. Prune the dull rows 23 Defects per KLOC Static code features (e.g. LOC per class, coupling, etc) How well each column predicts For defectsCentrality count
  • 24. Prune the dull columns 24 Defects per KLOC Static code features (e.g. LOC per class, coupling, etc) How well each column predicts For defectsCentrality count
  • 25. Data “corners” 49/900 = 5.4% of the data 25 Defects per KLOC Static code features (e.g. LOC per class, coupling, etc) How well each column predicts For defectsCentrality count
  • 26. Too much pruning? • For SE quality data no – Vasil 213: • Quality by extrapolating between the rows of the corners • Just as good as using all the data • The “corners” are the nub, the essence – Without any superfluous detail removed 26
  • 27. Three law of data sharing • First Law: don’t share everything; just the “corners”. 27
  • 28. Three law of data sharing • First Law: don’t share everything; just the “corners”. • Second Law: anonymize the data in the “corners”. 28
  • 29. Three law of data sharing • First Law: don’t share everything; just the “corners”. • Second Law: anonymize the data in the “corners”. 29 All data Just the corners
  • 30. Three law of data sharing • First Law: don’t share everything; just the “corners”. • Second Law: anonymize the data in the “corners”. 30 All data Just the corners Mutate data to some random nearby location
  • 31. Three law of data sharing • First Law: don’t share everything; just the “corners”. • Second Law: anonymize the data in the “corners”. • Third Law: never mutate across “decision boundary”. 31
  • 32. Three law of data sharing • First Law: don’t share everything; just the “corners”. • Second Law: anonymize the data in the “corners”. • Third Law: never mutate across “decision boundary”. 32
  • 33. Three law of data sharing • First Law: don’t share everything; just the “corners”. • Second Law: anonymize the data in the “corners”. • Third Law: never mutate across “decision boundary”. 33
  • 34. Three law of data sharing • First Law: don’t share everything; just the “corners”. • Second Law: anonymize the data in the “corners”. • Third Law: never mutate across “decision boundary”. 34
  • 35. Three law of data sharing • First Law: don’t share everything; just the “corners”. • Second Law: anonymize the data in the “corners”. • Third Law: never mutate across “decision boundary”. 35
  • 36. Three law of data sharing • First Law: don’t share everything; just the “corners”. • Second Law: anonymize the data in the “corners”. • Third Law: never mutate across “decision boundary”. 36
  • 37. Better models from shared privatized data that from all raw data • Simulated 20 data owners sharing privatized data – “pass the parcel” • Data owners incrementally added their data to a parcel of shared data – but only data that was somehow outstandingly different to data already in the parcel • Data was privatized – using corners – before leaving each data owner) • Shared parcel : – just 5% of all data • Software quality predictors built from this 5%, – predictors performed better than predictors built from all that data. 37 Peters, F., Menzies, T., & Layman, L. (2015). LACE2: Better Privacy-Preserving Data Sharing for Cross Project Defect Prediction. In ICSE’15, Florence, Italy http://menzies.us/pdf/15lace2.pdf
  • 38. Building a business case for data sharing • Funded by NC Data Science and Analytics Initiative • Joint project with Prof. Bojan Cukic, UNC Charlotte • Applying the following to data from – The smart cities initiative – Community health care data – Biometrics data • Q1: What do you lose by not sharing? – Compare conclusions seen with via sharing or via hoarding? • Q2: Does anonymization protect us? – Using standard privatization algorithms: – Can we violate privacy on data from Smart Cities, Community health, Biometrics • Q3: Are we protecting data too much – Using standard privatization algorithms: – How worse off are our models? • Q4: Do costs of sharing out-weight benefits? – Apply our novel “3 laws of data sharing” and see what what can be learned? – Check of learned models not very useful, interesting 38
  • 39. • Discussions about sharing • Too much fear • Not enough about benefits • Can we learn more from sharing that hoarding ? • Yes (results from SE) • Three laws of trusted data sharing: • For SE quality prediction.. • Better models from shared privatized data that from all raw data • Q: does this work for other kinds of data? • A: don’t know… yet 39
  • 40. 40