Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Use	and	Misuse	of		
the	Term	Experiment		
in	MSR	Research	
Natalia	Juristo		
	
University	of	Oulu	
&		
Technical	Universit...
Mo?va?on	
n  Today	empiricism	is	everywhere	in	SE	
n  This	does	not	mean	SE	is	empirically	mature	
n  Conduc?ng	empirical	...
Mo?va?on	
n  For	several	years	I	have	been	struggling	with	
matching	MSR	research	with	the	more	tradi?onal	SE	
empirical	r...
Collabora?on	
n  This	research	has	been	conducted	in	
collabora?on	with	
n  Claudia	Ayala	
n  Xavier	Franch	
n  Burak	Turh...
Evidence	of	Misuse
Small-scale	Literature	Review	
n  We	conducted	a	literature	review	to	double-
check	the	use	of	the	term	experiment	in	MSR	...
Findings	
Venue	
2015	
Use	of	Term	
Experiment	
MSR		
vs	tradi<onal	experiment	
MSR		
Use	vs.	Misuse	
ESEM	 30.5%	
11	out	...
What	is	an	experiment
Experiment Definition
n  Empirical procedure where
key variables of a reality
are manipulated
to investigate the impact of...
What	Makes	an	Experiment	
n  Manipula?on	of	variables	under	study	
n  Treatments	must	be	assigned	to	experimental	units	
n...
What	Makes	an	Experiment	
								Interven?on	
n  Experimenta?on	
n  There	is	a	purposely	interven?on	by	researchers	
n  R...
What	Makes	an	Experiment	
								Randomiza?on	
n  Experiments	limit	the	poten?al	for	any	confounding	
factors	(biases)		b...
What	Makes	an	Experiment	
								Interven?on	+	Randomiza?on	
n  Interven?on	guarantees	causality	
n  Inspiring	example	
n...
What	Does	not	Makes	an	Experiment	
n  Randomiza?on	
n  Comparison	
n  Analysis	techniques	
14
What	Does	not	Makes	an	Experiment	
Randomiza?on	
n  Randomiza?on	is	a	strategy	aiming	to	reduce	
confounding	variables	(bi...
What	Does	not	Makes	an	Experiment	
Comparison	
n  Compare	among	the	impact	of	values	of	a	
variable	does	not	mean	we	will	...
What	Does	not	Makes	an	Experiment	
	Analysis	
n  Analysis	techniques	do	not	differen?ate	experiments	
from	other	empirical	...
What	Does	not	Makes	an	Experiment	
n  An	MSR	study		
n  Applying	ANOVA	does	not	mean	it	is	an	experiment	
n  Comparing	poo...
Impact	of	Randomiza?on	and	Design	
19
Types	of	Experiments	
n  Without	interven?on	
n  Natural	environment	
n  Natural	experiments	
n  Interven?on	
n  Where?	
n...
Laboratory	experiments	
Purposely	interven?on	
Randomized	alloca?on	of	treatments	
Ar?ficial	environment	highly	controlled	...
22	
Natural	experiments	
No	interven?on	
In	a	natural	uncontrolled	environment
Mining	SoLware	Repositories	
n  MSR	research	
n  Outcomes	(such	as	quality	and	produc?vity)	are	studied	in	large-
samples	...
MSR	and	Epidemiology
Empirical	Studies	in	Medicine	
25	
MethodDevelopment
Laboratory Research
or Pre-clinical
Non-Human
Experiments
Field Resea...
Empirical	Studies	in	Medicine	
Analy<cal	 Experimental	 Clinical	Trial	
Field	Trial	
Group	Trial	
Observa<onal	 Cohort	Stu...
(Prospec?ve)	Cohort	Study	
n  A	collec?on	of	data	at	regular	intervals	of	a	group	of	people	who	do	not	have	the	
disease	f...
Retrospec?ve	Studies	
n  The	researcher	collects	data	from	past	records	and	
does	not	follow	pa?ents	up	as	is	in	prospec?v...
Retrospec?ve	Studies		
Threats	to	Validity	
n  Some	key	data	have	not	been	measured	
n  Biases	may	affect	the	selec?on	of	c...
Retrospec?ve	Cohort	Study	
n  Records	of	groups	of	individuals	who	are	alike	in	
many	ways	but	differ	by	a	certain	characte...
(Retrospec?ve)	Case-Control	Study	
n  Records	of	individuals	are	divided	in	two	groups	
differing	in	outcome	(disease	or	no...
Ecological	Studies	
n  Units	of	analysis	are	popula?ons		
n  Comparison	of	groups	rather	than	individuals	
n  Explores	cor...
Hierarchies	of	Evidence
Hierarchy	of	Evidences	
n  It	is	cri?cal	to	understand	which	empirical	
study	you	are	conduc?ng	
n  To	fully	understand	wh...
Authority	of	Evidences	
Field Experiments
Observational
Analytical
Prospective
Retrospective
Observational
Descriptive
Lab...
Psychology	Hierarchy	of	Evidence	
38
Two	MSR	examples
Example	1	
n  MSR’15		
n  The	Uniqueness	of	Changes:	Characteris?cs	and	Applica?ons	
n  Ray,	Nagappan,	Bird,	Nagappan,	Zim...
Empirical	Studies	(Authors’	terms)	
n  Topic	
n  Some	changes	are	unique	while	other	are	not	
n  They	propose	a	way	to	ide...
Type	of	Empirical	Studies	(Epidemiology	terms)	
n  Analysis	of	unique	and	non-unique	changes	proper?es	
n  What	is	the	ext...
Example	2	
n  ESEM’15		
n  How	to	make	best	use	of	cross-company	data	for	web	effort	
es?ma?on	
n  Minku,	Sarro,	Mendes,	Fe...
Experiments	in	Effort	Es?ma?on	
Research	
n  Interven?on	
n  The	two	(effort	es?ma?on)	techniques	compared	
n  Alloca?on	of	...
Which	Uses	were	Right	
Venue	
2015	
Use	of	Term	
Experiment	
MSR		
vs	tradi<onal	experiment	
MSR		
Use	vs.	Misuse	
ESEM	 3...
Conclusions
Conclusions	
n  MSR	is	a	research	method		by	which	several	type	of	empirical	
studies	can	be	conducted	
n  In	any	case	mos...
Use	and	Misuse	of		
the	Term	Experiment		
in	MSR	Research	
Natalia	Juristo		
	
University	of	Oulu	
&		
Technical	Universit...
PROMISE keynote Juristo
PROMISE keynote Juristo
Upcoming SlideShare
Loading in …5
×

PROMISE keynote Juristo

110 views

Published on

  • Be the first to comment

  • Be the first to like this

PROMISE keynote Juristo

  1. 1. Use and Misuse of the Term Experiment in MSR Research Natalia Juristo University of Oulu & Technical University of Madrid PROMISE September 7th 2016
  2. 2. Mo?va?on n  Today empiricism is everywhere in SE n  This does not mean SE is empirically mature n  Conduc?ng empirical studies does not imply they are carried out and understood properly n  I focus here in a methodological issue on MSR research n  The use of experiments in MSR 2
  3. 3. Mo?va?on n  For several years I have been struggling with matching MSR research with the more tradi?onal SE empirical research (being conducted along the last 35 years) n  Very oLen I was shocked hearing to call experiment (in MSR works) to empirical studies I do not consider as such n  I discuss today about a research we are conduc?ng to clarify this issue 3
  4. 4. Collabora?on n  This research has been conducted in collabora?on with n  Claudia Ayala n  Xavier Franch n  Burak Turhan 4
  5. 5. Evidence of Misuse
  6. 6. Small-scale Literature Review n  We conducted a literature review to double- check the use of the term experiment in MSR works n  2015 MSR, ESEM and EMSE n  MSR 42 papers reviewed n  ESEM 36 papers n  EMSE 55 papers 6
  7. 7. Findings Venue 2015 Use of Term Experiment MSR vs tradi<onal experiment MSR Use vs. Misuse ESEM 30.5% 11 out of 36 72,72% MSR Works (8 papers) 27,28% tradi?onal experiments (3 papers) Wrong use: 12,5% Proper use : 87,5% MSR 42,8% 18 out of 42 100% MSR Works (18 papers) 0% tradi?onal experiments Wrong use: 44,45% Proper use : 55,55% EMSE 52,72% 29 out of 55 65,51% MSR Works (19 papers) 34,48% tradi?onal experiments (10 papers) Wrong use: 52,63% Proper use : 47,36% ….Let me elaborate why the term is misused
  8. 8. What is an experiment
  9. 9. Experiment Definition n  Empirical procedure where key variables of a reality are manipulated to investigate the impact of such variations
  10. 10. What Makes an Experiment n  Manipula?on of variables under study n  Treatments must be assigned to experimental units n  Controlling poten?al confounding variables impac?ng results n  Confounding is eliminated though random assignment of treatments to units 10
  11. 11. What Makes an Experiment Interven?on n  Experimenta?on n  There is a purposely interven?on by researchers n  Researchers allocate treatments to units n  Experimental groups (exposure and unexposure) are determined by researcher n  Observa?on n  Researchers have a passive role and do not interfere with reality n  Data are generated directly from reality and a>er they are analyzed n  Exposure status is not determined by researcher 11
  12. 12. What Makes an Experiment Randomiza?on n  Experiments limit the poten?al for any confounding factors (biases) by randomly assigning one par?cipant pool to a treatment and another par?cipant pool to control or other treatment n  Random alloca?on of treatments to subjects minimizes the chance that the incidence of confounding (par?cularly unknown confounding) variables will differ between the two groups 12
  13. 13. What Makes an Experiment Interven?on + Randomiza?on n  Interven?on guarantees causality n  Inspiring example n  In a quasi-experiment the alloca?on of treatment is not possible n  Although run under controlled condi?ons n  The case of psychology experiments n  Personality treats 13
  14. 14. What Does not Makes an Experiment n  Randomiza?on n  Comparison n  Analysis techniques 14
  15. 15. What Does not Makes an Experiment Randomiza?on n  Randomiza?on is a strategy aiming to reduce confounding variables (bias) n  It is mandatory in controlled experiments n  Can be applied to other type of empirical studies n  Inspiring example n  Randomiza?on in surveys 15
  16. 16. What Does not Makes an Experiment Comparison n  Compare among the impact of values of a variable does not mean we will be able to reveal causality n  Comparing in a set of data units with different values of a variable neither makes the study an experiment nor can trace back differences to treatments 16
  17. 17. What Does not Makes an Experiment Analysis n  Analysis techniques do not differen?ate experiments from other empirical studies n  What allows to reveal causality is not the type of analysis technique it is the design of the study n  Applying to a set of data an analysis technique typically used in experiments neither makes the study an experiment nor detects causality 17
  18. 18. What Does not Makes an Experiment n  An MSR study n  Applying ANOVA does not mean it is an experiment n  Comparing pools of data differing in a variable’s value does not imply it is an experiment n  Even if MSR studies would randomized they were not experiments n  Design guarantees n  The drop of bias and confounding variables n  The differences observed in behavior are caused by treatments 18
  19. 19. Impact of Randomiza?on and Design 19
  20. 20. Types of Experiments n  Without interven?on n  Natural environment n  Natural experiments n  Interven?on n  Where? n  Ar?ficial controlled environment n  Laboratory controlled experiments n  Natural environment n  Field experiments 20
  21. 21. Laboratory experiments Purposely interven?on Randomized alloca?on of treatments Ar?ficial environment highly controlled Field experiments Purposely interven?on Randomized alloca?on of treatments Natural uncontrolled environment
  22. 22. 22 Natural experiments No interven?on In a natural uncontrolled environment
  23. 23. Mining SoLware Repositories n  MSR research n  Outcomes (such as quality and produc?vity) are studied in large- samples of past data to n  Apply sta?s?cal methods to test hypothesis n  Build machine learning and mining methods on past data into tools to support programming tasks n  The data stored in a repository have been obtained from reality (without interven?on) n  Therefore MSR works are observa?onal studies n  We could call them natural experiments but that term is misleading 23
  24. 24. MSR and Epidemiology
  25. 25. Empirical Studies in Medicine 25 MethodDevelopment Laboratory Research or Pre-clinical Non-Human Experiments Field Research Ill People Ill & Healthy People From 20-100 volunteers to 1-2M patients Descriptive A n a l y t i c Retrospective Prospective Descriptive
  26. 26. Empirical Studies in Medicine Analy<cal Experimental Clinical Trial Field Trial Group Trial Observa<onal Cohort Studies Prospec@ve Study; Follow-up study Concurrent study; Incidence study Longitudinal study Historical Cohort studies Case-Control Studies Retrospec@ve study; Case comparison study Case history study; Case compeer study; Case referent study; Trohoc study Descrip<ve Individuals Cross-Sec?onal Studies Prevalence study; Disease frequency study Morbidity survey; Health survey Case series Single case Popula<on Ecological Studies
  27. 27. (Prospec?ve) Cohort Study n  A collec?on of data at regular intervals of a group of people who do not have the disease for a period of ?me and see who develops the disease (new incidence) n  Cohort n  Group of people who share a common characteris?c within a defined period n  e.g., are born, are exposed to a drug or vaccine or pollutant, or undergo a certain medical procedure n  Comparison group n  The general popula?on from which the cohort is drawn n  Another cohort of persons thought to have had likle or no exposure to the substance under inves?ga?on, but otherwise similar n  SE: Projects/Commits that have not applied the method under study n  Example n  Does exposure to X (smoking) associate with outcome Y (lung cancer)? n  Such a study would recruit a group of smokers and a group of non-smokers (the unexposed group) and follow them for a set period of ?me and note differences in the incidence of lung cancer between the groups at the end of this ?me n  SE: A passive follow up of projects/commits, collec@ng data at regular intervals and no@ng the quality/produc@ve they get 27
  28. 28. Retrospec?ve Studies n  The researcher collects data from past records and does not follow pa?ents up as is in prospec?ve studies n  All the events (exposure, latent period, and subsequent outcome -development of disease-) have already occurred in the past n  Errors due to confounding and bias are more common in retrospec?ve studies than in prospec?ve studies 28
  29. 29. Retrospec?ve Studies Threats to Validity n  Some key data have not been measured n  Biases may affect the selec?on of controls n  Selec?on bias n  Only select pa?ents with the necessary informa?on n  Misclassifica?on or informa?on bias as a result of the retrospec?ve aspect n  Researchers cannot control exposure or outcome assessment but instead need to rely on others for accurate recordkeeping n  It can be very difficult to make accurate comparisons between the exposed and the non-exposed 29
  30. 30. Retrospec?ve Cohort Study n  Records of groups of individuals who are alike in many ways but differ by a certain characteris?c are compared for a par?cular output n  For example, female nurses who smoke and those who do not smoke n  SE: Use of past data in a repository to compare certain output of projects with characteris@c A and no-A n  The researcher collects data from past records and does not follow pa?ents up as is the case with a prospec?ve study 30
  31. 31. (Retrospec?ve) Case-Control Study n  Records of individuals are divided in two groups differing in outcome (disease or not) and compared on the basis of some supposed causal akribute n  Case-Control studies select subjects based on their disease status (the effect) n  Cohort studies select subjects based on their exposure status (the cause) n  SE: Select projects/commits with certain level (i.e. quality value) and trace back certain project characteris@cs that is believed to contribute to quality 31
  32. 32. Ecological Studies n  Units of analysis are popula?ons n  Comparison of groups rather than individuals n  Explores correla?ons between group level exposure and outcomes 32
  33. 33. Hierarchies of Evidence
  34. 34. Hierarchy of Evidences n  It is cri?cal to understand which empirical study you are conduc?ng n  To fully understand what the results are telling us n  The type of results depends on the type of study!!! n  Evidence hierarchies reflect the rela?ve authority of various types of empirical studies 34
  35. 35. Authority of Evidences Field Experiments Observational Analytical Prospective Retrospective Observational Descriptive Laboratory Experiments
  36. 36. Psychology Hierarchy of Evidence 38
  37. 37. Two MSR examples
  38. 38. Example 1 n  MSR’15 n  The Uniqueness of Changes: Characteris?cs and Applica?ons n  Ray, Nagappan, Bird, Nagappan, Zimmeramnn n  Why this paper n  A very well wriken paper n  Several empirical studies of different type about the same issue n  Prominent MSR authors 40
  39. 39. Empirical Studies (Authors’ terms) n  Topic n  Some changes are unique while other are not n  They propose a way to iden?fy uniqueness of changes n  Empirical studies (in authors’ terms) n  Analysis of unique and non-unique changes proper?es n  What is the extent of unique changes; Who introduces unique changes; Where do unique changes take place n  Applica?ons n  Experiment for Risk Analysis n  Check whether U file commits are have a higher defect rate than NU file commits n  Use Mann-Whitney test for the comparison n  Recommenda?on systems n  A system is embedded in the development environment to suggest changes to developers n  Precision and recall of the recommenda?ons is analyzed 41
  40. 40. Type of Empirical Studies (Epidemiology terms) n  Analysis of unique and non-unique changes proper?es n  What is the extent of unique changes; Who introduces unique changes; Where do unique changes take place n  Ecological study n  Descrip?ve; Use of popula?on aggregated data n  Applica?on: Experiment for Risk Analysis n  Check whether U file commits have a higher defect rate than NU file commits n  Retrospec?ve cohort study n  Comparison of past data n  Applica?ons: Recommenda?on systems n  A system is embedded in the development environment to suggest changes to developers; Precision and recall of the recommenda?ons is analyzed n  Prospec?ve observa?onal study; ecological? n  But no comparison is made (i.e.: if quality/produc?vity of developments using the recommenda?ons) n  Could be conducted as Field Trial or (Prospec?ve) Cohort study 42
  41. 41. Example 2 n  ESEM’15 n  How to make best use of cross-company data for web effort es?ma?on n  Minku, Sarro, Mendes, Ferrucci n  Topic n  Compares CC dataset versus WC dataset for web effort es?ma?on n  Compares Dycom against NN-filtering n  Dycom: Framework for learning soLware effort es?ma?on models for a company based on mapping CC models to the company’s context) n  NN-filtering: Nearest Neighbor filtering to make CC es?ma?ons 43
  42. 42. Experiments in Effort Es?ma?on Research n  Interven?on n  The two (effort es?ma?on) techniques compared n  Alloca?on of treatments to units? n  Yes n  Every project belonging to the test data set is an experimental unit n  Experimental groups are the test data set es?mated with one or the other technique n  Typical AB designs; But could try others n  Control confounding variables through randomiza?on? n  No 44
  43. 43. Which Uses were Right Venue 2015 Use of Term Experiment MSR vs tradi<onal experiment MSR Use vs. Misuse ESEM 30.5% 11 out of 36 72,7% MSR Works (8 papers) 27,3% tradi?onal experiments (3 papers) Observa?onal: 12,5% Data experiments: 87,5% MSR 42,8% 18 out of 42 100% MSR Works (18 papers) 0% tradi?onal experiments Observa?onal: 44,4% Data experiments: 55,5% EMSE 52,72% 29 out of 55 65,5% MSR Works (19 papers) 34,5% tradi?onal experiments (10 papers) Observa?onal: 52,6% Data experiments : 47,4%
  44. 44. Conclusions
  45. 45. Conclusions n  MSR is a research method by which several type of empirical studies can be conducted n  In any case most research is n  Observa?onal n  Retrospec?ve n  Unless data is mined from development tools prospec?vely n  Therefore the evidence obtained is of lower quality than n  Observa?onal prospec?ve studies n  Field experimental studies n  Show correla?on but it is hard to prove causa?on n  More powerful types of observa?onal studies (Case-control; Cohort) could get beker evidence 47
  46. 46. Use and Misuse of the Term Experiment in MSR Research Natalia Juristo University of Oulu & Technical University of Madrid PROMISE September 7th 2016

×