Preetha Chatterjee
Extracting Archival-Quality Information
from Software-Related Chats
Use of online developer chat services is increasing
User1: hi guys, i was trying to write a dataframe to txt files
and managed to write 11 files successfully. Halfway
through, I encountered this error:
User2: Hello guys, how can I write this in python so it will
work as in console? I'm helpless
User3: @User1 Don't reinvent the wheel... have a look at
the `smtplib` library for sending mail.
@User2 The `"a"` means you're opening an existing file for
append. Does it exist? You also seem to be mixing forward
and backward slashes... better practice is to use
`os.path.join` to be OS-independent.
Chats Contain Valuable Knowledge
The functionality of the code
Indication of an error in the code
API functionalities
Cause of the error presented
earlier
Good programming practice
Similar Knowledge has been Mined from Other
Software Artifacts to Improve Software Tools
4
Emails and bug reports:
• Re-documenting source code [Panichella 2012]
• Recommending mentors in software projects [Canfora 2012]
Tutorials: API learning [Jiang 2017, Petrosyan 2015]
Q & A forums:
• IDE recommendation [Rahman 2014, Cordeiro 2012, Ponzanelli 2014, Bacchelli 2012]
• Learning and recommendation of API [Chen 2016, Rahman 2016, Wang 2013]
• Automatic generation of comments [Wong 2013, Rahman 2015]
• Building thesaurus of software-specific terms [Tian 2014, Chen 2017]
Developer Chats as a Knowledge Source
Why archive information from chats?
• Transient content
• Informal and unstructured nature
• Noise and useful information
Archiving information is beneficial only if we can preserve
knowledge that could be useful for future reference, and thus
serve as a good resource for software engineers and/or mining
tools. 5
Previous analyses of chats
• Learn developer behaviors [Elliot 2003, Shihab 2009, Yu 2011, Lin 2016]
• Extract rationales [Alkadhi 2017, 2018], feature requests [ Shi 2020]
• Build Chatbots [Lebeuf 2017, Paikari 2018, Wood 2018]
Developer Conversation - 1
6
Author Utterance
Alexia Hi, I have a file with following contents 1234 alphabet /vag/one/arun > 1454 bigdata
/home/two/ogra > 5684 apple /vinay/three/dire, , but i want the output to be like 1234 alphabet
one > 1454 bigdata two > 5684 apple three
Elaina sed −r ’s|(.+)/[ˆ/]+/([ˆ/]+)/.+|12|g’
Corina Even though I dont have anything to do with this question, could you explain the logic behind the
answer? The formatting sentence seem so random
Elaina ‘sed -r‘ is an extended mode, so that + is enabled (matches one or more characters, unlike * that
matches zero or more); s///g or s|||g or any symbol instead of | is how a basic replacing expression
is constructed. The first field is what to match, the second is what to replace it with.;(.+)
/[ˆ/]+/([ˆ/]+) /.+ (.+)/ matches anything from the start until the first / and puts found characters in
the first group (1); [ˆ/]+/ matches anything that is not a slash, and then a slash (‘vag/‘ or ‘home/‘);
([ˆ/]+)/ matches the same thing, but puts the stuff found in-between slashes in the second group 2;
and then .+ matches whatever comes next to the end of line; and the second field tells sed to
replace the line with 12, so our saved groups side-by-side: the first group was everything before
the first slash, and the second group was the stuff between 2nd and 3rd slashes
Corina Ah ok, thanks a lot for the explanation!
Developer Conversation - 2
7
Author Utterance
Cody Hello guys I got a huge problem
Holli Cody: ask away
Cody We’ve been ask as assignment the implementation of Dijkstra’s and Bellman Ford’s algorithm for
calculating the shortest path in a given graph
Holli So what’s the issue?; run into a problem?
Cody I don’t really know how to start and that’s my problem
… ….
Darrin Cody: how much experience do you have writing code?; for example, there are quite a number of
existing examples of the algorithms you’re talking about
Cody basic i’m just starting
Darrin ok; can you describe the steps on how you execute the algorithm?; and do you understand why those
steps are necessary?; if so, then the next step you take is translating your written description of the
process into pseudocode; once you have a reasonable sequence of actions, you then implement the
pseudocode in your language of choice; frankly, the first two items are always the most difficult;
because it requires you to understand the problem domain; once you understand it, making it work is
usually much less effort
Rachel Cody: oof graph theory for a beginner. do you understand how those algorithms work, ?
Cody Yes I understand how those work
Darrin just having trouble translating described steps to code?
Comparing Developer Conversations of Varying Quality
8
Author Utterance
Cody Hello guys I got a huge problem
Holli Cody: ask away
Cody We’ve been ask as assignment the implementation of Dijkstra’s and Bellman
Ford’s algorithm for calculating the shortest path in a given graph
Holli So what’s the issue?; run into a problem?
Cody I don’t really know how to start and that’s my problem
… ….
Darrin Cody: how much experience do you have writing code?; for example, there are
quite a number of existing examples of the algorithms you’re talking about
Cody basic i’m just starting
Darrin ok; can you describe the steps on how you execute the algorithm?; and do you
understand why those steps are necessary?; if so, then the next step you take is
translating your written description of the process into pseudocode; once you
have a reasonable sequence of actions, you then implement the pseudocode in
your language of choice; frankly, the first two items are always the most
difficult; because it requires you to understand the problem domain; once you
understand it, making it work is usually much less effort
Rachel Cody: oof graph theory for a beginner. do you understand how those algorithms
work, ?
Cody Yes I understand how those work
Darrin just having trouble translating described steps to code?
Author Utterance
Alexia Hi, I have a file with following contents 1234 alphabet /vag/one/arun
> 1454 bigdata /home/two/ogra > 5684 apple /vinay/three/dire, , but
i want the output to be like 1234 alphabet one > 1454 bigdata two >
5684 apple three
Elaina sed −r ’s|(.+)/[ˆ/]+/([ˆ/]+)/.+|12|g’
Corina Even though I dont have anything to do with this question, could you
explain the logic behind the answer? The formatting sentence seem
so random
Elaina ‘sed -r‘ is an extended mode, so that + is enabled (matches one or
more characters, unlike * that matches zero or more); s///g or s|||g
or any symbol instead of | is how a basic replacing expression is
constructed. The first field is what to match, the second is what to
replace it with.;(.+) /[ˆ/]+/([ˆ/]+) /.+ (.+)/ matches anything from the
start until the first / and puts found characters in the first group (1);
[ˆ/]+/ matches anything that is not a slash, and then a slash (‘vag/‘ or
‘home/‘); ([ˆ/]+)/ matches the same thing, but puts the stuff found
in-between slashes in the second group 2; and then .+ matches
whatever comes next to the end of line; and the second field tells sed
to replace the line with 12, so our saved groups side-by-side: the
first group was everything before the first slash, and the second
group was the stuff between 2nd and 3rd slashes
Corina Ah ok, thanks a lot for the explanation!
Good Archival-Quality Conversation Poor Archival-Quality Conversation
Focus of this Dissertation
9
• RQ0: How much archival-quality information exists in
developer chats?
• RQ1: How accurately can we automatically identify
conversations containing archival-quality information?
• RQ2: How much can we accurately reduce the granularity of
archival-quality information?
• RQ3: Case Study: How do extracted archival-quality
information from Slack chats compare to related Stack
Overflow posts?
10
Extracting Archival-Quality Information from Software-Related Chats
RQ0: How much
archival-quality
information
exists in
developer chats?
RQ2: How much can
we accurately reduce
the granularity of
archival-quality
information?
RQ3: Case Study: How
extracted archival-
quality information
from Slack compare
to Stack Overflow?
RQ1: How accurately
can we automatically
identify conversations
containing archival-
quality information?
Investigate
prevalence of
archival-quality
information
Investigate
characteristics
that inhibit
automatic mining
Disentangle developer
chat conversations
Determine archival-
quality conversations
Identify properties of
archival-quality info
Identify seeds of
archival-quality info
Expand seeds with
necessary context
Determine nuggets of
archival-quality info
Investigate duplicate
information
Investigate
complementary
information
Developer chat
history
Chat
Archive
Investigate contradictory
information
11
Extracting Archival-Quality Information from Software-Related Chats
RQ0: How much
archival-quality
information
exists in
developer chats?
RQ2: How much can
we accurately reduce
the granularity of
archival-quality
information?
RQ3: Case Study: How
extracted archival-
quality information
from Slack compare to
Stack Overflow?
RQ1: How accurately
can we automatically
identify conversations
containing archival-
quality information?
Investigate
prevalence of
archival-quality
information
Investigate
characteristics
that inhibit
automatic mining
Disentangle developer
chat conversations
Determine archival-
quality conversations
Identify properties of
archival-quality info
Identify seeds of
archival-quality info
Expand seeds with
necessary context
Determine nuggets of
archival-quality info
Developer chat
history
Chat
Archive
Investigate duplicate
information
Investigate
complementary
information
Investigate contradictory
information
Study Questions
12
How prevalent was the information that has been
successfully mined from Stack Overflow Q&A forum to
support software engineering tools in developer Q&A chats
such as Slack?
Did Slack Q&A chats have characteristics that might inhibit
automatic mining of information to support software
engineering tools?
RQ0: How much archival-quality information exists in developer chats?
P. Chatterjee, K. Damevski, L. Pollock, V. Augustine, and N. A. Kraft, " Exploratory Study of
Slack Q&A Chats as a Mining Source for Software Engineering Tools," MSR 2019
Comparing Prevalence of information
13
Q & A
Conversations
Q & A
Posts
Similar Intent of Learning and Sharing
Information among Developers
Much of the information mined from Stack Overflow is available on Slack channels.
API mentions are available in larger quantities on Slack Q&A channels.
Links are rarely available on both Slack and Stack Overflow Q&A.
RQ0: How much archival-quality information exists in developer chats?
Challenges in Automatic Mining of Chats
14
Measure Results
Participant frequency 1 < 2 < 34
Questions with no answer 15.75%
Answer frequency 0<1<5
Questions with no accepted answer 52.25%
NL text context per code snippet 0 < 2 < 13
Incomplete sentences 12.63%
Noise in document 10.5%
Knowledge construction
61.5% crowd; 38.5%
participatory
Accepted answers are available in conversations, but require more effort to discern.
RQ0: How much archival-quality information exists in developer chats?
Low percentage of noise and incomplete sentences.
15
Extracting Archival-Quality Information from Software-Related Chats
RQ0: How much
archival-quality
information exist
in developer
chats?
RQ2: How much can
we accurately reduce
the granularity of
archival-quality
information?
RQ3: Case Study: How
extracted archival-
quality information
from Slack compare to
Stack Overflow?
RQ1: How accurately
can we automatically
identify conversations
containing archival-
quality information?
Investigate
prevalence of
archival-quality
information
Investigate
characteristics
that inhibit
automatic mining
Disentangle developer
chat conversations
Determine archival-
quality conversations
Identify properties of
archival-quality info
Identify seeds of
archival-quality info
Expand seeds with
necessary context
Determine nuggets of
archival-quality info
Developer chat
history
Chat
Archive
Investigate duplicate
information
Investigate
complementary
information
Investigate contradictory
information
Author Time Utterance Utterance Type
Clara 12:51 pm Hi guys. How can we delete all line breaks from .docx file? I’m using
python-docx library. In docx - I store some Jinja2 template, which later
I’m rendering with some data.
Question
Cyrus 1:32 pm Not sure if I should ask here or job board, I wanted to expand my github
and use it as a portfolio of sorts, are there certain types of projects that
are good to have in there to show my comptency?
Question
Judith 1:49 pm no one has real time to browse through your repo i would think. So if
you want a position that uses django/react then do a project that does
so. If you’re trying to get into scraping, do a scraping project etc
Answer
Judith 1:53 pm Clara: yes just open the file and remove all the line breaks. They are
essentially the special character ‘n‘
Answer
Task1: Disentangle developer chat conversations
16
Elsner and Charniak 2008
• timestamp
• author information
• occurrence of similar words
• cue words (e.g., hello, yes, no)
• technical jargon
Micro-averaged F-measure: 0.66
Chatterjee et al. 2020
• larger window of messages
• last five messages regardless of
elapsed time
• emoji or code blocks
• technical words
Micro-averaged F-measure : 0.80
RQ1: How accurately can we automatically identify conversations
containing archival-quality information?
P. Chatterjee, K. Damevski,
N. A. Kraft , L. Pollock,
"Software-related Slack
Chats with Disentangled
Conversations," MSR 2020
Task2: Properties of Archival-Quality Information
17
Properties of archival-
worthy knowledge
Feature Extraction
Supervised
machine learning
ML algorithms
Determine archival
worthiness
Classification
RQ1: How accurately can we automatically identify conversations
containing archival-quality information?
Evaluation for RQ1
18
MEASURES: PRECISION,
RECALL AND F-MEASURE
EQ. How effective is my approach in
determining archival worthy conversations?
RQ1: How accurately can we automatically identify conversations
containing archival-quality information?
Creation of Gold Set - Human Judges
19
Extracting Archival-Quality Information from Software-Related Chats
RQ0: How much
archival-quality
information exist
in developer
chats?
RQ2: How much can
we accurately reduce
the granularity of
archival-quality
information?
RQ3: Case Study: How
extracted archival-
quality information
from Slack compare to
Stack Overflow?
RQ1: How accurately
can we automatically
identify conversations
containing archival-
quality information?
Investigate
prevalence of
archival-quality
information
Investigate
characteristics
that inhibit
automatic mining
Disentangle developer
chat conversations
Determine archival-
quality conversations
Identify properties of
archival-quality info
Identify seeds of
archival-quality info
Expand seeds with
necessary context
Determine nuggets of
archival-quality info
Developer chat
history
Chat
Archive
Investigate duplicate
information
Investigate
complementary
information
Investigate contradictory
information
Author Utterance
Alexia Hi, I have a file with following contents 1234 alphabet /vag/one/arun > 1454 bigdata
/home/two/ogra > 5684 apple /vinay/three/dire, , but i want the output to be like 1234 alphabet
one > 1454 bigdata two > 5684 apple three
Elaina sed −r ’s|(.+)/[ˆ/]+/([ˆ/]+)/.+|12|g’
Corina Even though I dont have anything to do with this question, could you explain the logic behind the
answer? The formatting sentence seem so random
Elaina ‘sed -r‘ is an extended mode, so that + is enabled (matches one or more characters, unlike * that
matches zero or more); s///g or s|||g or any symbol instead of | is how a basic replacing expression
is constructed. The first field is what to match, the second is what to replace it with.;(.+)
/[ˆ/]+/([ˆ/]+) /.+ (.+)/ matches anything from the start until the first / and puts found characters in
the first group (1); [ˆ/]+/ matches anything that is not a slash, and then a slash (‘vag/‘ or ‘home/‘);
([ˆ/]+)/ matches the same thing, but puts the stuff found in-between slashes in the second group 2;
and then .+ matches whatever comes next to the end of line; and the second field tells sed to
replace the line with 12, so our saved groups side-by-side: the first group was everything before
the first slash, and the second group was the stuff between 2nd and 3rd slashes
Corina Ah ok, thanks a lot for the explanation!
An Example
20
RQ2: How much can we accurately reduce the granularity of archival-
quality information?
Reduce Granularity of Archival Quality Information
21
Identify seeds of
archival-quality info
Seeds
Expand seeds with
necessary context
Context
Determine nuggets of
archival-quality info
Seeds + Context =
Nuggets
RQ2: How much can we accurately reduce the granularity of archival-
quality information?
Author Utterance
Alexia Hi, I have a file with following contents 1234 alphabet /vag/one/arun > 1454 bigdata
/home/two/ogra > 5684 apple /vinay/three/dire, , but i want the output to be like 1234 alphabet
one > 1454 bigdata two > 5684 apple three
Elaina sed −r ’s|(.+)/[ˆ/]+/([ˆ/]+)/.+|12|g’
Corina Even though I dont have anything to do with this question, could you explain the logic behind the
answer? The formatting sentence seem so random
Elaina ‘sed -r‘ is an extended mode, so that + is enabled (matches one or more characters, unlike * that
matches zero or more); s///g or s|||g or any symbol instead of | is how a basic replacing expression
is constructed. The first field is what to match, the second is what to replace it with.;(.+) /[ˆ/]+/([ˆ/]+)
/.+ (.+)/ matches anything from the start until the first / and puts found characters in the first group
(1); [ˆ/]+/ matches anything that is not a slash, and then a slash (‘vag/‘ or ‘home/‘); ([ˆ/]+)/ matches
the same thing, but puts the stuff found in-between slashes in the second group 2; and then .+
matches whatever comes next to the end of line; and the second field tells sed to replace the line
with 12, so our saved groups side-by-side: the first group was everything before the first slash, and
the second group was the stuff between 2nd and 3rd slashes
Corina Ah ok, thanks a lot for the explanation!
Example Revisited
22
RQ2: How much can we accurately reduce the granularity of archival-
quality information?
Seed
Seed
Context
Author Utterance
Alexia Hi, I have a file with following contents 1234 alphabet /vag/one/arun > 1454 bigdata
/home/two/ogra > 5684 apple /vinay/three/dire, , but i want the output to be like 1234 alphabet
one > 1454 bigdata two > 5684 apple three
Elaina sed −r ’s|(.+)/[ˆ/]+/([ˆ/]+)/.+|12|g’
Corina Even though I dont have anything to do with this question, could you explain the logic behind the
answer? The formatting sentence seem so random
Elaina ‘sed -r‘ is an extended mode, so that + is enabled (matches one or more characters, unlike * that
matches zero or more); s///g or s|||g or any symbol instead of | is how a basic replacing expression
is constructed. The first field is what to match, the second is what to replace it with.;(.+) /[ˆ/]+/([ˆ/]+)
/.+ (.+)/ matches anything from the start until the first / and puts found characters in the first group
(1); [ˆ/]+/ matches anything that is not a slash, and then a slash (‘vag/‘ or ‘home/‘); ([ˆ/]+)/ matches
the same thing, but puts the stuff found in-between slashes in the second group 2; and then .+
matches whatever comes next to the end of line; and the second field tells sed to replace the line
with 12, so our saved groups side-by-side: the first group was everything before the first slash, and
the second group was the stuff between 2nd and 3rd slashes
Corina Ah ok, thanks a lot for the explanation!
Example Revisited
23
RQ2: How much can we accurately reduce the granularity of archival-
quality information?
Evaluation for RQ2
24
MEASURES: PRECISION,
RECALL AND F-MEASURE
EQ. How accurate is my approach in reducing
granularity of archival-quality information in
developer chats?
RQ2: How much can we accurately reduce the granularity of archival-
quality information?
Creation of Gold Set - Human Judges
25
Extracting Archival-Quality Information from Software-Related Chats
RQ0: How much
archival-quality
information exist
in developer
chats?
RQ2: How much can
we accurately reduce
the granularity of
archival-quality
information?
RQ3: Case Study: How
extracted archival-
quality information
from Slack compare
to Stack Overflow?
RQ1: How accurately
can we automatically
identify conversations
containing archival-
quality information?
Investigate
prevalence of
archival-quality
information
Investigate
characteristics
that inhibit
automatic mining
Disentangle developer
chat conversations
Determine archival-
quality conversations
Identify properties of
archival-quality info
Identify seeds of
archival-quality info
Expand seeds with
necessary context
Determine nuggets of
archival-quality info
Developer chat
history
Chat
Archive
Investigate duplicate
information
Investigate
complementary
information
Investigate contradictory
information
Case Study Questions
26
CSQ1. How much information in the Slack archive and Stack Overflow is
duplicate? What kinds of information is duplicate?
CSQ2. How much information in the Slack archive and Stack Overflow is
complementary? What kinds of information is complementary?
CSQ3. Can we determine if information in the Slack archive and Stack
Overflow contradict each other?
QUALITATIVE
ANALYSIS
RQ3: Case Study: How extracted archival-quality information from Slack
compare to Stack Overflow?
27
Extracting Archival-Quality Information from Software-Related Chats
RQ0: How much
archival-quality
information exist
in developer
chats?
RQ2: How much can
we accurately reduce
the granularity of
archival-quality
information?
RQ3: Case Study: How
extracted archival-
quality information
from Slack compare
to Stack Overflow?
RQ1: How accurately
can we automatically
identify conversations
containing archival-
quality information?
Investigate
prevalence of
archival-quality
information
Investigate
characteristics
that inhibit
automatic mining
Disentangle developer
chat conversations
Determine archival-
quality conversations
Identify properties of
archival-quality info
Identify seeds of
archival-quality info
Expand seeds with
necessary context
Determine nuggets of
archival-quality info
Investigate duplicate
information
Investigate
complementary
information
Developer chat
history
Chat
Archive
Investigate contradictory
information
Foreseen Dissertation Contributions
28
A technique to automatically disentangle conversations in software
developer chats.
Identification of properties of archival quality information in developer
chat communications.
An approach based on text processing and machine learning to extract
archival quality information in chat conversations.
A knowledge archive of developer conversations, suitable for research
beyond this project.
A case study to compare the archived information from developer chat
conversations to another archival-based software artifact, Q&A forums.
29
preethac@udel.edu
@PreethaChatterj
P. Chatterjee, M. A. Nishi, K. Damevski, V. Augustine, L.
Pollock and N. A. Kraft, "What information about code
snippets is available in different software-related
documents? An exploratory study," SANER 2017
P. Chatterjee, K. Damevski, L. Pollock, V. Augustine, and N. A.
Kraft, " Exploratory Study of Slack Q&A Chats as a Mining
Source for Software Engineering Tools," MSR 2019
Questions?
Extracting Archival Knowledge from Software Related Chats
Technique to disentangle conversations.
Identification of properties of archival quality information.
Approach to extract archival quality information in chat conversations.
Knowledge archive of developer conversations.
Case study to compare archived information.
Disclaimer: Images used in these slides are used for educational
purposes only. Images are copyright to owners.
P. Chatterjee, K. Damevski, N. A. Kraft, and L. Pollock,
"Software-related Slack Chats with Disentangled
Conversations ," MSR 2020
Preprint:
https://tinyurl.com
/usu9n2k

Extracting Archival-Quality Information from Software-Related Chats

  • 1.
    Preetha Chatterjee Extracting Archival-QualityInformation from Software-Related Chats
  • 2.
    Use of onlinedeveloper chat services is increasing
  • 3.
    User1: hi guys,i was trying to write a dataframe to txt files and managed to write 11 files successfully. Halfway through, I encountered this error: User2: Hello guys, how can I write this in python so it will work as in console? I'm helpless User3: @User1 Don't reinvent the wheel... have a look at the `smtplib` library for sending mail. @User2 The `"a"` means you're opening an existing file for append. Does it exist? You also seem to be mixing forward and backward slashes... better practice is to use `os.path.join` to be OS-independent. Chats Contain Valuable Knowledge The functionality of the code Indication of an error in the code API functionalities Cause of the error presented earlier Good programming practice
  • 4.
    Similar Knowledge hasbeen Mined from Other Software Artifacts to Improve Software Tools 4 Emails and bug reports: • Re-documenting source code [Panichella 2012] • Recommending mentors in software projects [Canfora 2012] Tutorials: API learning [Jiang 2017, Petrosyan 2015] Q & A forums: • IDE recommendation [Rahman 2014, Cordeiro 2012, Ponzanelli 2014, Bacchelli 2012] • Learning and recommendation of API [Chen 2016, Rahman 2016, Wang 2013] • Automatic generation of comments [Wong 2013, Rahman 2015] • Building thesaurus of software-specific terms [Tian 2014, Chen 2017]
  • 5.
    Developer Chats asa Knowledge Source Why archive information from chats? • Transient content • Informal and unstructured nature • Noise and useful information Archiving information is beneficial only if we can preserve knowledge that could be useful for future reference, and thus serve as a good resource for software engineers and/or mining tools. 5 Previous analyses of chats • Learn developer behaviors [Elliot 2003, Shihab 2009, Yu 2011, Lin 2016] • Extract rationales [Alkadhi 2017, 2018], feature requests [ Shi 2020] • Build Chatbots [Lebeuf 2017, Paikari 2018, Wood 2018]
  • 6.
    Developer Conversation -1 6 Author Utterance Alexia Hi, I have a file with following contents 1234 alphabet /vag/one/arun > 1454 bigdata /home/two/ogra > 5684 apple /vinay/three/dire, , but i want the output to be like 1234 alphabet one > 1454 bigdata two > 5684 apple three Elaina sed −r ’s|(.+)/[ˆ/]+/([ˆ/]+)/.+|12|g’ Corina Even though I dont have anything to do with this question, could you explain the logic behind the answer? The formatting sentence seem so random Elaina ‘sed -r‘ is an extended mode, so that + is enabled (matches one or more characters, unlike * that matches zero or more); s///g or s|||g or any symbol instead of | is how a basic replacing expression is constructed. The first field is what to match, the second is what to replace it with.;(.+) /[ˆ/]+/([ˆ/]+) /.+ (.+)/ matches anything from the start until the first / and puts found characters in the first group (1); [ˆ/]+/ matches anything that is not a slash, and then a slash (‘vag/‘ or ‘home/‘); ([ˆ/]+)/ matches the same thing, but puts the stuff found in-between slashes in the second group 2; and then .+ matches whatever comes next to the end of line; and the second field tells sed to replace the line with 12, so our saved groups side-by-side: the first group was everything before the first slash, and the second group was the stuff between 2nd and 3rd slashes Corina Ah ok, thanks a lot for the explanation!
  • 7.
    Developer Conversation -2 7 Author Utterance Cody Hello guys I got a huge problem Holli Cody: ask away Cody We’ve been ask as assignment the implementation of Dijkstra’s and Bellman Ford’s algorithm for calculating the shortest path in a given graph Holli So what’s the issue?; run into a problem? Cody I don’t really know how to start and that’s my problem … …. Darrin Cody: how much experience do you have writing code?; for example, there are quite a number of existing examples of the algorithms you’re talking about Cody basic i’m just starting Darrin ok; can you describe the steps on how you execute the algorithm?; and do you understand why those steps are necessary?; if so, then the next step you take is translating your written description of the process into pseudocode; once you have a reasonable sequence of actions, you then implement the pseudocode in your language of choice; frankly, the first two items are always the most difficult; because it requires you to understand the problem domain; once you understand it, making it work is usually much less effort Rachel Cody: oof graph theory for a beginner. do you understand how those algorithms work, ? Cody Yes I understand how those work Darrin just having trouble translating described steps to code?
  • 8.
    Comparing Developer Conversationsof Varying Quality 8 Author Utterance Cody Hello guys I got a huge problem Holli Cody: ask away Cody We’ve been ask as assignment the implementation of Dijkstra’s and Bellman Ford’s algorithm for calculating the shortest path in a given graph Holli So what’s the issue?; run into a problem? Cody I don’t really know how to start and that’s my problem … …. Darrin Cody: how much experience do you have writing code?; for example, there are quite a number of existing examples of the algorithms you’re talking about Cody basic i’m just starting Darrin ok; can you describe the steps on how you execute the algorithm?; and do you understand why those steps are necessary?; if so, then the next step you take is translating your written description of the process into pseudocode; once you have a reasonable sequence of actions, you then implement the pseudocode in your language of choice; frankly, the first two items are always the most difficult; because it requires you to understand the problem domain; once you understand it, making it work is usually much less effort Rachel Cody: oof graph theory for a beginner. do you understand how those algorithms work, ? Cody Yes I understand how those work Darrin just having trouble translating described steps to code? Author Utterance Alexia Hi, I have a file with following contents 1234 alphabet /vag/one/arun > 1454 bigdata /home/two/ogra > 5684 apple /vinay/three/dire, , but i want the output to be like 1234 alphabet one > 1454 bigdata two > 5684 apple three Elaina sed −r ’s|(.+)/[ˆ/]+/([ˆ/]+)/.+|12|g’ Corina Even though I dont have anything to do with this question, could you explain the logic behind the answer? The formatting sentence seem so random Elaina ‘sed -r‘ is an extended mode, so that + is enabled (matches one or more characters, unlike * that matches zero or more); s///g or s|||g or any symbol instead of | is how a basic replacing expression is constructed. The first field is what to match, the second is what to replace it with.;(.+) /[ˆ/]+/([ˆ/]+) /.+ (.+)/ matches anything from the start until the first / and puts found characters in the first group (1); [ˆ/]+/ matches anything that is not a slash, and then a slash (‘vag/‘ or ‘home/‘); ([ˆ/]+)/ matches the same thing, but puts the stuff found in-between slashes in the second group 2; and then .+ matches whatever comes next to the end of line; and the second field tells sed to replace the line with 12, so our saved groups side-by-side: the first group was everything before the first slash, and the second group was the stuff between 2nd and 3rd slashes Corina Ah ok, thanks a lot for the explanation! Good Archival-Quality Conversation Poor Archival-Quality Conversation
  • 9.
    Focus of thisDissertation 9 • RQ0: How much archival-quality information exists in developer chats? • RQ1: How accurately can we automatically identify conversations containing archival-quality information? • RQ2: How much can we accurately reduce the granularity of archival-quality information? • RQ3: Case Study: How do extracted archival-quality information from Slack chats compare to related Stack Overflow posts?
  • 10.
    10 Extracting Archival-Quality Informationfrom Software-Related Chats RQ0: How much archival-quality information exists in developer chats? RQ2: How much can we accurately reduce the granularity of archival-quality information? RQ3: Case Study: How extracted archival- quality information from Slack compare to Stack Overflow? RQ1: How accurately can we automatically identify conversations containing archival- quality information? Investigate prevalence of archival-quality information Investigate characteristics that inhibit automatic mining Disentangle developer chat conversations Determine archival- quality conversations Identify properties of archival-quality info Identify seeds of archival-quality info Expand seeds with necessary context Determine nuggets of archival-quality info Investigate duplicate information Investigate complementary information Developer chat history Chat Archive Investigate contradictory information
  • 11.
    11 Extracting Archival-Quality Informationfrom Software-Related Chats RQ0: How much archival-quality information exists in developer chats? RQ2: How much can we accurately reduce the granularity of archival-quality information? RQ3: Case Study: How extracted archival- quality information from Slack compare to Stack Overflow? RQ1: How accurately can we automatically identify conversations containing archival- quality information? Investigate prevalence of archival-quality information Investigate characteristics that inhibit automatic mining Disentangle developer chat conversations Determine archival- quality conversations Identify properties of archival-quality info Identify seeds of archival-quality info Expand seeds with necessary context Determine nuggets of archival-quality info Developer chat history Chat Archive Investigate duplicate information Investigate complementary information Investigate contradictory information
  • 12.
    Study Questions 12 How prevalentwas the information that has been successfully mined from Stack Overflow Q&A forum to support software engineering tools in developer Q&A chats such as Slack? Did Slack Q&A chats have characteristics that might inhibit automatic mining of information to support software engineering tools? RQ0: How much archival-quality information exists in developer chats? P. Chatterjee, K. Damevski, L. Pollock, V. Augustine, and N. A. Kraft, " Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineering Tools," MSR 2019
  • 13.
    Comparing Prevalence ofinformation 13 Q & A Conversations Q & A Posts Similar Intent of Learning and Sharing Information among Developers Much of the information mined from Stack Overflow is available on Slack channels. API mentions are available in larger quantities on Slack Q&A channels. Links are rarely available on both Slack and Stack Overflow Q&A. RQ0: How much archival-quality information exists in developer chats?
  • 14.
    Challenges in AutomaticMining of Chats 14 Measure Results Participant frequency 1 < 2 < 34 Questions with no answer 15.75% Answer frequency 0<1<5 Questions with no accepted answer 52.25% NL text context per code snippet 0 < 2 < 13 Incomplete sentences 12.63% Noise in document 10.5% Knowledge construction 61.5% crowd; 38.5% participatory Accepted answers are available in conversations, but require more effort to discern. RQ0: How much archival-quality information exists in developer chats? Low percentage of noise and incomplete sentences.
  • 15.
    15 Extracting Archival-Quality Informationfrom Software-Related Chats RQ0: How much archival-quality information exist in developer chats? RQ2: How much can we accurately reduce the granularity of archival-quality information? RQ3: Case Study: How extracted archival- quality information from Slack compare to Stack Overflow? RQ1: How accurately can we automatically identify conversations containing archival- quality information? Investigate prevalence of archival-quality information Investigate characteristics that inhibit automatic mining Disentangle developer chat conversations Determine archival- quality conversations Identify properties of archival-quality info Identify seeds of archival-quality info Expand seeds with necessary context Determine nuggets of archival-quality info Developer chat history Chat Archive Investigate duplicate information Investigate complementary information Investigate contradictory information
  • 16.
    Author Time UtteranceUtterance Type Clara 12:51 pm Hi guys. How can we delete all line breaks from .docx file? I’m using python-docx library. In docx - I store some Jinja2 template, which later I’m rendering with some data. Question Cyrus 1:32 pm Not sure if I should ask here or job board, I wanted to expand my github and use it as a portfolio of sorts, are there certain types of projects that are good to have in there to show my comptency? Question Judith 1:49 pm no one has real time to browse through your repo i would think. So if you want a position that uses django/react then do a project that does so. If you’re trying to get into scraping, do a scraping project etc Answer Judith 1:53 pm Clara: yes just open the file and remove all the line breaks. They are essentially the special character ‘n‘ Answer Task1: Disentangle developer chat conversations 16 Elsner and Charniak 2008 • timestamp • author information • occurrence of similar words • cue words (e.g., hello, yes, no) • technical jargon Micro-averaged F-measure: 0.66 Chatterjee et al. 2020 • larger window of messages • last five messages regardless of elapsed time • emoji or code blocks • technical words Micro-averaged F-measure : 0.80 RQ1: How accurately can we automatically identify conversations containing archival-quality information? P. Chatterjee, K. Damevski, N. A. Kraft , L. Pollock, "Software-related Slack Chats with Disentangled Conversations," MSR 2020
  • 17.
    Task2: Properties ofArchival-Quality Information 17 Properties of archival- worthy knowledge Feature Extraction Supervised machine learning ML algorithms Determine archival worthiness Classification RQ1: How accurately can we automatically identify conversations containing archival-quality information?
  • 18.
    Evaluation for RQ1 18 MEASURES:PRECISION, RECALL AND F-MEASURE EQ. How effective is my approach in determining archival worthy conversations? RQ1: How accurately can we automatically identify conversations containing archival-quality information? Creation of Gold Set - Human Judges
  • 19.
    19 Extracting Archival-Quality Informationfrom Software-Related Chats RQ0: How much archival-quality information exist in developer chats? RQ2: How much can we accurately reduce the granularity of archival-quality information? RQ3: Case Study: How extracted archival- quality information from Slack compare to Stack Overflow? RQ1: How accurately can we automatically identify conversations containing archival- quality information? Investigate prevalence of archival-quality information Investigate characteristics that inhibit automatic mining Disentangle developer chat conversations Determine archival- quality conversations Identify properties of archival-quality info Identify seeds of archival-quality info Expand seeds with necessary context Determine nuggets of archival-quality info Developer chat history Chat Archive Investigate duplicate information Investigate complementary information Investigate contradictory information
  • 20.
    Author Utterance Alexia Hi,I have a file with following contents 1234 alphabet /vag/one/arun > 1454 bigdata /home/two/ogra > 5684 apple /vinay/three/dire, , but i want the output to be like 1234 alphabet one > 1454 bigdata two > 5684 apple three Elaina sed −r ’s|(.+)/[ˆ/]+/([ˆ/]+)/.+|12|g’ Corina Even though I dont have anything to do with this question, could you explain the logic behind the answer? The formatting sentence seem so random Elaina ‘sed -r‘ is an extended mode, so that + is enabled (matches one or more characters, unlike * that matches zero or more); s///g or s|||g or any symbol instead of | is how a basic replacing expression is constructed. The first field is what to match, the second is what to replace it with.;(.+) /[ˆ/]+/([ˆ/]+) /.+ (.+)/ matches anything from the start until the first / and puts found characters in the first group (1); [ˆ/]+/ matches anything that is not a slash, and then a slash (‘vag/‘ or ‘home/‘); ([ˆ/]+)/ matches the same thing, but puts the stuff found in-between slashes in the second group 2; and then .+ matches whatever comes next to the end of line; and the second field tells sed to replace the line with 12, so our saved groups side-by-side: the first group was everything before the first slash, and the second group was the stuff between 2nd and 3rd slashes Corina Ah ok, thanks a lot for the explanation! An Example 20 RQ2: How much can we accurately reduce the granularity of archival- quality information?
  • 21.
    Reduce Granularity ofArchival Quality Information 21 Identify seeds of archival-quality info Seeds Expand seeds with necessary context Context Determine nuggets of archival-quality info Seeds + Context = Nuggets RQ2: How much can we accurately reduce the granularity of archival- quality information?
  • 22.
    Author Utterance Alexia Hi,I have a file with following contents 1234 alphabet /vag/one/arun > 1454 bigdata /home/two/ogra > 5684 apple /vinay/three/dire, , but i want the output to be like 1234 alphabet one > 1454 bigdata two > 5684 apple three Elaina sed −r ’s|(.+)/[ˆ/]+/([ˆ/]+)/.+|12|g’ Corina Even though I dont have anything to do with this question, could you explain the logic behind the answer? The formatting sentence seem so random Elaina ‘sed -r‘ is an extended mode, so that + is enabled (matches one or more characters, unlike * that matches zero or more); s///g or s|||g or any symbol instead of | is how a basic replacing expression is constructed. The first field is what to match, the second is what to replace it with.;(.+) /[ˆ/]+/([ˆ/]+) /.+ (.+)/ matches anything from the start until the first / and puts found characters in the first group (1); [ˆ/]+/ matches anything that is not a slash, and then a slash (‘vag/‘ or ‘home/‘); ([ˆ/]+)/ matches the same thing, but puts the stuff found in-between slashes in the second group 2; and then .+ matches whatever comes next to the end of line; and the second field tells sed to replace the line with 12, so our saved groups side-by-side: the first group was everything before the first slash, and the second group was the stuff between 2nd and 3rd slashes Corina Ah ok, thanks a lot for the explanation! Example Revisited 22 RQ2: How much can we accurately reduce the granularity of archival- quality information? Seed Seed Context
  • 23.
    Author Utterance Alexia Hi,I have a file with following contents 1234 alphabet /vag/one/arun > 1454 bigdata /home/two/ogra > 5684 apple /vinay/three/dire, , but i want the output to be like 1234 alphabet one > 1454 bigdata two > 5684 apple three Elaina sed −r ’s|(.+)/[ˆ/]+/([ˆ/]+)/.+|12|g’ Corina Even though I dont have anything to do with this question, could you explain the logic behind the answer? The formatting sentence seem so random Elaina ‘sed -r‘ is an extended mode, so that + is enabled (matches one or more characters, unlike * that matches zero or more); s///g or s|||g or any symbol instead of | is how a basic replacing expression is constructed. The first field is what to match, the second is what to replace it with.;(.+) /[ˆ/]+/([ˆ/]+) /.+ (.+)/ matches anything from the start until the first / and puts found characters in the first group (1); [ˆ/]+/ matches anything that is not a slash, and then a slash (‘vag/‘ or ‘home/‘); ([ˆ/]+)/ matches the same thing, but puts the stuff found in-between slashes in the second group 2; and then .+ matches whatever comes next to the end of line; and the second field tells sed to replace the line with 12, so our saved groups side-by-side: the first group was everything before the first slash, and the second group was the stuff between 2nd and 3rd slashes Corina Ah ok, thanks a lot for the explanation! Example Revisited 23 RQ2: How much can we accurately reduce the granularity of archival- quality information?
  • 24.
    Evaluation for RQ2 24 MEASURES:PRECISION, RECALL AND F-MEASURE EQ. How accurate is my approach in reducing granularity of archival-quality information in developer chats? RQ2: How much can we accurately reduce the granularity of archival- quality information? Creation of Gold Set - Human Judges
  • 25.
    25 Extracting Archival-Quality Informationfrom Software-Related Chats RQ0: How much archival-quality information exist in developer chats? RQ2: How much can we accurately reduce the granularity of archival-quality information? RQ3: Case Study: How extracted archival- quality information from Slack compare to Stack Overflow? RQ1: How accurately can we automatically identify conversations containing archival- quality information? Investigate prevalence of archival-quality information Investigate characteristics that inhibit automatic mining Disentangle developer chat conversations Determine archival- quality conversations Identify properties of archival-quality info Identify seeds of archival-quality info Expand seeds with necessary context Determine nuggets of archival-quality info Developer chat history Chat Archive Investigate duplicate information Investigate complementary information Investigate contradictory information
  • 26.
    Case Study Questions 26 CSQ1.How much information in the Slack archive and Stack Overflow is duplicate? What kinds of information is duplicate? CSQ2. How much information in the Slack archive and Stack Overflow is complementary? What kinds of information is complementary? CSQ3. Can we determine if information in the Slack archive and Stack Overflow contradict each other? QUALITATIVE ANALYSIS RQ3: Case Study: How extracted archival-quality information from Slack compare to Stack Overflow?
  • 27.
    27 Extracting Archival-Quality Informationfrom Software-Related Chats RQ0: How much archival-quality information exist in developer chats? RQ2: How much can we accurately reduce the granularity of archival-quality information? RQ3: Case Study: How extracted archival- quality information from Slack compare to Stack Overflow? RQ1: How accurately can we automatically identify conversations containing archival- quality information? Investigate prevalence of archival-quality information Investigate characteristics that inhibit automatic mining Disentangle developer chat conversations Determine archival- quality conversations Identify properties of archival-quality info Identify seeds of archival-quality info Expand seeds with necessary context Determine nuggets of archival-quality info Investigate duplicate information Investigate complementary information Developer chat history Chat Archive Investigate contradictory information
  • 28.
    Foreseen Dissertation Contributions 28 Atechnique to automatically disentangle conversations in software developer chats. Identification of properties of archival quality information in developer chat communications. An approach based on text processing and machine learning to extract archival quality information in chat conversations. A knowledge archive of developer conversations, suitable for research beyond this project. A case study to compare the archived information from developer chat conversations to another archival-based software artifact, Q&A forums.
  • 29.
    29 preethac@udel.edu @PreethaChatterj P. Chatterjee, M.A. Nishi, K. Damevski, V. Augustine, L. Pollock and N. A. Kraft, "What information about code snippets is available in different software-related documents? An exploratory study," SANER 2017 P. Chatterjee, K. Damevski, L. Pollock, V. Augustine, and N. A. Kraft, " Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineering Tools," MSR 2019 Questions? Extracting Archival Knowledge from Software Related Chats Technique to disentangle conversations. Identification of properties of archival quality information. Approach to extract archival quality information in chat conversations. Knowledge archive of developer conversations. Case study to compare archived information. Disclaimer: Images used in these slides are used for educational purposes only. Images are copyright to owners. P. Chatterjee, K. Damevski, N. A. Kraft, and L. Pollock, "Software-related Slack Chats with Disentangled Conversations ," MSR 2020 Preprint: https://tinyurl.com /usu9n2k

Editor's Notes

  • #2 TA: Hi, I am Preetha Chatterjee, a PhD student working with Dr. Lori Pollock at UD. Today, I am going to present my research plan for my PhD proposal, which I presented last year. I have been working on this plan so far, and I plan to graduate Spring 2021. TS: The topic of my proposed thesis is “Extracting …”
  • #3 TA: With the increased online sharing, software developers are having conversations about software development via online chat services. (click) Developers use these communities to ask and answer specific development questions, with the aim of improving their own skills and helping others. TS: Now, let us take a look at what types of information are shared by developers on chat platforms.
  • #4 TA:. Without analyzing the code in this chat snippet, we can learn: (go through the information as animation goes.) Considerable information is gained from just a conversation. TS: Now let us take a look on how Similar Knowledge has been Mined from Other Software Artifacts to Improve Software Tools.
  • #5 TA: Emails and bug reports are mined by researchers for <read bullet>. Information mined from online tutorials is used in API learning. Q&A forums such as SO has been mined for a variety of software engineering tasks <read bullet>. Collectively, these prior works suggest that extracting these types of information embedded in all these software-related documents could be useful in building or improving software engineering tools. TS: However, less work has focused on mining unconventional resources such as developer chats.
  • #6 TA: Prior studies show that mining Slack chats could help in <read the bullet points> 1)Due to the transient and distributed nature of chat communications, important software related information and advice are lost over time. 2) Not easy to find information because (2) and (3) However, archiving information is beneficial only if we preserve knowledge that could be useful for future reference. TS: Let’s look at a couple of conversations to understand what we mean by archival quality information in chats.
  • #7 TA: This is a conversation where the questioner seeks help about automatically modifying the format of text in a file. The responder suggests a regular expression as the answer, and also explains why that would work. This Conversation is concise, contains details of the problem and the suggested solution, and shows indication of answer acceptance. Conciseness makes it easy to read and understand, and indicators of answer acceptance give the readers a sense of verification and confidence in the correctness of the information. TS: Now let’s look at another example.
  • #8 TA: This is a conversation where the questioner asks suggestions for how to implement a specific algorithm. This is a lengthy conversation, and this slide shows only a part of it due to space constraint. This conversation is lengthy, contains too much noise, and lacks relevant details about the problem. This makes it difficult to extract specific information for both software engineers and mining tools. TS: Click and continue
  • #9 TA: Therefore, conversation(a) is intuitively of good archival-quality, and thus a better conversation to archive than conversation(b).
  • #10 TA: Through this dissertation I aim to answer the RQs <>
  • #11 TA: This is an Outline of the 4 major RQs, top level ques and then all subproblems together. TS: I will explain each RQ in further detail as we proceed with the presentation.
  • #12 I will start with my preliminary research: RQ0.
  • #13 RQ0 is a part of our paper at MSR 2019. Here, we report an exploratory study to compare the content in Q \& A focused public chat communities (e.g. Slack) with Q \& A based discussion forums (e.g. Stack Overflow), to answer the SQs <>
  • #14 TA: To answer SQ1, we focused on information that has been commonly mined in other software artifacts. Specifically, we analyzed code snippets, links to code snippets, API mentions. We display the results primarily as box plots. <> TS: Next we move on to SQ2.
  • #15 TA: To answer SQ2, we focused on measures that could provide some insights into the form of conversations (participant count, questions with no answer, answer count) and measures that could indicate challenges in automation (rest) that suggest a need to filter. Results represented as percentages are reported directly, while other results, computed as simple counts, are reported as minimum < median < maximum. TS: Insights from this study motivate and guide my proposed research.
  • #16 TA: The first RQ I plan to solve is …which involves solving three research problems: (1) Disentangle developer chat conversations, (2) Identify properties of archival-quality information, and (3) Determine conversations containing archival-quality information. TS: First I will talk about the first RP of disentangling dev chats
  • #17 TA: The disentanglement problem has been studied before by researchers in the context of IRC and similar chat platforms. Elsner and Charniak proposed a supervised model (maximum-entropy classifier) based on a set of features such as <read> In a recent study at MSR20, we leveraged their technique to disentangle conversations on Slack. During the study, we observed that some Slack channels can become dormant for a few hours at a time and that participants can respond to each other with considerable delay. Therefore, we modified their algorithm to: <read>
  • #18 As a first step to assess quality of conversations, we need to identify the properties of archival worthy knowledge in chat forums. Since, there are no explicit built-in indicators of quality of information shared on chat forums, I plan to explore 1) how well adaptations of the quality indicators in archival-based Q&A forums (SO) can be used to identify the properties of archival-quality information in chat forums. 2) Additionally, I will explore a data-driven approach of analyzing Slack conversations to understand the characteristics of conversations containing archival-quality information. The properties of archival-quality information thus identified could potentially be used as features to build a machine learning based approach to automatically determine archival worthiness of a conversation.
  • #19 TA: The first step in evaluation will be to create a gold set of conversations with each conversation assigned a quality score. I plan to recruit human judges with prior experience in programming and using Slack, to participate in a study and create the gold set. Next, the evaluation study will focus on addressing the following evaluation question: <read> Results will be evaluated using precision, recall, and F-measure.
  • #20 Now we move on to the next RQ in the proposal, which is … Developer conversations can often contain extensive details, redundant information, and noise, along with archival-quality information. Redundant information is present even in shorter conversations. Let us look at such a short conversation of Python developers in the next slide.
  • #21 TA: Only 3 out of 5 utterances are highlighted, thereby suggesting that we could reduce the granularity of information to be archived from this conversation by almost 50%. Reducing the granularity of archival-quality information in chats could help in saving (a) archival space and (b) time and effort for mining specific information from the archive by both developers and mining tools.
  • #22 TA: To reduce granularity of…, I propose a strategy which involves 3 steps: (1) identify archival-quality information seeds, which would be utterances in a conversation that are directly related to an archival property through the structure, content or metadata of the utterance. (2) identify utterances related to the archival-quality information seeds that would be highly likely to also be containing archival-quality information, but not directly identifiable without the seed utterance. (3) Finally, design an approach to combine the seeds and context to identify archival-quality nuggets of information.
  • #23 TA: The same sample conversation that we saw before, would have the following the seeds and context.
  • #24 TA: The same sample conversation that we saw before, would have the following the seeds and context.
  • #25 TA: Similar to RQ1, The first step in evaluation will be to create a gold set of developer chats tagged with nuggets of archival- quality information. The next step will focus on addressing the evaluation question: <>: Precision, recall and F-measure will be used as evaluation measures. TS: This concludes my plan for investigating RQ2 in the proposed research.
  • #26 Now I discuss about RQ3, which is … TA: Borrowing from data triangulation used by qualitative researchers, it is possible to envision a system where Information in one social communication channel that complements or contradicts information in another channel is identified and used to provide feedback within the social communities. Thus, I plan to conduct a case study to explore this potential by comparing Slack archival quality information with another archival-based software artifact, specifically Q&A forums (e.g. Stack Overflow).
  • #27 To conduct this case study, I plan to curate a Slack-Stack Overflow comparison dataset, and perform a qualitative study. Through the qualitative study I plan to answer the questions: <> Understanding the different kinds of duplicate, complementary and contradictory information is important to understand the potential of improving the quality of shared information in developer communications.
  • #28 This concludes my discussion of the 4 RQs and the sub-problems that I investigate in my thesis.
  • #29 This dissertation aims to develop techniques to extract and archive useful information communicated in developers’ chat communication channels. The following list presents a summary of foreseen contributions: <>
  • #30 Thank you, this concludes my talk. I will be happy to answer questions now.