Extracting Archival-Quality Information from Software-Related Chats

Preetha Chatterjee
Extracting Archival-Quality Information
from Software-Related Chats

Use of online developer chat services is increasing

User1: hi guys, i was trying to write a dataframe to txt files
and managed to write 11 files successfully. Halfway
through, I encountered this error:
User2: Hello guys, how can I write this in python so it will
work as in console? I'm helpless
User3: @User1 Don't reinvent the wheel... have a look at
the `smtplib` library for sending mail.
@User2 The `"a"` means you're opening an existing file for
append. Does it exist? You also seem to be mixing forward
and backward slashes... better practice is to use
`os.path.join` to be OS-independent.
Chats Contain Valuable Knowledge
The functionality of the code
Indication of an error in the code
API functionalities
Cause of the error presented
earlier
Good programming practice

Similar Knowledge has been Mined from Other
Software Artifacts to Improve Software Tools
4
Emails and bug reports:
• Re-documenting source code [Panichella 2012]
• Recommending mentors in software projects [Canfora 2012]
Tutorials: API learning [Jiang 2017, Petrosyan 2015]
Q & A forums:
• IDE recommendation [Rahman 2014, Cordeiro 2012, Ponzanelli 2014, Bacchelli 2012]
• Learning and recommendation of API [Chen 2016, Rahman 2016, Wang 2013]
• Automatic generation of comments [Wong 2013, Rahman 2015]
• Building thesaurus of software-specific terms [Tian 2014, Chen 2017]

Developer Chats as a Knowledge Source
Why archive information from chats?
• Transient content
• Informal and unstructured nature
• Noise and useful information
Archiving information is beneficial only if we can preserve
knowledge that could be useful for future reference, and thus
serve as a good resource for software engineers and/or mining
tools. 5
Previous analyses of chats
• Learn developer behaviors [Elliot 2003, Shihab 2009, Yu 2011, Lin 2016]
• Extract rationales [Alkadhi 2017, 2018], feature requests [ Shi 2020]
• Build Chatbots [Lebeuf 2017, Paikari 2018, Wood 2018]

Developer Conversation - 1
6
Author Utterance
Alexia Hi, I have a file with following contents 1234 alphabet /vag/one/arun > 1454 bigdata
/home/two/ogra > 5684 apple /vinay/three/dire, , but i want the output to be like 1234 alphabet
one > 1454 bigdata two > 5684 apple three
Elaina sed −r ’s|(.+)/[ˆ/]+/([ˆ/]+)/.+|12|g’
Corina Even though I dont have anything to do with this question, could you explain the logic behind the
answer? The formatting sentence seem so random
Elaina ‘sed -r‘ is an extended mode, so that + is enabled (matches one or more characters, unlike * that
matches zero or more); s///g or s|||g or any symbol instead of | is how a basic replacing expression
is constructed. The first field is what to match, the second is what to replace it with.;(.+)
/[ˆ/]+/([ˆ/]+) /.+ (.+)/ matches anything from the start until the first / and puts found characters in
the first group (1); [ˆ/]+/ matches anything that is not a slash, and then a slash (‘vag/‘ or ‘home/‘);
([ˆ/]+)/ matches the same thing, but puts the stuff found in-between slashes in the second group 2;
and then .+ matches whatever comes next to the end of line; and the second field tells sed to
replace the line with 12, so our saved groups side-by-side: the first group was everything before
the first slash, and the second group was the stuff between 2nd and 3rd slashes
Corina Ah ok, thanks a lot for the explanation!

Developer Conversation - 2
7
Author Utterance
Cody Hello guys I got a huge problem
Holli Cody: ask away
Cody We’ve been ask as assignment the implementation of Dijkstra’s and Bellman Ford’s algorithm for
calculating the shortest path in a given graph
Holli So what’s the issue?; run into a problem?
Cody I don’t really know how to start and that’s my problem
… ….
Darrin Cody: how much experience do you have writing code?; for example, there are quite a number of
existing examples of the algorithms you’re talking about
Cody basic i’m just starting
Darrin ok; can you describe the steps on how you execute the algorithm?; and do you understand why those
steps are necessary?; if so, then the next step you take is translating your written description of the
process into pseudocode; once you have a reasonable sequence of actions, you then implement the
pseudocode in your language of choice; frankly, the first two items are always the most difficult;
because it requires you to understand the problem domain; once you understand it, making it work is
usually much less effort
Rachel Cody: oof graph theory for a beginner. do you understand how those algorithms work, ?
Cody Yes I understand how those work
Darrin just having trouble translating described steps to code?

Comparing Developer Conversations of Varying Quality
8
Author Utterance
Cody Hello guys I got a huge problem
Holli Cody: ask away
Cody We’ve been ask as assignment the implementation of Dijkstra’s and Bellman
Ford’s algorithm for calculating the shortest path in a given graph
Holli So what’s the issue?; run into a problem?
Cody I don’t really know how to start and that’s my problem
… ….
Darrin Cody: how much experience do you have writing code?; for example, there are
quite a number of existing examples of the algorithms you’re talking about
Cody basic i’m just starting
Darrin ok; can you describe the steps on how you execute the algorithm?; and do you
understand why those steps are necessary?; if so, then the next step you take is
translating your written description of the process into pseudocode; once you
have a reasonable sequence of actions, you then implement the pseudocode in
your language of choice; frankly, the first two items are always the most
difficult; because it requires you to understand the problem domain; once you
understand it, making it work is usually much less effort
Rachel Cody: oof graph theory for a beginner. do you understand how those algorithms
work, ?
Cody Yes I understand how those work
Darrin just having trouble translating described steps to code?
Author Utterance
Alexia Hi, I have a file with following contents 1234 alphabet /vag/one/arun
> 1454 bigdata /home/two/ogra > 5684 apple /vinay/three/dire, , but
i want the output to be like 1234 alphabet one > 1454 bigdata two >
5684 apple three
Elaina sed −r ’s|(.+)/[ˆ/]+/([ˆ/]+)/.+|12|g’
Corina Even though I dont have anything to do with this question, could you
explain the logic behind the answer? The formatting sentence seem
so random
Elaina ‘sed -r‘ is an extended mode, so that + is enabled (matches one or
more characters, unlike * that matches zero or more); s///g or s|||g
or any symbol instead of | is how a basic replacing expression is
constructed. The first field is what to match, the second is what to
replace it with.;(.+) /[ˆ/]+/([ˆ/]+) /.+ (.+)/ matches anything from the
start until the first / and puts found characters in the first group (1);
[ˆ/]+/ matches anything that is not a slash, and then a slash (‘vag/‘ or
‘home/‘); ([ˆ/]+)/ matches the same thing, but puts the stuff found
in-between slashes in the second group 2; and then .+ matches
whatever comes next to the end of line; and the second field tells sed
to replace the line with 12, so our saved groups side-by-side: the
first group was everything before the first slash, and the second
group was the stuff between 2nd and 3rd slashes
Good Archival-Quality Conversation Poor Archival-Quality Conversation

Focus of this Dissertation
9
• RQ0: How much archival-quality information exists in
developer chats?
• RQ1: How accurately can we automatically identify
conversations containing archival-quality information?
• RQ2: How much can we accurately reduce the granularity of
archival-quality information?
• RQ3: Case Study: How do extracted archival-quality
information from Slack chats compare to related Stack
Overflow posts?

10
Extracting Archival-Quality Information from Software-Related Chats
RQ0: How much
archival-quality
information
exists in
developer chats?
RQ2: How much can
we accurately reduce
the granularity of
archival-quality
information?
RQ3: Case Study: How
extracted archival-
quality information
from Slack compare
to Stack Overflow?
RQ1: How accurately
can we automatically
identify conversations
containing archival-
quality information?
Investigate
prevalence of
archival-quality
information
Investigate
characteristics
that inhibit
automatic mining
Disentangle developer
chat conversations
Determine archival-
quality conversations
Identify properties of
archival-quality info
Identify seeds of
Expand seeds with
necessary context
Determine nuggets of
Investigate duplicate
information
Investigate
complementary
information
Developer chat
history
Chat
Archive
Investigate contradictory
information

11
RQ0: How much
archival-quality
information
exists in
developer chats?
RQ2: How much can
the granularity of
archival-quality
information?
extracted archival-
quality information
from Slack compare to
Stack Overflow?
RQ1: How accurately
Investigate
prevalence of
archival-quality
information
Investigate
characteristics
that inhibit
automatic mining
chat conversations
Determine archival-
Identify seeds of
Expand seeds with
necessary context
Developer chat
history
Chat
Archive
information
Investigate
complementary
information
information

Study Questions
12
How prevalent was the information that has been
successfully mined from Stack Overflow Q&A forum to
support software engineering tools in developer Q&A chats
such as Slack?
Did Slack Q&A chats have characteristics that might inhibit
automatic mining of information to support software
engineering tools?
RQ0: How much archival-quality information exists in developer chats?
P. Chatterjee, K. Damevski, L. Pollock, V. Augustine, and N. A. Kraft, " Exploratory Study of
Slack Q&A Chats as a Mining Source for Software Engineering Tools," MSR 2019

Comparing Prevalence of information
13
Q & A
Conversations
Q & A
Posts
Similar Intent of Learning and Sharing
Information among Developers
Much of the information mined from Stack Overflow is available on Slack channels.
API mentions are available in larger quantities on Slack Q&A channels.
Links are rarely available on both Slack and Stack Overflow Q&A.

Challenges in Automatic Mining of Chats
14
Measure Results
Participant frequency 1 < 2 < 34
Questions with no answer 15.75%
Answer frequency 0<1<5
Questions with no accepted answer 52.25%
NL text context per code snippet 0 < 2 < 13
Incomplete sentences 12.63%
Noise in document 10.5%
Knowledge construction
61.5% crowd; 38.5%
participatory
Accepted answers are available in conversations, but require more effort to discern.
Low percentage of noise and incomplete sentences.

15
RQ0: How much
archival-quality
information exist
in developer
chats?
RQ2: How much can
the granularity of
archival-quality
information?
extracted archival-
quality information
Stack Overflow?
RQ1: How accurately
Investigate
prevalence of
archival-quality
information
Investigate
characteristics
that inhibit
automatic mining
chat conversations
Determine archival-
Identify seeds of
Expand seeds with
necessary context
Developer chat
history
Chat
Archive
information
Investigate
complementary
information
information

Author Time Utterance Utterance Type
Clara 12:51 pm Hi guys. How can we delete all line breaks from .docx file? I’m using
python-docx library. In docx - I store some Jinja2 template, which later
I’m rendering with some data.
Question
Cyrus 1:32 pm Not sure if I should ask here or job board, I wanted to expand my github
and use it as a portfolio of sorts, are there certain types of projects that
are good to have in there to show my comptency?
Question
Judith 1:49 pm no one has real time to browse through your repo i would think. So if
you want a position that uses django/react then do a project that does
so. If you’re trying to get into scraping, do a scraping project etc
Answer
Judith 1:53 pm Clara: yes just open the file and remove all the line breaks. They are
essentially the special character ‘n‘
Answer
Task1: Disentangle developer chat conversations
16
Elsner and Charniak 2008
• timestamp
• author information
• occurrence of similar words
• cue words (e.g., hello, yes, no)
• technical jargon
Micro-averaged F-measure: 0.66
Chatterjee et al. 2020
• larger window of messages
• last five messages regardless of
elapsed time
• emoji or code blocks
• technical words
Micro-averaged F-measure : 0.80
RQ1: How accurately can we automatically identify conversations
containing archival-quality information?
P. Chatterjee, K. Damevski,
N. A. Kraft , L. Pollock,
"Software-related Slack
Chats with Disentangled
Conversations," MSR 2020

Task2: Properties of Archival-Quality Information
17
Properties of archival-
worthy knowledge
Feature Extraction
Supervised
machine learning
ML algorithms
Determine archival
worthiness
Classification

Evaluation for RQ1
18
MEASURES: PRECISION,
RECALL AND F-MEASURE
EQ. How effective is my approach in
determining archival worthy conversations?
Creation of Gold Set - Human Judges

19
RQ0: How much
archival-quality
information exist
in developer
chats?
RQ2: How much can
the granularity of
archival-quality
information?
extracted archival-
quality information
Stack Overflow?
RQ1: How accurately
Investigate
prevalence of
archival-quality
information
Investigate
characteristics
that inhibit
automatic mining
chat conversations
Determine archival-
Identify seeds of
Expand seeds with
necessary context
Developer chat
history
Chat
Archive
information
Investigate
complementary
information
information

Author Utterance
Elaina sed −r ’s|(.+)/[ˆ/]+/([ˆ/]+)/.+|12|g’
is constructed. The first field is what to match, the second is what to replace it with.;(.+)
/[ˆ/]+/([ˆ/]+) /.+ (.+)/ matches anything from the start until the first / and puts found characters in
the first group (1); [ˆ/]+/ matches anything that is not a slash, and then a slash (‘vag/‘ or ‘home/‘);
([ˆ/]+)/ matches the same thing, but puts the stuff found in-between slashes in the second group 2;
and then .+ matches whatever comes next to the end of line; and the second field tells sed to
replace the line with 12, so our saved groups side-by-side: the first group was everything before
the first slash, and the second group was the stuff between 2nd and 3rd slashes
An Example
20
RQ2: How much can we accurately reduce the granularity of archival-

Reduce Granularity of Archival Quality Information
21
Identify seeds of
Seeds
Expand seeds with
necessary context
Context
Seeds + Context =
Nuggets

Author Utterance
Elaina sed −r ’s|(.+)/[ˆ/]+/([ˆ/]+)/.+|12|g’
is constructed. The first field is what to match, the second is what to replace it with.;(.+) /[ˆ/]+/([ˆ/]+)
/.+ (.+)/ matches anything from the start until the first / and puts found characters in the first group
(1); [ˆ/]+/ matches anything that is not a slash, and then a slash (‘vag/‘ or ‘home/‘); ([ˆ/]+)/ matches
the same thing, but puts the stuff found in-between slashes in the second group 2; and then .+
matches whatever comes next to the end of line; and the second field tells sed to replace the line
with 12, so our saved groups side-by-side: the first group was everything before the first slash, and
the second group was the stuff between 2nd and 3rd slashes
Example Revisited
22
Seed
Seed
Context

Author Utterance
Elaina sed −r ’s|(.+)/[ˆ/]+/([ˆ/]+)/.+|12|g’
is constructed. The first field is what to match, the second is what to replace it with.;(.+) /[ˆ/]+/([ˆ/]+)
/.+ (.+)/ matches anything from the start until the first / and puts found characters in the first group
(1); [ˆ/]+/ matches anything that is not a slash, and then a slash (‘vag/‘ or ‘home/‘); ([ˆ/]+)/ matches
the same thing, but puts the stuff found in-between slashes in the second group 2; and then .+
matches whatever comes next to the end of line; and the second field tells sed to replace the line
with 12, so our saved groups side-by-side: the first group was everything before the first slash, and
the second group was the stuff between 2nd and 3rd slashes
Example Revisited
23

Evaluation for RQ2
24
MEASURES: PRECISION,
RECALL AND F-MEASURE
EQ. How accurate is my approach in reducing
granularity of archival-quality information in
developer chats?
Creation of Gold Set - Human Judges

25
RQ0: How much
archival-quality
information exist
in developer
chats?
RQ2: How much can
the granularity of
archival-quality
information?
extracted archival-
quality information
from Slack compare
to Stack Overflow?
RQ1: How accurately
Investigate
prevalence of
archival-quality
information
Investigate
characteristics
that inhibit
automatic mining
chat conversations
Determine archival-
Identify seeds of
Expand seeds with
necessary context
Developer chat
history
Chat
Archive
information
Investigate
complementary
information
information

Case Study Questions
26
CSQ1. How much information in the Slack archive and Stack Overflow is
duplicate? What kinds of information is duplicate?
CSQ2. How much information in the Slack archive and Stack Overflow is
complementary? What kinds of information is complementary?
CSQ3. Can we determine if information in the Slack archive and Stack
Overflow contradict each other?
QUALITATIVE
ANALYSIS
RQ3: Case Study: How extracted archival-quality information from Slack
compare to Stack Overflow?

27
RQ0: How much
archival-quality
information exist
in developer
chats?
RQ2: How much can
the granularity of
archival-quality
information?
extracted archival-
quality information
from Slack compare
to Stack Overflow?
RQ1: How accurately
Investigate
prevalence of
archival-quality
information
Investigate
characteristics
that inhibit
automatic mining
chat conversations
Determine archival-
Identify seeds of
Expand seeds with
necessary context
information
Investigate
complementary
information
Developer chat
history
Chat
Archive
information

Foreseen Dissertation Contributions
28
A technique to automatically disentangle conversations in software
developer chats.
Identification of properties of archival quality information in developer
chat communications.
An approach based on text processing and machine learning to extract
archival quality information in chat conversations.
A knowledge archive of developer conversations, suitable for research
beyond this project.
A case study to compare the archived information from developer chat
conversations to another archival-based software artifact, Q&A forums.

29
preethac@udel.edu
@PreethaChatterj
P. Chatterjee, M. A. Nishi, K. Damevski, V. Augustine, L.
Pollock and N. A. Kraft, "What information about code
snippets is available in different software-related
documents? An exploratory study," SANER 2017
P. Chatterjee, K. Damevski, L. Pollock, V. Augustine, and N. A.
Kraft, " Exploratory Study of Slack Q&A Chats as a Mining
Source for Software Engineering Tools," MSR 2019
Questions?
Extracting Archival Knowledge from Software Related Chats
Technique to disentangle conversations.
Identification of properties of archival quality information.
Approach to extract archival quality information in chat conversations.
Knowledge archive of developer conversations.
Case study to compare archived information.
Disclaimer: Images used in these slides are used for educational
purposes only. Images are copyright to owners.
P. Chatterjee, K. Damevski, N. A. Kraft, and L. Pollock,
"Software-related Slack Chats with Disentangled
Conversations ," MSR 2020
Preprint:
https://tinyurl.com
/usu9n2k

Extracting Archival-Quality Information from Software-Related Chats

More Related Content

What's hot

Similar to Extracting Archival-Quality Information from Software-Related Chats

More from Preetha Chatterjee

Recently uploaded

Extracting Archival-Quality Information from Software-Related Chats

Editor's Notes