Modeling Reliability of Cloud Infrastructure Software
Washington Garcia, Florida Atlantic University
Supervisor: Dr. Theophilus Benson
Edmund T. Pratt School of Engineering, Duke University
Motivation
Challenges
In the past, researchers have manually annotated bugs
and conducted statistical analysis to learn what types of bugs
infest cloud infrastructure. However, this process of manual
annotation is time consuming. Unlike most classification
problems, bug classification requires domain knowledge and
familiarity with the code-base to be useful, so crowdsourced
annotation such as Amazon Mechanical Turk is out of the
question.
An added obstacle to bug classification is the nature of
bug descriptions themselves, which are often unstructured and
consist of natural language provided by human developers.
Common natural language processing techniques can fall
short because, unlike news articles, bug descriptions contain
many typos, domain specific synonyms, abbreviations, and
inconsistencies.
Method
1. Gunawi et al. 2014. What Bugs Live in the Cloud? A Study of 3000+
Issues in Cloud Systems. In Proceedings of the ACM Symposium on
Cloud Computing (SOCC '14). ACM, New York, NY, USA, Article
7, 14 pages. DOI=http://dx.doi.org/10.1145/2670979.2670986
2. A. Medem, M. I. Akodjenou and R. Teixeira, "TroubleMiner:
Mining network trouble tickets," Integrated Network
Management-Workshops, 2009. IM '09. IFIP/IEEE International
Symposium on, New York, NY, 2009, pp. 113-119.
DOI: 10.1109/INMW.2009.5195946
3. Agnes Sandor, Nikolaos Lagos, Ngoc-Phuoc-An Vo, and Caroline
Brun. 2016. Identifying User Issues and Request Types in Forum
Question Posts Based on Discourse Analysis. In Proceedings of
the 25th International Conference Companion on World Wide
Web (WWW '16 Companion). Republic and Canton of Geneva,
Switzerland, 685-691. DOI:
http://dx.doi.org/10.1145/2872518.2890568
GitHub for this project:
https://github.com/w-garcia/BugClustering
Future Work
Since the system is mostly modular, different components can
be swapped out with better implementations to improve
performance. Some proposed additions include:
• The system currently takes around 15-30 minutes to
classify a dataset. This can be increased by injecting more
than one ticket at a time into step 2. However, the chance
of unlabeled tickets being clustered into the same cluster
becomes higher as more are injected into the dataset,
which makes the cluster less useful.
• Using discourse analysis [3] to replace the usage of the
banned word list, synonym list, and phrase filter. Instead,
useful words are found by identifying linguistic features
of each description.
• If proven accurate, our system can be used for the analysis
of bugs in systems such as OpenStack, Spark, Quagga,
ONOS, and OpenDaylight. Such analysis will provide
insight into the current state of cloud infrastructure and
how cloud development has shifted in the past few years.
References
Process Initial Results
Extended analysis of the accuracy is planned in the future,
with preliminary results available for four systems:
Category accuracy is how often the system correctly predicted
a ticket’s Cloud Bug Study category (such as aspect, software,
hardware). Class accuracy reflects its success at predicting
specific classes for tickets (such as a-consistency, sw-logic,
hw-disk, etc.). It performed best when predicting categories
using Cassandra’s large 1200+ ticket model.
An increasing amount of popular services are utilizing
cloud infrastructure due to its convenience, low cost, and
scalability. However, as more services turn to cloud as a
means of storing and delivering data to consumers, the faults
of cloud infrastructure become more apparent. When cloud
infrastructure fails, the consequences are disastrous, with
failures making national headlines. Popular services such as
Amazon, Dropbox, Netflix, and many social media sites all
rely on cloud computing at their core.
Although new cloud infrastructures have sprouted in
recent years, there is little knowledge about what type of bugs
they contain, and how these bugs affect quality of service to
other components. We propose a system that can
automatically classify bug tickets using the natural language
descriptions provided by developers. This system allows
taxonomies of bugs to be built for new cloud infrastructures,
which can be used to shift development focus and help stop
failures before they happen.
The primary objective was to create a system that could
classify unlabeled cloud infrastructure bugs with minimal
human intervention. The motive of our problem boils down to
building a classifier that will output bug classifications
(hardware, software, etc.) given an unlabeled bug description
as input. Thanks to past research in this field there is a large
repository of classified bugs from previous cloud
infrastructures [1].
1 2
3
5 6
1. Input to our system consists of classified bug tickets taken from the Cloud Bug Study [1]. The
bundled bug descriptions are very short, so each ticket is pre-processed using the Python JIRAAPI
to find the full bug description that corresponds to the issue ID. Next, each ticket is passed through
a stemmer, which uses the NLTK library to strip descriptions down to only nouns and verbs, then
reduce each word to its stem. A banned word list, system synonyms list, and phrase filter remove
specific words that only add noise to the clustering. Finally, low frequency words are filtered out.
2. Each bug description is encoded as a
vector of keyword weights. At this point,
an unlabeled ticket is injected into the
dataset, and weights for each keyword are
created. We use Document Frequency
(DF): the weight for any word k is the
number of tickets in the dataset that have k
present.
4
3. An n x m matrix is created using every vector that is generated,
where n is the amount of tickets, and m is the amount of unique
words in the dataset. This matrix is passed as input to a hierarchical
agglomerative clustering algorithm provided by the Python library
SciPy. The output of the clustering algorithm is a binary tree. Each
parent is labelled by the intersection of its children’s keywords.
4. The binary tree is collapsed to an n-ary
tree using a modified version of the
algorithm presented by Medem et al [2].
The previously unlabeled ticket is
marked, and its parent label is used as
the classification.
5. Steps 2-5 are repeated for each
unlabeled ticket in the desired
dataset, building a taxonomy of
classified bugs.
6. Steps 3 and 4 are repeated one more time with the
new taxonomy as input. An n-ary tree is generated,
giving a visual overview of the previously
unlabeled bugs and their predicted classifications.

36x48_new_modelling_cloud_infrastructure

  • 1.
    Modeling Reliability ofCloud Infrastructure Software Washington Garcia, Florida Atlantic University Supervisor: Dr. Theophilus Benson Edmund T. Pratt School of Engineering, Duke University Motivation Challenges In the past, researchers have manually annotated bugs and conducted statistical analysis to learn what types of bugs infest cloud infrastructure. However, this process of manual annotation is time consuming. Unlike most classification problems, bug classification requires domain knowledge and familiarity with the code-base to be useful, so crowdsourced annotation such as Amazon Mechanical Turk is out of the question. An added obstacle to bug classification is the nature of bug descriptions themselves, which are often unstructured and consist of natural language provided by human developers. Common natural language processing techniques can fall short because, unlike news articles, bug descriptions contain many typos, domain specific synonyms, abbreviations, and inconsistencies. Method 1. Gunawi et al. 2014. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In Proceedings of the ACM Symposium on Cloud Computing (SOCC '14). ACM, New York, NY, USA, Article 7, 14 pages. DOI=http://dx.doi.org/10.1145/2670979.2670986 2. A. Medem, M. I. Akodjenou and R. Teixeira, "TroubleMiner: Mining network trouble tickets," Integrated Network Management-Workshops, 2009. IM '09. IFIP/IEEE International Symposium on, New York, NY, 2009, pp. 113-119. DOI: 10.1109/INMW.2009.5195946 3. Agnes Sandor, Nikolaos Lagos, Ngoc-Phuoc-An Vo, and Caroline Brun. 2016. Identifying User Issues and Request Types in Forum Question Posts Based on Discourse Analysis. In Proceedings of the 25th International Conference Companion on World Wide Web (WWW '16 Companion). Republic and Canton of Geneva, Switzerland, 685-691. DOI: http://dx.doi.org/10.1145/2872518.2890568 GitHub for this project: https://github.com/w-garcia/BugClustering Future Work Since the system is mostly modular, different components can be swapped out with better implementations to improve performance. Some proposed additions include: • The system currently takes around 15-30 minutes to classify a dataset. This can be increased by injecting more than one ticket at a time into step 2. However, the chance of unlabeled tickets being clustered into the same cluster becomes higher as more are injected into the dataset, which makes the cluster less useful. • Using discourse analysis [3] to replace the usage of the banned word list, synonym list, and phrase filter. Instead, useful words are found by identifying linguistic features of each description. • If proven accurate, our system can be used for the analysis of bugs in systems such as OpenStack, Spark, Quagga, ONOS, and OpenDaylight. Such analysis will provide insight into the current state of cloud infrastructure and how cloud development has shifted in the past few years. References Process Initial Results Extended analysis of the accuracy is planned in the future, with preliminary results available for four systems: Category accuracy is how often the system correctly predicted a ticket’s Cloud Bug Study category (such as aspect, software, hardware). Class accuracy reflects its success at predicting specific classes for tickets (such as a-consistency, sw-logic, hw-disk, etc.). It performed best when predicting categories using Cassandra’s large 1200+ ticket model. An increasing amount of popular services are utilizing cloud infrastructure due to its convenience, low cost, and scalability. However, as more services turn to cloud as a means of storing and delivering data to consumers, the faults of cloud infrastructure become more apparent. When cloud infrastructure fails, the consequences are disastrous, with failures making national headlines. Popular services such as Amazon, Dropbox, Netflix, and many social media sites all rely on cloud computing at their core. Although new cloud infrastructures have sprouted in recent years, there is little knowledge about what type of bugs they contain, and how these bugs affect quality of service to other components. We propose a system that can automatically classify bug tickets using the natural language descriptions provided by developers. This system allows taxonomies of bugs to be built for new cloud infrastructures, which can be used to shift development focus and help stop failures before they happen. The primary objective was to create a system that could classify unlabeled cloud infrastructure bugs with minimal human intervention. The motive of our problem boils down to building a classifier that will output bug classifications (hardware, software, etc.) given an unlabeled bug description as input. Thanks to past research in this field there is a large repository of classified bugs from previous cloud infrastructures [1]. 1 2 3 5 6 1. Input to our system consists of classified bug tickets taken from the Cloud Bug Study [1]. The bundled bug descriptions are very short, so each ticket is pre-processed using the Python JIRAAPI to find the full bug description that corresponds to the issue ID. Next, each ticket is passed through a stemmer, which uses the NLTK library to strip descriptions down to only nouns and verbs, then reduce each word to its stem. A banned word list, system synonyms list, and phrase filter remove specific words that only add noise to the clustering. Finally, low frequency words are filtered out. 2. Each bug description is encoded as a vector of keyword weights. At this point, an unlabeled ticket is injected into the dataset, and weights for each keyword are created. We use Document Frequency (DF): the weight for any word k is the number of tickets in the dataset that have k present. 4 3. An n x m matrix is created using every vector that is generated, where n is the amount of tickets, and m is the amount of unique words in the dataset. This matrix is passed as input to a hierarchical agglomerative clustering algorithm provided by the Python library SciPy. The output of the clustering algorithm is a binary tree. Each parent is labelled by the intersection of its children’s keywords. 4. The binary tree is collapsed to an n-ary tree using a modified version of the algorithm presented by Medem et al [2]. The previously unlabeled ticket is marked, and its parent label is used as the classification. 5. Steps 2-5 are repeated for each unlabeled ticket in the desired dataset, building a taxonomy of classified bugs. 6. Steps 3 and 4 are repeated one more time with the new taxonomy as input. An n-ary tree is generated, giving a visual overview of the previously unlabeled bugs and their predicted classifications.