SlideShare a Scribd company logo
Modeling Reliability of Cloud Infrastructure Software
Washington Garcia, Florida Atlantic University
Supervisor: Dr. Theophilus Benson
Edmund T. Pratt School of Engineering, Duke University
Motivation
Challenges
In the past, researchers have manually annotated bugs
and conducted statistical analysis to learn what types of bugs
infest cloud infrastructure. However, this process of manual
annotation is time consuming. Unlike most classification
problems, bug classification requires domain knowledge and
familiarity with the code-base to be useful, so crowdsourced
annotation such as Amazon Mechanical Turk is out of the
question.
An added obstacle to bug classification is the nature of
bug descriptions themselves, which are often unstructured and
consist of natural language provided by human developers.
Common natural language processing techniques can fall
short because, unlike news articles, bug descriptions contain
many typos, domain specific synonyms, abbreviations, and
inconsistencies.
Method
1. Gunawi et al. 2014. What Bugs Live in the Cloud? A Study of 3000+
Issues in Cloud Systems. In Proceedings of the ACM Symposium on
Cloud Computing (SOCC '14). ACM, New York, NY, USA, Article
7, 14 pages. DOI=http://dx.doi.org/10.1145/2670979.2670986
2. A. Medem, M. I. Akodjenou and R. Teixeira, "TroubleMiner:
Mining network trouble tickets," Integrated Network
Management-Workshops, 2009. IM '09. IFIP/IEEE International
Symposium on, New York, NY, 2009, pp. 113-119.
DOI: 10.1109/INMW.2009.5195946
3. Agnes Sandor, Nikolaos Lagos, Ngoc-Phuoc-An Vo, and Caroline
Brun. 2016. Identifying User Issues and Request Types in Forum
Question Posts Based on Discourse Analysis. In Proceedings of
the 25th International Conference Companion on World Wide
Web (WWW '16 Companion). Republic and Canton of Geneva,
Switzerland, 685-691. DOI:
http://dx.doi.org/10.1145/2872518.2890568
GitHub for this project:
https://github.com/w-garcia/BugClustering
Future Work
Since the system is mostly modular, different components can
be swapped out with better implementations to improve
performance. Some proposed additions include:
• The system currently takes around 15-30 minutes to
classify a dataset. This can be increased by injecting more
than one ticket at a time into step 2. However, the chance
of unlabeled tickets being clustered into the same cluster
becomes higher as more are injected into the dataset,
which makes the cluster less useful.
• Using discourse analysis [3] to replace the usage of the
banned word list, synonym list, and phrase filter. Instead,
useful words are found by identifying linguistic features
of each description.
• If proven accurate, our system can be used for the analysis
of bugs in systems such as OpenStack, Spark, Quagga,
ONOS, and OpenDaylight. Such analysis will provide
insight into the current state of cloud infrastructure and
how cloud development has shifted in the past few years.
References
Process Initial Results
Extended analysis of the accuracy is planned in the future,
with preliminary results available for four systems:
Category accuracy is how often the system correctly predicted
a ticket’s Cloud Bug Study category (such as aspect, software,
hardware). Class accuracy reflects its success at predicting
specific classes for tickets (such as a-consistency, sw-logic,
hw-disk, etc.). It performed best when predicting categories
using Cassandra’s large 1200+ ticket model.
An increasing amount of popular services are utilizing
cloud infrastructure due to its convenience, low cost, and
scalability. However, as more services turn to cloud as a
means of storing and delivering data to consumers, the faults
of cloud infrastructure become more apparent. When cloud
infrastructure fails, the consequences are disastrous, with
failures making national headlines. Popular services such as
Amazon, Dropbox, Netflix, and many social media sites all
rely on cloud computing at their core.
Although new cloud infrastructures have sprouted in
recent years, there is little knowledge about what type of bugs
they contain, and how these bugs affect quality of service to
other components. We propose a system that can
automatically classify bug tickets using the natural language
descriptions provided by developers. This system allows
taxonomies of bugs to be built for new cloud infrastructures,
which can be used to shift development focus and help stop
failures before they happen.
The primary objective was to create a system that could
classify unlabeled cloud infrastructure bugs with minimal
human intervention. The motive of our problem boils down to
building a classifier that will output bug classifications
(hardware, software, etc.) given an unlabeled bug description
as input. Thanks to past research in this field there is a large
repository of classified bugs from previous cloud
infrastructures [1].
1 2
3
5 6
1. Input to our system consists of classified bug tickets taken from the Cloud Bug Study [1]. The
bundled bug descriptions are very short, so each ticket is pre-processed using the Python JIRAAPI
to find the full bug description that corresponds to the issue ID. Next, each ticket is passed through
a stemmer, which uses the NLTK library to strip descriptions down to only nouns and verbs, then
reduce each word to its stem. A banned word list, system synonyms list, and phrase filter remove
specific words that only add noise to the clustering. Finally, low frequency words are filtered out.
2. Each bug description is encoded as a
vector of keyword weights. At this point,
an unlabeled ticket is injected into the
dataset, and weights for each keyword are
created. We use Document Frequency
(DF): the weight for any word k is the
number of tickets in the dataset that have k
present.
4
3. An n x m matrix is created using every vector that is generated,
where n is the amount of tickets, and m is the amount of unique
words in the dataset. This matrix is passed as input to a hierarchical
agglomerative clustering algorithm provided by the Python library
SciPy. The output of the clustering algorithm is a binary tree. Each
parent is labelled by the intersection of its children’s keywords.
4. The binary tree is collapsed to an n-ary
tree using a modified version of the
algorithm presented by Medem et al [2].
The previously unlabeled ticket is
marked, and its parent label is used as
the classification.
5. Steps 2-5 are repeated for each
unlabeled ticket in the desired
dataset, building a taxonomy of
classified bugs.
6. Steps 3 and 4 are repeated one more time with the
new taxonomy as input. An n-ary tree is generated,
giving a visual overview of the previously
unlabeled bugs and their predicted classifications.

More Related Content

What's hot

IRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using Cobweb
IRJET Journal
 
06558266
0655826606558266
06558266
Vidya Sagar
 
A novel algorithm to protect and manage memory locations
A novel algorithm to protect and manage memory locationsA novel algorithm to protect and manage memory locations
A novel algorithm to protect and manage memory locations
iosrjce
 
eResearch workflows for studying free and open source software development
eResearch workflows for studying free and open source software developmenteResearch workflows for studying free and open source software development
eResearch workflows for studying free and open source software development
Andrea Wiggins
 
A Comparison between Flooding and Bloom Filter Based Multikeyword Search in P...
A Comparison between Flooding and Bloom Filter Based Multikeyword Search in P...A Comparison between Flooding and Bloom Filter Based Multikeyword Search in P...
A Comparison between Flooding and Bloom Filter Based Multikeyword Search in P...
International Journal of Engineering Inventions www.ijeijournal.com
 
Detecting java software similarities by using different clustering
Detecting java software similarities by using different clusteringDetecting java software similarities by using different clustering
Detecting java software similarities by using different clustering
Davide Ruscio
 
Ant Colony Optimization for Wireless Sensor Network: A Review
Ant Colony Optimization for Wireless Sensor Network: A ReviewAnt Colony Optimization for Wireless Sensor Network: A Review
Ant Colony Optimization for Wireless Sensor Network: A Review
iosrjce
 
An Examination of the Bloom Filter and its Application in Preventing Weak Pas...
An Examination of the Bloom Filter and its Application in Preventing Weak Pas...An Examination of the Bloom Filter and its Application in Preventing Weak Pas...
An Examination of the Bloom Filter and its Application in Preventing Weak Pas...
Editor IJCATR
 
Prediction of Answer Keywords using Char-RNN
Prediction of Answer Keywords using Char-RNNPrediction of Answer Keywords using Char-RNN
Prediction of Answer Keywords using Char-RNN
IJECEIAES
 
IEEE 2015 Java Projects
IEEE 2015 Java ProjectsIEEE 2015 Java Projects
IEEE 2015 Java Projects
Vijay Karan
 
Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progress
Bhaskar Mitra
 
Ijetr012045
Ijetr012045Ijetr012045
Ijetr012045
ER Publication.org
 
Acquisition of malicious code using active learning
Acquisition of malicious code using active learningAcquisition of malicious code using active learning
Acquisition of malicious code using active learning
UltraUploader
 
Comparative study on Cache Coherence Protocols
Comparative study on Cache Coherence ProtocolsComparative study on Cache Coherence Protocols
Comparative study on Cache Coherence Protocols
iosrjce
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Bhaskar Mitra
 
Community Analysis of Deep Networks (poster)
Community Analysis of Deep Networks (poster)Community Analysis of Deep Networks (poster)
Community Analysis of Deep Networks (poster)
Behrang Mehrparvar
 
Using content and interactions for discovering communities in
Using content and interactions for discovering communities inUsing content and interactions for discovering communities in
Using content and interactions for discovering communities in
moresmile
 
Understanding WeboNaver
Understanding WeboNaverUnderstanding WeboNaver
Understanding WeboNaver
Han Woo PARK
 
Multi Stage Filter Using Enhanced Adaboost for Network Intrusion Detection
Multi Stage Filter Using Enhanced Adaboost for Network Intrusion DetectionMulti Stage Filter Using Enhanced Adaboost for Network Intrusion Detection
Multi Stage Filter Using Enhanced Adaboost for Network Intrusion Detection
IJNSA Journal
 
Social Network Analysis and Visualization
Social Network Analysis and VisualizationSocial Network Analysis and Visualization
Social Network Analysis and Visualization
Alberto Ramirez
 

What's hot (20)

IRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using Cobweb
 
06558266
0655826606558266
06558266
 
A novel algorithm to protect and manage memory locations
A novel algorithm to protect and manage memory locationsA novel algorithm to protect and manage memory locations
A novel algorithm to protect and manage memory locations
 
eResearch workflows for studying free and open source software development
eResearch workflows for studying free and open source software developmenteResearch workflows for studying free and open source software development
eResearch workflows for studying free and open source software development
 
A Comparison between Flooding and Bloom Filter Based Multikeyword Search in P...
A Comparison between Flooding and Bloom Filter Based Multikeyword Search in P...A Comparison between Flooding and Bloom Filter Based Multikeyword Search in P...
A Comparison between Flooding and Bloom Filter Based Multikeyword Search in P...
 
Detecting java software similarities by using different clustering
Detecting java software similarities by using different clusteringDetecting java software similarities by using different clustering
Detecting java software similarities by using different clustering
 
Ant Colony Optimization for Wireless Sensor Network: A Review
Ant Colony Optimization for Wireless Sensor Network: A ReviewAnt Colony Optimization for Wireless Sensor Network: A Review
Ant Colony Optimization for Wireless Sensor Network: A Review
 
An Examination of the Bloom Filter and its Application in Preventing Weak Pas...
An Examination of the Bloom Filter and its Application in Preventing Weak Pas...An Examination of the Bloom Filter and its Application in Preventing Weak Pas...
An Examination of the Bloom Filter and its Application in Preventing Weak Pas...
 
Prediction of Answer Keywords using Char-RNN
Prediction of Answer Keywords using Char-RNNPrediction of Answer Keywords using Char-RNN
Prediction of Answer Keywords using Char-RNN
 
IEEE 2015 Java Projects
IEEE 2015 Java ProjectsIEEE 2015 Java Projects
IEEE 2015 Java Projects
 
Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progress
 
Ijetr012045
Ijetr012045Ijetr012045
Ijetr012045
 
Acquisition of malicious code using active learning
Acquisition of malicious code using active learningAcquisition of malicious code using active learning
Acquisition of malicious code using active learning
 
Comparative study on Cache Coherence Protocols
Comparative study on Cache Coherence ProtocolsComparative study on Cache Coherence Protocols
Comparative study on Cache Coherence Protocols
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
 
Community Analysis of Deep Networks (poster)
Community Analysis of Deep Networks (poster)Community Analysis of Deep Networks (poster)
Community Analysis of Deep Networks (poster)
 
Using content and interactions for discovering communities in
Using content and interactions for discovering communities inUsing content and interactions for discovering communities in
Using content and interactions for discovering communities in
 
Understanding WeboNaver
Understanding WeboNaverUnderstanding WeboNaver
Understanding WeboNaver
 
Multi Stage Filter Using Enhanced Adaboost for Network Intrusion Detection
Multi Stage Filter Using Enhanced Adaboost for Network Intrusion DetectionMulti Stage Filter Using Enhanced Adaboost for Network Intrusion Detection
Multi Stage Filter Using Enhanced Adaboost for Network Intrusion Detection
 
Social Network Analysis and Visualization
Social Network Analysis and VisualizationSocial Network Analysis and Visualization
Social Network Analysis and Visualization
 

Viewers also liked

Just enough data structuring
Just enough data structuringJust enough data structuring
Just enough data structuring
Networked Research Lab, UK
 
Diary of a Project Manager: How I screwed my last project?
Diary of a Project Manager: How I screwed my last project?Diary of a Project Manager: How I screwed my last project?
Diary of a Project Manager: How I screwed my last project?
Roy Simkes
 
Tarea 6 Organización
Tarea 6 OrganizaciónTarea 6 Organización
Tarea 6 Organización
legendario8896
 
Research Poster PowerPoint
Research Poster PowerPointResearch Poster PowerPoint
Research Poster PowerPoint
Washington Garcia
 
Infographic - PGi
Infographic - PGiInfographic - PGi
Infographic - PGi
Dana Hrabovsky
 
Automatic Fine-Grained Issue Report Reclassification
Automatic Fine-Grained Issue Report ReclassificationAutomatic Fine-Grained Issue Report Reclassification
Automatic Fine-Grained Issue Report Reclassification
Pavneet Singh Kochhar
 
An Empirical Study of the Effect of File Editing Patterns on Software Quality
An Empirical Study of the Effect of File Editing Patterns on Software QualityAn Empirical Study of the Effect of File Editing Patterns on Software Quality
An Empirical Study of the Effect of File Editing Patterns on Software Quality
Feng Zhang
 
Evaluating the presence and impact of bias in bug-fix datasets
Evaluating the presence and impact of bias in bug-fix datasetsEvaluating the presence and impact of bias in bug-fix datasets
Evaluating the presence and impact of bias in bug-fix datasets
Israel Herraiz
 
Sonny Moore
Sonny MooreSonny Moore
Sonny Moore
cloggedveins
 
Potential Biases in Bug Localization: Do They Matter?
Potential Biases in Bug Localization: Do They Matter?Potential Biases in Bug Localization: Do They Matter?
Potential Biases in Bug Localization: Do They Matter?
Pavneet Singh Kochhar
 
Bug Advocacy
Bug AdvocacyBug Advocacy
Bug Advocacy
Deepu S Nath
 
Product-Layers Features & Classification
Product-Layers  Features & ClassificationProduct-Layers  Features & Classification
Product-Layers Features & Classification
Trinity Dwarka
 
Bed Bugs 101
Bed Bugs 101Bed Bugs 101
Bed Bugs 101
V180Media
 
Learning to Rank Relevant Files for Bug Reports using Domain Knowledge
Learning to Rank Relevant Files for Bug Reports using Domain KnowledgeLearning to Rank Relevant Files for Bug Reports using Domain Knowledge
Learning to Rank Relevant Files for Bug Reports using Domain Knowledge
Xin Ye
 
Financial impact from stink bugs
Financial impact from stink bugsFinancial impact from stink bugs
Financial impact from stink bugs
archbishopcarroll
 
Is Text Search an Effective Approach for Fault Localization: A Practitioners ...
Is Text Search an Effective Approach for Fault Localization: A Practitioners ...Is Text Search an Effective Approach for Fault Localization: A Practitioners ...
Is Text Search an Effective Approach for Fault Localization: A Practitioners ...
Debdoot Mukherjee
 
Frequent Releases Reduce Risk
Frequent Releases Reduce RiskFrequent Releases Reduce Risk
Frequent Releases Reduce Risk
exortech
 
The Impact of Mislabelling on the Performance and Interpretation of Defect Pr...
The Impact of Mislabelling on the Performance and Interpretation of Defect Pr...The Impact of Mislabelling on the Performance and Interpretation of Defect Pr...
The Impact of Mislabelling on the Performance and Interpretation of Defect Pr...
Chakkrit (Kla) Tantithamthavorn
 
Shared Editing on the Web: A Classification of Developer Support Frameworks
Shared Editing on the Web: A Classification of Developer Support FrameworksShared Editing on the Web: A Classification of Developer Support Frameworks
Shared Editing on the Web: A Classification of Developer Support Frameworks
IstvanKoren
 
Applying Design Priciples to APIs - 2 of 4
Applying Design Priciples to APIs - 2 of 4 Applying Design Priciples to APIs - 2 of 4
Applying Design Priciples to APIs - 2 of 4
Brian Mulloy
 

Viewers also liked (20)

Just enough data structuring
Just enough data structuringJust enough data structuring
Just enough data structuring
 
Diary of a Project Manager: How I screwed my last project?
Diary of a Project Manager: How I screwed my last project?Diary of a Project Manager: How I screwed my last project?
Diary of a Project Manager: How I screwed my last project?
 
Tarea 6 Organización
Tarea 6 OrganizaciónTarea 6 Organización
Tarea 6 Organización
 
Research Poster PowerPoint
Research Poster PowerPointResearch Poster PowerPoint
Research Poster PowerPoint
 
Infographic - PGi
Infographic - PGiInfographic - PGi
Infographic - PGi
 
Automatic Fine-Grained Issue Report Reclassification
Automatic Fine-Grained Issue Report ReclassificationAutomatic Fine-Grained Issue Report Reclassification
Automatic Fine-Grained Issue Report Reclassification
 
An Empirical Study of the Effect of File Editing Patterns on Software Quality
An Empirical Study of the Effect of File Editing Patterns on Software QualityAn Empirical Study of the Effect of File Editing Patterns on Software Quality
An Empirical Study of the Effect of File Editing Patterns on Software Quality
 
Evaluating the presence and impact of bias in bug-fix datasets
Evaluating the presence and impact of bias in bug-fix datasetsEvaluating the presence and impact of bias in bug-fix datasets
Evaluating the presence and impact of bias in bug-fix datasets
 
Sonny Moore
Sonny MooreSonny Moore
Sonny Moore
 
Potential Biases in Bug Localization: Do They Matter?
Potential Biases in Bug Localization: Do They Matter?Potential Biases in Bug Localization: Do They Matter?
Potential Biases in Bug Localization: Do They Matter?
 
Bug Advocacy
Bug AdvocacyBug Advocacy
Bug Advocacy
 
Product-Layers Features & Classification
Product-Layers  Features & ClassificationProduct-Layers  Features & Classification
Product-Layers Features & Classification
 
Bed Bugs 101
Bed Bugs 101Bed Bugs 101
Bed Bugs 101
 
Learning to Rank Relevant Files for Bug Reports using Domain Knowledge
Learning to Rank Relevant Files for Bug Reports using Domain KnowledgeLearning to Rank Relevant Files for Bug Reports using Domain Knowledge
Learning to Rank Relevant Files for Bug Reports using Domain Knowledge
 
Financial impact from stink bugs
Financial impact from stink bugsFinancial impact from stink bugs
Financial impact from stink bugs
 
Is Text Search an Effective Approach for Fault Localization: A Practitioners ...
Is Text Search an Effective Approach for Fault Localization: A Practitioners ...Is Text Search an Effective Approach for Fault Localization: A Practitioners ...
Is Text Search an Effective Approach for Fault Localization: A Practitioners ...
 
Frequent Releases Reduce Risk
Frequent Releases Reduce RiskFrequent Releases Reduce Risk
Frequent Releases Reduce Risk
 
The Impact of Mislabelling on the Performance and Interpretation of Defect Pr...
The Impact of Mislabelling on the Performance and Interpretation of Defect Pr...The Impact of Mislabelling on the Performance and Interpretation of Defect Pr...
The Impact of Mislabelling on the Performance and Interpretation of Defect Pr...
 
Shared Editing on the Web: A Classification of Developer Support Frameworks
Shared Editing on the Web: A Classification of Developer Support FrameworksShared Editing on the Web: A Classification of Developer Support Frameworks
Shared Editing on the Web: A Classification of Developer Support Frameworks
 
Applying Design Priciples to APIs - 2 of 4
Applying Design Priciples to APIs - 2 of 4 Applying Design Priciples to APIs - 2 of 4
Applying Design Priciples to APIs - 2 of 4
 

Similar to 36x48_new_modelling_cloud_infrastructure

Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clustering
butest
 
Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models
IJECEIAES
 
IRJET- A Novel Approch Automatically Categorizing Software Technologies
IRJET- A Novel Approch Automatically Categorizing Software TechnologiesIRJET- A Novel Approch Automatically Categorizing Software Technologies
IRJET- A Novel Approch Automatically Categorizing Software Technologies
IRJET Journal
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
Praveen Kumar
 
Only Abstract
Only AbstractOnly Abstract
Only Abstract
guesta67d4a
 
IRJET- Factoid Question and Answering System
IRJET-  	  Factoid Question and Answering SystemIRJET-  	  Factoid Question and Answering System
IRJET- Factoid Question and Answering System
IRJET Journal
 
Bug Triage: An Automated Process
Bug Triage: An Automated ProcessBug Triage: An Automated Process
Bug Triage: An Automated Process
IRJET Journal
 
IRJET- Netreconner: An Innovative Method to Intrusion Detection using Regular...
IRJET- Netreconner: An Innovative Method to Intrusion Detection using Regular...IRJET- Netreconner: An Innovative Method to Intrusion Detection using Regular...
IRJET- Netreconner: An Innovative Method to Intrusion Detection using Regular...
IRJET Journal
 
IRJET - Netreconner: An Innovative Method to Intrusion Detection using Regula...
IRJET - Netreconner: An Innovative Method to Intrusion Detection using Regula...IRJET - Netreconner: An Innovative Method to Intrusion Detection using Regula...
IRJET - Netreconner: An Innovative Method to Intrusion Detection using Regula...
IRJET Journal
 
USING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTS
USING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTSUSING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTS
USING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTS
ijseajournal
 
Conversational Networks for AutomaticOnline Moderation
Conversational Networks for AutomaticOnline ModerationConversational Networks for AutomaticOnline Moderation
Conversational Networks for AutomaticOnline Moderation
JAYAPRAKASH JPINFOTECH
 
IEEE 2014 C# Projects
IEEE 2014 C# ProjectsIEEE 2014 C# Projects
IEEE 2014 C# Projects
Vijay Karan
 
IEEE 2014 C# Projects
IEEE 2014 C# ProjectsIEEE 2014 C# Projects
IEEE 2014 C# Projects
Vijay Karan
 
A Survey on Bioinformatics Tools
A Survey on Bioinformatics ToolsA Survey on Bioinformatics Tools
A Survey on Bioinformatics Tools
idescitation
 
Exploiting Semantics-Based Plagiarism Detection Methods
Exploiting Semantics-Based Plagiarism Detection MethodsExploiting Semantics-Based Plagiarism Detection Methods
Exploiting Semantics-Based Plagiarism Detection Methods
IJSRED
 
Instrumenting Home NetworksKenneth L. CalvertLab for Adv.docx
Instrumenting Home NetworksKenneth L. CalvertLab for Adv.docxInstrumenting Home NetworksKenneth L. CalvertLab for Adv.docx
Instrumenting Home NetworksKenneth L. CalvertLab for Adv.docx
normanibarber20063
 
Fota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsFota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity Algorithms
Shivansh Gaur
 
Efficient Data Mining Of Association Rules in Horizontally Distributed Databases
Efficient Data Mining Of Association Rules in Horizontally Distributed DatabasesEfficient Data Mining Of Association Rules in Horizontally Distributed Databases
Efficient Data Mining Of Association Rules in Horizontally Distributed Databases
ijircee
 
Paper id 25201463
Paper id 25201463Paper id 25201463
Paper id 25201463
IJRAT
 
Automatic crime report classification through a weightless neural network
Automatic crime report classification through a weightless neural networkAutomatic crime report classification through a weightless neural network
Automatic crime report classification through a weightless neural network
ZhongLI28
 

Similar to 36x48_new_modelling_cloud_infrastructure (20)

Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clustering
 
Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models
 
IRJET- A Novel Approch Automatically Categorizing Software Technologies
IRJET- A Novel Approch Automatically Categorizing Software TechnologiesIRJET- A Novel Approch Automatically Categorizing Software Technologies
IRJET- A Novel Approch Automatically Categorizing Software Technologies
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
 
Only Abstract
Only AbstractOnly Abstract
Only Abstract
 
IRJET- Factoid Question and Answering System
IRJET-  	  Factoid Question and Answering SystemIRJET-  	  Factoid Question and Answering System
IRJET- Factoid Question and Answering System
 
Bug Triage: An Automated Process
Bug Triage: An Automated ProcessBug Triage: An Automated Process
Bug Triage: An Automated Process
 
IRJET- Netreconner: An Innovative Method to Intrusion Detection using Regular...
IRJET- Netreconner: An Innovative Method to Intrusion Detection using Regular...IRJET- Netreconner: An Innovative Method to Intrusion Detection using Regular...
IRJET- Netreconner: An Innovative Method to Intrusion Detection using Regular...
 
IRJET - Netreconner: An Innovative Method to Intrusion Detection using Regula...
IRJET - Netreconner: An Innovative Method to Intrusion Detection using Regula...IRJET - Netreconner: An Innovative Method to Intrusion Detection using Regula...
IRJET - Netreconner: An Innovative Method to Intrusion Detection using Regula...
 
USING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTS
USING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTSUSING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTS
USING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTS
 
Conversational Networks for AutomaticOnline Moderation
Conversational Networks for AutomaticOnline ModerationConversational Networks for AutomaticOnline Moderation
Conversational Networks for AutomaticOnline Moderation
 
IEEE 2014 C# Projects
IEEE 2014 C# ProjectsIEEE 2014 C# Projects
IEEE 2014 C# Projects
 
IEEE 2014 C# Projects
IEEE 2014 C# ProjectsIEEE 2014 C# Projects
IEEE 2014 C# Projects
 
A Survey on Bioinformatics Tools
A Survey on Bioinformatics ToolsA Survey on Bioinformatics Tools
A Survey on Bioinformatics Tools
 
Exploiting Semantics-Based Plagiarism Detection Methods
Exploiting Semantics-Based Plagiarism Detection MethodsExploiting Semantics-Based Plagiarism Detection Methods
Exploiting Semantics-Based Plagiarism Detection Methods
 
Instrumenting Home NetworksKenneth L. CalvertLab for Adv.docx
Instrumenting Home NetworksKenneth L. CalvertLab for Adv.docxInstrumenting Home NetworksKenneth L. CalvertLab for Adv.docx
Instrumenting Home NetworksKenneth L. CalvertLab for Adv.docx
 
Fota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsFota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity Algorithms
 
Efficient Data Mining Of Association Rules in Horizontally Distributed Databases
Efficient Data Mining Of Association Rules in Horizontally Distributed DatabasesEfficient Data Mining Of Association Rules in Horizontally Distributed Databases
Efficient Data Mining Of Association Rules in Horizontally Distributed Databases
 
Paper id 25201463
Paper id 25201463Paper id 25201463
Paper id 25201463
 
Automatic crime report classification through a weightless neural network
Automatic crime report classification through a weightless neural networkAutomatic crime report classification through a weightless neural network
Automatic crime report classification through a weightless neural network
 

36x48_new_modelling_cloud_infrastructure

  • 1. Modeling Reliability of Cloud Infrastructure Software Washington Garcia, Florida Atlantic University Supervisor: Dr. Theophilus Benson Edmund T. Pratt School of Engineering, Duke University Motivation Challenges In the past, researchers have manually annotated bugs and conducted statistical analysis to learn what types of bugs infest cloud infrastructure. However, this process of manual annotation is time consuming. Unlike most classification problems, bug classification requires domain knowledge and familiarity with the code-base to be useful, so crowdsourced annotation such as Amazon Mechanical Turk is out of the question. An added obstacle to bug classification is the nature of bug descriptions themselves, which are often unstructured and consist of natural language provided by human developers. Common natural language processing techniques can fall short because, unlike news articles, bug descriptions contain many typos, domain specific synonyms, abbreviations, and inconsistencies. Method 1. Gunawi et al. 2014. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In Proceedings of the ACM Symposium on Cloud Computing (SOCC '14). ACM, New York, NY, USA, Article 7, 14 pages. DOI=http://dx.doi.org/10.1145/2670979.2670986 2. A. Medem, M. I. Akodjenou and R. Teixeira, "TroubleMiner: Mining network trouble tickets," Integrated Network Management-Workshops, 2009. IM '09. IFIP/IEEE International Symposium on, New York, NY, 2009, pp. 113-119. DOI: 10.1109/INMW.2009.5195946 3. Agnes Sandor, Nikolaos Lagos, Ngoc-Phuoc-An Vo, and Caroline Brun. 2016. Identifying User Issues and Request Types in Forum Question Posts Based on Discourse Analysis. In Proceedings of the 25th International Conference Companion on World Wide Web (WWW '16 Companion). Republic and Canton of Geneva, Switzerland, 685-691. DOI: http://dx.doi.org/10.1145/2872518.2890568 GitHub for this project: https://github.com/w-garcia/BugClustering Future Work Since the system is mostly modular, different components can be swapped out with better implementations to improve performance. Some proposed additions include: • The system currently takes around 15-30 minutes to classify a dataset. This can be increased by injecting more than one ticket at a time into step 2. However, the chance of unlabeled tickets being clustered into the same cluster becomes higher as more are injected into the dataset, which makes the cluster less useful. • Using discourse analysis [3] to replace the usage of the banned word list, synonym list, and phrase filter. Instead, useful words are found by identifying linguistic features of each description. • If proven accurate, our system can be used for the analysis of bugs in systems such as OpenStack, Spark, Quagga, ONOS, and OpenDaylight. Such analysis will provide insight into the current state of cloud infrastructure and how cloud development has shifted in the past few years. References Process Initial Results Extended analysis of the accuracy is planned in the future, with preliminary results available for four systems: Category accuracy is how often the system correctly predicted a ticket’s Cloud Bug Study category (such as aspect, software, hardware). Class accuracy reflects its success at predicting specific classes for tickets (such as a-consistency, sw-logic, hw-disk, etc.). It performed best when predicting categories using Cassandra’s large 1200+ ticket model. An increasing amount of popular services are utilizing cloud infrastructure due to its convenience, low cost, and scalability. However, as more services turn to cloud as a means of storing and delivering data to consumers, the faults of cloud infrastructure become more apparent. When cloud infrastructure fails, the consequences are disastrous, with failures making national headlines. Popular services such as Amazon, Dropbox, Netflix, and many social media sites all rely on cloud computing at their core. Although new cloud infrastructures have sprouted in recent years, there is little knowledge about what type of bugs they contain, and how these bugs affect quality of service to other components. We propose a system that can automatically classify bug tickets using the natural language descriptions provided by developers. This system allows taxonomies of bugs to be built for new cloud infrastructures, which can be used to shift development focus and help stop failures before they happen. The primary objective was to create a system that could classify unlabeled cloud infrastructure bugs with minimal human intervention. The motive of our problem boils down to building a classifier that will output bug classifications (hardware, software, etc.) given an unlabeled bug description as input. Thanks to past research in this field there is a large repository of classified bugs from previous cloud infrastructures [1]. 1 2 3 5 6 1. Input to our system consists of classified bug tickets taken from the Cloud Bug Study [1]. The bundled bug descriptions are very short, so each ticket is pre-processed using the Python JIRAAPI to find the full bug description that corresponds to the issue ID. Next, each ticket is passed through a stemmer, which uses the NLTK library to strip descriptions down to only nouns and verbs, then reduce each word to its stem. A banned word list, system synonyms list, and phrase filter remove specific words that only add noise to the clustering. Finally, low frequency words are filtered out. 2. Each bug description is encoded as a vector of keyword weights. At this point, an unlabeled ticket is injected into the dataset, and weights for each keyword are created. We use Document Frequency (DF): the weight for any word k is the number of tickets in the dataset that have k present. 4 3. An n x m matrix is created using every vector that is generated, where n is the amount of tickets, and m is the amount of unique words in the dataset. This matrix is passed as input to a hierarchical agglomerative clustering algorithm provided by the Python library SciPy. The output of the clustering algorithm is a binary tree. Each parent is labelled by the intersection of its children’s keywords. 4. The binary tree is collapsed to an n-ary tree using a modified version of the algorithm presented by Medem et al [2]. The previously unlabeled ticket is marked, and its parent label is used as the classification. 5. Steps 2-5 are repeated for each unlabeled ticket in the desired dataset, building a taxonomy of classified bugs. 6. Steps 3 and 4 are repeated one more time with the new taxonomy as input. An n-ary tree is generated, giving a visual overview of the previously unlabeled bugs and their predicted classifications.