"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
vodQA Bangalore 2019 - Orange
1. Squeeze your test
suite using Orange
https://github.com/SudhaNadchal
sudhashettyn@gmail.com
Sudha Nadchal
2. The problem
Same verification steps in several test cases
Execution of redundant test cases may take several
days
Need to minimize Time and Cost of execution and
maintenance
Bulky test suites in Legacy applications
4. What is Orange?
Open source Visual programming tool kit
Supports:
• Data visualization
• Machine learning
• Data mining
• Already in use in the field of Bioinformatics,
Space research, Image analytics, Geology etc
15. Limitations
Cluster gets be hard to interpret if corpus is too large
Spelling errors and other inconsistencies
Stemmers are not accurate.
‘Caring’ -> Lemmatization -> ‘Care’
‘Caring’ -> Stemming -> ‘Car’
Redundancy is the repeated data among different test cases
As a solution to these problems, I propose a text mining based approach using a open source tool called Orange that anyone can use.
This solution provides visual representation of two or more test case pairs in a test suite that are identical or functionally related. This can be used as an input in order to eliminate the duplicate test cases and also for functionally merging the test cases; thus making a leaner test suite.
A capability must be established to save the test cases in a test suite or test plan folder of interest as individual text files in a folder. This serves as input to the model. If any test case contains any attachments that needs to be ignored. If the test cases are managed in HP ALM, this can be established using REST API of ALM.
Download orange. By default orange doesn’t come with Text mining features. We need to install “Text – Add on “. Once we install the Text Add in, we are good to start. We have a canvas where one can drag and drop the components aka Widgets. This is how the workflow looks like. I’m using
"Import Documents" widget is used to import test case folder of choice.
Once the documents are imported, Word cloud and Corpus viewer are for visualizing the text. Corpus is the collection of documents. Wordcloud widget displays the frequencies of word. The larger the word appears in word cloud, more frequent it is used in the cloud.
Infact, when we view the corpus without any pre-processing, the word cloud displays silly things such as punctuations and uninformative words.
We’ll use Pre-process text to get rid of these.
This widget/component carries out a series of operations on corpus.
Lemmatization is the process of converting a word to its base form. The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.
For example, lemmatization would correctly identify the base form of ‘caring’ to ‘care’, whereas, stemming would cutoff the ‘ing’ part and convert it to car.
‘Caring’ -> Lemmatization -> ‘Care’‘Caring’ -> Stemming -> ‘Car’
Normalization
Stemmer –Porter, Snowball
Lemmatizer – Wordnet
Filtering – Stop words
BoW is a way of extracting features from test cases. Bow is only concerned with whether known words occur in the test case, not where in the test case.
The intuition is that test cases are similar or related if they have similar content.
Term frequency (tf) is basically the output of the Bag of word. For a specific testcase, it determines how important a word is by looking at how frequently it appears in the testcase. Term frequency measures the local importance of the word. If a word appears a lot of times, then the word must be important. For example, if our document is “I am a cat lover. I have a cat named Steve. I feed a cat outside my room regularly,” we see that the words with the highest frequency are I, a, and cat. This agrees with our intuition that high term frequency = higher importance since the document is all about my fascination with cats.
The second component of tf-idf is inverse document frequency (idf). For a word to be considered a signature word of a document, it shouldn’t appear that often in the other documents. Thus, a signature word’s document frequency must be low, meaning its inverse document frequency must be high
Euclidian Distance is the ordinary straight line distance between the two points in space.
Cluster gets be hard to interpret if corpus is too large
Spelling errors and other inconsistencies can result in inaccuracies
Stemmers are not accurate.
‘Caring’ -> Lemmatization -> ‘Care’‘Caring’ -> Stemming -> ‘Car’