1. Using Advanced Analytics to Classify
Electrical-Network-Related Incidents
Jessie Nghiem – Energy Safe Victoria
2. Let’s start with who we are…
• Energy Safe Victoria (ESV)
• Data and Analytics team
• About myself
• Current: part-time Senior Data Scientist and full-time new mom
• Past
• Lead Data Scientist at MLC Life Insurance
• Postdoc researcher at RMIT
• Data Scientist at InfoCentric
• Teaching Associate at Monash
• PhD at Monash
3. What have we used AI/ML for?
• Audit electrical products being sold online
• Find the correlations between weather conditions with the rate of
network related ground fire incidents
• Classify electrical network related incidents – case study for today
• And more…
4. The classic free text problem…
• Distribution businesses are required to report critical network incidents to ESV through OSIRIS,
our reporting platform
5. Let’s take a closer look at the data
• 5 years of data
• 3761 rows and 33 columns
• 3 important free text columns
• Description
• Causes
• Actions taken
• a labelled column of 15 incident
categories.
• We use IDEAR utility data
exploration
6. First we build models
using only text columns …
Incident data
Data extraction
and cleansing
Preprocessing
Feature
Engineering
Model Building
Model
Evaluation
TF-IDF (word/character/n-
gram)
Word Embedding
Count
Naïve Bayes
Linear Classifier
SVM
Bagging Models
(Random Forest, Extreme Gradient Boosting)
First iteration
Tokenisation
Remove stop words
Stemming and Lemmatisaion
7. And the winner is…
Algorithm Accuracy
Naive Bayes (Count Vector) 0.68
Naive Bayes (Word Level TF-IDF) 0.63
Naive Bayes (N-gram level TF-IDF) 0.63
Naive Bayes (Character level TF-IDF) 0.55
Linear Classifier (Count Vector) 0.76
Linear Classifier (Word Level TF-IDF) 0.77
Linear Classifier (N-gram level TF-IDF) 0.68
Linear Classifier (Character level TF-IDF) 0.75
Xgb Classifier (Count Vector) 0.76
Xgb Classifier (Word Level TF-IDF) 0.76
Xgb Classifier (Character level TF-IDF) 0.76
… …
Linear Classifier (Logistic Regression) on
Work Level TF-IDF Vector model is chosen
with the highest accuracy
8. This time we throw non-free- text
columns in…
Incident data
Data extraction
and cleansing
Preprocessing
Feature
Engineering
Model Building
Model
Evaluation
TF-IDF (word level)
Linear Classifier (text column only)
Linear Classifier (all columns)
Second iteration
Accuracy reaches 80%
9. Now we have a good model, so what’s next?
• The output of the model has been fed into the data process for
internal reporting and other advanced analytics
• Work with Deakin university to improve the accuracy of model
Hello everyone, thanks for giving me this opportunity to share our learning experience from applying advanced analytics in classifying electrical network-related incidents.
Energy Safe Victoria is a technical and safety regulator responsible for the safe generation, supply and use of electricity, gas and pipelines. When we talk about electricity for example, that will include everything from the generation, transmission distribution to installation and equipment. We also license and register electricians, and issue and audit Certificates of Electrical Safety, which is your guarantee that electrical work has been performed by a qualified electrician.
We have a small and relatively new established DA team of 5 people. We extend our capability and resources by collaboration with CSIRO and universities (such as Monash, Deakin and UTS) and external contractors
I am currently a part-time Senior Data Scientist, leading the Advanced Analytics programs and managing R&D funding at ESV and also a full-time mom of a always fully-charged toddler. The pandemic doesn’t help my childcare plan but I am grateful that ESV has been always very supportive to give me flexible working arrangement so I can complete the work and enjoy more quality time with my little one.
Prior to this role, I worked for MLC Life Insurance, RMIT and InfoCentric. Before being a mom, I enjoyed spending my spare time on teaching students at Monash, where I did my PhD in Computer Science..
Our AI/ML learning journey has been quite challenging by the lack of readiness of the data for AA at ESV. However, instead of sitting there and waiting for perfect setting to come, we roll our sleeve up and work with what we have. A few projects have been started and well-received by our internal customers.
We build an proof-of-concept AI solution to audit electrical products being sold on eBay/Amazon using Azure Cognitive Services. We have been secured a funding from Senior Committee of Officials (SCO) – Council of Australian Government to develop toward a full-scale product out of this PoC.
Another project is to find which weather factors has strong influence on the number of ground fire incidents on a certain day. This implies whether the occurrence of an incident can be explained by the weather conditions or bad performance of the distribution businesses, which can be resulted in further investigation. We are also working with UTS to develop a predictive solution to anticipate the rate of incidents based on weather conditions and other geo-spatial factors.
For today case study, I am going to share our experience in developing text classifier to put network-related incidents into the right bucket, a data enrichment work for the project we have just mentioned earlier.
The classic free text problem… I believe you can find it in any organization. As a background, distribution businesses are legally required to report critical network incidents to ESV through OSIRIS, our reporting platform. It is going through review, approval process and investigation (if needed). This dataset feeds into our operational report, annual network performance report and other analysis. These incidents can be classified into 15 categories. The classification is based on the causes of the incidents. However, data input is mostly free text and can be filled by non technical people. The casual factors columns are multi-value and hard to identify root cause of the problem. At the moment we have a specialist who is going through incident by incident to classify them into the right bucket. This poses an opportunity to use machine learning to automate and optimize the process.
Let’s take a closer look at the data. We have 5 years of data since the platform was launched. It is not a really big dataset as the number of incidents are small (less than 4K rows with 33 columns) and we try to make it is smaller for our community safety. There are 3 free text columns that have been used by our staff to classify the incidents into 15 categories based on their empirical knowledge. Incidents related to vehicle, connection and trees are the most popular. To explore the data, we use IDEAR utility, an tool written in R/Rmarkdown by Microsoft.
First we built models that use only text columns.
As any ML project, 70-80% time of project was on data processing and cleansing. Pre-processing steps include tokenization, removing stop word, lemmatisaion and stemming. It is followed by feature extraction/engineering to prepare for different types of models we want to try with this dataset. In particular, the generated features are TF-IDF vectors on word, character and n-gram), word embedded and count vectors. For each set of features, we trained different models including Naïve Bayes, Linear Classifier (Logistic Regression), support vector machine, bagging models such as Random Forest and Extreme Gradient Boosting to see which one performs the best in term of accuracy.
And the winner is Linear Classifier (Logistic Regression: with accuracy 77%. That means we found a good model but it can be better…
This time we throw more columns in in combination with those extracted feature from the previous build. These columns do have to be free text but can contain categorical or numerical values such as network type, voltage, caused by technical or work practice or environmental factors etc. We run the Logistic Regression model again. As a result the accuracy reaches 80%, makes the model a lot more useful.
Now we have a good model, so what is next
- The output of the model has been fed into the data process for internal reporting and other advanced analytics.
We are working with Deakin university to improve the accuracy of model
And that concludes my talk
Thank you for attending my session. I hope you enjoy it. If you have any question or interested in collaborating with us, please feel free to reach out. My LinkedIn profile and work email address are included here. And thank you again