Successfully reported this slideshow.
Your SlideShare is downloading. ×

[DevDay2019] How do I test AI models? - By Minh Hoang, Senior QA Engineer at KMS

More Related Content

Similar to [DevDay2019] How do I test AI models? - By Minh Hoang, Senior QA Engineer at KMS

More from DevDay.org

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

[DevDay2019] How do I test AI models? - By Minh Hoang, Senior QA Engineer at KMS

  1. 1. How I Test Ai Model DEVDAY 2019 April 06, 2019
  2. 2. Minh Hoang A Tester A member of Technology team Fond of new technology A Challenge-taker I’M
  3. 3. Objectives Sharing used tools, key metrics in AI testing and how to evaluate the AI model.
  4. 4. Agenda 1 • What is machine learning • Myths & Facts about AI • Myths & Facts about Chatbot ABOUT A.I 3 TAKE AWAY 2 • The right metrics for evaluating the ML model • How we test FAQ model • Demo HOW I TEST THE AI MODEL 4 REFERENCES • Tools & Libraries
  5. 5. About A.I
  6. 6. What Is Machine Learning? Machine learning is the subfield of computer science that gives computers the ability to learn without being explicitly programmed.
  7. 7. Myths And Facts About A.I MYTH FACT Artificial intelligence and machine learning will wipe out all the jobs. A.I is no different from other technological advances in that it helps humans become more effective and processes more efficient. “Cognitive AI” technologies are able to understand and solve new problems the way the human brain can. “Cognitive” technologies can’t solve problems they weren’t designed to solve. You need a PH.D. to work in machine learning & data science. Nowadays, a lot of documents and tutorial on the Internet can help people step by step approach machine learning world.
  8. 8. v What Is Chatbot? A computer program designed to simulate conversation with human users, especially over the Internet.
  9. 9. Myths And Facts About Chatbot MYTH FACT Chatbot have only been around for a short while. ELIZA is one of the most well-known Chatbot therapists and the bot was created about 50 years ago. Texts or voice is the only way to interact with Bots. Actually Chatbot platforms allows users to interact with them via graphical interfaces or graphical widgets, and recent Chatbot platforms follow this development approach. All Chatbot platforms use AI. Not all Chatbot platforms use AI. Most Chatbot platforms are rule-based which follow a simple, autonomous process, something along the lines of a decision tree.
  10. 10. How We Test The Ai Model
  11. 11. Regression • MSPE • MSAE • R Square • Adjusted R Square Classification • Precision – Recall • ROC-AUC • Accuracy • Log-Loss Unsupervised Models • Rand Index • Mutual • Information Others • CV Error • Heuristic methods to find K • BLEU Score (NLP) The Right Metric For Evaluating Ml Models
  12. 12. Actual positive Actual negative Predicted positive True positive False positive (Type I errors) Predicted negative False negative (Type II errors) True negative Confusion Matrix Commonly Used Metrics In Classification
  13. 13. Accuracy: • Percentage of total items classified correctly • Formula: Commonly Used Metrics In Classification
  14. 14. Recall/Sensitivity/TPR (True Positive Rate): • Number of items correctly identified as positive out of total true positives • Formula: Commonly Used Metrics In Classification Actual positive Actual negative Predicted positive True positive False positive (Type I errors) Predicted negative False negative (Type II errors) True negative
  15. 15. Precision • Number of items correctly identified as positive out of total items identified as positive • Formula: Commonly Used Metrics In Classification Actual positive Actual negative Predicted positive True positive False positive (Type I errors) Predicted negative False negative (Type II errors) True negative
  16. 16. Precision • It is a harmonic mean of precision and recall • Formula: Commonly Used Metrics In Classification Precision Recall F1 1 1 1 0.1 0.1 0.1 0.5 0.5 0.5 1 0.1 0.182 0.3 0.8 0.36 0.8 0.3 0.436
  17. 17. What Is FAQ Model?
  18. 18. Prepare test data •Crawl FAQ data •Generate question from FAQ data Run test •Train model with FAQ data •Run test Analyze result •Pre-process the raw result •Calculate metrics to evaluate the AI model in classification •Visualize the metrics Model Result •Select the threshold value The Process To Test FAQ Model?
  19. 19. • Collect FAQ questions data (Manual and Automate) • Use NLTK to generate new question data (NLG) • Self-defined question data How We Define Test Data Set?
  20. 20. Train with domain X and run the test defined for domain X. How We Evaluate The AI Model?
  21. 21. • Pre-process the raw result. • Calculate metrics to evaluate the AI model in classification. • Visually metrics. How We Analyze The Result?
  22. 22. Demo
  23. 23. Take Away
  24. 24. Take Away • Know main metrics for evaluating ML model. • Know how to test the classification AI model. • It is up to your self-learning skills and adaptability to decide whether working on ___ projects (AI, blockchain, VR, etc.) is difficult. • Use Automation to reduce time and effort to prepare test data
  25. 25. Tools & Libraries
  26. 26. Tools & Libraries • API: requests and postman. • AI/ML: nltk, difflib, plot.ly, pandas and numpy.
  27. 27. Question & Answer

Editor's Notes

  • Artificial intelligence and machine learning will wipe out all the jobs:
    Technology has been threatening jobs and displacing jobs throughout history. Telephone switching technology replaced human operators. Automatic call directors replaced receptionists. Word processing and voicemail replaced secretaries, email replaced inter-office couriers. Call center technology innovation has added efficiency and effectiveness at various stages of standing up customer service capabilities—from recruiting new reps using machine learning to screen resumes, to selecting the right training program based on specific learning styles, to call routing based on sentiment of the caller and disposition of the rep, to integration of various information sources and channels of communication. In each of these processes, technology augmentation enhanced the capabilities of humans. Were some jobs replaced? Perhaps, but more jobs were created, albeit requiring different skills.

    The use of AI-driven chatbots and virtual assistants is another iteration of this ongoing evolution. It needs to be thought of as augmentation rather than complete automation and replacement. Humans engage, machines simplify. There will always be the need for humans in the loop to interact with humans at some level.

    Bots and digital workers will enable the “super CSR” of the future and enable increasing levels of service with declining costs. At the same time, the information complexity of our world is increasing and prompting the need for human judgment. Some jobs will be lost, but the need and desire for human interaction at critical decision points will increase, and the CSR’s role will change from answering rote questions to providing better customer service at a higher level, especially for interactions requiring emotional engagement and judgment.


    “Cognitive AI” technologies are able to understand and solve new problems the way the human brain can:

    Cognitive AI simulates how a human might deal with ambiguity and nuance; however, we are a long way from AI that can extend learning to new problem areas. AI is only as good as the data on which it is trained, and humans still need to define the scenarios and use cases under which it will operate. Within those scenarios, cognitive AI offers significant value, but AI cannot define new scenarios in which it can successfully operate. This capability is referred to as “general AI” and there is much debate about when, if ever, it will emerge. For computers to answer broad questions and approach problems the way that humans do will require technological breakthroughs that are not yet on the horizon.

  • RMSE (Root Mean Square Error)
    MAE is the average of the absolute difference between the predicted values and observed value.
    BLEU (Bilingual Evaluation Understudy)


  • Recall or Sensitivity or TPR (True Positive Rate): Number of items correctly identified as positive out of total true positives- TP/(TP+FN) : được định nghĩa là tỉ lệ số điểm true positive trong số những điểm thực sự là positive.

    Specificity or TNR (True Negative Rate): Number of items correctly identified as negative out of total negatives- TN/(TN+FP)

    Precision: Number of items correctly identified as positive out of total items identified as positive- TP/(TP+FP): được định nghĩa là tỉ lệ số điểm true positive trong số những điểm được phân loại là positive.

    False Positive Rate or Type I Error: Number of items wrongly identified as positive out of total true negatives- FP/(FP+TN)

    False Negative Rate or Type II Error: Number of items wrongly identified as negative out of total true positives- FN/(FN+TP)
  • Recall or Sensitivity or TPR (True Positive Rate): Number of items correctly identified as positive out of total true positives- TP/(TP+FN) : được định nghĩa là tỉ lệ số điểm true positive trong số những điểm thực sự là positive. Hay còn gọi là tỉ lệ dự đoán chính xác giá trị positive của model

  • Precision: Number of items correctly identified as positive out of total items identified as positive- TP/(TP+FP): được định nghĩa là tỉ lệ số điểm true positive trong số những điểm được phân loại là positive. Hay còn gọi là khả năng phân loại Positive chính xác của model
  • Precision: Number of items correctly identified as positive out of total items identified as positive- TP/(TP+FP): được định nghĩa là tỉ lệ số điểm true positive trong số những điểm được phân loại là positive. Hay còn gọi là khả năng phân loại Positive chính xác của model

    Recall or Sensitivity or TPR (True Positive Rate): Number of items correctly identified as positive out of total true positives- TP/(TP+FN) : được định nghĩa là tỉ lệ số điểm true positive trong số những điểm thực sự là positive. Hay còn gọi là tỉ lệ dự đoán chính xác giá trị positive của model (tỉ lệ bỏ sót positive data)
    Mô hình 1: lý tưởng
    Mô hình 2: tệ vì dự doán chính xác giá trị positive thấp cũng như bỏ sót giá tị là positive

    Mô hình 3: balance
    Mô hình 4: tỉ lệ dự đoán chính xác giá trị positive chính xác tuyết đối nhưng tỉ lệ tìm ra positive thấp. Ví dụ: tập data có 100 giá trị positive nhưng model chỉ dự đoán đuọc đúng 1 giá trị là positive data và giá trị đó được dự đoán đúng là positive

    Mô hình 5: tỉ lệ dự đóán chính xác giá trị positive thấp nhưng tỉ lệ tìm ra positive cao. Ví dụ: tập data có 100 giá trị positive, model dự đoán 80 giá tị positive nhưng chỉ có 10 trong số đó là positive

    Mô hình 5: tỉ lệ dự đóán chính xác giá trị positive cao nhưng tỉ lệ tìm ra positive thấp. Ví dụ: tập data có 100 giá trị positive, model dự đoán 30 giá tị positive và trong 20 giá trị trong số đó là positive

×