AnswerBot
Introduction
• Avkash Chauhan, H2O.ai
o Head of enterprise products and customers
o @avkashchauhan | https://www.linkedin.com/in/avkashchauhan
• Products
o H2O
o Sparkling Water
o Deep Water
• NN – Tensorflow, mxnet, Caffe
• GPU
• xgboost - Distributed
What is an AnswerBot?
• An AnswerBot is an standalone intelligent application
• AnswerBot uses machine learning to respond user input
• Provide relevant knowledge base articles as answers
• Self-service customer base
• Raises awareness of knowledge base offerings
• Generate product feedback silently
AnswerBot – Client Interface
AnswerBot – Result Interface
Possible	Answers:
Possible	Answers:
60%
42%
60%
42%
More..
More..
More..
More..
AnswerBot – Administrator Interface
Male Female
Positive Negative
Question
Tags
Sentiment
Priority Low CriticalMedium High
Sex
Ratings
Top	(n)	Answers 728 35% 728 27% 718 17% 800 13% 128 3%
128 20% 18 20% 621 20% 801 20% 1208 20%
NSFW 3
Community
Stackoverflow
Reddit
Quora
Slack
Bot
AWS	
API	Gateway
AWS	Lambda
(Question	Scoring)
S3
DynamoDB
AWS	SQS
A	ML	pipeline	prototype	to	get	top	N	matching	answers
AWS	SNS
AnswerBot in production - Teaser
Scoring	Pipeline
Model	
Preparation	
Process
Model	Production
Support	Portal
Problems to solve
• Finding proper tags
• Finding & Removing NSFW words
• Sentiment in the question (positive or negative)
• Priority to find the answer (Low, medium, high, critical)
• Can we figure out if questioner is male or female?
• Question rating (How the question was written?)
• Findings best available answers
• Duplicate Questions
Problems to solve – Solutions (Part 1)
1. Finding proper tags:
1. Word Embedding's
2. Matching words
2. Finding & Removing NSFW words
1. Brute Force Search
2. NLTK Stop Words
3. Sentiment in the question: (Positive or Negative)
1. Binomial (2 classes)classification
1. Tree Based Algorithms (GBM/RF/DRF) or NN
Problems to solve – Solutions (Part 2)
1. Priority to find the answer (Low/Medium/High/Critical)
1. Multinomial (4 classes) classification
1. Tree based algorithms (GBM/RF/DRF) or NN
2. Can we figure our if questioner is male or female?
1. Binomial (2 classes Classification)
1. Tree based algorithms (GBM/RF/DRF) or NN
3. Question rating (How the question was written?)
1. Multinomial (N class – 1-5 star) classification
1. Tree Based algorithms or NN
Problems to solve – Solutions (Part 3)
1. Findings best available answers
1. Looking for the tags and keywords – Clustering / Reduction
2. Creating tag & keywords weights for each question
3. Matching tag, keywords and their weights to find top
probabilities
2. Duplicate Questions
1. Quora has same problem to solve on Kaggle
1. https://www.kaggle.com/c/quora-question-pairs/data
2. https://www.kaggle.com/anokas/data-analysis-xgboost-
starter-0-35460-lb
Data Preparation
• Real Data
o Real Question/Answers
• StackOverflow, Community, Quora, Support System
• Experimental Data
o Yelp – 41M reviews in 1-5 stars category - Supervised
• Ratings: 1-5
o Twitter Sentiment – Search it OR Mine It - Supervised
• Positive/Negative
• Male/Female
Our Experimentation Today
• Classifying sentences to predict
o Ratings: Starts (1-5)
• Multinomial classification example
o Sentiments: Positive or Negative
• Binomial classification example
Demo
• Binomial & Multinomial Classification
$ python PredictNow.py
Why Keras?
• High level API (Python) to run top of Tensorflow & Theano
• Great for quick and fast experimentation
• Supports both CNN and RNN and combination of two
• Run on CPU & GPU
• Visit: https://blog.keras.io/keras-as-a-simplified-interface-to-
tensorflow-tutorial.html
Word2vec
• Word2vec is an Neural Network based word embedding
method.
• A Neural Network with only 1 linear hidden layer
o Hidden layer's is used to transform inputs into something
that the output layer can use.
o Each hidden unit has the linear activation
• Represent words in a continuous, low dimensional vector
space ((i.e., the embedding space)
o Semantically similar words are mapped to nearby points.
Understanding Dataset
• Ratings Analysis
o review,stars
o The food is WAAAAY overpriced and totally not worth it, they charged for the salsa and the service was ridiculously slow....The
guacamole was good though., 2
o Decent food at a great price. Unfortunately, the place is so jam packed it's almost an inconvenience to head back to the buffet
lines., 2
o Love getting my haircut here! It's only $25 for a women's haircut. I'm pretty picky about how much my hair is layered and I've
never had a problem here. Make sure to call in to schedule your appointment ahead of time during the school year because she's
usually booked two days in advance., 5
• Sentiment Analysis
o Text, Sentiment
o I lost $80 today I know I shouldn't put things in my back pocket but I was about to put in my bag when I realized it was gone., 0
o Just got back from Seattle. Lots of crowds. Nordstrom was nuts. But Taphouse Grill was practically empty. Found hardcover of
Mad Love!, 1
o Crunch week! This Friday, I'll be heading to Oddmall, my first major craft fair, in Hudson, Ohio! I'm tricking out the website., 1
o Another beautiful day out today!! Going to build some models first then go for running!! 1
o Tired. Just tired. Home time!! I'm weaksauce, I know , 0
Components & Experimentation
• Keras
• Tensorflow
o GPU
• NLTK
o Using Stop Words
• Glove
o Pre-trained word2vec datasets
o Small (400K words)
• Python
• Jupyter notebook
Experimentation – Part 1
1. Data Preparation
2. Creating word collection
1. Removing stop words
2. Collecting all words into a big list
3. Tokenization and uniform data collection
1. Using full words collection
2. Get unique words in our collection
3. Tokenize are sentence level
4. Final Dataset
1. Sentences [sentences_per_record, length] - X
2. Labels [label_per_recordm, length] – Y
Experimentation– Part 2
4. Splitting dataset to training and validation
5. Creating Embedding Matrix
o Loading predefined word vector
o Finding match words from our collection and creating
embedding word matrix
6. Creating Embedding Layer/Configuration
7. Training
Experimentation– Part 3
8. Understanding results
o Layers connection
o Model configuration
o Model weights
9. Saving model configuration, weights, data-model
o HDF5 is a data model, library, and file format for
storing and managing data
Experimentation– Part 4
10. Model Metrics and Performance
o Getting Model Metrics
o Model Performance Graph
o Model Accuracy
• Training
• Validation
11.Prediction
o Validation Data
o User Input
What if you hit exact same prediction
• Bad Model - Could be a bad model. Retrain it.
• Rebalance your dataset:
o Either upsample less frequent class
o Or downsample more frequent one.
• Adjust class weights: Setting higher class weight for
less frequent class, network will put more attention on the
downsampled class during training
• Increase the time of training: After long training time
network starts concentrating more on less frequent classes.
Advance Processing
• Engine:
o Doc2seq -https://radimrehurek.com/gensim/models/doc2vec.html
o Seq2seq - https://github.com/farizrahman4u/seq2seq
o Lda2vec - http://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/
o RNN & LSTM - https://arxiv.org/pdf/1502.06922.pdf
• Training
o CPU vs GPU
o Checkpoints with training
AnswerBot production pipeline in cloud (AWS)
Community
Stackoverflow
Reddit
Quora
Slack
Bot
AWS	
API	Gateway
AWS	Lambda
(Question	Scoring)
S3
DynamoDB
AWS	SQS
A	ML	pipeline	prototype	to	get	top	N	matching	answers
AWS	SNS
Scoring	Pipeline
Model	
Preparation	
Process
Model	Production
Support	Portal
Content
• Github - https://github.com/Avkash/mldl/tree/master/tensorbeat-answerbot
• Dataset
o Sentiment : Search it or Mine it
o 5Star - https://www.yelp.com/dataset_challenge/dataset
• Python/Jupyter Notebook
o Sentiment:
• make-sentiment-model.py
• PositiveNegative.ipynb
o 5Star – make-5star-model.py
• make-5star-model.py
• 5StarReviews.ipynb
o Prediction – PredictNow.py
Thank you so much

Creating AnswerBot with Keras and TensorFlow (TensorBeat)

  • 1.
  • 2.
    Introduction • Avkash Chauhan,H2O.ai o Head of enterprise products and customers o @avkashchauhan | https://www.linkedin.com/in/avkashchauhan • Products o H2O o Sparkling Water o Deep Water • NN – Tensorflow, mxnet, Caffe • GPU • xgboost - Distributed
  • 3.
    What is anAnswerBot? • An AnswerBot is an standalone intelligent application • AnswerBot uses machine learning to respond user input • Provide relevant knowledge base articles as answers • Self-service customer base • Raises awareness of knowledge base offerings • Generate product feedback silently
  • 4.
  • 5.
    AnswerBot – ResultInterface Possible Answers: Possible Answers: 60% 42% 60% 42% More.. More.. More.. More..
  • 6.
    AnswerBot – AdministratorInterface Male Female Positive Negative Question Tags Sentiment Priority Low CriticalMedium High Sex Ratings Top (n) Answers 728 35% 728 27% 718 17% 800 13% 128 3% 128 20% 18 20% 621 20% 801 20% 1208 20% NSFW 3
  • 7.
  • 8.
    Problems to solve •Finding proper tags • Finding & Removing NSFW words • Sentiment in the question (positive or negative) • Priority to find the answer (Low, medium, high, critical) • Can we figure out if questioner is male or female? • Question rating (How the question was written?) • Findings best available answers • Duplicate Questions
  • 9.
    Problems to solve– Solutions (Part 1) 1. Finding proper tags: 1. Word Embedding's 2. Matching words 2. Finding & Removing NSFW words 1. Brute Force Search 2. NLTK Stop Words 3. Sentiment in the question: (Positive or Negative) 1. Binomial (2 classes)classification 1. Tree Based Algorithms (GBM/RF/DRF) or NN
  • 10.
    Problems to solve– Solutions (Part 2) 1. Priority to find the answer (Low/Medium/High/Critical) 1. Multinomial (4 classes) classification 1. Tree based algorithms (GBM/RF/DRF) or NN 2. Can we figure our if questioner is male or female? 1. Binomial (2 classes Classification) 1. Tree based algorithms (GBM/RF/DRF) or NN 3. Question rating (How the question was written?) 1. Multinomial (N class – 1-5 star) classification 1. Tree Based algorithms or NN
  • 11.
    Problems to solve– Solutions (Part 3) 1. Findings best available answers 1. Looking for the tags and keywords – Clustering / Reduction 2. Creating tag & keywords weights for each question 3. Matching tag, keywords and their weights to find top probabilities 2. Duplicate Questions 1. Quora has same problem to solve on Kaggle 1. https://www.kaggle.com/c/quora-question-pairs/data 2. https://www.kaggle.com/anokas/data-analysis-xgboost- starter-0-35460-lb
  • 12.
    Data Preparation • RealData o Real Question/Answers • StackOverflow, Community, Quora, Support System • Experimental Data o Yelp – 41M reviews in 1-5 stars category - Supervised • Ratings: 1-5 o Twitter Sentiment – Search it OR Mine It - Supervised • Positive/Negative • Male/Female
  • 13.
    Our Experimentation Today •Classifying sentences to predict o Ratings: Starts (1-5) • Multinomial classification example o Sentiments: Positive or Negative • Binomial classification example
  • 14.
    Demo • Binomial &Multinomial Classification $ python PredictNow.py
  • 15.
    Why Keras? • Highlevel API (Python) to run top of Tensorflow & Theano • Great for quick and fast experimentation • Supports both CNN and RNN and combination of two • Run on CPU & GPU • Visit: https://blog.keras.io/keras-as-a-simplified-interface-to- tensorflow-tutorial.html
  • 16.
    Word2vec • Word2vec isan Neural Network based word embedding method. • A Neural Network with only 1 linear hidden layer o Hidden layer's is used to transform inputs into something that the output layer can use. o Each hidden unit has the linear activation • Represent words in a continuous, low dimensional vector space ((i.e., the embedding space) o Semantically similar words are mapped to nearby points.
  • 17.
    Understanding Dataset • RatingsAnalysis o review,stars o The food is WAAAAY overpriced and totally not worth it, they charged for the salsa and the service was ridiculously slow....The guacamole was good though., 2 o Decent food at a great price. Unfortunately, the place is so jam packed it's almost an inconvenience to head back to the buffet lines., 2 o Love getting my haircut here! It's only $25 for a women's haircut. I'm pretty picky about how much my hair is layered and I've never had a problem here. Make sure to call in to schedule your appointment ahead of time during the school year because she's usually booked two days in advance., 5 • Sentiment Analysis o Text, Sentiment o I lost $80 today I know I shouldn't put things in my back pocket but I was about to put in my bag when I realized it was gone., 0 o Just got back from Seattle. Lots of crowds. Nordstrom was nuts. But Taphouse Grill was practically empty. Found hardcover of Mad Love!, 1 o Crunch week! This Friday, I'll be heading to Oddmall, my first major craft fair, in Hudson, Ohio! I'm tricking out the website., 1 o Another beautiful day out today!! Going to build some models first then go for running!! 1 o Tired. Just tired. Home time!! I'm weaksauce, I know , 0
  • 18.
    Components & Experimentation •Keras • Tensorflow o GPU • NLTK o Using Stop Words • Glove o Pre-trained word2vec datasets o Small (400K words) • Python • Jupyter notebook
  • 19.
    Experimentation – Part1 1. Data Preparation 2. Creating word collection 1. Removing stop words 2. Collecting all words into a big list 3. Tokenization and uniform data collection 1. Using full words collection 2. Get unique words in our collection 3. Tokenize are sentence level 4. Final Dataset 1. Sentences [sentences_per_record, length] - X 2. Labels [label_per_recordm, length] – Y
  • 20.
    Experimentation– Part 2 4.Splitting dataset to training and validation 5. Creating Embedding Matrix o Loading predefined word vector o Finding match words from our collection and creating embedding word matrix 6. Creating Embedding Layer/Configuration 7. Training
  • 21.
    Experimentation– Part 3 8.Understanding results o Layers connection o Model configuration o Model weights 9. Saving model configuration, weights, data-model o HDF5 is a data model, library, and file format for storing and managing data
  • 22.
    Experimentation– Part 4 10.Model Metrics and Performance o Getting Model Metrics o Model Performance Graph o Model Accuracy • Training • Validation 11.Prediction o Validation Data o User Input
  • 23.
    What if youhit exact same prediction • Bad Model - Could be a bad model. Retrain it. • Rebalance your dataset: o Either upsample less frequent class o Or downsample more frequent one. • Adjust class weights: Setting higher class weight for less frequent class, network will put more attention on the downsampled class during training • Increase the time of training: After long training time network starts concentrating more on less frequent classes.
  • 24.
    Advance Processing • Engine: oDoc2seq -https://radimrehurek.com/gensim/models/doc2vec.html o Seq2seq - https://github.com/farizrahman4u/seq2seq o Lda2vec - http://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/ o RNN & LSTM - https://arxiv.org/pdf/1502.06922.pdf • Training o CPU vs GPU o Checkpoints with training
  • 25.
    AnswerBot production pipelinein cloud (AWS) Community Stackoverflow Reddit Quora Slack Bot AWS API Gateway AWS Lambda (Question Scoring) S3 DynamoDB AWS SQS A ML pipeline prototype to get top N matching answers AWS SNS Scoring Pipeline Model Preparation Process Model Production Support Portal
  • 26.
    Content • Github -https://github.com/Avkash/mldl/tree/master/tensorbeat-answerbot • Dataset o Sentiment : Search it or Mine it o 5Star - https://www.yelp.com/dataset_challenge/dataset • Python/Jupyter Notebook o Sentiment: • make-sentiment-model.py • PositiveNegative.ipynb o 5Star – make-5star-model.py • make-5star-model.py • 5StarReviews.ipynb o Prediction – PredictNow.py
  • 27.