The document discusses how machine learning can improve 5 features of communication apps: entity analysis, sentiment analysis, tasks, tackling toxicity, and image recognition. It provides examples of using machine learning libraries like TensorFlow.js with Twilio services to add these features to apps and enhance real-time communications. The presentation aims to educate developers on applying machine learning to communication products.
As Alanis Morisette almost said, isn’t it ironic that I work at a communications company and still think communication is hard? I develop and teach web, iOS, and ML solutions for Twilio, a communications company that enables developers to send/receive messages, phone calls, emails, and more with code. So what’s a developer-centric solution to the communication Q?
(click)
How about machine learning? This talk will go over a few communication application features in which ML makes communication better or more fun. This is a fairly fast-paced talk that is just barely under 20 minutes so buckle up.
#1: Analyze calls in real-time. Here we have an example of performing sentiment analysis and entity analysis on a real-time transcription of a phone call. Great for summarizing phone calls and seeing how it is going in real-time. Is the customer happy? Sad? What is the general gist of it? See how it parses out “DNA” and “DNA Test” in Truth Hurts by Lizzo from “I just took a DNA Test”
Here we have this ready-made, out-of-the-box demo app you can find on Github that already performs the real-time transcription for us using Twilio Programmable Voice, Twilio Media Streams, and the Google Cloud Speech and Language APIs with Node.js. Now all we need is to perform the sentiment and entity analysis.
Log into your Google Cloud developer console and enable the Google Cloud Speech API for a newly-created project by clicking this button in the top left corner: it should either say Select a project or No organization. (click)
Select New Project and give it a title like analyze-call-transcriptions. (click)
Click Create Credentials. (click) Select the Cloud Speech-to-Text API from the dropdown menu and followed by No when asked if you're planning to use this API with App Engine or Compute Engine. Next click the blue What credentials do I need? button. (click)
Make a service account with key type JSON, save as google_creds.json in the root of your project in /node/realtime-transcriptions.
This async function accepts a parameter `transcription` to analyze its sentiment. Instantiate the client and make a new document object containing content (the transcription to analyze) and type (here it's PLAIN_TEXT but for a different project could be HTML.) document could also take the optional parameter language to recognize entities in Chinese, Japanese, French, Portuguese, and more, otherwise the Natural Language API auto-detects the language. The score is a normalized value ranging from -1 to 1 that represents the text's overall emotional inclination whereas the magnitude is an unnormalized value ranging from zero to infinity in which each individual expression in the text contributes to it so texts of longer length could have greater magnitudes.
Use case #2: Personal assistants, many of which are ML-based like Siri, Alexa. They assist with info/answering questions like providing directions or turning on lights helping make life easier and more entertaining. Many chatbots and IVRs (click)
have a DTMF keypad IVR (as shown on left) but that’s limiting. (click) That’s annoying.
This conversational IVR (on the right) made with Twilio Autopilot, a bot-building platform where you build once and then deploy across multiple channels like phone calls, SMS, Messenger, Alexa, and more.
(click)
#3: Recognition: image recognition, video recognition, voice/audio recognition. These are used by Alexa and Siri, but also businesses. You can search images for faces or objects, like this image recognition via MMS hack that is detecting pizza here. On the other hand, video chat is more for real-time, like if you want facial recognition quickly, tennis shot detection, etc.
Uses Handtrack.js, library for prototyping realtime hand detection (bounding box), directly in the browser.
The convolutional neural network is trained using the tensorflow object detection api using egohands dataset. model is converted to the Tensorflowjs format, wrapped into an npm package, and works better when the hands in an image are viewed from above (an egocentic) view
There are similar libraries for images and video, and you can make a model that recognizes yoga poses, tells you when you’re touching your face, and more.
#5: Text analysis. Can analyze sentiment (ie. Is something +, -, or neutral), entities (parse text for nouns, phrases, important key words), and more, all falling under Natural Language Processing (NLP.) This could also similarly be applied to recommendations.
This would return “0.6535733..” for “let it go”
The pre-trained TF.js model trained on a set of 25,000 movie reviews from IMDB, given either a positive or negative sentiment label, and two model architectures to use: CNN (Convolutional Neural Network) or LSTM (Long Short Term Memory networks). This post will be using the CNN.
Require Twilio and TF.js. create a Twilio client to access all my messages (to and from) Twilio numbers I’ve purchased, node-fetch to fetch medata from the TF.js sentiment concurrent neural network). The Twilio Client is needed bc I wanted to analyze my year: was my year positive based on text messages? (It was, but let me tell you, this past month has not been lol)
fetch some metadata which provides both the shape and type of the model, where we'll use node-fetch to get the metadata hosted at a remote URL to help us train our model.
make sequences the same length. pre-padding is the default (slightly modified from GitHub TF.js examples)
We need to convert words to sequences of word indices based on the metadata but first we need to equalize those sequences’ lengths and convert the strings of words to integers, or vectorize them. Sequences longer than the size of the last dimension of the returned tensor (metadata.max_len) are truncated and sequences shorter than it are padded at the start of the sequence.
Predict function accepts three parameters: one text message, the model loaded from a remote URL in the next function, and the metadata. In predict the input text is first tokenized and trimmed with regular expressions to convert it to lower-case and remove punctuation.
Next those trimmed words are converted to a sequence of word indices based on the metadata. Let's say a word is in the testing input but not in the training data or recognition vocabulary. This is called out-of-vocabulary, or OOV. With this conversion, even if a word is OOV like a misspelling or emoji, it can still be embedded as a vector, or array of numbers, which is needed to be used by the machine learning model.
Finally, the model predicts how positive the text is. We create a TensorFlow object with our sequences of word indices. Once our output data is retrieved and loosely downloaded from the GPU to the CPU with the synchronous dataSync() function, we need to explicitly manage memory and remove that tensor's memory with dispose() before returning a decimal showing how positive the model thinks the text is.
Then we make a helper function that compares each positivity score and determines whether that makes the text message positive, negative, or neutral. Can play around with values
#5: detecting toxicity. Train to Tame. Model trained on civil comments dataset, containing ~2 million comments labeled for toxicity.
7 possible labels: identity_attack (identity-based hate), insult, obscene, sexual_explicit, threat, toxicity, and severe_toxicity.
This is what we’ll be creating: a chat app that detects toxic/not-nice language and prevents it from being posted to the channel if a message has a predicted toxicity percentage above, say, 50% for any of the 7 labels
Already done some setup: I have this Twilio chat demo app
Clone it on the command line
You will need these credentials which you can find in the README of the chat demo app.
Import Libraries (click)
Add a toxicity-indicator to your Twilio chat app (I used a ready-made demo app on GitHub) (click)
Decorate with some CSS (click)
Now you, too, can call yourself a Machine Learning researcher.