SlideShare a Scribd company logo
Real World NLP and ML
Devin Bost
Software Architect
devin.bost@imaginelearning.com
Questions welcome during
presentation
NLP
ML
Sentiment analysis
Automated essay scoring
Content summarization
Chatbots
Information retrieval
Cluster analysis
Language neural
networks
Language translation
AI Big Data
http://eduardomagrani.com/en/we-are-big-data-new-technologies-and-personal-data-management/
Everything is
NLP
ML
Sentiment analysis
Automated essay scoring
Content summarization
Chatbots
Information retrieval
Cluster analysis
Language neural
networks
Document categorization
AI Big Data
https://openi.nlm.nih.gov/detailedresult.php?img=PMC2841207_1471-2105-11-101-2&req=4
Penn Treebank example:
Meta-analysis of studies: Burns, G. A., Feng, D., & Hovy, E. (2008). Intelligent
approaches to mining the primary research literature:
techniques, systems, and examples. In Computational
Intelligence in Medical Informatics (pp. 17-50). Springer,
Berlin, Heidelberg. Retrieved from:
http://www.academia.edu/download/30797420/burns_feng
_hovy_comp_intel-final.pdf
https://medium.com/@athif.shaffy/one-hot-encoding-of-text-b69124bef0a7
One-hot vectors:
https://www.kdnuggets.com/2017/04/must-know-curse-dimensionality.html
The curse of dimensionality:
Statistical word embeddings: Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases
and their compositionality. In Advances in neural information processing systems (pp. 3111-3119). At:
https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
Cited by over 9804 papers according to Google Scholar, as of: 10/22/2018
Based on statistical relationships between words:
https://www.coursera.org/lecture/intro-to-deep-learning/word-embeddings-dhzl5
Images from: https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
https://www.packtpub.com/mapt/book/big_data_and_business_intelligence/9781788398060/8/ch08lvl1sec56/mapping-with-word2vec-embeddings
SwiftKey:
So what are bigrams?
Examples of less useful bigrams:
Of the
what is
they are
to the
way to
hey you
Examples of useful bigrams:
New York
West Virginia
Imagine Learning
Imagine Math
Microsoft Office
Neural network
Ping pong
The problem with student chat data:
The problem with student chat data:
Top 10 bigrams:
1. need help
2. back need
3. nice day
4. help nice
5. click back
6. please come
7. hear voice
8. type please
9. problem ask
10.ask find
http://playground.tensorflow.org/
http://playground.tensorflow.org/
Chatbot:
• Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-
based neural machine translation. arXiv preprint arXiv:1508.04025. At:
https://arxiv.org/pdf/1508.04025
• Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly
learning to align and translate. arXiv preprint arXiv:1409.0473. At:
https://arxiv.org/pdf/1409.0473
• Next attempt with chatbots.
• Added context.
• Validation score improved!...
• Problem is that model validation scores have different meanings
when the model changes
• Key point: Ensure that your application allows your accuracy to be
imperfect.
Neural networks: Very good at detecting patterns, but
they don’t always beat less complex ML models (e.g.
Naïve Bayes, XGBoost, etc.)
The data volume paradigm:
Most common cases
https://blog.easysol.net/building-ai-applications/
https://people.duke.edu/~ccc14/sta-663/CUDAPython.html
Question
Analysis project
Using NLTK
named entity
extraction
. . .
• Without data, you have no machine learning!
• Should be obvious, right?
• You’d be surprised.
https://medium.com/data-ops/the-data-lake-is-a-design-pattern-888323323c66
Data pantry:
https://www.pinterest.com/pin/424956914822695344/?lp=true
Great software… but now what? (The problem LinkedIn experienced.)
The solution
How to make it cost effective:
Kinesis data
stream
Kinesis analytics or
Flink/ Spark-
Streaming on EMR
Lambda,
Proactive
Intervention
IoT Core
Client
devices/browsers
API
Gateway
(1)
(2) (3)
(4)
(5)
(6)
(7) (8) (9)
(10)
Lambda,
Auth.
S3 storage CloudWatch logging CloudWatch logging
$8/year
per 1,000,000
events
+ cost of analytics
Real-time streaming predictive analytics
https://analyse.kmi.open.ac.uk/
Clues that you have an organizational or
architectural problem:
Excuse #1: But all of our developers are so constantly busy that
we will never get around to making those changes!
Implication: But we have so much technical debt, we
spend all of our time putting out fires!
Image cropped from: https://www.flickr.com/photos/41284017@N08/9599182665
From: http://gis.nwcg.gov/gist_2004/logos/federal_logos.html
Excuse #2: We have all of the data that we
need!
Implication: We are so unwilling
to take a look at the reality of our problem
that we have no idea how bad it really is.
Excuse #3: It’s really not
that important. We have
higher priorities.
Implication: We think
we’re so right 100% of
the time that no data
could possibly ever tell us
that we’re ever wrong.
Or, we don’t make
mistakes (only our
developers do).
https://www.recruiter.com/i/does-a-worker%E2%80%99s-personal-life-affect-your-brand/fingers-pointing-blame-to-man/
Excuse #4: We make our decisions based on our instincts and
gut feelings.
Implication: We’re so unwilling to have our
assumptions challenged that we don’t want to think about the
idea that additional data could make our instincts even better.
https://medium.com/@vaidoshia/building-my-own-design-gut-instinct-f7f773d6d608
Excuse #5: That’s nice, but that doesn’t apply to us.
Implication: I live in my own little world where truth
doesn’t apply to me.
https://www.deviantart.com/bluejennybird/art/my-own-planet-159966933
Excuse #6: That would be too expensive.
Implication: We’re at least 5 years behind on what big
data technologies and cloud services can offer.
What’s a serverless
function?
What’s an event
stream?
[picture of a person
getting rained on by a
cloud] http://i.telegraph.co.uk/multimedia/archive/01244/appleimac1984_1244597i.jpg
Excuse #7: We don’t have time for that.
We’re so busy chasing the carrot in front of our faces that we probably won’t notice if our
competitors knock us out of the market until it’s too late.
https://www.derekhuether.com/blog/2010/11/12/chasing-the-carrot
https://forum.slowtwitch.com/forum/Slowtwitch_Forums_C1/Triathlon_Forum_F1/What%27s_the_average_first_year_out_of_pocket%3F_P5797700/
Excuse #8: We need to make use of our existing technologies.
We can’t bear the thought that we have been wasting
our investments in outdated technologies. Or, we don’t think
this effort is important enough to justify our investment. (See
excuses 1-7.)
Excuse #9: It would be too hard to maintain
Implication:
I don’t know what “serverless” means. Is that part of
“The Cloud”?
https://www.thoughtco.com/types-of-clouds-recognize-in-the-sky-4025569
Python libraries for exploring word embeddings include:
• Gensim: https://radimrehurek.com/gensim/tutorial.html
• SpaCy: https://spacy.io/usage/spacy-101
• NLTK: https://www.nltk.org
• CoreNLP: https://stanfordnlp.github.io/CoreNLP/

More Related Content

Similar to Real World NLP, ML, and Big Data

Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Sri Ambati
 

Similar to Real World NLP, ML, and Big Data (20)

Data science presentation
Data science presentationData science presentation
Data science presentation
 
Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018
 
Dl applicationlandscape-mar2018-180405144127
Dl applicationlandscape-mar2018-180405144127Dl applicationlandscape-mar2018-180405144127
Dl applicationlandscape-mar2018-180405144127
 
Human in the loop: Bayesian Rules Enabling Explainable AI
Human in the loop: Bayesian Rules Enabling Explainable AIHuman in the loop: Bayesian Rules Enabling Explainable AI
Human in the loop: Bayesian Rules Enabling Explainable AI
 
EDW 2015 cognitive computing panel session
EDW 2015 cognitive computing panel session EDW 2015 cognitive computing panel session
EDW 2015 cognitive computing panel session
 
Data Science Demystified
Data Science DemystifiedData Science Demystified
Data Science Demystified
 
Big Data
Big DataBig Data
Big Data
 
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
 
Data Science Workshop - day 1
Data Science Workshop - day 1Data Science Workshop - day 1
Data Science Workshop - day 1
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
 
Data Scientists Are Analysts Are Also Software Engineers
Data Scientists Are Analysts Are Also Software EngineersData Scientists Are Analysts Are Also Software Engineers
Data Scientists Are Analysts Are Also Software Engineers
 
Analyzing Big Data's Weakest Link (hint: it might be you)
Analyzing Big Data's Weakest Link  (hint: it might be you)Analyzing Big Data's Weakest Link  (hint: it might be you)
Analyzing Big Data's Weakest Link (hint: it might be you)
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
(In)convenient truths about applied machine learning
(In)convenient truths about applied machine learning(In)convenient truths about applied machine learning
(In)convenient truths about applied machine learning
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
[243] turning data into value
[243] turning data into value[243] turning data into value
[243] turning data into value
 
From Rocket Science to Data Science
From Rocket Science to Data ScienceFrom Rocket Science to Data Science
From Rocket Science to Data Science
 
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
 

More from Devin Bost

More from Devin Bost (6)

Vector Search / Generative AI introduction at Pulsar Meetup
Vector Search / Generative AI introduction at Pulsar MeetupVector Search / Generative AI introduction at Pulsar Meetup
Vector Search / Generative AI introduction at Pulsar Meetup
 
Streaming Patterns and Best Practices with Apache Pulsar for Enabling Machine...
Streaming Patterns and Best Practices with Apache Pulsar for Enabling Machine...Streaming Patterns and Best Practices with Apache Pulsar for Enabling Machine...
Streaming Patterns and Best Practices with Apache Pulsar for Enabling Machine...
 
How to introduce Apache Pulsar into your organization successfully - Devin Bost
How to introduce Apache Pulsar into your organization successfully - Devin BostHow to introduce Apache Pulsar into your organization successfully - Devin Bost
How to introduce Apache Pulsar into your organization successfully - Devin Bost
 
Pulsar Architectural Patterns for CI/CD Automation and Self-Service
Pulsar Architectural Patterns for CI/CD Automation and Self-ServicePulsar Architectural Patterns for CI/CD Automation and Self-Service
Pulsar Architectural Patterns for CI/CD Automation and Self-Service
 
Real-World Pulsar Architectural Patterns
Real-World Pulsar Architectural PatternsReal-World Pulsar Architectural Patterns
Real-World Pulsar Architectural Patterns
 
Apache Pulsar - Real-time data flows drive core business processes
Apache Pulsar - Real-time data flows drive core business processesApache Pulsar - Real-time data flows drive core business processes
Apache Pulsar - Real-time data flows drive core business processes
 

Recently uploaded

Recently uploaded (20)

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 

Real World NLP, ML, and Big Data

Editor's Notes

  1. Unfortunately, when people think of big data, they often think of this: Massive amounts of data.
  2. But the reality is that big data is everywhere. Everything that can potentially collect data should be considered. Data can still be considered Big Data if the variety is high, such as if many different data sources are involved.
  3. Considering that big data is all inclusive, where then does NLP fit into this landscape?
  4. Natural language processing (NLP) can be used to extract features from human language. Our goal is usually to gain deeper insight into what is actually being said by using a computational approach that allows us to detect patterns or gain insights in an automated manner.
  5. What are
  6. Extracted terms can be mapped to domain-specific ontologies. An ontology is like a word map. Ontologies can be industry specific or can be broad. Either way, they allow us to attach additional meaning to our original data. In Big Data, we call this enrichment.
  7. It is common to use what are called one-hot word vectors to represent the words in the data. They are very commonly used with neural network models, such as the models used for Neural Machine Translation (NMT).
  8. Unfortunately, this can result in what we call The Curse of Dimensionality. This is a problem that results from the high number of dimensions that are represented by modeling languages. For example, for neural machine translation (NMT) models used to translate languages, it is common to have millions or even billions of dimensions, depending on the size of the dictionary used.
  9. A very influential method was developed in 2013 by some very bright researchers who discovered a dimensional reduction technique that creates what we call “word embeddings.” These embeddings represent statistical relationships between words and the words that they frequently co-occur adjacent to. This method allows us to replace millions of dimensions of one-hot vectors that contain no context with hundreds of vectors that contain very rich context.
  10. As a consequence, the word embedding represents a vector-space representation of the dimensional reduction.
  11. Because the model is a linear space, it allows us to represent relationships like this:
  12. The linear features of word embeddings are particularly useful for building neural network models for languages.
  13. The latest version of SwiftKey uses a neural network to predict text to accelerate typing on a mobile device.
  14. Bigrams are pairs of words that co-occur in a dataset. Bigrams are the most useful when they represent distinct meaning when combined.
  15. Any useful bigrams?… (Ignore the b character at the start of the string.)
  16. Any useful bigrams?… (Ignore the b character at the start of the string.)
  17. Here are some good libraries for experimenting with word embeddings and natural language processing.