Text is the next frontier in big data. Human speech and writing, from newspaper articles and speeches to informal conversations on social media, comes in a constant stream, encoding information about the human experience - people, relationships, events and ideas - that simply cannot be captured through traditional data sources. Leveraging natural language data is often the best way to curate and tailor a user’s experience, and natural language processing is rapidly becoming an industry standard.
However, the work of translating raw text and audio into valuable data requires special effort -- not just data management and cleaning, but also the extraction and extrapolation of language patterns. Natural language, unlike strict rule-based formal languages, is contextual, constantly evolving, and sometimes unpredictable. This problem space calls for a flexible algorithm that can learn patterns from examples and make predictions about new data. Enter machine learning!
Machine learning algorithms and natural language processing tools are more widely available than ever, and in this talk, we’ll explore an approach for putting the pieces together to transform a raw text corpus into a data product with real business value. We’ll traverse an end-to-end pipeline for parsing many thousands or millions of documents, transforming them into feature vectors, comparing machine learning models, and finally applying the most performant model to new incoming text.
Whether you’re a data scientist, a big data enthusiast, or a developer who wants to leverage natural language data in your application, this talk will provide a practical approach to text analysis and its real-world applications.
4. tl;dr
● Text is the next frontier in big data.
● Language-aware data products are:
○ Not academia, but informed by it.
○ Not automagic, just feel that way.
● Machine learning is flexible; rules are not.
● Text comes with some unique requirements.
● Facilitate iteration with the model selection triple.
● Deployment is an opportunity to ingest more data.
● Pipelines are necessary for production.
5. “Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth”
6. Natural Language Understanding (AI)
Models for semantic understanding,
reasoning, and generation of natural
languages for human-computer
interaction.
Computational Linguistics (NLP)
Approaches to demonstrate how
humans interpret and understand
language and show how languages
evolve.
vs.
9. “It sucks I didn't take pictures of the food I ordered here because I really
wanted to show it off.
The restaurant isn't the biggest. It's pretty small. I had people constantly run
into my bag that I hung on the edge of my chair. Quite annoying honestly but
it's my bad for carrying such a large bag.
It didn't take long for the food to come out. I've been disappointed with one
of New York's best rated brunch spots that I waited 2+ hours for before so I
decided not to have any expectations for this place at all. However, the food
here actually tastes great.”
- 9/6/2017 Yelp Review
10.
11.
12. Sample Sentiment Analysis Pipeline
Training Data
(Historic Reviews)
Training Labels
(# Stars)
Feature
Vectors
Classification
Algorithm
New Data:
New Review Feature
Vector
Predictive
Model
Predicted Label
(# Stars)
18. Data Management Layer
Raw Data
Feature Engineering Hyperparameter Tuning
Algorithm Selection
Model Selection Triples
Instance
Database
Model Storage
Model
Family
Model
Form
21. Partisan Discourse: Architecture
Initial ModelDebate Transcripts
Submit URL
Preprocessing
Feature
Extraction
Evaluate Model
Fit Model
Model
Storage
Model
Monitoring
Corpus
Storage
Corpus
Monitoring
Classification
Feedback
Model Selectionstart
here
22. Partisan Discourse: New Documents
Users can:
- add new documents
- add labels to train
the model
23. Partisan Discourse: User Model
Over time, models
evolve:
- Global model
- Local models
- User models
25. tl;dr
● Text is the next frontier in big data.
● Language-aware data products are:
○ Not academia, but informed by it.
○ Not automagic, just feel that way.
● Machine learning is flexible; rules are not.
● Text comes with some unique requirements.
● Facilitate iteration with the model selection triple.
● Deployment is an opportunity to ingest more data.
● Pipelines are necessary for production.
26. • Summarization
• Reference Resolution
• Machine Translation
• Language Generation
• Language Understanding
• Document Classification
• Author Identification
• Part of Speech Tagging
• Question Answering
• Information Extraction
• Information Retrieval
• Speech Recognition
• Sense Disambiguation
• Topic Recognition
• Relationship Detection
• Named Entity Recognition
Everyday NLP Applications