SlideShare a Scribd company logo
Building a Gigaword Corpus
Data Ingestion, Management, and Processing for NLP
Rebecca Bilbro
PyCon 2017
● Me and my motivation
● Why make a custom corpus?
● Things likely to go wrong
○ Ingestion
○ Management
○ Loading
○ Preprocessing
○ Analysis
● Lessons we learned
● Open source tools we made
Rebecca Bilbro
Data Scientist
Yellowbrick
Natural language
processing
Everyone’s doing it
NLTK
So many great tools
The Natural Language Toolkit
import nltk
moby = nltk.text.Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
print(moby.similar("ahab"))
print(moby.common_contexts(["ahab", "starbuck"]))
print(moby.concordance("monstrous", 55, lines=10))
Gensim + Wikipedia
import bz2
import gensim
# Load id to word dictionary
id2word = gensim.corpora.Dictionary.load_from_text('wikipedia_wordids.txt')
# Instantiate iterator for corpus (which is ~24.14 GB on disk after compression!)
mm = gensim.corpora.MmCorpus(bz2.BZ2File('wikipedia_tfidf.mm.bz2'))
# Do latent Semantic Analysis and find 10 prominent topics
lsa = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400)
lsa.print_topics(10)
A custom corpus
RSS
import os
import requests
import feedparser
feed = "http://feeds.washingtonpost.com/rss/national"
for entry in feedparser.parse(feed)['entries']:
r = requests.get(entry['link'])
path = entry['title'].lower().replace(" ", "-") + ".html"
with open(path, 'wb') as f:
f.write(r.content)
Ingestion
● Scheduling
● Adding new feeds
● Synchronizing feeds, finding duplicates
● Parsing different feeds/entries into a
standard form
● Monitoring
Storage
● Database choice
● Data representation, indexing, fetching
● Connection and configuration
● Error tracking and handling
● Exporting
And as the corpus began
to grow …
… new questions arose about costs
(storage, time) and surprising results
(videos?).
Post
+ title
+ url
+ content
+ hash()
+ htmlize()
Feed
+ title
+ link
+ active
OPML Reader
+ categories()
+ counts()
+ __iter__()
+ __len__()
ingest()
Configuration
+ logging
+ database
+ flags
Exporter
+ export()
+ readme()
Admin
+ ingest_feeds()
+ ingest_opml()
+ summary()
+ run()
+ export()
Utilities
+ timez
Logging
+ logger
+ mongolog
Ingest
+ feeds()
+ started()
+ finished()
+ process()
+ ingest()
Feed Sync
+ parse()
+ sync()
+ entries()
Post Wrangler
+ wrangle()
+ fetch()
connect()
<OPML />
Production-grade
ingestion:
Baleen
Raw corpus != Usable data
From each doc, extract html, identify paras/sents/words, tag with part-of-speech
Raw Corpus
HTML
corpus = [(‘How’, ’WRB’),
(‘long’, ‘RB’),
(‘will’, ‘MD’),
(‘this’, ‘DT’),
(‘go’, ‘VB’),
(‘on’, ‘IN’),
(‘?’, ‘.’),
...
]
Paras
Sents
Tokens
Tags
Streaming
Corpus Preprocessing
Tokenized
Corpus
CorpusReader for streaming access, preprocessing, and saving the tokenized version
HTML
Paras
Sents
Tokens
Tags
Raw
Corpus
Vectorization
...so many features
Visualize top tokens,
document distribution
& part-of-speech
tagging
Yellowbrick
Data Loader
Text
Normalization
Text
Vectorization
Feature
Transformation
Estimator
Data Loader
Feature Union Pipeline
Estimator
Text
Normalization
Document
Features
Text Extraction
Summary
Vectorization
Article
Vectorization
Concept Features
Metadata Features
Dict Vectorizer
Minke
Dynamic graph
analysis
1.5 M documents, 7,500 jobs, 524 GB (uncompressed)
Keyphrase Graph:
- 2.7 M nodes
- 47 M edges
- Average
degree of 35
Lessons learned
Meaningfully literate data products rely
on…
...a custom, domain-specific corpus.
Meaningfully literate data products rely
on…
...a data management layer for flexibility
and iteration during
modeling.
Feature
Analysis
Algorithm
Selection
Hyperparameter
Tuning
corpus
├── citation.bib
├── feeds.json
├── LICENSE.md
├── manifest.json
├── README.md
└── books
├── 56d629e7c1808113ffb87eaf.html
├── 56d629e7c1808113ffb87eb3.html
└── 56d629ebc1808113ffb87ed0.html
└── business
├── 56d625d5c1808113ffb87730.html
├── 56d625d6c1808113ffb87736.html
└── 56d625ddc1808113ffb87752.html
└── cinema
├── 56d629b5c1808113ffb87d8f.html
├── 56d629b5c1808113ffb87d93.html
└── 56d629b6c1808113ffb87d9a.html
└── cooking
├── 56d62af2c1808113ffb880ec.html
├── 56d62af2c1808113ffb880ee.html
└── 56d62af2c1808113ffb880fa.html
Preprocessing
Transformer
Raw
CorpusReader
Tokenized
Corpus
Post-processed
CorpusReader
Meaningfully literate data products rely
on…
...a custom CorpusReader for streaming,
and also intermediate storage.
Meaningfully literate data products rely
on…
...visual steering and
graph analysis for
interpretation.
Corpus
Processing
Extract noun
keyphrases
weighted by
TF-IDF.
Corpus
Ingestion
Routine
Document
Collection
Every Hour
Baleen & Minke
Yellowbrick
Getting data
● (Tutorial) “Fantastic Data and Where to Find Them” by Nicole
Donnelly
● (Poster) “On the Hour Data Ingestion” by Benjamin Bengfort and
Will Voorhees
Speed to insight
● (Talk) “Human-Machine Collaboration” by Tony Ojeda
● (Poster) “A Framework for Exploratory Data Analysis” by Tony
Ojeda and Sasan Bahadaran
Machine learning
● (Poster) “Model Management Systems” by Benjamin Bengfort
and Laura Lorenz
● (Poster) “Yellowbrick” by Benjamin Bengfort and Rebecca Bilbro
Also, sprints!
pycon.districtdatalabs.com
Thank you!
Rebecca Bilbro
Twitter: twitter.com/rebeccabilbro
Github: github.com/rebeccabilbro
Email: rebecca.bilbro@bytecubed.com

More Related Content

What's hot

Java development with MongoDB
Java development with MongoDBJava development with MongoDB
Java development with MongoDB
James Williams
 
MongoDB + Java - Everything you need to know
MongoDB + Java - Everything you need to know MongoDB + Java - Everything you need to know
MongoDB + Java - Everything you need to know
Norberto Leite
 
Building Your First Application with MongoDB
Building Your First Application with MongoDBBuilding Your First Application with MongoDB
Building Your First Application with MongoDBMongoDB
 
Back to Basics Webinar 5: Introduction to the Aggregation Framework
Back to Basics Webinar 5: Introduction to the Aggregation FrameworkBack to Basics Webinar 5: Introduction to the Aggregation Framework
Back to Basics Webinar 5: Introduction to the Aggregation Framework
MongoDB
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQL
MongoDB
 
Back to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB ApplicationBack to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB Application
MongoDB
 
Introduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopIntroduction to MongoDB and Hadoop
Introduction to MongoDB and Hadoop
Steven Francia
 
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDBMongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB
 
PhpstudyTokyo MongoDB PHP CakePHP
PhpstudyTokyo MongoDB PHP CakePHPPhpstudyTokyo MongoDB PHP CakePHP
PhpstudyTokyo MongoDB PHP CakePHP
ichikaway
 
Building Your First MongoDB App
Building Your First MongoDB AppBuilding Your First MongoDB App
Building Your First MongoDB App
Henrik Ingo
 
MongoDB
MongoDBMongoDB
MongoDB: Intro & Application for Big Data
MongoDB: Intro & Application  for Big DataMongoDB: Intro & Application  for Big Data
MongoDB: Intro & Application for Big DataTakahiro Inoue
 
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19
Henrik Ingo
 
Mongo Presentation by Metatagg Solutions
Mongo Presentation by Metatagg SolutionsMongo Presentation by Metatagg Solutions
Mongo Presentation by Metatagg Solutions
Metatagg Solutions
 
Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework
MongoDB
 
Whats new in MongoDB 24
Whats new in MongoDB 24Whats new in MongoDB 24
Whats new in MongoDB 24
MongoDB
 
Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...
Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...
Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...
Miguel Gallardo
 
Building Apps with MongoDB
Building Apps with MongoDBBuilding Apps with MongoDB
Building Apps with MongoDB
Nate Abele
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Nosh Petigara
 
Reverse on go
Reverse on goReverse on go
Reverse on go
Yi-Hsien Chen
 

What's hot (20)

Java development with MongoDB
Java development with MongoDBJava development with MongoDB
Java development with MongoDB
 
MongoDB + Java - Everything you need to know
MongoDB + Java - Everything you need to know MongoDB + Java - Everything you need to know
MongoDB + Java - Everything you need to know
 
Building Your First Application with MongoDB
Building Your First Application with MongoDBBuilding Your First Application with MongoDB
Building Your First Application with MongoDB
 
Back to Basics Webinar 5: Introduction to the Aggregation Framework
Back to Basics Webinar 5: Introduction to the Aggregation FrameworkBack to Basics Webinar 5: Introduction to the Aggregation Framework
Back to Basics Webinar 5: Introduction to the Aggregation Framework
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQL
 
Back to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB ApplicationBack to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB Application
 
Introduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopIntroduction to MongoDB and Hadoop
Introduction to MongoDB and Hadoop
 
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDBMongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
 
PhpstudyTokyo MongoDB PHP CakePHP
PhpstudyTokyo MongoDB PHP CakePHPPhpstudyTokyo MongoDB PHP CakePHP
PhpstudyTokyo MongoDB PHP CakePHP
 
Building Your First MongoDB App
Building Your First MongoDB AppBuilding Your First MongoDB App
Building Your First MongoDB App
 
MongoDB
MongoDBMongoDB
MongoDB
 
MongoDB: Intro & Application for Big Data
MongoDB: Intro & Application  for Big DataMongoDB: Intro & Application  for Big Data
MongoDB: Intro & Application for Big Data
 
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19
 
Mongo Presentation by Metatagg Solutions
Mongo Presentation by Metatagg SolutionsMongo Presentation by Metatagg Solutions
Mongo Presentation by Metatagg Solutions
 
Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework
 
Whats new in MongoDB 24
Whats new in MongoDB 24Whats new in MongoDB 24
Whats new in MongoDB 24
 
Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...
Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...
Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...
 
Building Apps with MongoDB
Building Apps with MongoDBBuilding Apps with MongoDB
Building Apps with MongoDB
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Reverse on go
Reverse on goReverse on go
Reverse on go
 

Similar to Building a Gigaword Corpus (PyCon 2017)

PDC Video on C# 4.0 Futures
PDC Video on C# 4.0 FuturesPDC Video on C# 4.0 Futures
PDC Video on C# 4.0 Futures
nithinmohantk
 
蔡学镛 Rebol漫谈
蔡学镛   Rebol漫谈蔡学镛   Rebol漫谈
蔡学镛 Rebol漫谈d0nn9n
 
What do you mean, Backwards Compatibility?
What do you mean, Backwards Compatibility?What do you mean, Backwards Compatibility?
What do you mean, Backwards Compatibility?
Trisha Gee
 
HTML5 Is the Future of Book Authorship
HTML5 Is the Future of Book AuthorshipHTML5 Is the Future of Book Authorship
HTML5 Is the Future of Book Authorship
Sanders Kleinfeld
 
Webinar: Getting Started with Ruby and MongoDB
Webinar: Getting Started with Ruby and MongoDBWebinar: Getting Started with Ruby and MongoDB
Webinar: Getting Started with Ruby and MongoDB
MongoDB
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDB
MongoDB
 
NLTK introduction
NLTK introductionNLTK introduction
NLTK introduction
Prakash Pimpale
 
Mongo db japan
Mongo db japanMongo db japan
Mongo db japan
rogerbodamer
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
source{d}
 
MongoDB at FrozenRails
MongoDB at FrozenRailsMongoDB at FrozenRails
MongoDB at FrozenRails
Mike Dirolf
 
MongoDB at ZPUGDC
MongoDB at ZPUGDCMongoDB at ZPUGDC
MongoDB at ZPUGDC
Mike Dirolf
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Databricks
 
Building your first app with MongoDB
Building your first app with MongoDBBuilding your first app with MongoDB
Building your first app with MongoDB
Norberto Leite
 
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
Kouhei Nakaji
 
Software Language Evolution
Software Language EvolutionSoftware Language Evolution
Software Language EvolutionSander Vermolen
 
Functional (web) development with Clojure
Functional (web) development with ClojureFunctional (web) development with Clojure
Functional (web) development with Clojure
Henrik Eneroth
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics
Prakash Pimpale
 
支撐英雄聯盟戰績網的那條巨蟒
支撐英雄聯盟戰績網的那條巨蟒支撐英雄聯盟戰績網的那條巨蟒
支撐英雄聯盟戰績網的那條巨蟒Toki Kanno
 
Javascript dom event
Javascript dom eventJavascript dom event
Javascript dom event
Bunlong Van
 

Similar to Building a Gigaword Corpus (PyCon 2017) (20)

Oreilly
OreillyOreilly
Oreilly
 
PDC Video on C# 4.0 Futures
PDC Video on C# 4.0 FuturesPDC Video on C# 4.0 Futures
PDC Video on C# 4.0 Futures
 
蔡学镛 Rebol漫谈
蔡学镛   Rebol漫谈蔡学镛   Rebol漫谈
蔡学镛 Rebol漫谈
 
What do you mean, Backwards Compatibility?
What do you mean, Backwards Compatibility?What do you mean, Backwards Compatibility?
What do you mean, Backwards Compatibility?
 
HTML5 Is the Future of Book Authorship
HTML5 Is the Future of Book AuthorshipHTML5 Is the Future of Book Authorship
HTML5 Is the Future of Book Authorship
 
Webinar: Getting Started with Ruby and MongoDB
Webinar: Getting Started with Ruby and MongoDBWebinar: Getting Started with Ruby and MongoDB
Webinar: Getting Started with Ruby and MongoDB
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDB
 
NLTK introduction
NLTK introductionNLTK introduction
NLTK introduction
 
Mongo db japan
Mongo db japanMongo db japan
Mongo db japan
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
 
MongoDB at FrozenRails
MongoDB at FrozenRailsMongoDB at FrozenRails
MongoDB at FrozenRails
 
MongoDB at ZPUGDC
MongoDB at ZPUGDCMongoDB at ZPUGDC
MongoDB at ZPUGDC
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 
Building your first app with MongoDB
Building your first app with MongoDBBuilding your first app with MongoDB
Building your first app with MongoDB
 
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
 
Software Language Evolution
Software Language EvolutionSoftware Language Evolution
Software Language Evolution
 
Functional (web) development with Clojure
Functional (web) development with ClojureFunctional (web) development with Clojure
Functional (web) development with Clojure
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics
 
支撐英雄聯盟戰績網的那條巨蟒
支撐英雄聯盟戰績網的那條巨蟒支撐英雄聯盟戰績網的那條巨蟒
支撐英雄聯盟戰績網的那條巨蟒
 
Javascript dom event
Javascript dom eventJavascript dom event
Javascript dom event
 

More from Rebecca Bilbro

Data Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in ProductionData Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in Production
Rebecca Bilbro
 
Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)
Rebecca Bilbro
 
(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning
Rebecca Bilbro
 
Anti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual ConsistencyAnti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual Consistency
Rebecca Bilbro
 
The Promise and Peril of Very Big Models
The Promise and Peril of Very Big ModelsThe Promise and Peril of Very Big Models
The Promise and Peril of Very Big Models
Rebecca Bilbro
 
Beyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusBeyond Off the-Shelf Consensus
Beyond Off the-Shelf Consensus
Rebecca Bilbro
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine Learning
Rebecca Bilbro
 
EuroSciPy 2019: Visual diagnostics at scale
EuroSciPy 2019: Visual diagnostics at scaleEuroSciPy 2019: Visual diagnostics at scale
EuroSciPy 2019: Visual diagnostics at scale
Rebecca Bilbro
 
Visual diagnostics at scale
Visual diagnostics at scaleVisual diagnostics at scale
Visual diagnostics at scale
Rebecca Bilbro
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Rebecca Bilbro
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
Rebecca Bilbro
 
Words in space
Words in spaceWords in space
Words in space
Rebecca Bilbro
 
The Incredible Disappearing Data Scientist
The Incredible Disappearing Data ScientistThe Incredible Disappearing Data Scientist
The Incredible Disappearing Data Scientist
Rebecca Bilbro
 
Camlis
CamlisCamlis
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with Yellowbrick
Rebecca Bilbro
 
Escaping the Black Box
Escaping the Black BoxEscaping the Black Box
Escaping the Black Box
Rebecca Bilbro
 
Yellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformersYellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformers
Rebecca Bilbro
 
Visualizing the model selection process
Visualizing the model selection processVisualizing the model selection process
Visualizing the model selection process
Rebecca Bilbro
 
NLP for Everyday People
NLP for Everyday PeopleNLP for Everyday People
NLP for Everyday People
Rebecca Bilbro
 
Commerce Data Usability Project
Commerce Data Usability ProjectCommerce Data Usability Project
Commerce Data Usability Project
Rebecca Bilbro
 

More from Rebecca Bilbro (20)

Data Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in ProductionData Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in Production
 
Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)
 
(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning
 
Anti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual ConsistencyAnti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual Consistency
 
The Promise and Peril of Very Big Models
The Promise and Peril of Very Big ModelsThe Promise and Peril of Very Big Models
The Promise and Peril of Very Big Models
 
Beyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusBeyond Off the-Shelf Consensus
Beyond Off the-Shelf Consensus
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine Learning
 
EuroSciPy 2019: Visual diagnostics at scale
EuroSciPy 2019: Visual diagnostics at scaleEuroSciPy 2019: Visual diagnostics at scale
EuroSciPy 2019: Visual diagnostics at scale
 
Visual diagnostics at scale
Visual diagnostics at scaleVisual diagnostics at scale
Visual diagnostics at scale
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
 
Words in space
Words in spaceWords in space
Words in space
 
The Incredible Disappearing Data Scientist
The Incredible Disappearing Data ScientistThe Incredible Disappearing Data Scientist
The Incredible Disappearing Data Scientist
 
Camlis
CamlisCamlis
Camlis
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with Yellowbrick
 
Escaping the Black Box
Escaping the Black BoxEscaping the Black Box
Escaping the Black Box
 
Yellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformersYellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformers
 
Visualizing the model selection process
Visualizing the model selection processVisualizing the model selection process
Visualizing the model selection process
 
NLP for Everyday People
NLP for Everyday PeopleNLP for Everyday People
NLP for Everyday People
 
Commerce Data Usability Project
Commerce Data Usability ProjectCommerce Data Usability Project
Commerce Data Usability Project
 

Recently uploaded

做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 

Recently uploaded (20)

做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 

Building a Gigaword Corpus (PyCon 2017)

  • 1. Building a Gigaword Corpus Data Ingestion, Management, and Processing for NLP Rebecca Bilbro PyCon 2017
  • 2. ● Me and my motivation ● Why make a custom corpus? ● Things likely to go wrong ○ Ingestion ○ Management ○ Loading ○ Preprocessing ○ Analysis ● Lessons we learned ● Open source tools we made
  • 5.
  • 7. Everyone’s doing it NLTK So many great tools
  • 8. The Natural Language Toolkit import nltk moby = nltk.text.Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt')) print(moby.similar("ahab")) print(moby.common_contexts(["ahab", "starbuck"])) print(moby.concordance("monstrous", 55, lines=10))
  • 9. Gensim + Wikipedia import bz2 import gensim # Load id to word dictionary id2word = gensim.corpora.Dictionary.load_from_text('wikipedia_wordids.txt') # Instantiate iterator for corpus (which is ~24.14 GB on disk after compression!) mm = gensim.corpora.MmCorpus(bz2.BZ2File('wikipedia_tfidf.mm.bz2')) # Do latent Semantic Analysis and find 10 prominent topics lsa = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400) lsa.print_topics(10)
  • 11. RSS import os import requests import feedparser feed = "http://feeds.washingtonpost.com/rss/national" for entry in feedparser.parse(feed)['entries']: r = requests.get(entry['link']) path = entry['title'].lower().replace(" ", "-") + ".html" with open(path, 'wb') as f: f.write(r.content)
  • 12. Ingestion ● Scheduling ● Adding new feeds ● Synchronizing feeds, finding duplicates ● Parsing different feeds/entries into a standard form ● Monitoring Storage ● Database choice ● Data representation, indexing, fetching ● Connection and configuration ● Error tracking and handling ● Exporting
  • 13. And as the corpus began to grow … … new questions arose about costs (storage, time) and surprising results (videos?).
  • 14. Post + title + url + content + hash() + htmlize() Feed + title + link + active OPML Reader + categories() + counts() + __iter__() + __len__() ingest() Configuration + logging + database + flags Exporter + export() + readme() Admin + ingest_feeds() + ingest_opml() + summary() + run() + export() Utilities + timez Logging + logger + mongolog Ingest + feeds() + started() + finished() + process() + ingest() Feed Sync + parse() + sync() + entries() Post Wrangler + wrangle() + fetch() connect() <OPML /> Production-grade ingestion: Baleen
  • 15. Raw corpus != Usable data
  • 16. From each doc, extract html, identify paras/sents/words, tag with part-of-speech Raw Corpus HTML corpus = [(‘How’, ’WRB’), (‘long’, ‘RB’), (‘will’, ‘MD’), (‘this’, ‘DT’), (‘go’, ‘VB’), (‘on’, ‘IN’), (‘?’, ‘.’), ... ] Paras Sents Tokens Tags
  • 17. Streaming Corpus Preprocessing Tokenized Corpus CorpusReader for streaming access, preprocessing, and saving the tokenized version HTML Paras Sents Tokens Tags Raw Corpus
  • 19. Visualize top tokens, document distribution & part-of-speech tagging Yellowbrick
  • 20. Data Loader Text Normalization Text Vectorization Feature Transformation Estimator Data Loader Feature Union Pipeline Estimator Text Normalization Document Features Text Extraction Summary Vectorization Article Vectorization Concept Features Metadata Features Dict Vectorizer Minke
  • 22. 1.5 M documents, 7,500 jobs, 524 GB (uncompressed) Keyphrase Graph: - 2.7 M nodes - 47 M edges - Average degree of 35
  • 24. Meaningfully literate data products rely on… ...a custom, domain-specific corpus.
  • 25. Meaningfully literate data products rely on… ...a data management layer for flexibility and iteration during modeling. Feature Analysis Algorithm Selection Hyperparameter Tuning
  • 26. corpus ├── citation.bib ├── feeds.json ├── LICENSE.md ├── manifest.json ├── README.md └── books ├── 56d629e7c1808113ffb87eaf.html ├── 56d629e7c1808113ffb87eb3.html └── 56d629ebc1808113ffb87ed0.html └── business ├── 56d625d5c1808113ffb87730.html ├── 56d625d6c1808113ffb87736.html └── 56d625ddc1808113ffb87752.html └── cinema ├── 56d629b5c1808113ffb87d8f.html ├── 56d629b5c1808113ffb87d93.html └── 56d629b6c1808113ffb87d9a.html └── cooking ├── 56d62af2c1808113ffb880ec.html ├── 56d62af2c1808113ffb880ee.html └── 56d62af2c1808113ffb880fa.html Preprocessing Transformer Raw CorpusReader Tokenized Corpus Post-processed CorpusReader Meaningfully literate data products rely on… ...a custom CorpusReader for streaming, and also intermediate storage.
  • 27. Meaningfully literate data products rely on… ...visual steering and graph analysis for interpretation.
  • 31. Getting data ● (Tutorial) “Fantastic Data and Where to Find Them” by Nicole Donnelly ● (Poster) “On the Hour Data Ingestion” by Benjamin Bengfort and Will Voorhees Speed to insight ● (Talk) “Human-Machine Collaboration” by Tony Ojeda ● (Poster) “A Framework for Exploratory Data Analysis” by Tony Ojeda and Sasan Bahadaran Machine learning ● (Poster) “Model Management Systems” by Benjamin Bengfort and Laura Lorenz ● (Poster) “Yellowbrick” by Benjamin Bengfort and Rebecca Bilbro Also, sprints! pycon.districtdatalabs.com
  • 32. Thank you! Rebecca Bilbro Twitter: twitter.com/rebeccabilbro Github: github.com/rebeccabilbro Email: rebecca.bilbro@bytecubed.com