This talk covers rapid prototyping of a high performance scalable text processing pipeline development in Python. We demonstrate how Python modules, in particular from the Rosetta library, can be used to analyze, clean, extract features, and finally perform machine learning tasks such as classification or topic modeling on millions of documents. Our style is to build small and simple modules (each with command line interfaces) that use very little memory and are parallelized with the multiprocessing library.
Customer Service Analytics - Make Sense of All Your Data.pptx
Daniel Krasner - High Performance Text Processing with Rosetta
1. (Easy), High Performance Text Processing with
Python’s Rosetta
Daniel Krasner
KFit Solutions, Columbia University
Nov 22, 2014
(IDSE) 1 / 52
2. Outline
1 Introduction
2 Dealing with Limited Memory
3 Making Things Faster!
4 Stepping Outside of Python (time permitting)
(IDSE) 2 / 52
3. Outline
1 Introduction
2 Dealing with Limited Memory
3 Making Things Faster!
4 Stepping Outside of Python (time permitting)
(IDSE) 3 / 52
4. Motivation: Current Text Processing Projects
The Declassification Engine
I Full stack Digital Archive.
F Collections structuring/parsing.
F Backend, API, UI.
F Statistical analysis.
I Organizers
F David Madigan - Statistics Department chair (CU)
F Matthew Connelly - Professor of international and global history (CU)
I For more info see http://www.declassification-engine.org/.
(IDSE) 4 / 52
5. Motivation: Current Projects Continued
eDiscovery
I The legal world is overwhelmed with documents, both in pre and post
production review.
I Most technologies heavily rely on keyword search which is not efficient.
I “Predictive coding” solutions are generally archaic and inaccurate.
Other
I Human - text/document interaction.
I Semantic filtering solutions.
(IDSE) 5 / 52
8. Motivation: Network Analysis
Kissinger telcons can be analyzed for frequency. Even a simple version of
this type of analysis requires the data to be structured, or an information
extraction process in place.
(IDSE) 8 / 52
9. Motivation: Semantic Modeling
State department cables from embassies can be analyzed for topics.
Moscow is predominantly topic 12
soviet 0.133910
moscow 0.128717
october 0.090400
joint 0.052875
ussr 0.044190
soviets 0.042493
ur 0.027686
mfa 0.025786
refs 0.023871
prague 0.021268
London is predominantly topic 13
london 0.114568
bonn 0.083748
rome 0.074385
uk 0.051367
frg 0.050235
berlin 0.035972
usmission 0.031757
british 0.029836
european 0.027203
brussels 0.025023
(IDSE) 9 / 52
11. Feature Extraction
Figure: Metadata + body text ) features
The metadata features can be used in any classifier
(IDSE) 11 / 52
12. Tasks
What are some typical text processing goals?
Structuring/Information Extraction
I Metadata extraction
F Geo-tagging
F Name-entity identification/disambiguation
I Text cleaning/text body extraction
Machine Learning
I Classification
F logistic regression, random forests, etc
I Sentiment analysis
I Recommendation systems
I Anomaly detection
F Understanding underlying semantic structute (ex LDA/LSI modeling)
F Communication dynamics
Note: this is far from a complete list!
(IDSE) 12 / 52
14. What’s so hard about text processing?
Must use sparse data structures
Data usually doesn’t fit in memory
It’s language . . .
Text structure can change from collection to collection
(parsing/feature extraction can be tricky)
Can be difficult to convert to nice machine-readable formats
HUGE data drives much of software development. Leads to products
too complicated for most applications
No simple solution (e.g. “sklearn/statsmodels/pandas/numpy” stack)
(IDSE) 14 / 52
15. What’s so fun about text processing?
You get to care about memory and processing speed
It’s language
I spelling, stemming, etc
I parsing
I tokenization
I domain knowledge
Unix plays nicely with text
Python plays nicely with text
You get to step outside the Python ecosystem
(IDSE) 15 / 52
16. Outline
1 Introduction
2 Dealing with Limited Memory
3 Making Things Faster!
4 Stepping Outside of Python (time permitting)
(IDSE) 16 / 52
17. Cluster? No.
One powerful machine and memory/CPU conscious code can handle many
tasks with 1TB of text.
System76 Laptop, pre-loaded with Ubuntu. 16GB Memory, 4cores,
1T SSD. $1823
I Can also get a Macbook Pro for about $3600
You can spend 6k and get 20 cores, 64GB Memory (upgradeable to
256GB), 1-2T RAID SSD for $7,000. Stick it in a closet with lots of
fans or in a server center for $250/month.
Single machine on AWS ($1-2k per year) or on Digital Ocean (they
have nice ssds).
(IDSE) 17 / 52
18. Back-of-the-envelope memory calculations
Text:
I Same as on disk if the file is large enough (e.g. 10M)
I Can be much more if you load many small files and append to a list.
Numbers:
I 1double = 8byte. ) 1,000,000 doubles 8MB.
I You can save space by using type “int8,” “float16,” “float32,”
etc. . . see the dtype docs
(IDSE) 18 / 52
19. Monitoring Memory with HTOP
4 cores, 4 virtual cores, all in use
2821/15946 MB memory in use
processes listed below
For macs, htop doesn’t necessarily work. If it doesn’t, try iStat.
(IDSE) 19 / 52
20. Don’t blow up memory, stream!
Process files (text) line-by-line:
with open(infile, ‘‘r’’) as f, open(outfile, ‘‘w’’) as g:
for line in input:
line_output = process_line(line)
# Write output NOW
g.write(line_output)
Process directories one file at a time:
from rosetta.text.filefilter import get_paths
my_paths_iter = get_paths(MYDIR, get_iter=True)
for path in my_paths_iter:
output = process_file(path)
# Now write the output to file
(IDSE) 20 / 52
21. Rosetta Text File Streamer
Set up a text file streamer class for processing.
# stream from a file sys directory
stream = TextFileStreamer(text_base_path=MYDIR,
tokenizer=MyTokenizer)
# call info stream which will return a dict with
# doc_id, text, tokens, etc
for item in stream.info_stream():
# print the document text
print item[‘‘text’’]
‘‘This is my text.’’
# print the tokens
print item[‘‘tokens’’]
[‘‘this’’, ‘‘is’’, ‘‘my’’, ‘‘text’’]
(IDSE) 21 / 52
22. Rosetta MySQL Streamer
Set up a databse streamer class for processing.
# stream from a DB
stream = MySQLStreamer(db_setup=DBCONFIG,
tokenizer=MyTokenizer)
# Convert to scipysparse matrix
# and cache some data along the way
sparse_mat = stream.to_scipysparse(
cache_list[‘‘doc_id’’, ‘‘date’’])
# grab the cached doc_id and dates
doc_ids = stream.__dict__[‘‘doc_id_cache’’]
dates = stream.__dict__[‘‘date_cache’’]
(IDSE) 22 / 52
23. (Online) Stochastic Gradient Descent
Combine the above with an online learning algorithm.
To minimize empirical loss
nX
i=1
|yi − w · xi |2
Update the coefficient w one training example at a time
w(t+1) : = wt − trw|yt − wt · xt |2.
Note:
The learning rate t decays to 0
We can cycle through the training examples, updating the weights
more than n times
We only need to load one single example into memory at a time
Converges faster for cases of many data points and coefficients
See Bottou scikit-learn sgd and vowpal wabbit.
(IDSE) 23 / 52
24. Dealing with limited memory: Summary
Monitor memory usage
Stream process
I and cache what you need along the way
Deal with huge feature counts by. . .
I Use a sparse data structure, stochastic gradient descent. Or. . .
I Reduce the number of features to something that fits into a dense
matrix
(IDSE) 24 / 52
25. Outline
1 Introduction
2 Dealing with Limited Memory
3 Making Things Faster!
4 Stepping Outside of Python (time permitting)
(IDSE) 25 / 52
26. Goal: Tokenization
# Steps: split on whitespace, set to lowercase
# remove non-letters, punctuation, and stopwords
stopwords_list = ‘‘lets,do,a,the,and’’.split(‘‘,’’) # And more
punct = [’|’, ’‘’, ’[’, ’]’, ’{’, ’}’, ’(’, ’)’]
# Here’s the function call/result we want
text = Let’s do a deal: Trade 55 Euros for 75 euros
tokens = tokenize(text, punct, stopwords_list)
print tokens
[‘‘deal’’, ‘‘trade’’, ‘‘euros’’, ‘‘euros’’]
(IDSE) 26 / 52
27. First hack: For loop
def tokenize_for(text, punct, stopwords):
tokens = []
# Split the text on whitespace.
for token in text.lower().split():
# Remove punctuation
clean_token = ’’
for char in clean_token:
if char not in punct:
clean_token += char
# Remove stopwords.
if clean_token.isalpha() and len(token) 1
and (token not in stopwords):
tokens.append(token)
return tokens
(IDSE) 27 / 52
28. Profile: Time your code
regex test.py
def main():
stopwords_list = ‘‘lets,do,a,the,and’’.split(‘‘,’’) # And text = # Something pretty typical for your application
for i in range(10000):
tokens = tokenize_for(text, stopwords_list)
if __name__ == ‘‘__main__’’:
main()
time python regex_test.py
real 0m0.128s
user 0m0.120s
sys 0m0.008s
(IDSE) 28 / 52
29. Profile: Switch to a regex
bad_char_pattern = r||’|[|]|{|}|(|)
def tokenize_regex_1(text, bad_char_pattern, stopwords):
# Substitute empty string for the bad characters
text = re.sub(bad_char_pattern, ’’, text).lower()
# Split on whitespace, keeping strings length 1
split_text = re.findall(r[a-z]+, text)
tokens = []
for word in split_text:
if word not in stopwords:
tokens.append(word)
return tokens
time python regex_test.py
real 0m0.091s
(IDSE) 29 / 52
30. Profile: Line-by-line readout
Add an @profile decorator to your tokenize regex function, and
pip install line_profiler
kernprof.py -l regex_test.py
python -m line_profiler regex_test.py.lprof |less
Figure: line profiler output shows the for loop and if statement are slow.
(IDSE) 30 / 52
31. Profile: Use a set
regex test.py
def main():
stopwords_list = ’lets,do,a,the,and’.split(’,’) # And more
stopwords_set = set(stopwords_list)
for i in range(1000):
text = # Something pretty typical for your application
tokens = tokenize_regex_1(text, stopwords_set)
if __name__ == ’__main__’:
main()
Reduces time from 0.091 to 0.043
Testing item in my set requires hash function computation
Testing item in my list requires looking at every item in
my list.
(IDSE) 31 / 52
32. Profile: IPython timeit
Figure: Set lookup is O(1). So don’t ever test item in my list for long lists.
(IDSE) 32 / 52
33. Profile: Line-by-line profile again
Switching to a set speed up the if statement.
The for loop can still be faster
(IDSE) 33 / 52
34. Profile: Switch to a list comprehension
Reduced time from 0.033 to 0.025s
A list comprehension is essentially a for loop with a fast append
Looks nicer in this case
Be sure to use time python regex test.py for the total time!
(IDSE) 34 / 52
35. Data structures
Think about the data structures (and associated methods) you are using
for the task at hand!
Many data analysis friendly languages (ex. Python, R) have very
convenient built in data structures.
These can come with significant lookup and operation overhead.
I example: set lookup vs list lookup as above
I example: python does not allocate a contiguous memory block for
dictionaries, making them slower than a data structure which tells the
interpreter how much space will be needed
You can (easily) create your own data structure for the task at hand.
See Saulius Lukuaskas nice post Why Python Runs Slow. Part 1:
Data structures.
(IDSE) 35 / 52
36. Parallelization
Much of text processing is embarrassingly parallel.
Figure: Word counts for individual documents can be computed independently.
(IDSE) 36 / 52
37. Parallelization: Basic mapping
Serial mapping
def func(x):
... return 2 * x
iterable = range(3)
map(func, iterable)
[0, 2, 4]
Parallel mapping
from multiprocessing import Pool
def func(x):
... return 2 * x
my_pool = Pool(processes=4)
iterable = range(3)
my_pool.map(func, iterable)
[0, 2, 4]
1 Spawns 4 subprocesses
2 Pickles func, iterable and pipes them to the subprocess
3 The subprocesses compute their results
4 Subprocesses pickle/pipe back the results to the mother process
(IDSE) 37 / 52
38. Parallelization: Basic mapping issues
Serial mapping
def func(x):
... return 2 * x
map(func, range(3))
[0, 2, 4]
Parallel mapping
from multiprocessing import Pool
def func(x):
... return 2 * x
my_pool = Pool(processes=4)
iterable = range(3)
my_pool.map(func, iterable)
[0, 2, 4]
Issues:
What about functions of more than one variable?
Pickling not possible for every function
Can’t step a debugger into pool calls
Traceback is uninterpretable
Can’t exit with Ctrl-C
Entire result is computed at once ) memory blow-up!
(IDSE) 38 / 52
39. Parallelization: Mapping functions of more than one var
from multiprocessing import Pool
from functools import partial
def func(a, x):
... return 2 * a * x
a = 3
func_a = partial(func, a)
# func_a(x) = func(a, x)
Pool.map(func_a, range(3))
[0, 6, 12]
(IDSE) 39 / 52
40. Parallelization: Dealing with map issues
def map_easy(func, iterable, n_jobs):
if n_jobs == 1:
return map(func, iterable)
else:
_trypickle(func)
pool = Pool(n_jobs)
timeout = 1000000
return pool.map_async(func, iterable).get(timeout)
trypickle(func) tries to pickle the func before mapping
n jobs = 1 ) serial (debuggable/traceable) execution
pool.map async(func, iterable).get(timeout)
allows exit with Ctrl-C
(IDSE) 40 / 52
41. Parallelization: Limiting memory usage
Send out/return jobs in chunks
def imap_easy(func, iterable, n_jobs, chunksize,
ordered=True)
if n_jobs == 1:
results_iter = itertools.imap(func, iterable)
else:
_trypickle(func)
pool = Pool(n_jobs)
if ordered:
results_iter = pool.imap(func, iterable,
chunksize=chunksize)
else:
results_iter = pool.imap_unordered(
func, iterable, chunksize=chunksize)
return results_iter
Note: Exit with Ctrl-C is more difficult. See rosetta.parallel
(IDSE) 41 / 52
42. Making things faster: Summary
Use regular expressions
Use the right data structure
I Numpy/Pandas for numbers (use built in functions/numba/cython,
NOT for loops)
I sets if you will test some item in my set
Profile your code
I time python myscript.py
I timeit in IPython
I line profiler (using kernprof.py)
Use multiprocessing.Pool
A number of the Rosetta streamer methods have multiprocessing
built in (see rosetta.text.streamers)
NOTE: the above example are in python for convenience but are
relevant in many (most) other scenarios
(IDSE) 42 / 52
43. Outline
1 Introduction
2 Dealing with Limited Memory
3 Making Things Faster!
4 Stepping Outside of Python (time permitting)
(IDSE) 43 / 52
44. LDA (in a slide)
Latent Dirichlet Allocation, by Blei, Ng and Jordan, is a hierarchical
Bayesian model which describes the underlying semantic structure of a
document corpus via a set of latent distributions of the vocabulary.
The latent semantic distributions are referred to as “topics.”
Each document is assumed modeled as a mixture of these topics.
I the number of topics is chosen a priori.
Words in a document are draw by
I choosing a topic, given document mixture weights,
I sampling from that topic.
Hyperparameters:
I lda alpha: prior which controls the topic probabilities/weights.
F lda alpha 0.1: d Dirichlet()
I lda rho: prior which controls the word probabilities.
F lda alpha 0.1:
46. Vowpal Wabbit: What/Why
Can you build a topic model with 1,000,000 documents using gensim?
Sure. . . if you have 10 hours or so to kill
Better solution: Vowpal Wabbit
Online stochastic gradient descent ) memory independent, optimal
for huge data sets
Highly optimized C++ ) fast
However. . .
Interface is CLI and the input/output files are not very usable
(IDSE) 45 / 52
47. Vowpal Wabbit: Python to the rescue
Principles:
Make getting data into/out of VW easy
Don’t wrap the VW CLI (or if you do use the subprocess module to
make calls, not os.system)
# Convert text files in a directory structure to vw format
stream = TextFileStreamer(
text_base_path=’bodyfiles’, tokenizer=my_tokenizer)
stream.to_vw(’myfiles.vw’, n_jobs=-1)
# Explore token counts and filter tokens in a DataFrame
sff = SFileFilter(VWFormatter()).load_sfile(’myfiles.vw’)
# Create a data-frame representative
df = sff.to_frame()
(IDSE) 46 / 52
48. Vowpal Wabbit: Python to the rescue
# Create a filtered version of your sparse file
sff.filter_extremes(doc_freq_min=5, doc_fraction_max=0.8)
sff.compactify().filter_sfile(’myfiles.vw’, ’myfiles_filtered.# Back to bash to run VW
vw --lda 5 --cache_file ddrs.cache --passes 10
-p prediction.dat --readable_model topics.dat
--bit_precision 16 myfiles_filtered.vw
# Look at the results in DataFrames
lda = LDAResults(
’topics.dat’, ’prediction.dat’, num_topics, sff)
lda.print_topics()
See rosetta/examples/vw helpers.md
(IDSE) 47 / 52
49. Steps with VW
Step 1: Convert files to VW input format
1 0000BC34| saying:1 antunes:4 goncalves:3 scientist:1 ...
1 0000C1AE| shot:1 help:1 september:2 luxembourg:1...
1 0000BBA7| raised:1 chinese:1 winston:1 authority:1...
step 2: View the tokens in a DataFrame
doc_freq
tokens
war 58
china 77
...
Step 3: Filter tokens and hash them
1 0000BC34| 3423211:1 111:4 43454:3 989794:1 ...
1 0000C1AE| 338:1 3123:1 19393:2 3232321:1...
1 0000BBA7| 1191:1 69830:1 398:1 974949:1...
(IDSE) 48 / 52
50. Steps with VW
Step 4: Run VW vw --lda 5 --cache file ddrs.cache --passes
10...
Step 5: View the results
topic_0 topic_1
tokens
war 0.2 0.8
china 0.4 0.6
See rosetta/examples/vw helpers.md
(IDSE) 49 / 52
51. Summary
Pay attention to memory
Pay attention to data structures
Profile for performance
Parallelization is easy for many text-processing tasks
Use Python to make stepping outside the python world easier
Also, don’t forget CLI and UNIX
(IDSE) 50 / 52
52. Bibliography
M. Connelly et al. . .
Declassification engine. Ongoing project at Columbia University
http://www.declassification-engine.org/
https://github.com/declassengine/declass
D. Krasner and I. Langmore
Applied data science, lecture notes http:
//columbia-applied-data-science.github.io/appdatasci.pdf
The Rosetta team
Tools for data science with a focus on text processing.
https://github.com/columbia-applied-data-science/rosetta
Clone, submit issues on github, fork, contribute!
(IDSE) 51 / 52
53. THANK YOU!
contact: daniel@kfitsolutions.com
Rosetta
https://github.com/
columbia-applied-data-science/rosetta
Open Source Python Text Processing Library
Feel free to use, fork, submit issues and contribute!
(IDSE) 52 / 52