SlideShare a Scribd company logo
1 of 52
Download to read offline
(Easy), High Performance Text Processing with 
Python’s Rosetta 
Daniel Krasner 
KFit Solutions, Columbia University 
Nov 22, 2014 
(IDSE) 1 / 52
Outline 
1 Introduction 
2 Dealing with Limited Memory 
3 Making Things Faster! 
4 Stepping Outside of Python (time permitting) 
(IDSE) 2 / 52
Outline 
1 Introduction 
2 Dealing with Limited Memory 
3 Making Things Faster! 
4 Stepping Outside of Python (time permitting) 
(IDSE) 3 / 52
Motivation: Current Text Processing Projects 
The Declassification Engine 
I Full stack Digital Archive. 
F Collections structuring/parsing. 
F Backend, API, UI. 
F Statistical analysis. 
I Organizers 
F David Madigan - Statistics Department chair (CU) 
F Matthew Connelly - Professor of international and global history (CU) 
I For more info see http://www.declassification-engine.org/. 
(IDSE) 4 / 52
Motivation: Current Projects Continued 
eDiscovery 
I The legal world is overwhelmed with documents, both in pre and post 
production review. 
I Most technologies heavily rely on keyword search which is not efficient. 
I “Predictive coding” solutions are generally archaic and inaccurate. 
Other 
I Human - text/document interaction. 
I Semantic filtering solutions. 
(IDSE) 5 / 52
Motivation: Data Structuring/Information Extraction 
Text can come in many formats, encodings, and degree of structure. 
Figure: raw xml 
(IDSE) 6 / 52
Motivation: Data Structuring/Information Extraction 
Many tasks involve initial structuring of the data. 
Figure: structured api response 
(IDSE) 7 / 52
Motivation: Network Analysis 
Kissinger telcons can be analyzed for frequency. Even a simple version of 
this type of analysis requires the data to be structured, or an information 
extraction process in place. 
(IDSE) 8 / 52
Motivation: Semantic Modeling 
State department cables from embassies can be analyzed for topics. 
Moscow is predominantly topic 12 
soviet 0.133910 
moscow 0.128717 
october 0.090400 
joint 0.052875 
ussr 0.044190 
soviets 0.042493 
ur 0.027686 
mfa 0.025786 
refs 0.023871 
prague 0.021268 
London is predominantly topic 13 
london 0.114568 
bonn 0.083748 
rome 0.074385 
uk 0.051367 
frg 0.050235 
berlin 0.035972 
usmission 0.031757 
british 0.029836 
european 0.027203 
brussels 0.025023 
(IDSE) 9 / 52
Motivation: Classification 
Determine which documents are relevant to a legal case. 
(IDSE) 10 / 52
Feature Extraction 
Figure: Metadata + body text ) features 
The metadata features can be used in any classifier 
(IDSE) 11 / 52
Tasks 
What are some typical text processing goals? 
Structuring/Information Extraction 
I Metadata extraction 
F Geo-tagging 
F Name-entity identification/disambiguation 
I Text cleaning/text body extraction 
Machine Learning 
I Classification 
F logistic regression, random forests, etc 
I Sentiment analysis 
I Recommendation systems 
I Anomaly detection 
F Understanding underlying semantic structute (ex LDA/LSI modeling) 
F Communication dynamics 
Note: this is far from a complete list! 
(IDSE) 12 / 52
General Flow 
Figure: General Flow 
(IDSE) 13 / 52
What’s so hard about text processing? 
Must use sparse data structures 
Data usually doesn’t fit in memory 
It’s language . . . 
Text structure can change from collection to collection 
(parsing/feature extraction can be tricky) 
Can be difficult to convert to nice machine-readable formats 
HUGE data drives much of software development. Leads to products 
too complicated for most applications 
No simple solution (e.g. “sklearn/statsmodels/pandas/numpy” stack) 
(IDSE) 14 / 52
What’s so fun about text processing? 
You get to care about memory and processing speed 
It’s language 
I spelling, stemming, etc 
I parsing 
I tokenization 
I domain knowledge 
Unix plays nicely with text 
Python plays nicely with text 
You get to step outside the Python ecosystem 
(IDSE) 15 / 52
Outline 
1 Introduction 
2 Dealing with Limited Memory 
3 Making Things Faster! 
4 Stepping Outside of Python (time permitting) 
(IDSE) 16 / 52
Cluster? No. 
One powerful machine and memory/CPU conscious code can handle many 
tasks with 1TB of text. 
System76 Laptop, pre-loaded with Ubuntu. 16GB Memory, 4cores, 
1T SSD. $1823 
I Can also get a Macbook Pro for about $3600 
You can spend 6k and get 20 cores, 64GB Memory (upgradeable to 
256GB), 1-2T RAID SSD for $7,000. Stick it in a closet with lots of 
fans or in a server center for $250/month. 
Single machine on AWS ($1-2k per year) or on Digital Ocean (they 
have nice ssds). 
(IDSE) 17 / 52
Back-of-the-envelope memory calculations 
Text: 
I Same as on disk if the file is large enough (e.g. 10M) 
I Can be much more if you load many small files and append to a list. 
Numbers: 
I 1double = 8byte. ) 1,000,000 doubles  8MB. 
I You can save space by using type “int8,” “float16,” “float32,” 
etc. . . see the dtype docs 
(IDSE) 18 / 52
Monitoring Memory with HTOP 
4 cores, 4 virtual cores, all in use 
2821/15946 MB memory in use 
processes listed below 
For macs, htop doesn’t necessarily work. If it doesn’t, try iStat. 
(IDSE) 19 / 52
Don’t blow up memory, stream! 
Process files (text) line-by-line: 
with open(infile, ‘‘r’’) as f, open(outfile, ‘‘w’’) as g: 
for line in input: 
line_output = process_line(line) 
# Write output NOW 
g.write(line_output) 
Process directories one file at a time: 
from rosetta.text.filefilter import get_paths 
my_paths_iter = get_paths(MYDIR, get_iter=True) 
for path in my_paths_iter: 
output = process_file(path) 
# Now write the output to file 
(IDSE) 20 / 52
Rosetta Text File Streamer 
Set up a text file streamer class for processing. 
# stream from a file sys directory 
stream = TextFileStreamer(text_base_path=MYDIR, 
tokenizer=MyTokenizer) 
# call info stream which will return a dict with 
# doc_id, text, tokens, etc 
for item in stream.info_stream(): 
# print the document text 
print item[‘‘text’’] 
‘‘This is my text.’’ 
# print the tokens 
print item[‘‘tokens’’] 
[‘‘this’’, ‘‘is’’, ‘‘my’’, ‘‘text’’] 
(IDSE) 21 / 52
Rosetta MySQL Streamer 
Set up a databse streamer class for processing. 
# stream from a DB 
stream = MySQLStreamer(db_setup=DBCONFIG, 
tokenizer=MyTokenizer) 
# Convert to scipysparse matrix 
# and cache some data along the way 
sparse_mat = stream.to_scipysparse( 
cache_list[‘‘doc_id’’, ‘‘date’’]) 
# grab the cached doc_id and dates 
doc_ids = stream.__dict__[‘‘doc_id_cache’’] 
dates = stream.__dict__[‘‘date_cache’’] 
(IDSE) 22 / 52
(Online) Stochastic Gradient Descent 
Combine the above with an online learning algorithm. 
To minimize empirical loss 
nX 
i=1 
|yi − w · xi |2 
Update the coefficient w one training example at a time 
w(t+1) : = wt − trw|yt − wt · xt |2. 
Note: 
The learning rate t decays to 0 
We can cycle through the training examples, updating the weights 
more than n times 
We only need to load one single example into memory at a time 
Converges faster for cases of many data points and coefficients 
See Bottou scikit-learn sgd and vowpal wabbit. 
(IDSE) 23 / 52
Dealing with limited memory: Summary 
Monitor memory usage 
Stream process 
I and cache what you need along the way 
Deal with huge feature counts by. . . 
I Use a sparse data structure, stochastic gradient descent. Or. . . 
I Reduce the number of features to something that fits into a dense 
matrix 
(IDSE) 24 / 52
Outline 
1 Introduction 
2 Dealing with Limited Memory 
3 Making Things Faster! 
4 Stepping Outside of Python (time permitting) 
(IDSE) 25 / 52
Goal: Tokenization 
# Steps: split on whitespace, set to lowercase 
# remove non-letters, punctuation, and stopwords 
stopwords_list = ‘‘lets,do,a,the,and’’.split(‘‘,’’) # And more 
punct = [’|’, ’‘’, ’[’, ’]’, ’{’, ’}’, ’(’, ’)’] 
# Here’s the function call/result we want 
text = Let’s do a deal: Trade 55 Euros for 75 euros 
tokens = tokenize(text, punct, stopwords_list) 
print tokens 
[‘‘deal’’, ‘‘trade’’, ‘‘euros’’, ‘‘euros’’] 
(IDSE) 26 / 52
First hack: For loop 
def tokenize_for(text, punct, stopwords): 
tokens = [] 
# Split the text on whitespace. 
for token in text.lower().split(): 
# Remove punctuation 
clean_token = ’’ 
for char in clean_token: 
if char not in punct: 
clean_token += char 
# Remove stopwords. 
if clean_token.isalpha() and len(token)  1 
and (token not in stopwords): 
tokens.append(token) 
return tokens 
(IDSE) 27 / 52
Profile: Time your code 
regex test.py 
def main(): 
stopwords_list = ‘‘lets,do,a,the,and’’.split(‘‘,’’) # And text = # Something pretty typical for your application 
for i in range(10000): 
tokens = tokenize_for(text, stopwords_list) 
if __name__ == ‘‘__main__’’: 
main() 
time python regex_test.py 
real 0m0.128s 
user 0m0.120s 
sys 0m0.008s 
(IDSE) 28 / 52
Profile: Switch to a regex 
bad_char_pattern = r||’|[|]|{|}|(|) 
def tokenize_regex_1(text, bad_char_pattern, stopwords): 
# Substitute empty string for the bad characters 
text = re.sub(bad_char_pattern, ’’, text).lower() 
# Split on whitespace, keeping strings length  1 
split_text = re.findall(r[a-z]+, text) 
tokens = [] 
for word in split_text: 
if word not in stopwords: 
tokens.append(word) 
return tokens 
time python regex_test.py 
real 0m0.091s 
(IDSE) 29 / 52
Profile: Line-by-line readout 
Add an @profile decorator to your tokenize regex function, and 
pip install line_profiler 
kernprof.py -l regex_test.py 
python -m line_profiler regex_test.py.lprof |less 
Figure: line profiler output shows the for loop and if statement are slow. 
(IDSE) 30 / 52
Profile: Use a set 
regex test.py 
def main(): 
stopwords_list = ’lets,do,a,the,and’.split(’,’) # And more 
stopwords_set = set(stopwords_list) 
for i in range(1000): 
text = # Something pretty typical for your application 
tokens = tokenize_regex_1(text, stopwords_set) 
if __name__ == ’__main__’: 
main() 
Reduces time from 0.091 to 0.043 
Testing item in my set requires hash function computation 
Testing item in my list requires looking at every item in 
my list. 
(IDSE) 31 / 52
Profile: IPython timeit 
Figure: Set lookup is O(1). So don’t ever test item in my list for long lists. 
(IDSE) 32 / 52
Profile: Line-by-line profile again 
Switching to a set speed up the if statement. 
The for loop can still be faster 
(IDSE) 33 / 52
Profile: Switch to a list comprehension 
Reduced time from 0.033 to 0.025s 
A list comprehension is essentially a for loop with a fast append 
Looks nicer in this case 
Be sure to use time python regex test.py for the total time! 
(IDSE) 34 / 52
Data structures 
Think about the data structures (and associated methods) you are using 
for the task at hand! 
Many data analysis friendly languages (ex. Python, R) have very 
convenient built in data structures. 
These can come with significant lookup and operation overhead. 
I example: set lookup vs list lookup as above 
I example: python does not allocate a contiguous memory block for 
dictionaries, making them slower than a data structure which tells the 
interpreter how much space will be needed 
You can (easily) create your own data structure for the task at hand. 
See Saulius Lukuaskas nice post Why Python Runs Slow. Part 1: 
Data structures. 
(IDSE) 35 / 52
Parallelization 
Much of text processing is embarrassingly parallel. 
Figure: Word counts for individual documents can be computed independently. 
(IDSE) 36 / 52
Parallelization: Basic mapping 
Serial mapping 
 def func(x): 
... return 2 * x 
 iterable = range(3) 
 map(func, iterable) 
[0, 2, 4] 
Parallel mapping 
 from multiprocessing import Pool 
 def func(x): 
... return 2 * x 
 my_pool = Pool(processes=4) 
 iterable = range(3) 
 my_pool.map(func, iterable) 
[0, 2, 4] 
1 Spawns 4 subprocesses 
2 Pickles func, iterable and pipes them to the subprocess 
3 The subprocesses compute their results 
4 Subprocesses pickle/pipe back the results to the mother process 
(IDSE) 37 / 52
Parallelization: Basic mapping issues 
Serial mapping 
 def func(x): 
... return 2 * x 
 map(func, range(3)) 
[0, 2, 4] 
Parallel mapping 
 from multiprocessing import Pool 
 def func(x): 
... return 2 * x 
 my_pool = Pool(processes=4) 
 iterable = range(3) 
 my_pool.map(func, iterable) 
[0, 2, 4] 
Issues: 
What about functions of more than one variable? 
Pickling not possible for every function 
Can’t step a debugger into pool calls 
Traceback is uninterpretable 
Can’t exit with Ctrl-C 
Entire result is computed at once ) memory blow-up! 
(IDSE) 38 / 52
Parallelization: Mapping functions of more than one var 
 from multiprocessing import Pool 
 from functools import partial 
 def func(a, x): 
... return 2 * a * x 
 a = 3 
 func_a = partial(func, a) 
# func_a(x) = func(a, x) 
 Pool.map(func_a, range(3)) 
[0, 6, 12] 
(IDSE) 39 / 52
Parallelization: Dealing with map issues 
def map_easy(func, iterable, n_jobs): 
if n_jobs == 1: 
return map(func, iterable) 
else: 
_trypickle(func) 
pool = Pool(n_jobs) 
timeout = 1000000 
return pool.map_async(func, iterable).get(timeout) 
trypickle(func) tries to pickle the func before mapping 
n jobs = 1 ) serial (debuggable/traceable) execution 
pool.map async(func, iterable).get(timeout) 
allows exit with Ctrl-C 
(IDSE) 40 / 52
Parallelization: Limiting memory usage 
Send out/return jobs in chunks 
def imap_easy(func, iterable, n_jobs, chunksize, 
ordered=True) 
if n_jobs == 1: 
results_iter = itertools.imap(func, iterable) 
else: 
_trypickle(func) 
pool = Pool(n_jobs) 
if ordered: 
results_iter = pool.imap(func, iterable, 
chunksize=chunksize) 
else: 
results_iter = pool.imap_unordered( 
func, iterable, chunksize=chunksize) 
return results_iter 
Note: Exit with Ctrl-C is more difficult. See rosetta.parallel 
(IDSE) 41 / 52
Making things faster: Summary 
Use regular expressions 
Use the right data structure 
I Numpy/Pandas for numbers (use built in functions/numba/cython, 
NOT for loops) 
I sets if you will test some item in my set 
Profile your code 
I time python myscript.py 
I timeit in IPython 
I line profiler (using kernprof.py) 
Use multiprocessing.Pool 
A number of the Rosetta streamer methods have multiprocessing 
built in (see rosetta.text.streamers) 
NOTE: the above example are in python for convenience but are 
relevant in many (most) other scenarios 
(IDSE) 42 / 52
Outline 
1 Introduction 
2 Dealing with Limited Memory 
3 Making Things Faster! 
4 Stepping Outside of Python (time permitting) 
(IDSE) 43 / 52
LDA (in a slide) 
Latent Dirichlet Allocation, by Blei, Ng and Jordan, is a hierarchical 
Bayesian model which describes the underlying semantic structure of a 
document corpus via a set of latent distributions of the vocabulary. 
The latent semantic distributions are referred to as “topics.” 
Each document is assumed modeled as a mixture of these topics. 
I the number of topics is chosen a priori. 
Words in a document are draw by 
I choosing a topic, given document mixture weights, 
I sampling from that topic. 
Hyperparameters: 
I lda alpha: prior which controls the topic probabilities/weights. 
F lda alpha 0.1: d Dirichlet() 
I lda rho: prior which controls the word probabilities. 
F lda alpha 0.1:
k Dirichlet() 
(IDSE) 44 / 52
Vowpal Wabbit: What/Why 
Can you build a topic model with 1,000,000 documents using gensim? 
Sure. . . if you have 10 hours or so to kill 
Better solution: Vowpal Wabbit 
Online stochastic gradient descent ) memory independent, optimal 
for huge data sets 
Highly optimized C++ ) fast 
However. . . 
Interface is CLI and the input/output files are not very usable 
(IDSE) 45 / 52
Vowpal Wabbit: Python to the rescue 
Principles: 
Make getting data into/out of VW easy 
Don’t wrap the VW CLI (or if you do use the subprocess module to 
make calls, not os.system) 
# Convert text files in a directory structure to vw format 
stream = TextFileStreamer( 
text_base_path=’bodyfiles’, tokenizer=my_tokenizer) 
stream.to_vw(’myfiles.vw’, n_jobs=-1) 
# Explore token counts and filter tokens in a DataFrame 
sff = SFileFilter(VWFormatter()).load_sfile(’myfiles.vw’) 
# Create a data-frame representative 
df = sff.to_frame() 
(IDSE) 46 / 52
Vowpal Wabbit: Python to the rescue 
# Create a filtered version of your sparse file 
sff.filter_extremes(doc_freq_min=5, doc_fraction_max=0.8) 
sff.compactify().filter_sfile(’myfiles.vw’, ’myfiles_filtered.# Back to bash to run VW 
vw --lda 5 --cache_file ddrs.cache --passes 10  
-p prediction.dat --readable_model topics.dat  
--bit_precision 16 myfiles_filtered.vw 
# Look at the results in DataFrames 
lda = LDAResults( 
’topics.dat’, ’prediction.dat’, num_topics, sff) 
lda.print_topics() 
See rosetta/examples/vw helpers.md 
(IDSE) 47 / 52
Steps with VW 
Step 1: Convert files to VW input format 
1 0000BC34| saying:1 antunes:4 goncalves:3 scientist:1 ... 
1 0000C1AE| shot:1 help:1 september:2 luxembourg:1... 
1 0000BBA7| raised:1 chinese:1 winston:1 authority:1... 
step 2: View the tokens in a DataFrame 
doc_freq 
tokens 
war 58 
china 77 
... 
Step 3: Filter tokens and hash them 
1 0000BC34| 3423211:1 111:4 43454:3 989794:1 ... 
1 0000C1AE| 338:1 3123:1 19393:2 3232321:1... 
1 0000BBA7| 1191:1 69830:1 398:1 974949:1... 
(IDSE) 48 / 52
Steps with VW 
Step 4: Run VW vw --lda 5 --cache file ddrs.cache --passes 
10... 
Step 5: View the results 
topic_0 topic_1 
tokens 
war 0.2 0.8 
china 0.4 0.6 
See rosetta/examples/vw helpers.md 
(IDSE) 49 / 52
Summary 
Pay attention to memory 
Pay attention to data structures 
Profile for performance 
Parallelization is easy for many text-processing tasks 
Use Python to make stepping outside the python world easier 
Also, don’t forget CLI and UNIX 
(IDSE) 50 / 52
Bibliography 
M. Connelly et al. . . 
Declassification engine. Ongoing project at Columbia University 
http://www.declassification-engine.org/ 
https://github.com/declassengine/declass 
D. Krasner and I. Langmore 
Applied data science, lecture notes http: 
//columbia-applied-data-science.github.io/appdatasci.pdf 
The Rosetta team 
Tools for data science with a focus on text processing. 
https://github.com/columbia-applied-data-science/rosetta 
Clone, submit issues on github, fork, contribute! 
(IDSE) 51 / 52

More Related Content

What's hot

Ir 1 lec 7
Ir 1 lec 7Ir 1 lec 7
Ir 1 lec 7alaa223
 
Memory forensics
Memory forensicsMemory forensics
Memory forensicsSunil Kumar
 
Cross-Language Information Retrieval
Cross-Language Information RetrievalCross-Language Information Retrieval
Cross-Language Information RetrievalSumin Byeon
 
Memory Analysis of the Dalvik (Android) Virtual Machine
Memory Analysis of the Dalvik (Android) Virtual MachineMemory Analysis of the Dalvik (Android) Virtual Machine
Memory Analysis of the Dalvik (Android) Virtual MachineAndrew Case
 

What's hot (8)

Ir 1 lec 7
Ir 1 lec 7Ir 1 lec 7
Ir 1 lec 7
 
Memory forensics
Memory forensicsMemory forensics
Memory forensics
 
Threads
ThreadsThreads
Threads
 
Shellcoding, an Introduction
Shellcoding, an IntroductionShellcoding, an Introduction
Shellcoding, an Introduction
 
Lec08 optimizations
Lec08 optimizationsLec08 optimizations
Lec08 optimizations
 
Cross-Language Information Retrieval
Cross-Language Information RetrievalCross-Language Information Retrieval
Cross-Language Information Retrieval
 
Memory Analysis of the Dalvik (Android) Virtual Machine
Memory Analysis of the Dalvik (Android) Virtual MachineMemory Analysis of the Dalvik (Android) Virtual Machine
Memory Analysis of the Dalvik (Android) Virtual Machine
 
Chapter10
Chapter10Chapter10
Chapter10
 

Viewers also liked

Milos Miljkovic - Analyzing satellite images with python scientific stack
Milos Miljkovic - Analyzing satellite images with python scientific stackMilos Miljkovic - Analyzing satellite images with python scientific stack
Milos Miljkovic - Analyzing satellite images with python scientific stackPyData
 
Social Media Brand Positioning Workflow- David Gerson
Social Media Brand Positioning Workflow- David GersonSocial Media Brand Positioning Workflow- David Gerson
Social Media Brand Positioning Workflow- David GersonPyData
 
Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create PyData
 
партизанский маркетинг
партизанский маркетингпартизанский маркетинг
партизанский маркетингKuskovna
 
Uud th 1945_amandemen
Uud th 1945_amandemenUud th 1945_amandemen
Uud th 1945_amandemenjakartabalau
 
Diagramación y composición marcovega
Diagramación y composición marcovegaDiagramación y composición marcovega
Diagramación y composición marcovegaMarco Ct
 
New Vistas on Quantum Matter Opened by Dipolar Fermions
New Vistas on Quantum Matter Opened by Dipolar FermionsNew Vistas on Quantum Matter Opened by Dipolar Fermions
New Vistas on Quantum Matter Opened by Dipolar FermionsJorge Quintanilla
 
Nick Throlson Co-Op Presentation 2015
Nick Throlson Co-Op Presentation 2015Nick Throlson Co-Op Presentation 2015
Nick Throlson Co-Op Presentation 2015Nick Throlson
 
Friend Or Fan - Facebook Presentation from Envano
Friend Or Fan - Facebook Presentation from EnvanoFriend Or Fan - Facebook Presentation from Envano
Friend Or Fan - Facebook Presentation from EnvanoSocial Media Rockstar
 
Ingilis dili 4 240_suleymanova ruhengiz
Ingilis dili 4 240_suleymanova ruhengizIngilis dili 4 240_suleymanova ruhengiz
Ingilis dili 4 240_suleymanova ruhengizmimio_azerbaijan
 
”Sverige 2030”, SCBs presentationsbilder vid TCO-seminarium 2014-01-28 #10000...
”Sverige 2030”, SCBs presentationsbilder vid TCO-seminarium 2014-01-28 #10000...”Sverige 2030”, SCBs presentationsbilder vid TCO-seminarium 2014-01-28 #10000...
”Sverige 2030”, SCBs presentationsbilder vid TCO-seminarium 2014-01-28 #10000...VIRGOkonsult
 

Viewers also liked (19)

Milos Miljkovic - Analyzing satellite images with python scientific stack
Milos Miljkovic - Analyzing satellite images with python scientific stackMilos Miljkovic - Analyzing satellite images with python scientific stack
Milos Miljkovic - Analyzing satellite images with python scientific stack
 
Social Media Brand Positioning Workflow- David Gerson
Social Media Brand Positioning Workflow- David GersonSocial Media Brand Positioning Workflow- David Gerson
Social Media Brand Positioning Workflow- David Gerson
 
Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create
 
партизанский маркетинг
партизанский маркетингпартизанский маркетинг
партизанский маркетинг
 
Uud th 1945_amandemen
Uud th 1945_amandemenUud th 1945_amandemen
Uud th 1945_amandemen
 
T L E
T L ET L E
T L E
 
Diagramación y composición marcovega
Diagramación y composición marcovegaDiagramación y composición marcovega
Diagramación y composición marcovega
 
Template Artisteer Modx
Template Artisteer ModxTemplate Artisteer Modx
Template Artisteer Modx
 
Colors
ColorsColors
Colors
 
New Vistas on Quantum Matter Opened by Dipolar Fermions
New Vistas on Quantum Matter Opened by Dipolar FermionsNew Vistas on Quantum Matter Opened by Dipolar Fermions
New Vistas on Quantum Matter Opened by Dipolar Fermions
 
Panteon
PanteonPanteon
Panteon
 
Nick Throlson Co-Op Presentation 2015
Nick Throlson Co-Op Presentation 2015Nick Throlson Co-Op Presentation 2015
Nick Throlson Co-Op Presentation 2015
 
EQ3 RHCP
EQ3 RHCPEQ3 RHCP
EQ3 RHCP
 
Friend Or Fan - Facebook Presentation from Envano
Friend Or Fan - Facebook Presentation from EnvanoFriend Or Fan - Facebook Presentation from Envano
Friend Or Fan - Facebook Presentation from Envano
 
Ingilis dili 4 240_suleymanova ruhengiz
Ingilis dili 4 240_suleymanova ruhengizIngilis dili 4 240_suleymanova ruhengiz
Ingilis dili 4 240_suleymanova ruhengiz
 
”Sverige 2030”, SCBs presentationsbilder vid TCO-seminarium 2014-01-28 #10000...
”Sverige 2030”, SCBs presentationsbilder vid TCO-seminarium 2014-01-28 #10000...”Sverige 2030”, SCBs presentationsbilder vid TCO-seminarium 2014-01-28 #10000...
”Sverige 2030”, SCBs presentationsbilder vid TCO-seminarium 2014-01-28 #10000...
 
Digitechx Services Presentation
Digitechx Services PresentationDigitechx Services Presentation
Digitechx Services Presentation
 
Hp printer
Hp printerHp printer
Hp printer
 
Yamatw 3
Yamatw   3Yamatw   3
Yamatw 3
 

Similar to Daniel Krasner - High Performance Text Processing with Rosetta

Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimizationguest3eed30
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory OptimizationWei Lin
 
GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconPeter Lawrey
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 Andrey Vykhodtsev
 
2023-02-22_Tiberti_CyberX.pdf
2023-02-22_Tiberti_CyberX.pdf2023-02-22_Tiberti_CyberX.pdf
2023-02-22_Tiberti_CyberX.pdfcifoxo
 
Kqueue : Generic Event notification
Kqueue : Generic Event notificationKqueue : Generic Event notification
Kqueue : Generic Event notificationMahendra M
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Questions On The Code And Core Module
Questions On The Code And Core ModuleQuestions On The Code And Core Module
Questions On The Code And Core ModuleKatie Gulley
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!OSCON Byrum
 
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIOUnlock user behavior with 87 Million events using Hudi, StarRocks & MinIO
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIOnadine39280
 
JS Fest 2019. Ryan Dahl. Deno, a new way to JavaScript
JS Fest 2019. Ryan Dahl. Deno, a new way to JavaScriptJS Fest 2019. Ryan Dahl. Deno, a new way to JavaScript
JS Fest 2019. Ryan Dahl. Deno, a new way to JavaScriptJSFestUA
 
Purdue CS354 Operating Systems 2008
Purdue CS354 Operating Systems 2008Purdue CS354 Operating Systems 2008
Purdue CS354 Operating Systems 2008guestd9065
 
These questions will be a bit advanced level 2
These questions will be a bit advanced level 2These questions will be a bit advanced level 2
These questions will be a bit advanced level 2sadhana312471
 
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskDeep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskSaurabh Saxena
 
End-to-End Platform Support for Distributed Deep Learning in Finance
End-to-End Platform Support for Distributed Deep Learning in FinanceEnd-to-End Platform Support for Distributed Deep Learning in Finance
End-to-End Platform Support for Distributed Deep Learning in FinanceJim Dowling
 
Build your own discovery index of scholary e-resources
Build your own discovery index of scholary e-resourcesBuild your own discovery index of scholary e-resources
Build your own discovery index of scholary e-resourcesMartin Czygan
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurSiddharth Mathur
 

Similar to Daniel Krasner - High Performance Text Processing with Rosetta (20)

Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @Geecon
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
2023-02-22_Tiberti_CyberX.pdf
2023-02-22_Tiberti_CyberX.pdf2023-02-22_Tiberti_CyberX.pdf
2023-02-22_Tiberti_CyberX.pdf
 
Kqueue : Generic Event notification
Kqueue : Generic Event notificationKqueue : Generic Event notification
Kqueue : Generic Event notification
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Questions On The Code And Core Module
Questions On The Code And Core ModuleQuestions On The Code And Core Module
Questions On The Code And Core Module
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIOUnlock user behavior with 87 Million events using Hudi, StarRocks & MinIO
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO
 
JS Fest 2019. Ryan Dahl. Deno, a new way to JavaScript
JS Fest 2019. Ryan Dahl. Deno, a new way to JavaScriptJS Fest 2019. Ryan Dahl. Deno, a new way to JavaScript
JS Fest 2019. Ryan Dahl. Deno, a new way to JavaScript
 
Purdue CS354 Operating Systems 2008
Purdue CS354 Operating Systems 2008Purdue CS354 Operating Systems 2008
Purdue CS354 Operating Systems 2008
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
These questions will be a bit advanced level 2
These questions will be a bit advanced level 2These questions will be a bit advanced level 2
These questions will be a bit advanced level 2
 
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskDeep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
 
End-to-End Platform Support for Distributed Deep Learning in Finance
End-to-End Platform Support for Distributed Deep Learning in FinanceEnd-to-End Platform Support for Distributed Deep Learning in Finance
End-to-End Platform Support for Distributed Deep Learning in Finance
 
Python_intro.ppt
Python_intro.pptPython_intro.ppt
Python_intro.ppt
 
Build your own discovery index of scholary e-resources
Build your own discovery index of scholary e-resourcesBuild your own discovery index of scholary e-resources
Build your own discovery index of scholary e-resources
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 

More from PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Recently uploaded

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 

Recently uploaded (20)

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 

Daniel Krasner - High Performance Text Processing with Rosetta

  • 1. (Easy), High Performance Text Processing with Python’s Rosetta Daniel Krasner KFit Solutions, Columbia University Nov 22, 2014 (IDSE) 1 / 52
  • 2. Outline 1 Introduction 2 Dealing with Limited Memory 3 Making Things Faster! 4 Stepping Outside of Python (time permitting) (IDSE) 2 / 52
  • 3. Outline 1 Introduction 2 Dealing with Limited Memory 3 Making Things Faster! 4 Stepping Outside of Python (time permitting) (IDSE) 3 / 52
  • 4. Motivation: Current Text Processing Projects The Declassification Engine I Full stack Digital Archive. F Collections structuring/parsing. F Backend, API, UI. F Statistical analysis. I Organizers F David Madigan - Statistics Department chair (CU) F Matthew Connelly - Professor of international and global history (CU) I For more info see http://www.declassification-engine.org/. (IDSE) 4 / 52
  • 5. Motivation: Current Projects Continued eDiscovery I The legal world is overwhelmed with documents, both in pre and post production review. I Most technologies heavily rely on keyword search which is not efficient. I “Predictive coding” solutions are generally archaic and inaccurate. Other I Human - text/document interaction. I Semantic filtering solutions. (IDSE) 5 / 52
  • 6. Motivation: Data Structuring/Information Extraction Text can come in many formats, encodings, and degree of structure. Figure: raw xml (IDSE) 6 / 52
  • 7. Motivation: Data Structuring/Information Extraction Many tasks involve initial structuring of the data. Figure: structured api response (IDSE) 7 / 52
  • 8. Motivation: Network Analysis Kissinger telcons can be analyzed for frequency. Even a simple version of this type of analysis requires the data to be structured, or an information extraction process in place. (IDSE) 8 / 52
  • 9. Motivation: Semantic Modeling State department cables from embassies can be analyzed for topics. Moscow is predominantly topic 12 soviet 0.133910 moscow 0.128717 october 0.090400 joint 0.052875 ussr 0.044190 soviets 0.042493 ur 0.027686 mfa 0.025786 refs 0.023871 prague 0.021268 London is predominantly topic 13 london 0.114568 bonn 0.083748 rome 0.074385 uk 0.051367 frg 0.050235 berlin 0.035972 usmission 0.031757 british 0.029836 european 0.027203 brussels 0.025023 (IDSE) 9 / 52
  • 10. Motivation: Classification Determine which documents are relevant to a legal case. (IDSE) 10 / 52
  • 11. Feature Extraction Figure: Metadata + body text ) features The metadata features can be used in any classifier (IDSE) 11 / 52
  • 12. Tasks What are some typical text processing goals? Structuring/Information Extraction I Metadata extraction F Geo-tagging F Name-entity identification/disambiguation I Text cleaning/text body extraction Machine Learning I Classification F logistic regression, random forests, etc I Sentiment analysis I Recommendation systems I Anomaly detection F Understanding underlying semantic structute (ex LDA/LSI modeling) F Communication dynamics Note: this is far from a complete list! (IDSE) 12 / 52
  • 13. General Flow Figure: General Flow (IDSE) 13 / 52
  • 14. What’s so hard about text processing? Must use sparse data structures Data usually doesn’t fit in memory It’s language . . . Text structure can change from collection to collection (parsing/feature extraction can be tricky) Can be difficult to convert to nice machine-readable formats HUGE data drives much of software development. Leads to products too complicated for most applications No simple solution (e.g. “sklearn/statsmodels/pandas/numpy” stack) (IDSE) 14 / 52
  • 15. What’s so fun about text processing? You get to care about memory and processing speed It’s language I spelling, stemming, etc I parsing I tokenization I domain knowledge Unix plays nicely with text Python plays nicely with text You get to step outside the Python ecosystem (IDSE) 15 / 52
  • 16. Outline 1 Introduction 2 Dealing with Limited Memory 3 Making Things Faster! 4 Stepping Outside of Python (time permitting) (IDSE) 16 / 52
  • 17. Cluster? No. One powerful machine and memory/CPU conscious code can handle many tasks with 1TB of text. System76 Laptop, pre-loaded with Ubuntu. 16GB Memory, 4cores, 1T SSD. $1823 I Can also get a Macbook Pro for about $3600 You can spend 6k and get 20 cores, 64GB Memory (upgradeable to 256GB), 1-2T RAID SSD for $7,000. Stick it in a closet with lots of fans or in a server center for $250/month. Single machine on AWS ($1-2k per year) or on Digital Ocean (they have nice ssds). (IDSE) 17 / 52
  • 18. Back-of-the-envelope memory calculations Text: I Same as on disk if the file is large enough (e.g. 10M) I Can be much more if you load many small files and append to a list. Numbers: I 1double = 8byte. ) 1,000,000 doubles 8MB. I You can save space by using type “int8,” “float16,” “float32,” etc. . . see the dtype docs (IDSE) 18 / 52
  • 19. Monitoring Memory with HTOP 4 cores, 4 virtual cores, all in use 2821/15946 MB memory in use processes listed below For macs, htop doesn’t necessarily work. If it doesn’t, try iStat. (IDSE) 19 / 52
  • 20. Don’t blow up memory, stream! Process files (text) line-by-line: with open(infile, ‘‘r’’) as f, open(outfile, ‘‘w’’) as g: for line in input: line_output = process_line(line) # Write output NOW g.write(line_output) Process directories one file at a time: from rosetta.text.filefilter import get_paths my_paths_iter = get_paths(MYDIR, get_iter=True) for path in my_paths_iter: output = process_file(path) # Now write the output to file (IDSE) 20 / 52
  • 21. Rosetta Text File Streamer Set up a text file streamer class for processing. # stream from a file sys directory stream = TextFileStreamer(text_base_path=MYDIR, tokenizer=MyTokenizer) # call info stream which will return a dict with # doc_id, text, tokens, etc for item in stream.info_stream(): # print the document text print item[‘‘text’’] ‘‘This is my text.’’ # print the tokens print item[‘‘tokens’’] [‘‘this’’, ‘‘is’’, ‘‘my’’, ‘‘text’’] (IDSE) 21 / 52
  • 22. Rosetta MySQL Streamer Set up a databse streamer class for processing. # stream from a DB stream = MySQLStreamer(db_setup=DBCONFIG, tokenizer=MyTokenizer) # Convert to scipysparse matrix # and cache some data along the way sparse_mat = stream.to_scipysparse( cache_list[‘‘doc_id’’, ‘‘date’’]) # grab the cached doc_id and dates doc_ids = stream.__dict__[‘‘doc_id_cache’’] dates = stream.__dict__[‘‘date_cache’’] (IDSE) 22 / 52
  • 23. (Online) Stochastic Gradient Descent Combine the above with an online learning algorithm. To minimize empirical loss nX i=1 |yi − w · xi |2 Update the coefficient w one training example at a time w(t+1) : = wt − trw|yt − wt · xt |2. Note: The learning rate t decays to 0 We can cycle through the training examples, updating the weights more than n times We only need to load one single example into memory at a time Converges faster for cases of many data points and coefficients See Bottou scikit-learn sgd and vowpal wabbit. (IDSE) 23 / 52
  • 24. Dealing with limited memory: Summary Monitor memory usage Stream process I and cache what you need along the way Deal with huge feature counts by. . . I Use a sparse data structure, stochastic gradient descent. Or. . . I Reduce the number of features to something that fits into a dense matrix (IDSE) 24 / 52
  • 25. Outline 1 Introduction 2 Dealing with Limited Memory 3 Making Things Faster! 4 Stepping Outside of Python (time permitting) (IDSE) 25 / 52
  • 26. Goal: Tokenization # Steps: split on whitespace, set to lowercase # remove non-letters, punctuation, and stopwords stopwords_list = ‘‘lets,do,a,the,and’’.split(‘‘,’’) # And more punct = [’|’, ’‘’, ’[’, ’]’, ’{’, ’}’, ’(’, ’)’] # Here’s the function call/result we want text = Let’s do a deal: Trade 55 Euros for 75 euros tokens = tokenize(text, punct, stopwords_list) print tokens [‘‘deal’’, ‘‘trade’’, ‘‘euros’’, ‘‘euros’’] (IDSE) 26 / 52
  • 27. First hack: For loop def tokenize_for(text, punct, stopwords): tokens = [] # Split the text on whitespace. for token in text.lower().split(): # Remove punctuation clean_token = ’’ for char in clean_token: if char not in punct: clean_token += char # Remove stopwords. if clean_token.isalpha() and len(token) 1 and (token not in stopwords): tokens.append(token) return tokens (IDSE) 27 / 52
  • 28. Profile: Time your code regex test.py def main(): stopwords_list = ‘‘lets,do,a,the,and’’.split(‘‘,’’) # And text = # Something pretty typical for your application for i in range(10000): tokens = tokenize_for(text, stopwords_list) if __name__ == ‘‘__main__’’: main() time python regex_test.py real 0m0.128s user 0m0.120s sys 0m0.008s (IDSE) 28 / 52
  • 29. Profile: Switch to a regex bad_char_pattern = r||’|[|]|{|}|(|) def tokenize_regex_1(text, bad_char_pattern, stopwords): # Substitute empty string for the bad characters text = re.sub(bad_char_pattern, ’’, text).lower() # Split on whitespace, keeping strings length 1 split_text = re.findall(r[a-z]+, text) tokens = [] for word in split_text: if word not in stopwords: tokens.append(word) return tokens time python regex_test.py real 0m0.091s (IDSE) 29 / 52
  • 30. Profile: Line-by-line readout Add an @profile decorator to your tokenize regex function, and pip install line_profiler kernprof.py -l regex_test.py python -m line_profiler regex_test.py.lprof |less Figure: line profiler output shows the for loop and if statement are slow. (IDSE) 30 / 52
  • 31. Profile: Use a set regex test.py def main(): stopwords_list = ’lets,do,a,the,and’.split(’,’) # And more stopwords_set = set(stopwords_list) for i in range(1000): text = # Something pretty typical for your application tokens = tokenize_regex_1(text, stopwords_set) if __name__ == ’__main__’: main() Reduces time from 0.091 to 0.043 Testing item in my set requires hash function computation Testing item in my list requires looking at every item in my list. (IDSE) 31 / 52
  • 32. Profile: IPython timeit Figure: Set lookup is O(1). So don’t ever test item in my list for long lists. (IDSE) 32 / 52
  • 33. Profile: Line-by-line profile again Switching to a set speed up the if statement. The for loop can still be faster (IDSE) 33 / 52
  • 34. Profile: Switch to a list comprehension Reduced time from 0.033 to 0.025s A list comprehension is essentially a for loop with a fast append Looks nicer in this case Be sure to use time python regex test.py for the total time! (IDSE) 34 / 52
  • 35. Data structures Think about the data structures (and associated methods) you are using for the task at hand! Many data analysis friendly languages (ex. Python, R) have very convenient built in data structures. These can come with significant lookup and operation overhead. I example: set lookup vs list lookup as above I example: python does not allocate a contiguous memory block for dictionaries, making them slower than a data structure which tells the interpreter how much space will be needed You can (easily) create your own data structure for the task at hand. See Saulius Lukuaskas nice post Why Python Runs Slow. Part 1: Data structures. (IDSE) 35 / 52
  • 36. Parallelization Much of text processing is embarrassingly parallel. Figure: Word counts for individual documents can be computed independently. (IDSE) 36 / 52
  • 37. Parallelization: Basic mapping Serial mapping def func(x): ... return 2 * x iterable = range(3) map(func, iterable) [0, 2, 4] Parallel mapping from multiprocessing import Pool def func(x): ... return 2 * x my_pool = Pool(processes=4) iterable = range(3) my_pool.map(func, iterable) [0, 2, 4] 1 Spawns 4 subprocesses 2 Pickles func, iterable and pipes them to the subprocess 3 The subprocesses compute their results 4 Subprocesses pickle/pipe back the results to the mother process (IDSE) 37 / 52
  • 38. Parallelization: Basic mapping issues Serial mapping def func(x): ... return 2 * x map(func, range(3)) [0, 2, 4] Parallel mapping from multiprocessing import Pool def func(x): ... return 2 * x my_pool = Pool(processes=4) iterable = range(3) my_pool.map(func, iterable) [0, 2, 4] Issues: What about functions of more than one variable? Pickling not possible for every function Can’t step a debugger into pool calls Traceback is uninterpretable Can’t exit with Ctrl-C Entire result is computed at once ) memory blow-up! (IDSE) 38 / 52
  • 39. Parallelization: Mapping functions of more than one var from multiprocessing import Pool from functools import partial def func(a, x): ... return 2 * a * x a = 3 func_a = partial(func, a) # func_a(x) = func(a, x) Pool.map(func_a, range(3)) [0, 6, 12] (IDSE) 39 / 52
  • 40. Parallelization: Dealing with map issues def map_easy(func, iterable, n_jobs): if n_jobs == 1: return map(func, iterable) else: _trypickle(func) pool = Pool(n_jobs) timeout = 1000000 return pool.map_async(func, iterable).get(timeout) trypickle(func) tries to pickle the func before mapping n jobs = 1 ) serial (debuggable/traceable) execution pool.map async(func, iterable).get(timeout) allows exit with Ctrl-C (IDSE) 40 / 52
  • 41. Parallelization: Limiting memory usage Send out/return jobs in chunks def imap_easy(func, iterable, n_jobs, chunksize, ordered=True) if n_jobs == 1: results_iter = itertools.imap(func, iterable) else: _trypickle(func) pool = Pool(n_jobs) if ordered: results_iter = pool.imap(func, iterable, chunksize=chunksize) else: results_iter = pool.imap_unordered( func, iterable, chunksize=chunksize) return results_iter Note: Exit with Ctrl-C is more difficult. See rosetta.parallel (IDSE) 41 / 52
  • 42. Making things faster: Summary Use regular expressions Use the right data structure I Numpy/Pandas for numbers (use built in functions/numba/cython, NOT for loops) I sets if you will test some item in my set Profile your code I time python myscript.py I timeit in IPython I line profiler (using kernprof.py) Use multiprocessing.Pool A number of the Rosetta streamer methods have multiprocessing built in (see rosetta.text.streamers) NOTE: the above example are in python for convenience but are relevant in many (most) other scenarios (IDSE) 42 / 52
  • 43. Outline 1 Introduction 2 Dealing with Limited Memory 3 Making Things Faster! 4 Stepping Outside of Python (time permitting) (IDSE) 43 / 52
  • 44. LDA (in a slide) Latent Dirichlet Allocation, by Blei, Ng and Jordan, is a hierarchical Bayesian model which describes the underlying semantic structure of a document corpus via a set of latent distributions of the vocabulary. The latent semantic distributions are referred to as “topics.” Each document is assumed modeled as a mixture of these topics. I the number of topics is chosen a priori. Words in a document are draw by I choosing a topic, given document mixture weights, I sampling from that topic. Hyperparameters: I lda alpha: prior which controls the topic probabilities/weights. F lda alpha 0.1: d Dirichlet() I lda rho: prior which controls the word probabilities. F lda alpha 0.1:
  • 46. Vowpal Wabbit: What/Why Can you build a topic model with 1,000,000 documents using gensim? Sure. . . if you have 10 hours or so to kill Better solution: Vowpal Wabbit Online stochastic gradient descent ) memory independent, optimal for huge data sets Highly optimized C++ ) fast However. . . Interface is CLI and the input/output files are not very usable (IDSE) 45 / 52
  • 47. Vowpal Wabbit: Python to the rescue Principles: Make getting data into/out of VW easy Don’t wrap the VW CLI (or if you do use the subprocess module to make calls, not os.system) # Convert text files in a directory structure to vw format stream = TextFileStreamer( text_base_path=’bodyfiles’, tokenizer=my_tokenizer) stream.to_vw(’myfiles.vw’, n_jobs=-1) # Explore token counts and filter tokens in a DataFrame sff = SFileFilter(VWFormatter()).load_sfile(’myfiles.vw’) # Create a data-frame representative df = sff.to_frame() (IDSE) 46 / 52
  • 48. Vowpal Wabbit: Python to the rescue # Create a filtered version of your sparse file sff.filter_extremes(doc_freq_min=5, doc_fraction_max=0.8) sff.compactify().filter_sfile(’myfiles.vw’, ’myfiles_filtered.# Back to bash to run VW vw --lda 5 --cache_file ddrs.cache --passes 10 -p prediction.dat --readable_model topics.dat --bit_precision 16 myfiles_filtered.vw # Look at the results in DataFrames lda = LDAResults( ’topics.dat’, ’prediction.dat’, num_topics, sff) lda.print_topics() See rosetta/examples/vw helpers.md (IDSE) 47 / 52
  • 49. Steps with VW Step 1: Convert files to VW input format 1 0000BC34| saying:1 antunes:4 goncalves:3 scientist:1 ... 1 0000C1AE| shot:1 help:1 september:2 luxembourg:1... 1 0000BBA7| raised:1 chinese:1 winston:1 authority:1... step 2: View the tokens in a DataFrame doc_freq tokens war 58 china 77 ... Step 3: Filter tokens and hash them 1 0000BC34| 3423211:1 111:4 43454:3 989794:1 ... 1 0000C1AE| 338:1 3123:1 19393:2 3232321:1... 1 0000BBA7| 1191:1 69830:1 398:1 974949:1... (IDSE) 48 / 52
  • 50. Steps with VW Step 4: Run VW vw --lda 5 --cache file ddrs.cache --passes 10... Step 5: View the results topic_0 topic_1 tokens war 0.2 0.8 china 0.4 0.6 See rosetta/examples/vw helpers.md (IDSE) 49 / 52
  • 51. Summary Pay attention to memory Pay attention to data structures Profile for performance Parallelization is easy for many text-processing tasks Use Python to make stepping outside the python world easier Also, don’t forget CLI and UNIX (IDSE) 50 / 52
  • 52. Bibliography M. Connelly et al. . . Declassification engine. Ongoing project at Columbia University http://www.declassification-engine.org/ https://github.com/declassengine/declass D. Krasner and I. Langmore Applied data science, lecture notes http: //columbia-applied-data-science.github.io/appdatasci.pdf The Rosetta team Tools for data science with a focus on text processing. https://github.com/columbia-applied-data-science/rosetta Clone, submit issues on github, fork, contribute! (IDSE) 51 / 52
  • 53. THANK YOU! contact: daniel@kfitsolutions.com Rosetta https://github.com/ columbia-applied-data-science/rosetta Open Source Python Text Processing Library Feel free to use, fork, submit issues and contribute! (IDSE) 52 / 52