1. Duke COMPSCI 590.6
Final Project Report
Minwoo Kim Lucy Zhang
Spring 2017
1 Introduction
Social networks contain an enormous amount of data that oļ¬er insights into a number of
solutions to problems and questions. Facebook alone processes 2.5 billion pieces of content
and more than 500 terabytes of data each day. Similarly, Twitter has millions of users that
follow each other and form a large and complex network. We plan to retrieve data from
Twitter in real time, tokenizing and tagging, and summarizing the result of tweet analysis
through visualization in order to concisely show what is happening online. The project
consists of two central parts: crawling the web for the Tweets and enough text to warrant
the use of parallel algorithms, and the analysis of each tweet through Hidden Markov
Model (HMM). Currently there are a select few applications of parallel text mining for
large text processing. One study created a parallel text mining framework that was
implemented using Message Passing Interface (MPI).
2 Problem description
The purpose of this project is to scrape, crawl and analyze tweets in realtime and in a
parallel way. The source of the tweets and the main focus of the project is major news
outlets: the New York Times, Washington Post, BBC, Yahoo News, and Wall Street
Journal. In order to narrow down the broad scope of tweet information as well as enable a
more consistent comparison, only tweets from the oļ¬cial Twitter accounts of these sources
are tracked.
Due to the nature of the Twitter REST API which is rate limited at 180 calls every 15
minutes, the project instead relies on Twitterās public streaming API which has no limit.
Each connection thus processes an ongoing stream of tweets in a separate process using the
multicore programming paradigm. Because an oļ¬cial Twitter accountās tweet often comes
in conjunction with a link to the article itself, it was also important to take into account the
external link that provided the article in full. Part of the issue was to analyze the bias of a
tweet, so the expanded article provided critical comparison and insight on what the tweet
expressed versus the article as a whole. In order to obtain the article text in full, another
1
2. Duke COMPSCI 590.6 Final project report Spring 2017
http request was required as well as parsing of the html DOM structure in parallel such
that only relevant article text (and not sidebar html, random javascript, etc.) was recorded.
Once we crawl tweets, we use a HMM, a conventional model for Part-Of-Speech(POS)
tagging. HMM is a statistical Markov Model that can be conceptually described using
Bayesian Network framework as in the following ļ¬gure.
Figure 1: First-Order Discrete-State, Discrete-Time Hidden Markov Model
In our speciļ¬c context, we choose to use First-Order Discrete-State, homogeneous
Discrete-Time Hidden Markov Model. To explain this, let be a set of T distinct discrete
(and hidden) states S1, S2, ..., ST for a discrete time Markov Chain model, which are not
observable. Also, let be a set of M distinct possible observation symbols U1, U2, ..., UM ,
which are directly observable and dependent upon these unknown states of the model. Due
to the First-Order Markov assumption and time homogeneous property, the transitions
between states only depend on the previous state: mathematically,
P(ti = Sk(i)|tiā1 = Sk(iā1), tiā2 = Sk(iā2), ..., t0 = Sk(0))
= P(ti = Sk(i)|tiā1 = Sk(iā1)) = P(t1 = Sk(i)|t0 = Sk(iā1))
for time index i ā„ 1.Observations are directly dependent upon the given state, and
independent from other states and observations once the state is observed:
P(wi|wiā1, Ā· Ā· Ā· , w0, tiā1 = Sk(iā1), ..., t0 = Sk(0)) = P(wi|tiā1 = Sk(iā1))
The linearized representation of HMM in the above ļ¬gure is not intended to imply one
possible choice for transition from the given state. Rather, it is a conceptual representation
of the sequential realization of the states for each discrete time index. For our tweet
analysis problem, we can regard each tweet as one realization of this HMM.
Now, suppose we are given a tweet of a single sentence, āThe building houses many peo-
pleā. Each word corresponds to a speciļ¬c realization of possible observation symbols,
w0, w1, ...wnā1. The unknown grammatical tags for these words are identiļ¬ed with the hid-
den states for these words, represented by stochastic variables t0, t1, ..., tnā1. Then, the goal
of POS tagging is to predict the hidden states for t0, t1, ..., tnā1 in the maximum likelihood
2
3. Duke COMPSCI 590.6 Final project report Spring 2017
framework. That is, for the given tweet of a sequence of words w0, w1, ...wnā1, we want to
maximize the conditional probability:
max
(Sk(0),...Sk(nā1))āā¦n
P(t0 = Sk(0), ...tnā1 = Sk(nā1)|w0, ..., wnā1)
= max
(Sk(0),...Sk(nā1))āā¦n
P(t0 = Sk(0), ...tnā1 = Sk(nā1), w0, ..., wnā1)
The above optimization problem can be solved using Viterbi algorithm, an application of
dynamic programming method. To do this, ļ¬rst deļ¬ne auxiliary function Ī“i : ā¦ ā [0, 1],
where
Ī“0(S) = P(t0 = S, w0) = P(t0 = S) P(w0|t0 = S)
Ī“i(S) = max
Sk(0),Ā·Ā·Ā· ,Sk(iā1)
P(ti = S, t0 = Sk(0), Ā· Ā· Ā· , tiā1 = Sk(iā1), w0, Ā· Ā· Ā· , wi)
= P(wi|ti = S) max
Sk(iā1)
P(ti = S|tiā1 = Sk(iā1))Ī“iā1(Sk(iā1))
, and the equalities hold due to our Markov assumptions. Then, our optimization problem
now becomes
max
(Sk(0),Ā·Ā·Ā· ,Sk(iā1))āā¦n
P(t0 = Sk(0), Ā· Ā· Ā· , tnā1 = Sk(nā1), w0, Ā· Ā· Ā· , wnā1) = max
Sāā¦
Ī“nā1(S)
, and the algorithm iteratively computes the vector Ī“i of length T. Thus, once we are given
all the matrices of model parameters Ī , A, B where
Ī (S) := P(t0 = S), S ā ā¦
A(S1, S2) := P(ti = S2|tiā1 = S1) = P(t1 = S2|t0 = S1), S1, S2 ā ā¦
B(S, U) := P(wi = U|ti = S), S ā ā¦, U ā Ī£
, we can solve the problem using Viterbi algorithm. Once we get the maximum probability,
then we can easily retrieve the maximizing path (the sequence of grammatical tags) through
backtracking, if we store all the maximizing state of the previous word for the given state of
the current word, i.e., create a path-tracking matrix Q such that
S = Q(i, j) := arg max
tiā1=S
Ī“i(Sj)
. As a result of the algorithm, we obtain a sequence of grammatical tags for the given tweet.
After tagging all the tweets in the given batch, we then extract all the keywords in the tweet
and record the occurrence of these words, which is visually summarized using word cloud.
Note that this entire process occurs in real time. Thus tweets are being crawled and text is
being scraped while other tweets are being analyzed. The analysis on the texts is performed
on each individual tweet, and in the case of the corresponding article, on each individual
sentence. In order to optimize and parallelize this whole process, we used MPI as a parallel
architecture with 30, each of which has 16 cores and are capable of using MPI.
In order to optimize and parallelize this process, the handoļ¬ from the crawling process and
analysis process uses MPI to pass the text. Furthermore, instead of one tweet or sentence
being processed at a time, they are processed in batches in order to minimize the overhead
of piping the text from python code (used for API calls) to C++ code.
3
4. Duke COMPSCI 590.6 Final project report Spring 2017
3 Expected achievements
The main strategy of tackling the problem is dividing the tasks and distributing them to
machines in the linux cluster using MPI. Thus, the general performance is expected to be
lower than the case where we implemented the code for only a single machine, by a factor of
approximately 30 (in the ideal case), which is equal to the number of machines in the linux
cluster. Also, multiprocessing the crawling and tagging within each node in the network will
be able to further improve the time performance. With vectorized computation scheme in
the Viterbi algorithm, albeit its meta algorithm is itself sequential by nature (due to iterative
updates of auxiliary vectors i), we would be able to reduce the time for POS tagging by 1/T
(where T = |ā¦| is the number of diļ¬erent types of tag) in the ideal case, where the overheads
of worker scheduling is assumed to be negligible.
4 Bottleneck identiļ¬cation
The bottleneck for crawling the text comes from several sources. The ļ¬rst is the Twitter
API call that can only retrieve the ļ¬rst N Tweets. Thus the next API call would require
the id of the most recently retrieved tweets in order to retrieve the next set of tweets. This
data dependency makes it diļ¬cult to parallelize. Furthermore, the more Tweets that the
program retrieves, the older the Tweets are and the longer it takes to retrieve such data from
the Twitter servers. Each process must then wait after retrieving its ļ¬rst N tweets before
continuing to backtrack through tweets.
The second bottleneck comes from the newspaper article web scraping, which uses HTTP
requests on media source sites themselves. The time taken by the request is also reliant
on the servers and rate limits of the individual news sources. For tweets that do not have
scraped links to external media sources, this bottleneck is not an issue. However, for tweets
that do have external links to their respective media source, the time taken to request the
article text takes signiļ¬cantly longer than getting the tweet itself. As a result, the program
must wait for the article to be processed if it exists. This bottleneck was minimized by incor-
porating multi-threading, allowing the program to request multiple URLs at once, keeping
the order of the urls and responses stable and returning the same requests with the response
variables ļ¬lled.
In analysis part, the bottleneck comes from Viterbi algorithm for POS tagging. Let T be
the number of diļ¬erent types of tags we used for POS tagging, and N denote the number of
word in a given sentence, which is random. Then, the Viterbi algorithm takes O(nT2
) time
complexity for given N = n.
5 Parallel solution
There are several parallel components in this project. The ļ¬rst lies in the crawling of
tweets. We deļ¬ne a stream as a process in which the Twitter API is called for a set
Twitter account. In order to scrape tweets for the desired media sources, multicore
4
5. Duke COMPSCI 590.6 Final project report Spring 2017
Figure 2: Overview of crawling process.
programming is used such that one core is responsible for each stream. Furthermore, for
each stream, we further divide the processes among cores, batching the tweets being called
from the API such that each core is responsible for that batch of tweets.
Because many Twitter media accounts often post news article links within the tweet, we
also crawled the news article text which averages between 500 to 800 words, or 25 to 40
sentences. The text analysis algorithm parses text sentence by sentences so these news
articles and their HTML DOM structure had to be parsed into a readable format. This
was accomplished through modiļ¬cation of a newspaper module that scrapes the main body
of an article from the HTML. It works by downloading articles from news sources and
ābuildingā the paper. However, downloading articles one at a time is slow and spamming a
single news source such as the NY Times can cause rate limiting and is overall not very
courteous. We solve this problem by allocating threads to each news source (NYTimes,
BBC, etc.) to speed up the download time. We create a Worker thread that executes tasks
from a given tasks queue. Then, in the thread pool, we can add tasks that are executed on
separate threads that are later joined together. We also create a NewsPool class that can
accept any number of source or article objects in a list and allocates one thread to each
source (ie. 5 sources for 5 threads) and later returns when all threads have joined.
An important distinction to make between these aforementioned parallel implementations
is the selection to use processes versus threads. Intuitively, the threading module uses
5
6. Duke COMPSCI 590.6 Final project report Spring 2017
Figure 3: Simpliļ¬ed diagram of tweet crawling process
threads and the multiprocessing module uses threads. Threads run in the same memory
space while processes have separate memory. Thus, the multiprocessing implementations
are utilized for independent algorithmsāspeciļ¬cally the tweets for individual media user
accounts that have no dependence. Furthermore, if one process crashes perhaps due to an
unpredictable error in the API call, it would not bring down other processes and tweets
would still be coming in. However, within each account, each call to the Twitter API can
only get the most recent 200 Tweets. Thus there is a dependence on the knowledge of the
most recently tweet ID retrieved in order to determine which next set of 200 tweets to
retrieve.
For threading, we want to retrieve articles from the same news sources. Threading allows
us to avoid downloading individual news sources such as NY Times, Washington Post, etc.
and instead compile those into a set with a thread allocated to each. Thus a download call
gets called on every article for all three of the sources. Threading has the advantage of
being lightweight with a low memory footprint and shared memory.
6 Complexity
Let TC denote the time for crawling tweets, and TA for analyzing tweets. If we
implemented the code within a single machine without using MPI, the time complexity per
one batch of tweets would have been T1(n) = TC(n) + TA(n) given N = n, where n is the
number of total words in the given batch, and TA(n) = O(nT2
) is dominated by Viterbi
algorithm. Now, suppose we have l cores in each machine, and we use 5 machines for
6
7. Duke COMPSCI 590.6 Final project report Spring 2017
Figure 4: Expanded version of individual tweet batches and articles
crawling and 5 Ć 5 machines for analysis, as in the ļ¬gure 2. Since in our parallelism
architecture, analysis is ongoing while the next batch of tweet is being crawled, the time
complexity per batch is approximately given by T30l(n) ā max{ 1
5l
TC(n), 1
25l
TA(n)}. Due to
the API rate limit, inļ¬nite number of resources does not necessarily lead to the
improvement in the time complexity.
7 Experiments
# of machines for crawling # of machines for analysis Average time per line (sec)
1 1 0.0442
5 25 0.002110
5 27 0.001059
5 23 0.000978
5 20 0.00145
Table 1: Experimental data for time performance after running program for the ļ¬rst group
of Tweets
We expected imbalance between crawling and analysis from the varying the number of ma-
chines used for each part of the program. For the single machine experimental time, the
program utilized only a single machine and all of the parallelization was solely through mul-
7
8. Duke COMPSCI 590.6 Final project report Spring 2017
ticore paradigms.
There was an improvement in the amount of time per Tweet for an increased number of ma-
chines used for analysis. However, after reaching the 20s in number of machines, the results
varied. One factor to take into account is communication overhead. Every time one process
intends to communicate with others, it has the cost of creating/sending the message and in
case of using a synchronous communication routine there is also the cost of waiting for the
other processes to receive the message. Another issue is load balancing. Task distribution
may not have been optimized given that the number of machines for crawling had to remain
at 5 for each news media source.
We expected imbalance between crawling and analysis from the varying the number of
machines used for each part of the program. For the single machine experimental time, the
program utilized only a single machine and all of the parallelization was solely through
multicore paradigms.
There was an improvement in the amount of time per Tweet for an increased number of
machines used for analysis. However, after reaching the 20s in number of machines, the
results varied. One factor to take into account is communication overhead. Every time one
process intends to communicate with others, it has the cost of creating/sending the
message and in case of using a synchronous communication routine there is also the cost of
waiting for the other processes to receive the message. Another issue is load balancing.
Task distribution may not have been optimized given that the number of machines for
crawling had to remain at 5 for each news media source.
In the sequential version of the code, there is a large gap between the ļ¬rst and next record
of data. This is likely due to the fact that the ļ¬rst API call that gets a batch of around 200
tweets has completed and a new call is being made to the API. The 27.5061 seconds in
between the two records of data is likely how long the GET request takes to call the
Twitter API. However, there is no such signiļ¬cant gap in the parallel version. While the
API nevertheless will always be a bottleneck, we are always getting Tweets: if one process
is calling the API and encountering the bottleneck, another process is still crawling text so
that there is no down time in which nothing happens. Because the parallelism gets
minimizes this backlog of waiting for an API request and multiple process are being
spawned for diļ¬erent streams, each parallel process is able to crawl magnitudes larger
amount of text in comparison to the sequential version.
8
9. Duke COMPSCI 590.6 Final project report Spring 2017
Figure 5: NYT word cloud.
Figure 6: Washington Post word cloud
8 Discussion
8.1 Goals achieved
We were able to implement a tweet crawler and analyzer that could communicate with each
other. We were also able to develop a practical algorithm to distribute the work of tweet
processing on diļ¬erent machines and cores. While there are still areas that we would be
able to optimize in speed, we managed to produce a functional product that yielded visible
results. It also has practical applications in text and data analysis and shows a promising
approach on how to deal with the massive amounts of data that lie in Twitter (and social
media in general) today.
8.2 Lessons
We realized that solving the practical problems requires more than studying the theoretical
backgrounds behind the scene. One of the main diļ¬culty we faced was gathering the data.
9
10. Duke COMPSCI 590.6 Final project report Spring 2017
Figure 7: BBC word cloud
Figure 8: WSJ word cloud
In order to crawl a massive amount of online live tweet data, we needed to circumvent the
API rate limit, by utilizing the stream API, automating the generation of access tokens, and
optimizing API calls so that no scraped tweet is wasted.
Meanwhile, analyzing the tweet to get the desired, meaningful result requires pre-trained
classiļ¬ers of good quality, which means we also have to be able to gather correct, large
amount of available ātraining dataā that are well suited to our problem which can also guar-
antee the quality of our classiļ¬ers. This is one of the reason why we chose to use Python
along with C++, because Python has a good library called NLTK, which has a large amount
of tagged sentences from diļ¬erent corpora. Still, this was not a perfect solution, because
the data is inļ¬exible when it comes to the choice of the categories of tags, because they are
already tagged and we cannot make the categories of tags ļ¬ner for our purposes. If we want
to manipulate and redeļ¬ne the categories of tags, we would be able to get other training
data that follows our own deļ¬nition and rules about the categories, but this is generally
impossible unless we ourselves spend a couple of years in tagging all the sentences manually
10
11. Duke COMPSCI 590.6 Final project report Spring 2017
Figure 9: YahooNews word cloud
for our research.
Overhead was also a signiļ¬cant factor in performance. In the context of parallel computing,
this generally refers to the amount of unwanted time required for distributing and scheduling
the works to diļ¬erent workers. Even if there are a suļ¬cient amount of resources, it might be
better to limit the number of workers, especially for simple tasks. In MPI architecture, this
can be interpreted as minimizing the communication between nodes as much as possible is
best in terms of time performance. MPI is an optimal choice for problems where subprob-
lems can be completely independent from each other.
In the context of combining more than two diļ¬erent languages, speciļ¬cally in our case where
we used Python and C++, it is important to minimize the number of times piping the Python
and C++. For each call to python code in C++, we have to start a python interpreter and
close down again, which takes a large amount of time in the total time performance. In
places where Python code calls C++ code to pipe information using the subprocess mod-
ule, however, process creation does not typically create much overhead. The startup time
involved in the creation of the process is likely an order of magnitude less than the time the
new program takes to do the work. However, it is a diļ¬erent story when process creation
occurs for a large number of parallel children.
However, for the scope of the project, it would be infeasible to implement many of the
already optimized modules that Python has in C++. The code relies on the Tweepy python
wrapper for Twitter API calls, a number of nltk APIs, and other important python packages
that make it infeasible to implement everything with optimized runtime in C++.
8.3 Limitations
Natural Language Processing is a ļ¬eld of study where a subtle diļ¬erence in approaches and
methodologies makes large diļ¬erence in result, so a lot of additional delicate, elaborate ļ¬nal
touches is needed if we want to get a better result from the analysis. For example, when
11
12. Duke COMPSCI 590.6 Final project report Spring 2017
tagging the word based on the pre-deļ¬ned categories of tags, if we put common nouns and
proper nouns into the same category, than we might end up having a model that does not
work properly. Speciļ¬cally, the resulting model might classify a word āhouseā, which can
be used both as a verb or (common) noun, the model will exclusively classify āhouseā to
be noun. The reason is because ļ¬nding a viterbi path of the given Hidden Markov Model
is dependent upon the transition probability between tags, so if we put common nouns and
proper nouns together, the transition probability from noun to noun is higher than verb to
noun, because proper nouns are often used in conjunction with other proper nouns nearby.
That is, the accuracy of the result heavily depends upon how we deļ¬ne the categories of tags.
Another issue related to the result accuracy is training. Because the text analysis uses ma-
chine learning, the quality of the result also depends on the quality of the data used in
training the models. It would be ideal if all tweets have a very regular structure of english
sentence with standard grammar, correct placement of punctuation and capital letters, and
with no typographical errors, so that we do not have to worry about all the ānoiseā, which
can be deļ¬ned in our context to be all the irregular texts including misuse of grammar and
typographical errors. However, since tweets are irregular texts, and we have only limited
resources of āstructuredā text corpora from books that can be used in our training (used
pre-tagged sentences from NLTK library in Python), we cannot say that those limited train-
ing corpora are representative and instructive ones considering that our model is targeted
for āunstructuredā tweet data. That being said, in our project, the accuracy of the result
heavily depends on these kinds of āļ¬nal touchesā and āthe quality of training dataā that are
beyond the ability and the quality of the algorithm itself. With more time, the results can
be more reļ¬ned.
In regard to the crawling, the limitations were largely API request rate limits and time
taken. The time an average HTTP request depends on many factors. For the Twitter API
calls, the average response time is slower for older fetched data. According to Twitterās
current developer documentation API status page below, the current performance for the
/1.1/search/tweets service as a 1565 ms performance and has a status of āservice disruptionā.
The unpredictability of Twitter servers plays a huge role in the number of Tweets that can
be analyzed.
Another unprecedented limitation was Twitterās monitoring of unusual API usage. One
Twitter account used for testing was terminated due to Twitterās detection of āsuspicious
activityā.
8.4 Other Improvements
Ultimately we could do a better job at keeping all processors/cores busy at all time through
parallelizing more intense tasks such as parsing, sorting, counting, etc. possibly with mul-
tithreading. However it should be noted that increasing the number of threads beyond the
number of cores will increase overhead. We would also need to keep in mind that accessing
the hard disk from multiple threads in parallel can dramatically impact in performance as
random access is much slower than sequential access. There are also a few instances where
the program needs to wait for resources (speciļ¬cally, the API call for articles text as well
12
13. Duke COMPSCI 590.6 Final project report Spring 2017
Figure 10: Twitter Developer Documentation on API Status on 4-15-17
as batches of tweets). While a process is waiting for this server call, even though there are
other processes that are getting tweets thereby ensuring that the entire program is still do-
ing something, we could employ asynchronous elements somewhat reminiscent of javascript,
where API calls are asynchronous and require callbacks to function on the API response.
There are several thread and non blocking I/O networking libraries that would be suitable
for this purpose.
Another approach is to use MapReduce, which could easily perform a text analysis function
on a large set of data. Andrew Ng from Stanford published a paper on MapReduce for ma-
chine learning on multicore, in which he demonstrated linear speedup of machine learning
algorithms with the number of processors.
A less intuitive (but not necessarily less eļ¬ective) approach of parallelizing the text mining is
through use of GPU and CPU, a form of hybrid parallelization. A study at North Carolina
State University was able to demonstrate term frequency calculations done by processing
ļ¬les in batches and assigning each block to process a document stream. The GPU generates
document hash tables while the CPU prefetches the next batch of ļ¬les from disk. For the
Twitter project, with large enough datasets, the GPU can compute batched text ļ¬les of
Tweets and their corresponding article text information. The limitation of course of doing a
computation like term frequency is that all required data must be copied into GPU memory.
13
14. Duke COMPSCI 590.6 Final project report Spring 2017
References
[1] Tekiner, F., Tsuruoka, Y., Tsujii, J. I., Ananiadou, S., & Keane, J. Parallel text mining
for large text processing.Proceedings of IEEE CSNDSP2010, 348-353.
[2] Chu, C. T., Kim, S. K., Lin, Y. A., Yu, Y., Bradski, G., Ng, A. Y., & Olukotun, K.
(2006, December). Map-reduce for machine learning on multicore. In NIPS(Vol. 6, pp.
281-288).
[3] Zhang, Y., Mueller, F., Cui, X., & Potok, T. (2009, March). GPU-accelerated text mining.
In Workshop on exploiting parallelism using GPUs and other hardware-assisted methods
(pp. 1-6).
[4] A. J. Viterbi. Error bounds for convolutional codes and an asymptotically optimum de-
coding algorithm IEEE Trans. Inform. Theory, vol. IT-13, pp. 260-269, Apr. 1967
[5] GD Forney The viterbi algorithm Proc. IEEE vol. 61, pp. 268-278, Mar. 1973
14