SlideShare a Scribd company logo
1 of 14
Download to read offline
Duke COMPSCI 590.6
Final Project Report
Minwoo Kim Lucy Zhang
Spring 2017
1 Introduction
Social networks contain an enormous amount of data that oļ¬€er insights into a number of
solutions to problems and questions. Facebook alone processes 2.5 billion pieces of content
and more than 500 terabytes of data each day. Similarly, Twitter has millions of users that
follow each other and form a large and complex network. We plan to retrieve data from
Twitter in real time, tokenizing and tagging, and summarizing the result of tweet analysis
through visualization in order to concisely show what is happening online. The project
consists of two central parts: crawling the web for the Tweets and enough text to warrant
the use of parallel algorithms, and the analysis of each tweet through Hidden Markov
Model (HMM). Currently there are a select few applications of parallel text mining for
large text processing. One study created a parallel text mining framework that was
implemented using Message Passing Interface (MPI).
2 Problem description
The purpose of this project is to scrape, crawl and analyze tweets in realtime and in a
parallel way. The source of the tweets and the main focus of the project is major news
outlets: the New York Times, Washington Post, BBC, Yahoo News, and Wall Street
Journal. In order to narrow down the broad scope of tweet information as well as enable a
more consistent comparison, only tweets from the oļ¬ƒcial Twitter accounts of these sources
are tracked.
Due to the nature of the Twitter REST API which is rate limited at 180 calls every 15
minutes, the project instead relies on Twitterā€™s public streaming API which has no limit.
Each connection thus processes an ongoing stream of tweets in a separate process using the
multicore programming paradigm. Because an oļ¬ƒcial Twitter accountā€™s tweet often comes
in conjunction with a link to the article itself, it was also important to take into account the
external link that provided the article in full. Part of the issue was to analyze the bias of a
tweet, so the expanded article provided critical comparison and insight on what the tweet
expressed versus the article as a whole. In order to obtain the article text in full, another
1
Duke COMPSCI 590.6 Final project report Spring 2017
http request was required as well as parsing of the html DOM structure in parallel such
that only relevant article text (and not sidebar html, random javascript, etc.) was recorded.
Once we crawl tweets, we use a HMM, a conventional model for Part-Of-Speech(POS)
tagging. HMM is a statistical Markov Model that can be conceptually described using
Bayesian Network framework as in the following ļ¬gure.
Figure 1: First-Order Discrete-State, Discrete-Time Hidden Markov Model
In our speciļ¬c context, we choose to use First-Order Discrete-State, homogeneous
Discrete-Time Hidden Markov Model. To explain this, let be a set of T distinct discrete
(and hidden) states S1, S2, ..., ST for a discrete time Markov Chain model, which are not
observable. Also, let be a set of M distinct possible observation symbols U1, U2, ..., UM ,
which are directly observable and dependent upon these unknown states of the model. Due
to the First-Order Markov assumption and time homogeneous property, the transitions
between states only depend on the previous state: mathematically,
P(ti = Sk(i)|tiāˆ’1 = Sk(iāˆ’1), tiāˆ’2 = Sk(iāˆ’2), ..., t0 = Sk(0))
= P(ti = Sk(i)|tiāˆ’1 = Sk(iāˆ’1)) = P(t1 = Sk(i)|t0 = Sk(iāˆ’1))
for time index i ā‰„ 1.Observations are directly dependent upon the given state, and
independent from other states and observations once the state is observed:
P(wi|wiāˆ’1, Ā· Ā· Ā· , w0, tiāˆ’1 = Sk(iāˆ’1), ..., t0 = Sk(0)) = P(wi|tiāˆ’1 = Sk(iāˆ’1))
The linearized representation of HMM in the above ļ¬gure is not intended to imply one
possible choice for transition from the given state. Rather, it is a conceptual representation
of the sequential realization of the states for each discrete time index. For our tweet
analysis problem, we can regard each tweet as one realization of this HMM.
Now, suppose we are given a tweet of a single sentence, ā€œThe building houses many peo-
pleā€. Each word corresponds to a speciļ¬c realization of possible observation symbols,
w0, w1, ...wnāˆ’1. The unknown grammatical tags for these words are identiļ¬ed with the hid-
den states for these words, represented by stochastic variables t0, t1, ..., tnāˆ’1. Then, the goal
of POS tagging is to predict the hidden states for t0, t1, ..., tnāˆ’1 in the maximum likelihood
2
Duke COMPSCI 590.6 Final project report Spring 2017
framework. That is, for the given tweet of a sequence of words w0, w1, ...wnāˆ’1, we want to
maximize the conditional probability:
max
(Sk(0),...Sk(nāˆ’1))āˆˆā„¦n
P(t0 = Sk(0), ...tnāˆ’1 = Sk(nāˆ’1)|w0, ..., wnāˆ’1)
= max
(Sk(0),...Sk(nāˆ’1))āˆˆā„¦n
P(t0 = Sk(0), ...tnāˆ’1 = Sk(nāˆ’1), w0, ..., wnāˆ’1)
The above optimization problem can be solved using Viterbi algorithm, an application of
dynamic programming method. To do this, ļ¬rst deļ¬ne auxiliary function Ī“i : ā„¦ ā†’ [0, 1],
where
Ī“0(S) = P(t0 = S, w0) = P(t0 = S) P(w0|t0 = S)
Ī“i(S) = max
Sk(0),Ā·Ā·Ā· ,Sk(iāˆ’1)
P(ti = S, t0 = Sk(0), Ā· Ā· Ā· , tiāˆ’1 = Sk(iāˆ’1), w0, Ā· Ā· Ā· , wi)
= P(wi|ti = S) max
Sk(iāˆ’1)
P(ti = S|tiāˆ’1 = Sk(iāˆ’1))Ī“iāˆ’1(Sk(iāˆ’1))
, and the equalities hold due to our Markov assumptions. Then, our optimization problem
now becomes
max
(Sk(0),Ā·Ā·Ā· ,Sk(iāˆ’1))āˆˆā„¦n
P(t0 = Sk(0), Ā· Ā· Ā· , tnāˆ’1 = Sk(nāˆ’1), w0, Ā· Ā· Ā· , wnāˆ’1) = max
Sāˆˆā„¦
Ī“nāˆ’1(S)
, and the algorithm iteratively computes the vector Ī“i of length T. Thus, once we are given
all the matrices of model parameters Ī , A, B where
Ī (S) := P(t0 = S), S āˆˆ ā„¦
A(S1, S2) := P(ti = S2|tiāˆ’1 = S1) = P(t1 = S2|t0 = S1), S1, S2 āˆˆ ā„¦
B(S, U) := P(wi = U|ti = S), S āˆˆ ā„¦, U āˆˆ Ī£
, we can solve the problem using Viterbi algorithm. Once we get the maximum probability,
then we can easily retrieve the maximizing path (the sequence of grammatical tags) through
backtracking, if we store all the maximizing state of the previous word for the given state of
the current word, i.e., create a path-tracking matrix Q such that
S = Q(i, j) := arg max
tiāˆ’1=S
Ī“i(Sj)
. As a result of the algorithm, we obtain a sequence of grammatical tags for the given tweet.
After tagging all the tweets in the given batch, we then extract all the keywords in the tweet
and record the occurrence of these words, which is visually summarized using word cloud.
Note that this entire process occurs in real time. Thus tweets are being crawled and text is
being scraped while other tweets are being analyzed. The analysis on the texts is performed
on each individual tweet, and in the case of the corresponding article, on each individual
sentence. In order to optimize and parallelize this whole process, we used MPI as a parallel
architecture with 30, each of which has 16 cores and are capable of using MPI.
In order to optimize and parallelize this process, the handoļ¬€ from the crawling process and
analysis process uses MPI to pass the text. Furthermore, instead of one tweet or sentence
being processed at a time, they are processed in batches in order to minimize the overhead
of piping the text from python code (used for API calls) to C++ code.
3
Duke COMPSCI 590.6 Final project report Spring 2017
3 Expected achievements
The main strategy of tackling the problem is dividing the tasks and distributing them to
machines in the linux cluster using MPI. Thus, the general performance is expected to be
lower than the case where we implemented the code for only a single machine, by a factor of
approximately 30 (in the ideal case), which is equal to the number of machines in the linux
cluster. Also, multiprocessing the crawling and tagging within each node in the network will
be able to further improve the time performance. With vectorized computation scheme in
the Viterbi algorithm, albeit its meta algorithm is itself sequential by nature (due to iterative
updates of auxiliary vectors i), we would be able to reduce the time for POS tagging by 1/T
(where T = |ā„¦| is the number of diļ¬€erent types of tag) in the ideal case, where the overheads
of worker scheduling is assumed to be negligible.
4 Bottleneck identiļ¬cation
The bottleneck for crawling the text comes from several sources. The ļ¬rst is the Twitter
API call that can only retrieve the ļ¬rst N Tweets. Thus the next API call would require
the id of the most recently retrieved tweets in order to retrieve the next set of tweets. This
data dependency makes it diļ¬ƒcult to parallelize. Furthermore, the more Tweets that the
program retrieves, the older the Tweets are and the longer it takes to retrieve such data from
the Twitter servers. Each process must then wait after retrieving its ļ¬rst N tweets before
continuing to backtrack through tweets.
The second bottleneck comes from the newspaper article web scraping, which uses HTTP
requests on media source sites themselves. The time taken by the request is also reliant
on the servers and rate limits of the individual news sources. For tweets that do not have
scraped links to external media sources, this bottleneck is not an issue. However, for tweets
that do have external links to their respective media source, the time taken to request the
article text takes signiļ¬cantly longer than getting the tweet itself. As a result, the program
must wait for the article to be processed if it exists. This bottleneck was minimized by incor-
porating multi-threading, allowing the program to request multiple URLs at once, keeping
the order of the urls and responses stable and returning the same requests with the response
variables ļ¬lled.
In analysis part, the bottleneck comes from Viterbi algorithm for POS tagging. Let T be
the number of diļ¬€erent types of tags we used for POS tagging, and N denote the number of
word in a given sentence, which is random. Then, the Viterbi algorithm takes O(nT2
) time
complexity for given N = n.
5 Parallel solution
There are several parallel components in this project. The ļ¬rst lies in the crawling of
tweets. We deļ¬ne a stream as a process in which the Twitter API is called for a set
Twitter account. In order to scrape tweets for the desired media sources, multicore
4
Duke COMPSCI 590.6 Final project report Spring 2017
Figure 2: Overview of crawling process.
programming is used such that one core is responsible for each stream. Furthermore, for
each stream, we further divide the processes among cores, batching the tweets being called
from the API such that each core is responsible for that batch of tweets.
Because many Twitter media accounts often post news article links within the tweet, we
also crawled the news article text which averages between 500 to 800 words, or 25 to 40
sentences. The text analysis algorithm parses text sentence by sentences so these news
articles and their HTML DOM structure had to be parsed into a readable format. This
was accomplished through modiļ¬cation of a newspaper module that scrapes the main body
of an article from the HTML. It works by downloading articles from news sources and
ā€buildingā€ the paper. However, downloading articles one at a time is slow and spamming a
single news source such as the NY Times can cause rate limiting and is overall not very
courteous. We solve this problem by allocating threads to each news source (NYTimes,
BBC, etc.) to speed up the download time. We create a Worker thread that executes tasks
from a given tasks queue. Then, in the thread pool, we can add tasks that are executed on
separate threads that are later joined together. We also create a NewsPool class that can
accept any number of source or article objects in a list and allocates one thread to each
source (ie. 5 sources for 5 threads) and later returns when all threads have joined.
An important distinction to make between these aforementioned parallel implementations
is the selection to use processes versus threads. Intuitively, the threading module uses
5
Duke COMPSCI 590.6 Final project report Spring 2017
Figure 3: Simpliļ¬ed diagram of tweet crawling process
threads and the multiprocessing module uses threads. Threads run in the same memory
space while processes have separate memory. Thus, the multiprocessing implementations
are utilized for independent algorithmsā€“speciļ¬cally the tweets for individual media user
accounts that have no dependence. Furthermore, if one process crashes perhaps due to an
unpredictable error in the API call, it would not bring down other processes and tweets
would still be coming in. However, within each account, each call to the Twitter API can
only get the most recent 200 Tweets. Thus there is a dependence on the knowledge of the
most recently tweet ID retrieved in order to determine which next set of 200 tweets to
retrieve.
For threading, we want to retrieve articles from the same news sources. Threading allows
us to avoid downloading individual news sources such as NY Times, Washington Post, etc.
and instead compile those into a set with a thread allocated to each. Thus a download call
gets called on every article for all three of the sources. Threading has the advantage of
being lightweight with a low memory footprint and shared memory.
6 Complexity
Let TC denote the time for crawling tweets, and TA for analyzing tweets. If we
implemented the code within a single machine without using MPI, the time complexity per
one batch of tweets would have been T1(n) = TC(n) + TA(n) given N = n, where n is the
number of total words in the given batch, and TA(n) = O(nT2
) is dominated by Viterbi
algorithm. Now, suppose we have l cores in each machine, and we use 5 machines for
6
Duke COMPSCI 590.6 Final project report Spring 2017
Figure 4: Expanded version of individual tweet batches and articles
crawling and 5 Ɨ 5 machines for analysis, as in the ļ¬gure 2. Since in our parallelism
architecture, analysis is ongoing while the next batch of tweet is being crawled, the time
complexity per batch is approximately given by T30l(n) ā‰ˆ max{ 1
5l
TC(n), 1
25l
TA(n)}. Due to
the API rate limit, inļ¬nite number of resources does not necessarily lead to the
improvement in the time complexity.
7 Experiments
# of machines for crawling # of machines for analysis Average time per line (sec)
1 1 0.0442
5 25 0.002110
5 27 0.001059
5 23 0.000978
5 20 0.00145
Table 1: Experimental data for time performance after running program for the ļ¬rst group
of Tweets
We expected imbalance between crawling and analysis from the varying the number of ma-
chines used for each part of the program. For the single machine experimental time, the
program utilized only a single machine and all of the parallelization was solely through mul-
7
Duke COMPSCI 590.6 Final project report Spring 2017
ticore paradigms.
There was an improvement in the amount of time per Tweet for an increased number of ma-
chines used for analysis. However, after reaching the 20s in number of machines, the results
varied. One factor to take into account is communication overhead. Every time one process
intends to communicate with others, it has the cost of creating/sending the message and in
case of using a synchronous communication routine there is also the cost of waiting for the
other processes to receive the message. Another issue is load balancing. Task distribution
may not have been optimized given that the number of machines for crawling had to remain
at 5 for each news media source.
We expected imbalance between crawling and analysis from the varying the number of
machines used for each part of the program. For the single machine experimental time, the
program utilized only a single machine and all of the parallelization was solely through
multicore paradigms.
There was an improvement in the amount of time per Tweet for an increased number of
machines used for analysis. However, after reaching the 20s in number of machines, the
results varied. One factor to take into account is communication overhead. Every time one
process intends to communicate with others, it has the cost of creating/sending the
message and in case of using a synchronous communication routine there is also the cost of
waiting for the other processes to receive the message. Another issue is load balancing.
Task distribution may not have been optimized given that the number of machines for
crawling had to remain at 5 for each news media source.
In the sequential version of the code, there is a large gap between the ļ¬rst and next record
of data. This is likely due to the fact that the ļ¬rst API call that gets a batch of around 200
tweets has completed and a new call is being made to the API. The 27.5061 seconds in
between the two records of data is likely how long the GET request takes to call the
Twitter API. However, there is no such signiļ¬cant gap in the parallel version. While the
API nevertheless will always be a bottleneck, we are always getting Tweets: if one process
is calling the API and encountering the bottleneck, another process is still crawling text so
that there is no down time in which nothing happens. Because the parallelism gets
minimizes this backlog of waiting for an API request and multiple process are being
spawned for diļ¬€erent streams, each parallel process is able to crawl magnitudes larger
amount of text in comparison to the sequential version.
8
Duke COMPSCI 590.6 Final project report Spring 2017
Figure 5: NYT word cloud.
Figure 6: Washington Post word cloud
8 Discussion
8.1 Goals achieved
We were able to implement a tweet crawler and analyzer that could communicate with each
other. We were also able to develop a practical algorithm to distribute the work of tweet
processing on diļ¬€erent machines and cores. While there are still areas that we would be
able to optimize in speed, we managed to produce a functional product that yielded visible
results. It also has practical applications in text and data analysis and shows a promising
approach on how to deal with the massive amounts of data that lie in Twitter (and social
media in general) today.
8.2 Lessons
We realized that solving the practical problems requires more than studying the theoretical
backgrounds behind the scene. One of the main diļ¬ƒculty we faced was gathering the data.
9
Duke COMPSCI 590.6 Final project report Spring 2017
Figure 7: BBC word cloud
Figure 8: WSJ word cloud
In order to crawl a massive amount of online live tweet data, we needed to circumvent the
API rate limit, by utilizing the stream API, automating the generation of access tokens, and
optimizing API calls so that no scraped tweet is wasted.
Meanwhile, analyzing the tweet to get the desired, meaningful result requires pre-trained
classiļ¬ers of good quality, which means we also have to be able to gather correct, large
amount of available ā€œtraining dataā€ that are well suited to our problem which can also guar-
antee the quality of our classiļ¬ers. This is one of the reason why we chose to use Python
along with C++, because Python has a good library called NLTK, which has a large amount
of tagged sentences from diļ¬€erent corpora. Still, this was not a perfect solution, because
the data is inļ¬‚exible when it comes to the choice of the categories of tags, because they are
already tagged and we cannot make the categories of tags ļ¬ner for our purposes. If we want
to manipulate and redeļ¬ne the categories of tags, we would be able to get other training
data that follows our own deļ¬nition and rules about the categories, but this is generally
impossible unless we ourselves spend a couple of years in tagging all the sentences manually
10
Duke COMPSCI 590.6 Final project report Spring 2017
Figure 9: YahooNews word cloud
for our research.
Overhead was also a signiļ¬cant factor in performance. In the context of parallel computing,
this generally refers to the amount of unwanted time required for distributing and scheduling
the works to diļ¬€erent workers. Even if there are a suļ¬ƒcient amount of resources, it might be
better to limit the number of workers, especially for simple tasks. In MPI architecture, this
can be interpreted as minimizing the communication between nodes as much as possible is
best in terms of time performance. MPI is an optimal choice for problems where subprob-
lems can be completely independent from each other.
In the context of combining more than two diļ¬€erent languages, speciļ¬cally in our case where
we used Python and C++, it is important to minimize the number of times piping the Python
and C++. For each call to python code in C++, we have to start a python interpreter and
close down again, which takes a large amount of time in the total time performance. In
places where Python code calls C++ code to pipe information using the subprocess mod-
ule, however, process creation does not typically create much overhead. The startup time
involved in the creation of the process is likely an order of magnitude less than the time the
new program takes to do the work. However, it is a diļ¬€erent story when process creation
occurs for a large number of parallel children.
However, for the scope of the project, it would be infeasible to implement many of the
already optimized modules that Python has in C++. The code relies on the Tweepy python
wrapper for Twitter API calls, a number of nltk APIs, and other important python packages
that make it infeasible to implement everything with optimized runtime in C++.
8.3 Limitations
Natural Language Processing is a ļ¬eld of study where a subtle diļ¬€erence in approaches and
methodologies makes large diļ¬€erence in result, so a lot of additional delicate, elaborate ļ¬nal
touches is needed if we want to get a better result from the analysis. For example, when
11
Duke COMPSCI 590.6 Final project report Spring 2017
tagging the word based on the pre-deļ¬ned categories of tags, if we put common nouns and
proper nouns into the same category, than we might end up having a model that does not
work properly. Speciļ¬cally, the resulting model might classify a word ā€œhouseā€, which can
be used both as a verb or (common) noun, the model will exclusively classify ā€œhouseā€ to
be noun. The reason is because ļ¬nding a viterbi path of the given Hidden Markov Model
is dependent upon the transition probability between tags, so if we put common nouns and
proper nouns together, the transition probability from noun to noun is higher than verb to
noun, because proper nouns are often used in conjunction with other proper nouns nearby.
That is, the accuracy of the result heavily depends upon how we deļ¬ne the categories of tags.
Another issue related to the result accuracy is training. Because the text analysis uses ma-
chine learning, the quality of the result also depends on the quality of the data used in
training the models. It would be ideal if all tweets have a very regular structure of english
sentence with standard grammar, correct placement of punctuation and capital letters, and
with no typographical errors, so that we do not have to worry about all the ā€œnoiseā€, which
can be deļ¬ned in our context to be all the irregular texts including misuse of grammar and
typographical errors. However, since tweets are irregular texts, and we have only limited
resources of ā€œstructuredā€ text corpora from books that can be used in our training (used
pre-tagged sentences from NLTK library in Python), we cannot say that those limited train-
ing corpora are representative and instructive ones considering that our model is targeted
for ā€œunstructuredā€ tweet data. That being said, in our project, the accuracy of the result
heavily depends on these kinds of ā€œļ¬nal touchesā€ and ā€œthe quality of training dataā€ that are
beyond the ability and the quality of the algorithm itself. With more time, the results can
be more reļ¬ned.
In regard to the crawling, the limitations were largely API request rate limits and time
taken. The time an average HTTP request depends on many factors. For the Twitter API
calls, the average response time is slower for older fetched data. According to Twitterā€™s
current developer documentation API status page below, the current performance for the
/1.1/search/tweets service as a 1565 ms performance and has a status of ā€œservice disruptionā€.
The unpredictability of Twitter servers plays a huge role in the number of Tweets that can
be analyzed.
Another unprecedented limitation was Twitterā€™s monitoring of unusual API usage. One
Twitter account used for testing was terminated due to Twitterā€™s detection of ā€suspicious
activityā€.
8.4 Other Improvements
Ultimately we could do a better job at keeping all processors/cores busy at all time through
parallelizing more intense tasks such as parsing, sorting, counting, etc. possibly with mul-
tithreading. However it should be noted that increasing the number of threads beyond the
number of cores will increase overhead. We would also need to keep in mind that accessing
the hard disk from multiple threads in parallel can dramatically impact in performance as
random access is much slower than sequential access. There are also a few instances where
the program needs to wait for resources (speciļ¬cally, the API call for articles text as well
12
Duke COMPSCI 590.6 Final project report Spring 2017
Figure 10: Twitter Developer Documentation on API Status on 4-15-17
as batches of tweets). While a process is waiting for this server call, even though there are
other processes that are getting tweets thereby ensuring that the entire program is still do-
ing something, we could employ asynchronous elements somewhat reminiscent of javascript,
where API calls are asynchronous and require callbacks to function on the API response.
There are several thread and non blocking I/O networking libraries that would be suitable
for this purpose.
Another approach is to use MapReduce, which could easily perform a text analysis function
on a large set of data. Andrew Ng from Stanford published a paper on MapReduce for ma-
chine learning on multicore, in which he demonstrated linear speedup of machine learning
algorithms with the number of processors.
A less intuitive (but not necessarily less eļ¬€ective) approach of parallelizing the text mining is
through use of GPU and CPU, a form of hybrid parallelization. A study at North Carolina
State University was able to demonstrate term frequency calculations done by processing
ļ¬les in batches and assigning each block to process a document stream. The GPU generates
document hash tables while the CPU prefetches the next batch of ļ¬les from disk. For the
Twitter project, with large enough datasets, the GPU can compute batched text ļ¬les of
Tweets and their corresponding article text information. The limitation of course of doing a
computation like term frequency is that all required data must be copied into GPU memory.
13
Duke COMPSCI 590.6 Final project report Spring 2017
References
[1] Tekiner, F., Tsuruoka, Y., Tsujii, J. I., Ananiadou, S., & Keane, J. Parallel text mining
for large text processing.Proceedings of IEEE CSNDSP2010, 348-353.
[2] Chu, C. T., Kim, S. K., Lin, Y. A., Yu, Y., Bradski, G., Ng, A. Y., & Olukotun, K.
(2006, December). Map-reduce for machine learning on multicore. In NIPS(Vol. 6, pp.
281-288).
[3] Zhang, Y., Mueller, F., Cui, X., & Potok, T. (2009, March). GPU-accelerated text mining.
In Workshop on exploiting parallelism using GPUs and other hardware-assisted methods
(pp. 1-6).
[4] A. J. Viterbi. Error bounds for convolutional codes and an asymptotically optimum de-
coding algorithm IEEE Trans. Inform. Theory, vol. IT-13, pp. 260-269, Apr. 1967
[5] GD Forney The viterbi algorithm Proc. IEEE vol. 61, pp. 268-278, Mar. 1973
14

More Related Content

What's hot

Three Pass Protocol Concept in Hill Cipher Encryption Technique
Three Pass Protocol Concept in  Hill Cipher Encryption TechniqueThree Pass Protocol Concept in  Hill Cipher Encryption Technique
Three Pass Protocol Concept in Hill Cipher Encryption TechniqueUniversitas Pembangunan Panca Budi
Ā 
Gk3611601162
Gk3611601162Gk3611601162
Gk3611601162IJERA Editor
Ā 
Elgamal signature for content distribution with network coding
Elgamal signature for content distribution with network codingElgamal signature for content distribution with network coding
Elgamal signature for content distribution with network codingijwmn
Ā 
Mathematical_Introduction_to_Quantum_Computation
Mathematical_Introduction_to_Quantum_ComputationMathematical_Introduction_to_Quantum_Computation
Mathematical_Introduction_to_Quantum_ComputationBrian Flynn
Ā 
Clustering_Algorithm_DR
Clustering_Algorithm_DRClustering_Algorithm_DR
Clustering_Algorithm_DRNguyen Tran
Ā 
An Enhanced Message Digest Hash Algorithm for Information Security
An Enhanced Message Digest Hash Algorithm for Information SecurityAn Enhanced Message Digest Hash Algorithm for Information Security
An Enhanced Message Digest Hash Algorithm for Information Securitypaperpublications3
Ā 
Done reread thecomputationalcomplexityoflinkbuilding
Done reread thecomputationalcomplexityoflinkbuildingDone reread thecomputationalcomplexityoflinkbuilding
Done reread thecomputationalcomplexityoflinkbuildingJames Arnold
Ā 
On the modeling of
On the modeling ofOn the modeling of
On the modeling ofcsandit
Ā 
The Quality of the New Generator Sequence Improvent to Spread the Color Syste...
The Quality of the New Generator Sequence Improvent to Spread the Color Syste...The Quality of the New Generator Sequence Improvent to Spread the Color Syste...
The Quality of the New Generator Sequence Improvent to Spread the Color Syste...TELKOMNIKA JOURNAL
Ā 
On modeling controller switch interaction in openflow based sdns
On modeling controller switch interaction in openflow based sdnsOn modeling controller switch interaction in openflow based sdns
On modeling controller switch interaction in openflow based sdnsIJCNCJournal
Ā 
Scalable Distributed Graph Algorithms on Apache Spark
Scalable Distributed Graph Algorithms on Apache SparkScalable Distributed Graph Algorithms on Apache Spark
Scalable Distributed Graph Algorithms on Apache SparkLynxAnalytics
Ā 
Information 2014_5_28-100
Information 2014_5_28-100Information 2014_5_28-100
Information 2014_5_28-100Subrata Ghosh
Ā 

What's hot (14)

Three Pass Protocol Concept in Hill Cipher Encryption Technique
Three Pass Protocol Concept in  Hill Cipher Encryption TechniqueThree Pass Protocol Concept in  Hill Cipher Encryption Technique
Three Pass Protocol Concept in Hill Cipher Encryption Technique
Ā 
Gk3611601162
Gk3611601162Gk3611601162
Gk3611601162
Ā 
Elgamal signature for content distribution with network coding
Elgamal signature for content distribution with network codingElgamal signature for content distribution with network coding
Elgamal signature for content distribution with network coding
Ā 
Unit 3
Unit 3Unit 3
Unit 3
Ā 
Mathematical_Introduction_to_Quantum_Computation
Mathematical_Introduction_to_Quantum_ComputationMathematical_Introduction_to_Quantum_Computation
Mathematical_Introduction_to_Quantum_Computation
Ā 
Clustering_Algorithm_DR
Clustering_Algorithm_DRClustering_Algorithm_DR
Clustering_Algorithm_DR
Ā 
Unit 2
Unit  2Unit  2
Unit 2
Ā 
An Enhanced Message Digest Hash Algorithm for Information Security
An Enhanced Message Digest Hash Algorithm for Information SecurityAn Enhanced Message Digest Hash Algorithm for Information Security
An Enhanced Message Digest Hash Algorithm for Information Security
Ā 
Done reread thecomputationalcomplexityoflinkbuilding
Done reread thecomputationalcomplexityoflinkbuildingDone reread thecomputationalcomplexityoflinkbuilding
Done reread thecomputationalcomplexityoflinkbuilding
Ā 
On the modeling of
On the modeling ofOn the modeling of
On the modeling of
Ā 
The Quality of the New Generator Sequence Improvent to Spread the Color Syste...
The Quality of the New Generator Sequence Improvent to Spread the Color Syste...The Quality of the New Generator Sequence Improvent to Spread the Color Syste...
The Quality of the New Generator Sequence Improvent to Spread the Color Syste...
Ā 
On modeling controller switch interaction in openflow based sdns
On modeling controller switch interaction in openflow based sdnsOn modeling controller switch interaction in openflow based sdns
On modeling controller switch interaction in openflow based sdns
Ā 
Scalable Distributed Graph Algorithms on Apache Spark
Scalable Distributed Graph Algorithms on Apache SparkScalable Distributed Graph Algorithms on Apache Spark
Scalable Distributed Graph Algorithms on Apache Spark
Ā 
Information 2014_5_28-100
Information 2014_5_28-100Information 2014_5_28-100
Information 2014_5_28-100
Ā 

Similar to Tweet Cloud

Topic Evolutionary Tweet Stream Clustering Algorithm and TCV Rank Summarization
Topic Evolutionary Tweet Stream Clustering Algorithm and TCV Rank SummarizationTopic Evolutionary Tweet Stream Clustering Algorithm and TCV Rank Summarization
Topic Evolutionary Tweet Stream Clustering Algorithm and TCV Rank SummarizationIJERA Editor
Ā 
Sensing Trending Topics in Twitter for Greater Jakarta Area
Sensing Trending Topics in Twitter for Greater Jakarta Area Sensing Trending Topics in Twitter for Greater Jakarta Area
Sensing Trending Topics in Twitter for Greater Jakarta Area IJECEIAES
Ā 
Finding bursty topics from microblogs
Finding bursty topics from microblogsFinding bursty topics from microblogs
Finding bursty topics from microblogsmoresmile
Ā 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
Ā 
The Semantic Evolution of Online Communities
The Semantic Evolution of Online CommunitiesThe Semantic Evolution of Online Communities
The Semantic Evolution of Online CommunitiesMatthew Rowe
Ā 
NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...
NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...
NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...Eugene Nho
Ā 
cis97003
cis97003cis97003
cis97003perfj
Ā 
IRJET- Chatbot Using Gated End-to-End Memory Networks
IRJET-  	  Chatbot Using Gated End-to-End Memory NetworksIRJET-  	  Chatbot Using Gated End-to-End Memory Networks
IRJET- Chatbot Using Gated End-to-End Memory NetworksIRJET Journal
Ā 
NS-CUK Seminar: S.T.Nguyen, Review on "Continuous-Time Sequential Recommendat...
NS-CUK Seminar: S.T.Nguyen, Review on "Continuous-Time Sequential Recommendat...NS-CUK Seminar: S.T.Nguyen, Review on "Continuous-Time Sequential Recommendat...
NS-CUK Seminar: S.T.Nguyen, Review on "Continuous-Time Sequential Recommendat...ssuser4b1f48
Ā 
User_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxUser_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxdickonsondorris
Ā 
Hartmann im00
Hartmann im00Hartmann im00
Hartmann im00Irfan Khan
Ā 
14420-Article Text-17938-1-2-20201228.pdf
14420-Article Text-17938-1-2-20201228.pdf14420-Article Text-17938-1-2-20201228.pdf
14420-Article Text-17938-1-2-20201228.pdfMehwishKanwal14
Ā 
Extending Complex Event Processing to Graph-structured Information
Extending Complex Event Processing to Graph-structured InformationExtending Complex Event Processing to Graph-structured Information
Extending Complex Event Processing to Graph-structured InformationAntonio Vallecillo
Ā 
Technical Trends_Study of Quantum
Technical Trends_Study of QuantumTechnical Trends_Study of Quantum
Technical Trends_Study of QuantumHardik Gohel
Ā 
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIComprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIijtsrd
Ā 
Tta protocolsfinalppt-140305235749-phpapp02
Tta protocolsfinalppt-140305235749-phpapp02Tta protocolsfinalppt-140305235749-phpapp02
Tta protocolsfinalppt-140305235749-phpapp02Hrudya Balachandran
Ā 
Summingbird: Streaming Portable, MapReduce
Summingbird: Streaming Portable, MapReduceSummingbird: Streaming Portable, MapReduce
Summingbird: Streaming Portable, MapReduceDataWorks Summit
Ā 
00b7d51ed81834e4d7000000
00b7d51ed81834e4d700000000b7d51ed81834e4d7000000
00b7d51ed81834e4d7000000Rahul Jain
Ā 

Similar to Tweet Cloud (20)

Topic Evolutionary Tweet Stream Clustering Algorithm and TCV Rank Summarization
Topic Evolutionary Tweet Stream Clustering Algorithm and TCV Rank SummarizationTopic Evolutionary Tweet Stream Clustering Algorithm and TCV Rank Summarization
Topic Evolutionary Tweet Stream Clustering Algorithm and TCV Rank Summarization
Ā 
Sensing Trending Topics in Twitter for Greater Jakarta Area
Sensing Trending Topics in Twitter for Greater Jakarta Area Sensing Trending Topics in Twitter for Greater Jakarta Area
Sensing Trending Topics in Twitter for Greater Jakarta Area
Ā 
Finding bursty topics from microblogs
Finding bursty topics from microblogsFinding bursty topics from microblogs
Finding bursty topics from microblogs
Ā 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
Ā 
The Semantic Evolution of Online Communities
The Semantic Evolution of Online CommunitiesThe Semantic Evolution of Online Communities
The Semantic Evolution of Online Communities
Ā 
NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...
NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...
NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...
Ā 
cis97003
cis97003cis97003
cis97003
Ā 
IRJET- Chatbot Using Gated End-to-End Memory Networks
IRJET-  	  Chatbot Using Gated End-to-End Memory NetworksIRJET-  	  Chatbot Using Gated End-to-End Memory Networks
IRJET- Chatbot Using Gated End-to-End Memory Networks
Ā 
NS-CUK Seminar: S.T.Nguyen, Review on "Continuous-Time Sequential Recommendat...
NS-CUK Seminar: S.T.Nguyen, Review on "Continuous-Time Sequential Recommendat...NS-CUK Seminar: S.T.Nguyen, Review on "Continuous-Time Sequential Recommendat...
NS-CUK Seminar: S.T.Nguyen, Review on "Continuous-Time Sequential Recommendat...
Ā 
Fairness in Transfer Control Protocol for Congestion Control in Multiplicativ...
Fairness in Transfer Control Protocol for Congestion Control in Multiplicativ...Fairness in Transfer Control Protocol for Congestion Control in Multiplicativ...
Fairness in Transfer Control Protocol for Congestion Control in Multiplicativ...
Ā 
User_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxUser_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docx
Ā 
Hartmann im00
Hartmann im00Hartmann im00
Hartmann im00
Ā 
An adaptive framework towards analyzing the parallel merge sort
An adaptive framework towards analyzing the parallel merge sortAn adaptive framework towards analyzing the parallel merge sort
An adaptive framework towards analyzing the parallel merge sort
Ā 
14420-Article Text-17938-1-2-20201228.pdf
14420-Article Text-17938-1-2-20201228.pdf14420-Article Text-17938-1-2-20201228.pdf
14420-Article Text-17938-1-2-20201228.pdf
Ā 
Extending Complex Event Processing to Graph-structured Information
Extending Complex Event Processing to Graph-structured InformationExtending Complex Event Processing to Graph-structured Information
Extending Complex Event Processing to Graph-structured Information
Ā 
Technical Trends_Study of Quantum
Technical Trends_Study of QuantumTechnical Trends_Study of Quantum
Technical Trends_Study of Quantum
Ā 
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIComprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
Ā 
Tta protocolsfinalppt-140305235749-phpapp02
Tta protocolsfinalppt-140305235749-phpapp02Tta protocolsfinalppt-140305235749-phpapp02
Tta protocolsfinalppt-140305235749-phpapp02
Ā 
Summingbird: Streaming Portable, MapReduce
Summingbird: Streaming Portable, MapReduceSummingbird: Streaming Portable, MapReduce
Summingbird: Streaming Portable, MapReduce
Ā 
00b7d51ed81834e4d7000000
00b7d51ed81834e4d700000000b7d51ed81834e4d7000000
00b7d51ed81834e4d7000000
Ā 

Recently uploaded

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
Ā 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
Ā 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
Ā 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
Ā 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
Ā 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
Ā 
FULL ENJOY šŸ” 8264348440 šŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY šŸ” 8264348440 šŸ” Call Girls in Diplomatic Enclave | DelhiFULL ENJOY šŸ” 8264348440 šŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY šŸ” 8264348440 šŸ” Call Girls in Diplomatic Enclave | Delhisoniya singh
Ā 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
Ā 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
Ā 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
Ā 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
Ā 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
Ā 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
Ā 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
Ā 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
Ā 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
Ā 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
Ā 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
Ā 

Recently uploaded (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
Ā 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Ā 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
Ā 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Ā 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Ā 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
Ā 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
Ā 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
Ā 
FULL ENJOY šŸ” 8264348440 šŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY šŸ” 8264348440 šŸ” Call Girls in Diplomatic Enclave | DelhiFULL ENJOY šŸ” 8264348440 šŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY šŸ” 8264348440 šŸ” Call Girls in Diplomatic Enclave | Delhi
Ā 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Ā 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
Ā 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Ā 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
Ā 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
Ā 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
Ā 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
Ā 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
Ā 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
Ā 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Ā 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
Ā 

Tweet Cloud

  • 1. Duke COMPSCI 590.6 Final Project Report Minwoo Kim Lucy Zhang Spring 2017 1 Introduction Social networks contain an enormous amount of data that oļ¬€er insights into a number of solutions to problems and questions. Facebook alone processes 2.5 billion pieces of content and more than 500 terabytes of data each day. Similarly, Twitter has millions of users that follow each other and form a large and complex network. We plan to retrieve data from Twitter in real time, tokenizing and tagging, and summarizing the result of tweet analysis through visualization in order to concisely show what is happening online. The project consists of two central parts: crawling the web for the Tweets and enough text to warrant the use of parallel algorithms, and the analysis of each tweet through Hidden Markov Model (HMM). Currently there are a select few applications of parallel text mining for large text processing. One study created a parallel text mining framework that was implemented using Message Passing Interface (MPI). 2 Problem description The purpose of this project is to scrape, crawl and analyze tweets in realtime and in a parallel way. The source of the tweets and the main focus of the project is major news outlets: the New York Times, Washington Post, BBC, Yahoo News, and Wall Street Journal. In order to narrow down the broad scope of tweet information as well as enable a more consistent comparison, only tweets from the oļ¬ƒcial Twitter accounts of these sources are tracked. Due to the nature of the Twitter REST API which is rate limited at 180 calls every 15 minutes, the project instead relies on Twitterā€™s public streaming API which has no limit. Each connection thus processes an ongoing stream of tweets in a separate process using the multicore programming paradigm. Because an oļ¬ƒcial Twitter accountā€™s tweet often comes in conjunction with a link to the article itself, it was also important to take into account the external link that provided the article in full. Part of the issue was to analyze the bias of a tweet, so the expanded article provided critical comparison and insight on what the tweet expressed versus the article as a whole. In order to obtain the article text in full, another 1
  • 2. Duke COMPSCI 590.6 Final project report Spring 2017 http request was required as well as parsing of the html DOM structure in parallel such that only relevant article text (and not sidebar html, random javascript, etc.) was recorded. Once we crawl tweets, we use a HMM, a conventional model for Part-Of-Speech(POS) tagging. HMM is a statistical Markov Model that can be conceptually described using Bayesian Network framework as in the following ļ¬gure. Figure 1: First-Order Discrete-State, Discrete-Time Hidden Markov Model In our speciļ¬c context, we choose to use First-Order Discrete-State, homogeneous Discrete-Time Hidden Markov Model. To explain this, let be a set of T distinct discrete (and hidden) states S1, S2, ..., ST for a discrete time Markov Chain model, which are not observable. Also, let be a set of M distinct possible observation symbols U1, U2, ..., UM , which are directly observable and dependent upon these unknown states of the model. Due to the First-Order Markov assumption and time homogeneous property, the transitions between states only depend on the previous state: mathematically, P(ti = Sk(i)|tiāˆ’1 = Sk(iāˆ’1), tiāˆ’2 = Sk(iāˆ’2), ..., t0 = Sk(0)) = P(ti = Sk(i)|tiāˆ’1 = Sk(iāˆ’1)) = P(t1 = Sk(i)|t0 = Sk(iāˆ’1)) for time index i ā‰„ 1.Observations are directly dependent upon the given state, and independent from other states and observations once the state is observed: P(wi|wiāˆ’1, Ā· Ā· Ā· , w0, tiāˆ’1 = Sk(iāˆ’1), ..., t0 = Sk(0)) = P(wi|tiāˆ’1 = Sk(iāˆ’1)) The linearized representation of HMM in the above ļ¬gure is not intended to imply one possible choice for transition from the given state. Rather, it is a conceptual representation of the sequential realization of the states for each discrete time index. For our tweet analysis problem, we can regard each tweet as one realization of this HMM. Now, suppose we are given a tweet of a single sentence, ā€œThe building houses many peo- pleā€. Each word corresponds to a speciļ¬c realization of possible observation symbols, w0, w1, ...wnāˆ’1. The unknown grammatical tags for these words are identiļ¬ed with the hid- den states for these words, represented by stochastic variables t0, t1, ..., tnāˆ’1. Then, the goal of POS tagging is to predict the hidden states for t0, t1, ..., tnāˆ’1 in the maximum likelihood 2
  • 3. Duke COMPSCI 590.6 Final project report Spring 2017 framework. That is, for the given tweet of a sequence of words w0, w1, ...wnāˆ’1, we want to maximize the conditional probability: max (Sk(0),...Sk(nāˆ’1))āˆˆā„¦n P(t0 = Sk(0), ...tnāˆ’1 = Sk(nāˆ’1)|w0, ..., wnāˆ’1) = max (Sk(0),...Sk(nāˆ’1))āˆˆā„¦n P(t0 = Sk(0), ...tnāˆ’1 = Sk(nāˆ’1), w0, ..., wnāˆ’1) The above optimization problem can be solved using Viterbi algorithm, an application of dynamic programming method. To do this, ļ¬rst deļ¬ne auxiliary function Ī“i : ā„¦ ā†’ [0, 1], where Ī“0(S) = P(t0 = S, w0) = P(t0 = S) P(w0|t0 = S) Ī“i(S) = max Sk(0),Ā·Ā·Ā· ,Sk(iāˆ’1) P(ti = S, t0 = Sk(0), Ā· Ā· Ā· , tiāˆ’1 = Sk(iāˆ’1), w0, Ā· Ā· Ā· , wi) = P(wi|ti = S) max Sk(iāˆ’1) P(ti = S|tiāˆ’1 = Sk(iāˆ’1))Ī“iāˆ’1(Sk(iāˆ’1)) , and the equalities hold due to our Markov assumptions. Then, our optimization problem now becomes max (Sk(0),Ā·Ā·Ā· ,Sk(iāˆ’1))āˆˆā„¦n P(t0 = Sk(0), Ā· Ā· Ā· , tnāˆ’1 = Sk(nāˆ’1), w0, Ā· Ā· Ā· , wnāˆ’1) = max Sāˆˆā„¦ Ī“nāˆ’1(S) , and the algorithm iteratively computes the vector Ī“i of length T. Thus, once we are given all the matrices of model parameters Ī , A, B where Ī (S) := P(t0 = S), S āˆˆ ā„¦ A(S1, S2) := P(ti = S2|tiāˆ’1 = S1) = P(t1 = S2|t0 = S1), S1, S2 āˆˆ ā„¦ B(S, U) := P(wi = U|ti = S), S āˆˆ ā„¦, U āˆˆ Ī£ , we can solve the problem using Viterbi algorithm. Once we get the maximum probability, then we can easily retrieve the maximizing path (the sequence of grammatical tags) through backtracking, if we store all the maximizing state of the previous word for the given state of the current word, i.e., create a path-tracking matrix Q such that S = Q(i, j) := arg max tiāˆ’1=S Ī“i(Sj) . As a result of the algorithm, we obtain a sequence of grammatical tags for the given tweet. After tagging all the tweets in the given batch, we then extract all the keywords in the tweet and record the occurrence of these words, which is visually summarized using word cloud. Note that this entire process occurs in real time. Thus tweets are being crawled and text is being scraped while other tweets are being analyzed. The analysis on the texts is performed on each individual tweet, and in the case of the corresponding article, on each individual sentence. In order to optimize and parallelize this whole process, we used MPI as a parallel architecture with 30, each of which has 16 cores and are capable of using MPI. In order to optimize and parallelize this process, the handoļ¬€ from the crawling process and analysis process uses MPI to pass the text. Furthermore, instead of one tweet or sentence being processed at a time, they are processed in batches in order to minimize the overhead of piping the text from python code (used for API calls) to C++ code. 3
  • 4. Duke COMPSCI 590.6 Final project report Spring 2017 3 Expected achievements The main strategy of tackling the problem is dividing the tasks and distributing them to machines in the linux cluster using MPI. Thus, the general performance is expected to be lower than the case where we implemented the code for only a single machine, by a factor of approximately 30 (in the ideal case), which is equal to the number of machines in the linux cluster. Also, multiprocessing the crawling and tagging within each node in the network will be able to further improve the time performance. With vectorized computation scheme in the Viterbi algorithm, albeit its meta algorithm is itself sequential by nature (due to iterative updates of auxiliary vectors i), we would be able to reduce the time for POS tagging by 1/T (where T = |ā„¦| is the number of diļ¬€erent types of tag) in the ideal case, where the overheads of worker scheduling is assumed to be negligible. 4 Bottleneck identiļ¬cation The bottleneck for crawling the text comes from several sources. The ļ¬rst is the Twitter API call that can only retrieve the ļ¬rst N Tweets. Thus the next API call would require the id of the most recently retrieved tweets in order to retrieve the next set of tweets. This data dependency makes it diļ¬ƒcult to parallelize. Furthermore, the more Tweets that the program retrieves, the older the Tweets are and the longer it takes to retrieve such data from the Twitter servers. Each process must then wait after retrieving its ļ¬rst N tweets before continuing to backtrack through tweets. The second bottleneck comes from the newspaper article web scraping, which uses HTTP requests on media source sites themselves. The time taken by the request is also reliant on the servers and rate limits of the individual news sources. For tweets that do not have scraped links to external media sources, this bottleneck is not an issue. However, for tweets that do have external links to their respective media source, the time taken to request the article text takes signiļ¬cantly longer than getting the tweet itself. As a result, the program must wait for the article to be processed if it exists. This bottleneck was minimized by incor- porating multi-threading, allowing the program to request multiple URLs at once, keeping the order of the urls and responses stable and returning the same requests with the response variables ļ¬lled. In analysis part, the bottleneck comes from Viterbi algorithm for POS tagging. Let T be the number of diļ¬€erent types of tags we used for POS tagging, and N denote the number of word in a given sentence, which is random. Then, the Viterbi algorithm takes O(nT2 ) time complexity for given N = n. 5 Parallel solution There are several parallel components in this project. The ļ¬rst lies in the crawling of tweets. We deļ¬ne a stream as a process in which the Twitter API is called for a set Twitter account. In order to scrape tweets for the desired media sources, multicore 4
  • 5. Duke COMPSCI 590.6 Final project report Spring 2017 Figure 2: Overview of crawling process. programming is used such that one core is responsible for each stream. Furthermore, for each stream, we further divide the processes among cores, batching the tweets being called from the API such that each core is responsible for that batch of tweets. Because many Twitter media accounts often post news article links within the tweet, we also crawled the news article text which averages between 500 to 800 words, or 25 to 40 sentences. The text analysis algorithm parses text sentence by sentences so these news articles and their HTML DOM structure had to be parsed into a readable format. This was accomplished through modiļ¬cation of a newspaper module that scrapes the main body of an article from the HTML. It works by downloading articles from news sources and ā€buildingā€ the paper. However, downloading articles one at a time is slow and spamming a single news source such as the NY Times can cause rate limiting and is overall not very courteous. We solve this problem by allocating threads to each news source (NYTimes, BBC, etc.) to speed up the download time. We create a Worker thread that executes tasks from a given tasks queue. Then, in the thread pool, we can add tasks that are executed on separate threads that are later joined together. We also create a NewsPool class that can accept any number of source or article objects in a list and allocates one thread to each source (ie. 5 sources for 5 threads) and later returns when all threads have joined. An important distinction to make between these aforementioned parallel implementations is the selection to use processes versus threads. Intuitively, the threading module uses 5
  • 6. Duke COMPSCI 590.6 Final project report Spring 2017 Figure 3: Simpliļ¬ed diagram of tweet crawling process threads and the multiprocessing module uses threads. Threads run in the same memory space while processes have separate memory. Thus, the multiprocessing implementations are utilized for independent algorithmsā€“speciļ¬cally the tweets for individual media user accounts that have no dependence. Furthermore, if one process crashes perhaps due to an unpredictable error in the API call, it would not bring down other processes and tweets would still be coming in. However, within each account, each call to the Twitter API can only get the most recent 200 Tweets. Thus there is a dependence on the knowledge of the most recently tweet ID retrieved in order to determine which next set of 200 tweets to retrieve. For threading, we want to retrieve articles from the same news sources. Threading allows us to avoid downloading individual news sources such as NY Times, Washington Post, etc. and instead compile those into a set with a thread allocated to each. Thus a download call gets called on every article for all three of the sources. Threading has the advantage of being lightweight with a low memory footprint and shared memory. 6 Complexity Let TC denote the time for crawling tweets, and TA for analyzing tweets. If we implemented the code within a single machine without using MPI, the time complexity per one batch of tweets would have been T1(n) = TC(n) + TA(n) given N = n, where n is the number of total words in the given batch, and TA(n) = O(nT2 ) is dominated by Viterbi algorithm. Now, suppose we have l cores in each machine, and we use 5 machines for 6
  • 7. Duke COMPSCI 590.6 Final project report Spring 2017 Figure 4: Expanded version of individual tweet batches and articles crawling and 5 Ɨ 5 machines for analysis, as in the ļ¬gure 2. Since in our parallelism architecture, analysis is ongoing while the next batch of tweet is being crawled, the time complexity per batch is approximately given by T30l(n) ā‰ˆ max{ 1 5l TC(n), 1 25l TA(n)}. Due to the API rate limit, inļ¬nite number of resources does not necessarily lead to the improvement in the time complexity. 7 Experiments # of machines for crawling # of machines for analysis Average time per line (sec) 1 1 0.0442 5 25 0.002110 5 27 0.001059 5 23 0.000978 5 20 0.00145 Table 1: Experimental data for time performance after running program for the ļ¬rst group of Tweets We expected imbalance between crawling and analysis from the varying the number of ma- chines used for each part of the program. For the single machine experimental time, the program utilized only a single machine and all of the parallelization was solely through mul- 7
  • 8. Duke COMPSCI 590.6 Final project report Spring 2017 ticore paradigms. There was an improvement in the amount of time per Tweet for an increased number of ma- chines used for analysis. However, after reaching the 20s in number of machines, the results varied. One factor to take into account is communication overhead. Every time one process intends to communicate with others, it has the cost of creating/sending the message and in case of using a synchronous communication routine there is also the cost of waiting for the other processes to receive the message. Another issue is load balancing. Task distribution may not have been optimized given that the number of machines for crawling had to remain at 5 for each news media source. We expected imbalance between crawling and analysis from the varying the number of machines used for each part of the program. For the single machine experimental time, the program utilized only a single machine and all of the parallelization was solely through multicore paradigms. There was an improvement in the amount of time per Tweet for an increased number of machines used for analysis. However, after reaching the 20s in number of machines, the results varied. One factor to take into account is communication overhead. Every time one process intends to communicate with others, it has the cost of creating/sending the message and in case of using a synchronous communication routine there is also the cost of waiting for the other processes to receive the message. Another issue is load balancing. Task distribution may not have been optimized given that the number of machines for crawling had to remain at 5 for each news media source. In the sequential version of the code, there is a large gap between the ļ¬rst and next record of data. This is likely due to the fact that the ļ¬rst API call that gets a batch of around 200 tweets has completed and a new call is being made to the API. The 27.5061 seconds in between the two records of data is likely how long the GET request takes to call the Twitter API. However, there is no such signiļ¬cant gap in the parallel version. While the API nevertheless will always be a bottleneck, we are always getting Tweets: if one process is calling the API and encountering the bottleneck, another process is still crawling text so that there is no down time in which nothing happens. Because the parallelism gets minimizes this backlog of waiting for an API request and multiple process are being spawned for diļ¬€erent streams, each parallel process is able to crawl magnitudes larger amount of text in comparison to the sequential version. 8
  • 9. Duke COMPSCI 590.6 Final project report Spring 2017 Figure 5: NYT word cloud. Figure 6: Washington Post word cloud 8 Discussion 8.1 Goals achieved We were able to implement a tweet crawler and analyzer that could communicate with each other. We were also able to develop a practical algorithm to distribute the work of tweet processing on diļ¬€erent machines and cores. While there are still areas that we would be able to optimize in speed, we managed to produce a functional product that yielded visible results. It also has practical applications in text and data analysis and shows a promising approach on how to deal with the massive amounts of data that lie in Twitter (and social media in general) today. 8.2 Lessons We realized that solving the practical problems requires more than studying the theoretical backgrounds behind the scene. One of the main diļ¬ƒculty we faced was gathering the data. 9
  • 10. Duke COMPSCI 590.6 Final project report Spring 2017 Figure 7: BBC word cloud Figure 8: WSJ word cloud In order to crawl a massive amount of online live tweet data, we needed to circumvent the API rate limit, by utilizing the stream API, automating the generation of access tokens, and optimizing API calls so that no scraped tweet is wasted. Meanwhile, analyzing the tweet to get the desired, meaningful result requires pre-trained classiļ¬ers of good quality, which means we also have to be able to gather correct, large amount of available ā€œtraining dataā€ that are well suited to our problem which can also guar- antee the quality of our classiļ¬ers. This is one of the reason why we chose to use Python along with C++, because Python has a good library called NLTK, which has a large amount of tagged sentences from diļ¬€erent corpora. Still, this was not a perfect solution, because the data is inļ¬‚exible when it comes to the choice of the categories of tags, because they are already tagged and we cannot make the categories of tags ļ¬ner for our purposes. If we want to manipulate and redeļ¬ne the categories of tags, we would be able to get other training data that follows our own deļ¬nition and rules about the categories, but this is generally impossible unless we ourselves spend a couple of years in tagging all the sentences manually 10
  • 11. Duke COMPSCI 590.6 Final project report Spring 2017 Figure 9: YahooNews word cloud for our research. Overhead was also a signiļ¬cant factor in performance. In the context of parallel computing, this generally refers to the amount of unwanted time required for distributing and scheduling the works to diļ¬€erent workers. Even if there are a suļ¬ƒcient amount of resources, it might be better to limit the number of workers, especially for simple tasks. In MPI architecture, this can be interpreted as minimizing the communication between nodes as much as possible is best in terms of time performance. MPI is an optimal choice for problems where subprob- lems can be completely independent from each other. In the context of combining more than two diļ¬€erent languages, speciļ¬cally in our case where we used Python and C++, it is important to minimize the number of times piping the Python and C++. For each call to python code in C++, we have to start a python interpreter and close down again, which takes a large amount of time in the total time performance. In places where Python code calls C++ code to pipe information using the subprocess mod- ule, however, process creation does not typically create much overhead. The startup time involved in the creation of the process is likely an order of magnitude less than the time the new program takes to do the work. However, it is a diļ¬€erent story when process creation occurs for a large number of parallel children. However, for the scope of the project, it would be infeasible to implement many of the already optimized modules that Python has in C++. The code relies on the Tweepy python wrapper for Twitter API calls, a number of nltk APIs, and other important python packages that make it infeasible to implement everything with optimized runtime in C++. 8.3 Limitations Natural Language Processing is a ļ¬eld of study where a subtle diļ¬€erence in approaches and methodologies makes large diļ¬€erence in result, so a lot of additional delicate, elaborate ļ¬nal touches is needed if we want to get a better result from the analysis. For example, when 11
  • 12. Duke COMPSCI 590.6 Final project report Spring 2017 tagging the word based on the pre-deļ¬ned categories of tags, if we put common nouns and proper nouns into the same category, than we might end up having a model that does not work properly. Speciļ¬cally, the resulting model might classify a word ā€œhouseā€, which can be used both as a verb or (common) noun, the model will exclusively classify ā€œhouseā€ to be noun. The reason is because ļ¬nding a viterbi path of the given Hidden Markov Model is dependent upon the transition probability between tags, so if we put common nouns and proper nouns together, the transition probability from noun to noun is higher than verb to noun, because proper nouns are often used in conjunction with other proper nouns nearby. That is, the accuracy of the result heavily depends upon how we deļ¬ne the categories of tags. Another issue related to the result accuracy is training. Because the text analysis uses ma- chine learning, the quality of the result also depends on the quality of the data used in training the models. It would be ideal if all tweets have a very regular structure of english sentence with standard grammar, correct placement of punctuation and capital letters, and with no typographical errors, so that we do not have to worry about all the ā€œnoiseā€, which can be deļ¬ned in our context to be all the irregular texts including misuse of grammar and typographical errors. However, since tweets are irregular texts, and we have only limited resources of ā€œstructuredā€ text corpora from books that can be used in our training (used pre-tagged sentences from NLTK library in Python), we cannot say that those limited train- ing corpora are representative and instructive ones considering that our model is targeted for ā€œunstructuredā€ tweet data. That being said, in our project, the accuracy of the result heavily depends on these kinds of ā€œļ¬nal touchesā€ and ā€œthe quality of training dataā€ that are beyond the ability and the quality of the algorithm itself. With more time, the results can be more reļ¬ned. In regard to the crawling, the limitations were largely API request rate limits and time taken. The time an average HTTP request depends on many factors. For the Twitter API calls, the average response time is slower for older fetched data. According to Twitterā€™s current developer documentation API status page below, the current performance for the /1.1/search/tweets service as a 1565 ms performance and has a status of ā€œservice disruptionā€. The unpredictability of Twitter servers plays a huge role in the number of Tweets that can be analyzed. Another unprecedented limitation was Twitterā€™s monitoring of unusual API usage. One Twitter account used for testing was terminated due to Twitterā€™s detection of ā€suspicious activityā€. 8.4 Other Improvements Ultimately we could do a better job at keeping all processors/cores busy at all time through parallelizing more intense tasks such as parsing, sorting, counting, etc. possibly with mul- tithreading. However it should be noted that increasing the number of threads beyond the number of cores will increase overhead. We would also need to keep in mind that accessing the hard disk from multiple threads in parallel can dramatically impact in performance as random access is much slower than sequential access. There are also a few instances where the program needs to wait for resources (speciļ¬cally, the API call for articles text as well 12
  • 13. Duke COMPSCI 590.6 Final project report Spring 2017 Figure 10: Twitter Developer Documentation on API Status on 4-15-17 as batches of tweets). While a process is waiting for this server call, even though there are other processes that are getting tweets thereby ensuring that the entire program is still do- ing something, we could employ asynchronous elements somewhat reminiscent of javascript, where API calls are asynchronous and require callbacks to function on the API response. There are several thread and non blocking I/O networking libraries that would be suitable for this purpose. Another approach is to use MapReduce, which could easily perform a text analysis function on a large set of data. Andrew Ng from Stanford published a paper on MapReduce for ma- chine learning on multicore, in which he demonstrated linear speedup of machine learning algorithms with the number of processors. A less intuitive (but not necessarily less eļ¬€ective) approach of parallelizing the text mining is through use of GPU and CPU, a form of hybrid parallelization. A study at North Carolina State University was able to demonstrate term frequency calculations done by processing ļ¬les in batches and assigning each block to process a document stream. The GPU generates document hash tables while the CPU prefetches the next batch of ļ¬les from disk. For the Twitter project, with large enough datasets, the GPU can compute batched text ļ¬les of Tweets and their corresponding article text information. The limitation of course of doing a computation like term frequency is that all required data must be copied into GPU memory. 13
  • 14. Duke COMPSCI 590.6 Final project report Spring 2017 References [1] Tekiner, F., Tsuruoka, Y., Tsujii, J. I., Ananiadou, S., & Keane, J. Parallel text mining for large text processing.Proceedings of IEEE CSNDSP2010, 348-353. [2] Chu, C. T., Kim, S. K., Lin, Y. A., Yu, Y., Bradski, G., Ng, A. Y., & Olukotun, K. (2006, December). Map-reduce for machine learning on multicore. In NIPS(Vol. 6, pp. 281-288). [3] Zhang, Y., Mueller, F., Cui, X., & Potok, T. (2009, March). GPU-accelerated text mining. In Workshop on exploiting parallelism using GPUs and other hardware-assisted methods (pp. 1-6). [4] A. J. Viterbi. Error bounds for convolutional codes and an asymptotically optimum de- coding algorithm IEEE Trans. Inform. Theory, vol. IT-13, pp. 260-269, Apr. 1967 [5] GD Forney The viterbi algorithm Proc. IEEE vol. 61, pp. 268-278, Mar. 1973 14