Optimizing Near-Synonym System
Siyuan Zhou and Zichang Feng
Carnegie Mellon University
Abstract
Phrasal near-synonym extraction is crucial to AI tasks
such as natural language processing. Near-Synonym
System(NeSS) is a corpus-based model for finding near-
synonym phrases, but suffers from performance prob-
lems.
This report presents an optimized version of NeSS that
builds an index on the suffix array to reduce the complex-
ity dependency on corpus size and uses an efficient ap-
proach for parallel execution to improve the scalability.
We applied several other techniques along with the in-
dexed suffix array to achieve an approximately 20x-40x
speedup. We further did experiments to break down the
speedup brought by different optimization approaches.
1 Introduction
Synonymy has various degrees ranging from complete
contextual substitutability to near-synonymy [4]. The
word length of the synonymy can also range from single-
word synonyms to multi-word synonyms or phrasal near-
synonyms. The later one has to consider the semantics of
the combination of multiple words, instead of solely the
meaning of each words in the phrase. For example, it is
fair to say is a phrasal near-synonym to the phrase we all
understand. However, the individual components of the
two phrases are not directly related to each other. Phrasal
near-synonym extraction is very important in natural lan-
guage processing, information retrieval, text summariza-
tion and other AI tasks[6].
Near-Synonym System (NeSS)[6], the system we aim
to optimize, is an unsupervised corpus-based model for
finding phrasal synonyms and near synonyms based on
a large corpus. It differs from other approaches since it
doesnt require parallel resources or use pre-determined
sets of patterns. Instead of storing the mapping of near-
synonyms in databases, given a query phrase, NeSS gen-
erates the near synonyms at runtime. NeSS selects near-
synonymic candidates by identifying common surround-
ing context based on an extension of Harris Distribu-
tional Hypothesis[7], which states that words that occur
in the same contexts tend to have similar meanings.
To be more specific, NeSS tokenizes the corpus and
constructs a suffix array at its initialization phase. Upon
receiving a query phrase, it searches all occurrences of
the query phrase in the corpus as contexts using the suffix
array. It then finds all the candidates of near-synonyms
by searching the contexts in the corpus by suffix array.
The ranking of the candidates is based on the number
of matching contexts between candidates and the query
phrase and how they are matched.
Since NeSS finds near-synonyms dynamically from
the corpus, the performance (in terms of latency of a sin-
gle query phrase) becomes a big challenge. NeSS needs
to be a real-time on-line service since the huge amount
of possible queries make it impossible to do off-line pre-
process. We address three performance problems in the
original NeSS:
1. Part of the code has low efficiency, and thus leads to
long latency to process a user request. In the orig-
inal system, it takes three to four minutes to pull
the near synonyms of a single query phrase on a 16
core machine. Apparently, this latency isnt accept-
able for the purpose of a real-time on-line service.
2. The system doesnt scale well with number of cores.
To be more specific, the original system only has
only 1.1 - 1.5 speedups with selected query phrases
when running on 16 cores compared to one cores.
To make the problem worse, original NeSS takes
longer time with 36 cores compared to with s a sin-
gle core on part of the phrases.
3. The complexity of the original system doesn’t scale
when the corpus size increases. However, a larger
corpus size leads to more accurate results for near-
synonym searches. Thus the system will take much
longer to pull a result if a user needs better accuracy.
We present an optimized version of NeSS system that
allows real-time interactive query for near synonyms
phrases. Our contributions can be concluded as the fol-
lowing. Firstly, we carefully optimized some of the im-
plementation details to allow faster computations with-
out losing the accuracy of the system. We built an in-
dex on top of the suffix array in the original system to
allow O(L) time to search all the occurrences of a sub-
string, where L is the length of the query string. We
modified the algorithm in fetching candidates to improve
efficiency. Besides major modifications to the original
design, we also made optimizations such as punctuation
filtering. Secondly, we changed the way that the system
is parallelized to improve the scalability on multi-core
machines. Our optimized system gets 6x speed-up when
running with 16 cores compared to that with one core.
Thirdly, we avoided splitting the suffix array into mul-
tiple parts such that the system has an overall view of
the corpus. The original system splits the suffix array
into multiple parts and let each thread hold one part of
the suffix arrays. However, this leads to different scoring
and ranking of the candidates. We believe that the results
from one whole piece of suffix is the most accurate one,
and thus only keep one copy of the suffix array globally.
Finally, we changed the complexity of the algorithm
to be less dependent to the length of the corpus size. We
achieved this by building an index on the suffix array
such that searching a substring takes O(L) time instead
of O(L + log(N)) where L is the length of the substring
and N is the length of the corpus. Therefore, the system
can achieve better accuracy by using a larger corpus as
input without sacrificing too much performance.
In this report, we introduce background in Section 2.
We describe our four optimizations in Section 3. The
results and evaluations will be presented in Section 4.
We then discuss our results in 5 and summarize related
works in Section 6. Finally, we conclude our work in 7.
2 Background
Near-Synonym System (NeSS)[6], the system we aim
to optimize, is an unsupervised corpus-based model for
finding phrasal synonyms and near synonyms based on
a large corpus. NeSS selects near-synonymic candidates
by identifying common surrounding context based on an
extension of Harris Distributional Hypothesis[7]. The
idea of this hypothesis is that words that occur in the
same contexts tend to have similar meanings. To iden-
tify the contexts that the query phrase occurs, the NeSS
system used a suffix array[9] for look-up. In this section,
we briefly describe NeSS and the suffix array.
2.1 Near-Synonym System
At the initialization of NeSS, it accepts documents as the
input. The documents are preprocessed and concatenated
to form a large corpus. NeSS converts the words in the
document into tokens and assigns each word a unique
word id, and keeps a dictionary for the mapping between
words and ids. The ids identify the words and avoids
string comparison when searching for a word. Viewing
the corpus as a long string, NeSS builds a suffix array of
the corpus. The suffix array helps substring search when
finding near synonyms of the query phrase.
Given the query phrase, NeSS searches the occurrence
in the corpus by finding the query phrase as a substring
in the corpus as a long string. Since the suffix array is a
sorted lexicographically, all occurrences are returned as
a range of indices of the suffix array. The surrounding
words can be then fetched based on the positions of the
query phrase in the corpus. The surrounding words are
referred to as contexts, and can be classified into left con-
text, right context and cradle, which is the left and right
contexts.
After the contexts are filtered, they are used to search
candidates. Candidates are the phrases that share the
same left context, right context or cradle with the query
phrase. Given a found candidate, each matching candi-
date will contribute a score to the candidate. For left and
right context, this is done by again searching the con-
texts in the corpus and fetching the words next to the
contexts. Finding candidates between cradles takes more
effort. First, the occurrences of the left context in the
cradle is searched using the suffix array. Next, for each
occurrence and for each valid candidate phrase length,
the words behind the occurrence of the left context are
fetched. The right context in the cradle is then compared
with the words behind the supposed candidate. If they
match, the supposed candidate is a real candidate. Since
there are plenty of contexts, finding candidates is one of
the most time consuming function in the system, espe-
cially candidates between cradles.
The candidates will then be ranked based on the afore-
mentioned score. Top N candidates will be returned to
the user as the near synonym phrases, where N is a pa-
rameter defined by the user. In addition, part of the top
candidates will then be directed for the KL-divergence
computation. The KL-divergence gives a more reliable
ranking of the candidates. One thing to note is that, since
the computation of KL-divergence involves a lot of math-
ematical operations, its also one of the hot spot in term
of the runtime.
NeSS parallelizes on a multi-core machine by splitting
the suffix array into multiple parts, and let one thread
be responsible for searching substrings of the partial suf-
fix array and subsequent operations associated with the
2
Figure 1: Overview of the process of searching near synonyms[6]
matching substrings. For example, in the process of find-
ing candidates from contexts, each thread finds contexts
from its partial suffix array and then finds matching can-
didates the contexts it just found. The reason of why
NeSS parallelizes this way is based on the fact that the
suffix array search is one of the most time consuming
function in the system. However, this leads to only par-
tial results when scoring and ranking the candidates. One
proof of this is that NeSS generates different results on
the same corpus with different number of threads, and
thus different number of suffix array splits.
2.2 Suffix Array
Suffix array[9] is a data structure that allows substring
search in O(P + log(N)) time, where P is the length is the
length of the substring and N is the length of the whole
string. Compared to suffix tree, suffix array consumes
much less memory in proactive. A suffix array is a sorted
array of all suffixes of the original array. Searching a
substring can be done by performing a binary search on
the suffix array.
In NeSS, the suffix array is built from the whole cor-
pus. The process of finding the occurrences of query
phrases or contexts, can be converted to searching a sub-
string in a long string (corpus). Since the words in the
corpus are tokenized to ids, one word in the corpus is
equivalent to a char in the suffix array. Since when NeSS
searches a phrase, it needs all occurrences of it, the suf-
fix array search in NeSS will return a range of suffix ar-
ray indices. The indices of occurrences in suffix array
are contiguous because the suffix array is sorted lexico-
graphically.
3 Optimizations
3.1 Index on Suffix Array
Since searching and counting are the core operations in
Ness, and both of them rely on the functionality pro-
vided by the suffix array, the performance of suffix array
would greatly affect the performance of the entire sys-
tem. Therefore, we first focus on accelerating the search
on suffix array.
The method we use is to create a multi-level index on
the suffix array. Its feasibility is based on the following
observations:
1. The only operation performed on the suffix array
is to search a phrase. There’s no insertion or dele-
tion in the runtime. Thus the suffix array’s structure
3
Figure 2: The structure of an index
won’t change once it’s created in the initialization
phase.
2. Although the number of words in the corpus is very
large and will increase quickly as the corpus be-
comes larger, the size of vocabulary is much smaller
and normally stays in a fixed scale.
3. The number of words contained by most of the
queries, contexts and candidates are no more than
3.
The first observation indicates we can build the index on
suffix array in advance and carefully organize its struc-
ture to specifically optimize for search operation. The
second and third observations tell us a node in the index
won’t contain too many keys and the index only needs
few levels, which means we can use a reasonable amount
of memory to store the whole data structure.
As figure 2 shows, the index is organized as a multi-
way tree. Each node in the tree contains an interval, and
each edge is associated with a word. The construction of
the tree is described in algorithm 1.
Theorem 3.1.1. The time complexity of algorithm 1 is
O(L) where L is the length of suffix array.
Proof. For each level of the tree, the algorithm needs to
scan the whole suffix array and add at most O(L) edges,
where each edge can be added in O(1) time by using
hashmap. Therefore it will take O(L) time to construct
one level of the tree. Since the number of levels is fixed
at 3 in our algorithm, the construction of the entire tree
will take O(3∗L) = O(L) time.
The worst case space complexity of algorithm 1 would
be O(3 ∗ L). The space complexity in average cases is
hard to estimate since it’s highly related to the contents
Algorithm 1 Construct Indexed Tree
1: Input: A suffix array S
2: Output: An indexed tree T
3: Let L be the length of S
4: T ← CONSTRUCT(0,L,0)
5:
6: function CONSTRUCT(start,end,depth)
7: Create a new node R
8: R.left ← start
9: R.right ← end
10: if start = end or depth = 3 then
11: return R
12: end if
13: pw ← NULL
14: ps ← −1
15: for i = start to end do
16: p ← S[i]
17: w ← p[depth]
18: if w = pw then
19: if pw = NULL then
20: C ← CONSTRUCT(ps,i−1,depth+1)
21: R.addChild(pw,C)
22: end if
23: pw = w
24: ps = i
25: end if
26: end for
27: C ← CONSTRUCT(ps,end,depth+1)
28: R.addChild(pw,C)
29: return R
30: end function
of corpus. Based on our experience, the memory con-
sumption is mostly smaller than the value in worst case
and is acceptable to a common commodity machine.
The tree produced by algorithm 1 has the following
property:
Property: For a node u in the tree, let p(u) =
w1w2...wn be the path from root node to u, and (l,r) be
the interval contained in u, then all the suffixes whose po-
sitions in the suffix array are in (l,r) have p(u) as their
prefixes.
Since a phrase p in the corpus must be the prefixes
of some suffixes, we can get all its occurrences in the
corpus by finding all the suffixes starting with p. Given
the above property, this task can be easily done by using
algorithm 2.
Although for some long phrases, algorithm 2 still
needs to perform a binary search on the suffix array to
further narrow down the range, most of the phrases have
a short length, and the search of them can be done by
only using the index. Therefore, we can get the follow-
ing conclusion:
4
Algorithm 2 Search with Index
1: Input: An index I, a suffix array S and a phrase p
2: Output: An interval (s,e)
3: Let L be the number of words in p
4: Let R be the root node of I
5: u ← R
6: pos ← 0
7: while pos < L and u is not a leaf node do
8: v ← u.getChild(p[pos])
9: if v = NULL then
10: u ← v
11: pos ← pos+1
12: else
13: return (−1,−2)
14: end if
15: end while
16: if pos = L then
17: return (u.left,u.right)
18: else
19: return BINARYSEARCH(S, p, pos,u.left,u.right)
20: end if
Theorem 3.1.2. The time complexity of algorithm 2 in
searching a phrase p with L(L ≤ 3) words is O(L).
Proof. The while loop from line 7 to line 15 will execute
at most L times. Since the operation of finding a child
node by a given word in line 8 can be done with hashmap,
each execution of the loop will take O(1) time. After
the loop, either pos is equal to L thus the algorithm can
directly return, or the last node is a leaf node thus the
binary search would run in O(L) time. Therefore, the
algorithm will take at most O(L+L) = O(L) time.
For a phrase that doesn’t appear in the corpus, algo-
rithm 2 will return a pair(−1,−2) which indicates an
empty interval. For a phrase that occurs at least one time
in the corpus, the algorithm will return an interval in the
suffix array where all the suffixes starts with the given
phrase. The system can then iterate every position in the
interval to do further computations for a phrase such as
finding contexts or extracting candidates.
3.2 Multi-threading
Since the search of different phrases in the corpus is in-
dependent, NeSS uses multiple threads to perform the
search in parallel to achieve a better performance. As
figure 3 demonstrates, the original system splits the suf-
fix array into several disjointed parts and assigns each
part to one of the threads. Each thread then iterates ev-
ery context of the input query and uses the part of suffix
array assigned to it to calculate the frequency of context
and extract candidates. The results produced by a thread
will be added into a global hashmap which is protected
by a global lock from being accessed by multiple threads
simultaneously. Although this method can improve the
performance of the system in some degree, the resulting
speedup and scalability are not good enough due to the
following reasons:
1. The time complexity to search C contexts with av-
erage length L in a suffix array with length N is
O(C ∗ log(N) ∗ L) by using a single thread in the
original system. When increasing the number of
threads to T, the time complexity is reduced to
O(C ∗log(N/T)∗L). However, the latter complex-
ity is not much smaller than the former one since the
value of log() function decreases very slowly with
its parameter.
2. There’s additional overhead on the synchronization
of the global hashmap. When the system uses more
threads, the overhead will also increase and restricts
the parallelism.
In addition, the multi-threaded design in the original sys-
tem won’t bring any benefit if we build an index on the
suffix array since the time to search a phrase will no
longer be related to the size of suffix array. To address the
problems in the original design and take advantage of the
index, we proposed a new way to do the multi-threading.
Figure 3: The design of multi-threading in the original
system
As figure 4 shows, our method uses only one suffix
array which has index built on it and is shared by all the
threads. We split the contexts into several disjointed parts
and assign each part to one of the thread. In this way,
each thread is only responsible for its own contexts and
uses the shared suffix array to do searching. In addition,
a thread will first store the results in its own hashmap and
put the data in local hashmap into a global hashmap after
finishing all the computations.
5
Figure 4: The design of multi-threading in the new sys-
tem
When using T threads, our method can reduce the time
complexity of searching from O(C ∗ L) to O(C ∗ L/T),
which is a much better speedup compared to the original
method. Moreover, the use of per-thread hashmap also
avoids most of the synchronizations and is more cache
friendly since different cores can store this data struc-
ture in its own cache without interfering each other. Fi-
nally, the cost of synchronization on the global hashmap
can be reduced by using the ConcurrentHashmap in Java,
which provides a fine-grained lock that enables the inser-
tion of keys which are located in different buckets to be
performed in parallel.
3.3 Candidate Search
As mentioned in Section 2.1, candidate searches from
context is one of the core and the most time consum-
ing part in the system. Among three types of the con-
texts: left contexts, right contexts and cradles, which are
the combining of left and right contexts, the candidate
searches from cradles takes most of the time. The reason
for this is that suffix array doest provide such functional-
ity to search two substrings apart from each other with a
fixed length in one search. Such cradle searches have to
be done by searching one side of the cradle and manually
comparing the other side.
In the original system, the process of finding candidate
from a cradle context can be described as the following.
If we note a cradle context by L1L2L3QR1R2R3, where
L1 to L3 are the left context of the cradle, R1 to R3 are
the fight context of the cradle and Q is the query phrase.
For each cradle context found in the previous step, NeSS
finds all the occurrences of L1 to L3. For each occurrence
of L1 to L3, and for each valid candidate length, NeSS
will fetch the words with the length of the right context,
beginning at L3 plus the candidate length. The fetched
words are then compared with R1 to R3 to check whether
this occurrence is a match of the whole cradle. In this
example, if we represent one of the occurrences of L1 to
L3 to be L1L2L3W1W2W3W4W5, NeSS will first fetch W2
to W4 for a supposed candidate length of one. W2 to W4
will then be compared with R1 to R3. If W2 to W4 match
R1 to R3, NeSS has found an match of the cradle context
in the corpus, since both left and right contexts match.
W1 will then be compared with Q to see if W1 is the query
phrase. If not, W1 is regarded as a candidate and will be
added to the candidate table. In the next iteration, W3 to
W5 will be fetched in order to check supposed candidates
with length of two. This process will keep iterating until
all valid candidate lengths are checked before it moves
on to the next occurrence of the left context in the cradle.
An important detail of this process is that when W2 to W4
are fetched from the corpus for comparison with R1 to
R3, a new Java array will be allocated, and the contents
are copied from the corpus array to the newly-allocated
array.
We address three problems in the process of finding
candidates from cradle contexts in the original imple-
mentation.
1. The loop on all valid candidate lengths leads to
fetches same chars and comparison of the same
chars with the right context several times. In the
example, W4 will be fetch three times across finding
candidates with lengths of one to three.
2. The allocation of a new array when the words be-
hind the left context are to be returned is unneces-
sary. It involves an unnecessary system call to al-
location memory in heap and sequential checks on
the results of the system. Also, the copying of the
content from the corpus array to the new array in-
troduces additional overhead.
3. Since the heap memory are managed by the JVM
instead of the programmer, the allocation of new ar-
rays will lead to massive garbage. If the heap size
isnt large enough, this will cause frequent garbage
collects, degrading the performance.
To solve these problems, we revised the algorithm
used to find candidates from cradle contexts. We de-
scribe the revised algorithm as below. For a cradle
L1L2L3QR1R2R3, we find all occurrences of L1 to L3 us-
ing suffix array. For each occurrence, we directly fetch
all words after L1 to L3 in the corpus, which are W1 to
W5. Then we perform a substring search of R1R2R3 in
W1W2W3W4W5. For each substring match of R1R2R3, the
words before the match are compared with the query
phrase and regarded as a candidate, since both left and
right contexts match. When fetching W1W2W3W4W5 from
the corpus, instead of allocating and copying the content,
6
we directly pass the beginning and ending indices of the
words.
Our algorithm differs from the original one in that we
avoid fetching and comparing same chars from the cor-
pus multiple times, thus reducing the operations needed.
Also, we changed the way of fetching the supposed right
context such that no unnecessary memory allocation and
memory copying are needed. This reduces the overhead
of allocating and copying the context, as well as the time
spent for garbage collection.
3.4 Punctuation Filter
After collecting contexts and candidates, NeSS filters out
punctuations. The punctuation filter in the original im-
plementation uses a regular expression match to deter-
mine whether a token is a punctuation. However, the
original code didnt take advantage of the fact that the
regular expression for punctuations stays the same for
different contexts and candidates to filter. The original
implementation compiles the regular expression for each
filtering operation. Thus the overhead of compiling the
regular expression is incurred for each context and each
candidate.
Our first optimization on the filter takes advantage of
the unchanged regular expression and compile a static
pattern according to the regular expression to filter out
the punctuations.
Next, we further exploit the fact that the punctuation
contexts and candidates are mostly one character long.
We thus changed the code to eliminate the need to use a
regular expression match. We do this by first construct
an array of 255 boolean elements. Each of the boolean
elements stands for a character in ASCII code and repre-
sents whether the character is a punctuation. At runtime,
to check whether a character is a punctuation, NeSS will
access the corresponding element in the boolean array.
In this way, a regular expression match is replaced with
an array lookup, which is much cheaper than the original
implementation.
4 Results
4.1 Environment
We tested our optimized NeSS on Elastic Cloud Comput-
ing in Amazon Web Service. We launched a c4.8xlarge
instance which has 36 virtual CPUs and 60GB memory.
We used a 2.2GB document from the very large English
Gigaword Fifth Edition[11], an archive of newswire text
data, as our corpus input.
4.2 Overall Performance
Figure 5: The performance comparison between the orig-
inal system and our system with all the optimizations
We randomly selected six phrases from the test phrases
in Guptas paper[6]. We ran the optimized and the orig-
inal systems with the query phrases several times to get
an average runtime of pulling near synonyms. Figure 5
shows the runtime comparison between the original sys-
tem and our optimized one. The blue bars are the average
runtimes in seconds of the original NeSS with different
query phrases, while the green bars represents the aver-
age runtimes of the optimized system. We marked the
speedups, which is the runtimes of the original system
divided by the ones of our optimized system. We also
included error bars to represent max and min values in
our test.
From Figure 5, we can observe a speedup ranging
from 17x to 41x, depending on the query phrase. The
average speedup across different query phrases is 30x.
The runtimes across multiple runs against the same query
phrase are quite stable, as shown by the error bars.
The speedup achieved by our optimization changed
the latency of searching a single query phrase from sev-
eral minutes to seconds, which allows a user to interac-
tively search near synonyms on-line.
4.3 Performance Impact of each Optimiza-
tion
In this section we analyze the performance impact of
each single optimization we applied. We evaluate the
change of the performance by adding optimizations one
at a time on top of the previously added optimizations,
in the same order as in Section 3. For example, when
evaluating the impact of optimizing the punctuation fil-
ter, we compare the final version with the version having
the first three optimizations. The reason we evaluate in a
cumulative way is that the impact of a later optimization
is noticeable only if the previously dominant hotspot is
7
removed by earlier optimizations such that the currently
applied one is the hotspot. This process of evaluation fol-
lows the similar way as our optimization process: once
a hotspot is resolved by applying an improvement tech-
nique, we find the next hotspot to raise another optimiza-
tion.
In the following figures, we refer to the index on the
suffix array as Optimization 1, the improvement on the
multi-threading as Optimization 2, the modifications in
candidate search as Optimization 3 and the changes in
punctuation filter as Optimization 4.
Figure 6: The performance comparison between the orig-
inal system and our system with optimization 1
The impact of building the index on the suffix ar-
ray can be shown in Figure 7. The notions of the fig-
ure is the same as the previous section. The speedups
vary from 3.5x to 6.5x with an average of 4.65x. The
speedup due to the index is highly dependent on the cor-
pus size, since the index reduces the complexity of sub-
string search from O(L+log(N)) to O(L). We expect to
see a further speedup with a larger corpus size.
Figure 7: The performance comparison between the sys-
tem with optimization 1 and the one with optimization
1,2
Figure 8 shows the performance impact of improv-
ing multi-threading. As mentioned earlier, the compar-
ison is between the version with the first two optimiza-
tions and the version with only the index optimization.
The speedups vary from 2.8x to 5x, with an average of
3.3x. We ran the system with the same number of cores,
36, against the same set of query phrases. The speedup
shows that the improved system scales better than the
original implementation.
Figure 8: The performance comparison between the sys-
tem with optimization 1,2 and the one with optimization
1,2,3
As shown by Figure 8, the speedup due to the mod-
ification on candidate search is small. This is because
the candidate search only dominates the runtime in the
shared context ranking phrase. However, the time spent
in the KL-Divergence computation takes most of the
time. Users can choose to disable the KL-Divergence,
and in such case, the speedup of this optimization is ap-
proximately 1.2x.
Figure 9: The performance comparison between the sys-
tem with optimization 1,2,3 and the one with optimiza-
tion 1,2,3,4
The impact of changing the punctuation filter is shown
in Figure 9. This optimization leads to approximately 2x
speedup on average.
8
4.4 Scalability
Figure 10: The scalability of our system under 36 cores
We evaluate the scalability of the optimized system with
different core counts by comparing the speedups with the
one with one core, as shown in Figure 10. Our improved
system scales near-linearly with core count equal or less
than 4. However, we observe little improvement when
we keep increasing number of cores above 16. We will
discuss the reasons in Section 5.2
5 Discussions
5.1 Optimizations with Little Effect
The original system uses Java HashMap to keep all the
contexts and candidates, as well as their scores. In
our optimized version, we also used HashMap to store
each level of the suffix array index. We tried to re-
place the Java HashMap with Trove hash map[1]. The
idea of Trove hash map is that it avoids allocation of ob-
jects of primitive types (Java Integer, for example), and
thus saves memory. It also claims to have better hash
functions and thus better performance. However, after
switching to Trove hash map, we didnt observe a signifi-
cant performance gain. The reason for this might be that
the hash maps we are building have keys of integers, and
thus hash functions dont play a big role. Also, Java 7
improve the performance in HashMap.
We also tried to think as a compiler and optimize some
unnecessary code. For example, in original NeSS, there
was a while loop that only calls a function inside. The
function contains an if statement that is never true. We
eliminated the while loop, but didnt see any performance
gain. We suspect the reason is that the Java compiler
had already optimized out the code that would never be
executed.
The third optimization that didnt work parallelizing
functions that takes a very short time to execute. For
such cases, the overhead of creating threads and context
switching defeats the benefits of parallel execution.
Fine-grained locks in compute intensive parts has little
effect in our optimization. For example, in the process
of ranking candidates, when building candidate map, the
computation time is much longer than the time to put
the result in a shared HashMap. In this case, changing
the HashMap to ConcurrentHashMap doesnt work well,
since the situation of blocking to acquire the lock rarely
happens.
5.2 Scalability
Our improved NeSS scales well under 4 cores, but
doesnt have much improvement when core count in-
creases above 16. We suspect there are three reasons:
1. Sequential part of the system limits the speedup by
multi-thread. There are multiple parts in the algo-
rithm that must be run in sequential, such as finding
contexts of query phrase and ranking candidates.
According to Amdahl’s law[2], the speedup by par-
allel execution is limited by the sequential functions
of the algorithm.
2. The contention of locks of shared data structure
limits the benefits of multi-threading. The data
structure that stores all candidates and their associ-
ated matching contexts has two levels of HashMaps.
Since the logic of the execution and the need of
the organization of the two-level HashMap con-
flicts with each other, preventing from fined-grained
locking. Thus we see a bottleneck when ranking
candidates as we increase the core count.
3. The memory bandwidth limits the maximum exe-
cution of the program. Since the system contains
plenty of memory accesses and a little mathemati-
cal computations, we believe the system is memory-
bounded when the number of cores is high enough
to make the computation fast.
6 Related Work
The NeSS finds near-synonym phases based on an unsu-
pervised corpus-based model. There are several papers
addressing related work to finding near synonyms and
our optimization techniques.
In Mitchell and Lapatas paper [10], a set of compo-
sition functions were proposed to combine vectors of
words in a phrase into a single one. Reddy [13] argued
that not all the features are relevant to the phrase, and
thus further presented ways to select relevant senses of
words in a phrase. However, these papers decompose
phrases into words and analyze the semantics at the word
9
level. They ignored the case that the phrases as a whole
may have completely different meanings from the mean-
ings of individual component words.
Some approaches were presented to find synonyms
by parallel resources. Methods based on monolin-
gual text corpus like Discovery of Inference Rules from
Text(DIRT) [8] spot paraphrases that share the same in-
terpretation in a foreign language. However, this might
also find phrases that are not related to the original
phrase. Ganitkevitch [5] used monolingual distributional
similarity to rerank the extracted paraphrases. They
built a webpage that responds to paraphrase queries by
looking up interpretations in foreign languages in the
database. However, this approach needs parallel re-
sources and can only search for phrases that are present
in the database.
In addition, Pasca[12] introduced an unsupervised
method to retrieve near-synonym in arbitrary web text
using linguistically-motivated text anchors identified in
the context of documents. The quality of the paraphrases
can be further improved by a filtering mechanism using a
set of categorized names from online documents. How-
ever, this method requires document-dependent linguis-
tic patterns defined. The documents also need language
specific resources such as part-of-speech taggers.
The Near-Synonym System (NeSS)[6], which is our
goal to optimize, differs from previous paraphrasing sys-
tems in terms that it doesnt need parallel resources like
PPDB[5] or predefined patterns like the method intro-
duced by Pasca[12]. The algorithm in the NeSS system
select near-synonymic candidates by identifying com-
mon surrounding context based on an extension of Harris
Distributional Hypothesis[7]. The idea of this hypothesis
is that words that occur in the same contexts tend to have
similar meanings. To identify the contexts that the query
phrase occurs, the NeSS system used a suffix array for
lookup.
A suffix array[9] constructs an array of suffixes of a
string. The array is sorted so searching for a substring
in the original string takes O(P + log(N)) time, where P
is the length of the query string and N is the size of the
suffix array. In the context of the NeSS system, the min-
imum token is a word and the string is the whole corpus.
As the size of corpus grows, the runtime of context and
candidate searches also increases.
Another approach for substring searching other than
suffix array is suffix tree[14]. Intuitively, the suffix tree
is a Trie tree of the suffixes. A path to each node of the
tree is a prefix of a suffix. If the node is a leaf, the path
is a suffix. If the lookup table in a node is maintained
using hash maps, search a phrase takes O(P) time where
P is the number of words in the query. Although suf-
fix tree provides faster lookup, it consumes much more
memory than suffix array. Also, depending on the imple-
mentation, the data structure may not take advantage of
caching like suffix array.
Hashing is another way to do quick searches. Locality-
sensitive hashing [3] provides that the probability two
strings are hashed to the same bucket is proportional to
the similarity of them. If applied to the NeSS system,
the problem of finding a substring can be converted to
finding a similar string. Although the algorithm provides
constant-time string look-up in average, defining similar-
ity is not trivial and using locality-sensitive hashing loses
accuracy in some level.
Our indexed suffix array differs from normal suffix ar-
ray in the way that the index can help lookup the first
three words in the query phrase in O(3) time (constant
time). Since nearly all suffix array lookups have no more
than three words, the indexed suffix array can achieve a
constant-time substring lookup. It also differs from the
suffix tree because it doesnt consume as much memory.
7 Conclusions
In this report, we introduce Near Synonym Sys-
tem(NeSS) and addressed some performance problems:
long latency upon users requests, not satisfying scalabil-
ity and a complexity highly dependent on the size of cor-
pus. We presented an optimized version of NeSS that
solves these problems by building an index on the suf-
fix array, changing the approach for parallelism, improv-
ing efficiency of candidate search and optimizing the
punctuation filter. The experiments showed a speedup
of approximately 20x-40x compared to the original im-
plementation. Our optimized NeSS demonstrated near-
linear scalability with 8 cores or less.
References
[1] Trove high performance collections of java. http://trove.
starlight-systems.com/overview.
[2] AMDAHL, G. M. Validity of the single processor approach to
achieving large scale computing capabilities. In Proceedings of
the April 18-20, 1967, spring joint computer conference (1967),
ACM, pp. 483–485.
[3] CHARIKAR, M. S. Similarity estimation techniques from round-
ing algorithms. In Proceedings of the thiry-fourth annual ACM
symposium on Theory of computing (2002), ACM, pp. 380–388.
[4] CURRAN, J. R. From distributional to semantic similarity.
[5] GANITKEVITCH, J., VAN DURME, B., AND CALLISON-
BURCH, C. Ppdb: The paraphrase database. In HLT-NAACL
(2013), pp. 758–764.
[6] GUPTA, D., CARBONELL, J., GERSHMAN, A., KLEIN, S., AND
MILLER, D. Unsupervised phrasal near-synonym generation
from text corpora.
[7] HARRIS, Z. S. Distributional structure. Springer, 1970.
[8] LIN, D., AND PANTEL, P. Discovery of inference rules from
text, Dec. 5 2006. US Patent 7,146,308.
[9] MANBER, U., AND MYERS, G. Suffix arrays: a new method for
on-line string searches. siam Journal on Computing 22, 5 (1993),
935–948.
10
[10] MITCHELL, J., AND LAPATA, M. Vector-based models of se-
mantic composition. In ACL (2008), pp. 236–244.
[11] PARKER, R., GRAFF, D., KONG, J., CHEN, K., AND MAEDA,
K. English gigaword fifth edition, june. Linguistic Data Consor-
tium, LDC2011T07 (2011).
[12] PAS¸CA, M. Mining paraphrases from self-anchored web sentence
fragments. In Knowledge Discovery in Databases: PKDD 2005.
Springer, 2005, pp. 193–204.
[13] REDDY, S., KLAPAFTIS, I. P., MCCARTHY, D., AND MAN-
ANDHAR, S. Dynamic and static prototype vectors for semantic
composition. In IJCNLP (2011), pp. 705–713.
[14] WEINER, P. Linear pattern matching algorithms. In Switching
and Automata Theory, 1973. SWAT’08. IEEE Conference Record
of 14th Annual Symposium on (1973), IEEE, pp. 1–11.
11

Optimizing Near-Synonym System

  • 1.
    Optimizing Near-Synonym System SiyuanZhou and Zichang Feng Carnegie Mellon University Abstract Phrasal near-synonym extraction is crucial to AI tasks such as natural language processing. Near-Synonym System(NeSS) is a corpus-based model for finding near- synonym phrases, but suffers from performance prob- lems. This report presents an optimized version of NeSS that builds an index on the suffix array to reduce the complex- ity dependency on corpus size and uses an efficient ap- proach for parallel execution to improve the scalability. We applied several other techniques along with the in- dexed suffix array to achieve an approximately 20x-40x speedup. We further did experiments to break down the speedup brought by different optimization approaches. 1 Introduction Synonymy has various degrees ranging from complete contextual substitutability to near-synonymy [4]. The word length of the synonymy can also range from single- word synonyms to multi-word synonyms or phrasal near- synonyms. The later one has to consider the semantics of the combination of multiple words, instead of solely the meaning of each words in the phrase. For example, it is fair to say is a phrasal near-synonym to the phrase we all understand. However, the individual components of the two phrases are not directly related to each other. Phrasal near-synonym extraction is very important in natural lan- guage processing, information retrieval, text summariza- tion and other AI tasks[6]. Near-Synonym System (NeSS)[6], the system we aim to optimize, is an unsupervised corpus-based model for finding phrasal synonyms and near synonyms based on a large corpus. It differs from other approaches since it doesnt require parallel resources or use pre-determined sets of patterns. Instead of storing the mapping of near- synonyms in databases, given a query phrase, NeSS gen- erates the near synonyms at runtime. NeSS selects near- synonymic candidates by identifying common surround- ing context based on an extension of Harris Distribu- tional Hypothesis[7], which states that words that occur in the same contexts tend to have similar meanings. To be more specific, NeSS tokenizes the corpus and constructs a suffix array at its initialization phase. Upon receiving a query phrase, it searches all occurrences of the query phrase in the corpus as contexts using the suffix array. It then finds all the candidates of near-synonyms by searching the contexts in the corpus by suffix array. The ranking of the candidates is based on the number of matching contexts between candidates and the query phrase and how they are matched. Since NeSS finds near-synonyms dynamically from the corpus, the performance (in terms of latency of a sin- gle query phrase) becomes a big challenge. NeSS needs to be a real-time on-line service since the huge amount of possible queries make it impossible to do off-line pre- process. We address three performance problems in the original NeSS: 1. Part of the code has low efficiency, and thus leads to long latency to process a user request. In the orig- inal system, it takes three to four minutes to pull the near synonyms of a single query phrase on a 16 core machine. Apparently, this latency isnt accept- able for the purpose of a real-time on-line service. 2. The system doesnt scale well with number of cores. To be more specific, the original system only has only 1.1 - 1.5 speedups with selected query phrases when running on 16 cores compared to one cores. To make the problem worse, original NeSS takes longer time with 36 cores compared to with s a sin- gle core on part of the phrases. 3. The complexity of the original system doesn’t scale when the corpus size increases. However, a larger corpus size leads to more accurate results for near- synonym searches. Thus the system will take much
  • 2.
    longer to pulla result if a user needs better accuracy. We present an optimized version of NeSS system that allows real-time interactive query for near synonyms phrases. Our contributions can be concluded as the fol- lowing. Firstly, we carefully optimized some of the im- plementation details to allow faster computations with- out losing the accuracy of the system. We built an in- dex on top of the suffix array in the original system to allow O(L) time to search all the occurrences of a sub- string, where L is the length of the query string. We modified the algorithm in fetching candidates to improve efficiency. Besides major modifications to the original design, we also made optimizations such as punctuation filtering. Secondly, we changed the way that the system is parallelized to improve the scalability on multi-core machines. Our optimized system gets 6x speed-up when running with 16 cores compared to that with one core. Thirdly, we avoided splitting the suffix array into mul- tiple parts such that the system has an overall view of the corpus. The original system splits the suffix array into multiple parts and let each thread hold one part of the suffix arrays. However, this leads to different scoring and ranking of the candidates. We believe that the results from one whole piece of suffix is the most accurate one, and thus only keep one copy of the suffix array globally. Finally, we changed the complexity of the algorithm to be less dependent to the length of the corpus size. We achieved this by building an index on the suffix array such that searching a substring takes O(L) time instead of O(L + log(N)) where L is the length of the substring and N is the length of the corpus. Therefore, the system can achieve better accuracy by using a larger corpus as input without sacrificing too much performance. In this report, we introduce background in Section 2. We describe our four optimizations in Section 3. The results and evaluations will be presented in Section 4. We then discuss our results in 5 and summarize related works in Section 6. Finally, we conclude our work in 7. 2 Background Near-Synonym System (NeSS)[6], the system we aim to optimize, is an unsupervised corpus-based model for finding phrasal synonyms and near synonyms based on a large corpus. NeSS selects near-synonymic candidates by identifying common surrounding context based on an extension of Harris Distributional Hypothesis[7]. The idea of this hypothesis is that words that occur in the same contexts tend to have similar meanings. To iden- tify the contexts that the query phrase occurs, the NeSS system used a suffix array[9] for look-up. In this section, we briefly describe NeSS and the suffix array. 2.1 Near-Synonym System At the initialization of NeSS, it accepts documents as the input. The documents are preprocessed and concatenated to form a large corpus. NeSS converts the words in the document into tokens and assigns each word a unique word id, and keeps a dictionary for the mapping between words and ids. The ids identify the words and avoids string comparison when searching for a word. Viewing the corpus as a long string, NeSS builds a suffix array of the corpus. The suffix array helps substring search when finding near synonyms of the query phrase. Given the query phrase, NeSS searches the occurrence in the corpus by finding the query phrase as a substring in the corpus as a long string. Since the suffix array is a sorted lexicographically, all occurrences are returned as a range of indices of the suffix array. The surrounding words can be then fetched based on the positions of the query phrase in the corpus. The surrounding words are referred to as contexts, and can be classified into left con- text, right context and cradle, which is the left and right contexts. After the contexts are filtered, they are used to search candidates. Candidates are the phrases that share the same left context, right context or cradle with the query phrase. Given a found candidate, each matching candi- date will contribute a score to the candidate. For left and right context, this is done by again searching the con- texts in the corpus and fetching the words next to the contexts. Finding candidates between cradles takes more effort. First, the occurrences of the left context in the cradle is searched using the suffix array. Next, for each occurrence and for each valid candidate phrase length, the words behind the occurrence of the left context are fetched. The right context in the cradle is then compared with the words behind the supposed candidate. If they match, the supposed candidate is a real candidate. Since there are plenty of contexts, finding candidates is one of the most time consuming function in the system, espe- cially candidates between cradles. The candidates will then be ranked based on the afore- mentioned score. Top N candidates will be returned to the user as the near synonym phrases, where N is a pa- rameter defined by the user. In addition, part of the top candidates will then be directed for the KL-divergence computation. The KL-divergence gives a more reliable ranking of the candidates. One thing to note is that, since the computation of KL-divergence involves a lot of math- ematical operations, its also one of the hot spot in term of the runtime. NeSS parallelizes on a multi-core machine by splitting the suffix array into multiple parts, and let one thread be responsible for searching substrings of the partial suf- fix array and subsequent operations associated with the 2
  • 3.
    Figure 1: Overviewof the process of searching near synonyms[6] matching substrings. For example, in the process of find- ing candidates from contexts, each thread finds contexts from its partial suffix array and then finds matching can- didates the contexts it just found. The reason of why NeSS parallelizes this way is based on the fact that the suffix array search is one of the most time consuming function in the system. However, this leads to only par- tial results when scoring and ranking the candidates. One proof of this is that NeSS generates different results on the same corpus with different number of threads, and thus different number of suffix array splits. 2.2 Suffix Array Suffix array[9] is a data structure that allows substring search in O(P + log(N)) time, where P is the length is the length of the substring and N is the length of the whole string. Compared to suffix tree, suffix array consumes much less memory in proactive. A suffix array is a sorted array of all suffixes of the original array. Searching a substring can be done by performing a binary search on the suffix array. In NeSS, the suffix array is built from the whole cor- pus. The process of finding the occurrences of query phrases or contexts, can be converted to searching a sub- string in a long string (corpus). Since the words in the corpus are tokenized to ids, one word in the corpus is equivalent to a char in the suffix array. Since when NeSS searches a phrase, it needs all occurrences of it, the suf- fix array search in NeSS will return a range of suffix ar- ray indices. The indices of occurrences in suffix array are contiguous because the suffix array is sorted lexico- graphically. 3 Optimizations 3.1 Index on Suffix Array Since searching and counting are the core operations in Ness, and both of them rely on the functionality pro- vided by the suffix array, the performance of suffix array would greatly affect the performance of the entire sys- tem. Therefore, we first focus on accelerating the search on suffix array. The method we use is to create a multi-level index on the suffix array. Its feasibility is based on the following observations: 1. The only operation performed on the suffix array is to search a phrase. There’s no insertion or dele- tion in the runtime. Thus the suffix array’s structure 3
  • 4.
    Figure 2: Thestructure of an index won’t change once it’s created in the initialization phase. 2. Although the number of words in the corpus is very large and will increase quickly as the corpus be- comes larger, the size of vocabulary is much smaller and normally stays in a fixed scale. 3. The number of words contained by most of the queries, contexts and candidates are no more than 3. The first observation indicates we can build the index on suffix array in advance and carefully organize its struc- ture to specifically optimize for search operation. The second and third observations tell us a node in the index won’t contain too many keys and the index only needs few levels, which means we can use a reasonable amount of memory to store the whole data structure. As figure 2 shows, the index is organized as a multi- way tree. Each node in the tree contains an interval, and each edge is associated with a word. The construction of the tree is described in algorithm 1. Theorem 3.1.1. The time complexity of algorithm 1 is O(L) where L is the length of suffix array. Proof. For each level of the tree, the algorithm needs to scan the whole suffix array and add at most O(L) edges, where each edge can be added in O(1) time by using hashmap. Therefore it will take O(L) time to construct one level of the tree. Since the number of levels is fixed at 3 in our algorithm, the construction of the entire tree will take O(3∗L) = O(L) time. The worst case space complexity of algorithm 1 would be O(3 ∗ L). The space complexity in average cases is hard to estimate since it’s highly related to the contents Algorithm 1 Construct Indexed Tree 1: Input: A suffix array S 2: Output: An indexed tree T 3: Let L be the length of S 4: T ← CONSTRUCT(0,L,0) 5: 6: function CONSTRUCT(start,end,depth) 7: Create a new node R 8: R.left ← start 9: R.right ← end 10: if start = end or depth = 3 then 11: return R 12: end if 13: pw ← NULL 14: ps ← −1 15: for i = start to end do 16: p ← S[i] 17: w ← p[depth] 18: if w = pw then 19: if pw = NULL then 20: C ← CONSTRUCT(ps,i−1,depth+1) 21: R.addChild(pw,C) 22: end if 23: pw = w 24: ps = i 25: end if 26: end for 27: C ← CONSTRUCT(ps,end,depth+1) 28: R.addChild(pw,C) 29: return R 30: end function of corpus. Based on our experience, the memory con- sumption is mostly smaller than the value in worst case and is acceptable to a common commodity machine. The tree produced by algorithm 1 has the following property: Property: For a node u in the tree, let p(u) = w1w2...wn be the path from root node to u, and (l,r) be the interval contained in u, then all the suffixes whose po- sitions in the suffix array are in (l,r) have p(u) as their prefixes. Since a phrase p in the corpus must be the prefixes of some suffixes, we can get all its occurrences in the corpus by finding all the suffixes starting with p. Given the above property, this task can be easily done by using algorithm 2. Although for some long phrases, algorithm 2 still needs to perform a binary search on the suffix array to further narrow down the range, most of the phrases have a short length, and the search of them can be done by only using the index. Therefore, we can get the follow- ing conclusion: 4
  • 5.
    Algorithm 2 Searchwith Index 1: Input: An index I, a suffix array S and a phrase p 2: Output: An interval (s,e) 3: Let L be the number of words in p 4: Let R be the root node of I 5: u ← R 6: pos ← 0 7: while pos < L and u is not a leaf node do 8: v ← u.getChild(p[pos]) 9: if v = NULL then 10: u ← v 11: pos ← pos+1 12: else 13: return (−1,−2) 14: end if 15: end while 16: if pos = L then 17: return (u.left,u.right) 18: else 19: return BINARYSEARCH(S, p, pos,u.left,u.right) 20: end if Theorem 3.1.2. The time complexity of algorithm 2 in searching a phrase p with L(L ≤ 3) words is O(L). Proof. The while loop from line 7 to line 15 will execute at most L times. Since the operation of finding a child node by a given word in line 8 can be done with hashmap, each execution of the loop will take O(1) time. After the loop, either pos is equal to L thus the algorithm can directly return, or the last node is a leaf node thus the binary search would run in O(L) time. Therefore, the algorithm will take at most O(L+L) = O(L) time. For a phrase that doesn’t appear in the corpus, algo- rithm 2 will return a pair(−1,−2) which indicates an empty interval. For a phrase that occurs at least one time in the corpus, the algorithm will return an interval in the suffix array where all the suffixes starts with the given phrase. The system can then iterate every position in the interval to do further computations for a phrase such as finding contexts or extracting candidates. 3.2 Multi-threading Since the search of different phrases in the corpus is in- dependent, NeSS uses multiple threads to perform the search in parallel to achieve a better performance. As figure 3 demonstrates, the original system splits the suf- fix array into several disjointed parts and assigns each part to one of the threads. Each thread then iterates ev- ery context of the input query and uses the part of suffix array assigned to it to calculate the frequency of context and extract candidates. The results produced by a thread will be added into a global hashmap which is protected by a global lock from being accessed by multiple threads simultaneously. Although this method can improve the performance of the system in some degree, the resulting speedup and scalability are not good enough due to the following reasons: 1. The time complexity to search C contexts with av- erage length L in a suffix array with length N is O(C ∗ log(N) ∗ L) by using a single thread in the original system. When increasing the number of threads to T, the time complexity is reduced to O(C ∗log(N/T)∗L). However, the latter complex- ity is not much smaller than the former one since the value of log() function decreases very slowly with its parameter. 2. There’s additional overhead on the synchronization of the global hashmap. When the system uses more threads, the overhead will also increase and restricts the parallelism. In addition, the multi-threaded design in the original sys- tem won’t bring any benefit if we build an index on the suffix array since the time to search a phrase will no longer be related to the size of suffix array. To address the problems in the original design and take advantage of the index, we proposed a new way to do the multi-threading. Figure 3: The design of multi-threading in the original system As figure 4 shows, our method uses only one suffix array which has index built on it and is shared by all the threads. We split the contexts into several disjointed parts and assign each part to one of the thread. In this way, each thread is only responsible for its own contexts and uses the shared suffix array to do searching. In addition, a thread will first store the results in its own hashmap and put the data in local hashmap into a global hashmap after finishing all the computations. 5
  • 6.
    Figure 4: Thedesign of multi-threading in the new sys- tem When using T threads, our method can reduce the time complexity of searching from O(C ∗ L) to O(C ∗ L/T), which is a much better speedup compared to the original method. Moreover, the use of per-thread hashmap also avoids most of the synchronizations and is more cache friendly since different cores can store this data struc- ture in its own cache without interfering each other. Fi- nally, the cost of synchronization on the global hashmap can be reduced by using the ConcurrentHashmap in Java, which provides a fine-grained lock that enables the inser- tion of keys which are located in different buckets to be performed in parallel. 3.3 Candidate Search As mentioned in Section 2.1, candidate searches from context is one of the core and the most time consum- ing part in the system. Among three types of the con- texts: left contexts, right contexts and cradles, which are the combining of left and right contexts, the candidate searches from cradles takes most of the time. The reason for this is that suffix array doest provide such functional- ity to search two substrings apart from each other with a fixed length in one search. Such cradle searches have to be done by searching one side of the cradle and manually comparing the other side. In the original system, the process of finding candidate from a cradle context can be described as the following. If we note a cradle context by L1L2L3QR1R2R3, where L1 to L3 are the left context of the cradle, R1 to R3 are the fight context of the cradle and Q is the query phrase. For each cradle context found in the previous step, NeSS finds all the occurrences of L1 to L3. For each occurrence of L1 to L3, and for each valid candidate length, NeSS will fetch the words with the length of the right context, beginning at L3 plus the candidate length. The fetched words are then compared with R1 to R3 to check whether this occurrence is a match of the whole cradle. In this example, if we represent one of the occurrences of L1 to L3 to be L1L2L3W1W2W3W4W5, NeSS will first fetch W2 to W4 for a supposed candidate length of one. W2 to W4 will then be compared with R1 to R3. If W2 to W4 match R1 to R3, NeSS has found an match of the cradle context in the corpus, since both left and right contexts match. W1 will then be compared with Q to see if W1 is the query phrase. If not, W1 is regarded as a candidate and will be added to the candidate table. In the next iteration, W3 to W5 will be fetched in order to check supposed candidates with length of two. This process will keep iterating until all valid candidate lengths are checked before it moves on to the next occurrence of the left context in the cradle. An important detail of this process is that when W2 to W4 are fetched from the corpus for comparison with R1 to R3, a new Java array will be allocated, and the contents are copied from the corpus array to the newly-allocated array. We address three problems in the process of finding candidates from cradle contexts in the original imple- mentation. 1. The loop on all valid candidate lengths leads to fetches same chars and comparison of the same chars with the right context several times. In the example, W4 will be fetch three times across finding candidates with lengths of one to three. 2. The allocation of a new array when the words be- hind the left context are to be returned is unneces- sary. It involves an unnecessary system call to al- location memory in heap and sequential checks on the results of the system. Also, the copying of the content from the corpus array to the new array in- troduces additional overhead. 3. Since the heap memory are managed by the JVM instead of the programmer, the allocation of new ar- rays will lead to massive garbage. If the heap size isnt large enough, this will cause frequent garbage collects, degrading the performance. To solve these problems, we revised the algorithm used to find candidates from cradle contexts. We de- scribe the revised algorithm as below. For a cradle L1L2L3QR1R2R3, we find all occurrences of L1 to L3 us- ing suffix array. For each occurrence, we directly fetch all words after L1 to L3 in the corpus, which are W1 to W5. Then we perform a substring search of R1R2R3 in W1W2W3W4W5. For each substring match of R1R2R3, the words before the match are compared with the query phrase and regarded as a candidate, since both left and right contexts match. When fetching W1W2W3W4W5 from the corpus, instead of allocating and copying the content, 6
  • 7.
    we directly passthe beginning and ending indices of the words. Our algorithm differs from the original one in that we avoid fetching and comparing same chars from the cor- pus multiple times, thus reducing the operations needed. Also, we changed the way of fetching the supposed right context such that no unnecessary memory allocation and memory copying are needed. This reduces the overhead of allocating and copying the context, as well as the time spent for garbage collection. 3.4 Punctuation Filter After collecting contexts and candidates, NeSS filters out punctuations. The punctuation filter in the original im- plementation uses a regular expression match to deter- mine whether a token is a punctuation. However, the original code didnt take advantage of the fact that the regular expression for punctuations stays the same for different contexts and candidates to filter. The original implementation compiles the regular expression for each filtering operation. Thus the overhead of compiling the regular expression is incurred for each context and each candidate. Our first optimization on the filter takes advantage of the unchanged regular expression and compile a static pattern according to the regular expression to filter out the punctuations. Next, we further exploit the fact that the punctuation contexts and candidates are mostly one character long. We thus changed the code to eliminate the need to use a regular expression match. We do this by first construct an array of 255 boolean elements. Each of the boolean elements stands for a character in ASCII code and repre- sents whether the character is a punctuation. At runtime, to check whether a character is a punctuation, NeSS will access the corresponding element in the boolean array. In this way, a regular expression match is replaced with an array lookup, which is much cheaper than the original implementation. 4 Results 4.1 Environment We tested our optimized NeSS on Elastic Cloud Comput- ing in Amazon Web Service. We launched a c4.8xlarge instance which has 36 virtual CPUs and 60GB memory. We used a 2.2GB document from the very large English Gigaword Fifth Edition[11], an archive of newswire text data, as our corpus input. 4.2 Overall Performance Figure 5: The performance comparison between the orig- inal system and our system with all the optimizations We randomly selected six phrases from the test phrases in Guptas paper[6]. We ran the optimized and the orig- inal systems with the query phrases several times to get an average runtime of pulling near synonyms. Figure 5 shows the runtime comparison between the original sys- tem and our optimized one. The blue bars are the average runtimes in seconds of the original NeSS with different query phrases, while the green bars represents the aver- age runtimes of the optimized system. We marked the speedups, which is the runtimes of the original system divided by the ones of our optimized system. We also included error bars to represent max and min values in our test. From Figure 5, we can observe a speedup ranging from 17x to 41x, depending on the query phrase. The average speedup across different query phrases is 30x. The runtimes across multiple runs against the same query phrase are quite stable, as shown by the error bars. The speedup achieved by our optimization changed the latency of searching a single query phrase from sev- eral minutes to seconds, which allows a user to interac- tively search near synonyms on-line. 4.3 Performance Impact of each Optimiza- tion In this section we analyze the performance impact of each single optimization we applied. We evaluate the change of the performance by adding optimizations one at a time on top of the previously added optimizations, in the same order as in Section 3. For example, when evaluating the impact of optimizing the punctuation fil- ter, we compare the final version with the version having the first three optimizations. The reason we evaluate in a cumulative way is that the impact of a later optimization is noticeable only if the previously dominant hotspot is 7
  • 8.
    removed by earlieroptimizations such that the currently applied one is the hotspot. This process of evaluation fol- lows the similar way as our optimization process: once a hotspot is resolved by applying an improvement tech- nique, we find the next hotspot to raise another optimiza- tion. In the following figures, we refer to the index on the suffix array as Optimization 1, the improvement on the multi-threading as Optimization 2, the modifications in candidate search as Optimization 3 and the changes in punctuation filter as Optimization 4. Figure 6: The performance comparison between the orig- inal system and our system with optimization 1 The impact of building the index on the suffix ar- ray can be shown in Figure 7. The notions of the fig- ure is the same as the previous section. The speedups vary from 3.5x to 6.5x with an average of 4.65x. The speedup due to the index is highly dependent on the cor- pus size, since the index reduces the complexity of sub- string search from O(L+log(N)) to O(L). We expect to see a further speedup with a larger corpus size. Figure 7: The performance comparison between the sys- tem with optimization 1 and the one with optimization 1,2 Figure 8 shows the performance impact of improv- ing multi-threading. As mentioned earlier, the compar- ison is between the version with the first two optimiza- tions and the version with only the index optimization. The speedups vary from 2.8x to 5x, with an average of 3.3x. We ran the system with the same number of cores, 36, against the same set of query phrases. The speedup shows that the improved system scales better than the original implementation. Figure 8: The performance comparison between the sys- tem with optimization 1,2 and the one with optimization 1,2,3 As shown by Figure 8, the speedup due to the mod- ification on candidate search is small. This is because the candidate search only dominates the runtime in the shared context ranking phrase. However, the time spent in the KL-Divergence computation takes most of the time. Users can choose to disable the KL-Divergence, and in such case, the speedup of this optimization is ap- proximately 1.2x. Figure 9: The performance comparison between the sys- tem with optimization 1,2,3 and the one with optimiza- tion 1,2,3,4 The impact of changing the punctuation filter is shown in Figure 9. This optimization leads to approximately 2x speedup on average. 8
  • 9.
    4.4 Scalability Figure 10:The scalability of our system under 36 cores We evaluate the scalability of the optimized system with different core counts by comparing the speedups with the one with one core, as shown in Figure 10. Our improved system scales near-linearly with core count equal or less than 4. However, we observe little improvement when we keep increasing number of cores above 16. We will discuss the reasons in Section 5.2 5 Discussions 5.1 Optimizations with Little Effect The original system uses Java HashMap to keep all the contexts and candidates, as well as their scores. In our optimized version, we also used HashMap to store each level of the suffix array index. We tried to re- place the Java HashMap with Trove hash map[1]. The idea of Trove hash map is that it avoids allocation of ob- jects of primitive types (Java Integer, for example), and thus saves memory. It also claims to have better hash functions and thus better performance. However, after switching to Trove hash map, we didnt observe a signifi- cant performance gain. The reason for this might be that the hash maps we are building have keys of integers, and thus hash functions dont play a big role. Also, Java 7 improve the performance in HashMap. We also tried to think as a compiler and optimize some unnecessary code. For example, in original NeSS, there was a while loop that only calls a function inside. The function contains an if statement that is never true. We eliminated the while loop, but didnt see any performance gain. We suspect the reason is that the Java compiler had already optimized out the code that would never be executed. The third optimization that didnt work parallelizing functions that takes a very short time to execute. For such cases, the overhead of creating threads and context switching defeats the benefits of parallel execution. Fine-grained locks in compute intensive parts has little effect in our optimization. For example, in the process of ranking candidates, when building candidate map, the computation time is much longer than the time to put the result in a shared HashMap. In this case, changing the HashMap to ConcurrentHashMap doesnt work well, since the situation of blocking to acquire the lock rarely happens. 5.2 Scalability Our improved NeSS scales well under 4 cores, but doesnt have much improvement when core count in- creases above 16. We suspect there are three reasons: 1. Sequential part of the system limits the speedup by multi-thread. There are multiple parts in the algo- rithm that must be run in sequential, such as finding contexts of query phrase and ranking candidates. According to Amdahl’s law[2], the speedup by par- allel execution is limited by the sequential functions of the algorithm. 2. The contention of locks of shared data structure limits the benefits of multi-threading. The data structure that stores all candidates and their associ- ated matching contexts has two levels of HashMaps. Since the logic of the execution and the need of the organization of the two-level HashMap con- flicts with each other, preventing from fined-grained locking. Thus we see a bottleneck when ranking candidates as we increase the core count. 3. The memory bandwidth limits the maximum exe- cution of the program. Since the system contains plenty of memory accesses and a little mathemati- cal computations, we believe the system is memory- bounded when the number of cores is high enough to make the computation fast. 6 Related Work The NeSS finds near-synonym phases based on an unsu- pervised corpus-based model. There are several papers addressing related work to finding near synonyms and our optimization techniques. In Mitchell and Lapatas paper [10], a set of compo- sition functions were proposed to combine vectors of words in a phrase into a single one. Reddy [13] argued that not all the features are relevant to the phrase, and thus further presented ways to select relevant senses of words in a phrase. However, these papers decompose phrases into words and analyze the semantics at the word 9
  • 10.
    level. They ignoredthe case that the phrases as a whole may have completely different meanings from the mean- ings of individual component words. Some approaches were presented to find synonyms by parallel resources. Methods based on monolin- gual text corpus like Discovery of Inference Rules from Text(DIRT) [8] spot paraphrases that share the same in- terpretation in a foreign language. However, this might also find phrases that are not related to the original phrase. Ganitkevitch [5] used monolingual distributional similarity to rerank the extracted paraphrases. They built a webpage that responds to paraphrase queries by looking up interpretations in foreign languages in the database. However, this approach needs parallel re- sources and can only search for phrases that are present in the database. In addition, Pasca[12] introduced an unsupervised method to retrieve near-synonym in arbitrary web text using linguistically-motivated text anchors identified in the context of documents. The quality of the paraphrases can be further improved by a filtering mechanism using a set of categorized names from online documents. How- ever, this method requires document-dependent linguis- tic patterns defined. The documents also need language specific resources such as part-of-speech taggers. The Near-Synonym System (NeSS)[6], which is our goal to optimize, differs from previous paraphrasing sys- tems in terms that it doesnt need parallel resources like PPDB[5] or predefined patterns like the method intro- duced by Pasca[12]. The algorithm in the NeSS system select near-synonymic candidates by identifying com- mon surrounding context based on an extension of Harris Distributional Hypothesis[7]. The idea of this hypothesis is that words that occur in the same contexts tend to have similar meanings. To identify the contexts that the query phrase occurs, the NeSS system used a suffix array for lookup. A suffix array[9] constructs an array of suffixes of a string. The array is sorted so searching for a substring in the original string takes O(P + log(N)) time, where P is the length of the query string and N is the size of the suffix array. In the context of the NeSS system, the min- imum token is a word and the string is the whole corpus. As the size of corpus grows, the runtime of context and candidate searches also increases. Another approach for substring searching other than suffix array is suffix tree[14]. Intuitively, the suffix tree is a Trie tree of the suffixes. A path to each node of the tree is a prefix of a suffix. If the node is a leaf, the path is a suffix. If the lookup table in a node is maintained using hash maps, search a phrase takes O(P) time where P is the number of words in the query. Although suf- fix tree provides faster lookup, it consumes much more memory than suffix array. Also, depending on the imple- mentation, the data structure may not take advantage of caching like suffix array. Hashing is another way to do quick searches. Locality- sensitive hashing [3] provides that the probability two strings are hashed to the same bucket is proportional to the similarity of them. If applied to the NeSS system, the problem of finding a substring can be converted to finding a similar string. Although the algorithm provides constant-time string look-up in average, defining similar- ity is not trivial and using locality-sensitive hashing loses accuracy in some level. Our indexed suffix array differs from normal suffix ar- ray in the way that the index can help lookup the first three words in the query phrase in O(3) time (constant time). Since nearly all suffix array lookups have no more than three words, the indexed suffix array can achieve a constant-time substring lookup. It also differs from the suffix tree because it doesnt consume as much memory. 7 Conclusions In this report, we introduce Near Synonym Sys- tem(NeSS) and addressed some performance problems: long latency upon users requests, not satisfying scalabil- ity and a complexity highly dependent on the size of cor- pus. We presented an optimized version of NeSS that solves these problems by building an index on the suf- fix array, changing the approach for parallelism, improv- ing efficiency of candidate search and optimizing the punctuation filter. The experiments showed a speedup of approximately 20x-40x compared to the original im- plementation. Our optimized NeSS demonstrated near- linear scalability with 8 cores or less. References [1] Trove high performance collections of java. http://trove. starlight-systems.com/overview. [2] AMDAHL, G. M. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, spring joint computer conference (1967), ACM, pp. 483–485. [3] CHARIKAR, M. S. Similarity estimation techniques from round- ing algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing (2002), ACM, pp. 380–388. [4] CURRAN, J. R. From distributional to semantic similarity. [5] GANITKEVITCH, J., VAN DURME, B., AND CALLISON- BURCH, C. Ppdb: The paraphrase database. In HLT-NAACL (2013), pp. 758–764. [6] GUPTA, D., CARBONELL, J., GERSHMAN, A., KLEIN, S., AND MILLER, D. Unsupervised phrasal near-synonym generation from text corpora. [7] HARRIS, Z. S. Distributional structure. Springer, 1970. [8] LIN, D., AND PANTEL, P. Discovery of inference rules from text, Dec. 5 2006. US Patent 7,146,308. [9] MANBER, U., AND MYERS, G. Suffix arrays: a new method for on-line string searches. siam Journal on Computing 22, 5 (1993), 935–948. 10
  • 11.
    [10] MITCHELL, J.,AND LAPATA, M. Vector-based models of se- mantic composition. In ACL (2008), pp. 236–244. [11] PARKER, R., GRAFF, D., KONG, J., CHEN, K., AND MAEDA, K. English gigaword fifth edition, june. Linguistic Data Consor- tium, LDC2011T07 (2011). [12] PAS¸CA, M. Mining paraphrases from self-anchored web sentence fragments. In Knowledge Discovery in Databases: PKDD 2005. Springer, 2005, pp. 193–204. [13] REDDY, S., KLAPAFTIS, I. P., MCCARTHY, D., AND MAN- ANDHAR, S. Dynamic and static prototype vectors for semantic composition. In IJCNLP (2011), pp. 705–713. [14] WEINER, P. Linear pattern matching algorithms. In Switching and Automata Theory, 1973. SWAT’08. IEEE Conference Record of 14th Annual Symposium on (1973), IEEE, pp. 1–11. 11