Tang poetry inspiration machine using seq2seq

Tang poetry
inspiration
machine using
seq2seq with
attention
Chengeng Ma
Stony Brook
University
2018/03/13

Tang Dynasty
• In Chinese ancient history, the Tang
Dynasty or Tang Empire (618-907
A.D.) is definitely the best of all
times.
• Its military and international
political power dominated the
Eastern Asia.
• With hundreds of years of advance
in agriculture and without the
discrimination on traders and
merchants, the empire reached
into unparalleled prosperity.
• Its capital Chang An was the largest
and most populated metropolis over
the world.
• At the street of Chang An, there were
Japanese students learning Chinese
culture, Arabian merchants buying
silk and tea and selling horses and
ivories, exotic girls dancing in Persian
music, and Indian missionary
teaching Buddhism ......
https://en.wikipedia.org/wiki/Tang_dynasty
https://en.wikipedia.org/wiki/Chang'an
https://en.wikipedia.org/wiki/Japanese_missions_to_Tang_China
https://en.wikipedia.org/wiki/Silk_Road#Tang_dynasty_reopens_the_route

Tang Poets
• With the confidence that we are
the “Central Land”, the Tang
Dynasty shows an inclusive and
tolerant atmosphere of culture
diversity and literary creation.
• Under such a society, countless of
great poets created their
masterpiece. Li Bai, Du Fu, Bai Juyi,
Wang Changling, Meng Haoran, Li
Shangying, Ceng Shen, Wang
Zhihuan … Their names are tied
with the zenith of Chinese culture.
• There are some poems where the
emperor leads for the 1st sentence
during the royal banquet, the prime
minister makes up the 2nd and the
other ministers follow.
• However, there are more poems
written by lower level officials
struggling in realizing their political
ideals, soldiers and officers defending
at the frontier for years or the whole
life, talented youth depressed by not
being selected to serve the gov. and
missing for family. That’s where the
masterpiece usually comes from.

• In the previous work, we have
shown the poems created by char-
level RNN, which generates poems
char by char.
• These poems can fool 50% of
people and make them believe
these are created by human instead
of algorithm.
• However, strictly speaking, the
rhyme of those poems are quite
loose. And the meaning of the
poems are not very focusing.
• In this work, I’m going to use
seq2seq with attention mechanism
and try to beat the char-level RNN,
writing poems that really have
literacy value. One famous poem chosen from our dataset

vanilla seq2seq
• Different from RNN which is based on
single word or char, seq2seq is based on
sequence. It uses bucket & padding to
resolve the different-in-length problem.
• Seq2seq is a combination of two RNNs. The
left one is called encoder RNN which gets
the source sequence as input, and the right
one is called decoder RNN which gets the
target sequence as input and generates
outputs. The two RNNs are connected by
context vectors (the last hidden vectors of
the encoder).
• For example, in the task of translation from
English to French, the encoder gets an
English sentence and the decoder gets a
corresponding French sentence.
• In the task of writing poems, the encoder
gets all previous (or latest k) sentences,
and the decoder gets the next sentence.encoder decoder

Attention mechanism
• The vanilla seq2seq has a obvious weakness that
is all the encoder’s information are summarized
into the context vectors, which implies the
dimension of context vectors need to be very
large, otherwise information will lose.
• To get rid of this problem, attention mechanism
is invented.
• Attention mechanism can
learn the weights from the
current target word to all the
source words, so that all the
hidden vectors of encoder
are made use of, instead of
only the last one.
• This is very helpful in
translation, poems,
couplets …

• Like the previous work, the dataset
is Complete Tang Poems 《全唐
诗》, which contains about 43,000
poems with a vocabulary size of
6,000 chars.
• About 7,800 poems are not used
for training because they are too
long (more than 100 chars). They
can be decomposed and used as
validation dataset. But this does
not make a lot of sense, since the
dataset is small and each poem is
quite unique. Unlike the machine
translation task, the dataset does
not show much redundancy and
repetition.
• About 500 poems are abandoned
because of missing lines or noisy
information.
• Basically, we create our training data
like this: suppose a poem has 8
sentences, A B C D E F G H, then I will
generate this poem into 9 rows of data.
• Then I will send these data into
different buckets and randomly train on
them.
Source input Target input
1 <BEGIN> A
2 A B
3 AB C
4 ABC D
5 ABCD E
6 ABCDE F
7 ABCDEF G
8 ABCDEFG H
9 ABCDEFGH <END>
In the training data, each sentence is
separated by a space, e.g., “A B C “.

Pros and cons
• The advantage of this data format is,
each sentence with its special rhyme
and rhythm “knows” its location in the
poem.
• For example, some sentences will be
better to serve as the last (or first)
sentence, or the one before (after) the
last (first) sentence. Some sentences
are better to be put in the middle.
• At the level of model, the sentence to
sentence model should perform
better in rhyme & rhythm, and the
meaning of consecutive sentences
should become more focusing and
consistent.
• The weakness of this model, as far as I
know, is the generation of the 1st
sentence.
• The encoder input is just a <BEGIN>
symbol. It’s usually difficult to write
very meaningful 1st sentence only
based on this symbol.
• Of course, we can provide the 1st
sentence manually. It works a lot of
times, however, as you can see, it
does not guarantee to work at any
time.
• I will show some trick or variants of
model to try to deal with this problem
at the end.

Details of the model
• 2 layers LSTM, 256 hidden size, 64 batch size;
• Estimate training perplexity every 200 steps and
save the estimated perplexity into record;
• Decrease the learning rate by 1% if the current
perplexity is the largest of the latest 4;
• Save a model every 5 epochs;
• Choose models based on the flatness of learning
curve and manually test the few candidate
models.

Training process
Training perplexity
in log axis
Training perplexity
in linear axis
Learning rate
• The perplexity finally
decreases to less than 10.
• Usually one ancient Chinese
char can have 3-5 different
meanings (perplexity).

How do I decode?
• I just use all previously
generated sentences as the
encoder input (with <BEGIN> at
the start and space as separator)
to generate the next sentence.
• For the goodness of rhyme, I use
the greedy method to decode
for most of sentences.
• However, for the 1st sentence,
how do you create it?
• (1) one possible way is just
sampling chars based on the
output probability distribution,
with the <BEGIN> as the encoder
input;
• (2) the other way is providing
the 1st sentence manually

The sampling method
• If we use sampling method all
through the poem, then the quality
of seq2seq’s poems soon degrade
to the RNN’s quality, especially the
rhyme.
• If we only use it at the 1st sentence,
then the meaning of the 1st
sentence will be more random than
the following sentences.
• It also degrade the following
sentences if the 1st one is not very
appropriate.
• One possible method is to use a
“top k” clip, that is at each position
within the sentence, we only do
the sampling within the top k
probability chars. All the other
possible chars will not be
considered, with their probability
cut down to zero.
• This method combines the idea of
sampling (which creates diversity)
and the pursue of good rhyme
(which implies limiting the scope).

Manually providing the
1st sentence
• Even though this is the original goal
which drives me to build the model,
however, what I find is it’s very tricky
on how to provide a good leading
sentence.
• The seq2seq model is trained on the
Complete Tang Poems data. Even
though you create a sentence that
looks like ancient Chinese, if that
sentence does not look like some
combination of fragments of our
training data, it will still be difficult for
the model to write a good poem,
because the model is not familiar that.
• Different from the machine translation
task (where we have big data and there
are a lot of words and phrases used in
common between different sentences),
the poem data is small and does not
show such redundancy or repetition.
Each poem is unique and the utility
matrix is sparse (connection between
words through the common usage is
relatively weak).
• So, it become very tricky on how to
provide good 1st sentence manually. The
best poems I have shown and the worst
poems that I have not shown are all
generated in this way.
• Base on my experience, using a close-to-
last or simply the last sentence of an
existing poem from the training data
usually creates an acceptable poem.

Other variants:
<1> spin-up model
• In this model, I will not use all the
previous sentences to generate the
next one, instead I will only use the
latest 4 sentences (4 lines is usually
a period within Chinese poems),
and there will not be <END>
symbol.
• At the decoding phase, I will simply
let the model write 20 or 50 lines.
Then I just manually select 4 or 8
lines as my poem.
• This method is usually safe and
interesting to try.
• The idea of this is, even the model is
not good at the 1st sentence, however,
as the decoding going on, it will be
better and better in the following.
Because the memory is kept to be
only the latest 4 sentences, the bad
impacts from the 1st sentence will be
diluted soon through the process.
• This is a good solution for the
purpose of randomly generating
poems or few couple of interesting
lines.
• The caveat is, since the memory is
only 4 lines, you should not choose
too long subset as your poem.
Otherwise it is imaginable that the
poem’s meaning and even rhyme will
be become random as the decoding
goes on.

Other variants: <2> char-by-char @ the 1st line
• Using the char-by-char way to deal
with the 1st sentence, which
sounds like what decoder RNN can
equivalently do.
• However, we are using seq2seq
with attention mechanism. The
attention weights still can make it
different.
• And by decomposing the 1st line a
lot of times and append them into
the training data, the 1st line gets 4-
6 times more importance during
the training process (maybe we can
decrease their probability of being
selected in training).
• This method has been implemented. It
does improve the quality of the 1st
sentence, however, at the remaining
sentences, the poems tend to use some
chars multiple times, which is supposed
to be avoided in Chinese poem writing.
Not a good idea after exploring!

Conclusion
• It is no doubt the seq2seq model
shows significant improve from the
char-level RNN in writing poems. The
probability of getting a good poem
increases a lot.
• The improvement mostly comes from
the reasonability of sentence-to-
sentence method comparing with the
char-to-char method.
• The improvements mostly shows in
the rhyme and the focusing of poem
topic & general meaning.
• Presently I’m planning to submit some
of these poems to some competitions
and see whether I can win something.
• On the other hand, providing appropriate
1st sentence is still a challenge in writing a
good poem.
• Manually providing the leading sentence
seems tricky, but the best poems we have
seen also come from this way. Maybe the
last line from existing poem from training
data is a good hint.
• The sampling method with a “top k clip”
will help improve the random sampling
method a little bit.
• As a variant, the “spin-up model” seems can
generate acceptable subsets of poems a lot
of times.
• Another variant which tries to decompose
the 1st sentence into chars tends to use
same chars multiple times, degrading the
poems into low quality.

Tang poetry inspiration machine using seq2seq

More Related Content

Similar to Tang poetry inspiration machine using seq2seq

Recently uploaded

Tang poetry inspiration machine using seq2seq