Tang poetry
inspiration
machine using
seq2seq with
attention
Chengeng Ma
Stony Brook
University
2018/03/13
Tang Dynasty
• In Chinese ancient history, the Tang
Dynasty or Tang Empire (618-907
A.D.) is definitely the best of all
times.
• Its military and international
political power dominated the
Eastern Asia.
• With hundreds of years of advance
in agriculture and without the
discrimination on traders and
merchants, the empire reached
into unparalleled prosperity.
• Its capital Chang An was the largest
and most populated metropolis over
the world.
• At the street of Chang An, there were
Japanese students learning Chinese
culture, Arabian merchants buying
silk and tea and selling horses and
ivories, exotic girls dancing in Persian
music, and Indian missionary
teaching Buddhism ......
https://en.wikipedia.org/wiki/Tang_dynasty
https://en.wikipedia.org/wiki/Chang'an
https://en.wikipedia.org/wiki/Japanese_missions_to_Tang_China
https://en.wikipedia.org/wiki/Silk_Road#Tang_dynasty_reopens_the_route
Tang Poets
• With the confidence that we are
the “Central Land”, the Tang
Dynasty shows an inclusive and
tolerant atmosphere of culture
diversity and literary creation.
• Under such a society, countless of
great poets created their
masterpiece. Li Bai, Du Fu, Bai Juyi,
Wang Changling, Meng Haoran, Li
Shangying, Ceng Shen, Wang
Zhihuan … Their names are tied
with the zenith of Chinese culture.
• There are some poems where the
emperor leads for the 1st sentence
during the royal banquet, the prime
minister makes up the 2nd and the
other ministers follow.
• However, there are more poems
written by lower level officials
struggling in realizing their political
ideals, soldiers and officers defending
at the frontier for years or the whole
life, talented youth depressed by not
being selected to serve the gov. and
missing for family. That’s where the
masterpiece usually comes from.
• In the previous work, we have
shown the poems created by char-
level RNN, which generates poems
char by char.
• These poems can fool 50% of
people and make them believe
these are created by human instead
of algorithm.
• However, strictly speaking, the
rhyme of those poems are quite
loose. And the meaning of the
poems are not very focusing.
• In this work, I’m going to use
seq2seq with attention mechanism
and try to beat the char-level RNN,
writing poems that really have
literacy value. One famous poem chosen from our dataset
vanilla seq2seq
• Different from RNN which is based on
single word or char, seq2seq is based on
sequence. It uses bucket & padding to
resolve the different-in-length problem.
• Seq2seq is a combination of two RNNs. The
left one is called encoder RNN which gets
the source sequence as input, and the right
one is called decoder RNN which gets the
target sequence as input and generates
outputs. The two RNNs are connected by
context vectors (the last hidden vectors of
the encoder).
• For example, in the task of translation from
English to French, the encoder gets an
English sentence and the decoder gets a
corresponding French sentence.
• In the task of writing poems, the encoder
gets all previous (or latest k) sentences,
and the decoder gets the next sentence.encoder decoder
Attention mechanism
• The vanilla seq2seq has a obvious weakness that
is all the encoder’s information are summarized
into the context vectors, which implies the
dimension of context vectors need to be very
large, otherwise information will lose.
• To get rid of this problem, attention mechanism
is invented.
• Attention mechanism can
learn the weights from the
current target word to all the
source words, so that all the
hidden vectors of encoder
are made use of, instead of
only the last one.
• This is very helpful in
translation, poems,
couplets …
• Like the previous work, the dataset
is Complete Tang Poems 《全唐
诗》, which contains about 43,000
poems with a vocabulary size of
6,000 chars.
• About 7,800 poems are not used
for training because they are too
long (more than 100 chars). They
can be decomposed and used as
validation dataset. But this does
not make a lot of sense, since the
dataset is small and each poem is
quite unique. Unlike the machine
translation task, the dataset does
not show much redundancy and
repetition.
• About 500 poems are abandoned
because of missing lines or noisy
information.
• Basically, we create our training data
like this: suppose a poem has 8
sentences, A B C D E F G H, then I will
generate this poem into 9 rows of data.
• Then I will send these data into
different buckets and randomly train on
them.
Source input Target input
1 <BEGIN> A
2 A B
3 AB C
4 ABC D
5 ABCD E
6 ABCDE F
7 ABCDEF G
8 ABCDEFG H
9 ABCDEFGH <END>
In the training data, each sentence is
separated by a space, e.g., “A B C “.
Pros and cons
• The advantage of this data format is,
each sentence with its special rhyme
and rhythm “knows” its location in the
poem.
• For example, some sentences will be
better to serve as the last (or first)
sentence, or the one before (after) the
last (first) sentence. Some sentences
are better to be put in the middle.
• At the level of model, the sentence to
sentence model should perform
better in rhyme & rhythm, and the
meaning of consecutive sentences
should become more focusing and
consistent.
• The weakness of this model, as far as I
know, is the generation of the 1st
sentence.
• The encoder input is just a <BEGIN>
symbol. It’s usually difficult to write
very meaningful 1st sentence only
based on this symbol.
• Of course, we can provide the 1st
sentence manually. It works a lot of
times, however, as you can see, it
does not guarantee to work at any
time.
• I will show some trick or variants of
model to try to deal with this problem
at the end.
Details of the model
• 2 layers LSTM, 256 hidden size, 64 batch size;
• Estimate training perplexity every 200 steps and
save the estimated perplexity into record;
• Decrease the learning rate by 1% if the current
perplexity is the largest of the latest 4;
• Save a model every 5 epochs;
• Choose models based on the flatness of learning
curve and manually test the few candidate
models.
Training process
Training perplexity
in log axis
Training perplexity
in linear axis
Learning rate
• The perplexity finally
decreases to less than 10.
• Usually one ancient Chinese
char can have 3-5 different
meanings (perplexity).
How do I decode?
• I just use all previously
generated sentences as the
encoder input (with <BEGIN> at
the start and space as separator)
to generate the next sentence.
• For the goodness of rhyme, I use
the greedy method to decode
for most of sentences.
• However, for the 1st sentence,
how do you create it?
• (1) one possible way is just
sampling chars based on the
output probability distribution,
with the <BEGIN> as the encoder
input;
• (2) the other way is providing
the 1st sentence manually
The sampling method
• If we use sampling method all
through the poem, then the quality
of seq2seq’s poems soon degrade
to the RNN’s quality, especially the
rhyme.
• If we only use it at the 1st sentence,
then the meaning of the 1st
sentence will be more random than
the following sentences.
• It also degrade the following
sentences if the 1st one is not very
appropriate.
• One possible method is to use a
“top k” clip, that is at each position
within the sentence, we only do
the sampling within the top k
probability chars. All the other
possible chars will not be
considered, with their probability
cut down to zero.
• This method combines the idea of
sampling (which creates diversity)
and the pursue of good rhyme
(which implies limiting the scope).
Manually providing the
1st sentence
• Even though this is the original goal
which drives me to build the model,
however, what I find is it’s very tricky
on how to provide a good leading
sentence.
• The seq2seq model is trained on the
Complete Tang Poems data. Even
though you create a sentence that
looks like ancient Chinese, if that
sentence does not look like some
combination of fragments of our
training data, it will still be difficult for
the model to write a good poem,
because the model is not familiar that.
• Different from the machine translation
task (where we have big data and there
are a lot of words and phrases used in
common between different sentences),
the poem data is small and does not
show such redundancy or repetition.
Each poem is unique and the utility
matrix is sparse (connection between
words through the common usage is
relatively weak).
• So, it become very tricky on how to
provide good 1st sentence manually. The
best poems I have shown and the worst
poems that I have not shown are all
generated in this way.
• Base on my experience, using a close-to-
last or simply the last sentence of an
existing poem from the training data
usually creates an acceptable poem.
Selected poem collection
Average level poems
Other variants:
<1> spin-up model
• In this model, I will not use all the
previous sentences to generate the
next one, instead I will only use the
latest 4 sentences (4 lines is usually
a period within Chinese poems),
and there will not be <END>
symbol.
• At the decoding phase, I will simply
let the model write 20 or 50 lines.
Then I just manually select 4 or 8
lines as my poem.
• This method is usually safe and
interesting to try.
• The idea of this is, even the model is
not good at the 1st sentence, however,
as the decoding going on, it will be
better and better in the following.
Because the memory is kept to be
only the latest 4 sentences, the bad
impacts from the 1st sentence will be
diluted soon through the process.
• This is a good solution for the
purpose of randomly generating
poems or few couple of interesting
lines.
• The caveat is, since the memory is
only 4 lines, you should not choose
too long subset as your poem.
Otherwise it is imaginable that the
poem’s meaning and even rhyme will
be become random as the decoding
goes on.
Other variants: <2> char-by-char @ the 1st line
• Using the char-by-char way to deal
with the 1st sentence, which
sounds like what decoder RNN can
equivalently do.
• However, we are using seq2seq
with attention mechanism. The
attention weights still can make it
different.
• And by decomposing the 1st line a
lot of times and append them into
the training data, the 1st line gets 4-
6 times more importance during
the training process (maybe we can
decrease their probability of being
selected in training).
• This method has been implemented. It
does improve the quality of the 1st
sentence, however, at the remaining
sentences, the poems tend to use some
chars multiple times, which is supposed
to be avoided in Chinese poem writing.
Not a good idea after exploring!
Conclusion
• It is no doubt the seq2seq model
shows significant improve from the
char-level RNN in writing poems. The
probability of getting a good poem
increases a lot.
• The improvement mostly comes from
the reasonability of sentence-to-
sentence method comparing with the
char-to-char method.
• The improvements mostly shows in
the rhyme and the focusing of poem
topic & general meaning.
• Presently I’m planning to submit some
of these poems to some competitions
and see whether I can win something.
• On the other hand, providing appropriate
1st sentence is still a challenge in writing a
good poem.
• Manually providing the leading sentence
seems tricky, but the best poems we have
seen also come from this way. Maybe the
last line from existing poem from training
data is a good hint.
• The sampling method with a “top k clip”
will help improve the random sampling
method a little bit.
• As a variant, the “spin-up model” seems can
generate acceptable subsets of poems a lot
of times.
• Another variant which tries to decompose
the 1st sentence into chars tends to use
same chars multiple times, degrading the
poems into low quality.

Tang poetry inspiration machine using seq2seq

  • 1.
    Tang poetry inspiration machine using seq2seqwith attention Chengeng Ma Stony Brook University 2018/03/13
  • 2.
    Tang Dynasty • InChinese ancient history, the Tang Dynasty or Tang Empire (618-907 A.D.) is definitely the best of all times. • Its military and international political power dominated the Eastern Asia. • With hundreds of years of advance in agriculture and without the discrimination on traders and merchants, the empire reached into unparalleled prosperity. • Its capital Chang An was the largest and most populated metropolis over the world. • At the street of Chang An, there were Japanese students learning Chinese culture, Arabian merchants buying silk and tea and selling horses and ivories, exotic girls dancing in Persian music, and Indian missionary teaching Buddhism ...... https://en.wikipedia.org/wiki/Tang_dynasty https://en.wikipedia.org/wiki/Chang'an https://en.wikipedia.org/wiki/Japanese_missions_to_Tang_China https://en.wikipedia.org/wiki/Silk_Road#Tang_dynasty_reopens_the_route
  • 3.
    Tang Poets • Withthe confidence that we are the “Central Land”, the Tang Dynasty shows an inclusive and tolerant atmosphere of culture diversity and literary creation. • Under such a society, countless of great poets created their masterpiece. Li Bai, Du Fu, Bai Juyi, Wang Changling, Meng Haoran, Li Shangying, Ceng Shen, Wang Zhihuan … Their names are tied with the zenith of Chinese culture. • There are some poems where the emperor leads for the 1st sentence during the royal banquet, the prime minister makes up the 2nd and the other ministers follow. • However, there are more poems written by lower level officials struggling in realizing their political ideals, soldiers and officers defending at the frontier for years or the whole life, talented youth depressed by not being selected to serve the gov. and missing for family. That’s where the masterpiece usually comes from.
  • 4.
    • In theprevious work, we have shown the poems created by char- level RNN, which generates poems char by char. • These poems can fool 50% of people and make them believe these are created by human instead of algorithm. • However, strictly speaking, the rhyme of those poems are quite loose. And the meaning of the poems are not very focusing. • In this work, I’m going to use seq2seq with attention mechanism and try to beat the char-level RNN, writing poems that really have literacy value. One famous poem chosen from our dataset
  • 5.
    vanilla seq2seq • Differentfrom RNN which is based on single word or char, seq2seq is based on sequence. It uses bucket & padding to resolve the different-in-length problem. • Seq2seq is a combination of two RNNs. The left one is called encoder RNN which gets the source sequence as input, and the right one is called decoder RNN which gets the target sequence as input and generates outputs. The two RNNs are connected by context vectors (the last hidden vectors of the encoder). • For example, in the task of translation from English to French, the encoder gets an English sentence and the decoder gets a corresponding French sentence. • In the task of writing poems, the encoder gets all previous (or latest k) sentences, and the decoder gets the next sentence.encoder decoder
  • 6.
    Attention mechanism • Thevanilla seq2seq has a obvious weakness that is all the encoder’s information are summarized into the context vectors, which implies the dimension of context vectors need to be very large, otherwise information will lose. • To get rid of this problem, attention mechanism is invented. • Attention mechanism can learn the weights from the current target word to all the source words, so that all the hidden vectors of encoder are made use of, instead of only the last one. • This is very helpful in translation, poems, couplets …
  • 7.
    • Like theprevious work, the dataset is Complete Tang Poems 《全唐 诗》, which contains about 43,000 poems with a vocabulary size of 6,000 chars. • About 7,800 poems are not used for training because they are too long (more than 100 chars). They can be decomposed and used as validation dataset. But this does not make a lot of sense, since the dataset is small and each poem is quite unique. Unlike the machine translation task, the dataset does not show much redundancy and repetition. • About 500 poems are abandoned because of missing lines or noisy information. • Basically, we create our training data like this: suppose a poem has 8 sentences, A B C D E F G H, then I will generate this poem into 9 rows of data. • Then I will send these data into different buckets and randomly train on them. Source input Target input 1 <BEGIN> A 2 A B 3 AB C 4 ABC D 5 ABCD E 6 ABCDE F 7 ABCDEF G 8 ABCDEFG H 9 ABCDEFGH <END> In the training data, each sentence is separated by a space, e.g., “A B C “.
  • 8.
    Pros and cons •The advantage of this data format is, each sentence with its special rhyme and rhythm “knows” its location in the poem. • For example, some sentences will be better to serve as the last (or first) sentence, or the one before (after) the last (first) sentence. Some sentences are better to be put in the middle. • At the level of model, the sentence to sentence model should perform better in rhyme & rhythm, and the meaning of consecutive sentences should become more focusing and consistent. • The weakness of this model, as far as I know, is the generation of the 1st sentence. • The encoder input is just a <BEGIN> symbol. It’s usually difficult to write very meaningful 1st sentence only based on this symbol. • Of course, we can provide the 1st sentence manually. It works a lot of times, however, as you can see, it does not guarantee to work at any time. • I will show some trick or variants of model to try to deal with this problem at the end.
  • 9.
    Details of themodel • 2 layers LSTM, 256 hidden size, 64 batch size; • Estimate training perplexity every 200 steps and save the estimated perplexity into record; • Decrease the learning rate by 1% if the current perplexity is the largest of the latest 4; • Save a model every 5 epochs; • Choose models based on the flatness of learning curve and manually test the few candidate models.
  • 10.
    Training process Training perplexity inlog axis Training perplexity in linear axis Learning rate • The perplexity finally decreases to less than 10. • Usually one ancient Chinese char can have 3-5 different meanings (perplexity).
  • 11.
    How do Idecode? • I just use all previously generated sentences as the encoder input (with <BEGIN> at the start and space as separator) to generate the next sentence. • For the goodness of rhyme, I use the greedy method to decode for most of sentences. • However, for the 1st sentence, how do you create it? • (1) one possible way is just sampling chars based on the output probability distribution, with the <BEGIN> as the encoder input; • (2) the other way is providing the 1st sentence manually
  • 12.
    The sampling method •If we use sampling method all through the poem, then the quality of seq2seq’s poems soon degrade to the RNN’s quality, especially the rhyme. • If we only use it at the 1st sentence, then the meaning of the 1st sentence will be more random than the following sentences. • It also degrade the following sentences if the 1st one is not very appropriate. • One possible method is to use a “top k” clip, that is at each position within the sentence, we only do the sampling within the top k probability chars. All the other possible chars will not be considered, with their probability cut down to zero. • This method combines the idea of sampling (which creates diversity) and the pursue of good rhyme (which implies limiting the scope).
  • 13.
    Manually providing the 1stsentence • Even though this is the original goal which drives me to build the model, however, what I find is it’s very tricky on how to provide a good leading sentence. • The seq2seq model is trained on the Complete Tang Poems data. Even though you create a sentence that looks like ancient Chinese, if that sentence does not look like some combination of fragments of our training data, it will still be difficult for the model to write a good poem, because the model is not familiar that. • Different from the machine translation task (where we have big data and there are a lot of words and phrases used in common between different sentences), the poem data is small and does not show such redundancy or repetition. Each poem is unique and the utility matrix is sparse (connection between words through the common usage is relatively weak). • So, it become very tricky on how to provide good 1st sentence manually. The best poems I have shown and the worst poems that I have not shown are all generated in this way. • Base on my experience, using a close-to- last or simply the last sentence of an existing poem from the training data usually creates an acceptable poem.
  • 14.
  • 15.
  • 16.
    Other variants: <1> spin-upmodel • In this model, I will not use all the previous sentences to generate the next one, instead I will only use the latest 4 sentences (4 lines is usually a period within Chinese poems), and there will not be <END> symbol. • At the decoding phase, I will simply let the model write 20 or 50 lines. Then I just manually select 4 or 8 lines as my poem. • This method is usually safe and interesting to try. • The idea of this is, even the model is not good at the 1st sentence, however, as the decoding going on, it will be better and better in the following. Because the memory is kept to be only the latest 4 sentences, the bad impacts from the 1st sentence will be diluted soon through the process. • This is a good solution for the purpose of randomly generating poems or few couple of interesting lines. • The caveat is, since the memory is only 4 lines, you should not choose too long subset as your poem. Otherwise it is imaginable that the poem’s meaning and even rhyme will be become random as the decoding goes on.
  • 18.
    Other variants: <2>char-by-char @ the 1st line • Using the char-by-char way to deal with the 1st sentence, which sounds like what decoder RNN can equivalently do. • However, we are using seq2seq with attention mechanism. The attention weights still can make it different. • And by decomposing the 1st line a lot of times and append them into the training data, the 1st line gets 4- 6 times more importance during the training process (maybe we can decrease their probability of being selected in training). • This method has been implemented. It does improve the quality of the 1st sentence, however, at the remaining sentences, the poems tend to use some chars multiple times, which is supposed to be avoided in Chinese poem writing. Not a good idea after exploring!
  • 19.
    Conclusion • It isno doubt the seq2seq model shows significant improve from the char-level RNN in writing poems. The probability of getting a good poem increases a lot. • The improvement mostly comes from the reasonability of sentence-to- sentence method comparing with the char-to-char method. • The improvements mostly shows in the rhyme and the focusing of poem topic & general meaning. • Presently I’m planning to submit some of these poems to some competitions and see whether I can win something. • On the other hand, providing appropriate 1st sentence is still a challenge in writing a good poem. • Manually providing the leading sentence seems tricky, but the best poems we have seen also come from this way. Maybe the last line from existing poem from training data is a good hint. • The sampling method with a “top k clip” will help improve the random sampling method a little bit. • As a variant, the “spin-up model” seems can generate acceptable subsets of poems a lot of times. • Another variant which tries to decompose the 1st sentence into chars tends to use same chars multiple times, degrading the poems into low quality.