Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka Composition

1,165 views

Published on

manuscript for review
(accepted as short paper for ICCS 2018)

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka Composition

  1. 1. LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka Composition Tomonari Masada1 and Atsuhiro Takasu2 1 Nagasaki University, 1-14 Bunkyo-machi, Nagasaki, Japan masada@nagasaki-u.ac.jp 2 National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, Japan takasu@nii.ac.jp Abstract. This paper proposes a method of scoring sequences gener- ated by recurrent neural network (RNN) for automatic Tanka compo- sition. Our method gives sequences a score based on topic assignments provided by latent Dirichlet allocation (LDA). A scoring of sequences generated by RNN can be achieved with their output probabilities. How- ever, the sequences having large probabilities are likely to share much the same subsequences and thus are doomed to be deprived of diversity. In contrast, our method uses topic assignments given by LDA. We first per- form an inference for LDA over the same set of sequences as that used for training RNN. The inferred per-topic word probabilities are then used to obtain topic assignments of word tokens in the sequences generated by a well-trained RNN. When a sequence contains many word tokens as- signed to the same topic, we give it a high score. This is an entropy-based scoring. The result of an experiment, where we scored Japanese Tanka poems generated by RNN, shows that the top-ranked sequences selected by our method were likely to contain a wider variety of subsequences than those selected with RNN output probability. Keywords: topic modeling, sequence generation, recurrent neural net- work, automatic poetry composition 1 Introduction Recurrent neural network (RNN) is a class of artificial neural network that has been providing remarkable achievements in many important applications. Se- quence generation is one among such applications [11][6]. By adopting a special architecture like LSTM [9] or GRU [3][4], RNN can learn dependencies among word tokens appearing at distant positions. It is thus possible to use RNN for generating sequences showing dependencies among distant word tokens. When we use sequence generation in realistic situations, we may sift the se- quences generated by RNN to obtain only useful ones. Such sifting may give each generated sequence a score representing its usefulness relative to the application under consideration. While an improvement of the quality of sequences gener- ated by RNN can also be achieved by modifying the architecture of RNN [5][13],
  2. 2. we here consider a scoring method that can be tuned and applied separately from RNN. In particular, this paper proposes a method for scoring sequences to achieve a diversity of subsequences appearing in highly scored sequences. We may score the generated sequences by using their output probabilities in RNN. However, this method is likely to gives high scores to the sequences containing subsequences popular among the sequences used for training RNN. Consequently, the sequences of high score are doomed to look alike and to show only a limited diversity. In contrast, our scoring method uses latent Dirichlet allo- cation (LDA) [2]. Topic models like LDA can extract diverse topics from training documents. Each topic is represented as a probability distribution defined over vocabulary words. By using the per-topic word probabilities LDA provides, we can assign high scores to the sequences containing many words having a strong relevance to some particular topic. This scoring is expected to give top-ranked sequences that individually show a strong relevance to some particular topic and together show a wide variety of topics. We performed an evaluation experiment by generating Japanese Tanka po- ems with RNN using LSTM or GRU module. Tanka poems basically have a 5-7-5-7-7 syllabic structure and thus consist of five parts. Every Tanka poem used in the experiment was given in Hiragana characters with no voicing marks and was represented as a sequence of non-overlapping character bigrams, which were regarded as vocabulary words. After training RNN under different set- tings, we chose the best setting in terms of validation perplexity. We then gener- ated random sequences by using the best-trained RNN. The generated sequences were scored by their output probabilities in RNN or by using our LDA-based method. The result shows that our LDA-based method could select more diverse sequences in the sense that a wider variety of subsequences were obtained as the five parts of the top-ranked Tanka poems. The rest of the paper is structured as follows. Section 2 describes the method in detail. The result of the evaluation experiment is given in Section 3. Section 4 discusses previous study. Section 5 concludes the paper by giving future work. 2 Method 2.1 Preprocessing of Tanka Poems Sequence generation is one among the important applications of RNN [11][6]. This paper considers the generation of Japanese Tanka poems. We assume that all Tanka poems for training RNN are given in Hiragana characters with no voicing marks. Tanka poems have a 5-7-5-7-7 syllabic structure and thus consist of five subsequences, which are called parts in this paper. We give an example of Tanka poem taken from The Tale of Genji in Fig. 1.1 Note that the Tanka poem in Fig. 1 has a 6-7-5-7-7 syllabic structure. While the first part of the standard syllabic structure consists of five syllables, that of this example consists of six. In this manner, a small deviation from the standard 5-7-5-7-7 structure is often 1 In this paper, a romanization called Kunrei-shiki is used for Hiraganas.
  3. 3. Fig. 1. A Tanka poem from The Tale of Genji. observed. When performing a preprocessing of Tanka poems, we need to consider this kind of deviation. The details of preprocessing are given below. First, we put a spacing character ‘ ’ between each neighboring pair of parts and also at the tail of the poem. Further, the first Hiragana character of each of the five parts is “uppercased,” i.e., marked as distinct from the same character appearing at other positions. The above example is then converted to: MO no o mo hu ni _ TA ti ma hu he ku mo _ A ra nu mi no _ SO te u ti hu ri si _ KO ko ro si ri ki ya _ Second, we represent each Tanka poem as a sequence of non-overlapping char- acter bigrams. Since we would like to regard character bigrams as vocabulary words composing each Tanka poem, non-overlapping bigrams are used. How- ever, only for the parts of even length, we apply an additional modification. The special bigram ‘( )’ was put at the tail of the parts of even length in place of the spacing character ‘ ’. Consequently, the non-overlapping bigram sequence corresponding to the above example is obtained as follows: (MO no) (o mo) (hu ni) (_ _) (TA ti) (ma hu) (he ku) (mo _) (A ra) (nu mi) (no _) (SO te) (u ti) (hu ri) (si _) (KO ko) (ro si) (ri ki) (ya _) Note that the even-length part ‘MO no o mo hu ni ’ is converted into the bi- gram sequence ‘(MO no) (o mo) (hu ni) ( )’ by the additional modifica- tion. Finally, we put tokens of the special words ‘BOS’ and ‘EOS’ at the head and the tail of each sequence, respectively. The final output of preprocessing for the above example is as follows: BOS (MO no) (o mo) (hu ni) (_ _) (TA ti) (ma hu) (he ku) (mo _) (A ra) (nu mi) (no _) (SO te) (u ti) (hu ri) (si _) (KO ko) (ro si) (ri ki) (ya _) EOS The sequences preprocessed in this manner were used for training RNN and also for training LDA.
  4. 4. Fig. 2. Validation perplexities obtained by GRU-RNN with three hidden layers of various sizes. 2.2 Tanka Poem Generation by RNN We downloaded 179,225 Tanka poems from the web site of International Research Center for Japanese Studies.2 3,631 different non-overlapping bigrams were found to appear in this set. Therefore, the vocabulary size was 3,631. Among 179,225 Tanka poems, 143,550 were used for training RNN and also for training LDA, and 35,675 were for validation, i.e., for tuning free parameters. We implemented RNN with PyTorch3 by using LSTM or GRU modules. The number of hidden layers were three, because larger numbers of layers could not give better results in terms of validation set perplexity. The dropout probability was 0.5 for all layers. The weights were clipped to [−0.25, 0.25]. RMSprop [12] was used for optimization with the learning rate 0.002. The mini-batch size was 200. The best RNN architecture has been chosen by the following procedure. Fig. 2 presents the validation set perplexity computed in the course of training of GRU-RNN with three hidden layers. The hidden layer size h was set to 400, 600, 800, or 1,000. When h = 400 (dashed line), the worst result was obtained. The results for the three other cases showed no significant difference. The running time was the shortest when h = 600 (solid line) among these three cases. While we obtained similar results also for LSTM-RNN, the results of GRU-RNN were slightly better. Therefore, we have selected the GRU-RNN with three hidden layers of size 600 as the model for generating Tanka poems. 2.3 Topic Modeling for Sequence Scoring This paper proposes a new method for scoring the sequences generated by RNN. We use latent Dirichlet allocation (LDA) [2], the best-known topic model, for 2 http://tois.nichibun.ac.jp/database/html2/waka/menu.html 3 http://pytorch.org/
  5. 5. scoring. LDA is a Bayesian probabilistic model of documents and can model the difference in semantic contents of documents as the difference in mixing proportions of topics. Each topic is in turn modeled as a probability distribution defined over vocabulary words. The details of LDA is given below. We first introduce notations. We denote the number of documents, the vocab- ulary size, and the number of topics by D, V , and K, respectively. By performing an inference for LDA, e.g. variational Bayesian inference [2] or collapsed Gibbs sampling [7], over training document set, we can estimate the two groups of parameters: θdk and φkv, for d = 1, . . . , D, v = 1, . . . , V , and k = 1, . . . , K. The parameter θdk is the probability of the topic k in the document d. In- tuitively, this group of parameters represents the importance of each topic in each document. Technically, the parameter vector θd = (θd1, . . . , θdK) gives the mixing proportions of K different topics in the document d. Each topic is in turn represented as a probability distribution defined over words. The parameter φkv is the probability of the word v in the topic k. Intuitively, φkv represents the rel- evance of the word v to the topic k. For example, in autumn, people talk about fallen leaves more often than about blooming flowers. Such topic relevancy of vocabulary words is represented by the parameter vector φk = (φk1, . . . φkV ). When an inference for LDA estimates many word tokens in the document d as strongly relevant to the topic k, i.e., as having a large probability in the topic k, θdk is also estimated as large. If the estimated value of θdk is far larger than those of the θdk ’s for k = k, we can say that the document d is exclusively related to the topic k. LDA is used in this manner to know whether a given sequence is exclusively related to some particular topic or not when we score Tanka poems generated by RNN. In our experiment, the inference was performed by collapsed Gibbs sampling (CGS) [7]. CGS iteratively updates the topic assignments of all word tokens in every training document. This assignment can be viewed as a clustering of word tokens, where the number of clusters is K. For explaining the details of CGS, we introduce some more notations. Let xdi denote the vocabulary word appearing as the i-th word token of the document d and zdi the topic to which the i-th word token of the document d is assigned. CGS updates zdi based on the topic probabilities given as follows: p(zdi = k|X, Z¬di ) ∝ n¬di dk + α × n¬di kv + β n¬di k + V β for k = 1, . . . , K. (1) This is the conditional probability that the i-th word token of the document d is assigned to the topic k when given all observed word tokens X and their topic assignments except the assignment we are going to update, which is denoted by Z¬di in Eq. (1), where ¬di means the exclusion of the i-th word token of the document d. It should hold that k p(zdi = k|X, Z¬di ) = 1. In Eq. (1), nkv is the number of the tokens of the word v that are assigned to the topic k in the entire document set, and nk is defined as nk ≡ V v=1 nkv. α is the hyperparameter of the symmetric Dirichlet prior for the θd’s, and β is that of the symmetric Dirichlet prior for the φk’s [7][1]. In CGS, we compute the
  6. 6. Table 1. An Example of Topic Words (HA na) (ha na) (HA ru) (no ) (hi su) (U ku) (ni ) (YA ma) (hu ) (sa to) (ru ) (NI ho) (U me) (ha ru) (ta ti) (HU ru) (SA ki) (U no) (tu ki) (no ) (NA ka) (ni ) (TU ki) (A ka) (KU mo) (te ) (ka ke) (wo ) (so ra) (ki yo) (ha ) (KA ke) (ka ri) (hi no) (yu ku) (KO yo) (su ) (to ki) (HO to) (ko ye) (NA ki) (NA ku) (KO ye) (HI to) (ho to) (ni ) (ni na) (su ka) (ku ) (MA tu) (HA tu) (ku ra) (MA ta) (ya ma) (ka se) (ku ) (A ki) (ni ) (no ha) (se ) (KO ro) (so hu) (no ) (HU ki) (O to) (HU ku) (tu ka) (sa mu) (WO ki) (MA tu) (mo u) (U ra) (KI mi) (no ) (KA yo) (ni ) (te ) (MA tu) (wo ) (ha ) (TI yo) (ma tu) (to se) (tu yo) (I ku) (YO ro) (mu ) (TI to) (no ma) (ka ta) probabilities of K topics by Eq. (1) at every word token and sample a topic based on these probabilities. The sampled topic is set as a new value of zdi. We repeat this sampling hundreds of times over the entire training document set. When performing CGS in our evaluation experiment, we used the same set of Tanka poems as that used for training RNN. Therefore, D = 143, 550 and V = 3, 631 as given in Subsection 2.2. Since ordering of word tokens is irrelevant to LDA, each non-overlapping bigram sequence was regarded as a bag of bigrams. K was set to 50, because other numbers gave no significant improvements. The hyperparameters α and β were tuned by grid search based on validation set perplexity. It should be noted that CGS for LDA can be run independently of the training of RNN. Table 1 gives an example of the 20 top-ranked topic words in terms of φkv for five among 50 topics. Each row corresponds to a different topic. The five topics represent bloom of flowers, autumn moon, singing birds, wind blow, and thy reign, respectively from top to bottom. For example, in the topic corresponding to the top row in Table 1, the words “ha na” (flowers), “ha ru” (spring), “ni ho” (the first two Hiragana characters of the word “ni ho hi,” which means fragrance), and “u me” (plum blossom) have large probabilities. In the topic corresponding to the second row, the words “tu ki” (moon), “ku mo” (cloud), “ka ke” (waning of the moon), and “ko yo” (the first two Hiragana characters of the word “ko yo hi,” which means tonight) have large probabilities. 2.4 Topic-Based Scoring Our sequence scoring uses the φk’s, i.e., the per-topic word probabilities, learned by CGS from the training data set. CGS provides φkv as φkv ≡ nkv+β nk+V β , where nkv and nk are defined in Subsection 2.3. Based on these φk’s, we can assign each word token of unseen documents, i.e., the documents not used for inference, to one among the K topics. This procedure is called fold-in [1]. In our case, bi- gram sequences generated by RNN are unseen documents. The fold-in procedure ran CGS over each sequence by keeping the φk’s unchanged. We then computed topic probabilities by counting the assignments to each topic. For example, when the length of a given bigram sequence is 18, and the fold-in procedure assigns 12
  7. 7. among the 18 bigrams to some particular topic, the probability of that topic is 12+α 18+Kα . In the experiment, we obtained eight topic assignments every 50 itera- tions after the 100th iteration of fold-in procedure for each sequence. The mean of these eight probabilities was used to compute the entropy − K k=1 ¯θ0k log ¯θ0k, where ¯θ0k was the averaged probability of the topic k for a sequence under con- sideration. We call this topic entropy. Smaller topic entropies are regarded as better, because smaller ones correspond to the situations where more tokens are assigned to the same topic. In other words, we aim to select the sequences showing a topic consistency. 3 Evaluation The evaluation experiment compared our scoring method to the method based on RNN output probability. The output probability in RNN is obtained as follows. We can generate a random sequence with RNN by starting from the special word BOS and then randomly drawing words one by one until we draw the special word EOS. The output probability of the generated sequence is the product of the output probabilities of all tokens, where the probability of each token is the output probability at each moment during the sequence generation. Our LDA-based method was compared to this probability-based method. We first investigate the difference of the top-ranked Tanka poems obtained by the two compared scoring methods. Table 2 presents an example of the top five Tanka poems selected by our method in the left column and those selected based on RNN output probability in the right column. To obtain these top-ranked po- ems, we first generated 100,000 Tanka poems with GRU-RNN. Since the number of poems containing grammatically incorrect parts was large, a grammar check was required. However, we could not find any good grammar check tool for ar- chaic Japanese. Therefore, as an approximation, we regard the poems containing at least one part appearing in no training Tanka poem as grammatically incor- rect.4 It is a future work to perform a grammar check in a more appropriate manner. After removing the poems that are grammatically incorrect, we assign to the remaining poems a score based on each scoring method. Table 2 presents the resulting top five Tanka poems for each method. Table 2 shows that when we use RNN output probability (right column), it is difficult to achieve topic consistency. The fourth poem contains the words “sa ku 4 This grammar check eventually reduces our search space to the set of random com- binations of the parts appearing in the training set. However, even when we generate Tanka poems simply by randomly shuffling the parts appearing in the existing Tanka poems, we need some criterion to say if an obtained poem is good or not. To pro- pose such criterion is nothing but to propose a method for Tanka poem generation. The data set used in our experiment contains around 48,000 unique subsequences that can be used in the first or third part of Tanka poem and around 215,000 unique subsequences that can be used in the second, fourth, or fifth part. It is far from a trivial task to select from all possible combinations of these parts (i.e., from ≈ 48, 0002 ×215, 0003 combinations) only those that are good in some definite sense.
  8. 8. Table 2. Top five Tanka poems selected by compared methods (rank) Topic entropy Output probability si ku re yu ku o ho a ra ki no ka tu ra ki ya ma no mo ri no si ta ku sa 1 i ro hu ka ki ka mi na tu ki mo mi ti no i ro ni mo ri no si ta ku sa si ku re hu ri ke ri ku ti ni ke ru ka na ti ha ya hu ru hi sa ka ta no yu hu hi no ya ma no ku mo wi ni mi yu ru 2 ka mi na hi no tu ki ka ke no mi mu ro no ya ma no tu ki ka ke wa ta ru mo mi ti wo so mi ru a ma no ka ku ya ma ka he ru ya ma ti ha ya hu ru hu mo to no mi ti ha ka mo no ka ha na mi 3 ki ri ko me te ta ti ka he ri hu ka ku mo mu su hu ki ru hi to mo na ki wo ti no ya ma ka se a hu sa ka no se ki o ho ka ta mo ta ka sa ko no wa su ru ru ko to mo wo no he no sa ku ra 4 ka yo he to mo sa ki ni ke ri o mo hu ko ko ro ni mi ne no ma tu ya ma o mo hu ha ka ri so yu ki hu ri ni ke ri u ti na hi ku ta ka sa ko no ko ro mo te sa mu ki wo no he no sa ku ra 5 o ku ya ma ni na ka mu re ha u tu ro hi ni ke ri a ri a ke no tu ki ni a ki no ha tu ka se a ki ka se so hu ku ra” (cherry blossom) and “yu ki” (snow). The fifth one contains the words “sa ku ra” (cherry blossom) and “a ki ka se” (autumn wind). The poems top-ranked based on RNN output probability are likely to contain the words expressing different seasons. This is a fatal error for Tanka composition. In contrast, the first poem selected by our method contains the words “si ku re” (drizzling rain) and “mo mi ti” (autumnal tints). Because drizzling rain is a shower observed in late autumn or in early winter, the word “mo mi ti” fits well within this context. Further, in the third poem, the words “ya ma” and “hu mo to” are observed. The former means mountain, and the latter means the foot of the mountain. This poem also shows a topic consistency. However, a slight weakness can be observed in the poems selected by our method. The same word is likely to be used twice or more. For example, the word “o mo hu” (think, imagine) is repeated twice in the fourth poem. While a refrain is often observed in Tanka poems, some future work may introduce an improvement here. Next, we investigated the diversity of Tanka poems selected by both methods. We picked up the 200 top-ranked poems given by each scoring method. We then split all poems into five parts to obtain 200 × 5 = 1, 000 parts in total. Since
  9. 9. Table 3. Ten most frequent parts in the top 200 Tanka poems Topic entropy Output probability hi sa ka ta no (9) a ri a ke no tu ki no (13) ko ro mo te sa mu ki (8) hi sa ka ta no (12) a ri a ke no tu ki ni (8) ta ka sa ko no (12) ho to to ki su (7) ho to to ki su (11) a ri a ke no tu ki (6) a si hi ki no (10) a ki ka se so hu ku (6) a ma no ka ku ya ma (8) sa ku ra ha na (6) a ri a ke no tu ki ni (7) ka mi na tu ki (6) na ri ni ke ri (7) ta na ha ta no (5) sa ku ra ha na (7) a ki no yo no tu ki (5) ya ma ho to to ki su (7) the resulting set of 1,000 parts included duplicates, we grouped the 1,000 parts by their identity and counted duplicates. Table 3 presents the ten most frequent parts for each scoring method. When we used RNN output probability for scoring (right column), “a ri a ke no tu ki no” appeared 13 times, “hi sa ka ta no” 12 times, “ta ka sa ko no” 12 times, and so on, among the 1,000 parts coming from the 200 top-ranked poems. In contrast, when we used the LDA-based scoring (left column), “hi sa ka ta no” appeared nine times, “ko ro mo te sa mu ki” eight times, “a ri a ke no tu ki ni” eight times, and so on. That is, there were less duplicates for our method. Further, we also observed that while only 678 among 1,000 were unique when RNN output probability was used for scoring, 806 were unique when our scoring method was used. It can be said that our method explores a larger diversity. 4 Previous Study There already exists many proposals of automatic poetry composition using RNN. However, LDA is a key component in our method. Therefore, we focus on the proposals using topic modeling. Yan et al. [14] utilizes LDA for Chinese poetry composition. However, LDA is only used for obtaining word similarities, not for exploring topical diversity. Further, the poem generation is not achieved with RNN. The subsequent work by the same author [15] proposes a method called iterative polishing for recomposition of the poems generated by a well- designed RNN architecture. However, LDA is no more used. The combination of topic modeling and RNN can be found in the proposals not related to automatic poetry composition. Dieng et al. [5] proposes a com- bination of RNN and topic modeling. The model, called TopicRNN, modifies the output word probabilities of RNN by the long-range semantic information of each document captured by an LDA-like mechanism. However, when the se- quences are generated with TopicRNN, we need to choose one document among the existing documents as seed. This means that we can only generate a sequence similar to the chosen document. While TopicRNN has this limitation, it provides
  10. 10. Fig. 3. Correlation between the topic entropy and the log output probability. a valuable guide for future work. Our method detaches sequence selection from sequence generation. Only after generating sequences by RNN, we can perform a selection. It may be better to directly generate the sequences having some de- sirable character regarding their topical contents. Xing et al. [13] also suggests a similar research direction. Hall et al. [8] utilizes the entropy of topic probabilities for studying the history of ideas. However, the entropy is used for measuring how many different topics are covered in a particular conference held in a particular year. Therefore, larger entropies are better. Their evaluation direction is opposite to ours. Our scoring regards the poems having topic consistency as important. Since LDA can extract diverse topics, Tanka poems can be highly scored by our method due to a wide variety of reasons. Some poems are topically consistent in describing autumn moon, and others in describing blooming flowers. As a result, our method eventually could select Tanka poems showing a diversity. RNN output probability could not address such topical diversity, because it was indifferent to the topic consistency within each Tanka poem. 5 Conclusions This paper proposed a method for scoring the sequences generated by RNN. The proposed method based on topic entropy was compared to the scoring based on RNN output probability and could provide more diverse Tanka poems. An im- mediate future work is to devise a method for combining these two scores. Fig. 3 plots per-token topic entropies (horizontal axis) obtained by our method and per-token log output probabilities (vertical axis) of 10,000 sequences generated by GRU-RNN. Since less topic entropies are better, the upper left part contains the sequences better in terms of both scores. As depicted in Fig. 3, the two scores only show a moderate negative correlation. The correlation coefficient is around −0.41. It may be a meaningful task to seek an effective combination of the two scores for a better sequence selection.
  11. 11. In this paper, we only consider the method for obtaining better sequences by screening generated sequences. However, the same thing can also be achieved by modifying the architecture of RNN. As discussed in Section 4, Dieng et al. [5] incorporates an idea from topic modeling into the architecture of RNN. It is an interesting future work to propose an architecture of RNN that can generate sequences diverse in their topics. With respect to the evaluation, we only con- sidered the diversity of subsequences of generated poems. It is also an important future work to apply other evaluation metrics like BLEU [10][15] or even human subjective evaluation. Acknowledgments This work was supported by Grant-in-Aid for Scientific Research (B) 15H02789. References 1. Asuncion, A., Welling, M., Smyth, P., and Teh, Y. W.: On smoothing and inference for topic models. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI ’09), 27–34 (2009) 2. Blei, D. M., Ng, A. Y., and Jordan, M. I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003) 3. Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y.: On the proper- ties of neural machine translation: Encoder-decoder approaches. arXiv preprint, arXiv:1409.1259 (2014) 4. Chung, J., Gulcehre, C., Cho, K., and Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint, arXiv:1412.3555 (2014) 5. Dieng, A. B., Wang, C., Gao, J., and Paisley, J.: TopicRNN: A recurrent neural network with long-range semantic dependency. arXiv preprint, arXiv:1611.01702 (2016) 6. Graves, A.: Generating sequences with recurrent neural networks. arXiv preprint, arXiv:1308.0850 (2013) 7. Griffiths, T. L. and Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. U. S. A. 101, Suppl. 1, 5228–35 (2004) 8. Hall, D., Jurafsky, D., and Manning, C. D.: Studying the history of ideas using topic models. In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP ’08), 363–371 (2008) 9. Hochreiter, S. and Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 8, 1735–1780 (1997) 10. Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL ’02), 311–318 (2002) 11. Sutskever, I., Martens, J., and Hinton, G.: Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML ’11), 1017–1024 (2011) 12. Tieleman, T. and Hinton, G.: Lecture 6.5-rmsprop: Divide the gradient by a run- ning average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4 (2012)
  12. 12. 13. Xing, C., Wu, W., Wu, Y., Liu, J., Huang, Y., Zhou, M., and Ma, W.-Y.: Topic aware neural response generation. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI ’17), 3351–3357 (2017) 14. Yan, R., Jiang, H., Lapata, M., Lin, S.-D., Lv, X., Li, X.: i, Poet: Automatic Chinese poetry composition through a generative summarization framework under constrained optimization. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI ’13), 2197–2203 (2013) 15. Yan, R.: i, Poet: Automatic poetry composition through recurrent neural networks with iterative polishing schema. In Proceedings of the 25th International Joint Con- ference on Artificial Intelligence (IJCAI ’16), 2238–2244 (2016)

×