1. Reading Kitty: Automating Prediction of Children’s Oral Reading Skill
Julie Medero, Alfredo Gomez, Alicia Ngo
Harvey Mudd College
Abstract
While affective outcomes are generally positive for the use of
eBooks and computer-based reading tutors in teaching chil-
dren to read, learning outcomes are often poorer (Korat and
Shamir 2004). We describe the first iteration of Reading Kitty,
an iOS application that uses NLP and speech processing to fo-
cus children’s time on close reading and prosody in oral read-
ing, while maintaining an emphasis on creativity and artifact
creation. We also share preliminary results demonstrating that
semitone range can be used to automatically predict readers’
skill level.
Introduction
Early elementary-aged children’s reading achievement is
aided by reading practice in general, especially when that
practice focuses on reading with appropriate prosody (Kim
and Petscher 2016). Because individualized instruction is
time-intensive for teachers, many eBook readers and com-
puterized reading programs are available for emerging read-
ers but, while those tools are popular, they result in poorer
reading interactions than paper books, which negatively im-
pacts learning outcomes.
By focusing students’ time on practicing reading with ap-
propriate prosody, with an end goal of an artifact creation,
the Reading with KIneTic TYpography (Reading Kitty) app
aims to to keep the positive affective elements of eBooks
without sacrificing quality of interaction with texts. In the re-
maining sections, we outline first steps toward development
of Reading Kitty. We also share preliminary analysis demon-
strating that the prosodic analysis used in Reading Kitty can
provide accurate and useful feedback to children.
The Reading Kitty app
Kinetic typography is a category of animations that uses
moving text and images to emphasize the meaning and struc-
ture of text. Figure 1 shows a frame from a Kinetic Typog-
raphy animation of a scene from the film Toy Story. In the
animation, words are emphasized through text size, weight,
and motion to reflect the way that they’re emphasized by the
character that delivers the line.
Copyright c 2019, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Figure 1: Kinetic typography frame (RebelX-Productions
2016).
By interacting with Reading Kitty, a child reader is able
to create their own kinetic typography animations that reflect
their interactions with a story. First, Reading Kitty guides a
child reader through a close reading by presenting a series
of questions customized to the story a child is reading. Those
questions correspond to specific oral reading fluency bench-
marks in the Common Core Standards Initiative (National
Governors Association Center for Best Practices, Council of
Chief State School Officers 2010). The child uses the answer
to each close reading question as the basis for adjusting the
appearance of words in the story.
Next, Reading Kitty has the child read aloud. The
prosody of the child’s reading is used with the text charac-
teristics selected by the child to generate a kinetic typogra-
phy artifact, with better prosody resulting in more dynamic
videos. Finally, feedback about the child’s oral reading is
shared with the child’s teacher.
Pitch Analysis
Predicting high pitch
We want Reading Kitty to reward children who read with
appropriate prosody with more interesting, dynamic anima-
tions of their readings. With this goal in mind, we first exam-
ine the feasibility of predicting which words a skilled reader
would read with a high pitch. Classification models have
been used previously to predict the locations of prosodic
boundaries (Medero and Ostendorf 2013). We adopt this
idea by building a model to predict which words should be
read with a high pitch. We train the model on the Boston
Directions Corpus because this corpus is hand-labeled with
ToBI pitch tones (Silverman et al. 1992). We define a word
to be high-pitched when the word’s labeled tone has an H*
2. Figure 2: Average semitone range per utterance for skilled
and struggling readers.
in it. We use features that have proven successful for predict-
ing ToBI tones in Greek: part-of-speech (POS) tags, POS bi-
grams, syntactic chunk tags, the number of syllables in each
word, and each word’s depth in an automatically-generated
syntactic constituency tree (Zervas, Fakotakis, and Kokki-
nakis 2004). For our data set, though, resulting error rate is
too high to give reliable feedback to children. We look in-
stead at pitch range across an utterance.
Semitone range
We analyze the CMU Kids Corpus (Consortium. and Uni-
versity. 1997), which contains 5,180 utterances (spoken sen-
tences) from 76 children. The children range in age from
6 to 11 and are in first through third grade. These 76 chil-
dren are split into 2 groups, skilled readers (n = 44) and
struggling readers (n = 32) Each utterance in the corpus is
accompanied by a transcript, along with word- and phone-
level forced alignments.
For each utterance, we extract the F0 values in Hz every
.01s using PRAAT’s pitch tracker (Boersma 2001). We cal-
culate the F0 maximum and F0 minimum for each utterance
in the corpus, and then calculate the semitone range as de-
scribed by Aoyama et al. (Aoyama, Akbari, and Flege 2016).
We average the semitone ranges for each utterance across all
readers in the skilled group and all readers in the struggling
group.
Figure 2 compares the semitone range for the two groups.
Each point corresponds to an utterance. Points below the line
correspond to utterances that were read by struggling read-
ers with a smaller semitone range than by skilled readers.
Likewise, points above the line correspond to utterances that
were read by struggling readers with a larger semitone range
than by skilled readers.
Skilled readers show larger average semitone range than
struggling readers across utterances (Fig. 2), which is
promising for our goal of using semitone range as a metric
for feedback to young readers. Because the two datasets are
not matched in terms of which utterances are read by which
reader, however, we
Paired Comparison Finally, we perform a paired t-test.
We pair a skilled reader with a struggling reader who read
the same utterances. Since the participants in the two groups
do not align exactly, we use the Hungarian method to find
the pairings that maximize the number of shared utterances
(Kuhn 1955), resulting in 13 pairs with 41-91 shared utter-
ances. We ignore utterances that are not shared by each pair.
The skilled readers have an average semitone range of 17.03
(σ = 1.85) and the struggling readers have an average semi-
tone range of 14.83 (σ = 2.15), which is statistically signif-
icant with p < .05.
Conclusion
Reading Kitty represents a new way to bring evidence-based
reading instruction in an electronic form. Our initial anal-
ysis provides promising evidence that semitone-based fea-
tures can be used to give feedback on prosodic quality of
oral readings.
References
Aoyama, K.; Akbari, C.; and Flege, J. 2016. Prosodic char-
acteristics of American English in school-age children. In
Proc. Speech Prosody, 572–576.
Boersma, P. 2001. Praat, a system for doing phonetics by
computer. Glot International.
Consortium., L. D., and University., C.-M. 1997. The CMU
kids speech corpus TT -.
Kim, Y. S. G., and Petscher, Y. 2016. Prosodic Sensitiv-
ity and Reading: An Investigation of Pathways of Relations
Using a Latent Variable Approach. Journal of Educational
Psychology 108(5):630–645.
Korat, O., and Shamir, A. 2004. Do Hebrew electronic
books differ from Dutch electronic books? A replication of
a Dutch content analysis. Journal of Computer Assisted
Learning 20(4):257–268.
Kuhn, H. W. 1955. The Hungarian method for the assign-
ment problem. Naval Research Logistics Quarterly 83–97.
Medero, J., and Ostendorf, M. 2013. Atypical Prosodic
Structure as an Indicator of Reading Level and Text Diffi-
culty. In Proc. NAACL-HLT, 715–720.
National Governors Association Center for Best Practices,
Council of Chief State School Officers. 2010. Common
Core State Standards for English Language Arts. Techni-
cal report, National Governors Association Center for Best
Practices, Council of Chief State School Officers, Washing-
ton, D.C.
2016. Toy Story - ”You Are A Toy” Kinetic Typography.
https://youtu.be/_PU49-_qQoI.
Silverman, K.; Beckman, M.; Pitrelli, J.; Ostendorf,
M.; Wightman, C. C.; Price, P.; Pierrehumbert, J.; and
Hirschberg, J. 1992. TOBI: A Standard for Labeling En-
glish Prosody. In Second International Conference on Spo-
ken Language Processing.
Zervas, P.; Fakotakis, N.; and Kokkinakis, G. 2004. Pitch
accent prediction from ToBI annotated corpora based on
3. Bayesian learning. Lecture Notes in Artificial Intelli-
gence (Subseries of Lecture Notes in Computer Science)
3206:545–552.