Generating Audio-Visual Slideshows from Text Articles Using Word Concreteness

Generating Audio-Visual Slideshows from
Text Articles Using Word Concreteness
Mackenzie Leake, Hijung Valentina Shin, Joy O. Kim, Maneesh Argawala
CHI 2020
Apr. 9th, 2021
Presenter: Seunghyeong Choe

Contents
• Overview of the paper
• Introduction
• Related Work
• Formative Study
• Methods
• Result
• Evaluation
• Limitation and Future Work

2
Overview of the paper
Automatically transform text article to audio-visual slideshows
Evaluate generated slideshow
Use word concreteness to select key word
Images selection based on word concreteness

3
Terminology
• Word Concreteness
 How strongly a word or phrase is related to some perceptible concept.
AirPods Intuitive
?

4
Introduction
Audio-visual
Visual
Content
• Enhance written information
• Emphasize with photos
• Diagrams make article easier to understand
• Maps illustrate direction
• Include audio contents
• Aid in longer term recall
• Higher preference
• Higher understandability
• Requires author’s significant effort, time, and skill.

5
Introduction
1. Find most concrete words
2. Search image files
3. Speech generation

6
Related Work
Text Based Video Editing Tools Automatic Visualization of Text
• Article Video Robot
• Automatically arrange user-provided video clip
• Visual Transcripts
• SceneSkim
• Videolization: Wikipedia articles to video
• Text summarization using multiple images
• Multimodal summaries for complex sentences

7
Formative Study
• How the format of articles impacts a viewer’s understanding and preferences?
• Recruit 120 participants on Amazon Mechanical Turk
 Preference and understandability of three formats
 Randomly assign articles
Text only Text with images Audio-visual slideshows

8
Formative Study
• Survey Result
Viewers preferred the slideshow format over text-only and text with images.
Viewers also found content presented in slideshows easiest to understand

9
Methods
• Methods to generate audio-visual slideshow from a text article
A. Segment text article into sentences
B. Search image files by using concrete words
C. Generate audio narration by using Google Cloud Text-to-Speech
D. Time-aligning audio narration and image files

10
Methods
Obtaining Images for Text Using Concreteness
• Computing the image search query for each sentences
 Sub-step 1: Concreteness
• 40k word dataset
• Human rated concreteness on a scale of 1 to 5
• Empirically τ = 4.5 is good. (farmers, wheat)
• spaCy dependency parser to identify noun phrases and compound nouns (common wheat)
They are often raised in Kansas, near where farmers also grow common wheat.
2.93 1.96 2.50 2.86 3.0 x 2.79 1.66 4.54 1.83 3.03 2.07 4.89

11
Methods
 Sub-step 2: Named Entities
• Use spaCy named entity recognition tags to identify words
• People, places, and organizations (Kansas)
They are often raised in Kansas, near where farmers also grow common wheat.
2.93 1.96 2.50 2.86 3.0 x 2.79 1.66 4.54 1.83 3.03 2.07 4.89

12
Methods
 Sub-Step 3: Pronoun Replacement
• Use Neural-Coref
• Pronoun coreference resolution method
• Data-driven NLP approach
• They → Cows
Cows are often raised in Kansas, near where farmers also grow common wheat.
2.93 1.96 2.50 2.86 3.0 x 2.79 1.66 4.54 1.83 3.03 2.07 4.89

13
Methods
• Special Cases
 Duplicate words: keep only a single occurrence
 Single word query in a sentence: add the article to provide context and reduce ambiguity
 Empty search query: continue to show the image from the previous sentences
 First sentences search query is empty: pull the image from the nearest sentence (rare case)
• Image Selection
 Use Bing Image Search
 Minimum resolution 480x360, aspect ratio 4:3
 Filter out charts, diagrams, and images that contain text
 Remove stock image URL to avoid watermarks

14
Methods
Slideshow Composition
• Audio Narration
 Google Cloud Text-to-Speech
 Reprocess the output audio through Google Speech-to-Text
• Returns per-word-time-stamps
• Provide timing information
 Needleman-Wunsch algorithm
• To find optimal alignment between the input text article and the transcript
• Time-aligning images to the narration
 Continue a previous image if the length of image is shorter than 2 seconds
• Composition and Effects
 Crop into 960x720, using Python-smart-crop
 Zoom if face is detected and pan if salient region exists, using OpenCV
 Add captions

15
Results
• Create 13 slideshow videos using Wikipedia articles and HowStuffWorks articles.
• Takes 2~10 minutes to generate slideshow

16
Results
• Sentences without concrete words
 Conversational sentences
 Pulls the image from the next sentence
 Holds the prior image on screen.

17
Evaluation
• Comparison of Automatic and Manual Search Queries
• How well the system identifies the appropriate image
search query?
• Evaluate overall quality of generated slideshows
• 3 human annotators create manually without
knowledge of the system
• Red texts: commonly selected search queries
• Green texts: manually selected search queries
• Blue texts: automatically selected search queries
• Measure F1 score to compare the words between
manual and automatically selected
• Both auto generated and manual resulting images
may not differ in meaningful ways

18
Evaluation
• User Study and Feedback
 Assessing output slideshow video
 Compare 3 types of video
• Manually created video
• Keyword-search based approach video (Rapid Automatic Keyword Extraction, RAKE)
• Concreteness based video
 Recruit 120 participants from Amazon Mechanical Turk

19
Evaluation
Participants strongly preferred their slideshows over the keyword-based version.
No strong preference between manually created and automatically created version.

20
Evaluation
Automatically selected images were more relevant than the keyword-based approach

21
Limitation and Future Work
• Concreteness can be applied to a wide range
of domains
 Poetry, classic literature
 Different grammatical structure from international
articles
• Not filtering copyrighted images
• Cannot identify object and person that are
not famous
• Only uses static images
• Future work
 Using video clips (trimming, timing)
 Utilize imageability, specificity, familiarity in
addition to concreteness

Generating Audio-Visual Slideshows from Text Articles Using Word Concreteness

Recommended

Recommended

More Related Content

Similar to Generating Audio-Visual Slideshows from Text Articles Using Word Concreteness

Similar to Generating Audio-Visual Slideshows from Text Articles Using Word Concreteness (20)

More from ivaderivader

More from ivaderivader (20)

Recently uploaded

Recently uploaded (20)

Generating Audio-Visual Slideshows from Text Articles Using Word Concreteness