Television News Search and Analysis with Lucene/Solr

•Download as PPTX, PDF•

1 like•667 views

A presentation given at the Lucene Revolution 2012 conference to introduce the UCLA Communication Studies Archive. Video: http://youtu.be/YnI7ftPcgJ4 Summary: UCLA Communication Studies Archive hosts a collection of over 100,000 hours of digital television news, updated daily. Its search Lucene Revolution 2012 Download Presentationengine provides closed captioning search and online streaming of videos. The search engine allows researchers and students in various fields to study television news, images and language usage, in ways that were not possible before. In this presentation, we will show the setup of our Lucene/Solr-powered search engine, as well as how it is being used. We will discuss our work on custom result formats, such as linking search result text to the video at particular timestamps, counting occurrences of words, phrases or patterns, grouping the result by fields such as month or show, and creating interactive charts. We will also discuss our work on extending Lucene’s proximity searches, and creating custom query types, such as segment-enclosed (two or more words, phrases or patterns occurring within a story-based text segment), time-enclosed (two or more words, phrases or patterns occurring within a certain time), and multi-word regular expression queries. Future goals will also be discussed, such as supporting multiple languages, multiple sources (speech-to-text along side closed-captioning text), searching user-contributed and generated metadata (programs that identify story segments, objects in video, etc.), and syntactic tags (such as parts of speech).

Technology

Television News Search and
Analysis with Lucene/Solr
Kai Chan <kai@ssc.ucla.edu>
Social Sciences Computing
University of California, Los Angeles
Lucene Revolution, May 10, 2012

Communication Studies Archive
Background (1)
• Continuation of analog recording of TV news
– Thousands of tapes since Watergate/1970s
– Hard to look for a particular news program or
topic
1

Communication Studies Archive
Background (2)
• Digital recording since 2005
• Capture news programs on computers
– Video: can be streamed over the Web
– Closed captioning (“subtitle text”): indexed and
searchable
– Image snapshots
– Search engine and analysis tools
2

Communication Studies Archive
Background (3)
• Also download transcripts and web-streamed
news programs
• 100 news programs and 600,000 words added
each day
3

Communication Studies Archive
Background (4)
• January 2005 to present
– 28 networks
– 1,600 shows
– 130,000 hours
– 160,000 news programs
– 50,000,000 images
– 880,000,000 words
4

Why This is Important (1)
• Researchers
– Large and unique collection of communication
– Many modalities
• Speech, facial expression, body gesture, etc.
– Different conditions/settings
– Different networks and communities
– Allows study of TV news + communication in
general in ways impossible before
5

Why This is Important (2)
• Non-researchers
– TV news about presentation and persuasion
• Which happen in daily life also
– TV main source of news for many/most
– Greatly affects the public’s decisions
– Learn about what we watch
6

Application in Research
• Communication Studies
– Amount of coverage for events over time
• Linguistic
– Speech and language patterns
• Computer Science
– Object identification
– Identify news anchors, public figures
– Story segmentation
14

Application in Teaching (1)
• Chicano Studies: Representations of Latinos
on the Television News
– May 1, 2007 immigration march
– MacArthur Park, Los Angeles, CA
– 2 days (May 1 & 2, 2007)
– Framing, stereotyping, metaphor, silencing
– reports with screenshots and links to news stories
15

Application in Teaching (2)
• Communication Studies: Presidential
Communication
– 2008 presidential primary
– 6 weeks (Dec 2007 to Feb 2008)
– Coverage of sound bites
• Amount of time given to candidate/party
• Types of response (positive, neutral, negative)
– Students created their own political ad.
16

Work flow (1)
Capture/conversion machines
• 2 groups, 2 machines per group
– Keep the best recording
– 6 TV tuners per machine
• Capture video and CC to
separate files in real-time
– MPEG-TS (~7 GB/hr)
– Timestamp every 2-3 seconds
• Generate image snapshots
• Convert videos
– MP4/H.264 (VGA, ~240 MB/hr)
17

Work flow (2)
Storage/static file servers
• Control server
– Download TV schedules
– Download web-streamed news
programs
– Collect and check recordings
– Pushes files to places
• Video streaming server
• Backup storage server
• Image server
18

Work flow (3)
Search server
• Lucene index updated daily
– Main text field tokenized
– Separate fields for date,
network, show, etc.
– Binary fields for segment and
time data
• Hosts search engine
19

Custom query type
Segment-enclosed query (1)
• Problem 1: search for “X near Z”
• Lucene: search for “X within Y words of Z”
– How to pick Y?
– Hard to pick a fixed number
21

Custom query type
Segment-enclosed query (2)
• Problem 2: all matched search words might
not be talking about same story
– E.g. “Obama AND visit AND Afghanistan”
– Might match a news program about Obama’s visit
to Canada + violence in Afghanistan
22

Custom query type
Segment-enclosed query (3)
• A news program can contain several stories
– E.g. Local, national, world, weather, sports
23

Custom query type
Segment-enclosed query (4)
24

Custom query type
Segment-enclosed query (5)
• One solution: search for “X and Z within same
story segment”
– Possible with Lucene + story segment info
• Bonus: enables searching/filtering for a
particular story type
– E.g. Politics
25

Custom query type
Segment-enclosed query (6)
• How to mark segments
– Automated
• Computer Science researchers working on them
• Word frequency
• Scene change
• Black frame and silence
– Manual segmentation
• Watch the video
• Decide where a story starts and ends
• Mark positions in semi-automated system
26

Custom query type
Segment-enclosed query (7)
27

Custom query type
Segment-enclosed query (8)
• Idea
– Get spans from SpanNearQuery
– Filter and keep those fully within segments
• In production: segment info in stored fields
– As a list of <start position, end position>
– Simple to implement
– Reasonably fast searching
• Alternative: store segment info as terms
– Possible to find segments by themselves
– Appears to run much faster
28

Custom query type
Time-enclosed query
29

Custom query type
Multi-term regular expression (1)
• “here is _ _ _ with the
(news|story|details|report)”
• Apply RegEx to a phrase or sentence
– Not just individual words
• Lucene core has regular expression query
support
– Good starting point
– Not a complete solution for us
30

Custom query type
Multi-term regular expression (2)
• Problems
– Some analyzers do not work with RegEx
– Lucene’s RegEx query classes only apply RegEx to
individual terms
• Want to match a pattern against a phrase/sentence
• Want placeholders for whole words (not just characters)
– Term(fieldName, “.*”) matches all terms, and all
documents, and all positions in the index
• very slow
• takes lots of memory
31

Custom query type
Multi-term regular expression (3)
• What we did
– Parse and translate multi-term RegEx into Lucene
built-in queries (SpanNearQuery, RegexQuery)
• E.g. “here is _ _ _ with the” = “here is” followed by “with
the” (with exactly 3 terms in between)
– Leading and trailing placeholders
• E.g. “_ _ is the _ _ _”
• Preserve for correctness
• Store word count for each document
• Expand each span on both sides
• Bounds checking
32

Custom query type
Multi-term regular expression (4)
• Regular expression libraries differ in
– Syntax (e.g. Perl 5-compatible)
– Capabilities (e.g. back-references)
– Speed
• Memory usage
– Proportional to number of terms matched
– Increasing available memory might help
33

Custom result format
Occurrence count
34

Future work
Job queue (1)
• Research front moving towards analysis of
whole database
– Want full search result set
– Queries are intensive and take a long time
• Solution will be beyond increasing timeout
– Users might close their browsers
– We might restart the search back-end
35

Future work
Job queue (2)
• Features
– Query runs in background
– Notification when finished/failed
– Restart queries with recoverable errors
– Check and cancel jobs
– Downloadable result
– Schedule recurring queries
– Manage job priority and quota
36

Future work
Multiple sources and languages (1)
• Multilingual news programs
– E.g. some have English + Spanish CC
• Multiple text and timestamp sources
– E.g. CNN transcript available from website
– Applying speech-to-text to videos
– Manual correction of text and timestamps
• Multiple markets
– E.g. Capture TV programs in Denmark and Norway
37

Future work
Multiple sources and languages (2)
• Need language detection
– Libraries exist
• Search for specific channel
– Search by language more useful
– But no fixed channel -> language mapping
• What will proximity search and occurrence
counting mean when there are multiple
channels/languages?
38

Future work
Metadata
• Types of metadata
– Segment boundary, type and topic
– Headline and description (from transcripts)
– Website links
– Syntactic tags (e.g. part of speech)
– Generated annotation (e.g. object identification)
– User annotation (e.g. scene description)
– Screen text
• Eventually: want them to be searchable
39

Thank you for coming!
• Any questions?
• My e-mail: kai@ssc.ucla.edu
• Slides available: http://ucla.in/IDJq2u
40

Viewers also liked

IT, Media, Marketing - A regultatory lanscapeVietnamBusinessTV

CI108 projectJia Xing Huang

papuaTheologi

Mothers with eating disorders, depression and anxietyArticolo diagnosilucacerniglia

Investalljuliav11

West Valley College Outdoor ClassroomCarol Kuster

Viewers also liked (6)

IT, Media, Marketing - A regultatory lanscape

CI108 project

papua

Mothers with eating disorders, depression and anxietyArticolo diagnosi

Invest

West Valley College Outdoor Classroom

Similar to Television News Search and Analysis with Lucene/Solr

Searching over the past, present and futureRoi Blanco

Exploiting temporal information in retrieval of archived documents (doctoral ...Nattiya Kanhabua

Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri

Groeling, Tim: NewsScape: Preserving TV NewsReynolds Journalism Institute (RJI)

Dynamics of Web: Analysis and Implications from Search PerspectiveNattiya Kanhabua

Automatic Classification of Springer Nature Proceedings with Smart Topic MinerFrancesco Osborne

Temporal Web Dynamics and Implications for Information RetrievalNattiya Kanhabua

Shrinking the haystack wes caldwell - finallucenerevolution

Shrinking the Haystack" using Solr and OpenNLPlucenerevolution

DatoConference2015Debora Donato

Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...linshanleearchive

Global Media Monitor - Marko GrobelnikMarko Grobelnik

Natural Language Search in SolrTommaso Teofili

Text REtrieval Conference (TREC) Dynamic Domain Track 2015Grace Hui Yang

AntiForensics - Leveraging OS and File System Artifacts.pdfekobelasting

A spatio-temporal visual analysis tool for historical dictionaries. Technological Ecosystems for Enhancing Multiculturality

Temporal Web Dynamics: Implications from Search PerspectiveNattiya Kanhabua

Temporal models for mining, ranking and recommendation in the WebTu Nguyen

way_topics.pptUmayKulsoom2

topics natural language processing and image processingyoukayaslam

Similar to Television News Search and Analysis with Lucene/Solr (20)

Searching over the past, present and future

Exploiting temporal information in retrieval of archived documents (doctoral ...

Natural Language Processing, Techniques, Current Trends and Applications in I...

Groeling, Tim: NewsScape: Preserving TV News

Dynamics of Web: Analysis and Implications from Search Perspective

Automatic Classification of Springer Nature Proceedings with Smart Topic Miner

Temporal Web Dynamics and Implications for Information Retrieval

Shrinking the haystack wes caldwell - final

Shrinking the Haystack" using Solr and OpenNLP

DatoConference2015

Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...

Global Media Monitor - Marko Grobelnik

Natural Language Search in Solr

Text REtrieval Conference (TREC) Dynamic Domain Track 2015

AntiForensics - Leveraging OS and File System Artifacts.pdf

A spatio-temporal visual analysis tool for historical dictionaries.

Temporal Web Dynamics: Implications from Search Perspective

Temporal models for mining, ranking and recommendation in the Web

way_topics.ppt

topics natural language processing and image processing

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Install Stable Diffusion in windows machinePadma Pradeep

AI as an Interface for Commercial BuildingsMemoori

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

costume and set research powerpoint presentationphoebematthew05

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024

Are Multi-Cloud and Serverless Good or Bad?

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

DevEX - reference for building teams, processes, and platforms

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Understanding the Laravel MVC Architecture

Install Stable Diffusion in windows machine

AI as an Interface for Commercial Buildings

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

costume and set research powerpoint presentation

DMCC Future of Trade Web3 - Special Edition

SIP trunking in Janus @ Kamailio World 2024

My INSURER PTE LTD - Insurtech Innovation Award 2024

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Gen AI in Business - Global Trends Report 2024.pdf

Connect Wave/ connectwave Pitch Deck Presentation

Developer Data Modeling Mistakes: From Postgres to NoSQL

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service

Television News Search and Analysis with Lucene/Solr

1. Television News Search and Analysis with Lucene/Solr Kai Chan <kai@ssc.ucla.edu> Social Sciences Computing University of California, Los Angeles Lucene Revolution, May 10, 2012

2. Communication Studies Archive Background (1) • Continuation of analog recording of TV news – Thousands of tapes since Watergate/1970s – Hard to look for a particular news program or topic 1

3. Communication Studies Archive Background (2) • Digital recording since 2005 • Capture news programs on computers – Video: can be streamed over the Web – Closed captioning (“subtitle text”): indexed and searchable – Image snapshots – Search engine and analysis tools 2

4. Communication Studies Archive Background (3) • Also download transcripts and web-streamed news programs • 100 news programs and 600,000 words added each day 3

5. Communication Studies Archive Background (4) • January 2005 to present – 28 networks – 1,600 shows – 130,000 hours – 160,000 news programs – 50,000,000 images – 880,000,000 words 4

6. Why This is Important (1) • Researchers – Large and unique collection of communication – Many modalities • Speech, facial expression, body gesture, etc. – Different conditions/settings – Different networks and communities – Allows study of TV news + communication in general in ways impossible before 5

7. Why This is Important (2) • Non-researchers – TV news about presentation and persuasion • Which happen in daily life also – TV main source of news for many/most – Greatly affects the public’s decisions – Learn about what we watch 6

8. 7

9. 8

10. 9

11. 10

12. 11

13.

14. 13

15. Application in Research • Communication Studies – Amount of coverage for events over time • Linguistic – Speech and language patterns • Computer Science – Object identification – Identify news anchors, public figures – Story segmentation 14

16. Application in Teaching (1) • Chicano Studies: Representations of Latinos on the Television News – May 1, 2007 immigration march – MacArthur Park, Los Angeles, CA – 2 days (May 1 & 2, 2007) – Framing, stereotyping, metaphor, silencing – reports with screenshots and links to news stories 15

17. Application in Teaching (2) • Communication Studies: Presidential Communication – 2008 presidential primary – 6 weeks (Dec 2007 to Feb 2008) – Coverage of sound bites • Amount of time given to candidate/party • Types of response (positive, neutral, negative) – Students created their own political ad. 16

18. Work flow (1) Capture/conversion machines • 2 groups, 2 machines per group – Keep the best recording – 6 TV tuners per machine • Capture video and CC to separate files in real-time – MPEG-TS (~7 GB/hr) – Timestamp every 2-3 seconds • Generate image snapshots • Convert videos – MP4/H.264 (VGA, ~240 MB/hr) 17

19. Work flow (2) Storage/static file servers • Control server – Download TV schedules – Download web-streamed news programs – Collect and check recordings – Pushes files to places • Video streaming server • Backup storage server • Image server 18

20. Work flow (3) Search server • Lucene index updated daily – Main text field tokenized – Separate fields for date, network, show, etc. – Binary fields for segment and time data • Hosts search engine 19

21. The search process 20

22. Custom query type Segment-enclosed query (1) • Problem 1: search for “X near Z” • Lucene: search for “X within Y words of Z” – How to pick Y? – Hard to pick a fixed number 21

23. Custom query type Segment-enclosed query (2) • Problem 2: all matched search words might not be talking about same story – E.g. “Obama AND visit AND Afghanistan” – Might match a news program about Obama’s visit to Canada + violence in Afghanistan 22

24. Custom query type Segment-enclosed query (3) • A news program can contain several stories – E.g. Local, national, world, weather, sports 23

25. Custom query type Segment-enclosed query (4) 24

26. Custom query type Segment-enclosed query (5) • One solution: search for “X and Z within same story segment” – Possible with Lucene + story segment info • Bonus: enables searching/filtering for a particular story type – E.g. Politics 25

27. Custom query type Segment-enclosed query (6) • How to mark segments – Automated • Computer Science researchers working on them • Word frequency • Scene change • Black frame and silence – Manual segmentation • Watch the video • Decide where a story starts and ends • Mark positions in semi-automated system 26

28. Custom query type Segment-enclosed query (7) 27

29. Custom query type Segment-enclosed query (8) • Idea – Get spans from SpanNearQuery – Filter and keep those fully within segments • In production: segment info in stored fields – As a list of <start position, end position> – Simple to implement – Reasonably fast searching • Alternative: store segment info as terms – Possible to find segments by themselves – Appears to run much faster 28

30. Custom query type Time-enclosed query 29

31. Custom query type Multi-term regular expression (1) • “here is _ _ _ with the (news|story|details|report)” • Apply RegEx to a phrase or sentence – Not just individual words • Lucene core has regular expression query support – Good starting point – Not a complete solution for us 30

32. Custom query type Multi-term regular expression (2) • Problems – Some analyzers do not work with RegEx – Lucene’s RegEx query classes only apply RegEx to individual terms • Want to match a pattern against a phrase/sentence • Want placeholders for whole words (not just characters) – Term(fieldName, “.*”) matches all terms, and all documents, and all positions in the index • very slow • takes lots of memory 31

33. Custom query type Multi-term regular expression (3) • What we did – Parse and translate multi-term RegEx into Lucene built-in queries (SpanNearQuery, RegexQuery) • E.g. “here is _ _ _ with the” = “here is” followed by “with the” (with exactly 3 terms in between) – Leading and trailing placeholders • E.g. “_ _ is the _ _ _” • Preserve for correctness • Store word count for each document • Expand each span on both sides • Bounds checking 32

34. Custom query type Multi-term regular expression (4) • Regular expression libraries differ in – Syntax (e.g. Perl 5-compatible) – Capabilities (e.g. back-references) – Speed • Memory usage – Proportional to number of terms matched – Increasing available memory might help 33

35. Custom result format Occurrence count 34

36. Future work Job queue (1) • Research front moving towards analysis of whole database – Want full search result set – Queries are intensive and take a long time • Solution will be beyond increasing timeout – Users might close their browsers – We might restart the search back-end 35

37. Future work Job queue (2) • Features – Query runs in background – Notification when finished/failed – Restart queries with recoverable errors – Check and cancel jobs – Downloadable result – Schedule recurring queries – Manage job priority and quota 36

38. Future work Multiple sources and languages (1) • Multilingual news programs – E.g. some have English + Spanish CC • Multiple text and timestamp sources – E.g. CNN transcript available from website – Applying speech-to-text to videos – Manual correction of text and timestamps • Multiple markets – E.g. Capture TV programs in Denmark and Norway 37

39. Future work Multiple sources and languages (2) • Need language detection – Libraries exist • Search for specific channel – Search by language more useful – But no fixed channel -> language mapping • What will proximity search and occurrence counting mean when there are multiple channels/languages? 38

40. Future work Metadata • Types of metadata – Segment boundary, type and topic – Headline and description (from transcripts) – Website links – Syntactic tags (e.g. part of speech) – Generated annotation (e.g. object identification) – User annotation (e.g. scene description) – Screen text • Eventually: want them to be searchable 39

41. Thank you for coming! • Any questions? • My e-mail: kai@ssc.ucla.edu • Slides available: http://ucla.in/IDJq2u 40

Editor's Notes

Good afternoon. I am Kai Chan from UCLA Social Sciences Computing. I am the lead programmer of the UCLA Communication Studies Archive. I am happy to be here with you and talk about this project, its background and setup, how it is being used, and how we have implemented certain things. I hope you will find this presentation interesting and helpful. The slides will be available for download after this presentation, so don’t worry about writing down all the details from the slides. If things are confusing or if I don’t speak loud enough, please let me know. Otherwise, I will be saving questions until the end.
The project started as a continuation of analog recording of television news. Back in the Watergate years, a professor in the now-Communication Studies department started recording television news. Over the years, thousands of tapes have been recorded. Obviously, it is not very convenient to look for a particular news program or topic in the collection.
So, in 2005, two professors at the department started recording television news digitally, with computers and TV capture cards. This new system has several advantages. It stores videos that can be streamed over the Web. It records closed captioning, which is commonly called “subtitle text” and comes with most TV programs in the U.S. The recorded text is indexed and searchable, thanks to Lucene. The system captures image snapshots. Finally, we have built a search engine, as well as some analysis tools, for the archive.
Besides recording news programs from television, we also get them from other sources. For example, we record CNN television news programs, but their website has the transcripts of many of their shows. So for these shows, we now have these transcripts along with the closed captioning. There are also television news networks (such as Democracy Now and Russia Today) that broadcast on the Web, and we download their shows. The digital archive grows at about 100 news programs and 600 thousand words each day.
Right now, it contains television recordings from 28 networks and 16 hundred shows. It has 130 thousand hours, 160 thousand news programs, 50 million still images and 880 million words.
Why is this project important? For researchers, the archive is a large and unique collection of human communication. Unlike many other collections, this one captures many communication modalities, such as speech, facial expression, body gesture, and so on. News content has different conditions and settings. Some are staged and scripted, and some (such as interviewing eyewitnesses) are not. There are single-speaker scenes, conversations, debates and so on. The news programs are also from different networks and communities. The archive allows the study of TV news like how people study written news with newspaper archives. It allows the study of TV news in particular (and communication in general) in ways that were impossible before. For example, to use our analog collection as well as other tape-based video collections, people often need to know which tapes they want, and there is often a non-trivial cost to check out particular tapes. It involves finding, copying and mailing each tape. Taking a quick look at (or extracting some details from) a large number of news programs is hard, if possible at all. In contrast, our digital collection is searchable and accessible from the Web, and it allows people to do what I just described easily. Researchers all over the world can access the archive’s material instantly and simultaneously.
For non-researchers, myself included, why should they care? Some of them might say that there is little they can learn and conclude from TV news other than “they are all biased”. I disagree. TV news is about presentation and persuasion. Those who created news content have their own angles and opinions about the news story, and news programs are a medium in which they persuade us to think the same. However, presentation and persuasion happens in our daily lives too. The difference is that, unlike those that happen in our daily lives, those in TV news be as readily recorded, systematically studied, and tell us something about communication in general too. Also, even in recently years when the Internet and social networks become more popular than ever, TV remains the main source of news for many or most people. What they see in TV news greatly affects the public’s opinion and decisions, in public policy and other areas. If what we watch is so important, we really should learn about (and be conscious about) what we watch. Now is a good time to mention that the archive’s content is copyright restricted and is to be used for academic research and teaching. However, for the researchers and students fortunate enough to have access, the archive will hopefully empower them and change their lives in some positive ways.
Here is what the search engine’s interface looks like. You can notice that it is quite different from a regular document search engine. There are three display formats: list, table and chart. I will show you each of them, and later, also talk about the implementation side. Users can enter multiple words, phrases and regular expression patterns. And there are several criteria, such as: with all the words, with at least one of the words, without the words, and with all the words within a certain distance. Notice that in this screenshot, the distance says “the same segment”. I will talk about story segments later, but the distance can be in words, segments or seconds. For the list format, the user can select how many search results per page. In additions, the search results can be filtered by date, network, and show/series. The search result can be sorted by score or date.
Here is how the search result looks in the list format. It looks closest to a typical document search, but we have gone beyond just showing what news programs are matched. Since we have the closed captioning and images, the search result page shows where the matches are inside the news programs. The positions that the searched words, phrases or patterns match are highlighted, with text around them shown to show the contexts. At each matched position, there is a permalink for referring to that position from external bookmarks and reports, and there is an image snapshot of the video at that position. There is a video player on the same page. If you click on an image snapshot, it starts the video streaming and jumps to that position in the video.
Here are some larger screenshots of the search result in the list format. Hopefully they show better how the context, the highlighting, and the image snapshots work. As you can see, highlighting works for words, phrases as well as regular expression patterns.
Table and chart display formats are very useful for showing how the use of certain words, phrases or language patterns vary by time, network and show, as well as showing how a word is used together with (or instead of) another word. For these two display formats, the search results are not listed by news programs but are put into groups, for example, by day, week, month, year, network or show/series. Again, they can be filtered by date, network, and show/series.
Here is how the search result looks in the table format. This sample query searches document and occurrence counts of two words (crisis and bailout) from September 11th to October 1st, 2008, which was when Lehman Brothers’ bankruptcy sparked a financial crisis. There is a column for each of the two words, a column to show the frequency of either word occurring, plus a column for the total number of news programs in that group. The rows are days in the date range. The cells show the number of news programs and occurrences, of the word indicated by the column heading, on the day indicated by the row heading.
Here is the same search result in the chart format. The valleys represent weekends when we record fewer shows, but you can still see the general trend. Right after the Lehman Brothers bankruptcy, there was a sharp jump in the mentioning of the word “crisis”, but the word “bailout” was mentioned less until later in the month. By the end of the month, “bailout” greatly surpassed “crisis”.
Here is the chart result for a different query: the number of times two presidential candidates’ names, Romney and Santorum, are mentioned in the course of their campaigns.
The archive is being used in research in several fields. Communication Studies researchers analyze the amount of coverage of events over time. Examples are the second Iraq War, the 2008 South Ossetia war and the 2011 Norway attacks. Linguistic researchers use the archive to study speech and language patterns. Computer Science research topics include identifying objects and people in the news, as well as story segmentation. In fact, researches from these three fields are joining together to study “Visual and Verbal Interaction and Persuasion” with the archive, with the support of a grant from the National Science Foundation.
The archive has been used in teaching several classes. An example is a Chicano Studies class on representation of Latinos on the television news. Its focus was on studying the May 1, 2007 immigration match, with an emphasis on the use of police force at MacArthur Park in Los Angeles, California. Students studies two days of news programs, in which they were to identify framing, stereotyping, metaphor and silencing. The archive allowed them to better show what they learned. For example, instead of just citing KCBS news at 11PM on May 1, 2007, they could point to 2 minutes 12 seconds into the program by including a permalink and a screenshot in their reports.
Other examples are two Communication Studies classes on presidential communication. In one class, students studied television coverage of the 2008 presidential primary election through 6 weeks of selected news programs up to the primary election. Students analyzed coverage of sound bites from candidates. They collected the amount of time given to each candidate and party, and the types of responses given by the news-anchors or portrayed by the news programs. In another class, students created their own political TV advertisements, similar to the ads we see on TV about public officials, from the archive’s news footage.
We have four video capture and conversion machines in production, divided into two groups. Each group covers half of the recording schedule and has two machines recording the same news programs at the same time for redundancy. The better recording out of two is kept. There are 6 TV tuners per machine. Video is captured in MPEG-TS format at about 7 GB/hr. Closed captioning text is recorded with a timestamp every 2 to 3 seconds. Afterwards, the system generates image snapshots from the videos, and converts the videos to MP4 H.264 format, in VGA resolution at about 240 MB/hr.
The control server downloads electronic TV schedules as well as web-streamed news programs. It collects and checks recordings from the capture/conversion machines. It stories a copy of the files and also pushes them to the backup storage server, the image server, the video-streaming server and the search server.
The search server updates the Lucene index daily. The indexing process stores and tokenizes the main closed captioning text field. It creates separate fields for date, network, show and so on. It also creates binary fields for segment and time data that I will talk about later. This server hosts the archive’s search engine.
The search result page contains videos and images that come from separate servers. But the main result comes from the search server. It has a front end with a Web server and PHP custom code, which formats the search form and the search result. In the back end, there is Lucene, a Lucene index, and a MySQL database. There is custom Java code that extends what Lucene provides into what we want. It allows our search engine to handle query types and result formats beyond the Lucene built-in ones. A bridge allows the front and-back-ends to talk to each other. We are using PHP-Java Bridge in production but are replacing it with Solr. Our use of Solr will be different from many others’, as we will have a custom Solr request handler to handle queries. However, we can still make use of Solr’s thread management, caching, serialization and so on.
Lucene is great because it gives us the end result, plus low-level data and building blocks, from which we have built custom solutions. One of them is segment-enclosed query. Let’s consider the problem of: how to search for “X near Z”. Lucene can search for “X within Y words of Z”, but how do we pick Y to solve our problem? No fixed number works everywhere.
A related problem is that, even a news program matches several words, these words might not be all used in this news program to talk about the same story. For example, if you search for all these words: “Obama”, “visit” and “Afghanistan”, you might expect to find something about President Obama’s visit to Afghanistan. However, the query would also match a news program that talks about Obama’s visit to, say, Canada as well as violence in Afghanistan. This news program is probably not what you have in mind.
A characteristic common in many of the news program is that each of them often covers many stories. For example, an hour-long evening news might start with a few local stories, then national and world stories, and then weather, sports and so on. If you draw a timeline for a news program, these stories show up as segments in the timeline.
Let me show you what story segments look like. The left side shows a simplified view of the evening-news I just described. The right side shows actual story segments from an hour of CNN Newsroom. In this particular program, there are 26 story segments plus 6 commercial block segments.
A solution, or an approximation, for finding “X near Z” is that we consider X and Z to be near each other if they are within the same story segment in the news program. This is possible with Lucene if we give it information about the story segments. A bonus from this solution is that it can also enable searching and filtering for a particular story type. For example, a Political Science researcher might be more interested in news programs (or sections of news programs) about politics, and less interested in the other news programs (or sections of them).
Computer Science researchers have been working on automated ways to mark segments using, for example, word frequency, scene change, black frame and silence. On the other hand, some Communication Studies researchers are doing manual segmentation on some of the news programs. They watch a video, decide where a story starts and ends, and mark the positions plus the story’s type and topic in a semi-automated system.
(A span is a section of a document that a query matches.) The big picture of segment-enclosed query is that a span it matches should be fully enclosed by a segment. In this diagram, span 1 is enclosed by segment 1. None of the other 4 spans are enclosed by any of the 3 segments.
We implemented segment-enclosed query by getting spans from SpanNearQuery, which comes with Lucene, and then filtering and keeping those that are located fully within any segments. Right now, in production, segment information is stored as a list of start-position end-position pairs. This approach is relatively simple to implement. We just put the list in a stored field of the document. Searching is reasonably fast. An alternative we are developing is to store each segment starting or ending point as terms. This way, it is possible to search for a particular story topic. This approach also appears to run much faster.
Time-enclosed query is similar to segment-enclosed query. The idea is that based on the timestamps and the position-time mapping we have, we can calculate the maximum length of each span in terms of seconds. Then, you can search for, let’s say, X within 15 seconds of Z. The search engine gets spans from SpanNearQuery (in Lucene), and then filters and keeps those we know are no longer than 15 seconds each. The implementation is similar to segment-enclosed query, except we have found that just storing the timestamp-position information as an array works better. Storing it as terms takes much more disk space, and unlike segments, there is less need to search for a particular timestamp.
Multi-term regular expression is another type of custom query we have implemented recently. As I mentioned before, linguists want to find word patterns, for example, “here is _ _ _ with the (news|story|details|report)”. They want to use regular expression to look for a phrase or a sentence, not just individual words. In fact, they want to run the UNIX grep command to match a pattern against the whole text collection. Of course, that is very slow. We think we can make many of their use cases run much faster with Lucene. The good news is: Lucene has some built-in regular expression query support. It is a good starting point, but not a complete solution for us.
Some analyzers do not work with RegEX and cannot give you a RegEX pattern as a term. But there is a bigger problem: Lucene’s RegEx query classes only match a pattern with individual terms, and a pattern only describes how one term should look like. However that is not what all our researchers want. And they also want to mix words with placeholders for a specific number of any words. For example, we know in RegEX, “…” means three consecutive character of any kind, such as “the”. But our researchers also want something that matches three words of any kind, such as “with the details”. Another implementation issue is: You cannot just tell Lucene to find all document and all positions where “any words” occur. It gets very slow and takes lots of memory.
What we did was to parse and translate a multi-word RegEx pattern into a proximity query that Lucene understands. For example, the pattern “here is _ _ _ with the” equals to the phrase “here is” followed by the phrase “with the”, with exactly 3 words in between, and Lucene understands the latter. Leading and trailing placeholders are an issue. An example is the pattern “_ _ is the _ _ _”. We cannot just discard these placeholders at the beginning and the end, and there are no equivalent Lucene built-in queries. Instead, we store the number of words in each document, expand each span on both sides (taking the placeholders into account), and make sure the span does not extend outside the document’s range.
Lucene uses Java’s built-in regular expression library but also let you pick your own. Several of them are available, and they differ in syntax, capabilities, and speed. For example, some have Perl-compatible RegEx syntaxes, some let you have back-references in your patterns, and some are way faster than others. For applications that use regular expression a lot, it is worthwhile to experiment instead of just using the built-in one. Another concern is memory usage. Memory requirement to run a RegEx query is proportional to the number of terms matched. You might have no problem with very specific patterns but run into OutOfMemoryError for very broad ones, such as matching any words that end in “-ed”. For us, luckily, giving Lucene more memory solves the problem for now. If another system has much more terms than we do, they might need to make the memory requirements scale better.
You have seen the table and chart display formats before. Here is how they work under the hood. There are sub-queries (generally words) to be counted and there are groups (which are days in this case). Then we have a table like this. Each cell contains the number of news programs and places, where the sub-query matches and the news program belongs to the specific group. For example, to get the counts for the highlighted cell, we get the query that corresponds to the word “meltdown”, get the spans it matches in the index, go through each span, and count those that are in news programs on September 15th, 2008.
Let me talk about some of the unfinished challenges. First of all, our researchers are starting to work on analysis of the whole database. Instead of just seeing the first 10 or 100 results, they might want to see the whole search result set or export it into another program. Queries like this are intensive and take a long time. Just increasing the website’s timeout doesn’t solve other problems such as, what if the user closes the browser or if we restart the search back-end.
Eventually we need a job queue system that can run queries in the background and let users come back later and retrieve results. The system can notify users if their queries have finished or failed. They can restart queries with recoverable errors, check and cancel their query jobs, and download the results. The system would allows users to define queries that are scheduled to run repeatedly. For example, they can monitor trends by running queries every day to look for particular words and phrases in the news programs added on that day. The system would also make it possible to have job priorities and quotas in place.
Some news programs are bilingual. For example, some KNBC news programs have English and Spanish closed captioning. There are more than one way to get text and timestamps. For example, CNN’s website has transcripts for their programs. We can get another copy of the text from the video using speech-to-text. We can manually correct text and timestamps. We are also starting to get news programs in other countries, such as Denmark and Norway.
As we add materials in English and other languages, we need a way to detect and label which language a piece of text is in. Fortunately, libraries to do that already exist. We readily know what channel or source a piece of text is from. Searching for specific channel is easier to implement but it might be less useful than saying “show me only the Spanish content”. However, there is no fixed channel to language mapping. For example, there are four closed captioning channels, and usually the first is for English and the third is for Spanish, but not always. Also, when there are multiple channels or languages, what proximity search and occurrence counting should return might also change.
There are several types of metadata for the news programs, and the variety will only grow. For example, we have talked about story segments before. Besides, downloaded transcripts have headlines, descriptions and website links. Language researchers might add syntactic tags, such as noting the part of speech of each word. Other researchers can add annotations generated from their research. For example, when they identify an object or a speaker in a video, they can add annotations to our stored closed captioning. Other user can add arbitrary annotation, such as description of a particular scene. There is also screen text, which is the “scrolled text” that appears in many of the CNN and Fox news programs. Right now we have a page to display a news program, its text and all metadata associated with it. It is enough for now, but eventually, we want most of the metadata to be searchable too.
That’s all slides I have. If you have any questions about the presentation or the project, feel free to let me know, and I am happy to answer them afterwards. Or, you can e-mail me at this address. The slides will be available at this URL. Thank you for coming.

Television News Search and Analysis with Lucene/Solr

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to Television News Search and Analysis with Lucene/Solr

Similar to Television News Search and Analysis with Lucene/Solr (20)

Recently uploaded

Recently uploaded (20)

Television News Search and Analysis with Lucene/Solr

Editor's Notes