SlideShare a Scribd company logo
1 of 41
Television News Search and 
Analysis with Lucene/Solr 
Kai Chan <kai@ssc.ucla.edu> 
Social Sciences Computing 
University of California, Los Angeles 
Lucene Revolution, May 10, 2012
Communication Studies Archive 
Background (1) 
• Continuation of analog recording of TV news 
– Thousands of tapes since Watergate/1970s 
– Hard to look for a particular news program or 
topic 
1
Communication Studies Archive 
Background (2) 
• Digital recording since 2005 
• Capture news programs on computers 
– Video: can be streamed over the Web 
– Closed captioning (“subtitle text”): indexed and 
searchable 
– Image snapshots 
– Search engine and analysis tools 
2
Communication Studies Archive 
Background (3) 
• Also download transcripts and web-streamed 
news programs 
• 100 news programs and 600,000 words added 
each day 
3
Communication Studies Archive 
Background (4) 
• January 2005 to present 
– 28 networks 
– 1,600 shows 
– 130,000 hours 
– 160,000 news programs 
– 50,000,000 images 
– 880,000,000 words 
4
Why This is Important (1) 
• Researchers 
– Large and unique collection of communication 
– Many modalities 
• Speech, facial expression, body gesture, etc. 
– Different conditions/settings 
– Different networks and communities 
– Allows study of TV news + communication in 
general in ways impossible before 
5
Why This is Important (2) 
• Non-researchers 
– TV news about presentation and persuasion 
• Which happen in daily life also 
– TV main source of news for many/most 
– Greatly affects the public’s decisions 
– Learn about what we watch 
6
7
8
9
10
11
13
Application in Research 
• Communication Studies 
– Amount of coverage for events over time 
• Linguistic 
– Speech and language patterns 
• Computer Science 
– Object identification 
– Identify news anchors, public figures 
– Story segmentation 
14
Application in Teaching (1) 
• Chicano Studies: Representations of Latinos 
on the Television News 
– May 1, 2007 immigration march 
– MacArthur Park, Los Angeles, CA 
– 2 days (May 1 & 2, 2007) 
– Framing, stereotyping, metaphor, silencing 
– reports with screenshots and links to news stories 
15
Application in Teaching (2) 
• Communication Studies: Presidential 
Communication 
– 2008 presidential primary 
– 6 weeks (Dec 2007 to Feb 2008) 
– Coverage of sound bites 
• Amount of time given to candidate/party 
• Types of response (positive, neutral, negative) 
– Students created their own political ad. 
16
Work flow (1) 
Capture/conversion machines 
• 2 groups, 2 machines per group 
– Keep the best recording 
– 6 TV tuners per machine 
• Capture video and CC to 
separate files in real-time 
– MPEG-TS (~7 GB/hr) 
– Timestamp every 2-3 seconds 
• Generate image snapshots 
• Convert videos 
– MP4/H.264 (VGA, ~240 MB/hr) 
17
Work flow (2) 
Storage/static file servers 
• Control server 
– Download TV schedules 
– Download web-streamed news 
programs 
– Collect and check recordings 
– Pushes files to places 
• Video streaming server 
• Backup storage server 
• Image server 
18
Work flow (3) 
Search server 
• Lucene index updated daily 
– Main text field tokenized 
– Separate fields for date, 
network, show, etc. 
– Binary fields for segment and 
time data 
• Hosts search engine 
19
The search process 
20
Custom query type 
Segment-enclosed query (1) 
• Problem 1: search for “X near Z” 
• Lucene: search for “X within Y words of Z” 
– How to pick Y? 
– Hard to pick a fixed number 
21
Custom query type 
Segment-enclosed query (2) 
• Problem 2: all matched search words might 
not be talking about same story 
– E.g. “Obama AND visit AND Afghanistan” 
– Might match a news program about Obama’s visit 
to Canada + violence in Afghanistan 
22
Custom query type 
Segment-enclosed query (3) 
• A news program can contain several stories 
– E.g. Local, national, world, weather, sports 
23
Custom query type 
Segment-enclosed query (4) 
24
Custom query type 
Segment-enclosed query (5) 
• One solution: search for “X and Z within same 
story segment” 
– Possible with Lucene + story segment info 
• Bonus: enables searching/filtering for a 
particular story type 
– E.g. Politics 
25
Custom query type 
Segment-enclosed query (6) 
• How to mark segments 
– Automated 
• Computer Science researchers working on them 
• Word frequency 
• Scene change 
• Black frame and silence 
– Manual segmentation 
• Watch the video 
• Decide where a story starts and ends 
• Mark positions in semi-automated system 
26
Custom query type 
Segment-enclosed query (7) 
27
Custom query type 
Segment-enclosed query (8) 
• Idea 
– Get spans from SpanNearQuery 
– Filter and keep those fully within segments 
• In production: segment info in stored fields 
– As a list of <start position, end position> 
– Simple to implement 
– Reasonably fast searching 
• Alternative: store segment info as terms 
– Possible to find segments by themselves 
– Appears to run much faster 
28
Custom query type 
Time-enclosed query 
29
Custom query type 
Multi-term regular expression (1) 
• “here is _ _ _ with the 
(news|story|details|report)” 
• Apply RegEx to a phrase or sentence 
– Not just individual words 
• Lucene core has regular expression query 
support 
– Good starting point 
– Not a complete solution for us 
30
Custom query type 
Multi-term regular expression (2) 
• Problems 
– Some analyzers do not work with RegEx 
– Lucene’s RegEx query classes only apply RegEx to 
individual terms 
• Want to match a pattern against a phrase/sentence 
• Want placeholders for whole words (not just characters) 
– Term(fieldName, “.*”) matches all terms, and all 
documents, and all positions in the index 
• very slow 
• takes lots of memory 
31
Custom query type 
Multi-term regular expression (3) 
• What we did 
– Parse and translate multi-term RegEx into Lucene 
built-in queries (SpanNearQuery, RegexQuery) 
• E.g. “here is _ _ _ with the” = “here is” followed by “with 
the” (with exactly 3 terms in between) 
– Leading and trailing placeholders 
• E.g. “_ _ is the _ _ _” 
• Preserve for correctness 
• Store word count for each document 
• Expand each span on both sides 
• Bounds checking 
32
Custom query type 
Multi-term regular expression (4) 
• Regular expression libraries differ in 
– Syntax (e.g. Perl 5-compatible) 
– Capabilities (e.g. back-references) 
– Speed 
• Memory usage 
– Proportional to number of terms matched 
– Increasing available memory might help 
33
Custom result format 
Occurrence count 
34
Future work 
Job queue (1) 
• Research front moving towards analysis of 
whole database 
– Want full search result set 
– Queries are intensive and take a long time 
• Solution will be beyond increasing timeout 
– Users might close their browsers 
– We might restart the search back-end 
35
Future work 
Job queue (2) 
• Features 
– Query runs in background 
– Notification when finished/failed 
– Restart queries with recoverable errors 
– Check and cancel jobs 
– Downloadable result 
– Schedule recurring queries 
– Manage job priority and quota 
36
Future work 
Multiple sources and languages (1) 
• Multilingual news programs 
– E.g. some have English + Spanish CC 
• Multiple text and timestamp sources 
– E.g. CNN transcript available from website 
– Applying speech-to-text to videos 
– Manual correction of text and timestamps 
• Multiple markets 
– E.g. Capture TV programs in Denmark and Norway 
37
Future work 
Multiple sources and languages (2) 
• Need language detection 
– Libraries exist 
• Search for specific channel 
– Search by language more useful 
– But no fixed channel -> language mapping 
• What will proximity search and occurrence 
counting mean when there are multiple 
channels/languages? 
38
Future work 
Metadata 
• Types of metadata 
– Segment boundary, type and topic 
– Headline and description (from transcripts) 
– Website links 
– Syntactic tags (e.g. part of speech) 
– Generated annotation (e.g. object identification) 
– User annotation (e.g. scene description) 
– Screen text 
• Eventually: want them to be searchable 
39
Thank you for coming! 
• Any questions? 
• My e-mail: kai@ssc.ucla.edu 
• Slides available: http://ucla.in/IDJq2u 
40

More Related Content

Viewers also liked

IT, Media, Marketing - A regultatory lanscape
IT, Media, Marketing - A regultatory lanscapeIT, Media, Marketing - A regultatory lanscape
IT, Media, Marketing - A regultatory lanscapeVietnamBusinessTV
 
Mothers with eating disorders, depression and anxietyArticolo diagnosi
Mothers with eating disorders, depression and anxietyArticolo diagnosiMothers with eating disorders, depression and anxietyArticolo diagnosi
Mothers with eating disorders, depression and anxietyArticolo diagnosilucacerniglia
 
West Valley College Outdoor Classroom
West Valley College Outdoor ClassroomWest Valley College Outdoor Classroom
West Valley College Outdoor ClassroomCarol Kuster
 

Viewers also liked (6)

IT, Media, Marketing - A regultatory lanscape
IT, Media, Marketing - A regultatory lanscapeIT, Media, Marketing - A regultatory lanscape
IT, Media, Marketing - A regultatory lanscape
 
CI108 project
CI108 projectCI108 project
CI108 project
 
papua
papuapapua
papua
 
Mothers with eating disorders, depression and anxietyArticolo diagnosi
Mothers with eating disorders, depression and anxietyArticolo diagnosiMothers with eating disorders, depression and anxietyArticolo diagnosi
Mothers with eating disorders, depression and anxietyArticolo diagnosi
 
Invest
InvestInvest
Invest
 
West Valley College Outdoor Classroom
West Valley College Outdoor ClassroomWest Valley College Outdoor Classroom
West Valley College Outdoor Classroom
 

Similar to Television News Search and Analysis with Lucene/Solr

Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and futureRoi Blanco
 
Exploiting temporal information in retrieval of archived documents (doctoral ...
Exploiting temporal information in retrieval of archived documents (doctoral ...Exploiting temporal information in retrieval of archived documents (doctoral ...
Exploiting temporal information in retrieval of archived documents (doctoral ...Nattiya Kanhabua
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
Dynamics of Web: Analysis and Implications from Search Perspective
Dynamics of Web: Analysis and Implications from Search  PerspectiveDynamics of Web: Analysis and Implications from Search  Perspective
Dynamics of Web: Analysis and Implications from Search PerspectiveNattiya Kanhabua
 
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerFrancesco Osborne
 
Temporal Web Dynamics and Implications for Information Retrieval
Temporal Web Dynamics and Implications for Information RetrievalTemporal Web Dynamics and Implications for Information Retrieval
Temporal Web Dynamics and Implications for Information RetrievalNattiya Kanhabua
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 
Shrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPShrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPlucenerevolution
 
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...linshanleearchive
 
Global Media Monitor - Marko Grobelnik
Global Media Monitor - Marko GrobelnikGlobal Media Monitor - Marko Grobelnik
Global Media Monitor - Marko GrobelnikMarko Grobelnik
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
 
Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Grace Hui Yang
 
AntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdfAntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdfekobelasting
 
Temporal Web Dynamics: Implications from Search Perspective
Temporal Web Dynamics: Implications from Search PerspectiveTemporal Web Dynamics: Implications from Search Perspective
Temporal Web Dynamics: Implications from Search PerspectiveNattiya Kanhabua
 
Temporal models for mining, ranking and recommendation in the Web
Temporal models for mining, ranking and recommendation in the WebTemporal models for mining, ranking and recommendation in the Web
Temporal models for mining, ranking and recommendation in the WebTu Nguyen
 
topics natural language processing and image processing
topics natural language processing and image processingtopics natural language processing and image processing
topics natural language processing and image processingyoukayaslam
 

Similar to Television News Search and Analysis with Lucene/Solr (20)

Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and future
 
Exploiting temporal information in retrieval of archived documents (doctoral ...
Exploiting temporal information in retrieval of archived documents (doctoral ...Exploiting temporal information in retrieval of archived documents (doctoral ...
Exploiting temporal information in retrieval of archived documents (doctoral ...
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Groeling, Tim: NewsScape: Preserving TV News
Groeling, Tim: NewsScape: Preserving TV NewsGroeling, Tim: NewsScape: Preserving TV News
Groeling, Tim: NewsScape: Preserving TV News
 
Dynamics of Web: Analysis and Implications from Search Perspective
Dynamics of Web: Analysis and Implications from Search  PerspectiveDynamics of Web: Analysis and Implications from Search  Perspective
Dynamics of Web: Analysis and Implications from Search Perspective
 
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
 
Temporal Web Dynamics and Implications for Information Retrieval
Temporal Web Dynamics and Implications for Information RetrievalTemporal Web Dynamics and Implications for Information Retrieval
Temporal Web Dynamics and Implications for Information Retrieval
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
Shrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPShrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLP
 
DatoConference2015
DatoConference2015DatoConference2015
DatoConference2015
 
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
 
Global Media Monitor - Marko Grobelnik
Global Media Monitor - Marko GrobelnikGlobal Media Monitor - Marko Grobelnik
Global Media Monitor - Marko Grobelnik
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015
 
AntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdfAntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdf
 
A spatio-temporal visual analysis tool for historical dictionaries.
A spatio-temporal visual analysis tool for historical dictionaries. A spatio-temporal visual analysis tool for historical dictionaries.
A spatio-temporal visual analysis tool for historical dictionaries.
 
Temporal Web Dynamics: Implications from Search Perspective
Temporal Web Dynamics: Implications from Search PerspectiveTemporal Web Dynamics: Implications from Search Perspective
Temporal Web Dynamics: Implications from Search Perspective
 
Temporal models for mining, ranking and recommendation in the Web
Temporal models for mining, ranking and recommendation in the WebTemporal models for mining, ranking and recommendation in the Web
Temporal models for mining, ranking and recommendation in the Web
 
way_topics.ppt
way_topics.pptway_topics.ppt
way_topics.ppt
 
topics natural language processing and image processing
topics natural language processing and image processingtopics natural language processing and image processing
topics natural language processing and image processing
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 

Television News Search and Analysis with Lucene/Solr

  • 1. Television News Search and Analysis with Lucene/Solr Kai Chan <kai@ssc.ucla.edu> Social Sciences Computing University of California, Los Angeles Lucene Revolution, May 10, 2012
  • 2. Communication Studies Archive Background (1) • Continuation of analog recording of TV news – Thousands of tapes since Watergate/1970s – Hard to look for a particular news program or topic 1
  • 3. Communication Studies Archive Background (2) • Digital recording since 2005 • Capture news programs on computers – Video: can be streamed over the Web – Closed captioning (“subtitle text”): indexed and searchable – Image snapshots – Search engine and analysis tools 2
  • 4. Communication Studies Archive Background (3) • Also download transcripts and web-streamed news programs • 100 news programs and 600,000 words added each day 3
  • 5. Communication Studies Archive Background (4) • January 2005 to present – 28 networks – 1,600 shows – 130,000 hours – 160,000 news programs – 50,000,000 images – 880,000,000 words 4
  • 6. Why This is Important (1) • Researchers – Large and unique collection of communication – Many modalities • Speech, facial expression, body gesture, etc. – Different conditions/settings – Different networks and communities – Allows study of TV news + communication in general in ways impossible before 5
  • 7. Why This is Important (2) • Non-researchers – TV news about presentation and persuasion • Which happen in daily life also – TV main source of news for many/most – Greatly affects the public’s decisions – Learn about what we watch 6
  • 8. 7
  • 9. 8
  • 10. 9
  • 11. 10
  • 12. 11
  • 13.
  • 14. 13
  • 15. Application in Research • Communication Studies – Amount of coverage for events over time • Linguistic – Speech and language patterns • Computer Science – Object identification – Identify news anchors, public figures – Story segmentation 14
  • 16. Application in Teaching (1) • Chicano Studies: Representations of Latinos on the Television News – May 1, 2007 immigration march – MacArthur Park, Los Angeles, CA – 2 days (May 1 & 2, 2007) – Framing, stereotyping, metaphor, silencing – reports with screenshots and links to news stories 15
  • 17. Application in Teaching (2) • Communication Studies: Presidential Communication – 2008 presidential primary – 6 weeks (Dec 2007 to Feb 2008) – Coverage of sound bites • Amount of time given to candidate/party • Types of response (positive, neutral, negative) – Students created their own political ad. 16
  • 18. Work flow (1) Capture/conversion machines • 2 groups, 2 machines per group – Keep the best recording – 6 TV tuners per machine • Capture video and CC to separate files in real-time – MPEG-TS (~7 GB/hr) – Timestamp every 2-3 seconds • Generate image snapshots • Convert videos – MP4/H.264 (VGA, ~240 MB/hr) 17
  • 19. Work flow (2) Storage/static file servers • Control server – Download TV schedules – Download web-streamed news programs – Collect and check recordings – Pushes files to places • Video streaming server • Backup storage server • Image server 18
  • 20. Work flow (3) Search server • Lucene index updated daily – Main text field tokenized – Separate fields for date, network, show, etc. – Binary fields for segment and time data • Hosts search engine 19
  • 22. Custom query type Segment-enclosed query (1) • Problem 1: search for “X near Z” • Lucene: search for “X within Y words of Z” – How to pick Y? – Hard to pick a fixed number 21
  • 23. Custom query type Segment-enclosed query (2) • Problem 2: all matched search words might not be talking about same story – E.g. “Obama AND visit AND Afghanistan” – Might match a news program about Obama’s visit to Canada + violence in Afghanistan 22
  • 24. Custom query type Segment-enclosed query (3) • A news program can contain several stories – E.g. Local, national, world, weather, sports 23
  • 25. Custom query type Segment-enclosed query (4) 24
  • 26. Custom query type Segment-enclosed query (5) • One solution: search for “X and Z within same story segment” – Possible with Lucene + story segment info • Bonus: enables searching/filtering for a particular story type – E.g. Politics 25
  • 27. Custom query type Segment-enclosed query (6) • How to mark segments – Automated • Computer Science researchers working on them • Word frequency • Scene change • Black frame and silence – Manual segmentation • Watch the video • Decide where a story starts and ends • Mark positions in semi-automated system 26
  • 28. Custom query type Segment-enclosed query (7) 27
  • 29. Custom query type Segment-enclosed query (8) • Idea – Get spans from SpanNearQuery – Filter and keep those fully within segments • In production: segment info in stored fields – As a list of <start position, end position> – Simple to implement – Reasonably fast searching • Alternative: store segment info as terms – Possible to find segments by themselves – Appears to run much faster 28
  • 30. Custom query type Time-enclosed query 29
  • 31. Custom query type Multi-term regular expression (1) • “here is _ _ _ with the (news|story|details|report)” • Apply RegEx to a phrase or sentence – Not just individual words • Lucene core has regular expression query support – Good starting point – Not a complete solution for us 30
  • 32. Custom query type Multi-term regular expression (2) • Problems – Some analyzers do not work with RegEx – Lucene’s RegEx query classes only apply RegEx to individual terms • Want to match a pattern against a phrase/sentence • Want placeholders for whole words (not just characters) – Term(fieldName, “.*”) matches all terms, and all documents, and all positions in the index • very slow • takes lots of memory 31
  • 33. Custom query type Multi-term regular expression (3) • What we did – Parse and translate multi-term RegEx into Lucene built-in queries (SpanNearQuery, RegexQuery) • E.g. “here is _ _ _ with the” = “here is” followed by “with the” (with exactly 3 terms in between) – Leading and trailing placeholders • E.g. “_ _ is the _ _ _” • Preserve for correctness • Store word count for each document • Expand each span on both sides • Bounds checking 32
  • 34. Custom query type Multi-term regular expression (4) • Regular expression libraries differ in – Syntax (e.g. Perl 5-compatible) – Capabilities (e.g. back-references) – Speed • Memory usage – Proportional to number of terms matched – Increasing available memory might help 33
  • 35. Custom result format Occurrence count 34
  • 36. Future work Job queue (1) • Research front moving towards analysis of whole database – Want full search result set – Queries are intensive and take a long time • Solution will be beyond increasing timeout – Users might close their browsers – We might restart the search back-end 35
  • 37. Future work Job queue (2) • Features – Query runs in background – Notification when finished/failed – Restart queries with recoverable errors – Check and cancel jobs – Downloadable result – Schedule recurring queries – Manage job priority and quota 36
  • 38. Future work Multiple sources and languages (1) • Multilingual news programs – E.g. some have English + Spanish CC • Multiple text and timestamp sources – E.g. CNN transcript available from website – Applying speech-to-text to videos – Manual correction of text and timestamps • Multiple markets – E.g. Capture TV programs in Denmark and Norway 37
  • 39. Future work Multiple sources and languages (2) • Need language detection – Libraries exist • Search for specific channel – Search by language more useful – But no fixed channel -> language mapping • What will proximity search and occurrence counting mean when there are multiple channels/languages? 38
  • 40. Future work Metadata • Types of metadata – Segment boundary, type and topic – Headline and description (from transcripts) – Website links – Syntactic tags (e.g. part of speech) – Generated annotation (e.g. object identification) – User annotation (e.g. scene description) – Screen text • Eventually: want them to be searchable 39
  • 41. Thank you for coming! • Any questions? • My e-mail: kai@ssc.ucla.edu • Slides available: http://ucla.in/IDJq2u 40

Editor's Notes

  1. Good afternoon. I am Kai Chan from UCLA Social Sciences Computing. I am the lead programmer of the UCLA Communication Studies Archive. I am happy to be here with you and talk about this project, its background and setup, how it is being used, and how we have implemented certain things. I hope you will find this presentation interesting and helpful. The slides will be available for download after this presentation, so don’t worry about writing down all the details from the slides. If things are confusing or if I don’t speak loud enough, please let me know. Otherwise, I will be saving questions until the end.
  2. The project started as a continuation of analog recording of television news. Back in the Watergate years, a professor in the now-Communication Studies department started recording television news. Over the years, thousands of tapes have been recorded. Obviously, it is not very convenient to look for a particular news program or topic in the collection.
  3. So, in 2005, two professors at the department started recording television news digitally, with computers and TV capture cards. This new system has several advantages. It stores videos that can be streamed over the Web. It records closed captioning, which is commonly called “subtitle text” and comes with most TV programs in the U.S. The recorded text is indexed and searchable, thanks to Lucene. The system captures image snapshots. Finally, we have built a search engine, as well as some analysis tools, for the archive.
  4. Besides recording news programs from television, we also get them from other sources. For example, we record CNN television news programs, but their website has the transcripts of many of their shows. So for these shows, we now have these transcripts along with the closed captioning. There are also television news networks (such as Democracy Now and Russia Today) that broadcast on the Web, and we download their shows. The digital archive grows at about 100 news programs and 600 thousand words each day.
  5. Right now, it contains television recordings from 28 networks and 16 hundred shows. It has 130 thousand hours, 160 thousand news programs, 50 million still images and 880 million words.
  6. Why is this project important? For researchers, the archive is a large and unique collection of human communication. Unlike many other collections, this one captures many communication modalities, such as speech, facial expression, body gesture, and so on. News content has different conditions and settings. Some are staged and scripted, and some (such as interviewing eyewitnesses) are not. There are single-speaker scenes, conversations, debates and so on. The news programs are also from different networks and communities. The archive allows the study of TV news like how people study written news with newspaper archives. It allows the study of TV news in particular (and communication in general) in ways that were impossible before. For example, to use our analog collection as well as other tape-based video collections, people often need to know which tapes they want, and there is often a non-trivial cost to check out particular tapes. It involves finding, copying and mailing each tape. Taking a quick look at (or extracting some details from) a large number of news programs is hard, if possible at all. In contrast, our digital collection is searchable and accessible from the Web, and it allows people to do what I just described easily. Researchers all over the world can access the archive’s material instantly and simultaneously.
  7. For non-researchers, myself included, why should they care? Some of them might say that there is little they can learn and conclude from TV news other than “they are all biased”. I disagree. TV news is about presentation and persuasion. Those who created news content have their own angles and opinions about the news story, and news programs are a medium in which they persuade us to think the same. However, presentation and persuasion happens in our daily lives too. The difference is that, unlike those that happen in our daily lives, those in TV news be as readily recorded, systematically studied, and tell us something about communication in general too. Also, even in recently years when the Internet and social networks become more popular than ever, TV remains the main source of news for many or most people. What they see in TV news greatly affects the public’s opinion and decisions, in public policy and other areas. If what we watch is so important, we really should learn about (and be conscious about) what we watch. Now is a good time to mention that the archive’s content is copyright restricted and is to be used for academic research and teaching. However, for the researchers and students fortunate enough to have access, the archive will hopefully empower them and change their lives in some positive ways.
  8. Here is what the search engine’s interface looks like. You can notice that it is quite different from a regular document search engine. There are three display formats: list, table and chart. I will show you each of them, and later, also talk about the implementation side. Users can enter multiple words, phrases and regular expression patterns. And there are several criteria, such as: with all the words, with at least one of the words, without the words, and with all the words within a certain distance. Notice that in this screenshot, the distance says “the same segment”. I will talk about story segments later, but the distance can be in words, segments or seconds. For the list format, the user can select how many search results per page. In additions, the search results can be filtered by date, network, and show/series. The search result can be sorted by score or date.
  9. Here is how the search result looks in the list format. It looks closest to a typical document search, but we have gone beyond just showing what news programs are matched. Since we have the closed captioning and images, the search result page shows where the matches are inside the news programs. The positions that the searched words, phrases or patterns match are highlighted, with text around them shown to show the contexts. At each matched position, there is a permalink for referring to that position from external bookmarks and reports, and there is an image snapshot of the video at that position. There is a video player on the same page. If you click on an image snapshot, it starts the video streaming and jumps to that position in the video.
  10. Here are some larger screenshots of the search result in the list format. Hopefully they show better how the context, the highlighting, and the image snapshots work. As you can see, highlighting works for words, phrases as well as regular expression patterns.
  11. Table and chart display formats are very useful for showing how the use of certain words, phrases or language patterns vary by time, network and show, as well as showing how a word is used together with (or instead of) another word. For these two display formats, the search results are not listed by news programs but are put into groups, for example, by day, week, month, year, network or show/series. Again, they can be filtered by date, network, and show/series.
  12. Here is how the search result looks in the table format. This sample query searches document and occurrence counts of two words (crisis and bailout) from September 11th to October 1st, 2008, which was when Lehman Brothers’ bankruptcy sparked a financial crisis. There is a column for each of the two words, a column to show the frequency of either word occurring, plus a column for the total number of news programs in that group. The rows are days in the date range. The cells show the number of news programs and occurrences, of the word indicated by the column heading, on the day indicated by the row heading.
  13. Here is the same search result in the chart format. The valleys represent weekends when we record fewer shows, but you can still see the general trend. Right after the Lehman Brothers bankruptcy, there was a sharp jump in the mentioning of the word “crisis”, but the word “bailout” was mentioned less until later in the month. By the end of the month, “bailout” greatly surpassed “crisis”.
  14. Here is the chart result for a different query: the number of times two presidential candidates’ names, Romney and Santorum, are mentioned in the course of their campaigns.
  15. The archive is being used in research in several fields. Communication Studies researchers analyze the amount of coverage of events over time. Examples are the second Iraq War, the 2008 South Ossetia war and the 2011 Norway attacks. Linguistic researchers use the archive to study speech and language patterns. Computer Science research topics include identifying objects and people in the news, as well as story segmentation. In fact, researches from these three fields are joining together to study “Visual and Verbal Interaction and Persuasion” with the archive, with the support of a grant from the National Science Foundation.
  16. The archive has been used in teaching several classes. An example is a Chicano Studies class on representation of Latinos on the television news. Its focus was on studying the May 1, 2007 immigration match, with an emphasis on the use of police force at MacArthur Park in Los Angeles, California. Students studies two days of news programs, in which they were to identify framing, stereotyping, metaphor and silencing. The archive allowed them to better show what they learned. For example, instead of just citing KCBS news at 11PM on May 1, 2007, they could point to 2 minutes 12 seconds into the program by including a permalink and a screenshot in their reports.
  17. Other examples are two Communication Studies classes on presidential communication. In one class, students studied television coverage of the 2008 presidential primary election through 6 weeks of selected news programs up to the primary election. Students analyzed coverage of sound bites from candidates. They collected the amount of time given to each candidate and party, and the types of responses given by the news-anchors or portrayed by the news programs. In another class, students created their own political TV advertisements, similar to the ads we see on TV about public officials, from the archive’s news footage.
  18. We have four video capture and conversion machines in production, divided into two groups. Each group covers half of the recording schedule and has two machines recording the same news programs at the same time for redundancy. The better recording out of two is kept. There are 6 TV tuners per machine. Video is captured in MPEG-TS format at about 7 GB/hr. Closed captioning text is recorded with a timestamp every 2 to 3 seconds. Afterwards, the system generates image snapshots from the videos, and converts the videos to MP4 H.264 format, in VGA resolution at about 240 MB/hr.
  19. The control server downloads electronic TV schedules as well as web-streamed news programs. It collects and checks recordings from the capture/conversion machines. It stories a copy of the files and also pushes them to the backup storage server, the image server, the video-streaming server and the search server.
  20. The search server updates the Lucene index daily. The indexing process stores and tokenizes the main closed captioning text field. It creates separate fields for date, network, show and so on. It also creates binary fields for segment and time data that I will talk about later. This server hosts the archive’s search engine.
  21. The search result page contains videos and images that come from separate servers. But the main result comes from the search server. It has a front end with a Web server and PHP custom code, which formats the search form and the search result. In the back end, there is Lucene, a Lucene index, and a MySQL database. There is custom Java code that extends what Lucene provides into what we want. It allows our search engine to handle query types and result formats beyond the Lucene built-in ones. A bridge allows the front and-back-ends to talk to each other. We are using PHP-Java Bridge in production but are replacing it with Solr. Our use of Solr will be different from many others’, as we will have a custom Solr request handler to handle queries. However, we can still make use of Solr’s thread management, caching, serialization and so on.
  22. Lucene is great because it gives us the end result, plus low-level data and building blocks, from which we have built custom solutions. One of them is segment-enclosed query. Let’s consider the problem of: how to search for “X near Z”. Lucene can search for “X within Y words of Z”, but how do we pick Y to solve our problem? No fixed number works everywhere.
  23. A related problem is that, even a news program matches several words, these words might not be all used in this news program to talk about the same story. For example, if you search for all these words: “Obama”, “visit” and “Afghanistan”, you might expect to find something about President Obama’s visit to Afghanistan. However, the query would also match a news program that talks about Obama’s visit to, say, Canada as well as violence in Afghanistan. This news program is probably not what you have in mind.
  24. A characteristic common in many of the news program is that each of them often covers many stories. For example, an hour-long evening news might start with a few local stories, then national and world stories, and then weather, sports and so on. If you draw a timeline for a news program, these stories show up as segments in the timeline.
  25. Let me show you what story segments look like. The left side shows a simplified view of the evening-news I just described. The right side shows actual story segments from an hour of CNN Newsroom. In this particular program, there are 26 story segments plus 6 commercial block segments.
  26. A solution, or an approximation, for finding “X near Z” is that we consider X and Z to be near each other if they are within the same story segment in the news program. This is possible with Lucene if we give it information about the story segments. A bonus from this solution is that it can also enable searching and filtering for a particular story type. For example, a Political Science researcher might be more interested in news programs (or sections of news programs) about politics, and less interested in the other news programs (or sections of them).
  27. Computer Science researchers have been working on automated ways to mark segments using, for example, word frequency, scene change, black frame and silence. On the other hand, some Communication Studies researchers are doing manual segmentation on some of the news programs. They watch a video, decide where a story starts and ends, and mark the positions plus the story’s type and topic in a semi-automated system.
  28. (A span is a section of a document that a query matches.) The big picture of segment-enclosed query is that a span it matches should be fully enclosed by a segment. In this diagram, span 1 is enclosed by segment 1. None of the other 4 spans are enclosed by any of the 3 segments.
  29. We implemented segment-enclosed query by getting spans from SpanNearQuery, which comes with Lucene, and then filtering and keeping those that are located fully within any segments. Right now, in production, segment information is stored as a list of start-position end-position pairs. This approach is relatively simple to implement. We just put the list in a stored field of the document. Searching is reasonably fast. An alternative we are developing is to store each segment starting or ending point as terms. This way, it is possible to search for a particular story topic. This approach also appears to run much faster.
  30. Time-enclosed query is similar to segment-enclosed query. The idea is that based on the timestamps and the position-time mapping we have, we can calculate the maximum length of each span in terms of seconds. Then, you can search for, let’s say, X within 15 seconds of Z. The search engine gets spans from SpanNearQuery (in Lucene), and then filters and keeps those we know are no longer than 15 seconds each. The implementation is similar to segment-enclosed query, except we have found that just storing the timestamp-position information as an array works better. Storing it as terms takes much more disk space, and unlike segments, there is less need to search for a particular timestamp.
  31. Multi-term regular expression is another type of custom query we have implemented recently. As I mentioned before, linguists want to find word patterns, for example, “here is _ _ _ with the (news|story|details|report)”. They want to use regular expression to look for a phrase or a sentence, not just individual words. In fact, they want to run the UNIX grep command to match a pattern against the whole text collection. Of course, that is very slow. We think we can make many of their use cases run much faster with Lucene. The good news is: Lucene has some built-in regular expression query support. It is a good starting point, but not a complete solution for us.
  32. Some analyzers do not work with RegEX and cannot give you a RegEX pattern as a term. But there is a bigger problem: Lucene’s RegEx query classes only match a pattern with individual terms, and a pattern only describes how one term should look like. However that is not what all our researchers want. And they also want to mix words with placeholders for a specific number of any words. For example, we know in RegEX, “…” means three consecutive character of any kind, such as “the”. But our researchers also want something that matches three words of any kind, such as “with the details”. Another implementation issue is: You cannot just tell Lucene to find all document and all positions where “any words” occur. It gets very slow and takes lots of memory.
  33. What we did was to parse and translate a multi-word RegEx pattern into a proximity query that Lucene understands. For example, the pattern “here is _ _ _ with the” equals to the phrase “here is” followed by the phrase “with the”, with exactly 3 words in between, and Lucene understands the latter. Leading and trailing placeholders are an issue. An example is the pattern “_ _ is the _ _ _”. We cannot just discard these placeholders at the beginning and the end, and there are no equivalent Lucene built-in queries. Instead, we store the number of words in each document, expand each span on both sides (taking the placeholders into account), and make sure the span does not extend outside the document’s range.
  34. Lucene uses Java’s built-in regular expression library but also let you pick your own. Several of them are available, and they differ in syntax, capabilities, and speed. For example, some have Perl-compatible RegEx syntaxes, some let you have back-references in your patterns, and some are way faster than others. For applications that use regular expression a lot, it is worthwhile to experiment instead of just using the built-in one. Another concern is memory usage. Memory requirement to run a RegEx query is proportional to the number of terms matched. You might have no problem with very specific patterns but run into OutOfMemoryError for very broad ones, such as matching any words that end in “-ed”. For us, luckily, giving Lucene more memory solves the problem for now. If another system has much more terms than we do, they might need to make the memory requirements scale better.
  35. You have seen the table and chart display formats before. Here is how they work under the hood. There are sub-queries (generally words) to be counted and there are groups (which are days in this case). Then we have a table like this. Each cell contains the number of news programs and places, where the sub-query matches and the news program belongs to the specific group. For example, to get the counts for the highlighted cell, we get the query that corresponds to the word “meltdown”, get the spans it matches in the index, go through each span, and count those that are in news programs on September 15th, 2008.
  36. Let me talk about some of the unfinished challenges. First of all, our researchers are starting to work on analysis of the whole database. Instead of just seeing the first 10 or 100 results, they might want to see the whole search result set or export it into another program. Queries like this are intensive and take a long time. Just increasing the website’s timeout doesn’t solve other problems such as, what if the user closes the browser or if we restart the search back-end.
  37. Eventually we need a job queue system that can run queries in the background and let users come back later and retrieve results. The system can notify users if their queries have finished or failed. They can restart queries with recoverable errors, check and cancel their query jobs, and download the results. The system would allows users to define queries that are scheduled to run repeatedly. For example, they can monitor trends by running queries every day to look for particular words and phrases in the news programs added on that day. The system would also make it possible to have job priorities and quotas in place.
  38. Some news programs are bilingual. For example, some KNBC news programs have English and Spanish closed captioning. There are more than one way to get text and timestamps. For example, CNN’s website has transcripts for their programs. We can get another copy of the text from the video using speech-to-text. We can manually correct text and timestamps. We are also starting to get news programs in other countries, such as Denmark and Norway.
  39. As we add materials in English and other languages, we need a way to detect and label which language a piece of text is in. Fortunately, libraries to do that already exist. We readily know what channel or source a piece of text is from. Searching for specific channel is easier to implement but it might be less useful than saying “show me only the Spanish content”. However, there is no fixed channel to language mapping. For example, there are four closed captioning channels, and usually the first is for English and the third is for Spanish, but not always. Also, when there are multiple channels or languages, what proximity search and occurrence counting should return might also change.
  40. There are several types of metadata for the news programs, and the variety will only grow. For example, we have talked about story segments before. Besides, downloaded transcripts have headlines, descriptions and website links. Language researchers might add syntactic tags, such as noting the part of speech of each word. Other researchers can add annotations generated from their research. For example, when they identify an object or a speaker in a video, they can add annotations to our stored closed captioning. Other user can add arbitrary annotation, such as description of a particular scene. There is also screen text, which is the “scrolled text” that appears in many of the CNN and Fox news programs. Right now we have a page to display a news program, its text and all metadata associated with it. It is enough for now, but eventually, we want most of the metadata to be searchable too.
  41. That’s all slides I have. If you have any questions about the presentation or the project, feel free to let me know, and I am happy to answer them afterwards. Or, you can e-mail me at this address. The slides will be available at this URL. Thank you for coming.