• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Television News Search and Analysis with Lucene/Solr
 

Television News Search and Analysis with Lucene/Solr

on

  • 1,747 views

Presented by Kai Chan | UCLA - See complete conference videos - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 ...

Presented by Kai Chan | UCLA - See complete conference videos - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012

UCLA Communication Studies Archive hosts a collection of over 100,000 hours of digital television news, updated daily. Its search engine provides closed captioning search and online streaming of videos. The search engine allows researchers and students in various fields to study television news, images and language usage, in ways that were not possible before. In this presentation, we will show the setup of our Lucene/Solr-powered search engine, as well as how it is being used. We will discuss our work on custom result formats, such as linking search result text to the video at particular timestamps, counting occurrences of words, phrases or patterns, grouping the result by fields such as month or show, and creating interactive charts. We will also discuss our work on extending Lucene’s proximity searches, and creating custom query types, such as segment-enclosed (two or more words, phrases or patterns occurring within a story-based text segment), time-enclosed (two or more words, phrases or patterns occurring within a certain time), and multi-word regular expression queries. Future goals will also be discussed, such as supporting multiple languages, multiple sources (speech-to-text along side closed-captioning text), searching user-contributed and generated metadata (programs that identify story segments, objects in video, etc.), and syntactic tags (such as parts of speech).

Statistics

Views

Total Views
1,747
Views on SlideShare
1,747
Embed Views
0

Actions

Likes
1
Downloads
9
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Television News Search and Analysis with Lucene/Solr Television News Search and Analysis with Lucene/Solr Presentation Transcript

    • Television  News  Search  and   Analysis  with  Lucene/Solr   Kai  Chan  <kai@ssc.ucla.edu>   Social  Sciences  CompuAng   University  of  California,  Los  Angeles     Lucene  RevoluAon,  May  10,  2012  
    • CommunicaAon  Studies  Archive   Background  (1)  •  ConAnuaAon  of  analog  recording  of  TV  news   –  Thousands  of  tapes  since  Watergate/1970s   –  Hard  to  look  for  a  parAcular  news  program  or   topic   1
    • CommunicaAon  Studies  Archive   Background  (2)  •  Digital  recording  since  2005  •  Capture  news  programs  on  computers   –  Video:  can  be  streamed  over  the  Web   –  Closed  capAoning  (“subAtle  text”):  indexed  and   searchable   –  Image  snapshots   –  Search  engine  and  analysis  tools   2
    • CommunicaAon  Studies  Archive   Background  (3)  •  Also  download  transcripts  and  web-­‐streamed   news  programs  •  100  news  programs  and  600,000  words  added   each  day   3
    • CommunicaAon  Studies  Archive   Background  (4)  •  January  2005  to  present   –  28  networks   –  1,600  shows   –  130,000  hours   –  160,000  news  programs   –  50,000,000  images   –  880,000,000  words   4
    • Why  This  is  Important  (1)    •  Researchers   –  Large  and  unique  collecAon  of  communicaAon   –  Many  modaliAes   •  Speech,  facial  expression,  body  gesture,  etc.   –  Different  condiAons/secngs   –  Different  networks  and  communiAes   –  Allows  study  of  TV  news  +  communicaAon  in   general  in  ways  impossible  before   5
    • Why  This  is  Important  (2)    •  Non-­‐researchers   –  TV  news  about  presentaAon  and  persuasion   •  Which  happen  in  daily  life  also   –  TV  main  source  of  news  for  many/most   –  Greatly  affects  the  public’s  decisions   –  Learn  about  what  we  watch   6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 13
    • ApplicaAon  in  Research    •  CommunicaAon  Studies   –  Amount  of  coverage  for  events  over  Ame  •  LinguisAc   –  Speech  and  language  pagerns  •  Computer  Science   –  Object  idenAficaAon   –  IdenAfy  news  anchors,  public  figures   –  Story  segmentaAon   14
    • ApplicaAon  in  Teaching  (1)    •  Chicano  Studies:  RepresentaAons  of  LaAnos   on  the  Television  News   –  May  1,  2007  immigraAon  march   –  MacArthur  Park,  Los  Angeles,  CA   –  2  days  (May  1  &  2,  2007)   –  Framing,  stereotyping,  metaphor,  silencing   –  reports  with  screenshots  and  links  to  news  stories   15
    • ApplicaAon  in  Teaching  (2)    •  CommunicaAon  Studies:  PresidenAal   CommunicaAon   –  2008  presidenAal  primary   –  6  weeks  (Dec  2007  to  Feb  2008)   –  Coverage  of  sound  bites   •  Amount  of  Ame  given  to  candidate/party   •  Types  of  response  (posiAve,  neutral,  negaAve)   –  Students  created  their  own  poliAcal  ad.   16
    • Work  flow  (1)   Capture/conversion  machines  •  2  groups,  2  machines  per  group   Capture/ Backup –  Keep  the  best  recording   conversion storage machines server –  6  TV  tuners  per  machine  •  Capture  video  and  CC  to   separate  files  in  real-­‐Ame   Storage/ control Image –  MPEG-­‐TS  (~7  GB/hr)   server server –  Timestamp  every  2-­‐3  seconds  •  Generate  image  snapshots   Video Search•  Convert  videos   server streaming server –  MP4/H.264  (VGA,  ~240  MB/hr)   17
    • Work  flow  (2)   Storage/staAc  file  servers  •  Control  server     Capture/ Backup conversion storage –  Download  TV  schedules   machines server –  Download  web-­‐streamed  news   programs   Storage/ Image –  Collect  and  check  recordings   control server server –  Pushes  files  to  places  •  Video  streaming  server   Video Search•  Backup  storage  server   server streaming server•  Image  server   18
    • Work  flow  (3)   Search  server  •  Lucene  index  updated  daily   Capture/ Backup conversion storage –  Main  text  field  tokenized   machines server –  Separate  fields  for  date,   network,  show,  etc.   Storage/ Image control –  Binary  fields  for  segment  and   server server Ame  data  •  Hosts  search  engine   Search Video streaming server server 19
    • The  search  process     Video server Retrieve thumbnails Image server Watch videos and montages Web server Video files (Apache)Video streaming Thumbnailserver (Wowza) User & montages Perform searches Search server Web server Custom code (PHP) front end PHP-Java Bridge or Solr bridge Custom code (Java) Lucene back end MySQL database Lucene index 20
    • Custom  query  type   Segment-­‐enclosed  query  (1)  •  Problem  1:  search  for  “X  near  Z”  •  Lucene:  search  for  “X  within  Y  words  of  Z”   –  How  to  pick  Y?   –  Hard  to  pick  a  fixed  number   21
    • Custom  query  type   Segment-­‐enclosed  query  (2)  •  Problem  2:  all  matched  search  words  might   not  be  talking  about  same  story   –  E.g.  “Obama  AND  visit  AND  Afghanistan”   –  Might  match  a  news  program  about  Obama’s  visit   to  Canada  +  violence  in  Afghanistan   22
    • Custom  query  type   Segment-­‐enclosed  query  (3)  •  A  news  program  can  contain  several  stories   –  E.g.  Local,  naAonal,  world,  weather,  sports   23
    • Custom  query  type   Segment-­‐enclosed  query  (4)   local story 1 local story 2 commercialsnational story 1national story 2 weather 1 commercials world story 1 world story 2 weather 2 commercials health entertainment sports 24
    • Custom  query  type   Segment-­‐enclosed  query  (5)  •  One  soluAon:  search  for  “X  and  Z  within  same   story  segment”   –  Possible  with  Lucene  +  story  segment  info  •  Bonus:  enables  searching/filtering  for  a   parAcular  story  type   –  E.g.  PoliAcs   25
    • Custom  query  type   Segment-­‐enclosed  query  (6)  •  How  to  mark  segments   –  Automated   •  Computer  Science  researchers  working  on  them   •  Word  frequency   •  Scene  change   •  Black  frame  and  silence   –  Manual  segmentaAon   •  Watch  the  video   •  Decide  where  a  story  starts  and  ends   •  Mark  posiAons  in  semi-­‐automated  system   26
    • Custom  query  type   Segment-­‐enclosed  query  (7)  seg. 1 seg. 1 seg. 2 seg. 2 seg. 3 seg. 3begin end begin end begin end span 1 span 2 span 3 span 4 span 5 27
    • Custom  query  type   Segment-­‐enclosed  query  (8)  •  Idea   –  Get  spans  from  SpanNearQuery   –  Filter  and  keep  those  fully  within  segments  •  In  producAon:  segment  info  in  stored  fields   –  As  a  list  of  <start  posiAon,  end  posiAon>   –  Simple  to  implement   –  Reasonably  fast  searching  •  AlternaAve:  store  segment  info  as  terms   –  Possible  to  find  segments  by  themselves   –  Appears  to  run  much  faster   28
    • Custom  query  type   Time-­‐enclosed  query   20 s 25 s 30 s 35 s 40 s 45 s 50 s 55 s 60 s<= 20 s span 1<= 15 s span 2<= 10 s span 3<= 35 s span 4<= 25 s span 5 29
    • Custom  query  type   MulA-­‐term  regular  expression  (1)  •  “here  is  _  _  _  with  the  (news|story|details| report)”  •  Apply  RegEx  to  a  phrase  or  sentence   –  Not  just  individual  words  •  Lucene  core  has  regular  expression  query   support   –  Good  starAng  point   –  Not  a  complete  soluAon  for  us   30
    • Custom  query  type   MulA-­‐term  regular  expression  (2)  •  Problems   –  Some  analyzers  do  not  work  with  RegEx   –  Lucene’s  RegEx  query  classes  only  apply  RegEx  to   individual  terms   •  Want  to  match  a  pagern  against  a  phrase/sentence   •  Want  placeholders  for  whole  words  (not  just  characters)   –  Term(fieldName,  “.*”)  matches  all  terms,  and  all   documents,  and  all  posiAons  in  the  index   •  very  slow   •  takes  lots  of  memory   31
    • Custom  query  type   MulA-­‐term  regular  expression  (3)  •  What  we  did   –  Parse  and  translate  mulA-­‐term  RegEx  into  Lucene   built-­‐in  queries  (SpanNearQuery,  RegexQuery)   •  E.g.  “here  is  _  _  _  with  the”  =  “here  is”  followed  by  “with   the”  (with  exactly  3  terms  in  between)   –  Leading  and  trailing  placeholders   •  E.g.  “_  _  is  the  _  _  _”   •  Preserve  for  correctness   •  Store  word  count  for  each  document   •  Expand  each  span  on  both  sides   •  Bounds  checking   32
    • Custom  query  type   MulA-­‐term  regular  expression  (4)  •  Regular  expression  libraries  differ  in   –  Syntax  (e.g.  Perl  5-­‐compaAble)   –  CapabiliAes  (e.g.  back-­‐references)   –  Speed  •  Memory  usage   –  ProporAonal  to  number  of  terms  matched   –  Increasing  available  memory  might  help   33
    • Custom  result  format   Occurrence  count  date word crisis crash meltdown tsunami go through every span generated by ... (SpanTermQuery(meltdown) filtered by date 9/15/08) 9/14/08 X docs, Y 9/15/08 occurrences 9/16/08 ... 34
    • Future  work   Job  queue  (1)  •  Research  front  moving  towards  analysis  of   whole  database   –  Want  full  search  result  set   –  Queries  are  intensive  and  take  a  long  Ame  •  SoluAon  will  be  beyond  increasing  Ameout   –  Users  might  close  their  browsers   –  We  might  restart  the  search  back-­‐end   35
    • Future  work   Job  queue  (2)  •  Features   –  Query  runs  in  background   –  NoAficaAon  when  finished/failed   –  Restart  queries  with  recoverable  errors   –  Check  and  cancel  jobs   –  Downloadable  result   –  Schedule  recurring  queries   –  Manage  job  priority  and  quota   36
    • Future  work   MulAple  sources  and  languages  (1)  •  MulAlingual  news  programs   –  E.g.  some  have  English  +  Spanish  CC  •  MulAple  text  and  Amestamp  sources   –  E.g.  CNN  transcript  available  from  website   –  Applying  speech-­‐to-­‐text  to  videos   –  Manual  correcAon  of  text  and  Amestamps  •  MulAple  markets   –  E.g.  Capture  TV  programs  in  Denmark  and  Norway   37
    • Future  work   MulAple  sources  and  languages  (2)  •  Need  language  detecAon   –  Libraries  exist  •  Search  for  specific  channel   –  Search  by  language  more  useful   –  But  no  fixed  channel  -­‐>  language  mapping  •  What  will  proximity  search  and  occurrence   counAng  mean  when  there  are  mulAple   channels/languages?   38
    • Future  work   Metadata  •  Types  of  metadata   –  Segment  boundary,  type  and  topic   –  Headline  and  descripAon  (from  transcripts)   –  Website  links   –  SyntacAc  tags  (e.g.  part  of  speech)   –  Generated  annotaAon  (e.g.  object  idenAficaAon)   –  User  annotaAon  (e.g.  scene  descripAon)   –  Screen  text  •  Eventually:  want  them  to  be  searchable   39
    • Thank  you  for  coming!    •  Any  quesAons?  •  My  e-­‐mail:  kai@ssc.ucla.edu  •  Slides  available:  hgp://ucla.in/IDJq2u   40