Text analysis in transparency - a talk at Sunlight Labs
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Text analysis in transparency - a talk at Sunlight Labs



Video at: http://overview.ap.org/blog/2013/05/video-text-analysis-in-transparency/ ...

Video at: http://overview.ap.org/blog/2013/05/video-text-analysis-in-transparency/

How text analysis and natural language processing is being used in journalism, open government, and transparency generally. A survey of existing public projects, and the algorithms behind them. Then a demonstration of the Overview Project (overviewproject.org), a tool for automatically visualizing the topics in a large document set, designed for investigative journalists. Then, a discussion of where data-driven transparency is going now -- or, what should we work on next?



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Text analysis in transparency - a talk at Sunlight Labs Presentation Transcript

  • 1. Text  analysis  in  transparency   Jonathan  Stray   Sunlight  Labs,  May  2  2013  
  • 2.   Text  Analysis  for  Transparency  in  the  Wild   cool  projects,  and  the  tech  behind  them     An  Overview  of  Overview   the  thing  I've  been  working  on     What's  Next  for  Data-­‐Driven  Transparency?     how  does  transparency  work,  anyway?  
  • 3. What  people  are  doing  now   Transparency  applicaMons  of  text  analysis,  in  the  wild:     •  Document  SummarizaMon   •  ExploraMon  of  text  collecMons   •  Name  standardizaMon   •  Plagiarism  detecMon  /  text  flow  analysis   •  Change  surveillance  /  revision  tracking   •  ClassificaMon  /  automaMc  tagging    
  • 4. Text  Analysis  for  Transparency   In  the  Wild  
  • 5. Algorithms   •  Full  text  search   •  Bag-­‐of-­‐words  /  TF-­‐IDF   •  N-­‐gram  language  models   •  Document  similarity  funcMons  (cosine  distance)   •  Fuzzy  string  matching  (shingles,  edit  distance,...)   •  Text  Diff   •  Clustering  (k-­‐means,  hierarchical,  ...)     •  Locality  SensiMve  Hashing  (MinHash,  ...)   •  supervised  classificaMon  (linear,  SVM,  ...)   •  Topic  modeling  (LSA,  LDA,  NMF,  ...)  
  • 6. State  of  the  Union  2011  word  cloud,  Whitehouse.gov  
  • 7. State  of  the  Union  by  decade,  Henry  Williams  
  • 8. State  of  the  Union  by  Decade   Uses:  bag  of  words,  TF-­‐IDF     Loads  speeches  from  all  years,  applies  TF-­‐IDF.  Sums  document   vectors  by  decade,  then  picks  top  10  words.     Not  really  a  principled  approach,  but  seems  to  give  reasonable   results...  be_er  than  word  clouds?  
  • 9. First  text  summarizaMon  algorithm:  H.P.  Luhn,  1958  
  • 10. First  text  summarizaMon  algorithm:  H.P.  Luhn,  1958  
  • 11. Many  Bills,  IBM  
  • 12. Many  Bills   Does:  legislaAve  text  exploraAon   Using  machine  classificaAon  via  (best  guess)  bag-­‐of-­‐words,  n-­‐ grams,  TF-­‐IDF     Classifies  secMons  of  bill  by  topic,  and  displays  visually.     Allows  comparison  of  mulMple  bills.       Intended  applicaMon:  obscure  riders  and  "pork  barrel"  projects  
  • 13. Churnalism,  Sunlight  Labs  
  • 14. Churnalism   Does:  bill  content  explorer   Using:  maching(best  guess):  bag-­‐of-­‐words,  n-­‐grams,  locality   sensiAve  hashing,  fuzzy  string  matching     Given  some  text,  find  all  documents  which  contain  a  substanMal   secMon  of  that  text.     Allows  for  some  difference  between  source  and  target.     Highlights  diffs.  
  • 15. MemeTracker,    by  Jure  Leskovec,  Lars  Backstrom  and  Jon  Kleinberg  
  • 16. MemeTracker   Does:  web-­‐scale  text  flow  analysis  on  poliAcal  quotes   Using:  n-­‐grams,  fuzzy  string  matching  via  edit  distance,   phylogeneAc  tree  concepts  from  bioinformaAcs     Given  a  quote,  track  its  diffusion  and  mutaMon  across  news   outlets  and  millions  of  blogs.     Shows  a_enMon  curves,  phrase  variaMons.  Allows  comparison  of   different  types  of  media.  
  • 17. Campaign  finance  donor  name  standardizer,  Chase  Davis  
  • 18. FEC-­‐Standardizer   Does:  name  standardizaAon   Using:  supervised  classificaAon  via  random  forests,  locality-­‐ sensiAve  hashing  on  2-­‐shingles     Standardizes  donor  idenMMes.  That  is,  finds  clusters  of  donors   who  are  the  same  person,  even  with  typos,  incomplete  data,   other  errors.     95-­‐99%  accurate,  compared  to  Center  for  Responsive  PoliMcs   reference  data.    
  • 19. Newsdiffs,  Eric  Price,  Jenny  8  Lee,  Greg  Price    
  • 20. NewsDiffs   Does:  change  detecAon   Using:  text  diff     ConMnuously  scrapes  nyMmes.com,  cnn.com,  poliMco.com,   bbc.co.uk,  looking  for  changes  in  published  stories.     Displays  diffs  in  visual  format.    
  • 21. Docket  Wrench,  Sunlight  Labs  
  • 22. Docket  Wrench   Does:  topic  analysis  /  plagiarism  detecAon   Using:  (best  guess)  bag-­‐of-­‐words,  n-­‐grams,  locality-­‐sensiAve   hashing,  full  text  search     Analyzes  comments  on  proposed  Federal  regulaMon  and  shows   clusters  which  contain  similar  text.     ConMnuously  pulls  from  many  different  agencies  –  over  100k   dockets!  Also  visual  display  of  docket  acMvity,  browsing,  search.  
  • 23. The  BaSle  for  Bystanders:  InformaAon,  Meaning  Contests,  and   CollecAve  AcAon  in  the  EgypAan  RevoluAon  of  2011,  Trey  Causey    
  • 24. The  Ba_le  for  Bystanders   An  analysis  of  media  during  EgypAan  revoluAon  of  2011   Using:  bag-­‐of-­‐words,  topic  modeling     Topic  modeling  across  a  database  of  three  online  news  outlets  –   both  state  and  non-­‐state  media  –  to  detect  and  count  stories   with  various  frames,  e.g.  "danger  and  instability"     Relies  on  interpretaMon  of  algorithmically  generated  "topics,"   which  are  really  distribuMons  over  words.  No  ground-­‐truth  /   comparison  to  human  raters.  
  • 25. An  Overview  of  Overview  
  • 26. The  Overview  Project   A  general  purpose  document  mining  system.     Meant  to  answer  the  quesMon,  "what's  in   there?"     Be_er  than  search  –  find  what  you  didn't  know   you're  looking  for.  
  • 27. Overview,  Associated  Press  
  • 28. Overview   Does:  topic  exploraAon   Using:  bag-­‐of-­‐words,  n-­‐grams,  TF-­‐IDF,  document  similarity,  k-­‐ means  clustering,  full  text  search     Uses  the  full  text  of  each  document  to  perform  hierarchical   clustering  based  on  topic.     Visual  exploraMon  and  tagging,  and  (soon)  integrated  full-­‐text   search.  
  • 29. Topic  Tree   Computer  sorts  documents  into  folders  and  sub-­‐ folders,  based  on  topic  analysis.    
  • 30. Duplicate/near  duplicate  detecMon   66  copies  with  different  names  
  • 31. AutomaMc  sorMng  +  manual  tagging   Deeper  in  the  tree  =  narrower  topic.     When  all  docs  are  on  "same"  topic,  tag  it  
  • 32. Extracted  keywords  for  folders  and  docs  
  • 33. Generate  document  vectors,  just  like  a  search  engine.  Then   cluster  the  space.  VisualizaMon  of  "types"  of  search  result.  
  • 34. Stories  done  with  Overview   9000  pages  FOIA'd  from  200  Federal  agencies.  Data  Journalism  Awards  2013  finalist.   4500  pages  of  incident  reports  from  US  Dept  of  State,  declassified  aoer  FOIA   7000  emails  from  Tulsa  Police  Department.  Millions  wasted  on  bad  computers.  
  • 35. Lessons  learned   •  Import  is  the  hardest  part!  Messy  input  formats,  big   uploads,  many  documents  on  paper...   •  Usability  is  crucial.  People  will  give  up  fast.   •  #1  FAQ:  "how  is  it  sorMng  my  documents?"   •  #1  comment:  "oh,  you  mean  it's  a  search  engine."   •  How  do  we  explain  what  we're  doing  to  users?     WORKFLOW  beats  ALGORITHM   every  Mme  
  • 36. What's  Next  for  Data-­‐Driven   Transparency  
  • 37. What  should  we  do  next?   Lots  of  stories  we  could  do.  Lots  of  tools  we  could  build.   Lots  of  data  we  could  analyze.     Are  we  starMng  from  the  right  place?        
  • 38. "Low  Hanging  Fruit"   Work  on  the  untouched  data  sets  that  have  obvious   interest  and  potenMal,  like  campaign  contribuMons.     Catalog  available  data.  Push  for  opening  more.  Create   interfaces  to  exisMng  data  sets.       This  is  a  data-­‐driven  approach.  Risk  is  "looking  for  your   keys  under  the  street  light."  
  • 39. "Capacity  Building"   Data  analysis  is  hard!  Let's  make  it  easier.     Build  be_er  sooware.  Reduce  duplicaMon  of   engineering  efforts.  Teach  people  to  do  data  work,  and   improve  training  methods.     This  is  a  tool-­‐  and  technique-­‐driven  approach.  Risk  is   building  capacity  that  doesn't  ma_er  (no  one  uses,  or   has  no  impact)  
  • 40. "What  happens  if"   Look  for  the  work  that  will  have  the  greatest  posiMve   effect.     Impact  is  some  combinaMon  of  supply  (we  could  do   this)  plus  demand  (people  would  want  it)  plus   effecMveness  (contributes  to  agency.)     This  is  an  impact-­‐driven  approach.  Can  be  very  hard  to   predict  or  measure.  
  • 41. How  does  transparency  work?   •  Deterrence.  Powerful  people  don't  do  bad  things   because  they  know  someone  is  watching.   •  A_enMon.  Focus  spotlight  on  things  that  shouldn't   be  (even  if  they're  "known")   •  Understanding.  Just  what  is  going  on  there   anyway?  Secrets  vs.  mysteries.   •  Influence  mapping.  Who  is  actually  making  the   rules?  
  • 42. The  anxiety  of  influence   This  is  why  people  care  about  campaign  finance     This  is  why  people  care  about  text  flow  in  lawmaking     This  is  why  people  care  about  poliMcal  adverMsing     .   .   .  
  • 43. DetecMng  influence   But  how  does  influence  work?     Influence  over  what?  (It's  a  vector,  not  a  scalar)     Algorithmically  detectable?  Campaign  finance  data   seeks  to  quanMfy  it.  Social  network  analysis  makes   claims.  Straight  up  votes  sMll  count.     But...  do  we  really  understand  influence?  Are  we   confusing  inputs  and  results?  
  • 44. Some  analyses  I'd  like  to  see   Externali+es  of  Finance.  How  do  banks  make  money?  What   effects  does  this  have  on  the  rest  of  us?  Are  internal   jusMficaMons  like  "increasing  liquidity"  good  or  bad  for  everyone   else?  Is  the  industry  actually  compeMMve  or  just  an  oligopoly?   ConnecMons  to  poliMcs  and  other  sources  of  power?     Large  scale  social  network  mapping.  Start  with  data  from   Li_leSis.org.  Can  we  actually  learn  anything  about  influence  from   this?  Try  to  develop  comparaMve  metrics,  and  typology  of   influence  –  break  down  by  industry?  Look  at  revolving  doors,   hiring  and  appointment,  money  flows,  etc.  
  • 45. Transparency  Grand  Challenge     Illuminate  for  ciMzens  how  the  decisions  that   affect  them  actually  get  made.     (Which  requires  figuring  that  out.)     Show  them  how  to  use  their  own  influence.  
  • 46. Grand  Challenge  QuesMons   •  Is  government  the  right  focus?   •  What  types  of  influence  are  there?  (tribes,  insMtuMons,   markets,  networks)   •  What  is  the  limiMng  factor  to  detecMng  influence?  Could  be   data  access,  missing  tools,  lack  of  public  a_enMon,  system   complexity,  or  ...  ?   •  Are  we  facing  secrets  (someone  doesn't  want  us  to  know)  or  a   mysteries  (it's  complicated  and  no  one  knows)?   •  Do  we  really  know  how  data  relates  to  influence?   •  Who  is  affected  by  each  type  of  influence?   •  Who  are  we  working  for?  Have  we  asked  them  what  they   want?