Text	
  analysis	
  in	
  transparency	
  
Jonathan	
  Stray	
  
Sunlight	
  Labs,	
  May	
  2	
  2013	
  
 
Text	
  Analysis	
  for	
  Transparency	
  in	
  the	
  Wild	
  
cool	
  projects,	
  and	
  the	
  tech	
  behind	
  th...
What	
  people	
  are	
  doing	
  now	
  
Transparency	
  applicaMons	
  of	
  text	
  analysis,	
  in	
  the	
  wild:	
  ...
Text	
  Analysis	
  for	
  Transparency	
  
In	
  the	
  Wild	
  
Algorithms	
  
•  Full	
  text	
  search	
  
•  Bag-­‐of-­‐words	
  /	
  TF-­‐IDF	
  
•  N-­‐gram	
  language	
  models	
 ...
State	
  of	
  the	
  Union	
  2011	
  word	
  cloud,	
  Whitehouse.gov	
  
State	
  of	
  the	
  Union	
  by	
  decade,	
  Henry	
  Williams	
  
State	
  of	
  the	
  Union	
  by	
  Decade	
  
Uses:	
  bag	
  of	
  words,	
  TF-­‐IDF	
  
	
  
Loads	
  speeches	
  fro...
First	
  text	
  summarizaMon	
  algorithm:	
  H.P.	
  Luhn,	
  1958	
  
First	
  text	
  summarizaMon	
  algorithm:	
  H.P.	
  Luhn,	
  1958	
  
Many	
  Bills,	
  IBM	
  
Many	
  Bills	
  
Does:	
  legislaAve	
  text	
  exploraAon	
  
Using	
  machine	
  classificaAon	
  via	
  (best	
  guess)...
Churnalism,	
  Sunlight	
  Labs	
  
Churnalism	
  
Does:	
  bill	
  content	
  explorer	
  
Using:	
  maching(best	
  guess):	
  bag-­‐of-­‐words,	
  n-­‐gram...
MemeTracker,	
  	
  by	
  Jure	
  Leskovec,	
  Lars	
  Backstrom	
  and	
  Jon	
  Kleinberg	
  
MemeTracker	
  
Does:	
  web-­‐scale	
  text	
  flow	
  analysis	
  on	
  poliAcal	
  quotes	
  
Using:	
  n-­‐grams,	
  fu...
Campaign	
  finance	
  donor	
  name	
  standardizer,	
  Chase	
  Davis	
  
FEC-­‐Standardizer	
  
Does:	
  name	
  standardizaAon	
  
Using:	
  supervised	
  classificaAon	
  via	
  random	
  forest...
Newsdiffs,	
  Eric	
  Price,	
  Jenny	
  8	
  Lee,	
  Greg	
  Price	
  	
  
NewsDiffs	
  
Does:	
  change	
  detecAon	
  
Using:	
  text	
  diff	
  
	
  
ConMnuously	
  scrapes	
  nyMmes.com,	
  cnn.c...
Docket	
  Wrench,	
  Sunlight	
  Labs	
  
Docket	
  Wrench	
  
Does:	
  topic	
  analysis	
  /	
  plagiarism	
  detecAon	
  
Using:	
  (best	
  guess)	
  bag-­‐of-­...
The	
  BaSle	
  for	
  Bystanders:	
  InformaAon,	
  Meaning	
  Contests,	
  and	
  
CollecAve	
  AcAon	
  in	
  the	
  Eg...
The	
  Ba_le	
  for	
  Bystanders	
  
An	
  analysis	
  of	
  media	
  during	
  EgypAan	
  revoluAon	
  of	
  2011	
  
Us...
An	
  Overview	
  of	
  Overview	
  
The	
  Overview	
  Project	
  
A	
  general	
  purpose	
  document	
  mining	
  system.	
  
	
  
Meant	
  to	
  answer	
  ...
Overview,	
  Associated	
  Press	
  
Overview	
  
Does:	
  topic	
  exploraAon	
  
Using:	
  bag-­‐of-­‐words,	
  n-­‐grams,	
  TF-­‐IDF,	
  document	
  simila...
Topic	
  Tree	
  
Computer	
  sorts	
  documents	
  into	
  folders	
  and	
  sub-­‐
folders,	
  based	
  on	
  topic	
  a...
Duplicate/near	
  duplicate	
  detecMon	
  
66	
  copies	
  with	
  different	
  names	
  
AutomaMc	
  sorMng	
  +	
  manual	
  tagging	
  
Deeper	
  in	
  the	
  tree	
  =	
  narrower	
  topic.	
  	
  
When	
  al...
Extracted	
  keywords	
  for	
  folders	
  and	
  docs	
  
Generate	
  document	
  vectors,	
  just	
  like	
  a	
  search	
  engine.	
  Then	
  
cluster	
  the	
  space.	
  Visuali...
Stories	
  done	
  with	
  Overview	
  
9000	
  pages	
  FOIA'd	
  from	
  200	
  Federal	
  agencies.	
  Data	
  Journali...
Lessons	
  learned	
  
•  Import	
  is	
  the	
  hardest	
  part!	
  Messy	
  input	
  formats,	
  big	
  
uploads,	
  man...
What's	
  Next	
  for	
  Data-­‐Driven	
  
Transparency	
  
What	
  should	
  we	
  do	
  next?	
  
Lots	
  of	
  stories	
  we	
  could	
  do.	
  Lots	
  of	
  tools	
  we	
  could	...
"Low	
  Hanging	
  Fruit"	
  
Work	
  on	
  the	
  untouched	
  data	
  sets	
  that	
  have	
  obvious	
  
interest	
  an...
"Capacity	
  Building"	
  
Data	
  analysis	
  is	
  hard!	
  Let's	
  make	
  it	
  easier.	
  
	
  
Build	
  be_er	
  so...
"What	
  happens	
  if"	
  
Look	
  for	
  the	
  work	
  that	
  will	
  have	
  the	
  greatest	
  posiMve	
  
effect.	
 ...
How	
  does	
  transparency	
  work?	
  
•  Deterrence.	
  Powerful	
  people	
  don't	
  do	
  bad	
  things	
  
because	...
The	
  anxiety	
  of	
  influence	
  
This	
  is	
  why	
  people	
  care	
  about	
  campaign	
  finance	
  
	
  
This	
  i...
DetecMng	
  influence	
  
But	
  how	
  does	
  influence	
  work?	
  
	
  
Influence	
  over	
  what?	
  (It's	
  a	
  vecto...
Some	
  analyses	
  I'd	
  like	
  to	
  see	
  
Externali+es	
  of	
  Finance.	
  How	
  do	
  banks	
  make	
  money?	
 ...
Transparency	
  Grand	
  Challenge	
  
	
  
Illuminate	
  for	
  ciMzens	
  how	
  the	
  decisions	
  that	
  
affect	
  t...
Grand	
  Challenge	
  QuesMons	
  
•  Is	
  government	
  the	
  right	
  focus?	
  
•  What	
  types	
  of	
  influence	
 ...
Text analysis in transparency - a talk at Sunlight Labs
Text analysis in transparency - a talk at Sunlight Labs
Text analysis in transparency - a talk at Sunlight Labs
Text analysis in transparency - a talk at Sunlight Labs
Text analysis in transparency - a talk at Sunlight Labs
Upcoming SlideShare
Loading in …5
×

Text analysis in transparency - a talk at Sunlight Labs

1,335 views

Published on

Video at: http://overview.ap.org/blog/2013/05/video-text-analysis-in-transparency/

How text analysis and natural language processing is being used in journalism, open government, and transparency generally. A survey of existing public projects, and the algorithms behind them. Then a demonstration of the Overview Project (overviewproject.org), a tool for automatically visualizing the topics in a large document set, designed for investigative journalists. Then, a discussion of where data-driven transparency is going now -- or, what should we work on next?

Published in: Education, Technology
  • Be the first to comment

Text analysis in transparency - a talk at Sunlight Labs

  1. 1. Text  analysis  in  transparency   Jonathan  Stray   Sunlight  Labs,  May  2  2013  
  2. 2.   Text  Analysis  for  Transparency  in  the  Wild   cool  projects,  and  the  tech  behind  them     An  Overview  of  Overview   the  thing  I've  been  working  on     What's  Next  for  Data-­‐Driven  Transparency?     how  does  transparency  work,  anyway?  
  3. 3. What  people  are  doing  now   Transparency  applicaMons  of  text  analysis,  in  the  wild:     •  Document  SummarizaMon   •  ExploraMon  of  text  collecMons   •  Name  standardizaMon   •  Plagiarism  detecMon  /  text  flow  analysis   •  Change  surveillance  /  revision  tracking   •  ClassificaMon  /  automaMc  tagging    
  4. 4. Text  Analysis  for  Transparency   In  the  Wild  
  5. 5. Algorithms   •  Full  text  search   •  Bag-­‐of-­‐words  /  TF-­‐IDF   •  N-­‐gram  language  models   •  Document  similarity  funcMons  (cosine  distance)   •  Fuzzy  string  matching  (shingles,  edit  distance,...)   •  Text  Diff   •  Clustering  (k-­‐means,  hierarchical,  ...)     •  Locality  SensiMve  Hashing  (MinHash,  ...)   •  supervised  classificaMon  (linear,  SVM,  ...)   •  Topic  modeling  (LSA,  LDA,  NMF,  ...)  
  6. 6. State  of  the  Union  2011  word  cloud,  Whitehouse.gov  
  7. 7. State  of  the  Union  by  decade,  Henry  Williams  
  8. 8. State  of  the  Union  by  Decade   Uses:  bag  of  words,  TF-­‐IDF     Loads  speeches  from  all  years,  applies  TF-­‐IDF.  Sums  document   vectors  by  decade,  then  picks  top  10  words.     Not  really  a  principled  approach,  but  seems  to  give  reasonable   results...  be_er  than  word  clouds?  
  9. 9. First  text  summarizaMon  algorithm:  H.P.  Luhn,  1958  
  10. 10. First  text  summarizaMon  algorithm:  H.P.  Luhn,  1958  
  11. 11. Many  Bills,  IBM  
  12. 12. Many  Bills   Does:  legislaAve  text  exploraAon   Using  machine  classificaAon  via  (best  guess)  bag-­‐of-­‐words,  n-­‐ grams,  TF-­‐IDF     Classifies  secMons  of  bill  by  topic,  and  displays  visually.     Allows  comparison  of  mulMple  bills.       Intended  applicaMon:  obscure  riders  and  "pork  barrel"  projects  
  13. 13. Churnalism,  Sunlight  Labs  
  14. 14. Churnalism   Does:  bill  content  explorer   Using:  maching(best  guess):  bag-­‐of-­‐words,  n-­‐grams,  locality   sensiAve  hashing,  fuzzy  string  matching     Given  some  text,  find  all  documents  which  contain  a  substanMal   secMon  of  that  text.     Allows  for  some  difference  between  source  and  target.     Highlights  diffs.  
  15. 15. MemeTracker,    by  Jure  Leskovec,  Lars  Backstrom  and  Jon  Kleinberg  
  16. 16. MemeTracker   Does:  web-­‐scale  text  flow  analysis  on  poliAcal  quotes   Using:  n-­‐grams,  fuzzy  string  matching  via  edit  distance,   phylogeneAc  tree  concepts  from  bioinformaAcs     Given  a  quote,  track  its  diffusion  and  mutaMon  across  news   outlets  and  millions  of  blogs.     Shows  a_enMon  curves,  phrase  variaMons.  Allows  comparison  of   different  types  of  media.  
  17. 17. Campaign  finance  donor  name  standardizer,  Chase  Davis  
  18. 18. FEC-­‐Standardizer   Does:  name  standardizaAon   Using:  supervised  classificaAon  via  random  forests,  locality-­‐ sensiAve  hashing  on  2-­‐shingles     Standardizes  donor  idenMMes.  That  is,  finds  clusters  of  donors   who  are  the  same  person,  even  with  typos,  incomplete  data,   other  errors.     95-­‐99%  accurate,  compared  to  Center  for  Responsive  PoliMcs   reference  data.    
  19. 19. Newsdiffs,  Eric  Price,  Jenny  8  Lee,  Greg  Price    
  20. 20. NewsDiffs   Does:  change  detecAon   Using:  text  diff     ConMnuously  scrapes  nyMmes.com,  cnn.com,  poliMco.com,   bbc.co.uk,  looking  for  changes  in  published  stories.     Displays  diffs  in  visual  format.    
  21. 21. Docket  Wrench,  Sunlight  Labs  
  22. 22. Docket  Wrench   Does:  topic  analysis  /  plagiarism  detecAon   Using:  (best  guess)  bag-­‐of-­‐words,  n-­‐grams,  locality-­‐sensiAve   hashing,  full  text  search     Analyzes  comments  on  proposed  Federal  regulaMon  and  shows   clusters  which  contain  similar  text.     ConMnuously  pulls  from  many  different  agencies  –  over  100k   dockets!  Also  visual  display  of  docket  acMvity,  browsing,  search.  
  23. 23. The  BaSle  for  Bystanders:  InformaAon,  Meaning  Contests,  and   CollecAve  AcAon  in  the  EgypAan  RevoluAon  of  2011,  Trey  Causey    
  24. 24. The  Ba_le  for  Bystanders   An  analysis  of  media  during  EgypAan  revoluAon  of  2011   Using:  bag-­‐of-­‐words,  topic  modeling     Topic  modeling  across  a  database  of  three  online  news  outlets  –   both  state  and  non-­‐state  media  –  to  detect  and  count  stories   with  various  frames,  e.g.  "danger  and  instability"     Relies  on  interpretaMon  of  algorithmically  generated  "topics,"   which  are  really  distribuMons  over  words.  No  ground-­‐truth  /   comparison  to  human  raters.  
  25. 25. An  Overview  of  Overview  
  26. 26. The  Overview  Project   A  general  purpose  document  mining  system.     Meant  to  answer  the  quesMon,  "what's  in   there?"     Be_er  than  search  –  find  what  you  didn't  know   you're  looking  for.  
  27. 27. Overview,  Associated  Press  
  28. 28. Overview   Does:  topic  exploraAon   Using:  bag-­‐of-­‐words,  n-­‐grams,  TF-­‐IDF,  document  similarity,  k-­‐ means  clustering,  full  text  search     Uses  the  full  text  of  each  document  to  perform  hierarchical   clustering  based  on  topic.     Visual  exploraMon  and  tagging,  and  (soon)  integrated  full-­‐text   search.  
  29. 29. Topic  Tree   Computer  sorts  documents  into  folders  and  sub-­‐ folders,  based  on  topic  analysis.    
  30. 30. Duplicate/near  duplicate  detecMon   66  copies  with  different  names  
  31. 31. AutomaMc  sorMng  +  manual  tagging   Deeper  in  the  tree  =  narrower  topic.     When  all  docs  are  on  "same"  topic,  tag  it  
  32. 32. Extracted  keywords  for  folders  and  docs  
  33. 33. Generate  document  vectors,  just  like  a  search  engine.  Then   cluster  the  space.  VisualizaMon  of  "types"  of  search  result.  
  34. 34. Stories  done  with  Overview   9000  pages  FOIA'd  from  200  Federal  agencies.  Data  Journalism  Awards  2013  finalist.   4500  pages  of  incident  reports  from  US  Dept  of  State,  declassified  aoer  FOIA   7000  emails  from  Tulsa  Police  Department.  Millions  wasted  on  bad  computers.  
  35. 35. Lessons  learned   •  Import  is  the  hardest  part!  Messy  input  formats,  big   uploads,  many  documents  on  paper...   •  Usability  is  crucial.  People  will  give  up  fast.   •  #1  FAQ:  "how  is  it  sorMng  my  documents?"   •  #1  comment:  "oh,  you  mean  it's  a  search  engine."   •  How  do  we  explain  what  we're  doing  to  users?     WORKFLOW  beats  ALGORITHM   every  Mme  
  36. 36. What's  Next  for  Data-­‐Driven   Transparency  
  37. 37. What  should  we  do  next?   Lots  of  stories  we  could  do.  Lots  of  tools  we  could  build.   Lots  of  data  we  could  analyze.     Are  we  starMng  from  the  right  place?        
  38. 38. "Low  Hanging  Fruit"   Work  on  the  untouched  data  sets  that  have  obvious   interest  and  potenMal,  like  campaign  contribuMons.     Catalog  available  data.  Push  for  opening  more.  Create   interfaces  to  exisMng  data  sets.       This  is  a  data-­‐driven  approach.  Risk  is  "looking  for  your   keys  under  the  street  light."  
  39. 39. "Capacity  Building"   Data  analysis  is  hard!  Let's  make  it  easier.     Build  be_er  sooware.  Reduce  duplicaMon  of   engineering  efforts.  Teach  people  to  do  data  work,  and   improve  training  methods.     This  is  a  tool-­‐  and  technique-­‐driven  approach.  Risk  is   building  capacity  that  doesn't  ma_er  (no  one  uses,  or   has  no  impact)  
  40. 40. "What  happens  if"   Look  for  the  work  that  will  have  the  greatest  posiMve   effect.     Impact  is  some  combinaMon  of  supply  (we  could  do   this)  plus  demand  (people  would  want  it)  plus   effecMveness  (contributes  to  agency.)     This  is  an  impact-­‐driven  approach.  Can  be  very  hard  to   predict  or  measure.  
  41. 41. How  does  transparency  work?   •  Deterrence.  Powerful  people  don't  do  bad  things   because  they  know  someone  is  watching.   •  A_enMon.  Focus  spotlight  on  things  that  shouldn't   be  (even  if  they're  "known")   •  Understanding.  Just  what  is  going  on  there   anyway?  Secrets  vs.  mysteries.   •  Influence  mapping.  Who  is  actually  making  the   rules?  
  42. 42. The  anxiety  of  influence   This  is  why  people  care  about  campaign  finance     This  is  why  people  care  about  text  flow  in  lawmaking     This  is  why  people  care  about  poliMcal  adverMsing     .   .   .  
  43. 43. DetecMng  influence   But  how  does  influence  work?     Influence  over  what?  (It's  a  vector,  not  a  scalar)     Algorithmically  detectable?  Campaign  finance  data   seeks  to  quanMfy  it.  Social  network  analysis  makes   claims.  Straight  up  votes  sMll  count.     But...  do  we  really  understand  influence?  Are  we   confusing  inputs  and  results?  
  44. 44. Some  analyses  I'd  like  to  see   Externali+es  of  Finance.  How  do  banks  make  money?  What   effects  does  this  have  on  the  rest  of  us?  Are  internal   jusMficaMons  like  "increasing  liquidity"  good  or  bad  for  everyone   else?  Is  the  industry  actually  compeMMve  or  just  an  oligopoly?   ConnecMons  to  poliMcs  and  other  sources  of  power?     Large  scale  social  network  mapping.  Start  with  data  from   Li_leSis.org.  Can  we  actually  learn  anything  about  influence  from   this?  Try  to  develop  comparaMve  metrics,  and  typology  of   influence  –  break  down  by  industry?  Look  at  revolving  doors,   hiring  and  appointment,  money  flows,  etc.  
  45. 45. Transparency  Grand  Challenge     Illuminate  for  ciMzens  how  the  decisions  that   affect  them  actually  get  made.     (Which  requires  figuring  that  out.)     Show  them  how  to  use  their  own  influence.  
  46. 46. Grand  Challenge  QuesMons   •  Is  government  the  right  focus?   •  What  types  of  influence  are  there?  (tribes,  insMtuMons,   markets,  networks)   •  What  is  the  limiMng  factor  to  detecMng  influence?  Could  be   data  access,  missing  tools,  lack  of  public  a_enMon,  system   complexity,  or  ...  ?   •  Are  we  facing  secrets  (someone  doesn't  want  us  to  know)  or  a   mysteries  (it's  complicated  and  no  one  knows)?   •  Do  we  really  know  how  data  relates  to  influence?   •  Who  is  affected  by  each  type  of  influence?   •  Who  are  we  working  for?  Have  we  asked  them  what  they   want?  

×