Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PyTextRank: Graph algorithms for enhanced natural language processing

684 views

Published on

DSSG'17 meetup 2017-12-07
https://www.meetup.com/DataScience-SG-Singapore/events/244885954/

PyTextRank: Graph algorithms for enhanced natural language processing by Paco Nathan (O’Reilly Media). In this talk, Paco illustrates PyTextRank use cases in media and learning to enable semi-supervised word sense disambiguation, move from natural language parsing to natural language understanding, and implement AI-based video search and approximation algorithms for content recommendation based on semantic similarity.

Published in: Technology
  • Be the first to comment

PyTextRank: Graph algorithms for enhanced natural language processing

  1. 1. PyTextRank:  
 graph  algorithms  for  enhanced  NLP Paco  Nathan  @pacoid   Dir,  Learning  Group  @  O’Reilly  Media   DSSG’17,  Singapore  2017-­‐12-­‐06
  2. 2. Background:   helping  people  learn 2
  3. 3. AI  in  Media ▪ content  which  can  represented  as  
 text  can  be  parsed  by  NLP,  then   manipulated  by  available  AI  tooling     ▪ labeled  images  get  really  interesting   ▪ text  or  images  within  a  context  have  
 inherent  structure   ▪ representation  of  that  kind  of  structure   is  rare  in  the  Media  vertical  –  so  far 3
  4. 4. {"graf": [[21, "let", "let", "VB", 1, 48], [0, "'s", "'s", "PRP", 0, 49], "take", "take", "VB", 1, 50], [0, "a", "a", "DT", 0, 51], [23, "look", "l "NN", 1, 52], [0, "at", "at", "IN", 0, 53], [0, "a", "a", "DT", 0, 54], [ "few", "few", "JJ", 1, 55], [25, "examples", "example", "NNS", 1, 56], [0 "often", "often", "RB", 0, 57], [0, "when", "when", "WRB", 0, 58], [11, "people", "people", "NNS", 1, 59], [2, "are", "be", "VBP", 1, 60], [26, " "first", "JJ", 1, 61], [27, "learning", "learn", "VBG", 1, 62], [0, "abou "about", "IN", 0, 63], [28, "Docker", "docker", "NNP", 1, 64], [0, "they" "they", "PRP", 0, 65], [29, "try", "try", "VBP", 1, 66], [0, "and", "and" 0, 67], [30, "put", "put", "VBP", 1, 68], [0, "it", "it", "PRP", 0, 69], "in", "in", "IN", 0, 70], [0, "one", "one", "CD", 0, 71], [0, "of", "of", 0, 72], [0, "a", "a", "DT", 0, 73], [24, "few", "few", "JJ", 1, 74], [31, "existing", "existing", "JJ", 1, 75], [18, "categories", "category", "NNS 76], [0, "sometimes", "sometimes", "RB", 0, 77], [11, "people", "people", 1, 78], [9, "think", "think", "VBP", 1, 79], [0, "it", "it", "PRP", 0, 80 "'s", "be", "VBZ", 1, 81], [0, "a", "a", "DT", 0, 82], [32, "virtualizati "virtualization", "NN", 1, 83], [19, "tool", "tool", "NN", 1, 84], [0, "l "like", "IN", 0, 85], [33, "VMware", "vmware", "NNP", 1, 86], [0, "or", " "CC", 0, 87], [34, "virtualbox", "virtualbox", "NNP", 1, 88], [0, "also", "also", "RB", 0, 89], [35, "known", "know", "VBN", 1, 90], [0, "as", "as" 0, 91], [0, "a", "a", "DT", 0, 92], [36, "hypervisor", "hypervisor", "NN" 93], [0, "these", "these", "DT", 0, 94], [2, "are", "be", "VBP", 1, 95], "tools", "tool", "NNS", 1, 96], [0, "which", "which", "WDT", 0, 97], [2, "be", "VBP", 1, 98], [37, "emulating", "emulate", "VBG", 1, 99], [38, "hardware", "hardware", "NN", 1, 100], [0, "for", "for", "IN", 0, 101], [ "virtual", "virtual", "JJ", 1, 102], [40, "software", "software", "NN", 1 103]], "id": "001.video197359", "sha1": "4b69cf60f0497887e3776619b922514f2e5b70a8"} Video  transcription 4 {"count": 2, "ids": [32, 19], "pos": "np", "rank": 0.0194, "text": "virtualization tool"} {"count": 2, "ids": [40, 69], "pos": "np", "rank": 0.0117, "text": "software applications"} {"count": 4, "ids": [38], "pos": "np", "rank": 0.0114, "text": "hardware"} {"count": 2, "ids": [33, 36], "pos": "np", "rank": 0.0099, "text": "vmware hypervisor"} {"count": 4, "ids": [28], "pos": "np", "rank": 0.0096, "text": "docker"} {"count": 4, "ids": [34], "pos": "np", "rank": 0.0094, "text": "virtualbox"} {"count": 10, "ids": [11], "pos": "np", "rank": 0.0049, "text": "people"} {"count": 4, "ids": [37], "pos": "vbg", "rank": 0.0026, "text": "emulating"} {"count": 2, "ids": [27], "pos": "vbg", "rank": 0.0016, "text": "learning"} Transcript: let's take a look at a few examples often when people are first learning about Docker they try and put it in one of a few existing categories sometimes people think it's a virtualization tool like VMware or virtualbox also known as a hypervisor these are tools which are emulating hardware for virtual software Confidence: 0.973419129848 39 KUBERNETES 0.8747 coreos 0.8624 etcd 0.8478 DOCKER CONTAINERS 0.8458 mesos 0.8406 DOCKER 0.8354 DOCKER CONTAINER 0.8260 KUBERNETES CLUSTER 0.8258 docker image 0.8252 EC2 0.8210 docker hub 0.8138 OPENSTACK orm:Docker a orm:Vendor; a orm:Container; a orm:Open_Source; a orm:Commercial_Software; owl:sameAs dbr:Docker_%28software%29; skos:prefLabel "Docker"@en;
  5. 5. Knowledge  Graph ▪ used  to  construct  an  ontology  about   technology,  based  on  learning   materials  from  200+  publishers   ▪ uses  SKOS  as  a  foundation,  ties  into  
 US  Library  of  Congress  and  DBpedia  
 as  upper  ontologies   ▪ primary  structure  is  “human  scale”,  
 used  as  control  points   ▪ majority  (>90%)  of  the  graph  
 comes  from  machine  generated  
 data  products 5
  6. 6. 6
  7. 7. 7 UX  for  content  discovery:   ▪ partly  generated  +  curated  by  humans   ▪ partly  generated  +  curated  by  AI  apps
  8. 8. Motivations:   a  new  generation  of  NLP 8
  9. 9. Statistical  parsing 9 Long,  long  ago,  the  field  of  NLP  used  to  be  concerned  about   transformational  grammars,  linguistic  theory  by  Chomsky,  etc.   Machine  learning  techniques  allowed  simpler  approaches,  
 i.e.,  statistical  parsing.  With  the  rise  of  Big  Data,  those   techniques  grew  more  effective  –  Google  won  the  NIST  2005   Machine  Translation  Evaluation  and  remains  a  leader.   More  recently,  deep  learning  helps  extend  performance  
 (read:  less  errors)  for  statistical  parsing  and  downstream  uses.
  10. 10. What  changed? 10 Hardware:  multicore  and  large  memory  spaces,  plus  wide   availability  of  cloud  computing  resources.   The  same  data  +  algorithms  which  required  a  large  Hadoop   cluster  in  2007,  run  faster  in  Python  on  my  laptop  in  2017.   NLP  benefited  as  hardware  advances  eliminated  the  need  for   computational  shortcuts  (e.g.  stemming)  and  made  more   advanced  approaches  (e.g.,  graph  algorithms)  become  feasible.   Also,  subtle  shifts  as  approximation  algorithms  became  more   integrated,  e.g.,  substituting  n-­‐grams  with  skip-­‐grams.
  11. 11. Bag-­‐of-­‐words  is  not  your  friend 11
  12. 12. Advances:   graph  algorithms  for  NLP 12
  13. 13. TextRank 13 “TextRank:  Bringing  Order  into  Texts”
 Rada  Mihalcea,  Paul  Tarau
 Conference  on  Empirical  Methods  in   Natural  Language  Processing  (2004-­‐07-­‐25)
 https://goo.gl/AJnA76   http://web.eecs.umich.edu/~mihalcea/   http://www.cse.unt.edu/~tarau/
  14. 14. TextRank 14 “In  this  paper,  we  introduced  TextRank  –  a  graph-­‐based  ranking   model  for  text  processing,  and  show  how  it  can  be  successfully  used   for  natural  language  applications.  In  particular,  we  proposed  and   evaluated  two  innovative  unsupervised  approaches  for  keyword  and   sentence  extraction,  and  showed  that  the  accuracy  achieved  by   TextRank  in  these  applications  is  competitive  with  that  of  previously   proposed  state-­‐of-­‐the-­‐art  algorithms.  An  important  aspect  of   TextRank  is  that  it  does  not  require  deep  linguistic  knowledge,  nor   domain  or  language  specific  annotated  corpora,  which  makes  it   highly  portable  to  other  domains,  genres,  or  languages.”  
  15. 15. TextRank 1. segment  document  into  paragraphs  and   sentences   2. parse  each  sentence,  find  PoS  annotations,   lemmas   3. construct  a  graph  where  nouns,  adjectives,   verbs  are  vertices   • links  based  on  skip-­‐grams   • links  based  on  repeated  instances  of  the  same  root   4. stochastic  link  analysis  (e.g.,  PageRank)   identifies  noun  phrases  which  have  high   probability  of  inbound  references 15
  16. 16. How  does  this  work? graph  analysis,  e.g.,  centrality,  causes  the   most  “central”  phrases  to  be  ranked  higher 16
  17. 17. Unsupervised  summarization 17
  18. 18. Important  outcomes 18 Unsupervised  methods  which  do  not  require  subject  domain   info,  but  can  use  it  when  available   Extracting  key  phrases  from  a  text,  in  lieu  of  keywords,  provides   significantly  more  information  content.   For  example,  consider  “deep  learning”  vs.  “deep”  and  “learning”.   Also,  automated  summarization  can  reduce  human  (editorial)   time  dramatically.  
  19. 19. Implementations:   various  trade-­‐offs 19
  20. 20. Other  implementations 20 Paco  Nathan,  2008
 Java,  Apache  Hadoop  /  English+Spanish
 https://github.com/ceteri/textrank/   Radim  Řehůřek,  2009
 Python,  NumPy,  SciPy
 https://radimrehurek.com/gensim/   Jeff  Kubina,  2009
 Perl  /  English
 http://search.cpan.org/~kubina/Text-­‐Categorize-­‐Textrank-­‐0.51/lib/Text/Categorize/Textrank/En.pm   Karin  Christiansen,  2014
 Java  /  Icelandic
 https://github.com/karchr/icetextsum   Paco  Nathan,  2014
 Scala,  Apache  Spark  /  English
 https://github.com/ceteri/spark-­‐exercises   Got  any  others  to  add  to  this  list??
  21. 21. PyTextRank Paco  Nathan,  2016   Python  implementation  atop  spaCy,  NetworkX,  datasketch:   ▪ https://pypi.python.org/pypi/pytextrank/   ▪ https://github.com/ceteri/pytextrank/
  22. 22. PyTextRank:  motivations ▪ spaCy  is  moving  fast  and  benefits  greatly  by  having  an   abstraction  layer  above  it   ▪ NetworkX  allows  use  of  multicore  +  large  memory  spaces   for  graph  algorithms,  obviating  the  need  in  many  NLP  use   cases  for  Hadoop,  Spark,  etc.   ▪ datasketch  leverages  approximation  algorithms  which   make  key  features  feasible,  given  the  environments   described  above   ▪ monolithic  NLP  frameworks  often  attempt  to  control  
 the  entire  pipeline  (instead  of  using  modules  or  services)  
 and  are  basically  tech  debt  waiting  to  happen  
  23. 23. PyTextRank:  innovations ▪ fixed  bug  in  the  original  paper’s  pseudocode;  see  Java  impl  2008   used  in  production  by  ShareThis,  Rolling  Stone,  etc.   ▪ included  verbs  into  the  graph  (though  not  in  keyphrase  output)   ▪ added  text  scrubbing,  unicode  normalization   ▪ replaced  stemming  with  lemmatization   ▪ can  leverage  named  entity  recognition,  noun  phrase  chunking   ▪ provides  extractive  summarization   ▪ makes  use  of  approximation  algorithms,  where  appropriate   ▪ applies  use  of  ontology:  random  walk  of  hypernyms/hyponyms   helps  enrich  the  graph,  which  in  turn  improves  performance   ▪ integrates  vector  embedding  +  other  graph  algorithms  to  help   construct  or  augment  ontology  from  a  corpus
  24. 24. PyTextRank:  upcoming Package  installation  is  currently  integrated  with  PyPi   2017-­‐Q4:  adding  support  for  Anaconda   Also  working  on:   ▪ integration  with  spaCy  2.x,  e.g.,  spans   ▪ more  coding  examples  in  the  wiki,  especially  using  Jupyter   ▪ better  handling  for  character  encoding  issues;  
 see  “Character  Filtering”  by  Daniel  Tunkelang,  
 along  with  his  entire  series  Query  Understanding   ▪ more  integration  into  coursework;  
 see  Get  Started  with  NLP  in  Python  on  O’Reilly
  25. 25. Strata  Data   SG,  Dec  4-­‐7
 SJ,  Mar  5-­‐8
 UK,  May  21-­‐24
 CN,  Jul  12-­‐15   The  AI  Conf   CN  Apr  10-­‐13
 NY,  Apr  29-­‐May  2
 SF,  Sep  4-­‐7
 UK,  Oct  8-­‐11   JupyterCon   NY,  Aug  21-­‐24   OSCON   PDX,  Jul  16-­‐19,  2018 25
  26. 26. 26 Learn  Alongside
 Innovators Just  Enough  Math Building  Data   Science  Teams Hylbert-­‐Speys How  Do  You  Learn? updates,  reviews,  conference  summaries…   liber118.com/pxn/
 @pacoid

×