Franta Polach - Exploring Patent Data with Python

1,683 views
1,499 views

Published on

PyData Berlin 2014

Experiences from building a recommendation engine for patent search using pythonic NLP and topic modeling tools such as Gensim.

Published in: Data & Analytics
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,683
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
19
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Franta Polach - Exploring Patent Data with Python

  1. 1. Exploring patent space with python Franta Polach @FrantaPolach IPberry.com PyData 2014
  2. 2. @FrantaPolach 2
  3. 3. @FrantaPolach 3
  4. 4. @FrantaPolach 4
  5. 5. @FrantaPolach 5
  6. 6. @FrantaPolach 6 Outline ● Why patents ● Data kung fu ● Topic modelling ● Future
  7. 7. @FrantaPolach 7 Why patents
  8. 8. @FrantaPolach 8 Why patents ● The system is broken ● Messy, slow & costly process ● USPTO data freely available ● Data structured, mostly consistent ● A chance to learn
  9. 9. @FrantaPolach 9 Data kung fu Kung fu or Gung fu (/ˌkʌŋˈfuː/ or /ˌkʊŋˈfuː/; 功夫 , Pinyin: gōngfu) – a Chinese term referring to any study, learning, or practice that requires patience, energy, and time to complete
  10. 10. @FrantaPolach 10 USPTO Data ● xml, SGML key-value store ● 1975 – present ● eight different formats ● > 70GB (compressed) ● patent grants ● patent applications ● How to parse? ● Parsed data available? – Harvard Dataverse Network – Coleman Fung Institute for Engineering Leadership, UC Berkeley – PATENT SEARCH TOOL by Fung Institute – http://funginstitute.berkeley.edu/tools-and-data
  11. 11. @FrantaPolach 11 Coleman Fung Institute for Engineering Leadership, UC Berkeley patent data process flow The code is in Python 2 on Github.
  12. 12. @FrantaPolach 12 Fung Institute SQL database schema
  13. 13. @FrantaPolach 13 Entity-relationship diagram Patents with citations, claims, applications and classes
  14. 14. @FrantaPolach 14 Descriptive statistics
  15. 15. @FrantaPolach 15 Topic modelling ● Goal: build a topic space of the patent documents ● i.e. compute semantic similarity ● Tools: nltk, gensim ● Data: patent abstracts, claims, descriptions ● Usage: have invention description, find semantically similar patents
  16. 16. @FrantaPolach 16 Text preprocessing ● Have: parsed data in a relational database ● Want: data ready for semantic analysis ● Do: – lemmatization, stemming – collocations, Named Entity Recognition
  17. 17. @FrantaPolach 17 Text preprocessing Lemmatization, stemming print(gensim.utils.lemmatize("Changing the way scientists, engineers, and analysts perceive big data")) ['change/VB', 'way/NN', 'scientist/NN', 'engineer/NN', 'analyst/NN', 'perceive/VB', 'big/JJ', 'datum/NN'] i.e. group together different inflected forms of a word so they can be analysed as a single item Collocations, Named Entity Recognition detect a sequence of words that co-occur more often than would be expected by chance import nltk from nltk.collocations import TrigramCollocationFinder from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures e.g. entity such as "General Electric" stays a single token Stopwords generic words, such as "six", "then", "be", "do".... from gensim.parsing.preprocessing import STOPWORDS
  18. 18. @FrantaPolach 18 Data streaming Why? data is too large to fit into RAM Itertools are your friend class PatCorpus(object): def __init__(self, fname): self.fname = fname def __iter__(self): for line in open(self.fname): patent=line.lower().split('t') tokens = gensim.utils.tokenize(patent[5], lower=True) title = patent[6] yield title, list(tokens) corpus_tokenized = PatCorpus('in.tsv') print(list(itertools.islice(corpus_tokenized, 2))) [('easy wagon/easy cart/bicycle wheel mounting brackets system', [u'a', u'specific', u'wheel', u'mounting', u'bracket', u'and', u'a', u'versatile', u'method', u'of', u'using', u'these', u'brackets', u'or', u'similar', u'items', u'to', u'attach', u'bicycle', u'wheels', u'to', u'various', u'vehicle', u'frames', u'primarily', u'made', u'of', u'wood', u'and', u'a', u'general', u'vehicle', u'structure', u'or', u'frame', u'design', u'using', u'the', u'brackets', u'the', u'brackets', u'are', u'flat', …
  19. 19. @FrantaPolach 19 Vectorization ● First we create a dictionary, i.e. index text tokens by integers id2word = gensim.corpora.Dictionary(corpus_tokenized) ● Create bag-of-words vectors using a streamed corpus and a dictionary text = "A community for developers and users of Python data tools." bow = id2word.doc2bow(tokenize(text)) print(bow) [(12832, 1), (28124, 1), (28301, 1), (32835, 1)] def tokenize(text): return [t for t in simple_preprocess(text) if t not in STOPWORDS]
  20. 20. @FrantaPolach 20 Semantic transformations ● A transformation takes a corpus and outputs another corpus ● Choice: Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), Random Projections (RP), etc. model = gensim.models.LdaModel(corpus, num_topics=100, id2word=id2word, passes = 4, alpha=None) _ = model.print_topics(-1) INFO:gensim.models.ldamodel:topic #0 (0.010): 0.116*memory + 0.090*cell + 0.063*plurality + 0.054*array + 0.052*each + 0.044*bit + 0.039*cells + 0.032*address + 0.022*logic + 0.017*row INFO:gensim.models.ldamodel:topic #1 (0.010): 0.101*speed + 0.092*lines + 0.060*performance + 0.045*characteristic + 0.036*skin + 0.028*characteristics + 0.025*suspension + 0.024*enclosure + 0.023*transducer + 0.022*loss INFO:gensim.models.ldamodel:topic #2 (0.010): 0.141*portion + 0.049*housing + 0.031*portions + 0.028*end + 0.024*edge + 0.020*mounting + 0.018*has + 0.017*each + 0.016*formed + 0.016*arm INFO:gensim.models.ldamodel:topic #3 (0.010): 0.224*signal + 0.099*output + 0.075*input + 0.057*signals + 0.043*frequency + 0.034*phase + 0.024*clock + 0.020*circuit + 0.016*amplifier + 0.014*reference
  21. 21. @FrantaPolach 21 Transforming unseen documents text = "A method of configuring the link maximum transmission unit (MTU) in a user equipment." 1) transform text into the bag-of-words space bow_vector = id2word.doc2bow(tokenize(text)) print([(id2word[id], count) for id, count in bow_vector]) [(u'method', 1), (u'configuring', 1), (u'link', 1), (u'maximum', 1), (u'transmission', 1), (u'unit', 1), (u'user', 1), (u'equipment', 1)] 2) transform text into our LDA space vector = model[bow_vector] [(0, 0.024384265946835323), (1, 0.78941547921042373),... 3) find the document's most significant LDA topic model.print_topic(max(vector, key=lambda item: item[1])[0]) 0.022*network + 0.021*performance + 0.018*protocol + 0.015*data + 0.009*system + 0.008*internet + ...
  22. 22. @FrantaPolach 22 Evaluation ● Topic modelling is an unsupervised task ->> evaluation tricky ● Need to evaluate the improvement of the intended task ● Our goal is to retrieve semantically similar documents, thus we tag a set of similar documents and compare with the results of given semantic model ● "word intrusion" method: for each trained topic, take its first ten words, substitute one of them with a randomly chosen word (intruder!) and let a human detect the intruder ● Method without human intervention: split each document into two parts, and check that topics of the first half are similar to topics of the second; halves of different documents are dissimilar
  23. 23. @FrantaPolach 23 The topic space ● a topic is a distribution over a fixed vocabulary of terms ● the idea behind Latent Dirichlet Allocation is to statistically model documents as containing multiple hidden semantic topics
  24. 24. @FrantaPolach 24 memory: 188 cell: 146 plurality: 102 array: 86 bit: 71 address: 51 Exploring topic space speed: 178 line: 163 performance: 107 characteristic: 79 skin: 63 suspension: 45 signal: 324 output: 142 input: 108 frequency: 62 phase: 49 clock: 35 portion: 310 housing: 109 end: 62 edge: 53 mounting: 43 form: 35
  25. 25. @FrantaPolach 25 Topics distribution many topics in total, but each document contains just a few of them ->> sparse model
  26. 26. @FrantaPolach 26 Semantic distance in topic space ● Semantic distance queries from scipy.spatial import distance pairwise = distance.squareform(distance.pdist(matrix)) >> MemoryError ● Document indexing from gensim.similarities import Similarity index = Similarity('tmp/index', corpus, num_features=corpus.num_terms) The Similarity class splits the index into several smaller sub-indexes ->> scales well
  27. 27. @FrantaPolach 27 Semantic distance queries query = "A method of configuring the link maximum transmission unit (MTU) in a user equipment." 1) vectorize the text into bag-of-words space bow_vector = id2word.doc2bow(tokenize(query)) 2) transform the text into our LDA space query_lda = model[bow_vector] 3) query the LDA index, get the top 3 most similar documents index.num_best = 3 print(index[query_lda]) [(2026, 0.91495784099521484), (32384, 0.8226358470916238), (11525, 0.80638835174553156)]
  28. 28. @FrantaPolach 28 Future ● Graph of USPTO data (Neo4j) ● Elasticsearch search and analytics ● Recommendation engine (for applications) ● Drawings analysis ● Blockchain based smart contracts ● Artificial patent lawyer

×