Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Exploring patent space
with python
Franta Polach
@FrantaPolach
IPberry.com
PyData 2014
@FrantaPolach 2
@FrantaPolach 3
@FrantaPolach 4
@FrantaPolach 5
@FrantaPolach 6
Outline
● Why patents
● Data kung fu
● Topic modelling
● Future
@FrantaPolach 7
Why patents
@FrantaPolach 8
Why patents
● The system is broken
● Messy, slow & costly process
● USPTO data freely available
● Data str...
@FrantaPolach 9
Data kung fu
Kung fu or Gung fu (/ˌkʌŋˈfuː/ or /ˌkʊŋˈfuː/; 功夫 ,
Pinyin: gōngfu)
– a Chinese term referring...
@FrantaPolach 10
USPTO Data
● xml, SGML key-value store
● 1975 – present
● eight different formats
● > 70GB (compressed)
●...
@FrantaPolach 11
Coleman Fung Institute for Engineering Leadership, UC Berkeley
patent data process flow
The code is in Py...
@FrantaPolach 12
Fung Institute SQL database schema
@FrantaPolach 13
Entity-relationship diagram
Patents with citations, claims, applications and classes
@FrantaPolach 14
Descriptive statistics
@FrantaPolach 15
Topic modelling
● Goal: build a topic space of the patent
documents
● i.e. compute semantic similarity
● ...
@FrantaPolach 16
Text preprocessing
● Have: parsed data in a relational database
● Want: data ready for semantic analysis
...
@FrantaPolach 17
Text preprocessing
Lemmatization, stemming
print(gensim.utils.lemmatize("Changing the way scientists, eng...
@FrantaPolach 18
Data streaming
Why? data is too large to fit into RAM
Itertools are your friend
class PatCorpus(object):
...
@FrantaPolach 19
Vectorization
● First we create a dictionary, i.e. index text tokens by integers
id2word = gensim.corpora...
@FrantaPolach 20
Semantic transformations
● A transformation takes a corpus and outputs another corpus
● Choice: Latent Se...
@FrantaPolach 21
Transforming unseen documents
text = "A method of configuring the link maximum transmission unit (MTU) in...
@FrantaPolach 22
Evaluation
● Topic modelling is an unsupervised task ->> evaluation tricky
● Need to evaluate the improve...
@FrantaPolach 23
The topic space
● a topic is a distribution over a fixed vocabulary
of terms
● the idea behind Latent Dir...
@FrantaPolach 24
memory: 188
cell: 146
plurality: 102
array: 86
bit: 71
address: 51
Exploring topic space
speed: 178
line:...
@FrantaPolach 25
Topics distribution
many topics in total, but each document contains just a few of them
->> sparse model
@FrantaPolach 26
Semantic distance in topic space
● Semantic distance queries
from scipy.spatial import distance
pairwise ...
@FrantaPolach 27
Semantic distance queries
query = "A method of configuring the link maximum transmission unit (MTU) in a
...
@FrantaPolach 28
Future
● Graph of USPTO data (Neo4j)
● Elasticsearch search and analytics
● Recommendation engine (for ap...
Upcoming SlideShare
Loading in …5
×

Franta Polach - Exploring Patent Data with Python

2,475 views

Published on

PyData Berlin 2014

Experiences from building a recommendation engine for patent search using pythonic NLP and topic modeling tools such as Gensim.

Published in: Data & Analytics
  • Be the first to comment

Franta Polach - Exploring Patent Data with Python

  1. 1. Exploring patent space with python Franta Polach @FrantaPolach IPberry.com PyData 2014
  2. 2. @FrantaPolach 2
  3. 3. @FrantaPolach 3
  4. 4. @FrantaPolach 4
  5. 5. @FrantaPolach 5
  6. 6. @FrantaPolach 6 Outline ● Why patents ● Data kung fu ● Topic modelling ● Future
  7. 7. @FrantaPolach 7 Why patents
  8. 8. @FrantaPolach 8 Why patents ● The system is broken ● Messy, slow & costly process ● USPTO data freely available ● Data structured, mostly consistent ● A chance to learn
  9. 9. @FrantaPolach 9 Data kung fu Kung fu or Gung fu (/ˌkʌŋˈfuː/ or /ˌkʊŋˈfuː/; 功夫 , Pinyin: gōngfu) – a Chinese term referring to any study, learning, or practice that requires patience, energy, and time to complete
  10. 10. @FrantaPolach 10 USPTO Data ● xml, SGML key-value store ● 1975 – present ● eight different formats ● > 70GB (compressed) ● patent grants ● patent applications ● How to parse? ● Parsed data available? – Harvard Dataverse Network – Coleman Fung Institute for Engineering Leadership, UC Berkeley – PATENT SEARCH TOOL by Fung Institute – http://funginstitute.berkeley.edu/tools-and-data
  11. 11. @FrantaPolach 11 Coleman Fung Institute for Engineering Leadership, UC Berkeley patent data process flow The code is in Python 2 on Github.
  12. 12. @FrantaPolach 12 Fung Institute SQL database schema
  13. 13. @FrantaPolach 13 Entity-relationship diagram Patents with citations, claims, applications and classes
  14. 14. @FrantaPolach 14 Descriptive statistics
  15. 15. @FrantaPolach 15 Topic modelling ● Goal: build a topic space of the patent documents ● i.e. compute semantic similarity ● Tools: nltk, gensim ● Data: patent abstracts, claims, descriptions ● Usage: have invention description, find semantically similar patents
  16. 16. @FrantaPolach 16 Text preprocessing ● Have: parsed data in a relational database ● Want: data ready for semantic analysis ● Do: – lemmatization, stemming – collocations, Named Entity Recognition
  17. 17. @FrantaPolach 17 Text preprocessing Lemmatization, stemming print(gensim.utils.lemmatize("Changing the way scientists, engineers, and analysts perceive big data")) ['change/VB', 'way/NN', 'scientist/NN', 'engineer/NN', 'analyst/NN', 'perceive/VB', 'big/JJ', 'datum/NN'] i.e. group together different inflected forms of a word so they can be analysed as a single item Collocations, Named Entity Recognition detect a sequence of words that co-occur more often than would be expected by chance import nltk from nltk.collocations import TrigramCollocationFinder from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures e.g. entity such as "General Electric" stays a single token Stopwords generic words, such as "six", "then", "be", "do".... from gensim.parsing.preprocessing import STOPWORDS
  18. 18. @FrantaPolach 18 Data streaming Why? data is too large to fit into RAM Itertools are your friend class PatCorpus(object): def __init__(self, fname): self.fname = fname def __iter__(self): for line in open(self.fname): patent=line.lower().split('t') tokens = gensim.utils.tokenize(patent[5], lower=True) title = patent[6] yield title, list(tokens) corpus_tokenized = PatCorpus('in.tsv') print(list(itertools.islice(corpus_tokenized, 2))) [('easy wagon/easy cart/bicycle wheel mounting brackets system', [u'a', u'specific', u'wheel', u'mounting', u'bracket', u'and', u'a', u'versatile', u'method', u'of', u'using', u'these', u'brackets', u'or', u'similar', u'items', u'to', u'attach', u'bicycle', u'wheels', u'to', u'various', u'vehicle', u'frames', u'primarily', u'made', u'of', u'wood', u'and', u'a', u'general', u'vehicle', u'structure', u'or', u'frame', u'design', u'using', u'the', u'brackets', u'the', u'brackets', u'are', u'flat', …
  19. 19. @FrantaPolach 19 Vectorization ● First we create a dictionary, i.e. index text tokens by integers id2word = gensim.corpora.Dictionary(corpus_tokenized) ● Create bag-of-words vectors using a streamed corpus and a dictionary text = "A community for developers and users of Python data tools." bow = id2word.doc2bow(tokenize(text)) print(bow) [(12832, 1), (28124, 1), (28301, 1), (32835, 1)] def tokenize(text): return [t for t in simple_preprocess(text) if t not in STOPWORDS]
  20. 20. @FrantaPolach 20 Semantic transformations ● A transformation takes a corpus and outputs another corpus ● Choice: Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), Random Projections (RP), etc. model = gensim.models.LdaModel(corpus, num_topics=100, id2word=id2word, passes = 4, alpha=None) _ = model.print_topics(-1) INFO:gensim.models.ldamodel:topic #0 (0.010): 0.116*memory + 0.090*cell + 0.063*plurality + 0.054*array + 0.052*each + 0.044*bit + 0.039*cells + 0.032*address + 0.022*logic + 0.017*row INFO:gensim.models.ldamodel:topic #1 (0.010): 0.101*speed + 0.092*lines + 0.060*performance + 0.045*characteristic + 0.036*skin + 0.028*characteristics + 0.025*suspension + 0.024*enclosure + 0.023*transducer + 0.022*loss INFO:gensim.models.ldamodel:topic #2 (0.010): 0.141*portion + 0.049*housing + 0.031*portions + 0.028*end + 0.024*edge + 0.020*mounting + 0.018*has + 0.017*each + 0.016*formed + 0.016*arm INFO:gensim.models.ldamodel:topic #3 (0.010): 0.224*signal + 0.099*output + 0.075*input + 0.057*signals + 0.043*frequency + 0.034*phase + 0.024*clock + 0.020*circuit + 0.016*amplifier + 0.014*reference
  21. 21. @FrantaPolach 21 Transforming unseen documents text = "A method of configuring the link maximum transmission unit (MTU) in a user equipment." 1) transform text into the bag-of-words space bow_vector = id2word.doc2bow(tokenize(text)) print([(id2word[id], count) for id, count in bow_vector]) [(u'method', 1), (u'configuring', 1), (u'link', 1), (u'maximum', 1), (u'transmission', 1), (u'unit', 1), (u'user', 1), (u'equipment', 1)] 2) transform text into our LDA space vector = model[bow_vector] [(0, 0.024384265946835323), (1, 0.78941547921042373),... 3) find the document's most significant LDA topic model.print_topic(max(vector, key=lambda item: item[1])[0]) 0.022*network + 0.021*performance + 0.018*protocol + 0.015*data + 0.009*system + 0.008*internet + ...
  22. 22. @FrantaPolach 22 Evaluation ● Topic modelling is an unsupervised task ->> evaluation tricky ● Need to evaluate the improvement of the intended task ● Our goal is to retrieve semantically similar documents, thus we tag a set of similar documents and compare with the results of given semantic model ● "word intrusion" method: for each trained topic, take its first ten words, substitute one of them with a randomly chosen word (intruder!) and let a human detect the intruder ● Method without human intervention: split each document into two parts, and check that topics of the first half are similar to topics of the second; halves of different documents are dissimilar
  23. 23. @FrantaPolach 23 The topic space ● a topic is a distribution over a fixed vocabulary of terms ● the idea behind Latent Dirichlet Allocation is to statistically model documents as containing multiple hidden semantic topics
  24. 24. @FrantaPolach 24 memory: 188 cell: 146 plurality: 102 array: 86 bit: 71 address: 51 Exploring topic space speed: 178 line: 163 performance: 107 characteristic: 79 skin: 63 suspension: 45 signal: 324 output: 142 input: 108 frequency: 62 phase: 49 clock: 35 portion: 310 housing: 109 end: 62 edge: 53 mounting: 43 form: 35
  25. 25. @FrantaPolach 25 Topics distribution many topics in total, but each document contains just a few of them ->> sparse model
  26. 26. @FrantaPolach 26 Semantic distance in topic space ● Semantic distance queries from scipy.spatial import distance pairwise = distance.squareform(distance.pdist(matrix)) >> MemoryError ● Document indexing from gensim.similarities import Similarity index = Similarity('tmp/index', corpus, num_features=corpus.num_terms) The Similarity class splits the index into several smaller sub-indexes ->> scales well
  27. 27. @FrantaPolach 27 Semantic distance queries query = "A method of configuring the link maximum transmission unit (MTU) in a user equipment." 1) vectorize the text into bag-of-words space bow_vector = id2word.doc2bow(tokenize(query)) 2) transform the text into our LDA space query_lda = model[bow_vector] 3) query the LDA index, get the top 3 most similar documents index.num_best = 3 print(index[query_lda]) [(2026, 0.91495784099521484), (32384, 0.8226358470916238), (11525, 0.80638835174553156)]
  28. 28. @FrantaPolach 28 Future ● Graph of USPTO data (Neo4j) ● Elasticsearch search and analytics ● Recommendation engine (for applications) ● Drawings analysis ● Blockchain based smart contracts ● Artificial patent lawyer

×