The director of the Fung Institute Patent Lab provides an overview of their work including disambiguating patent data, developing visualization tools, and future plans. Some of their key work includes disambiguating inventor, assignee, location and law firm data from US patent grants and applications dating back to 1975. This data is publicly available online. They are working to improve disambiguation algorithms and bring in additional data. The lab also develops tools to visualize patent data including maps of clean tech inventions and movies showing inventor mobility. Future plans include linking to other data sets, developing network analysis tools, and using blocking actions as a measure of patent impact. Support is needed to continue maintaining and expanding the publicly available database.
The Fung Institute Patent Lab: Products and Future Plans
1. The Fung Institute Patent Lab:
Products and Future Plans
Lee Fleming, Director of the Coleman Fung Institute
for Engineering Leadership
May 2015
With Gabe Fierro, Ben Balsmeier, Guan-Cheng Li, Kevin
Johnson, Aditya Kaulagi, Douglas O'Reagan, Bill Yeh
We gratefully acknowledge support from the National
Science Foundation Grant #1064182, the US Patent and
Trademark Office, and the American Institutes for Research
2. My objectives for today’s chat
• Give you an understanding of our work
– Disambiguation (upcoming JEMS paper)
– Visualization and tools
– Future plans (PAIR)
• Get your feedback on our research
• Help me understand bigger picture of data
efforts in innovation and entrepreneurship
– I want to get our stuff used
– and at the same time, aid replication and help our
field to stop re-inventing inferior wheels
3. Continuing opportunity w/ patent data
• Despite many papers, basic data remain
inaccessible
– Unstructured and dirty text difficult to aggregate across entities
– (Semi) manual and uncoordinated efforts to date for granted patents
• We provide parsing, dbase, auto disambig of grants + apps:
• inventors
• assignees
• patent lawyers’ firms
• location
• Everything made public and supportive of complementary
efforts (mainly AIR and USPTO)
7. Will the real Matt Marx please stand up?
Plainview NY Everett MA Mt View CA
Class 704
8. Disambiguation: a classifier problem
• Popular methods: we currently use last three
– Manual
– Linear weighting + manual tuning
– Naïve Bayes, supervised and semi-supervised
– String matching
– K-means intra and inter cluster optimization
– Look up (Google provided access to library)
• Active research topic in machine learning
• Julia Lane is planning a contest
• Had more complex approach (Li et al. 2014)
– latest is simpler, faster, supportable, improvable
• though not as accurate yet – tends to oversplit
9. Inventor disambiguation
• Start with (block on) exact name matches
• Euclidean distance for exact attribute matches
• Balance min intra cluster and max inter cluster distances
10. • Look for no further
improvement
– 4 in this case
11. • Re-label each column with a cluster
• Relax exact name match and merge
• Use correlation of co-authors as well
12. Future of inventor disambiguation
• Relax strict matching
• Bring in additional data
– All tech fields
– Lexical overlap
– Law firms
– Prior art citations and non patent references
• New algorithms
• Make everything public and support AIR
tournament
13. Assignee disambiguation
• Jaro-Winkler after simple string cleaning
• Unique assignees from 6,700,000 to 507,000
• Indentifier, raw and cleaned name available
14. Future of assignee disambiguation
• Coordinate with NBER and HBS efforts
– The field needs to curate and maintain cumulative progress
• CONAME data from USPTO
• Normalize common affixes
• Train with manually developed NBER disambiguation
• Apply inventor algorithm
• Provide Compustat identifier
• Add subsidiary information
- BvD sample of 6,000 major U.S. firms revealed 50,000
subsidiaries under parental control (>50% in 2012)
- GE: 250 subsidiaries, ~98% patents filed under GE
15. Law firms
• Similar algorithms to assignees
• Not aware of any applications yet
16. Locations
• Use Google’s geocoding API
• Unique cities from 333K to 66K
• City, region, country
– Lat and Long being developed
– Do not provide street level data
19. Tools and applications
• Look for this stuff and high level explanations at:
– http://www.funginstitute.berkeley.edu/blog-categories/faculty-directors-blog#
20. Visualizations
• Clean tech inventions mapped by type and source
• Inventor mobility movies
• Patent location in technology “space”
• The convergence and divergence, the coalescence and
reconfiguration of components – the flow of technology -
over time
• Visualizing the patent application process
21. Clean Tech Patent Mapper
• Li, G., K. Paisner, “A List of Clean Tech Patents.”
• http://funglab.berkeley.edu/cleantechx/
• Energy: wind, solar, bio, hydro, geo, nuclear
• Assignee: VC backed, university, government, large and small incumbents, no assignee
22. VC patents 1990-1999
Innovation and Entrepreneurship
in Clean Energy: Nanda, Younge, Fleming
Note scale of funding activity 1990-1999
23. VC patents 2000-2009
Innovation and Entrepreneurship
in Clean Energy: Nanda, Younge, Fleming
See Nanda, R. and K. Younge, L. Fleming.
“Innovation and Entrepreneurship in Clean Energy,”
Forthcoming at Rethinking Science and Innovation Policy, NBER.
Much greater funding activity 2000-2009
35. Cool pics – but what do they mean?
– Need to validate visualizations with ground truth
– Mixed visualization and historical study of
biggest semiconductor breakthrough of last
decade – the FinFET
36. Why FinFET?
• Study intended to explore/develop
breakthrough visualization tools
– tie to reality w/o conflating variables
• All patents Northern CA 1995-2000
• Ranked by future citations
• Tech distance
– from our brains, close but moldy
• Geographic distance
– about 40 yards
• Social distance
– head of search committee that hired me
– neighbor
40. The flow of
technology
1) Words are
components -> little
differentiation, this
is so incremental
2) No geographic
localization of
trajectories
3) How did university
plop in and do this?
4) FinFET may have
been only govt
supported patent
41. Coming attractions
• Blocking actions – better than citations as
a measure of patent impact?
• Lexical novelty
– First appearance of new word in corpus
– First pair-wise combination of words
• Lexical distance between classes
43. Claim Rejections –
35 USC 103 3. The
folowing is a
quotation of 35
U.S.C. 103(a) which
forms the basis for
all obviousness
Detail
Enhancement
Noise
Reduction
OC
R
45. First results from 2012
• 2011 now complete as well
• Need to characterize each type of action
46. I may come to you tin cup in hand…
• Download, parse, clean, disambiguate, store
and serve up > 300M data (and weekly updates)
– Julia Lane taking over part of this
• Blocking data: must OCR ~400M documents
• Disambiguation takes weeks, PAIR years
– ~$150K hardware alone past year
– database person in Si Valley (~$140K + Cal tax)
• Mention maintenance in NSF proposal => ding
• Public good (~50,000 downloads)
• Talking with firms and private philanthropy