1. The document describes matching companies and individuals from the Crunchbase database to patent data from the PATSTAT database.
2. It outlines challenges with missing address data and disambiguating entities, and solutions used like standardizing names, adding country codes, and comparing inventors to company staff.
3. The final results found around 50,000 companies from Crunchbase that own over 12 million patents, with improved precision and recall after filtering based on applicant and inventor matches between the two databases.
Strumsky lobo (2011) does patenting intensity beget qualityivan weinel
This paper addresses the following research question: Is the quality of patents issued in a given metropolitan area related to the per capita rate of patent authorship (patenting intensity / productivity)? The authors conclude that there may be a small positive response of patent quality (avg. number of citations received per patent granted) to increases in patenting productivity. Highly productive inventors do not necessarily generate high quality patents.
Strumsky lobo (2011) does patenting intensity beget qualityivan weinel
This paper addresses the following research question: Is the quality of patents issued in a given metropolitan area related to the per capita rate of patent authorship (patenting intensity / productivity)? The authors conclude that there may be a small positive response of patent quality (avg. number of citations received per patent granted) to increases in patenting productivity. Highly productive inventors do not necessarily generate high quality patents.
Finding the Best Patents – Forward Citation Analysis Still WinsErik Oliver
How do you find the highest quality patents reliably and efficiently? We share our methodology of developing, evaluating, monetizing, litigating, and licensing patents. Here, we’ve identified five primary factors for consideration in patent ranking.
Named entity recognition (ner) with nltkJanu Jahnavi
https://www.learntek.org/blog/named-entity-recognition-ner-with-nltk/
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses.
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...Kripa (कृपा) Rajshekhar
Recent progress in incorporating word order and semantics to the decades-old, tried-and-tested bag-of-words representation of text meaning has yielded promising results in computational text classification and analysis. This development, and the availability of a large number of legal rulings from the PTAB (Patent Trial and Appeal Board motivated us to revisit possibilities for practical, computational models of legal relevance -- starting with this narrow and approachable niche of jurisprudence. We present results from our analysis and experiments towards this goal using a corpus of approximately 8000 rulings from the PTAB. This work makes three important contributions towards the development of models for legal relevance semantics: (a) Using state-of-art Natural Language Processing (NLP) methods, we characterize the diversity and types of semantic relationships that are implicit in select judgements of legal relevance at the PTAB (b) We achieve new state-of-art results on practical information retrieval tasks using our customized semantic representations on this corpus (c) We outline promising avenues for future work in the area - including preliminary evidence from human-in-loop interaction, and new forms of text representation developed using input from over a hundred interviews with practitioners in the field. Using the PTAB data set for testing relevance in patent document retrieval, instead of traditional citations search, also shows a bigger gap between the needs of practitioners and the capabilities of current information retrieval and NLP technologies.
"The ICAIL conference is the primary international conference addressing research in Artificial Intelligence and Law, and has been organized biennially since 1987"
#ICAIL2017, #ADAI
The purpose of this work is to outline our approach to the development and testing of several computational models for legal relevance in the narrow domain of patent law, specifically as documented through select proceedings of the USPTO PTAB cases.
Contribution #1: “real world” legal judgement
Patent Trial and Appeal Board (PTAB) publicly available dataset, as of Jan 2017 has about 100 zip files containing 10 GB of data (compressed).
PTAB data represent practitioner needs, better than the more commonly used citation graphs
Contribution #2: Doc semantics != Legal Relevance
Disproved prevailing notion that document semantics implies legal relevance or is at least correlated with it. e.g. Khoury and Bekkerman in 2016, “if a given document is not in the semantic neighborhood of the query document, it simply cannot be relevant for the query document"
Contribution #3: ~4X improvement in retrieval
Without subsector pre-processing: Recall@100 was 5%, After text pre-processing: Recall@100 was 19%
Contribution #4: Human-in-loop impact is dramatic
Potential for over 50X improvement, where a retrieval task failed Recall @ 5000 but passed Recall @100 with user feedback
A proposal for combining two different technologies, Solr and a triple store, in order to improve the (user) search experience by decoupling the “search” from the “view” perspective.
Through a case study focused on Google Search Appliance, Bryan Bell illustrates how, by exploiting deep linguistic analysis, corporations can enhance the effectiveness of their existing search platforms and achieve true results. Following the session,
Commercialization Options for a set of Wireless PatentsShanmukha S. Potti
Given a portfolio of patents, this project utilizes two approaches of study – one is analysis of the portfolio as a whole and the second is specific analysis limited to individual patent assets.
This process involves mining for crown jewels in a portfolio, using Patent Analytics.
Patent assets thus identified were mapped to a wireless value chain and an innovation value chain to determine preferred commercialization options.
Match of PATSTAT data (2019 spring) and PatentsView (jan 2019) is discussed here, with focus on how this match can help to enrich PATSTAT data with information not contained in USPTO patents (and the other way round).
Finding the Best Patents – Forward Citation Analysis Still WinsErik Oliver
How do you find the highest quality patents reliably and efficiently? We share our methodology of developing, evaluating, monetizing, litigating, and licensing patents. Here, we’ve identified five primary factors for consideration in patent ranking.
Named entity recognition (ner) with nltkJanu Jahnavi
https://www.learntek.org/blog/named-entity-recognition-ner-with-nltk/
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses.
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...Kripa (कृपा) Rajshekhar
Recent progress in incorporating word order and semantics to the decades-old, tried-and-tested bag-of-words representation of text meaning has yielded promising results in computational text classification and analysis. This development, and the availability of a large number of legal rulings from the PTAB (Patent Trial and Appeal Board motivated us to revisit possibilities for practical, computational models of legal relevance -- starting with this narrow and approachable niche of jurisprudence. We present results from our analysis and experiments towards this goal using a corpus of approximately 8000 rulings from the PTAB. This work makes three important contributions towards the development of models for legal relevance semantics: (a) Using state-of-art Natural Language Processing (NLP) methods, we characterize the diversity and types of semantic relationships that are implicit in select judgements of legal relevance at the PTAB (b) We achieve new state-of-art results on practical information retrieval tasks using our customized semantic representations on this corpus (c) We outline promising avenues for future work in the area - including preliminary evidence from human-in-loop interaction, and new forms of text representation developed using input from over a hundred interviews with practitioners in the field. Using the PTAB data set for testing relevance in patent document retrieval, instead of traditional citations search, also shows a bigger gap between the needs of practitioners and the capabilities of current information retrieval and NLP technologies.
"The ICAIL conference is the primary international conference addressing research in Artificial Intelligence and Law, and has been organized biennially since 1987"
#ICAIL2017, #ADAI
The purpose of this work is to outline our approach to the development and testing of several computational models for legal relevance in the narrow domain of patent law, specifically as documented through select proceedings of the USPTO PTAB cases.
Contribution #1: “real world” legal judgement
Patent Trial and Appeal Board (PTAB) publicly available dataset, as of Jan 2017 has about 100 zip files containing 10 GB of data (compressed).
PTAB data represent practitioner needs, better than the more commonly used citation graphs
Contribution #2: Doc semantics != Legal Relevance
Disproved prevailing notion that document semantics implies legal relevance or is at least correlated with it. e.g. Khoury and Bekkerman in 2016, “if a given document is not in the semantic neighborhood of the query document, it simply cannot be relevant for the query document"
Contribution #3: ~4X improvement in retrieval
Without subsector pre-processing: Recall@100 was 5%, After text pre-processing: Recall@100 was 19%
Contribution #4: Human-in-loop impact is dramatic
Potential for over 50X improvement, where a retrieval task failed Recall @ 5000 but passed Recall @100 with user feedback
A proposal for combining two different technologies, Solr and a triple store, in order to improve the (user) search experience by decoupling the “search” from the “view” perspective.
Through a case study focused on Google Search Appliance, Bryan Bell illustrates how, by exploiting deep linguistic analysis, corporations can enhance the effectiveness of their existing search platforms and achieve true results. Following the session,
Commercialization Options for a set of Wireless PatentsShanmukha S. Potti
Given a portfolio of patents, this project utilizes two approaches of study – one is analysis of the portfolio as a whole and the second is specific analysis limited to individual patent assets.
This process involves mining for crown jewels in a portfolio, using Patent Analytics.
Patent assets thus identified were mapped to a wireless value chain and an innovation value chain to determine preferred commercialization options.
Match of PATSTAT data (2019 spring) and PatentsView (jan 2019) is discussed here, with focus on how this match can help to enrich PATSTAT data with information not contained in USPTO patents (and the other way round).
Patents applicants: how to create the full time seriesGianluca Tarasconi
Patents change applicants data within time;
Main reason for change are ownership change, name/address change, M&A …
Applicant’s names contained in TLS206 is the ‘last available’ data;
PATSTAT Global + EP Register make available several sources to build a chain of names and a timeline for patents contained;
Recently released from EPO, EP Register data contains information on all the steps of a patent application.
This seminar aims to show at a glance how Register can be used in combination with other datasets or by themselves in order to enrich patent studies or get new non trivial indicators.
The presentation will be centered on examples in the subfield of wind energy.
This work shows a methodology used to match PATSTAT inventor names to a full list of researchers working in Italian universities. The goal is to have higher recall, leaving institutions/researchers to validate the data.
Focus will not be on results (evaluation still in progress) but on data processing, selection and match algorithm, highlighting some difficulties and relative workarounds.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
2. In short:
1
- We present here a match between PATSTAT
and Crunchbase entities (companies and staff);
- Names matching also benefits for the use of
other information (staff vs inventors);
- The resulting database can be used in a
number of different domains (fi analysis of the role
of IP assets in securing venture capital; the
characterization of the IP portfolio of high-growth
patenting start-ups, of start-ups developing radical or
breakthrough innovations, and of inclusive start-ups)
3. First data source: PATSTAT
PATSTAT is the short name for _EPO
worldwide PATent STATistical Database
a single database covering 100 million
patents from 90 patent authorities
developed by European patent Office
(EPO) in cooperation with WIPO, OECD
and Eurostat.
10
4. Second source: Crunchbase
CrunchBase is presented in its website as “the
premier destination for discovering industry
trends, investments, and news about
hundreds of thousands of companies
globally."
In the version used for this note, (January
2017), the database contains information on
more than 490.000 distinct entities
(companies and vc investors) located in 199
different countries;
10
5. Crunchbase tables
TABLE_NAME TABLE_ROWS AVG_ROW_LENGTH
'acquisitions' 34667 Detail for each acquisition in the dataset
'awards' 676
'category_groups' 736
Mapping between categories and category groups
'competitors' 519237 List of competitors for each organization
'customers' 300303 List of customers for each organization
'event_relationships' 121271
Detail for each event participant in the dataset
'events' 32277 Detail for each event in the dataset
'funding_rounds' 152470
Detail for each funding round in the dataset
'funds' 5519
'investment_partners' 44258
Partners who are responsible for their firm's investments
'investments' 235868
Mapping between investors and investments
'investors' 49935
Active investors including organizations and individuals
'ipos' 11807 Detail for each IPO in the dataset
'jobs' 991323 List of all job and advisory roles
'org_parents' 6847
Parent-child mapping for each organization
'organization_descriptions' 306891 Long descriptions for organizations
'organizations' 492960
'people' 578694 All people in Crunchbase
'people_descriptions' 306481
'school' 10893
10
9. Population selected for the match
PATSTAT: IP5 (EP US CN R JP) priority
year >=2000
Crunchbase: all entities excluded VC
10
10. PATSTAT match issues
(A) Lack of comprehensive information about
applicants (only address information is available, not
standardized and often partial or missing).
(B) Lack of entities disambiguation = the same entity
may have several separate database entries (different
spellings of a single organization or name changes
over time).
(C) The distribution of the number of patents per
assignee is skewed; a small number of applicants
hold thousands of patents, the large majority less than
five patents.
10
11. Dealing with issues (A) address missing (I)
30% of PASTAT and 25% of CB had no
valid country code
For PATSTAT:
- Find a homonym in the same patent family.
- If more than one country code is found, the country of the
applicant with the higher number of patents (over the full
PATSTAT database) is assigned.
- If no homonym is found, if the applicant belongs to a patent
family of only one patent (singleton), the nationality of patent
office is given (this case helps disentangle in particular cases of
SIPO and JPO only applicants).
The algorithm leaves unsolved < 1%
10
12. Dealing with issues (A) address missing (II)
For Crunchbase:
- the modal country code of the people
reported to work for the company.
- telephone country code, whenever
available and unambiguous.
10
13. Dealing with issues (B) entities disambiguation(I)
(a) Standardized names from EEE-PPAT
database, now included in PATSTAT itself;
(b) Non ascii character latinized;
(c) Further process by removing the
remaining noise and most of the legal
designations;
Steps b and c applied also to CB for
ensuring compatibility in match phase;
10
14. Match
In the name match phase, four criteria are combined, listed below
in order of increasingly match accuracy:
1. Perfect match: where names, removing legal designation, are
exactly the same.
2. Alphanumeric match: where the names, keeping only [A-Z]
and [0-9] are the same (e.g.: I.B.M. = IBM = I B M).
3. Jaro-Winkler distance: names are broken into tokens and the
similarity score is computed by the number of tokens in common,
weighted on the inverse of frequency.
4. Levensthein distance (edit distance).
(# 4 dropped since it proves to add a high number of false
positives)
10
15. Benchmark against BVD Orbis
The comparison is based on a small
overlapping sample of 7.569 companies
that matches exactly and unambiguously
by the company name and country code.
Benchmark used also for finetuning
threshold of JaroWinkler match
First result 89% precision, 87% recall
10
16. Filtering and finetuning (I)
Benchmark used also for fine-tuning
threshold of JaroWinkler match;
Match improved also by adding information
on inventors matched to CB companies
staff;
10
17. Filtering and finetuning (II)
Inventors-staff match steps:
1) name clustering based on string
matching [bigrams in common]: 300 million
couples, corresponding to 9.8 million PATSTATperson IDs.
2) inventor’s entity disambiguation of
patstat inventors (three criteria: at least one
applicant in common; at least one common IPC4 tag;
having one applicant with less than 50 inventors; at
least one coinventor in common; and being at
maximum three degrees of distance in patenting)
10
18. Filtering and finetuning (III)
Disambiguation of PATSTAT inventors
produces 14.9 million possible matches
between the sample of Crunchbase and
the sample of disambiguated PATSTAT
inventors
3) Matches are filtered based on: at least
one applicant in common; at least one
common IPC4 tag; and having one
applicant with less than 50 inventors.
10
19. Filtering and finetuning (IV)
Matched inventors-staff helps to solve
doublematches and finetune names match
criteria.
Final results vs benchmarck: 93%
precision, 92% recall
10
20. Final statistics
Almost 50 thousand companies, out of the 447 thousand listed in
CrunchBase (excluding venture capital companies), are found to
own one or more patents, for a total of around 12 million patents.
Around 220 thousand of those have been applied for by
companies created after 2005. The share of patentees for US
companies is 15%, but the share doubles for companies
reporting at least one funding round.
Regarding individuals, out of the 578 thousand professionals
listed in CrunchBase who could be potential patent inventors,
around 25 thousand are found to have a correspondent in
PATSTAT. These inventors account for 2,2 million patent
applications.
10