5. Synonyms: dimethyl sulfoxide, dimethylsulfoxide, Domoso, Infiltrina
Hierarchies: cancer, carcinoma, melanoma, lymphoma, glioblastoma…
Patterns: dates, citations, mail addresses…
Rule-based extraction of all different kinds of complex information
Persons, Locations, Genes, ….
Coocurrences, Typed Relations, e.g. Genes / Diseases / Modification Type
TEXT MINING
Term Detection
Regular
Expressions
Rule Engine
Named Entities
Relations
Sentences, Tokens, POS-Tags, Chunks, Paragraphs, Sections, Stemming, Decompounding…Syntax Detection
6. RULE ENGINE
1. NAME OF THE MEDICINAL PRODUCT
Desloratadine ratiopharm 5 mg film-coated tablets
Primary Field Name Secondary Field Name Field Value
MedicalProductName coveredText Desloratadine ratiopharm 5 mg film-coated tablets
inventedPartName DESLORATADINE
strengthPart 5 mg
pharmaceuticalDoseFormPart FILM-COATED TABLET
TextRegelErgebnis
7. SEARCH & NOSQL
Free text + concept based
search
Text mining integration
Guided navigation / facets
NoSQL functionalities
Multi- & cross lingual search
Related documents
Based on Apache Solr
• Extended Query Syntax
• JSON-API
• Scalability
…
10. INFORMATION DISCOVERY
Terminology
Management Text Mining
Search &
Analytics NoSQL
Categorization
& Clustering
Delivery / Deployment / Runtime Environment
Integration Tests / Continuous Integration
Extensive Documentation
Common Architecture / Application Design
User & Role Management, Security
Communication Bus
Project Management
11. PATENT CLASSIFICATION AT EPO
Tender No. 1585
1) Pre-Classification of
unpublished patents into departments
2) Re-Classification on
published patents, if category system changes
12. ABOUT EPO
• The European Patent Office (EPO)
grants European patents for the
Contracting States to the European
Patent Convention
• Second largest intergovernmental
institution in Europe
• Not an EU institution
• Self-financing, i.e. revenue
from fees covers operating
and capital expenditure
16. COOPERATIVE PATENT CLASSIFICATION
• Patent Classification System based on ECLA / IPC
• jointly developed by the European Patent Office (EPO)
and the United States Patent and Trademark Office
(USPTO)
• used by both the EPO and USPTO since 1 January 2013
• currently contains about 250.000 classes
22. PATENT CLASSIFICATION AT EPO
Tender No. 1585
1) Pre-Classification of
unpublished patents into departments
Our Motivation:
• Great Classification Use-Case
– Big Data (80 Mio. patents available)
– Large Scale Category System >250.000 CPC codes
– Tough classification quality and response time
constraints
• Text Mining Success Story
27. SOME FACTS
• about 650k training documents from 2005-2013
• supervised learning: light-weight and fast linear support
vector machine
• Training time (16 Cores, 128 GB RAM)
– Feature Extraction: ~1 hour
– Training of Classifiers: ~1 hour
– 90/10 tests with a look-a-head of 3 levels
and reporting 3 best candidates: ~1 hour
• Prediction: 5 docs in 5 sec
29. STATUS & OUTLOOK
Range-specific quality
evaluation
Going live with best
ranges
• Continuous optimization
30. PATENT CLASSIFICATION AT EPO
Tender No. 1585
1) Re-Classification on
published patents, if category system changes
Challenges and Facts:
– 250.000 CPC codes, regular changes/refinements
– Several re-classification projects at any one time, great
variation in size, a class is split into 5-20(?) subclasses
– No training material available
31. NEW RE-CLASSIFICATION PROCESS
Training Data
• Human Annotator starts labeling about 20% of
the documents with new subclasses
Statistical Models
• are generated on-the-fly, and
• Cross-validation test are carried out
Threshold
• If cross-validation achieves certain threshold
(e.g. 90%), the remaining documents are
classified fully automatically without further
review
• Otherwise, more training data is being generated
32. STATUS & OUTLOOK
Currently in evaluation
phase
• Going live in the next
weeks
33. …NOT ONLY PATENTS
Solutions
Libraries PharmaPatentsHealthcare Social Media
Terminology
Management Text Mining
Search &
Analytics NoSQL
Categorization
& Clustering
Automotive
34. For further questions, please contact:
David Baehrens
+ 49 (0)761 203 97690
info@averbis.com