SlideShare a Scribd company logo
1 of 20
Download to read offline
AI-SDV 2022, Oct. 10/11 2022
Klaus Kater
Director, Research & Development
1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2
1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3
4
Look and feel 2014 Look and feel 2022
1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
5
6
Having started in 2015 with just about 30,000 companies,
the company SEARCHCORPUS keeps growing and growing…
2015 2016 2017 2018 2019
• 30.000 company websites
• Duration 2 weeks:
• Crawling
• Indexing
• 50 GB of web data
• 60.000 company websites
• Still 2 weeks:
• Crawling
• Indexing
• 500 GB of web data
• 290.000 company websites
• Link depth 5
• Still 2 weeks:
• Crawling
• Geolocation
• Classification
• Indexing
• 2 TB of web data
2020 2021 2022
1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
4. 2016: Thesaurus management for domain specific content selection
7
8
1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
4. 2016: Thesaurus management for domain specific content selection
5. 2017: Establishing a process to roll out news trackers and other crawling solutions
9
10
……2015 ……………………………..….2016……………….………...………2017……
• Preparation of international rollout of domain specific targeted news trackers and alerting
• Animal Health Tracker
• RBB Tracker
• CRDI Tracker
• BD&L Tracker
• Single Sign On with automatic user provisioning (SAML)
1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
4. 2016: Thesaurus management for domain specific content selection
5. 2017: Establishing a process to roll out news trackers and other crawling solutions
6. 2018: Clinical Trial Registry Tracker
11
12
Australia / New Zealand
China
Republic of Korea
USA
Germany
European Union
Hong Kong
International (Springer)
Japan
Philippines
University Hospital Medical Information Network
WHO
1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
4. 2016: Thesaurus management for domain specific content selection
5. 2017: Establishing a process to roll out news trackers and other crawling solutions
6. 2018: Clinical Trial Registry Tracker
7. 2018: Machine learning based classification of company websites
13
14
1. To obtain a reasonably sized input vector (remember, we classify a whole website
which may have several 100 MB of content), we convert the data into a vector
using a TF-IDF pre-processor trained on a corpus collected for the project
2. Support Vector Machines alone is not good enough​, therefore pre-processing of
all input with a custom thesaurus is necessary
3. For all 6 real world samples we got > 96% average recognition rate
1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
4. 2016: Thesaurus management for domain specific content selection
5. 2017: Establishing a process to roll out news trackers and other crawling solutions
6. 2018: Clinical Trial Registry Tracker
7. 2018: Machine learning based classification of company websites
8. 2019: Globally distributed massive parallel crawling
15
16
Transferring 1 page from London: 82ms​, 500 pages: 41 seconds​, 1.000 servers: 11,5 hours
Transferring 1 page from Tokyo: 1.200ms​, 500 pages: 10 minutes​, 1.000 servers: 6 days 23 hours
NASA’s
Terra
satellite
for
the
MODIS
imageries,
combined
by
Meow.
Credit:
NASA
Goddard
Space
Flight
Center
Image
by
Reto
Stöckli
(land
surface,
shallow
water,
clouds).
Enhancements
by
Robert
Simmon
(ocean
color,
compositing,
3D
globes,
animation).
Data
and
technical
support:
MODIS
Land
Group;
MODIS
Science
Data
Support
Team;
MODIS
Atmosphere
Group;
MODIS
Ocean
Group
Additional
data:
USGS
EROS
Data
Center
(topography);
USGS
Terrestrial
Remote
Sensing
Flagstaff
Field
Center
(Antarctica);
Defense
Meteorological
Satellite
Program
(city
lights).,
Public
domain,
via
Wikimedia
Commons
1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
4. 2016: Thesaurus management for domain specific content selection
5. 2017: Establishing a process to roll out news trackers and other crawling solutions
6. 2018: Clinical Trial Registry Tracker
7. 2018: Machine learning based classification of company websites
8. 2019: Globally distributed massive parallel crawling
9. 2020/21: Deep learning for automated news rating
17
18
Corporate
Websites
News portals
News feeds
Crawlers
{APIs}
{APIs}
3rd party APIs
Licensed 3rd party
content
Feed readers
News archive (un)rated news
Consume
{APIs}
{
standard
API
}
Model + meta data
Request news rating
for selected model
Return news rated with
selected model
Deploy selected
model
Crawl
/
retrieve
news





Rate news with
selected model

extracted news

Publish model + metadata
Optimization of
models, retraining
 model rating
Deploy and verify model
1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
4. 2016: Thesaurus management for domain specific content selection
5. 2017: Establishing a process to roll out news trackers and other crawling solutions
6. 2018: Clinical Trial Registry Tracker
7. 2018: Machine learning based classification of company websites
8. 2019: Globally distributed massive parallel crawling
9. 2020/21: Deep learning for automated news rating
10. 2022: Automating regulatory intelligence collection and classification
(will be integrated with intranet applications to manage regulatory events)
19
Klaus Kater
kkater@copyright.com

More Related Content

Similar to AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of experience Klaus Kater (Copyright Clearance Center, DE )

The anatomy of google
The anatomy of googleThe anatomy of google
The anatomy of googlemaelmardi
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
Google Research Paper
Google Research PaperGoogle Research Paper
Google Research Paperdidip
 
Geo know general presentation 2013
Geo know general presentation 2013Geo know general presentation 2013
Geo know general presentation 2013geoknow
 
History page-brin thesis - anatomy of a large scale hypertextual web search...
History   page-brin thesis - anatomy of a large scale hypertextual web search...History   page-brin thesis - anatomy of a large scale hypertextual web search...
History page-brin thesis - anatomy of a large scale hypertextual web search...Bitsytask
 
How to use NCI's national repository of big spatial data collections
How to use NCI's national repository of big spatial data collectionsHow to use NCI's national repository of big spatial data collections
How to use NCI's national repository of big spatial data collectionsARDC
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3Robert Grossman
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us? Andrea Volpini
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSAPRBETTER
 
Azure machine learning ile tahminleme modelleri
Azure machine learning ile tahminleme modelleriAzure machine learning ile tahminleme modelleri
Azure machine learning ile tahminleme modelleriKoray Kocabas
 
Support Rapid Systems Growth with a Design-First Approach
Support Rapid Systems Growth with a Design-First ApproachSupport Rapid Systems Growth with a Design-First Approach
Support Rapid Systems Growth with a Design-First ApproachSmartBear
 
TFF2016, Rudi Studer, Smarte Dienstleistungen mit semantischen Technologien
TFF2016, Rudi Studer, Smarte Dienstleistungen mit semantischen TechnologienTFF2016, Rudi Studer, Smarte Dienstleistungen mit semantischen Technologien
TFF2016, Rudi Studer, Smarte Dienstleistungen mit semantischen TechnologienTourismFastForward
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudPeter Haase
 

Similar to AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of experience Klaus Kater (Copyright Clearance Center, DE ) (20)

The anatomy of google
The anatomy of googleThe anatomy of google
The anatomy of google
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Going for GOLD - Adventures in Open Linked Metadata
Going for GOLD - Adventures in Open Linked MetadataGoing for GOLD - Adventures in Open Linked Metadata
Going for GOLD - Adventures in Open Linked Metadata
 
Google Research Paper
Google Research PaperGoogle Research Paper
Google Research Paper
 
Test
TestTest
Test
 
Google
GoogleGoogle
Google
 
Geo know general presentation 2013
Geo know general presentation 2013Geo know general presentation 2013
Geo know general presentation 2013
 
A04210106
A04210106A04210106
A04210106
 
History page-brin thesis - anatomy of a large scale hypertextual web search...
History   page-brin thesis - anatomy of a large scale hypertextual web search...History   page-brin thesis - anatomy of a large scale hypertextual web search...
History page-brin thesis - anatomy of a large scale hypertextual web search...
 
How to use NCI's national repository of big spatial data collections
How to use NCI's national repository of big spatial data collectionsHow to use NCI's national repository of big spatial data collections
How to use NCI's national repository of big spatial data collections
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3
 
Pf3426712675
Pf3426712675Pf3426712675
Pf3426712675
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
 
Azure machine learning ile tahminleme modelleri
Azure machine learning ile tahminleme modelleriAzure machine learning ile tahminleme modelleri
Azure machine learning ile tahminleme modelleri
 
Support Rapid Systems Growth with a Design-First Approach
Support Rapid Systems Growth with a Design-First ApproachSupport Rapid Systems Growth with a Design-First Approach
Support Rapid Systems Growth with a Design-First Approach
 
TFF2016, Rudi Studer, Smarte Dienstleistungen mit semantischen Technologien
TFF2016, Rudi Studer, Smarte Dienstleistungen mit semantischen TechnologienTFF2016, Rudi Studer, Smarte Dienstleistungen mit semantischen Technologien
TFF2016, Rudi Studer, Smarte Dienstleistungen mit semantischen Technologien
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
 

More from Dr. Haxel Consult

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementDr. Haxel Consult
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...Dr. Haxel Consult
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...Dr. Haxel Consult
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...Dr. Haxel Consult
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...Dr. Haxel Consult
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...Dr. Haxel Consult
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...Dr. Haxel Consult
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...Dr. Haxel Consult
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...Dr. Haxel Consult
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...Dr. Haxel Consult
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...Dr. Haxel Consult
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...Dr. Haxel Consult
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...Dr. Haxel Consult
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...Dr. Haxel Consult
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterDr. Haxel Consult
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCDr. Haxel Consult
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...Dr. Haxel Consult
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...Dr. Haxel Consult
 
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...Dr. Haxel Consult
 

More from Dr. Haxel Consult (20)

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance Center
 
AI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IPAI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IP
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOC
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
 
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
 

Recently uploaded

VIP Call Girls Kolkata Ananya 🤌 8250192130 🚀 Vip Call Girls Kolkata
VIP Call Girls Kolkata Ananya 🤌  8250192130 🚀 Vip Call Girls KolkataVIP Call Girls Kolkata Ananya 🤌  8250192130 🚀 Vip Call Girls Kolkata
VIP Call Girls Kolkata Ananya 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts servicesonalikaur4
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Sheetaleventcompany
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024APNIC
 
Gram Darshan PPT cyber rural in villages of india
Gram Darshan PPT cyber rural  in villages of indiaGram Darshan PPT cyber rural  in villages of india
Gram Darshan PPT cyber rural in villages of indiaimessage0108
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...Diya Sharma
 
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With RoomVIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Roomgirls4nights
 
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts servicevipmodelshub1
 
Russian Call Girls Thane Swara 8617697112 Independent Escort Service Thane
Russian Call Girls Thane Swara 8617697112 Independent Escort Service ThaneRussian Call Girls Thane Swara 8617697112 Independent Escort Service Thane
Russian Call Girls Thane Swara 8617697112 Independent Escort Service ThaneCall girls in Ahmedabad High profile
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxellan12
 
VIP Kolkata Call Girl Dum Dum 👉 8250192130 Available With Room
VIP Kolkata Call Girl Dum Dum 👉 8250192130  Available With RoomVIP Kolkata Call Girl Dum Dum 👉 8250192130  Available With Room
VIP Kolkata Call Girl Dum Dum 👉 8250192130 Available With Roomdivyansh0kumar0
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024APNIC
 
VIP Kolkata Call Girl Kestopur 👉 8250192130 Available With Room
VIP Kolkata Call Girl Kestopur 👉 8250192130  Available With RoomVIP Kolkata Call Girl Kestopur 👉 8250192130  Available With Room
VIP Kolkata Call Girl Kestopur 👉 8250192130 Available With Roomdivyansh0kumar0
 
Russian Call girls in Dubai +971563133746 Dubai Call girls
Russian  Call girls in Dubai +971563133746 Dubai  Call girlsRussian  Call girls in Dubai +971563133746 Dubai  Call girls
Russian Call girls in Dubai +971563133746 Dubai Call girlsstephieert
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Servicesexy call girls service in goa
 
Low Rate Call Girls Kolkata Avani 🤌 8250192130 🚀 Vip Call Girls Kolkata
Low Rate Call Girls Kolkata Avani 🤌  8250192130 🚀 Vip Call Girls KolkataLow Rate Call Girls Kolkata Avani 🤌  8250192130 🚀 Vip Call Girls Kolkata
Low Rate Call Girls Kolkata Avani 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With RoomVIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Roomishabajaj13
 
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Delhi Call girls
 

Recently uploaded (20)

Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
 
VIP Call Girls Kolkata Ananya 🤌 8250192130 🚀 Vip Call Girls Kolkata
VIP Call Girls Kolkata Ananya 🤌  8250192130 🚀 Vip Call Girls KolkataVIP Call Girls Kolkata Ananya 🤌  8250192130 🚀 Vip Call Girls Kolkata
VIP Call Girls Kolkata Ananya 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 
Gram Darshan PPT cyber rural in villages of india
Gram Darshan PPT cyber rural  in villages of indiaGram Darshan PPT cyber rural  in villages of india
Gram Darshan PPT cyber rural in villages of india
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
 
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With RoomVIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
 
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
 
Russian Call Girls Thane Swara 8617697112 Independent Escort Service Thane
Russian Call Girls Thane Swara 8617697112 Independent Escort Service ThaneRussian Call Girls Thane Swara 8617697112 Independent Escort Service Thane
Russian Call Girls Thane Swara 8617697112 Independent Escort Service Thane
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
 
VIP Kolkata Call Girl Dum Dum 👉 8250192130 Available With Room
VIP Kolkata Call Girl Dum Dum 👉 8250192130  Available With RoomVIP Kolkata Call Girl Dum Dum 👉 8250192130  Available With Room
VIP Kolkata Call Girl Dum Dum 👉 8250192130 Available With Room
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
 
VIP Kolkata Call Girl Kestopur 👉 8250192130 Available With Room
VIP Kolkata Call Girl Kestopur 👉 8250192130  Available With RoomVIP Kolkata Call Girl Kestopur 👉 8250192130  Available With Room
VIP Kolkata Call Girl Kestopur 👉 8250192130 Available With Room
 
Russian Call girls in Dubai +971563133746 Dubai Call girls
Russian  Call girls in Dubai +971563133746 Dubai  Call girlsRussian  Call girls in Dubai +971563133746 Dubai  Call girls
Russian Call girls in Dubai +971563133746 Dubai Call girls
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
 
Low Rate Call Girls Kolkata Avani 🤌 8250192130 🚀 Vip Call Girls Kolkata
Low Rate Call Girls Kolkata Avani 🤌  8250192130 🚀 Vip Call Girls KolkataLow Rate Call Girls Kolkata Avani 🤌  8250192130 🚀 Vip Call Girls Kolkata
Low Rate Call Girls Kolkata Avani 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
Call Girls In South Ex 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In South Ex 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICECall Girls In South Ex 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In South Ex 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
 
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With RoomVIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
 
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
 

AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of experience Klaus Kater (Copyright Clearance Center, DE )

  • 1. AI-SDV 2022, Oct. 10/11 2022 Klaus Kater Director, Research & Development
  • 2. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2
  • 3. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3
  • 4. 4 Look and feel 2014 Look and feel 2022
  • 5. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 5
  • 6. 6 Having started in 2015 with just about 30,000 companies, the company SEARCHCORPUS keeps growing and growing… 2015 2016 2017 2018 2019 • 30.000 company websites • Duration 2 weeks: • Crawling • Indexing • 50 GB of web data • 60.000 company websites • Still 2 weeks: • Crawling • Indexing • 500 GB of web data • 290.000 company websites • Link depth 5 • Still 2 weeks: • Crawling • Geolocation • Classification • Indexing • 2 TB of web data 2020 2021 2022
  • 7. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 4. 2016: Thesaurus management for domain specific content selection 7
  • 8. 8
  • 9. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 4. 2016: Thesaurus management for domain specific content selection 5. 2017: Establishing a process to roll out news trackers and other crawling solutions 9
  • 10. 10 ……2015 ……………………………..….2016……………….………...………2017…… • Preparation of international rollout of domain specific targeted news trackers and alerting • Animal Health Tracker • RBB Tracker • CRDI Tracker • BD&L Tracker • Single Sign On with automatic user provisioning (SAML)
  • 11. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 4. 2016: Thesaurus management for domain specific content selection 5. 2017: Establishing a process to roll out news trackers and other crawling solutions 6. 2018: Clinical Trial Registry Tracker 11
  • 12. 12 Australia / New Zealand China Republic of Korea USA Germany European Union Hong Kong International (Springer) Japan Philippines University Hospital Medical Information Network WHO
  • 13. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 4. 2016: Thesaurus management for domain specific content selection 5. 2017: Establishing a process to roll out news trackers and other crawling solutions 6. 2018: Clinical Trial Registry Tracker 7. 2018: Machine learning based classification of company websites 13
  • 14. 14 1. To obtain a reasonably sized input vector (remember, we classify a whole website which may have several 100 MB of content), we convert the data into a vector using a TF-IDF pre-processor trained on a corpus collected for the project 2. Support Vector Machines alone is not good enough​, therefore pre-processing of all input with a custom thesaurus is necessary 3. For all 6 real world samples we got > 96% average recognition rate
  • 15. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 4. 2016: Thesaurus management for domain specific content selection 5. 2017: Establishing a process to roll out news trackers and other crawling solutions 6. 2018: Clinical Trial Registry Tracker 7. 2018: Machine learning based classification of company websites 8. 2019: Globally distributed massive parallel crawling 15
  • 16. 16 Transferring 1 page from London: 82ms​, 500 pages: 41 seconds​, 1.000 servers: 11,5 hours Transferring 1 page from Tokyo: 1.200ms​, 500 pages: 10 minutes​, 1.000 servers: 6 days 23 hours NASA’s Terra satellite for the MODIS imageries, combined by Meow. Credit: NASA Goddard Space Flight Center Image by Reto Stöckli (land surface, shallow water, clouds). Enhancements by Robert Simmon (ocean color, compositing, 3D globes, animation). Data and technical support: MODIS Land Group; MODIS Science Data Support Team; MODIS Atmosphere Group; MODIS Ocean Group Additional data: USGS EROS Data Center (topography); USGS Terrestrial Remote Sensing Flagstaff Field Center (Antarctica); Defense Meteorological Satellite Program (city lights)., Public domain, via Wikimedia Commons
  • 17. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 4. 2016: Thesaurus management for domain specific content selection 5. 2017: Establishing a process to roll out news trackers and other crawling solutions 6. 2018: Clinical Trial Registry Tracker 7. 2018: Machine learning based classification of company websites 8. 2019: Globally distributed massive parallel crawling 9. 2020/21: Deep learning for automated news rating 17
  • 18. 18 Corporate Websites News portals News feeds Crawlers {APIs} {APIs} 3rd party APIs Licensed 3rd party content Feed readers News archive (un)rated news Consume {APIs} { standard API } Model + meta data Request news rating for selected model Return news rated with selected model Deploy selected model Crawl / retrieve news      Rate news with selected model  extracted news  Publish model + metadata Optimization of models, retraining  model rating Deploy and verify model
  • 19. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 4. 2016: Thesaurus management for domain specific content selection 5. 2017: Establishing a process to roll out news trackers and other crawling solutions 6. 2018: Clinical Trial Registry Tracker 7. 2018: Machine learning based classification of company websites 8. 2019: Globally distributed massive parallel crawling 9. 2020/21: Deep learning for automated news rating 10. 2022: Automating regulatory intelligence collection and classification (will be integrated with intranet applications to manage regulatory events) 19