Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
II-SDV 2015, 20 - 21 April 2015 in Nice
1. Boehringer Ingelheim Pharma GmbH & Co. KG
Scientific Information Center – S.I.C.
WebCrawling / Internet Research
Emancipation from Public Search
Aleksandar Kapisoda & Klaus Kater (black swan )
2. Content
1. Intro: Why we need our own web crawler and search engine
2. Focus on competitive technology and startups:
Building proprietary SEARCHCORPORA to
• Find new technology, e.g. university spin-offs / licenses (search)
• Monitor activities of known competitors (alerting)
3. Scientific Information Center - Workflow
4. What S.I.C. Can Now Offer to the Customers
• Targeted SEARCHCORPORA
• Automatic alerting
5. Outlook: What we want to achieve in the next steps
• Ontology mapping
2
4. The Sea of Information
Our claim is to search all of the sea,
not just its surface!
4
5. The Sea of Information
Personal Web
Observation
(Browser with Google)
5
6. The Sea of Information
News Feeds
(RSS, Email-Alerts, Newsletters)
Personal Web
Observation
(Browser with Google)
6
7. The Sea of Information
Personal Web
Observation
(Browser with Google)
Social Media
News Feeds
(RSS, Email-Alerts, Newsletters)
7
8. The Sea of Information
Personal Web
Observation
(Browser with Google)
Internet of Things
(Patient Health Sensor Data)
Social Media
News Feeds
(RSS, Email-Alerts, Newsletters)
http://www.teleskop-austria.at/information/bino-coin-tl/Coin100-1.jpg
http://www.easymarmaris.com/uploaded_tour_files/1397573325jet_Ski_7.jpg 8
9. The Sea of Information
Personal Web
Observation
(Browser with Google)
Internet of Things
(Patient Health Sensor Data)
Internal Information
(Corporate Databases, Intranet)
Social Media
News Feeds
(RSS, Email-Alerts, Newsletters)
http://www.teleskop-austria.at/information/bino-coin-tl/Coin100-1.jpg
http://www.easymarmaris.com/uploaded_tour_files/1397573325jet_Ski_7.jpg
9
10. Our Lack of Information
Personal Web
Observation
Social Media
What we actually find using public search (Google)
10
11. Our Lack of Information
All other information is Deep Web information
that cannot be searched with Public Search.
11
12. Google repository
Google
Rating
Magic
Google Ads
Surf behavior
User profile
Array of
Googlebots
WWW
.com
google
.de …
max 1000
results
Public search does not allow access to Deep Web information
• Number of results artificially limited
• Search hit filter logic is not revealed
• Single document content index
The Lack of Information
and also
12
14. Focus on Competitive Technology and Startups
Building proprietary SEARCHCORPORA
Case Studies
14
15. Focus on Competitive Technology and Startups:
Building Proprietary SEARCHCORPORA
Find new technology, e.g. university spin-offs / licenses (PULL)
• Provide custom SEARCHCORPUS
• Start from technology transfer organizations / universities (spin-offs in 1st step)
1. Crawl information about spin-offs companies (address, website)
2. Extract technology categories
3. Crawl and index websites
4. Build SEARCHCORPUS
• Customize SEARCHCORPUS Viewer1)
• Publish SEARCHCORPUS Viewer in corporate intranet
1) In addition to common search queries we support fuzzy search, proximity search and phrases
15
16. Side Note: Annotating target documents with topic
specific content to build searchable contexts
Surface Web
Deep Web
Corporate
Resources
We can find documents using search terms that appear in the context but not
necessarily in the document’s content.
16
17. Focus on Competitive Technology and Startups:
Building Proprietary SEARCHCORPORA - PULL
Find new technology, e.g. university spin-offs / licenses (PULL)
http://www.example_url.com
names and
data of targets
crawl
extract
crawl
Target SEARCHCORPUS
expressions to scrape data
from pages of published targets
SEARCHCORPUS Viewer
17
18. Focus on Competitive Technology and Startups:
Building Proprietary SEARCHCORPORA - PUSH
Monitor activities of known competitors (PUSH)
• Weekly alerts
• Currently concentrating on public companies (3 different websites as sources)
1. Crawl and extract ticker symbols (>15.000 public companies)
2. Crawl and scrape company information (address, website, industry, sector)
3. Crawl and index company news
• For each topic of interest, we create targets as search queries1)
e.g. “oncology AND acquisition” to find out, who acquired oncology companies
• Alerts are automatically sent by email
1) In addition to common search queries we support fuzzy search, proximity search and phrases
18
19. Monitor activities of known competitors (PUSH)
Focus on Competitive Technology and Startups:
Building Proprietary SEARCHCORPORA - PUSH
http://www.example1.com…
http://www.finance.example.com- seed urls
crawl
extract
crawl
Company data,
industry, sector
Description, …
expressions to extract
stock market ticker symbols
newspage.com seed urls
crawl
newspage.com
company news pages
crawl
newspage.com
company news
linkCompany news
corpus
User profile
matchMatching news
send
alerts
On a monthly scheduleOn a weekly schedule
Email alert
19
21. Scientific Information Center
Workflow
Project Inquiry
Specify Scope
Setup Chains
Review
Research Department
Customer
Crawler
Crawling
Analyzing
Daily use Scheduled Updates
possibly
in iterations
Information Scientist
21
22. Information Scientist Engineer
Scientific Information Center
Workflow
Viewer
…
SEARCHCORPUS Designer
Scheduler / Engine
Container
ToolsReport / XLS
…
Research Department
22
23. Scientific Information Center
Workflow
Pay off
tDevelopment Test
Actual Usage
Ongoing Optimization
no predetermined end of life time
The value of a SEARCHCORPUS increases over time.
Cost
23
24. What S.I.C. Can Now Offer to the Customers
Automatic alerting
Targeted SEARCHCORPORA
Email Client
SERACHCORPUS Viewer
Blendedinto
BIIntranetSolution
Project
Alert Profile
(SearchTerms)
Scheduled Alerts
Push
Scheduled Updates
Project
SEARCH Profile
(Targets)
Scheduled Updates
Faceted SEARCH
Pull
Crawler
SIC Crawler
24
25. Outlook:
What We Want to achieve in the Next Steps
Technology
User Perspective
GUI for defining Alert Profiles
• Broader project scopes
• Larger SEARCHCORPORA
• More sources
Ontology Mapping
• Map SEARCHCORPUS entries to Ontologies
• Faceting over Ontologies
• Ontology Management: Import AND build ontologies
25