Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

II-SDV 2017: Approaches of Web Information Analysis in a Day to Day Work Environment

427 views

Published on

Web scraping, content filtering, tagging and feeding web data into the day to day work environment takes many different shapes and requires an additional software stack that is blending well with existing big data analysis, text analysis and search technology.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

II-SDV 2017: Approaches of Web Information Analysis in a Day to Day Work Environment

  1. 1. 1 © 2017 Deep SEARCH 9 GmbH1http://www.deepsearchnine.com Deep SEARCH 9 Approaches of Web Information Analysis in a Day to Day Work Environment. II-SDV 2017 24-25 April Nice, France Klaus Kater Deep SEARCH 9 GmbH Managing Partner http://www.deepsearchnine.com
  2. 2. 2 © 2017 Deep SEARCH 9 GmbH2http://www.deepsearchnine.com Web Information Analysis Many areas of critical importance call for research, continuous monitoring, analysis and distribution of collected intelligence within an organization to support immediate and competent decision making. Information Sources of Critical Importance
  3. 3. 3 © 2017 Deep SEARCH 9 GmbH3http://www.deepsearchnine.com Web Information Analysis Information Sources of Critical Importance • Corporate news (e.g. acquisitions, pipeline information, …) • Regulatory documents (e.g. ISO 22301: BCM, PDUFA V, …) • Technology (e.g. research funding, patents, publications, …) • Corporate Websites (e.g. competitors, start-ups, supply chain, …) • Newsletters (e.g. BioCentury, CafePharma, …) • Cyber threats (e.g. CERTs, ransom attacks, false news, …) • Portals (e.g. WebMD, NIH, Dr. Mercola, …) • Online Databases (e.g. DrugBank, UniProt, Wiki Pathways, …) • Online Registries (e.g. CTR, Edgar Online, …) • Social Media (e.g. Twitter, Facebook, Blogs, …) • Shops (e.g. product prices, counterfeiting, …) • Semantic Web (e.g. Wiki Data, DBPedia, OBO, MeSH, …)
  4. 4. 4 © 2017 Deep SEARCH 9 GmbH4http://www.deepsearchnine.com Web Information Analysis Surface Web Deep Web Dark Web • Corporate news (e.g. acquisitions, pipeline information, …) • Regulatory documents (e.g. ISO 22301: BCM, PDUFA V, …) • Technology (e.g. research funding, patents, publications, …) • Corporate Websites (e.g. competitors, start-ups, supply chain, …) • Newsletters (e.g. BioCentury, CafePharma, …) • Cyber threats (e.g. CERTs, ransom attacks, false news, …) • Portals (e.g. WebMD, NIH, Dr. Mercola, …) • Online Databases (e.g. DrugBank, UniProt, Wiki Pathways, …) • Online Registries (e.g. CTR, Edgar Online, …) • Social Media (e.g. Twitter, Facebook, Blogs, …) • Shops (e.g. product prices, counterfeiting, …) • Semantic Web (e.g. Wiki Data, DBPedia, OBO, MeSH, …) Information Sources of Critical Importance
  5. 5. 5 © 2017 Deep SEARCH 9 GmbH5http://www.deepsearchnine.com Web Information Analysis Surface Web Deep Web Dark Web • Corporate news (e.g. acquisitions, pipeline information, …) • Regulatory documents (e.g. ISO 22301: BCM, PDUFA V, …) • Technology (e.g. research funding, patents, publications, …) • Corporate Websites (e.g. competitors, start-ups, supply chain, …) • Newsletters (e.g. BioCentury, CafePharma, …) • Cyber threats (e.g. CERTs, ransom attacks, false news, …) • Portals (e.g. WebMD, NIH, Dr. Mercola, …) • Online Databases (e.g. DrugBank, UniProt, Wiki Pathways, …) • Online Registries (e.g. CTR, Edgar Online, …) • Social Media (e.g. Twitter, Facebook, Blogs, …) • Shops (e.g. product prices, counterfeiting, …) • Semantic Web (e.g. Wiki Data, DBPedia, OBO, MeSH, …) Information Sources of Critical Importance
  6. 6. 6 © 2017 Deep SEARCH 9 GmbH6http://www.deepsearchnine.com Web Information Analysis • Corporate news (e.g. acquisitions, pipeline information, …) • Regulatory documents (e.g. ISO 22301: BCM, PDUFA V, …) • Technology (e.g. research funding, patents, publications, …) • Corporate Websites (e.g. competitors, start-ups, supply chain, …) • Newsletters (e.g. BioCentury, CafePharma, …) • Cyber threats (e.g. CERTs, ransom attacks, false news, …) • Portals (e.g. WebMD, NIH, Dr. Mercola, …) • Online Databases (e.g. DrugBank, UniProt, Wiki Pathways, …) • Online Registries (e.g. CTR, Edgar Online, …) • Social Media (e.g. Twitter, Facebook, Blogs, …) • Shops (e.g. product prices, counterfeiting, …) • Semantic Web (e.g. Wiki Data, DBPedia, OBO, MeSH, …) Information Sources of Critical Importance Surface Web Deep Web Dark Web Who is doing this research? And what tools are used? Public Search E-Mail clientBrowser Spread sheets
  7. 7. 7 © 2017 Deep SEARCH 9 GmbH7http://www.deepsearchnine.com Sources Surface Web Deep Web Decisions Manualresearch Decision makers Web Information Analysis • 100s of emails… • 1,000s of websites… • Once a week, daily, every other hour? • Keep sitting there, hitting F5 ;-) reliable? systematic? structured? repeatable? fast enough?
  8. 8. 8 © 2017 Deep SEARCH 9 GmbH8http://www.deepsearchnine.com Google Repository Google Rating Magic Google Ads Surf Behavior Search Profile Army of Google Bots World Wide Web .com google .de … max. 1000 results The Public Search Situation (e.g. Google) • The “Rating Magic” and therefore who gets access to which results is determined solely by Google • The browser’s finger print tells Google who is interested in what topics Public Search • Restricting search to a specific context, e.g. “find me persons only” is not possible (I’m not going to address the “Public Search Paradoxon” here…)
  9. 9. 9 © 2017 Deep SEARCH 9 GmbH9http://www.deepsearchnine.com Sources Surface Web Deep Web Decisions Manualresearch Decision makers Web Information Analysis • 100s of emails… • 1,000s of websites… • Once a week, daily, every other hour? • Keep sitting there, hitting F5 ;-)
  10. 10. 10 © 2017 Deep SEARCH 9 GmbH10http://www.deepsearchnine.com Sources Surface Web Deep Web Decisions Manualresearch Decision makers Web Information Analysis
  11. 11. 11 © 2017 Deep SEARCH 9 GmbH11http://www.deepsearchnine.com Sources Information Scientists Search Specialists Knowledge WorkersSurface Web Deep Web Databases Repositories Manualresearch Competitive Intelligence Web Information Analysis Decisions Regulatory Affairs Research & Development there are many more… Decision makers Expert Search
  12. 12. 12 © 2017 Deep SEARCH 9 GmbH12http://www.deepsearchnine.com Sources Information Scientists Surface Web Deep Web Databases Repositories Dark Web SEARCHCORPORA • Start-ups • Competitors • Regulatory • New technology • … Manualresearch Ontologies Scheduled execution Unattendedupdates Content assessment Automatic publication • Known (trusted) sources • More complete • Faster Managed Intelligence Competitive Intelligence Decisions Regulatory Affairs Research & Development there are many more… • Information source selection • Content structuring • Linking of disparate sources • Ontology management • SEARCHCORPUS management Managed Intelligence Search Competence Center
  13. 13. 13 © 2017 Deep SEARCH 9 GmbH13http://www.deepsearchnine.com Sources Information Scientists Surface Web Deep Web Databases Repositories Dark Web SEARCHCORPORA • Start-ups • Competitors • Regulatory • New technology • … Manualresearch Ontologies Scheduled execution Unattendedupdates Content assessment Automatic publication Competitive Intelligence Decisions Regulatory Affairs Research & Development there are many more… Direct access for immediate answers within predefined scopes of interest • Known (trusted) sources • More complete • Faster Managed Intelligence • Information source selection • Content structuring • Linking of disparate sources • Ontology management • SEARCHCORPUS management Managed Intelligence Search Competence Center
  14. 14. 14 © 2017 Deep SEARCH 9 GmbH14http://www.deepsearchnine.com Triple Store Search EngineDatabase Connectors { APIs } Scheduler Parallel Processing Engine Job Job Job Analyz ers Analyz ers Agents Corporate Websites Portals News Azure Cognitive Services Browser Farm > whois Whois Proxy Crawlers Crawlers Crawlers Proprietary Databases Document Repositories Managed Intelligence Architecture > 2.000 IANA Top Level Domains Registries (e.g. .com) Trust!
  15. 15. 15 © 2017 Deep SEARCH 9 GmbH15http://www.deepsearchnine.com Triple Store Search EngineDatabase Connectors { APIs } Scheduler Parallel Processing Engine Job Job Job Analyz ers Analyz ers Agents Corporate Websites Portals News Azure Cognitive Services > whois Whois Proxy Crawlers Crawlers Crawlers Proprietary Databases Document Repositories Managed Intelligence Architecture > 2.000 IANA Top Level Domains Registries (e.g. .com) Trust! Integrated Development Studio Browser Farm
  16. 16. 16 © 2017 Deep SEARCH 9 GmbH16http://www.deepsearchnine.com Managed Intelligence Architecture Triple Store Search EngineDatabase Connectors { APIs } Scheduler Parallel Processing Engine Job Job Job Analyz ers Analyz ers Agents Corporate Websites Portals News Azure Cognitive Services > whois Whois Proxy Crawlers Crawlers Crawlers Proprietary Databases Document Repositories > 2.000 IANA Top Level Domains Registries (e.g. .com) Trust! Integrated Development Studio Browser Farm
  17. 17. 17 © 2017 Deep SEARCH 9 GmbH17http://www.deepsearchnine.com Managed Intelligence Architecture Triple Store Search EngineDatabase Connectors { APIs } Scheduler Parallel Processing Engine Job Job Job Analyz ers Analyz ers Agents Corporate Websites Portals News Azure Cognitive Services > whois Whois Proxy Crawlers Crawlers Crawlers Proprietary Databases Document Repositories > 2.000 IANA Top Level Domains Registries (e.g. .com) Trust! Integrated Development Studio Browser Farm Information Scientists Search Competence Center • Information source selection • Content structuring • Linking of disparate sources • Ontology management • SEARCHCORPUS management • Viewer management
  18. 18. 18 © 2017 Deep SEARCH 9 GmbH18http://www.deepsearchnine.com Managed Intelligence Architecture Information Scientists Development Studio • Crawl selected sources • Extract data • Tag known entities in content • Filter based on data, content or tags • Link data into SEARCHCORPUS® • Automatically renew according to schedule • Provide interactive search to end users Configure the system to
  19. 19. 19 © 2017 Deep SEARCH 9 GmbH19http://www.deepsearchnine.com >
  20. 20. 20 © 2017 Deep SEARCH 9 GmbH20http://www.deepsearchnine.com Managed Intelligence Architecture Information Scientists Search Competence Center
  21. 21. 21 © 2017 Deep SEARCH 9 GmbH21http://www.deepsearchnine.com Ontology Management • Define ontology and rules MeSH, Proprietary taxonomies, ……. Criteria's Ontology Extracted Data Websites of extracted companies / institutes, … Publications Patents Search Competence Center: • Address Room Management select relevant addresses bioscentury.com clinicaltrials.gov, bloomberg.com, ……. Search Room CRISPR AND “diabetes type 1”~3World Wide Web Deep SEARCH 9 repeatable targeted systematic structured reliable flexible 24hx7
  22. 22. 22 © 2017 Deep SEARCH 9 GmbH22http://www.deepsearchnine.com Deep SEARCH 9 Approaches of Web Information Analysis in a Day to Day Work Environment. II-SDV 2017 24-25 April Nice, France Klaus Kater Deep SEARCH 9 GmbH Managing Partner http://www.deepsearchnine.com

×