Arcomem training system-overview_beginner

659 views

Published on

This presentation on the ARCOMEM system is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.

Published in: Technology, Design
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
659
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
14
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Arcomem training system-overview_beginner

  1. 1. ARCOMEM System Overview (Beginner Level) Thomas Risse L3S Research Center Hannover, Germany risse@L3s.de
  2. 2. Overview Beginner Level • Approach of current crawlers • What’s new in ARCOMEM? • The ARCOMEM Approach – Overview about the phases – Overview about the processing levels • Handling of Preservation in ARCOMEM Advanced Level • Overview of the system architecture • Possible ARCOMEM System Configurations Slide 2
  3. 3. Standard Crawlers Seedlist http://www.economist.com/node/215348 49 http://www.ekathimerini.com/ekathi/com ment http://www.bbc.co.uk/news/world- europe-15589568 http://www.bbc.co.uk/search/news/?q=G reek%20crisis http://www.guardian.co.uk/business/blog http://www.kathimerini.gr/ http://twitter.com/#!/EU_Commission Web Crawler e.g. Heritrix, HTTrack 1 1. A seedlist is specified as input for the crawler. This specification might also contain some limited crawling parameters like the crawl depth or maximum crawl time. Also blacklists of domain to reduce spam can be given.
  4. 4. Standard Crawlers Seedlist http://www.economist.com/node/215348 49 http://www.ekathimerini.com/ekathi/com ment http://www.bbc.co.uk/news/world- europe-15589568 http://www.bbc.co.uk/search/news/?q=G reek%20crisis http://www.guardian.co.uk/business/blog http://www.kathimerini.gr/ http://twitter.com/#!/EU_Commission Web Crawler e.g. Heritrix, HTTrack 2Crawling 1 2. The Web crawler collects the content from the Web and follows the links up to the specified depth to crawl.
  5. 5. Standard Crawlers Seedlist http://www.economist.com/node/215348 49 http://www.ekathimerini.com/ekathi/com ment http://www.bbc.co.uk/news/world- europe-15589568 http://www.bbc.co.uk/search/news/?q=G reek%20crisis http://www.guardian.co.uk/business/blog http://www.kathimerini.gr/ http://twitter.com/#!/EU_Commission Web Crawler e.g. Heritrix, HTTrack Storage Archive 2Crawling 1 3 3. The results of the crawl are are directly stored in the Web archive. This is typically in WARC or ARC format.
  6. 6. Standard Crawlers Seedlist http://www.economist.com/node/215348 49 http://www.ekathimerini.com/ekathi/com ment http://www.bbc.co.uk/news/world- europe-15589568 http://www.bbc.co.uk/search/news/?q=G reek%20crisis http://www.guardian.co.uk/business/blog http://www.kathimerini.gr/ http://twitter.com/#!/EU_Commission Web Crawler e.g. Heritrix, HTTrack Storage Archive 2Crawling 1 3 Quality Assurance 4 4. The Quality Assurance is applied as the last step to ensure that all information are collected and that the pages are fully stored in the archive. Missing URLs are given to the Web Crawler for re-crawling
  7. 7. What‘s new in ARCOMEM? • Intelligent Crawler – Semantically Enhanced Crawl Specification – „Understands“ the crawl intention – Crawler guidance by using social and semantic information – Stops crawling at irrelevant pages – Two stage crawling strategy: Web  ARCOMEM Storage  Archive • Advanced Web Archive Enrichment – Semantic Information: Entities, Topics, Opinions, Events (ETOE) – Social Context: Interlinking Web Social Web, Trustworthiness of information and users • Archivist and End User Support – Archivist Tool – Searching and browsing Web archives with different facets Slide 7
  8. 8. ARCOMEM Phases: Crawl Specification 1. Intelligent Crawl Specification (ICS) The ICS describes the intended crawl by specifying keywords, entities, topics, etc. together with reference page and starting points. Reference pages matches to 100% with the crawl content and are used by the crawler to learn more about the crawl. Slide 8 Entities Obama, Romney, Biden, Ryan, Republicans, Democrats, … Keywords US Election, CommitToMitt, Teaparty, Budget deficit, … Reference Seedlist https://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ... Seedlist http://news.bbc.co.uk/, http://telegraph.co.uk/, ...
  9. 9. ARCOMEM Phases: Crawling & Online Processing Slide 9 2. Crawling & Online Processing In this phase the web pages and social web content will be collected and a first semantic analysis will be applied. The analysis result is used to guide the crawler by ranking extracted links by their importance. All information are stored in the ARCOMEM Storage. Crawling Online Processing ARCOMEM Storage Crawling Entities Obama, Romney, Biden, Ryan, Republicans, Democrats, … Keywords US Election, CommitToMitt, Teaparty, Budget deficit, … Reference Seedlist https://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ... Seedlist http://news.bbc.co.uk/, http://telegraph.co.uk/, ... Internet
  10. 10. ARCOMEM Phases: Offline Processing Slide 10 3. Offline Processing The offline processing runs after the collection of content has been finished. The aim of this phase is the enrich the crawled pages with meta-information that has been extracted from the content. The enrichments helps selecting content for the final web archive. Furthermore it eases the searching and browsing within the final Web archive. Crawling Online Processing Offline Processing ARCOMEM Storage Crawling Entities Obama, Romney, Biden, Ryan, Republicans, Democrats, … Keywords US Election, CommitToMitt, Teaparty, Budget deficit, … Reference Seedlist https://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ... Seedlist http://news.bbc.co.uk/, http://telegraph.co.uk/, ... Internet
  11. 11. ARCOMEM Phases: Appraisal & Selection Slide 11 4. Based on the information given in the Intelligent Crawl Specification (ICS) and the enrichment of the content, the most interesting content items are selected to be stored in the final Web archive. The final Web archive are WARC files, which include the crawled pages and all enrichments done during the offline processing in RDF format. Crawling Online Processing Offline Processing ARCOMEM Storage Archive Crawling Appraisal Selection Entities Obama, Romney, Biden, Ryan, Republicans, Democrats, … Keywords US Election, CommitToMitt, Teaparty, Budget deficit, … Reference Seedlist https://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ... Seedlist http://news.bbc.co.uk/, http://telegraph.co.uk/, ... Internet
  12. 12. ARCOMEM Phases: Applications Slide 12 Crawling Online Processing Offline Processing SARA for Broadcaster, Parliaments ARCOMEM Storage Archive Crawling Appraisal Selection Entities Obama, Romney, Biden, Ryan, Republicans, Democrats, … Keywords US Election, CommitToMitt, Teaparty, Budget deficit, … Reference Seedlist https://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ... Seedlist http://news.bbc.co.uk/, http://telegraph.co.uk/, ... Internet 5. The Search and Retrieval Application (SARA) allows end users to search and browse the archive in different ways, e.g. based on keywords, entities, topics, opinions.
  13. 13. ARCOMEM Phases: Cross Crawl Analytics Slide 13 Crawling Online Processing Offline Processing SARA for Broadcaster, Parliaments ARCOMEM Storage Archive Crawling Appraisal Selection Cross Crawl Processing Entities Obama, Romney, Biden, Ryan, Republicans, Democrats, … Keywords US Election, CommitToMitt, Teaparty, Budget deficit, … Reference Seedlist https://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ... Seedlist http://news.bbc.co.uk/, http://telegraph.co.uk/, ... Internet 6. The Cross-Crawl analysis allows content analytics across archives. This enables the possibility to combine Web archives to get a larger collection of documents or to study evolutions over time. Examples are evolution of languages, opinions, etc.
  14. 14. Preservation in ARCOMEM Content Preservation in ARCOMEM • Selection and appraisal of Web and Social Web content • Preparation of WARC files for preservation • Provides access to preserved Web content • Not part of ARCOMEM are – Long-term preservation of WARC files – Format handling, etc. Semantic Preservation in ARCOMEM • Extraction of Entities, Events, Topics, Opinions • Enrichment with Linked Data • Created WARC files contain – Raw Web Data – RDF triples of enrichment • Preservation of Linked Data – Not part of ARCOMEM – See EU Projects: DIACHRON (IP), PRELIDA (CA) Slide 14 + WARC
  15. 15. THANK YOU CONTACT DETAILS Dr. Thomas Risse +49 511 762 17764 risse@L3S.de www.arcomem.eu

×