Your SlideShare is downloading. ×
0
Arcomem training system-overview_advanced
Arcomem training system-overview_advanced
Arcomem training system-overview_advanced
Arcomem training system-overview_advanced
Arcomem training system-overview_advanced
Arcomem training system-overview_advanced
Arcomem training system-overview_advanced
Arcomem training system-overview_advanced
Arcomem training system-overview_advanced
Arcomem training system-overview_advanced
Arcomem training system-overview_advanced
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Arcomem training system-overview_advanced

206

Published on

This presentation on the ARCOMEM System is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving …

This presentation on the ARCOMEM System is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
206
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Entitieevolution: Obama assenator president
  • Transcript

    • 1. ARCOMEM System Overview (Advanced Level) Prerequisite: ARCOMEM System Overview (Beginner Level) Thomas Risse L3S Research Center Hannover, Germany risse@L3s.de
    • 2. Architecture Overview Slide 2 Online Processing Crawler Cross Crawl Analysis Offline Processing Queue Management Application-Aware Helper Resource Selection & Prioritization Resource Fetching Intelligent Crawl Definition Consolidation Enrichment GATE Offline Analysis Social Web Analysis GATE Online Analysis Social Web Analysis Named Entity Evol. Recog. Extracted SocialWeb Information Crawler Cockpit ARCOMEM Storage (HBase, H2RDF) URLs Relevance Analysis & Priorization Image/Video Analysis Twitter Dynamics WARC Export Application WARC Files SARA SOLR Index + Broadcaster Parliament The levels in the architecture represent the phases as described in the ARCOMEM overview. Details about the components can be found in the other courses.
    • 3. Crawler and Online Analysis • Intelligent Crawl Specification (ICS) specifies the crawl intention • Resource Fetching – Heritrix or the IMF Large Scale Crawler can be used for collecting Web pages – API Crawling support the collection of content in the Social Web via the API of the sites • Application Aware Helper extracts links from Web and Social Web content by taking application specific functionalities into account e.g. for Twitter, YouTube. • Simple content analysis (e.g. keyword detection) in the online phase allows and efficient relevance ranking of extracted links • All results are stored in the ARCOMEM Storage • Crawler and Online Analysis are tightly coupled Slide 3 Online Processing Crawler Queue Management Application-Aware Helper Resource Selection & Prioritization Resource Fetching Intelligent Crawl Definition GATE Online Analysis Social Web Analysis ARCOMEM Storage (HBase, H2RDF) URLs Relevance Analysis & Priorization
    • 4. Offline Processing Level Thorough analysis of crawled Web objects • GATE based extraction of Entities, Opinions, Events from text • Topic Extraction • Analysis of images and videos – Extraction of entities, locations, etc. – Identification of duplicates • Social Web Analysis – Identification of cultural differences in the Social Web – Domain Expert detection – Social Search Archive Enrichment • Enrichment of all crawled content items with semantic Information about Topics, Entities and Events • Interlinking entities and events with Linked Data • Sentiments of user content in the Social Web Supporting for Appraisal and Selection • Learn more about the crawl intention • Feedback to the crawl specification • Ranking of content for WARC export Slide 4 Offline Processing Consolidation Enrichment GATE Offline Analysis Social Web Analysis Extracted SocialWeb Information ARCOMEM Storage (HBase, H2RDF) Image/Video Analysis
    • 5. Cross Crawl Analysis (CCA) Level Analyzing several crawls • Temporal analytics to get a better understanding of changes that occur over time. • Combination of content to get a larger collection of content, e.g. combining several Twitter crawls Understanding the dynamics of the Web content • Evolution of entities over time e.g. Joseph Ratzinger  Pope Benedict XVI  Pope Emeritus Benedict XVI • Evolution of opinions Better understanding of the public perception • Dynamics of Twitter hashtags Slide 5 Cross Crawl Analysis Named Entity Evol. Recog. Twitter Dynamics GATE CCA Analysis Opinion DynamicsARCOMEM Storage (HBase, H2RDF)
    • 6. Technologies Slide 6 Technical Framework • Scalability is important due the large amount of analysis • Apache Hadoop based environment as framework for the implementation ARCOMEM Storage • Central Component of ARCOMEM • HBase for scalability – Web Object Store – RDF Store as Knowledge Base • ARCOMEM data model has been specified Crawler Cross Crawl Analysis Online Processing Offline Processing Queue Management Application-Aware Helper Resource Selection & Prioritization Resource Fetching Intelligent Crawl Definition Consolidation Enrichment GATE Offline Analysis Social Web Analysis GATE Online Analysis Social Web Analysis Named Entity Evol. Recog. Extracted SocialWeb Information Crawler Cockpit ARCOMEM Storage URLs Relevance Analysis & Priorization Image/Video Analysis Twitter Dynamics WARC Export WARC Files Applications Broadcaster Application Parliament Application
    • 7. Crawler Cockpit & Applications • Crawler Cockpit – Adapted interfaces to the crawler – Allows the specification and refinement of the Intelligent Crawl Specification (ICS) • WARC Export – Semi-automatic selection of content to be preserved – Selection is based on the ICS and the extracted meta information – Raw content and RDF Metadata are exported as WARC Files • Search And Retrieval Application (SARA) – End user access to exported Web archives (incl. index) – One Application, Two Scenarios Slide 7 Crawler Cockpit ARCOMEM Storage (HBase, H2RDF) WARC Export Application WARC Files SARA SOLR Index + Broadcaster Parliament
    • 8. ARCOMEM System Configurations (1/3) • ARCOMEM System is complex – Development aim was to be generic to serve as many Web archive goals as possible – Large number of phases and components – Complex handling and maintenance of the whole systems • But not every user needs all functionalities – A subset is often enough – Phases can be used separately Slide 8
    • 9. ARCOMEM System Configurations (2/3) Crawler Configurations • Heritrix + Online Analysis – Simple configuartion – Completely Open Source – Runs on standard servers – Interesting for a broad group of organizations that do small to medium sized focused crawls • Large Scale Crawler + Online Analysis + Offline Analysis (+ Cross Crawl Analysis) – Complex High Throughput System – Requires Big clusters or Server farms – Analysis steps have to be well selected depending on user requirements, e.g. not every crawl requires video analysis – Mainly interesting for Service Providers (e.g. Internet Memory Foundation) or other organizations with large scale crawl requirements (e.g. National Líbraries) Slide 9
    • 10. ARCOMEM System Configurations (3/3) Offline Analysis / Cross Crawl processing • Analysis modules can be used independently from the crawler to analyze and enrich existing Web crawls • Analysis steps have to be well selected depending on user requirements • Depending on the analysis this requires Big clusters / Server farms • Interesting for Service Providers (e.g. Internet Memory Foundation, University Computing Centers) Applications / User Interfaces • Crawler Cockpit – Easy user interface for crawler control – Interesting for all crawler users • SARA – Generic tool for content exploration – Interesting for all end users of Web Archives – Interesting for Service Providers to deliver results Slide 10
    • 11. THANK YOU CONTACT DETAILS Dr. Thomas Risse +49 511 762 17764 risse@L3S.de www.arcomem.eu

    ×