The document provides an overview of how to specify crawls using the ARCOMEM crawler cockpit. It begins with an introduction to how the web works and key concepts like HTTP, HTML, URLs, and domains. It then describes the different components of the ARCOMEM crawler system including the Memory Bot crawler, Adaptive Heritrix, API Crawler, and Application Aware Helper. The document concludes by explaining the functionality of the ARCOMEM Crawler Cockpit user interface, which allows users to launch, monitor, and export crawls by focusing on keywords, entities, URLs or social media categories.
This presentation on Specifying Crawls is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training Specifying Crawls Beginnersarcomem
This presentation on Specifying Crawls is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
DeepSea phishing gear aims to help RTOs and pentesters with the delivery of opsec-tight, flexible email phishing campaigns carried out on the outside as well as on the inside of a perimeter.
Code: https://github.com/dsnezhkov/deepsea/
Goals
Operate with a minimal footprint deep inside enterprises (Internal phish delivery).
Seamlessly operate with external and internal mail providers (e.g. O365, Gmail, on-prem mail servers)
Quickly re-target connectivity parameters.
Flexibly add headers, targets, attachments
Correctly format and inline email templates, images and multipart messages.
Use content templates for personalization
Account for various secure email communication parameters
Clearly separate artifacts, mark databases and content delivery for multiple (parallel or sequential) phishing campaigns.
Help create content with minimal dependencies. Embedded tools to support Markdown->HTML->TXT workflow.
Execution of an offensive payload may begin with a safe delivery of the payload to the endpoint itself. When secure connections in the enterprise are inspected, reliance only on transmission level security may not be enough to accomplish that goal. Foxtrot C2 serves one goal: safe last mile delivery of payloads and commands between the external network and the internal point of presence, traversing intercepting proxies, with the end-to-end application level encryption.
While the idea of end-to-end application encryption is certainly not new, the exact mechanism of Foxtrot's delivery implementation has advantages to Red Teams as it relies on a well known third party site, enjoying elevated ranking and above average domain fronting features. Payload delivery involves several OpSec defenses: sensible protection from direct attribution, active link expiration to evade consistent interception, inspection, tracking and replay activities by the defenders. Asymmetric communication channels are also planned.
And if your standalone Foxtrot agent is caught, the delivery mechanism may live on, you could still manually bring the agent back into the environment via the browser. A concept tool built on these ideas will be presented and released. It will be used as basis for our discussion.
This presentation on Specifying Crawls is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training Specifying Crawls Beginnersarcomem
This presentation on Specifying Crawls is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
DeepSea phishing gear aims to help RTOs and pentesters with the delivery of opsec-tight, flexible email phishing campaigns carried out on the outside as well as on the inside of a perimeter.
Code: https://github.com/dsnezhkov/deepsea/
Goals
Operate with a minimal footprint deep inside enterprises (Internal phish delivery).
Seamlessly operate with external and internal mail providers (e.g. O365, Gmail, on-prem mail servers)
Quickly re-target connectivity parameters.
Flexibly add headers, targets, attachments
Correctly format and inline email templates, images and multipart messages.
Use content templates for personalization
Account for various secure email communication parameters
Clearly separate artifacts, mark databases and content delivery for multiple (parallel or sequential) phishing campaigns.
Help create content with minimal dependencies. Embedded tools to support Markdown->HTML->TXT workflow.
Execution of an offensive payload may begin with a safe delivery of the payload to the endpoint itself. When secure connections in the enterprise are inspected, reliance only on transmission level security may not be enough to accomplish that goal. Foxtrot C2 serves one goal: safe last mile delivery of payloads and commands between the external network and the internal point of presence, traversing intercepting proxies, with the end-to-end application level encryption.
While the idea of end-to-end application encryption is certainly not new, the exact mechanism of Foxtrot's delivery implementation has advantages to Red Teams as it relies on a well known third party site, enjoying elevated ranking and above average domain fronting features. Payload delivery involves several OpSec defenses: sensible protection from direct attribution, active link expiration to evade consistent interception, inspection, tracking and replay activities by the defenders. Asymmetric communication channels are also planned.
And if your standalone Foxtrot agent is caught, the delivery mechanism may live on, you could still manually bring the agent back into the environment via the browser. A concept tool built on these ideas will be presented and released. It will be used as basis for our discussion.
Internet Memory Foundation (IMF), Südwestrundfunk (SWR) and the Dutch Beeld en Geluid institute held a joint workshop about web archiving at FIAT IFTA World Conference 2011 in Turino, Italy. IMF and SWR introduced the ARCOMEM project to the international audiovisual archivist community.
Arcomem training entities-and-events_advancedarcomem
This presentation on Entities and Events detection is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training Cultural Analysis Beginnerarcomem
This presentation on Cultural Analysis is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on Diversification is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training – Enrichment Beginner (update)arcomem
This presentation on data enrichment is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
HTML5 introduces significant changes for today\'s websites: new and updated tags, new functionality, better error handling and improved Document Object Model (DOM). However, the HTML5 new features come with new (application) security vulnerabilities. This presentation reviews the new attack vectors, associated risks and what a needs to be taken into consideration when implementing HTML5.
Internet Memory Foundation (IMF), Südwestrundfunk (SWR) and the Dutch Beeld en Geluid institute held a joint workshop about web archiving at FIAT IFTA World Conference 2011 in Turino, Italy. IMF and SWR introduced the ARCOMEM project to the international audiovisual archivist community.
Arcomem training entities-and-events_advancedarcomem
This presentation on Entities and Events detection is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training Cultural Analysis Beginnerarcomem
This presentation on Cultural Analysis is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on Diversification is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training – Enrichment Beginner (update)arcomem
This presentation on data enrichment is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
HTML5 introduces significant changes for today\'s websites: new and updated tags, new functionality, better error handling and improved Document Object Model (DOM). However, the HTML5 new features come with new (application) security vulnerabilities. This presentation reviews the new attack vectors, associated risks and what a needs to be taken into consideration when implementing HTML5.
Linked Data Platform specification aims to define a set of HTTP protocol extensions for accessing, updating, creating and deleting resources from servers that expose their resources as Linked Data. This presentation looks at how the Linked Data Platform can be used for application integration.
Similar to Arcomem training Specifying Crawls Advanced (20)
Arcomem training – Enrichment Advanced (update)arcomem
This presentation on data enrichment is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training Topic Analysis Models advancedarcomem
This presentation on Topic Analysis Models is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training Topic Analysis Models beginnersarcomem
This presentation on Topic Analysis Models is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training Twitter Domain Experts advancedarcomem
This presentation on Twitter Domain Experts is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training Cultural Analysis Advancedarcomem
This presentation on Cultural Analysis is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on Twitter Dynamics is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on the ARCOMEM System is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training simple-text-mining_beginnerarcomem
This presentation on Text Mining is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on Opinion Mining is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on Named Entity Evolution Recognition is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on Named Entity Evolution Recognition is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on using the Heritrix crawler is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on using the Heritrix crawler is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on data enrichment is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on data enrichment is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on Twitter Dynamics is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Which contributions of the social web will stay in the future and will be preserved? And how will they be accessible and explorable for users? ARCOMEM creates tools which helps archivists to exploit the social web.
ARCOMEM developing methods & tools for transforming digital archives into community memories bases on novel socially-aware & -driven preservation models.
http://www.arcomem.eu
ARCOMEM developing methods & tools for transforming digital archives into community memories bases on novel socially-aware & -driven preservation models.
http://www.arcomem.eu
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
2. Training Goals
➔ Help user to specify properly the campaign
➔ Make user understanding what it is going on in
the back end of the ARCOMEM platform
➔ Set-up a campaign in the crawler cockpit
Slide 2
3. Plan
What is the Web ? Challenges and SOA
ARCOMEM platform
Crawler
Set-up a campaign in the Arcomen Crawler
Cockpit
Slide 3
4. Introduction : How does web work ?
➔ The web is managed by protocols and standards :
• HTTP Hypertext Transfer Protocol
• HTML HyperText Markup Language
• URL Uniform Resource Locator
• DNS Domain Name System
➔ Each server has an address : IP address
• Example : http://213.251.150.222/ ->
http://collections.europarchive.org
4
5. WWW
The web is a large space of communication and information :
• managed by servers which talk together by convention (protocol) and
through applications in a large network.
• a naming space organized and controlled (ICANN)
World Wide Web: abbreviated as WWW and commonly known
as the Web, is a system of interlinked hypertext documents
accessed via the internet
Slide 5
6. HTTP - Hypertext Transfer Protocol
➔ Notion client/server
•
request-response protocol in the client-server computing model
➔ How does it work ?
•
•
Server hosts the content and delivers it
•
6
Client asks for a content
The browser locates the DNS server, connects itself to the
server and sends a request to the server.
7. HTML - HyperText Markup Languag e
➔ Markup language for Web page
➔ Written in form of HTML elements
➔ Creates structured documents denoting structural
semantic elements for text as headings, paragraphs,
titles, links, quotes, and other items
➔ Allows text and embedded as images
➔ Example : http://www.w3.org/
7
8. URI - URL
➔ URL - Uniform resource Locator (URL) that specifies
where an identified resource is available and the mechanism
for retrieving it.
➔ Examples :
– http://host.domain.extension/path/pageORfile
– http://www.europarchive.org
– http://collections.europarchive.org/
– http://www.europarchive.org/about.php
Samos 2013 – Workshop : The ARCOMEM Platform
8
9. Domain name and extension
➔ Manage by l’ICANN, Internet Corporation for Assigned Names and
Numbers (ICANN), is non profit organization, allocated by registrar.
•
http://www.icann.org
➔ ICANN coordinates the allocation and assignment to ensure the
universal resolvability of :
• Domain names (forming a system referred to as «DNS»)
• Internet protocol («IP») addresses
• Protocol port and parameter numbers.
➔ Several types of TLD
•
•
gTLD : aero, .biz, .coop, .info, .museum, .name, et .pro
•
9
TLD first level : .com, .info, etc
ccTLD (country code Top Level Domains).fr
10. What kind of contents?
➔ Different type of contents : multimedia text, video, images
➔ Different type of producers :
• public : institution, government, museum, TV....
• private : foundation, company, press, people, blog...
http://ec.europa.eu/index_fr.htm
http://iawebarchiving.wordpress.com/
http://www.nytimes.com/
➔ Each producer is in charge of its content
• Information can disappear: fragility
• Size
10
11. Social web
➔ Focus on people’s socialization and interaction
• Characteristics :
•
Walled space in which users can interact
• Creation of social network
➔ WEB ARCHIVE -> challenges in term of content, privacy
and technique.
•
Examples:
• Share bookmark(Del.icio.us, Digg), videos (Dailymotion,
YouTube), photos (Flickr, Picasa)
• community (MySpace, Facebook)
11
12. Ex. of technical difficulties: Videos
➔ Standard HTTP protocol
• obfuscated links to the video files
• dynamic playlists and channels or configuration files loaded
by the player several hops and redirects to the server of the
video content
e.g.: YouTube
➔ Streaming protocols: RTSP, RTMP, MMS...
• real-time protocols implemented by the video players suited
for large video files (control commands) or live broadcasts
• sometimes proprietary protocols (e.g.: RTMP - Adobe)
available tools: MPlayer, FLVStreamer, VCL
12
13. Deep /Hidden Web
• Deep web: content accessible behind
password, database, payment... and hidden
to search engine
http://c.asselin.free.fr/french/schema_webinvisible.htm Schema établit sur la base de la figure
"Distribution des sites du Deep Web par types de contenu" de l'étude Bright Planet.
13
14. How do we archive it ?
➔ Challenges for archiving :
– dynamic websites
➔ Technical barriers:
•
•
•
•
•
some javascript
animation on Flash
pop-up
video and audio on streaming
restricted access
➔Traps : Spam and loop
14
15. What do user need to do some web archiving ?
➔ Define the target content (Website, URL, Topic…)
➔ A tool to manage its campaign
➔ Intelligent crawler to archive content
15
16. Management tools (1)
Several tool exist already developed by Libraries which are doing some Library.
➔Netarchivesuite (http://netarchive.dk/suite/)
➔The NetarchiveSuite software was originally developed by the two national deposit
libraries in Denmark, The Royal Library and The State and University Library and has
been running in production, harvesting the Danish world wide web since 2005. The
French National Library and the Austrian National Libraries joined the project in 2008.
➔Web curator tool: http://webcurator.sourceforge.net
Open-source workflow management application for selective web archiving
developed by the National Library of New Zealand and the British Library, initiated
by the International Internet Preservation Consortium
➔Archive-it http://www.archive-it.org/
A subscription service by Internet Archive to build and preserve collections: allows
to harvest, catalogue, manage and browse archived collections
➔Archivethe.net http://archivethe.net/fr/
Service provides by the Internet Memory Foundation.
➔Arcomem crawler cockpit
16
17. How does a crawler work ?
➔ A crawler is a bot parsing web pages in order to index
or and archive them. Robot navigates following links
➔ Link in the center of crawl’s problematic
• Explicit links : source code is available and full path is
explicitly stated
• Variable link : source code is available but use
variables to encode the path
• Opaque links: source code not available
Example : http://www.thetimes.co.uk/tto/news/
17
18. Parameters
➔ Scoping function is used to define how depth the crawl
will go
• Complete or specific content of a website
• Discovery or focus crawl
➔ Politeness
• Follow the common rules of politeness
➔ Robots.txt
• Follow
➔ Frequency
• How often I want to launch a crawl on this target ?
18
21. Memory Bot
• Component Name: IMF Large Scale Crawler
– The large scale crawler retrieves content from the web and
stores it in an HBase repository. It aims at being scalable:
crawling at a fast rate from the start and slowing down as
little as possible as the amount of visited URLs grows to
hundreds of millions, all while observing politeness
conventions (rate regulation, robots.txt compliance, etc.).
• Input:
– URLs with a score (seeds, then URLs output by the
analysis process)
• Output:
– Web resources written to WARC files. We also have
developed an importer to load these WARC files into
HBase. Some metadata is also extracted: HTTP status
code, identified out links, MIME type, etc.
21
23. Adaptative Heritrix
➔ Component Name: Adaptive Heritrix
➔ Description: Adaptive Heritrix is a modified version of the
open source crawler Heritrix that allows the dynamic
reordering of queued URLs and receiving URLs from the
Online Analysis module.
23
24. How does adaptative Heritrix work ?
➔ Prioritization module communicates new scores to the
crawler queue using a JSON over HTTP Prioritisation
module sends POST to http://QUEUE_SERVER/update.
The request body is a JSON encoded array of update
objects.
➔ {"url": "http://google.com/", "score": 0.3, "parentUrl":
"http://seed.tld/page"},
➔ {"url": "http://spam.net/", "blacklisted": true, "parentUrl":
"http://seed.tld/page"}
24
25. API Crawler
➔ Component Name: API Crawler
➔ Description:
•
The API Crawler is a solution to manage keyword-based crawls of
different social platforms using their Web APIs. It is controlled via
a RESTful Web interface. Scalability and Performance: 3000
requests per hour, millions of triples per hour, millions of links per
hour
➔ Input: List of tuples (keyword, platform)
➔ Output: Triples stored in the triple store and WARC files
stored in the HDFS
➔ Twitter restriction: 180 request /15mn one request is
one criteria. Each request give back 100 answers
25
26. How does API crawler work ?
➔ Principles: a crawler runs crawls. Each crawl has a crawl
ID assigned by the pipeline. The pipeline ensures crawl
IDs are unique. A crawl has four states: running, stopped,
being deleted, deleted. A crawl runs until it ends by itself
or until a stop order is received. Only a stopped crawl can
be deleted.
➔ The APCrawler produces three kind of data:
– semi-structured data stored as triples in the triple
store,
– outlinks sent to Heritrix or the IMF crawler,
– and WARC files saved in the file system, that will also
possibly be inserted into HBase.
26
29. Application Aware helper
➔ Component Name: Application-aware helper
– The goal of this software component is to make the
crawler aware of the particular kind of Web application
being crawled, in terms of general classification of
websites (wiki, social network, blog, web forum, etc.),
technical implementation (Mediawiki, Wordpress, etc.),
and their specific instances (Twitter, CNN, etc.).
➔ Input:
– HTML content as string, base URL, list of out-links
➔ Output:
– Augmented document (original text document and
structured objects extracted from web page) and
extracted links with score will be sent to ARCOMEM
framework module. Extracted semantic objects, crawling
actions, and out-links with score will also be stored in the
ARCOMEM database.
29
31. How does AAH work ?
➔ The application aware helper will be assisted with a knowledge base that
will help in recognizing a specific web application and related crawling
actions
➔ Since the knowledge base will grow and there will exist several detection
patterns for many web applications, we have to ensure the web
application detection module does not slow up the crawling process and
affect overall performance.
➔ To ensure scalability, after integration of the application aware helper with
the crawler, we have used the Yfilter system (a NFA based filtering
system) for efficient indexing of detection patterns in order to quickly find
the relevant Web application.
➔ Here each state is represented by XPath expression patterns and
common steps of the path expression are represented only once in a
structure. The introduction of Yfilter in the Web application detection
module improves the performance dynamically and now the system is well
synchronized with the other sub modules of crawling process.
31
33. ARCOMEM Crawler Cockpit
• Requirements
described by
ARCOMEM user
partners (SWR – DW)
• Designed and
implemented by IMF
• A UI on top of the
ARCOMEM system
• Demo: Crawler cockpit
33
35. Crawler Cockpit: Functionality
• Launch crawls following
scheduler specifications
• Monitor crawls and get realtime feedback on the
progress of the crawlers
• Run crawl with HTML Crawler
(Heritrix and IMF Crawler)
• Export the crawled content to
a WARC file
•
Set-up a campaign by focusing,
event, keyword, entity and URL
• Focus on target content in Social
Media Category (blog, forum,
video, photo...)
• Run crawl by using API crawler
(Twitter, Facebook, YouTube,
Flickr)
• Get a campaign overview with
qualified statistics
• Do some refinement at crawls
time to have a better focus on the
target content
• decide what content to archive
35
36. Crawler Cockpit Navigation
• Set-up: A campaign is described by an intelligent crawl
definition, which associates content target to crawl
parameters (schedule and technical parameters).
• Monitor tab give access to statistics provide by the crawler
at running time
• Overview: global dashboard on a campaign. The
information is organized following different topics: general
description of the campaign, metadata, current status, crawl
activity, statistics and analysis
• Inspector: A tool to have access into the content as it is
stored into Hbase.
• Report: specfications and parameters of a campaign
36
37. CC: Overview Tab
Global
dashboard
on
campaign:
• General description of the
campaign
• Crawl activity
• Keywords
• Statistics
• Refine Mode: User can give
more or less weight to a
keyword.
37
a
38. CC Set-up tab
•General description
•Distinct named entities
(e.g. person, geo location,
and organization),Time
period Free keywords and
Language
•A selection of up to nine
SMC (Social Media
Categories)
•Schedule: Each campaign
has a start and end date.
Frequency of the craw is
defined by choosing an
interval.
38
39. Focus on Scoping function
Domain: entire web site
http://www.site.com
Path: only a specific directory of
a website
http://www.site.com/actu
Sub domain:
http://sport.site.com
Page + context:
http://www.site.comhome.html
39
40. Focus on scheduler
Frequency: weekly, monthly, quaterly …
Interval: 1 to 9
Calendar: a campaign has a start date and
an end date.
40
41. CC Inspector Tab
Inspector tab allows user to
•Check the quality of the content
before indexing
•Access to the content (from
HBase), metadata and triples
directly related to a resource
•Browse a list of URLs ranked by
on-line analysis scores is
provided.
41
42. CC Monitor Tab
The Monitor tab gives real time
statistics on the running crawl.
42
{"38":"For each campaign, the archivist can select which SMC, he wants to focus on (blogs, video, discussion) and he does the same for the API crawler (Facebook, Twitter, Flickr, YouTube…).\n","27":"http://www.arcomem.eu/wp-content/uploads/2012/05/D5_2.pdf\n","16":"Netarchivesuite (http://netarchive.dk/suite/) developed by the two national deposit libraries in Denmark, The Royal Library and The State and University Library\nto plan, schedule and run web harvests for selective and broad crawl\nbuilt-in bit preservation functionality\nWeb curator tool: http://webcurator.sourceforge.net\nOpen-source workflow management application for selective web archiving developped by the National Library of New Zealand and the British Library, initiated by the International Internet Preservation Consortium\nArchive-it http://www.archive-it.org/\nA subscription service by Internet Archive to build and preserve collections: allows to\nharvest, catalog, manage and browse archived collections\nArcomem crawler cokpit\n","6":"There is several protocol : \nMai protocol as \nPOP3 (post office protocol version 3)\nSMTP (simple mail transfer protocol\nDNS Domain name service\nDHCP Dynamic Host configuration \nFTP File transfer Protocole\nIMAP Internet Message Access Protocole\n","34":"A crawl is guided by the crawl specifications defined by the user. The crawl specification contains URLs to start the discovery from seeds, keywords to look for in web pages, social web sites APIs to query (and with which keywords) and Social Media Categories (SMC) to focus the crawl on. The seeds get fetched, and the corresponding content and the social sites API query responses are inserted into the document store. The insertion triggers the online analysis process. The Web resources and the links extracted from them are analyzed and scored by the Online Analysis Modules. The links get sent to the crawler’s URL queue, where their score is used to determine the order in which they should be crawled, thereby guiding the crawler. The newly crawled content gets written to the document store, completing the loop. On top of the prototype, a UI allows the user to target topics to archive and offers some analyses of collected data.\n","23":"http://www.arcomem.eu/wp-content/uploads/2012/05/D5_2.pdf\n","24":"http://www.arcomem.eu/wp-content/uploads/2012/05/D5_2.pdf\n","13":"A lot of data are stored in DB hidden to search engine like google are not available for such engine,\nmoreover many pages are created dynamicaly to answer to queries so hey do not existbefor user requested information. \nThis enorme reservoir \nhttp://www.dailymotion.com/video/x9udyo_the-virtual-private-library-and-dee_news\n","8":"URI Uniform Resource Identifier (URI) is a string of characters used to identify a name or a resource on the Internet.\n","42":"On the top of the page, a progression bar gives an estimation of crawl progress until completion. It is a ratio between seen and unseen URL recorded by the crawler. Seen URLs are all the URLs which have been already crawled. Unseen are the URLs, which have been discovered but are waiting to be crawled \n","31":"http://www.arcomem.eu/wp-content/uploads/2012/05/D5_2.pdf\n","20":"Crawler cokpit send order to the crawler. An order is an « intelligent crawl specification ». It is created with the set-up of hte campaign. \nThis order is send to the crawler according to the scheduler. \n","37":". The information is organized following different topics: general description of the campaign, metadata, current status, crawl activity, statistics and analysis \n","4":"To find an information online, I have to know is address. \nLe système de nom de domaine (Domain Name System - DNS) aide les utilisateurs à naviguer sur Internet. Chaque ordinateur relié à Internet a une adresse unique appelée “adresse IP” (adresse de protocole Internet). Étant donné que les adresses IP (qui sont des séries de chiffres) sont difficiles à mémoriser, le DNS permet d’utiliser à la place une série de lettres familières (le “nom de domaine”). Par exemple, au lieu de taper “192.0.34.163,” vous pouvez taper “www.icann.org.”\n","21":"http://www.arcomem.eu/wp-content/uploads/2012/05/D5_2.pdf\n","10":"On line information heterogeneous\nthere is copy online \n"}