This presentation on Specifying Crawls is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training Specifying Crawls Advancedarcomem
This presentation on Specifying Crawls is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Minerva is a storage plugin of Drill that connects IPFS's decentralized storage and Drill's flexible query engine. Any data file stored on IPFS can be easily accessed from Drill's query interface, just like a file stored on a local disk.
Visit https://github.com/bdchain/Minerva to learn more and try it out!
TTL Alfresco Product Security and Best Practices 2017Toni de la Fuente
Slide deck used during Tech Talk Live #110 in October 2017. Phil Meadows and myself discussed about Alfresco products security and I went through Alfresco CS security best practices.
Arcomem training simple-text-mining_beginnerarcomem
This presentation on Text Mining is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training Cultural Analysis Advancedarcomem
This presentation on Cultural Analysis is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training – Enrichment Advanced (update)arcomem
This presentation on data enrichment is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training – Enrichment Beginner (update)arcomem
This presentation on data enrichment is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training entities-and-events_advancedarcomem
This presentation on Entities and Events detection is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training Specifying Crawls Advancedarcomem
This presentation on Specifying Crawls is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Minerva is a storage plugin of Drill that connects IPFS's decentralized storage and Drill's flexible query engine. Any data file stored on IPFS can be easily accessed from Drill's query interface, just like a file stored on a local disk.
Visit https://github.com/bdchain/Minerva to learn more and try it out!
TTL Alfresco Product Security and Best Practices 2017Toni de la Fuente
Slide deck used during Tech Talk Live #110 in October 2017. Phil Meadows and myself discussed about Alfresco products security and I went through Alfresco CS security best practices.
Arcomem training simple-text-mining_beginnerarcomem
This presentation on Text Mining is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training Cultural Analysis Advancedarcomem
This presentation on Cultural Analysis is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training – Enrichment Advanced (update)arcomem
This presentation on data enrichment is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training – Enrichment Beginner (update)arcomem
This presentation on data enrichment is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training entities-and-events_advancedarcomem
This presentation on Entities and Events detection is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on Specifying Crawls is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on the ARCOMEM System is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Presentada en la Jornada Internacional sobre Archivos Web y Depósito Legal Electrónico, en la Biblioteca Nacional de España (BNE), el día 9 de julio de 2013.
HTML5 introduces significant changes for today\'s websites: new and updated tags, new functionality, better error handling and improved Document Object Model (DOM). However, the HTML5 new features come with new (application) security vulnerabilities. This presentation reviews the new attack vectors, associated risks and what a needs to be taken into consideration when implementing HTML5.
SophiaConf2010 Présentation des Retours d'expériences de la Conférence du 08 ...TelecomValley
SophiaConf2010 Présentation des Retours d'expériences de la Conférence du 08 Juillet - HTML 5, une plateforme contemporaine pour le Web : Stefano Crosta, Chief Technical Officer de SLICE FACTORY ; Raphaël Troncy, Maître de Conférences à Eurecom.
Arcomem training Topic Analysis Models advancedarcomem
This presentation on Topic Analysis Models is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training Topic Analysis Models beginnersarcomem
This presentation on Topic Analysis Models is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
More Related Content
Similar to Arcomem training Specifying Crawls Beginners
This presentation on Specifying Crawls is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on the ARCOMEM System is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Presentada en la Jornada Internacional sobre Archivos Web y Depósito Legal Electrónico, en la Biblioteca Nacional de España (BNE), el día 9 de julio de 2013.
HTML5 introduces significant changes for today\'s websites: new and updated tags, new functionality, better error handling and improved Document Object Model (DOM). However, the HTML5 new features come with new (application) security vulnerabilities. This presentation reviews the new attack vectors, associated risks and what a needs to be taken into consideration when implementing HTML5.
SophiaConf2010 Présentation des Retours d'expériences de la Conférence du 08 ...TelecomValley
SophiaConf2010 Présentation des Retours d'expériences de la Conférence du 08 Juillet - HTML 5, une plateforme contemporaine pour le Web : Stefano Crosta, Chief Technical Officer de SLICE FACTORY ; Raphaël Troncy, Maître de Conférences à Eurecom.
Arcomem training Topic Analysis Models advancedarcomem
This presentation on Topic Analysis Models is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training Topic Analysis Models beginnersarcomem
This presentation on Topic Analysis Models is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training Twitter Domain Experts advancedarcomem
This presentation on Twitter Domain Experts is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Arcomem training Cultural Analysis Beginnerarcomem
This presentation on Cultural Analysis is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on Twitter Dynamics is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on Opinion Mining is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on Named Entity Evolution Recognition is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on Named Entity Evolution Recognition is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on using the Heritrix crawler is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on using the Heritrix crawler is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on data enrichment is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on data enrichment is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on Diversification is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This presentation on Twitter Dynamics is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Which contributions of the social web will stay in the future and will be preserved? And how will they be accessible and explorable for users? ARCOMEM creates tools which helps archivists to exploit the social web.
Internet Memory Foundation (IMF), Südwestrundfunk (SWR) and the Dutch Beeld en Geluid institute held a joint workshop about web archiving at FIAT IFTA World Conference 2011 in Turino, Italy. IMF and SWR introduced the ARCOMEM project to the international audiovisual archivist community.
ARCOMEM developing methods & tools for transforming digital archives into community memories bases on novel socially-aware & -driven preservation models.
http://www.arcomem.eu
ARCOMEM developing methods & tools for transforming digital archives into community memories bases on novel socially-aware & -driven preservation models.
http://www.arcomem.eu
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
2. Training Goals
➔ Help user to specify properly the campaign
➔ Make user understanding what it is going on in
the back end of the ARCOMEM platform
➔ Set-up a campaign in the crawler cockpit
Slide 2
3. Plan
What is the Web ? Challenges and SOA
ARCOMEM platform
Crawler
Set-up a campaign in the Arcomen Crawler
Cockpit
Slide 3
4. Introduction : How does Web work ?
➔ The web is managed by protocols and standards :
• HTTP Hypertext Transfer Protocol
• HTML HyperText Markup Language
• URL Uniform Resource Locator
• DNS Domain Name System
➔ Each server has an address : IP address
• Example : http://213.251.150.222/ ->
http://collections.europarchive.org
4
5. WWW
The web is a large space of communication and information :
• managed by servers which talk together by convention (protocol) and
through applications in a large network.
• a naming space organized and controlled (ICANN)
World Wide Web: abbreviated as WWW and commonly known
as the Web, is a system of interlinked hypertext documents
accessed via the internet
Slide 5
6. HTTP - Hypertext Transfer Protocol
➔ Notion client/server
•
request-response protocol in the client-server computing model
➔ How does it work ?
•
Client asks for a content
•
Server hosts the content and delivers it
•
The browser locates the DNS server, connects itself to the
server and sends a request to the server.
6
7. HTML - HyperText Markup Languag e
➔ Markup language for Web page
➔ Written in form of HTML elements
➔ Creates structured documents denoting structural
semantic elements for text as headings, paragraphs,
titles, links, quotes, and other items
➔ Allows text and embedded as images
➔ Example : http://www.w3.org/
7
8. URI - URL
➔ URL - Uniform resource Locator (URL) that specifies
where an identified resource is available and the mechanism for
retrieving it.
➔ Examples :
– http://host.domain.extension/path/pageORfile
– http://www.europarchive.org
– http://collections.europarchive.org/
– http://www.europarchive.org/about.php
Samos 2013 – Workshop : The ARCOMEM Platform
8
9. Domain name and extension
➔ Manage by l’ICANN, Internet Corporation for Assigned Names and
Numbers (ICANN), is non profit organization, allocated by registrar.
•
http://www.icann.org
➔ ICANN coordinates the allocation and assignment to ensure the
universal resolvability of :
•
•
•
Domain names (forming a system referred to as «
DNS»)
Internet protocol («
IP») addresses
Protocol port and parameter numbers.
➔ Several types of TLD
•
TLD first level : .com, .info, etc
•
gTLD : aero, .biz, .coop, .info, .museum, .name, et .pro
•
ccTLD (country code Top Level Domains).fr
9
10. What kind of contents?
➔ Different type of contents : multimedia text, video, images
➔ Different type of producers :
• public : institution, government, museum, TV....
• private : foundation, company, press, people, blog...
http://ec.europa.eu/index_fr.htm
http://iawebarchiving.wordpress.com/
http://www.nytimes.com/
➔ Each producer is in charge of its content
• Information can disappear: fragility
• Size
10
11. Social web
➔ Focus on people’s socialization and interaction
• Characteristics :
•
Walled space in which users can interact
• Creation of social network
➔ WEB ARCHIVE -> challenges in term of content, privacy
and technique.
•
Examples:
• Share bookmark(Del.icio.us, Digg), videos (Dailymotion,
YouTube), photos (Flickr, Picasa)
• community (MySpace, Facebook)
11
12. Ex. of technical difficulties: Videos
➔ Standard HTTP protocol
• obfuscated links to the video files
• dynamic playlists and channels or configuration files loaded by
the player several hops and redirects to the server of the
video content
e.g.: YouTube
➔ Streaming protocols: RTSP, RTMP, MMS...
• real-time protocols implemented by the video players suited
for large video files (control commands) or live broadcasts
• sometimes proprietary protocols (e.g.: RTMP - Adobe)
available tools: MPlayer, FLVStreamer, VCL
12
13. Deep /Hidden Web
• Deep web: content accessible behind
password, database, payment... and hidden
to search engine
http://c.asselin.free.fr/french/schema_webinvisible.htm Schema établit sur la base de la figure
"Distribution des sites du Deep Web par types de contenu" de l'étude Bright Planet.
13
14. How do we archive it ?
➔ Challenges for archiving :
– dynamic websites
➔ Technical barriers:
•
•
•
•
•
some javascript
animation on Flash
pop-up
video and audio on streaming
restricted access
➔Traps : Spam and loop
14
15. What do user need to do some web archiving ?
➔ Define the target content (Website, URL, Topic…)
➔ A tool to manage its campaign
➔ Intelligent crawler to archive content
15
16. Management tools (1)
Several tools exist already developed by Libraries which are doing some Library.
➔Netarchivesuite (http://netarchive.dk/suite/)
➔The NetarchiveSuite software was originally developed by the two national deposit
libraries in Denmark, The Royal Library and The State and University Library and has
been running in production, harvesting the Danish world wide web since 2005. The
French National Library and the Austrian National Libraries joined the project in 2008.
➔Web curator tool: http://webcurator.sourceforge.net
Open-source workflow management application for selective web archiving
developed by the National Library of New Zealand and the British Library, initiated
by the International Internet Preservation Consortium
➔Archive-it http://www.archive-it.org/
A subscription service by Internet Archive to build and preserve collections: allows to
harvest, catalogue, manage and browse archived collections
➔Archivethe.net http://archivethe.net/fr/
Service provides by the Internet Memory Foundation.
➔Arcomem crawler cockpit
16
17. How does a crawler work ?
➔ A crawler is a bot parsing web pages in order to
index or and archive them. Robot navigates
following links
➔ Link in the center of crawl’s problematic
• Explicit links : source code is available and full path is
explicitly stated
• Variable link : source code is available but use
variables to encode the path
• Opaque links: source code not available
Example : http://www.thetimes.co.uk/tto/news/
17
18. Parameters
➔ Scoping function is used to define how depth the crawl will go
• Complete or specific content of a website
• Discovery or focus crawl
➔ Politeness
• Follow the common rules of politeness
➔ Robots.txt
• Follow
➔ Frequency
• How often I want to launch a crawl on this target ?
18
20. IMF Crawler
•
Component Name: IMF Large Scale Crawler
– The large scale crawler retrieves content from the web and
stores it in an HBase repository. It aims at being scalable:
crawling at a fast rate from the start and slowing down as
little as possible as the amount of visited URLs grows to
hundreds of millions, all while observing politeness
conventions (rate regulation, robots.txt compliance, etc.).
•
Output:
– Web resources written to WARC files. We also have
developed an importer to load these WARC files into HBase.
Some metadata is also extracted: HTTP status code,
identified out links, MIME type, etc.
20
22. Adaptative Heritrix
➔ Component Name: Adaptive Heritrix
➔ Description: Adaptive Heritrix is a modified version of the
open source crawler Heritrix that allows the dynamic
reordering of queued URLs
➔ Application Aware Helper
22
24. ARCOMEM Crawler Cockpit
• Requirements
described by
ARCOMEM user
partners (SWR – DW)
• Designed and
implemented by IMF
• A UI on top of the
ARCOMEM system
• Demo: Crawler cockpit
24
26. Crawler Cockpit: Functionality
• Launch crawls following
scheduler specifications
•
Set-up a campaign by focusing,
event, keyword, entity and URL
• Monitor crawls and get realtime feedback on the progress
of the crawlers
•
Focus on target content in Social
Media Category (blog, forum,
video, photo...)
•
Run crawl by using API crawler
(Twitter, Facebook, YouTube,
Flickr)
•
Get a campaign overview with
qualified statistics
•
Do some refinement at crawls
time to have a better focus on the
target content
•
decide what content to archive
• Run crawl with HTML Crawler
(Heritrix and IMF Crawler)
• Export the crawled content to
a WARC file
26
27. Crawler Cockpit Navigation
• Set-up: A campaign is described by an intelligent crawl
definition, which associates content target to crawl
parameters (schedule and technical parameters).
• Monitor tab give access to statistics provide by the crawler
at running time
• Overview: global dashboard on a campaign. The
information is organized following different topics: general
description of the campaign, metadata, current status, crawl
activity, statistics and analysis
• Inspector: A tool to have access into the content as it is
stored into Hbase.
• Report: specifications and parameters of a campaign
27
28. Set-up a campaign
• General description
• Distinct named entities
(e.g. person, geo location,
and organization),Time
period Free keywords and
Language
• A selection of up to nine
SMC (Social Media
Categories)
• Schedule: Each campaign
has a start and end date.
Frequency of the craw is
defined by choosing an
interval.
28
29. Focus on Scoping function
Domain: entire web site
http://www.site.com
Path: only a specific directory of a
website
http://www.site.com/actu
Sub domain:
http://sport.site.com
Page + context:
http://www.site.comhome.html
29
30. Focus on scheduler
Frequency: weekly, monthly, quaterly …
Interval: 1 to 9
Calendar: a campaign has a start date and
an end date.
30
32. CC Inspector Tab
Inspector tab allows user to
•Check the quality of the
content before indexing
•Access to the content (from
HBase), metadata and
triples directly related to a
resource
•Browse a list of URLs
ranked by on-line analysis
scores is provided.
32
33. CC Monitor Tab
The Monitor tab gives real
time statistics on the
running crawl.
33
{"16":"Netarchivesuite (http://netarchive.dk/suite/) developed by the two national deposit libraries in Denmark, The Royal Library and The State and University Library\nto plan, schedule and run web harvests for selective and broad crawl\nbuilt-in bit preservation functionality\nWeb curator tool: http://webcurator.sourceforge.net\nOpen-source workflow management application for selective web archiving developped by the National Library of New Zealand and the British Library, initiated by the International Internet Preservation Consortium\nArchive-it http://www.archive-it.org/\nA subscription service by Internet Archive to build and preserve collections: allows to\nharvest, catalog, manage and browse archived collections\nArcomem crawler cokpit\n","33":"On the top of the page, a progression bar gives an estimation of crawl progress until completion. It is a ratio between seen and unseen URL recorded by the crawler. Seen URLs are all the URLs which have been already crawled. Unseen are the URLs, which have been discovered but are waiting to be crawled \n","22":"http://www.arcomem.eu/wp-content/uploads/2012/05/D5_2.pdf\n","28":"For each campaign, the archivist can select which SMC, he wants to focus on (blogs, video, discussion) and he does the same for the API crawler (Facebook, Twitter, Flickr, YouTube…).\n","6":"There is several protocol : \nMai protocol as \nPOP3 (post office protocol version 3)\nSMTP (simple mail transfer protocol\nDNS Domain name service\nDHCP Dynamic Host configuration \nFTP File transfer Protocole\nIMAP Internet Message Access Protocole\n","13":"A lot of data are stored in DB hidden to search engine like google are not available for such engine,\nmoreover many pages are created dynamicaly to answer to queries so hey do not existbefor user requested information. \nThis enorme reservoir \nhttp://www.dailymotion.com/video/x9udyo_the-virtual-private-library-and-dee_news\n","8":"URI Uniform Resource Identifier (URI) is a string of characters used to identify a name or a resource on the Internet.\n","25":"A crawl is guided by the crawl specifications defined by the user. The crawl specification contains URLs to start the discovery from seeds, keywords to look for in web pages, social web sites APIs to query (and with which keywords) and Social Media Categories (SMC) to focus the crawl on. The seeds get fetched, and the corresponding content and the social sites API query responses are inserted into the document store. The insertion triggers the online analysis process. The Web resources and the links extracted from them are analyzed and scored by the Online Analysis Modules. The links get sent to the crawler’s URL queue, where their score is used to determine the order in which they should be crawled, thereby guiding the crawler. The newly crawled content gets written to the document store, completing the loop. On top of the prototype, a UI allows the user to target topics to archive and offers some analyses of collected data.\n","31":". The information is organized following different topics: general description of the campaign, metadata, current status, crawl activity, statistics and analysis \n","20":"http://www.arcomem.eu/wp-content/uploads/2012/05/D5_2.pdf\n","4":"To find an information online, I have to know is address. \nLe système de nom de domaine (Domain Name System - DNS) aide les utilisateurs à naviguer sur Internet. Chaque ordinateur relié à Internet a une adresse unique appelée “adresse IP” (adresse de protocole Internet). Étant donné que les adresses IP (qui sont des séries de chiffres) sont difficiles à mémoriser, le DNS permet d’utiliser à la place une série de lettres familières (le “nom de domaine”). Par exemple, au lieu de taper “192.0.34.163,” vous pouvez taper “www.icann.org.”\n","10":"On line information heterogeneous\nthere is copy online \n"}