SlideShare a Scribd company logo
1 of 26
CrawlerLD - 
Distributed Crawler 
for Linked Data 
RAPHAEL DO VALE
Summary 
Introduction 
Until now=) 
Issues 
Large Memory Footprint 
Graphical Interface
Introduction 
How can we recommend linked data sources to a beginner user? 
◦ Data sources may not use popular ontologies. 
◦ There might be more than one ontology for the same domain. 
◦ The user may not know all (if any) of the ontologies. 
3
Introduction 
Our solution: 
◦ Create a recommender system that receives a small set of generic URI 
resources and returns a complete report of related resources (URIs, Datasets 
and Ontologies). 
◦ Why generic? Because our user is a beginner person exploring the Linked Data! He doesn’t have 
to know about specific datasets or ontologies, he only need to know how to get started. 
◦ The recommender system would benefit from a Linked Data crawler, based 
on metadata. 
4
Introduction 
Metadata focused crawler 
◦ INPUT: 
◦ User should summarize the desired domain with a small set of related terms (URI Resources). 
◦ OUTPUT: 
◦ The tool returns a list of vocabulary terms, as well as provenance data indicating how the output 
was generated. 
◦ With the output results, the user should evaluate the most relevant 
vocabularies for triplification or linkage process. 
◦ This step could be manual or use another tool (e.g.: recommender system). 
5
Introduction 
Our solution: 
◦ Executes several SPARQL Queries over all the LOD Cloud (Linked Open Data 
Cloud). 
◦ For each dataset, applies several queries trying to discover relationships 
between datasets and the crawling resource. 
◦ A breath first algorithm is used to discover more data in cycles. 
6
Until now 
Simplified Workflow: 
7 
List of Terms Processor 
Mediator
Until now 
Processors: 
◦ Each way to recover data from the Linked Data is mapped into a processor. 
◦ Small pieces of code that can be plugged and unplugged. 
◦ Any user can create a new processor. 
8
Until now 
Crawling stages. 
◦ Challenge: based on generic terms, how can we discover more data? 
◦ Answer: using strong relationships (sameAs, subclassOf, seeAlso and 
instanceOf). 
9 
Schema.org 
DBpedia WordNet 
Music Ontology 
BBC Music 
More specific
Issues 
Large Memory Footprint 
◦ At a 2 level task, with 20 concurrent threads consumes 40gb RAM memory(!!) 
Absence of Graphical Interface 
‘Locked code’ 
◦ Open source on roadmap 
Small amout of processors
LARGE MEMORY FOOTPRINT
Identifying the issue 
Processor 
ResultSets 
One request for each 
dataset 
Over 500 distinct 
datasets 
Asynchronous 
Synchronous 
Several processors 
running at the same 
time 
Each of them with a 
increasing resultset 
Jena resultset is far 
from being small
Theorical Solution 
Processor 
ResultSets 
One request for each 
dataset 
Over 500 distinct 
datasets 
Asynchronous 
Asynchronous 
Several processors 
running at the same 
time 
The results are 
immediately 
processed 
Even with bigger 
resultsets, the 
memory is controlled
The reactive manifesto 
Reactive Systems are 
◦ Responsive 
◦ The system responds in a timely manner if at all possible 
◦ Resilient 
◦ The system stays responsive in the face of failure 
◦ Elastic 
◦ The system stays responsive under varying workload. 
◦ Message Driven 
◦ Reactive Systems rely on asynchronous message-passing to establish a boundary between 
components that ensures loose coupling, isolation, location transparency, and provides the 
means to delegate errors as messages 
◦ Essentially, reactive systems are event driven applications where modules 
send events (messages) to other modules. Each module should ask 
something to another asynchronously. 
http://www.reactivemanifesto.org/
Actor model 
The actor model in computer science is a mathematical model of 
concurrent computation that treats "actors" as the universal primitives of 
concurrent computation: in response to a message that it receives, an actor 
can make local decisions, create more actors, send more messages, and 
determine how to respond to the next message received. The actor model 
originated in 1973.[1] It has been used both as a framework for a 
theoretical understanding of computation, and as the theoretical basis for 
several practical implementations of concurrent systems. The relationship 
of the model to other work is discussed in Indeterminacy in concurrent 
computation and Actor model and process calculi. 
http://en.wikipedia.org/wiki/Actor_model 
1 - Carl Hewitt; Peter Bishop; Richard Steiger (1973). "A Universal Modular 
Actor Formalism for Artificial Intelligence". IJCAI. 
http://pt.slideshare.net/drorbr/the-actor-model-towards-better-concurrency
Actor model 
http://codermonkey65.blogspot.com.br/2012/09/actors-in-c-with-nact.html
Akka 
http://akka.io/ 
Java or Scala framework for the Actor Model
Akka 
Comparisson with Java’s thread model 
◦ + Simpler 
◦ CrawlerLD worked with two thread pools: 
◦ One to manage all the system’s algorithm 
◦ Other to make calls to datasets 
◦ Using the same thread pool could block all threads in IO operations 
◦ + No thread blocking 
◦ Not need to worry about shared resources 
◦ Each actor runs at most one task at a time 
◦ + Better performance 
◦ No blocking 
◦ Allows distributed computing 
◦ + Better error management 
◦ Actor hierarchy allows supervisor actors to manage errors and even repeat the failed tasks 
◦ Support for transactions (atomic operations between several actors, even if distributed over several 
machines) 
◦ + Configuration can change system behavior without code change 
◦ Change number of allocated threads, create thread pools for different actors, distribute over several 
machines, change message priority without touching the code.
Akka 
Comparisson with Java’s thread model 
◦ - Much harder to learn 
◦ New paradigm 
◦ - Not native
Results 
CrawlerLDMainActor Calculate 
CalculateResource LevelFinished ResourceProcessedFromLevel 
LevelActor 
Calculate ResourceProcessed 
ResourceActor 
Calculate Calculate Calculate Calculate 
ResourceProcessed ResourceProcessed ResourceProcessed 
ResourceProcessed 
DereferenceProcessor NumberOfInstancesProcessor PropertyQueryProcessor Processor
Results 
Processor 
Calculate QueryFinishedMessage 
SparqlResultset 
SparqlQuerierMasterActor 
CrawlerLD 
UtilitiesSemanticWeb 
ProcessSparqlOnDataset SparqlResultset 
SparqlQuerierActor 
Jena 
Modified 
version 
Blocking calls 
Managed by another 
Akka Dispatcher 
Critical message. Must be 
processed immediately. 
One actor for 
each dataset
Results 
Complete refactor of the code 
◦ Better organization 
◦ Better understanding 
◦ Bugs found and resolved 
◦ Almost two months to understand the paradigm, change the code and test 
Better performance 
◦ Even in heavy workload, the system is always available, 
◦ Another message to another actor 
◦ Distributed code made easy 
◦ Each SparqlQuerierActor could run in a separated machine 
◦ Not yet implemented / tested 
(Much) better memory footprint 
◦ Using a 3 level task it ran with 1,5gb RAM memory at most (!!) 
◦ Number of levels or any other parameter does not seem to affect the memory 
footprint
Graphical Interface 
60% completed
Graphical Interface 
New actor message to retrieve task status while running 
CrawlerLDMainActor 
Calculate 
GetSimplifiedStatus 
CrawlerLDSimplifiedStatus 
GetFullStatus 
CrawlerLDFullStatus
Graphical Interface 
Allows creation and monitoring of the tasks 
Takes advantage of actor model 
Anyone will be able to create new tasks 
URL available soon
Questions?

More Related Content

What's hot

Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Vinay Kumar
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and UsesSuvradeep Rudra
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integrationDylan Wan
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In DepthFabio Fumarola
 
Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra nehabsairam
 
Schema Agnostic Indexing with Azure DocumentDB
Schema Agnostic Indexing with Azure DocumentDBSchema Agnostic Indexing with Azure DocumentDB
Schema Agnostic Indexing with Azure DocumentDBDharma Shukla
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
 
Cool NoSQL on Azure with DocumentDB
Cool NoSQL on Azure with DocumentDBCool NoSQL on Azure with DocumentDB
Cool NoSQL on Azure with DocumentDBJan Hentschel
 
Design of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureDesign of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureLuiz Henrique Zambom Santana
 
New Security Features in Apache HBase 0.98: An Operator's Guide
New Security Features in Apache HBase 0.98: An Operator's GuideNew Security Features in Apache HBase 0.98: An Operator's Guide
New Security Features in Apache HBase 0.98: An Operator's GuideHBaseCon
 
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016Alluxio, Inc.
 
Automating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusGlobus
 
Elasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseElasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseKristijan Duvnjak
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Databasenehabsairam
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented DatabasesFabio Fumarola
 

What's hot (20)

Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Log analysis with elastic stack
Log analysis with elastic stackLog analysis with elastic stack
Log analysis with elastic stack
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and Uses
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integration
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra
 
Schema Agnostic Indexing with Azure DocumentDB
Schema Agnostic Indexing with Azure DocumentDBSchema Agnostic Indexing with Azure DocumentDB
Schema Agnostic Indexing with Azure DocumentDB
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Cool NoSQL on Azure with DocumentDB
Cool NoSQL on Azure with DocumentDBCool NoSQL on Azure with DocumentDB
Cool NoSQL on Azure with DocumentDB
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Design of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureDesign of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore Architecture
 
New Security Features in Apache HBase 0.98: An Operator's Guide
New Security Features in Apache HBase 0.98: An Operator's GuideNew Security Features in Apache HBase 0.98: An Operator's Guide
New Security Features in Apache HBase 0.98: An Operator's Guide
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
 
Automating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with Globus
 
Elasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseElasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational database
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
 
Azure DocumentDB
Azure DocumentDBAzure DocumentDB
Azure DocumentDB
 

Similar to CrawlerLD - Distributed crawler for linked data

Zookeeper big sonata
Zookeeper  big sonataZookeeper  big sonata
Zookeeper big sonataAnh Le
 
Profiler Guided Java Performance Tuning
Profiler Guided Java Performance TuningProfiler Guided Java Performance Tuning
Profiler Guided Java Performance Tuningosa_ora
 
Operating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptxOperating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptxPrudhvi668506
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applicationsDing Li
 
BISSA: Empowering Web gadget Communication with Tuple Spaces
BISSA: Empowering Web gadget Communication with Tuple SpacesBISSA: Empowering Web gadget Communication with Tuple Spaces
BISSA: Empowering Web gadget Communication with Tuple SpacesSrinath Perera
 
Aleksandr_Butenko_Mobile_Development
Aleksandr_Butenko_Mobile_DevelopmentAleksandr_Butenko_Mobile_Development
Aleksandr_Butenko_Mobile_DevelopmentCiklum
 
Design patterns - Common Solutions to Common Problems - Brad Wood
Design patterns -  Common Solutions to Common Problems - Brad WoodDesign patterns -  Common Solutions to Common Problems - Brad Wood
Design patterns - Common Solutions to Common Problems - Brad WoodOrtus Solutions, Corp
 
cf.Objective() 2017 - Design patterns - Brad Wood
cf.Objective() 2017 - Design patterns - Brad Woodcf.Objective() 2017 - Design patterns - Brad Wood
cf.Objective() 2017 - Design patterns - Brad WoodOrtus Solutions, Corp
 
Automatisez la détection des menaces et évitez les faux positifs
Automatisez la détection des menaces et évitez les faux positifsAutomatisez la détection des menaces et évitez les faux positifs
Automatisez la détection des menaces et évitez les faux positifsElasticsearch
 
Java multithreading
Java multithreadingJava multithreading
Java multithreadingMohammed625
 
Java Multithreading
Java MultithreadingJava Multithreading
Java MultithreadingRajkattamuri
 
Multithreading 101
Multithreading 101Multithreading 101
Multithreading 101Tim Penhey
 
Multithreading
MultithreadingMultithreading
MultithreadingF K
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computingbutest
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computingbutest
 
Performance Tuning - Memory leaks, Thread deadlocks, JDK tools
Performance Tuning -  Memory leaks, Thread deadlocks, JDK toolsPerformance Tuning -  Memory leaks, Thread deadlocks, JDK tools
Performance Tuning - Memory leaks, Thread deadlocks, JDK toolsHaribabu Nandyal Padmanaban
 

Similar to CrawlerLD - Distributed crawler for linked data (20)

Zookeeper big sonata
Zookeeper  big sonataZookeeper  big sonata
Zookeeper big sonata
 
Noha mega store
Noha mega storeNoha mega store
Noha mega store
 
Profiler Guided Java Performance Tuning
Profiler Guided Java Performance TuningProfiler Guided Java Performance Tuning
Profiler Guided Java Performance Tuning
 
Operating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptxOperating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptx
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
BISSA: Empowering Web gadget Communication with Tuple Spaces
BISSA: Empowering Web gadget Communication with Tuple SpacesBISSA: Empowering Web gadget Communication with Tuple Spaces
BISSA: Empowering Web gadget Communication with Tuple Spaces
 
Aleksandr_Butenko_Mobile_Development
Aleksandr_Butenko_Mobile_DevelopmentAleksandr_Butenko_Mobile_Development
Aleksandr_Butenko_Mobile_Development
 
Design patterns - Common Solutions to Common Problems - Brad Wood
Design patterns -  Common Solutions to Common Problems - Brad WoodDesign patterns -  Common Solutions to Common Problems - Brad Wood
Design patterns - Common Solutions to Common Problems - Brad Wood
 
cf.Objective() 2017 - Design patterns - Brad Wood
cf.Objective() 2017 - Design patterns - Brad Woodcf.Objective() 2017 - Design patterns - Brad Wood
cf.Objective() 2017 - Design patterns - Brad Wood
 
multithreading
multithreadingmultithreading
multithreading
 
Automatisez la détection des menaces et évitez les faux positifs
Automatisez la détection des menaces et évitez les faux positifsAutomatisez la détection des menaces et évitez les faux positifs
Automatisez la détection des menaces et évitez les faux positifs
 
Java multithreading
Java multithreadingJava multithreading
Java multithreading
 
Java Multithreading
Java MultithreadingJava Multithreading
Java Multithreading
 
Multithreading 101
Multithreading 101Multithreading 101
Multithreading 101
 
Java
JavaJava
Java
 
Multithreading
MultithreadingMultithreading
Multithreading
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Performance Tuning - Memory leaks, Thread deadlocks, JDK tools
Performance Tuning -  Memory leaks, Thread deadlocks, JDK toolsPerformance Tuning -  Memory leaks, Thread deadlocks, JDK tools
Performance Tuning - Memory leaks, Thread deadlocks, JDK tools
 
Java
JavaJava
Java
 

Recently uploaded

Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 

Recently uploaded (20)

Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 

CrawlerLD - Distributed crawler for linked data

  • 1. CrawlerLD - Distributed Crawler for Linked Data RAPHAEL DO VALE
  • 2. Summary Introduction Until now=) Issues Large Memory Footprint Graphical Interface
  • 3. Introduction How can we recommend linked data sources to a beginner user? ◦ Data sources may not use popular ontologies. ◦ There might be more than one ontology for the same domain. ◦ The user may not know all (if any) of the ontologies. 3
  • 4. Introduction Our solution: ◦ Create a recommender system that receives a small set of generic URI resources and returns a complete report of related resources (URIs, Datasets and Ontologies). ◦ Why generic? Because our user is a beginner person exploring the Linked Data! He doesn’t have to know about specific datasets or ontologies, he only need to know how to get started. ◦ The recommender system would benefit from a Linked Data crawler, based on metadata. 4
  • 5. Introduction Metadata focused crawler ◦ INPUT: ◦ User should summarize the desired domain with a small set of related terms (URI Resources). ◦ OUTPUT: ◦ The tool returns a list of vocabulary terms, as well as provenance data indicating how the output was generated. ◦ With the output results, the user should evaluate the most relevant vocabularies for triplification or linkage process. ◦ This step could be manual or use another tool (e.g.: recommender system). 5
  • 6. Introduction Our solution: ◦ Executes several SPARQL Queries over all the LOD Cloud (Linked Open Data Cloud). ◦ For each dataset, applies several queries trying to discover relationships between datasets and the crawling resource. ◦ A breath first algorithm is used to discover more data in cycles. 6
  • 7. Until now Simplified Workflow: 7 List of Terms Processor Mediator
  • 8. Until now Processors: ◦ Each way to recover data from the Linked Data is mapped into a processor. ◦ Small pieces of code that can be plugged and unplugged. ◦ Any user can create a new processor. 8
  • 9. Until now Crawling stages. ◦ Challenge: based on generic terms, how can we discover more data? ◦ Answer: using strong relationships (sameAs, subclassOf, seeAlso and instanceOf). 9 Schema.org DBpedia WordNet Music Ontology BBC Music More specific
  • 10. Issues Large Memory Footprint ◦ At a 2 level task, with 20 concurrent threads consumes 40gb RAM memory(!!) Absence of Graphical Interface ‘Locked code’ ◦ Open source on roadmap Small amout of processors
  • 12. Identifying the issue Processor ResultSets One request for each dataset Over 500 distinct datasets Asynchronous Synchronous Several processors running at the same time Each of them with a increasing resultset Jena resultset is far from being small
  • 13. Theorical Solution Processor ResultSets One request for each dataset Over 500 distinct datasets Asynchronous Asynchronous Several processors running at the same time The results are immediately processed Even with bigger resultsets, the memory is controlled
  • 14. The reactive manifesto Reactive Systems are ◦ Responsive ◦ The system responds in a timely manner if at all possible ◦ Resilient ◦ The system stays responsive in the face of failure ◦ Elastic ◦ The system stays responsive under varying workload. ◦ Message Driven ◦ Reactive Systems rely on asynchronous message-passing to establish a boundary between components that ensures loose coupling, isolation, location transparency, and provides the means to delegate errors as messages ◦ Essentially, reactive systems are event driven applications where modules send events (messages) to other modules. Each module should ask something to another asynchronously. http://www.reactivemanifesto.org/
  • 15. Actor model The actor model in computer science is a mathematical model of concurrent computation that treats "actors" as the universal primitives of concurrent computation: in response to a message that it receives, an actor can make local decisions, create more actors, send more messages, and determine how to respond to the next message received. The actor model originated in 1973.[1] It has been used both as a framework for a theoretical understanding of computation, and as the theoretical basis for several practical implementations of concurrent systems. The relationship of the model to other work is discussed in Indeterminacy in concurrent computation and Actor model and process calculi. http://en.wikipedia.org/wiki/Actor_model 1 - Carl Hewitt; Peter Bishop; Richard Steiger (1973). "A Universal Modular Actor Formalism for Artificial Intelligence". IJCAI. http://pt.slideshare.net/drorbr/the-actor-model-towards-better-concurrency
  • 17. Akka http://akka.io/ Java or Scala framework for the Actor Model
  • 18. Akka Comparisson with Java’s thread model ◦ + Simpler ◦ CrawlerLD worked with two thread pools: ◦ One to manage all the system’s algorithm ◦ Other to make calls to datasets ◦ Using the same thread pool could block all threads in IO operations ◦ + No thread blocking ◦ Not need to worry about shared resources ◦ Each actor runs at most one task at a time ◦ + Better performance ◦ No blocking ◦ Allows distributed computing ◦ + Better error management ◦ Actor hierarchy allows supervisor actors to manage errors and even repeat the failed tasks ◦ Support for transactions (atomic operations between several actors, even if distributed over several machines) ◦ + Configuration can change system behavior without code change ◦ Change number of allocated threads, create thread pools for different actors, distribute over several machines, change message priority without touching the code.
  • 19. Akka Comparisson with Java’s thread model ◦ - Much harder to learn ◦ New paradigm ◦ - Not native
  • 20. Results CrawlerLDMainActor Calculate CalculateResource LevelFinished ResourceProcessedFromLevel LevelActor Calculate ResourceProcessed ResourceActor Calculate Calculate Calculate Calculate ResourceProcessed ResourceProcessed ResourceProcessed ResourceProcessed DereferenceProcessor NumberOfInstancesProcessor PropertyQueryProcessor Processor
  • 21. Results Processor Calculate QueryFinishedMessage SparqlResultset SparqlQuerierMasterActor CrawlerLD UtilitiesSemanticWeb ProcessSparqlOnDataset SparqlResultset SparqlQuerierActor Jena Modified version Blocking calls Managed by another Akka Dispatcher Critical message. Must be processed immediately. One actor for each dataset
  • 22. Results Complete refactor of the code ◦ Better organization ◦ Better understanding ◦ Bugs found and resolved ◦ Almost two months to understand the paradigm, change the code and test Better performance ◦ Even in heavy workload, the system is always available, ◦ Another message to another actor ◦ Distributed code made easy ◦ Each SparqlQuerierActor could run in a separated machine ◦ Not yet implemented / tested (Much) better memory footprint ◦ Using a 3 level task it ran with 1,5gb RAM memory at most (!!) ◦ Number of levels or any other parameter does not seem to affect the memory footprint
  • 24. Graphical Interface New actor message to retrieve task status while running CrawlerLDMainActor Calculate GetSimplifiedStatus CrawlerLDSimplifiedStatus GetFullStatus CrawlerLDFullStatus
  • 25. Graphical Interface Allows creation and monitoring of the tasks Takes advantage of actor model Anyone will be able to create new tasks URL available soon