SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
Mapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawl
1.
Mapping french Open
Data actors on the web
with Common Crawl
guillaume.lebourgeois@data-publica.com
@glebourg
2.
Mining the Web at Data Publica
Different needs, different techniques
● Scraping
● Focused crawling
● Prospective crawling
3.
Mining the Web at Data Publica
Scraping
● Identified resources
● Configured extractors
● Structured content
● Not scalable
4.
Mining the Web at Data Publica
Focused crawling
● Identified entities
● Fuzzy extraction
● Structured content using text-mining
● Scalable
● Useful to get meta information on known
entities
5.
Mining the Web at Data Publica
Prospective crawling
● No starting point
● Fuzzy extraction
● Structured content using text-mining
● Very hard to scale
● Heavy resources needed : CPU, RAM,
HDD
It makes your life easier to use a third-party !
6.
From a crawl to a map
Goal : build a map of the french open data
actors on the web
● As a graph
● Showing websites
7.
From a crawl to a map
Using Common Crawl
● Large web crawl archives fully accessible
● Good coverage of french web
● Easy access via AWS / MapReduce jobs
8.
From a crawl to a map
Working on french web
● Irrelevant to use tld .fr for detection
● Detecting page language
● Giving websites a "frenchness" score
○ Sw = amount of fr pages / total of pages
○ Cutoff manually chosen via testing on french
websites
9.
From a crawl to a map
Working on Open Data websites
● Building an Open Data "vocabulary"
● Detecting if page speaks about Open
Data
● Giving websites an "opendataness" score
○ Sw = amount of Open Data pages / total of pages
○ Cutoff manually chosen via testing on Open Data
websites
10.
From a crawl to a map
Building graph
● Inside our subset
○ Inlinks
○ Outlinks
● Generating two files
○ nodes.csv (list of websites with an id)
○ edges.csv (directed links between websites)
A inlink A outlink
Node A
A inlink
11.
From a crawl to a map
Building graph
● Links tell a lot about websites
○ Authorities
○ Hubs
12.
From a crawl to a map
Visualizing graph using Gephi
● Load graph
● Spatialize graph
○ links between websites create "attraction", to
make them appear near each other
○ the more inlinks, bigger the node (= authority)
○ categorizing web site for better understanding (a
color per category)
■ Companies, Non profit/blogs, Governement
agencies
○ communities can now appear !
14.
From a crawl to a map
Visualizing graph on the web
● Sigma.js
● Uses Gephi files
● Gives better interactivity
15.
Analyze
● The final graph is a good way to understand
interactions between actors
○ Open Data is definitely initiated by a Non Profit
movement
○ Companies are beginning to work on the subject
○ French state only had some sporadic initiatives for
now
● This graph is to be generated again in near
futur, to see changes in this ecosystem
16.
Results
● Large scale crawl made easy
○ Easy to focus on mining the results instead of
finding/storing the data
● Nice workflow from raw data to an
understandable visualisation
● The final graph is a good way to understand
interactions between actors
17.
Feedback
● Common Crawl
○ Common crawl doesn't have an exhaustive crawl of
the french web for now
○ Data is not fresh as it could be
○ It is missing an index to access at least domains,
and maybe pages in O(1)
● Methodology
○ Opendataness scoring can put aside some websites
not enough focused on open data even if relevant
18.
Resources
● http://webatlas.
fr/tempshare/OpenDataActeursTypes.pdf
○ poster by Franck Ghitalla
● http://french-opendata.data-publica.
com/index.html
○ dynamic visualisation of the results, by Data Publica
● http://fr.slideshare.net/willounet/a-sneak-
peek-into-the-web-presentation,
○ A sneak peek into the web, by GL
● http://french-opendata.data-publica.com/
○ Project host page
19.
Mapping french Open
Data actors on the web
with Common Crawl
guillaume.lebourgeois@data-publica.com
@glebourg