Mapping french open data actors on the web with common crawl
Mapping french Open
Data actors on the web
with Common Crawl
guillaume.lebourgeois@data-publica.com
@glebourg
Mining the Web at Data Publica
Different needs, different techniques
● Scraping
● Focused crawling
● Prospective crawling
Mining the Web at Data Publica
Scraping
● Identified resources
● Configured extractors
● Structured content
● Not scalable
Mining the Web at Data Publica
Focused crawling
● Identified entities
● Fuzzy extraction
● Structured content using text-mining
● Scalable
● Useful to get meta information on known
entities
Mining the Web at Data Publica
Prospective crawling
● No starting point
● Fuzzy extraction
● Structured content using text-mining
● Very hard to scale
● Heavy resources needed : CPU, RAM,
HDD
It makes your life easier to use a third-party !
From a crawl to a map
Goal : build a map of the french open data
actors on the web
● As a graph
● Showing websites
From a crawl to a map
Using Common Crawl
● Large web crawl archives fully accessible
● Good coverage of french web
● Easy access via AWS / MapReduce jobs
From a crawl to a map
Working on french web
● Irrelevant to use tld .fr for detection
● Detecting page language
● Giving websites a "frenchness" score
○ Sw = amount of fr pages / total of pages
○ Cutoff manually chosen via testing on french
websites
From a crawl to a map
Working on Open Data websites
● Building an Open Data "vocabulary"
● Detecting if page speaks about Open
Data
● Giving websites an "opendataness" score
○ Sw = amount of Open Data pages / total of pages
○ Cutoff manually chosen via testing on Open Data
websites
From a crawl to a map
Building graph
● Inside our subset
○ Inlinks
○ Outlinks
● Generating two files
○ nodes.csv (list of websites with an id)
○ edges.csv (directed links between websites)
A inlink A outlink
Node A
A inlink
From a crawl to a map
Building graph
● Links tell a lot about websites
○ Authorities
○ Hubs
From a crawl to a map
Visualizing graph using Gephi
● Load graph
● Spatialize graph
○ links between websites create "attraction", to
make them appear near each other
○ the more inlinks, bigger the node (= authority)
○ categorizing web site for better understanding (a
color per category)
■ Companies, Non profit/blogs, Governement
agencies
○ communities can now appear !
From a crawl to a map
Visualizing graph on the web
● Sigma.js
● Uses Gephi files
● Gives better interactivity
Analyze
● The final graph is a good way to understand
interactions between actors
○ Open Data is definitely initiated by a Non Profit
movement
○ Companies are beginning to work on the subject
○ French state only had some sporadic initiatives for
now
● This graph is to be generated again in near
futur, to see changes in this ecosystem
Results
● Large scale crawl made easy
○ Easy to focus on mining the results instead of
finding/storing the data
● Nice workflow from raw data to an
understandable visualisation
● The final graph is a good way to understand
interactions between actors
Feedback
● Common Crawl
○ Common crawl doesn't have an exhaustive crawl of
the french web for now
○ Data is not fresh as it could be
○ It is missing an index to access at least domains,
and maybe pages in O(1)
● Methodology
○ Opendataness scoring can put aside some websites
not enough focused on open data even if relevant
Resources
● http://webatlas.
fr/tempshare/OpenDataActeursTypes.pdf
○ poster by Franck Ghitalla
● http://french-opendata.data-publica.
com/index.html
○ dynamic visualisation of the results, by Data Publica
● http://fr.slideshare.net/willounet/a-sneak-
peek-into-the-web-presentation,
○ A sneak peek into the web, by GL
● http://french-opendata.data-publica.com/
○ Project host page
Mapping french Open
Data actors on the web
with Common Crawl
guillaume.lebourgeois@data-publica.com
@glebourg