Mapping french Open
Data actors on the web
with Common Crawl
guillaume.lebourgeois@data-publica.com
@glebourg
Mining the Web at Data Publica
Different needs, different techniques
   ● Scraping
   ● Focused crawling
   ● Prospective crawling
Mining the Web at Data Publica
Scraping
  ● Identified resources
  ● Configured extractors
  ● Structured content
  ● Not scalable
Mining the Web at Data Publica
Focused crawling
  ● Identified entities
  ● Fuzzy extraction
  ● Structured content using text-mining
  ● Scalable
  ● Useful to get meta information on known
    entities
Mining the Web at Data Publica
Prospective crawling
  ● No starting point
  ● Fuzzy extraction
  ● Structured content using text-mining
  ● Very hard to scale
  ● Heavy resources needed : CPU, RAM,
    HDD

It makes your life easier to use a third-party !
From a crawl to a map
Goal : build a map of the french open data
actors on the web
  ● As a graph
  ● Showing websites
From a crawl to a map
Using Common Crawl
  ● Large web crawl archives fully accessible
  ● Good coverage of french web
  ● Easy access via AWS / MapReduce jobs
From a crawl to a map
Working on french web
 ● Irrelevant to use tld .fr for detection
 ● Detecting page language
 ● Giving websites a "frenchness" score
     ○ Sw = amount of fr pages / total of pages
     ○ Cutoff manually chosen via testing on french
       websites
From a crawl to a map
Working on Open Data websites
 ● Building an Open Data "vocabulary"
 ● Detecting if page speaks about Open
    Data
 ● Giving websites an "opendataness" score
     ○ Sw = amount of Open Data pages / total of pages
     ○ Cutoff manually chosen via testing on Open Data
       websites
From a crawl to a map
Building graph
  ● Inside our subset
     ○ Inlinks
     ○ Outlinks
  ● Generating two files
     ○ nodes.csv (list of websites with an id)
     ○ edges.csv (directed links between websites)


              A inlink                A outlink
                             Node A



                  A inlink
From a crawl to a map
Building graph
  ● Links tell a lot about websites
     ○ Authorities
     ○ Hubs
From a crawl to a map
Visualizing graph using Gephi
  ● Load graph
  ● Spatialize graph
     ○ links between websites create "attraction", to
       make them appear near each other
     ○ the more inlinks, bigger the node (= authority)
     ○ categorizing web site for better understanding (a
       color per category)
        ■ Companies, Non profit/blogs, Governement
           agencies
     ○ communities can now appear !
From a crawl to a map
From a crawl to a map
Visualizing graph on the web
  ● Sigma.js
  ● Uses Gephi files
  ● Gives better interactivity
Analyze
● The final graph is a good way to understand
  interactions between actors
  ○ Open Data is definitely initiated by a Non Profit
    movement
  ○ Companies are beginning to work on the subject
  ○ French state only had some sporadic initiatives for
    now
● This graph is to be generated again in near
  futur, to see changes in this ecosystem
Results
● Large scale crawl made easy
  ○ Easy to focus on mining the results instead of
    finding/storing the data
● Nice workflow from raw data to an
  understandable visualisation
● The final graph is a good way to understand
  interactions between actors
Feedback
● Common Crawl
  ○ Common crawl doesn't have an exhaustive crawl of
    the french web for now
  ○ Data is not fresh as it could be
  ○ It is missing an index to access at least domains,
    and maybe pages in O(1)
● Methodology
  ○ Opendataness scoring can put aside some websites
    not enough focused on open data even if relevant
Resources
● http://webatlas.
  fr/tempshare/OpenDataActeursTypes.pdf
   ○ poster by Franck Ghitalla
● http://french-opendata.data-publica.
  com/index.html
   ○ dynamic visualisation of the results, by Data Publica
● http://fr.slideshare.net/willounet/a-sneak-
  peek-into-the-web-presentation,
   ○ A sneak peek into the web, by GL
● http://french-opendata.data-publica.com/
   ○ Project host page
Mapping french Open
Data actors on the web
with Common Crawl
guillaume.lebourgeois@data-publica.com
@glebourg

Mapping french open data actors on the web with common crawl

  • 1.
    Mapping french Open Dataactors on the web with Common Crawl guillaume.lebourgeois@data-publica.com @glebourg
  • 2.
    Mining the Webat Data Publica Different needs, different techniques ● Scraping ● Focused crawling ● Prospective crawling
  • 3.
    Mining the Webat Data Publica Scraping ● Identified resources ● Configured extractors ● Structured content ● Not scalable
  • 4.
    Mining the Webat Data Publica Focused crawling ● Identified entities ● Fuzzy extraction ● Structured content using text-mining ● Scalable ● Useful to get meta information on known entities
  • 5.
    Mining the Webat Data Publica Prospective crawling ● No starting point ● Fuzzy extraction ● Structured content using text-mining ● Very hard to scale ● Heavy resources needed : CPU, RAM, HDD It makes your life easier to use a third-party !
  • 6.
    From a crawlto a map Goal : build a map of the french open data actors on the web ● As a graph ● Showing websites
  • 7.
    From a crawlto a map Using Common Crawl ● Large web crawl archives fully accessible ● Good coverage of french web ● Easy access via AWS / MapReduce jobs
  • 8.
    From a crawlto a map Working on french web ● Irrelevant to use tld .fr for detection ● Detecting page language ● Giving websites a "frenchness" score ○ Sw = amount of fr pages / total of pages ○ Cutoff manually chosen via testing on french websites
  • 9.
    From a crawlto a map Working on Open Data websites ● Building an Open Data "vocabulary" ● Detecting if page speaks about Open Data ● Giving websites an "opendataness" score ○ Sw = amount of Open Data pages / total of pages ○ Cutoff manually chosen via testing on Open Data websites
  • 10.
    From a crawlto a map Building graph ● Inside our subset ○ Inlinks ○ Outlinks ● Generating two files ○ nodes.csv (list of websites with an id) ○ edges.csv (directed links between websites) A inlink A outlink Node A A inlink
  • 11.
    From a crawlto a map Building graph ● Links tell a lot about websites ○ Authorities ○ Hubs
  • 12.
    From a crawlto a map Visualizing graph using Gephi ● Load graph ● Spatialize graph ○ links between websites create "attraction", to make them appear near each other ○ the more inlinks, bigger the node (= authority) ○ categorizing web site for better understanding (a color per category) ■ Companies, Non profit/blogs, Governement agencies ○ communities can now appear !
  • 13.
    From a crawlto a map
  • 14.
    From a crawlto a map Visualizing graph on the web ● Sigma.js ● Uses Gephi files ● Gives better interactivity
  • 15.
    Analyze ● The finalgraph is a good way to understand interactions between actors ○ Open Data is definitely initiated by a Non Profit movement ○ Companies are beginning to work on the subject ○ French state only had some sporadic initiatives for now ● This graph is to be generated again in near futur, to see changes in this ecosystem
  • 16.
    Results ● Large scalecrawl made easy ○ Easy to focus on mining the results instead of finding/storing the data ● Nice workflow from raw data to an understandable visualisation ● The final graph is a good way to understand interactions between actors
  • 17.
    Feedback ● Common Crawl ○ Common crawl doesn't have an exhaustive crawl of the french web for now ○ Data is not fresh as it could be ○ It is missing an index to access at least domains, and maybe pages in O(1) ● Methodology ○ Opendataness scoring can put aside some websites not enough focused on open data even if relevant
  • 18.
    Resources ● http://webatlas. fr/tempshare/OpenDataActeursTypes.pdf ○ poster by Franck Ghitalla ● http://french-opendata.data-publica. com/index.html ○ dynamic visualisation of the results, by Data Publica ● http://fr.slideshare.net/willounet/a-sneak- peek-into-the-web-presentation, ○ A sneak peek into the web, by GL ● http://french-opendata.data-publica.com/ ○ Project host page
  • 19.
    Mapping french Open Dataactors on the web with Common Crawl guillaume.lebourgeois@data-publica.com @glebourg