Data Curation @ SpazioDati - NEXA Lunch Seminar

Matteo Brunati

@dagoneye
22/07/2015
Data Curation

@SpazioDati
33° Nexa Lunch Seminar
http://nexa.polito.it/lunch-33

Big Data - Linked Data - Machine Learning
spaziodati.eu

a lot of European projects
http://www.spaziodati.eu/en/#research

Data Curation?
https://www.google.com/search?q=data+curation&ie=utf-8&oe=utf-8

!
!
Data curation is the process of turning independently
created data sources (structured and semi-structured data)
into uniﬁed data sets ready for analytics,

using domain experts to guide the process.
http://strataconf.com/stratany2014/public/schedule/detail/36021

a lot of things involved
!
ETL (Extract-Transform-Load) tools

Data Science tools

Linked Data tools

Big Data tools

Domain Knowledge

why we need a data curation process?
@

it’s our mantra: ALLYOU NEED IS DATA

:)
accessible
for everyone
lat 00° 00’ 00”

-> GPS -> Smartphones -> UI IPhone / Android
it’s our mantra: ALLYOU NEED IS DATA

Dandelion API

Text Analytics as a service
dandelion.eu

<Powered by SpazioDati> codename
2014
2015
data platform

Our Entity Extraction API is based on a graph
Brussels
Paris
Berlin
Eiffel Tower
2009 World Championships
in Athletics
King Baudouin Stadium
Champ de Mars
0.42
0.80
0.43
0.53
0.53
0.53
0.63
0.59
0.440.44
https://dandelion.eu/docs/api/datatxt/nex/v1/

different sources;

different semantics;

companies, people,Wikipedia topics, POI…

simple to query on traversals

global statistics
why a knowledge graph

let’s start with some details on the

“Powered by SpazioDati” data platform…

http://blog.spaziodati.eu/en/2014/10/21/spaziodati-at-iswc-2014-visit-our-booth-research-plans-available/
“Powered by SpazioDati”

data platform backstage
PWR-BY-SD

OpenReﬁne
https://azkaban.github.io/
Azkaban

Open Source Workﬂow Manager
Apache Silk
Titan graph db
Apache Cassandra
The Linked Data Integration Framework
Tools involved

http://blog.spaziodati.eu/en/2014/07/24/using-openreﬁne-to-perform-text-mining-on-your-data-food-for-thoughts/
starting from OpenReﬁne to clean up

the data easily, for example
* reconcile and clean up the data
* align the data model to our internal ontologies, using
RDF skeletons
* export the RDF modelled using our rules

in other words…
Rexster: JSON-based REST interface to Titan

Our internal ontology: a sample

~5,9 ★ MLN companies
>10 ★ MLN persons
900k
updated weekly
★ Weekly web crawl of the Italian corporate

★ Real-time data collection from company social accounts
★ ~1600 online & offline newspapers (updated daily)
updated weekly

Search: how it works
Direct search of one particular
company through its name or “partita
iva” (vat number)
Content search into company websites
Keyword search among extracted and
refined entities from company resources
!
Dandelion API is the extraction engine!
1.
2. [*]
3. [*]

Some details on
• Five main “types”:!
– Company!
– Person!
– Site!
– Administrative Division!
– Website

our infrastructure to crawl the Web for ATOKA

other details
Cerved
• Company
• People
• Site
• Position+Share
ISTAT
• AdminDiv
ES
DBPedia
• Company
cluster computing

something really interesting on OpenReﬁne

it rocks! :)
more background details on

http://blog.spaziodati.eu/wp-content/uploads/2015/07/ReﬁneOnSpark.pdf

Thanks :)
@spaziodati
brunati@spaziodati.eu

References
1) From raw data to dataGEMs for developers - http://ceur-ws.org/Vol-1268/paper1.pdf

2) Knowledge Graph ovunque:

http://www.slideshare.net/dagoneye/knowledge-graphs-ovunque-un-quadro-di-insieme-e-le-implicazioni-per-uno-sviluppo-condiviso-del-web-of-data

3) Linking Enterprise Data - https://www.springer.com/it/book/9781441976642

4) Using OpenRefine - https://www.packtpub.com/big-data-and-business-intelligence/using-openrefine

5) WhyYour Business Needs A Customer Data Knowledge Graph -

http://www.dataversity.net/business-needs-customer-data-knowledge-graph/

6) Enabling parallel processing for OpenRefine: Spark vs Akka -

http://refinepro.com/blog/enabling-parallel-processing-for-openrefine-spark-vs-akka/

Data Curation @ SpazioDati - NEXA Lunch Seminar

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Curation @ SpazioDati - NEXA Lunch Seminar

Similar to Data Curation @ SpazioDati - NEXA Lunch Seminar (20)

More from SpazioDati

More from SpazioDati (14)

Recently uploaded

Recently uploaded (20)

Data Curation @ SpazioDati - NEXA Lunch Seminar