SoBigData. European Research Infrastructure for Big Data and Social Mining
1. Social Mining &
Big Data Analytics
H2020 - www.sobigdata.eu
September 2015- August 2019
@SoBigData (https://twitter.com/SoBigData)
https://www.facebook.com/SoBigData
3. Delft 17 – 19 February 2016
Integrating national research
Infrastructures
4. SoBigData is…
A Multidisciplinary European Infrastructure on Big Data and Social
Data Mining providing an integrated ecosystem for ethic-sensitive
scientific discoveries and advanced applications of social data
mining on the various dimensions of social life, as recorded by “big
data”.
5. SMARTCATs
The GOAL of the Research Infrastructure
• to integrate key national infrastructures and centres of
excellence at European level in social mining and big data
analytics
• to enable cutting-edge, multi-disciplinary social mining &
responsible data science experiments leveraging the Research
Infrastructure assets: big data sets, analytical tools and
services, and data scientist skills
• to grant access (both online and on-site) to multidisciplinary
scientists, innovators, public bodies, citizen organizations,
SMEs, as well as data science students at any level of
education.
6. SMARTCATs
The pillars for reaching the goal
• a distributed data ecosystem for procurement, access and
curation of big social data
• a distributed platform of interoperable, social data mining
methods and associated “data scientist” skills for mining,
analysing, and visualising complex and massive datasets
• a community of multidisciplinary scientists, innovators,
public bodies, citizen organizations, SMEs, as well as data
science students at any level of education scientific, brought
together by extensive networking and innovation actions
7. What are doing our researcher?
• any responsible data science experiment is
composed by:
– data acquiring (open data, crowdsourcing,
crowdsensing,)
– model building (very complex validation phase),
– creation of an exploration scenario (what-if
analysis) (different validation setting),
– ….similar to many other data-driven science
process,…but data are produced by humans
Delft 17 – 19 February 2016
8.
9. Exploratories
Social Mining Research Environments
tailored on specific multidisciplinary
domains
• Promotes results sharing among scientists and
communities
• Promotes the use of RI through Virtual and
Transnational Access
10. Big Data for Societal Debates
Polarization, controversy and topic trends on societal debates through social media
Lead by Aris Gionis and Dominic Rout
17. Sentiment Analysis
• Internal and external perception by country
– Index ρ - the ratio between pro refugees users and against refugees
users
– Red means a higher predominance of positive sentiment, higher ρ
– Yellow means a higher predominance of negative sentiment, lower ρ
(a) Overall. (b) Internal
perception.
(c) External
perception.
-
+
-
+
-
+
18. SoBigData e-Infrastructure
• An Exploratory is also a Virtual Research Environments :
– VRE are web-based, community-oriented, comprehensive, flexible, and
secure working environments
– VREs are tailored to satisfy the needs of a designated community.
• services for data and methods discovery and access
• collaboration oriented facilities enabling scientists
– DATA: different sharing policy, may be shared or not
– METHODS: web services (executed over a variety of data centers), or
downloaded (packages to be executed on DATA side)
– WORKFLOW: complex analytical process that may imply executions on
different sites. (currently only description or on-site execution on some
special analytical platforms)
Firenze, 14 Nov 2016
19. e-infra design
• SoBigData.eu portal
• Sobigdata.eu Catalogue: a set of
functionalities to search, index and discovery
all resources (Data, Models and workflows)
(powered by D4Science)
• Virtual research Environments (Exploratories)
functionalities to create, update and
operation (powered by D4Science):
Delft 17 – 19 February 2016
22. SoBigData example: Resource Catalogue
Search for datasets and methods
Description
Recent Activities
Action Bar
Recent
Products
Statistics
Firenze, 14 Nov 2016
23. VRE example: SoBigData VRE
Application
Posting messages to other VRE users
VRE
Abstract
VRE
Managers
News Feed
Top Topics
Recent
Files
Firenze, 14 Nov 2016
24. The ethics of SoBigData
• Gathering large quantities of data may have serious
consequences:
– consequences range from personal harm,
– to issues of autonomy, injustice and inequality.
• Making Big Data accessible is a value for democracy
• SoBigData adheres to a value-sensitive design
approach:
– design solutions to overcome ethical dilemma’s, in this
case those between the utility of the data gathered vs.
the protection of the individuals subject to the research.
25. Ethics in practice
• SoBigData has an ethical framework which
provides a broad overview of all the ethical
concerns of big data.
• But, as per the VSD outlook, data protection is
not only the concern of the ethicists. In order
to make the ideals of SoBigData successful,
scientific methods also need to be developed
in order embed moral principles in practice.
26. The ethics of SoBigData
• How do we create an infrastructure in which such
methods can be disseminated and improved
upon?
• Data Management Plan plays a key role:
– Each data has its privacy requirements and fact checks
and responsibility
• Anonymization techniques are part of the
research
• Researchers will be trained in applying the
necessary procedural safeguards
28. Educating the responsible data
scientists
Based on a cooperation between ethicists and
computer
1. A Massive Online Open Course (MOOC) which
instructs all prospective researchers about the
legal and ethical dangers of big data research
and the steps they can take to minimise these;
2. A set of workflows that outline the steps
researchers can take when designing their
approach;
3. Information pop-ups which redirect researchers
to state-of-the-art ethical methods.
29. New challenges are coming
• One of the OECD ideals is algorithmic
transparency and the GDPR, also, says that
decision-making algorithms should
be explainable.
• But what is enough to constitute an
explanation?
• We're working on developing some sort of
template that would satisfy most people's
conceptions of what an explanation should be
30. • What function should a terms of use have?
Currently, SoBigData is trying to defer legal
responsibility to the Final User through the ToS,
but this is difficult.
• Example: How do we deal with Twitter's
intellectual property rights? May he scraper
violate Twitter terms of use? although of course
the above also holds. How does SoBigData relate
to data collectors terms of service
Delft 17 – 19 February 2016
New challenges are coming
31. SoBigData metadata structure
• A highly structured and detailed metadata
structure has been designed in order to provide
information about:
– Description of the dataset (to make it Findable)
– How the dataset has been produced
– Intellectual Property
– Privacy issues
– Who can access the data and how (terms of use,
NDA…)
• Mainly based on the DataCite standard
Delft 17 – 19 February 2016
32. • Copyright
Copyright is a legal right that grants the creator of an original work
exclusive rights for its use and distribution for a limited amount of time.
Copyright can exist on individual data as well as over a dataset or
database as a whole. The application of copyright to factual data and
metadata have no eligibility for copyright protection.
• License
A license is a unilateral permission by the right holder from the licensor to
the licensee to use certain rights. Licenses distinguish themselves from
contracts since the implementation of a license does not require mutual
agreement.
• Terms of Use
The terms of use are rules that one must obey in order to use the data or
service. The terms of use agreement is mainly used for legal purposes by
data providers and databases that store data. A legitimate terms of use
agreement is legally binding and may be subject to change.
Managing data: disambiguating terms
Firenze, 14 Nov 2016
33. There are three broad types of data:
• Primary/raw data: data coming directly from the
source;
• Derivative data: data processed from primary/raw
data;
• Metadata: reference data describing either the
primary/raw data or derivative data.
Primary/raw data and derivative data may be licensed
under different conditions and by different stakeholders
Managing data: type of data and their licenses
Firenze, 14 Nov 2016
34. The e-infrastructure offering discover and access may be operated by a
different actor and offered through specific terms of use.
In this case three types of licenses may be involved:
1. the one agreed between the system operator and the primary data
owner,
2. the one selected for derivative product that may differ from the one
associated with primary data, and
3. the one agreed between the system operator and the data
consumer.
All these licenses have to be captured by the “terms of use” of the e-
infrastructure, i.e., they are part of the rules a consumer must agree to
accept when using the system.
Managing data: dealing with complexity
Firenze, 14 Nov 2016
35. A re-use license (to specify in the ToU of the e-infrastructure) concerns at least
attribution, copyleft requirement, and control on commercial exploitation of the
dataset. Moreover it is needed to manage and apply some forms of control on access.
Virtual Research Environments (VREs) offer flexible and secure web-based,
community-centric platforms, so researchers can work together on common
challenges.
VREs terms of use are automatically composed according to the combined data and
services selected at the time of VRE definition.
• Raw data are then licensed according to the license expressed by the data
owner/custodian and expressed at time of registration of the data content to the e-
infrastructure.
• Derivative data products instead are licensed with a license compatible and legally
interoperable with the one associated with the primary data.
• It remains under the responsibility of a single user, as expressed in the VRE terms
of use, to confirm the license to associate with any produced derivative data.
Managing data:
VRE as an instrument to manage the complexity
Firenze, 14 Nov 2016
38. Il laboratorio di ricerca SoBigData.it organizza
la prima Tuscan Big Data Challenge,
un’occasione gratuita per le aziende per
migliorare il proprio business. Grazie a
un’analisi avanzata ei dati prodotti dalle
aziende o estratti da internet è possibile
ricavare informazioni utili su diversi fronti.