Social Mining & Big Data Ecosystem
Educating Data Scientists:
the SoBigData master
experience
www.sobigdata.eu
Fosca Giannotti, Valerio Grossi
ISTI-CNR Pisa
H2020-INFRAIA-2014-2015
Grant Agreement N. 654024
Modern science is data-intensive,
multidisciplinary, collaborative and global
– efficiency of data management (noSQL paradigms and
cloud computing play important role here) and
curation, search, sharing, transfer.
– managing the complexity of the analytical process is a
key issue (scalable distributed analytical methods and
and Visual Analytics are crucial here).
Firenze, 14 Nov 2016
Validation
Data
DemographicdataGeographicdataMovementdataTransportdata
Models
T-ClusteringT-Patterns
Forecasts
Big Data Analytics process
Firenze, 14 Nov 2016
Interdisciplinary and collaborative
• for sharing data/models/processes and results of
experiments (different level of interoperability and semantic
enrichment)
• to realize experiments by combining resources (data, methods
and results) belonging to different communities.
– This call for tools facilitating the govern of complex
analytical process in a workflow style or mega-modeling.
– This call also for sophisticate search that supports resource
discovery.
Firenze, 14 Nov 2016
Data scientist
A new kind of professional
has emerged, the data
scientist, who combines the
skills of software
programmer, statistician and
storyteller/artist to extract
the nuggets of gold hidden
under mountains of data.
Firenze, 14 Nov 2016
Four core points of a data scientist
• Data Procurement and Curation
• Making sense of Data
• Story-telling
• Respond step-by-step on technical correctness and
legal and ethical issues
Firenze, 14 Nov 2016
SoBigData is…
A Multidisciplinary European Infrastructure for Big Data and Social
Data Mining providing an integrated ecosystem for ethically
sensitive scientific discoveries and advanced applications of social
data mining on the various dimensions of social life, as recorded by
“big data”.
Firenze, 14 Nov 2016
Social Mining - Answer to:
Firenze, 14 Nov 2016
• Who will win US elections? What’s the elector’s current
intention of vote? How reliable is it?
• Which are the indicators of social well-being (beyond GDP)
and how can they be computed and monitored?
• How is the aging population effectively helped by the social
participation to digital community services?
• What is the link between media ownership and media
content? Is there bias in news reporting? And in content
reviews?
• Is an infective disease emerging? How is its diffusion model?
Firenze, 14 Nov 2016
Estimating traffic fluxes on road network with mobile phone
data
A
B
C
H
W
Firenze, 14 Nov 2016
Predicting Success
“Football is a simple game: 22 men chase a ball for 90
minutes and at the end, the Germans always win”
-- Gary Lieneker (after Italy 1990 Final)
Firenze, 14 Nov 2016
Managing Data does not means
Support discover
Provide access, Verify the quality of data, Clean errors, outliers, anomalier
Transform data in a format suitable for specific data analytical tools
It must include support for
• legal interoperability
– copyright management,
– licensing of single and derivative products
– terms of use
• fine-grained policies
– attribution,
– citation policy,
– provenance management
• Ethics issues
Managing Data: what this means?
Firenze, 14 Nov 2016
Metadata in the SoBigData RI
experience
• Huge datasets often describe human activities, which implies
privacy and ethical issues
• As a Research Infrastructure FAIRness is one of our main targets
– The success of the RI is directly connected to the fact that
datasets are Findable, Accessible, Interoperable and
Reusable
– The intellectual property has to be considered
– The design of a highly structured metadata schema allows
the RI to automatically grant or deny access to a dataset, to
force the acceptance of terms of use or signing NDAs…
SoBigData metadata structure
• A highly structured and detailed metadata structure
has been designed in order to provide information
about:
– Description of the dataset (to make it Findable)
– How the dataset has been produced
– Intellectual Property
– Privacy issues
– Who can access the data and how (terms of use,
NDA…)
• Mainly based on the DataCite standard
The ethics of SoBigData
• Gathering large quantities of data has serious consequences
that SoBigData is trying to address. These consequences range
from personal harm, to issues of autonomy, injustice and
inequality.
• In order to deal with these problems, SoBigData adheres to a
value-sensitive design approach. This approach consists in using
design solutions to overcome ethical dilemma’s, in this case
those between the utility of the data gathered vs. the
protection of the individuals subject to the research.
• In order to make the ideals of SoBigData successful, scientific
methods also need to be developed in order embed moral
principles in practice.
Ethics: the challenge for SoBigData
• How do we create an infrastructure in which such data
and methods can be disseminated and improved
upon?
1. A Massive Online Open Cource (MOOC) which instructs all
prospective researchers about the legal and ethical
dangers of big data research and the steps they can take to
minimise these;
2. A set of workflows that outline the steps researchers can
take when designing their approach;
3. Information pop-ups which redirect researchers to state-of-
the-art ethical methods.
Meta data definition: Ethics
Firenze, 14 Nov 2016
Meta data definition: Intellectual Properties
Firenze, 14 Nov 2016
Master in Big Data Analytics & Social Mining
http://www.sobigdata.eu/master/bigdata
Firenze, 14 Nov 2016
Firenze, 14 Nov 2016
Education
• Big Data Sensing
• Big Data Mining
• Big Data Story Telling
• Big Data Technology
• Big Data for Social Good
• Big Data Ethics
Firenze, 14 Nov 2016
Students: their studies
0
1
2
3
4
5
6
7
8
2015
2016
Firenze, 14 Nov 2016
Gender distribution
0
5
10
15
20
25
2014-2015 2015-2016
M
F
Firenze, 14 Nov 2016
Firenze, 14 Nov 2016

Educating Data Scientists: the SoBigData master experience

  • 1.
    Social Mining &Big Data Ecosystem Educating Data Scientists: the SoBigData master experience www.sobigdata.eu Fosca Giannotti, Valerio Grossi ISTI-CNR Pisa H2020-INFRAIA-2014-2015 Grant Agreement N. 654024
  • 2.
    Modern science isdata-intensive, multidisciplinary, collaborative and global – efficiency of data management (noSQL paradigms and cloud computing play important role here) and curation, search, sharing, transfer. – managing the complexity of the analytical process is a key issue (scalable distributed analytical methods and and Visual Analytics are crucial here). Firenze, 14 Nov 2016
  • 3.
  • 4.
    Interdisciplinary and collaborative •for sharing data/models/processes and results of experiments (different level of interoperability and semantic enrichment) • to realize experiments by combining resources (data, methods and results) belonging to different communities. – This call for tools facilitating the govern of complex analytical process in a workflow style or mega-modeling. – This call also for sophisticate search that supports resource discovery. Firenze, 14 Nov 2016
  • 5.
    Data scientist A newkind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Firenze, 14 Nov 2016
  • 6.
    Four core pointsof a data scientist • Data Procurement and Curation • Making sense of Data • Story-telling • Respond step-by-step on technical correctness and legal and ethical issues Firenze, 14 Nov 2016
  • 7.
    SoBigData is… A MultidisciplinaryEuropean Infrastructure for Big Data and Social Data Mining providing an integrated ecosystem for ethically sensitive scientific discoveries and advanced applications of social data mining on the various dimensions of social life, as recorded by “big data”. Firenze, 14 Nov 2016
  • 8.
    Social Mining -Answer to: Firenze, 14 Nov 2016 • Who will win US elections? What’s the elector’s current intention of vote? How reliable is it? • Which are the indicators of social well-being (beyond GDP) and how can they be computed and monitored? • How is the aging population effectively helped by the social participation to digital community services? • What is the link between media ownership and media content? Is there bias in news reporting? And in content reviews? • Is an infective disease emerging? How is its diffusion model?
  • 9.
  • 10.
    Estimating traffic fluxeson road network with mobile phone data A B C H W Firenze, 14 Nov 2016
  • 11.
    Predicting Success “Football isa simple game: 22 men chase a ball for 90 minutes and at the end, the Germans always win” -- Gary Lieneker (after Italy 1990 Final) Firenze, 14 Nov 2016
  • 12.
    Managing Data doesnot means Support discover Provide access, Verify the quality of data, Clean errors, outliers, anomalier Transform data in a format suitable for specific data analytical tools It must include support for • legal interoperability – copyright management, – licensing of single and derivative products – terms of use • fine-grained policies – attribution, – citation policy, – provenance management • Ethics issues Managing Data: what this means? Firenze, 14 Nov 2016
  • 13.
    Metadata in theSoBigData RI experience • Huge datasets often describe human activities, which implies privacy and ethical issues • As a Research Infrastructure FAIRness is one of our main targets – The success of the RI is directly connected to the fact that datasets are Findable, Accessible, Interoperable and Reusable – The intellectual property has to be considered – The design of a highly structured metadata schema allows the RI to automatically grant or deny access to a dataset, to force the acceptance of terms of use or signing NDAs…
  • 14.
    SoBigData metadata structure •A highly structured and detailed metadata structure has been designed in order to provide information about: – Description of the dataset (to make it Findable) – How the dataset has been produced – Intellectual Property – Privacy issues – Who can access the data and how (terms of use, NDA…) • Mainly based on the DataCite standard
  • 15.
    The ethics ofSoBigData • Gathering large quantities of data has serious consequences that SoBigData is trying to address. These consequences range from personal harm, to issues of autonomy, injustice and inequality. • In order to deal with these problems, SoBigData adheres to a value-sensitive design approach. This approach consists in using design solutions to overcome ethical dilemma’s, in this case those between the utility of the data gathered vs. the protection of the individuals subject to the research. • In order to make the ideals of SoBigData successful, scientific methods also need to be developed in order embed moral principles in practice.
  • 16.
    Ethics: the challengefor SoBigData • How do we create an infrastructure in which such data and methods can be disseminated and improved upon? 1. A Massive Online Open Cource (MOOC) which instructs all prospective researchers about the legal and ethical dangers of big data research and the steps they can take to minimise these; 2. A set of workflows that outline the steps researchers can take when designing their approach; 3. Information pop-ups which redirect researchers to state-of- the-art ethical methods.
  • 17.
    Meta data definition:Ethics Firenze, 14 Nov 2016
  • 18.
    Meta data definition:Intellectual Properties Firenze, 14 Nov 2016
  • 19.
    Master in BigData Analytics & Social Mining http://www.sobigdata.eu/master/bigdata Firenze, 14 Nov 2016
  • 20.
  • 21.
    Education • Big DataSensing • Big Data Mining • Big Data Story Telling • Big Data Technology • Big Data for Social Good • Big Data Ethics Firenze, 14 Nov 2016
  • 22.
  • 23.
  • 24.