Ciência de Dados: definição, desafios de modelagem e aplicações multidisciplinares

0010010110101111001111110101010101000
1010100010101110111000111010110011011
10010110101111001111110101010101000
10100010101110111000111010110011011
10010010110101111001111110101010101
010100010101110111000111010110011
110111100111111010101010100010111
10101110111000111010110011011111
1011101001010101110000101010111
0110101111001111110101010101000
010101110111000111010110011011
010110101111001111110101010101
100010101110111000111010110011
11100111111010101010100010111
Ciência de Dados: definição,
desafios de modelagem e
aplicações multidisciplinares
Luiz Celso Gomes-Jr

Agenda
●
Ciência de Dados – História e Definições
●
Dilúvio de dados, Economia da informação
●
Espectro de estrutura e modelos de dados
– Dados tabulares
– Grafos
– Texto
●
Exemplos de pesquisa multi-modelos e
multi-disciplinares

Apresentação
●
Professor de Bancos de Dados/Ciência de
Dados – UTFPR
●
Interesses de Pesquisa: Big Data, NLP, IR,
Redes Complexas, ML, privacidade
●
Formado em Ciência da Computação,
“biólogo frustrado”

Mundo de Dados
●
Curtidas em Redes
Sociais
●
Páginas na Web
●
Notas dos alunos
●
Fotos do
Instagram
●
Localização de
Pokémons
●
Sinais de televisão
●
Saldo de contas
correntes
●
Produtos à venda
●
Imagens de satélite
●
Exames médicos
●
Medição de nível de
metano na atmosfera
de Marte
●
Telemetria de um
carro de F1

Simple definition
Data Science
=
Evidencedata-based knowledge discovery

Data Science - History
●
Science has always been Data Science
●
Tycho Brahe (1546-1601)
and Johannes Kepler (1571-
1630) discovered the laws of
planetary motion collecting
and analysing a large volume
of observation data
●
What has changed now:
– The amount of data being generated
– The new methods and tech for analysis
– The dependency of our society on the generated
knowledge

A brief history of data and
modern technologies

Before Computers
●
Government funding
●
Manual acquisition and
computation of data
●
Restricted to Government and a
few academic disciplines

Computers
●
Digital production and processing of data
●
DataBase Management Systems (DBMSs)
●
Data Analysis limited to large corporations

Internet, cellphones,
sensors, data storage...
●
Fast and cheap
communication for
everyone
●
Massive data production
and consumption
●
Commercial drive for new
data management tech
●
Data-driven economy
●
Data-driven science

Dilúvio de Informação
●
1bi usuários conectados no facebook
(23/08/2015)
●
2bi smartphones no mundo, 1b sites web
●
300 horas de vídeo no YouTube a cada
minuto
●
Google, Amazon, Microsoft and Facebook =
1,200 petabytes =
1.200.000.000.000.000.000 bytes = 5
pilhas de CDs até a Estação Espacial
Internacional

Big Data
●
Data sets that are so large or complex
that traditional data processing
applications are inadequate
●
Challenges: analysis, capture, data
curation, search, sharing, storage,
transfer, visualization, querying, updating
and information privacy
●
Predictive analytics, user behavior
analytics

Information economy
●
“The world’s most valuable resource is no
longer oil, but data”
●
Data companies are the most valuable
listed firms in the world
●
The nature of data makes the antitrust
remedies of the past less useful
The Economist - May 6th 2017

Valorização no Mercado
(2018)
* U$ 2 tri = Brazil’s GDP (8th
highest in the world)

The Fourth Paradigm
●
Thousand years ago: science was empirical;
describing natural phenomena
●
Last few hundred years: theoretical branch;
using models, generalizations
●
Last few decades: a computational branch;
simulating complex phenomena
●
Today: data exploration (eScience), unify
theory, experiment, and simulation
(Jim Gray, 2007)

Data Science
Data Science, is an interdisciplinary field
about scientific methods, processes and
systems to extract knowledge or insights
from data in various forms [1]
.
Data science is a "concept to unify statistics,
data analysis and their related methods" in
order to "understand and analyze actual
phenomena" with data [2]
.
[1] Dhar, V. (2013). "Data science and 'prediction"
[2] Hayashi, Chikio (1998). "What is Data Science?

Data Science vs. Statistics
Dictionary definitions of statistical inference
tend to equate it with the entire discipline. This
has become less satisfactory in the “big data”
era of immense computer-based processing
algorithms. [...]
Very broadly speaking, algorithms are what
statisticians do while inference says why they
do them. A particularly energetic brand of the
statistical enterprise has flourished in the new
century, data science, emphasizing algorithmic
thinking rather than its inferential justification.

Tables
Red List of Threatened Species
(International Union for Conservation of Nature)

ON THE ECOLOGY OF HUMAN
CARNIVORY
ON THE ECOLOGY OF HUMAN
CARNIVORY
ZULMIRA COIMBRA
ADVISOR: FERNANDO
FERNANDEZ
ZULMIRA COIMBRA
ADVISOR: FERNANDO
FERNANDEZ

All vertebrates
12%
11%
5%
3%
1%

Affected Affected threatened Affected decreasing
Ecological aspect
species
number
proportion
(1000)
species
number
proportion
(195)
species
number
proportion
(253)
Agriculture 283 28.3% 134 68.7% 146 57.7%
Carnivory 261 26.1% 88 45.1% 100 39.5%
Infrastructure 172 17.2% 73 37.4% 90 35.6%
Wild plants use 165 16.5% 82 42.1% 92 36.4%
Pollution 111 11.1% 53 27.2% 51 20.2%
Invasive
species
104 10.4% 67 34.4% 57 22.5%
Natural system
modifications
97 9.7% 52 26.7% 54 21.3%
Energy production
and mining
94 9.4% 46 23.6% 45 17.8%
Climate change 71 7.1% 33 16.9% 33 13.0%
Ranking the impact of human ecological aspects

Social NetworksSocial Networks

Graph Model
edge or link
vertex or node

Connecting Wolves to
Rivers
●
Yellowstone National Park
●
Wolves Elks→ Elks → → Elks →
Vegetation Erosion→ Elks →
Water Beavers→ Elks → → Elks →
Dams Fish→ Elks → → Elks →

Biological networks
Genetics
Drosophila
Melanogaster
Homo
Sapiens
In the generic networks shown, the
points represent the elements of each
organism’s genetic network, and the
dotted lines show the interactions
between them.

Complex Network
Analysis
●
Distances
●
Clustering and cycles
●
Degree distribution
●
Entropy
●
Centrality
●
. . .

Research
●
Graph analysis of Symptoms vs. Nursing
Diagnosis
●
Goal: determine related symptoms and
possible urgency

Nursing Diagnosis
Bipartite Graph: Case vs Symptom vs Diagnosis
(also tripartite)

Nursing Diagnosis
Monopartite graph from Symptom/Diagnosis

Text
Major threats to the species include cattle grazing,
agriculture activities and mining activities throughout
its range. A museum specimen collected from Reserva
Forestal de Yotoco in 1996 tested positive for
Batrachochytrium dendrobatidis (Velasquez et al.
2007). The presence of chytrid in this species in 1996 is
consistent with the timing of the declines observed in
the Yotoco subpopulations at the end of the 1990s, as
well as the timing of other Bd declines in montane
Andean species, suggesting it as a plausible, but
unconfirmed cause. However, the species can still be
found within Reserva Forestal de Yotoco (Velasquez et
al. 2007).

NLP - Levels of
Representation
Morphology
SyntaxSyntax
Explicit
Semantics
Full Semantics
Words
Also, higher representations require lower

Research
●
Analysis of Metaphorical language in Fake
News
●
Goal: understand how metaphors are used
in Fake News

Resultados iniciais - corpo
Classe Média Desvio P.
unreliable 0.031 0.08
clickbait 0.604 0.13
hate 0.634 0.10
conspiracy 0.750 0.14
bias 1.231 0.13
junksci 1.357 0.13
satire 1.716 0.12
rumor 1.767 0.12
reliable 2.111
0.13
unknown 2.287 0.13
political 2.427 0.13
fake 3.245
0.13

Research
●
Analysis of association between in-class
social networks and academic performance
●
Goal: understand how the circle of friends
may influence grades of students

Case 1: In-class social networks
and academic performance
class social graph
grades spreadsheet

Resultados - Turmas
Média de conexões X
média de nota final
Média de agrupamento X
média de nota de trabalhos
Correlação 0,75 (p = 0,087) Correlação 0,64 (p = 0,167)

Resultados - Alunos
Centralidade de autovetor X nota final
Correlação 0.48 (p = 8.85-10
)

Student improvement
Students with friends that performed poorly on Exam 1 only improved by 0.5
on average on Exam 2 (p<0.01). Students with friends that performed well, in
contrast, improved by 1.9 points on average -- almost a 4 fold gain when
compared with the other group. This suggests that having friends with good
academic performance have a direct impact on students grade.

Resultados - Alunos
●
Correlação significativa entre a nota final
dos alunos e centralidade de autovetor
(correlação: 0,48) e e maior que grau
médio dos vizinhos (correlação: 0,40),
sugerindo importância da topologia da rede
●
Correlação negativa (fraca) entre as notas
centralidade de intermediação

Text Graph Tables→ Elks → → Elks →

Research
●
Analysis networks of entities cited in Fake
News
●
Goal: understand how entities are
mentioned and related in fake news and
how their prevalence correlates with
political events

Texto Grafo→ Elks →
...Moro investiga Lula na
operação Lava-Jato...
...Lula se reúne com
Dilma para tratar...
Moro
Lula
Lava-Jato
Dilma
Lula
Moro
LulaLava-Jato
Dilma

Identificação de tópicos
●
Algoritmo de agrupamento em grafos
(Modularidade)
●
Agrupamentos representam entidades
frequentemente co-citadas
●
Agrupamentos usados como
representantes de tópicos

Agrupamentos/Tópicos
Lula/PT
Lava-Jato
Legislativo

Estratégias
●
Detecção de anomalias na distribuição dos
tópicos (agrupamentos)
●
Detecção de anomalias sobre a evolução
dos grafos

Anomalias nas Distribuições
dos Tópicos
●
Cálculo de LOF para os percentuais dos 5
tópicos ao longo do tempo

Trabalhos em andamento
●
Compreender intencionalidade no uso de
metáforas em Fake News
●
Usar texto da tabela da IUCN para fazer
classificação automática das ameaças
●
Comparar evolução dos tópicos das Fake
News com eventos políticos
●
Estudar formação de grupos de alunos
●
Avaliar impacto de espécies em extinção
na rede trófica

Projeto Ciência de Dados
por uma Causa
Página: http://dainf.ct.utfpr.edu.br/umacausa
Facebook: https://fb.me/cienciadadoscausa/

Obrigado
●
Email: luizcelso@gmail.com,
gomesjr@dainf.ct.utfpr.edu.br
●
Programa de Pós (PPGCA):
ppgca.dainf.ct.utfpr.edu.br

Ciência de Dados: definição, desafios de modelagem e aplicações multidisciplinares

Recommended

Recommended

More Related Content

Similar to Ciência de Dados: definição, desafios de modelagem e aplicações multidisciplinares

Similar to Ciência de Dados: definição, desafios de modelagem e aplicações multidisciplinares (20)

Recently uploaded

Recently uploaded (20)

Ciência de Dados: definição, desafios de modelagem e aplicações multidisciplinares