SlideShare a Scribd company logo
1 of 98
Download to read offline
https://kemlg.upc.edu
FromDMtoBD&DS:aComputationalPerspective
From Data Mining to
Big Data & Data Science:
a Computational Perspective
Miquel Sànchez-Marrè
Knowledge Engineering & Machine Learning Group (KEMLG)
Dept. de Llenguatges i Sistemes Informàtics
Universitat Politècnica de Catalunya-BarcelonaTech
miquel@lsi.upc.edu
http://www.lsi.upc.edu/~miquel
19 de Julio 2013
FromDMtoBD&DS:aComputationalPerspective
Content
 Introduction
 Knowledge Discovery in Databases
 Data Pre-processing
 Data Mining Techniques: Statistical & ML methods
 Data Post-Processing
 Big Data / Data Science
 Social Mining
 Scaling from Data Mining to Big Mining
 Computational Techniques
 Examples of Extrapolation of classical DM Techniques
 A look to Mahout
 Big Data Tools & Resources
 Big Data Trends
2Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
https://kemlg.upc.edu
FromDMtoBD&DS:aComputationalPerspective
INTRODUCTION
FromDMtoBD&DS:aComputationalPerspective
4Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Complex Real-world Systems/Domains
 Exist in the daily real life of human beings, and normally show a
strong complexity for their understanding, analysis, management or
solving.
 They imply several decision making tasks very complex and
difficult, which usually are faced up by human experts
 Some of them, in addition, could have catastrophic consequences
either for human beings or for the environment or for the economy of
one organization
 Examples
 Environmental System/Domains
 Medical System/Domains
 Industrial Process Management Systems/Domains
 Business Administration & Management Systems/Domains
 Marketing
 Decisions on products and prices
 Decisions on human resources
 Decisions on strategies and company policy
 Internet
FromDMtoBD&DS:aComputationalPerspective
5Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Need for Decision Making Support Tools
 Complexity of the decision making process
 Accurate Evaluation of multiple alternatives
 Need for forecasting capabilities
 Uncertainty
 Data Analysis and Exploitation
 Need for including experience and expertise (knowledge)
 Computational tools: Decision Support Systems (DSS)
 Intelligent Computational Tools: Intelligent Decision
Support Systems (IDSS)
FromDMtoBD&DS:aComputationalPerspective
6Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Management Information Systems
MANAGEMENT INFORMATION SYSTEMS
TRANSACTION
PROCESSING
SYSTEMS
DECISION
SUPPORT
SYSTEMS
(DSS)
MODEL-DRIVEN
DSS
DATA-DRIVEN
DSS
AUTOMATIC
MODEL-DRIVEN
DSS
TOP-DOWN
APPROACH /
CONFIRMATIVE
MODELS
BOTTOM-UP
APPROACH /
EXPLORATIVE
MODELS
STATISTICAL /
NUMERICAL
METHODS
ARTIF. INTELLIG /
MACH. LEARN.
METHODS
INTELLIGENT DECISION SUPPORT SYSTEMS (IDSS)
FromDMtoBD&DS:aComputationalPerspective
7Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Intelligent
Decision
Support
Systems
(IDSS)
[Sànchez-Marrè et al., 2000] DATA BASE
(TEMPORAL DATA)
SYSTEM / PROCESS
SENSORS
SUBJECTIVE
INFORMATION /
ANALYSES
Spatial /
Geographical
data
On-line
data
Off-line
data
EXPLANATION ALTERNATIVES EVAL.
PLANNING / PREDICTION / SUPERVISION
REASONING / MODELS’ INTEGRATION
DATA MINING
KNOWLEDGE ACQUISITION / LEARNING
AI
MODELS
STATISTICAL
MODELS
NUMERICAL
MODELS
USER INTERFACE
USER
ON-LINE /
OFF-LINE
ACTUATORS
ECONOMICAL
COSTS
LAWS &
REGULATIONS
Feedback
DATAINTERPRETATIONDATAMININGDIAGNOSISDECISIONSUPPORT
Background/
Subjective
Knowledge
Decision /
Strategy
GIS
(SPATIAL DATA)
Actuation
FromDMtoBD&DS:aComputationalPerspective
8Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Why Data Mining is important ?
From observations to decisions
[Adapted from A.D. Witakker, 1993]
Observations
Consequences
Recomendations
Predictions
KNOWLEDGE
Understanding
Data
Quantiy of Information
DECISION
INTERPRETATION
High Low
High
Low
Value and
Relevance
to
Decisions
Observations
Consequences
Recomendations
Predictions
KNOWLEDGE
Understanding
Data
Quantiy of Information
DECISION
INTERPRETATION
High Low
High
Low
Value and
Relevance
to
Decisions
https://kemlg.upc.edu
FromDMtoBD&DS:aComputationalPerspective
KNOWLEDGE DISCOVERY / DATA MINING
Intelligent Data Analysis
FromDMtoBD&DS:aComputationalPerspective
10Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
KDD Concepts
 KDD
 Knowledge Discovery in Databases is the non-trivial
process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data [Fayyad et al.,
1996]
 KDD is a multi-step process
 Data Mining
 Data Mining is a step in the KDD process consisting of
particular data mining algorithms that, under some
aceptable computational efficiency limitations, produces a
particular set of patterns over a set of examples or cases
or data. [Fayyad et al., 1996]
FromDMtoBD&DS:aComputationalPerspective
11Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
The KDD process
[Fayyad, Piatetsky-Shapiro & Smyth, 1996]
Data
Knowledge
Selection
Preprocessing
Transformation
Data Mining
Interpretation /
Evaluation
Target Data
Preprocessed
Data
Transformed
Data
Knowledge
Patterns / Models
FromDMtoBD&DS:aComputationalPerspective
12Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Terminology
Data Bases Artificial Intelligence Statistics
Table Data Matrix Data
Register Instance/Example Observation/Individual
Field Attribute/Feature Variable
Value Data Value
https://kemlg.upc.edu
FromDMtoBD&DS:aComputationalPerspective
DATA PRE-PROCESSING
FromDMtoBD&DS:aComputationalPerspective
14Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Problems with data
 Huge amount of data
 Corrupted Data
 Noisy Data
 Irrelevant Data / Attribute Relevance
 Attribute Extraction
 Numeric and Symbolic Data
 Scarce Data
 Missing Attributes
 Missing Values
 Fractioned Data
 Incompatible Data
 Different Source Data
 Different Granularity Data
FromDMtoBD&DS:aComputationalPerspective
15Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Data Preparing Data
 Data Tranformation
 Data Filtering
 Data Sorting
 Data Edition
 Noise Modelling
 New Information Obtention
 Visualization
 Removing
 Data Selection
 Sampling
 New Information Generation
 Data Engineering
 Data Fusion
 Time Series Analysis
 Constructive Induction
FromDMtoBD&DS:aComputationalPerspective
16Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Data Cleaning
 Duplicated data removing
 Data Inconsistency solving
 Outlier Detection and Management
 Error Detection and Management
FromDMtoBD&DS:aComputationalPerspective
17Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Coding
 Codify some fields and generate a code table
 Address to region
 Phone prefix to province code
 Birth data to age, and age to decade
 To normalize huge magnitudes: millions to thousands (account
balance, credits, ...)
 Sex, married/single, ... transformed to binary
 Weekly, monthly informations, ... to number of week, number of
month, ...
FromDMtoBD&DS:aComputationalPerspective
18Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Flattering
 Transform an attribute of cardinality n in n binary
attributes to get a unique register per element
DNI hobby
45464748 sport
45464748 music
45464748 reading
50000001 music
DNI Spo Mus Read
45464748 1 1 1
50000001 0 1 0
FromDMtoBD&DS:aComputationalPerspective
19Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Data Selection
 Mechanism to reduce the size of the data matrix, by
means of the following techniques:
 Instance Selection
 Feature Selection
 Feature Weighting (Feature Relevance Determination)
FromDMtoBD&DS:aComputationalPerspective
20Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Instance Selection
 Goal: to reduce the number of examples
 Methods [Riaño, 1997]:
 Exhaustive methods
 Greedy methods
 Heuristic methods
 Convergent methods
 Probabilistic methods
FromDMtoBD&DS:aComputationalPerspective
21Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Feature Selection
 Goal: to reduce the number of features (dimensionality)
 Categories of methods:
 Feature ranking
 Feature ranking methods ranks the features by a metric and
eliminates all features that do not achieve an adequate score
 Subset selection
 Subset selection methods search the set of possible features
for the optimal subset.
 Methods
 Wrapper methods (Wrappers): utilize the ML model of interest as a
black box to score subsets of feature according to their predictive
power.
 Filter methods (Filters): select subsets of features as a preprocessing
step, independently of the chosen predictor.
 Embedded methods: perform feature selection in the process of
training and are usually specific to given ML techniques
FromDMtoBD&DS:aComputationalPerspective
22Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
 Goal: to determine the weight or relevance of all features in
an automatic way
 Framework for Feature Weighting methods [Wettschereck et
al., 1997]:
Feature Weighting
Dimension Possible Value
Bias {Feedback, Preset}
Weights Space {Continuous, Binary}
Representation {Given, Transformed}
Generality {Global, Local}
Knowledge {Poor, Intensive}
FromDMtoBD&DS:aComputationalPerspective
23Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
 Supervised methods
 Filter methods
 Global methods
 Mutual Information algorithm
 Cross-Category Feature Importance
 Projection of Attributes
 Information Gain measure
 Class Value method
 Local methods
 Value-Difference Metric
 Per Category Feature Importance
 Class Distribution Weighting algorithm
 Flexible Weighting
 Entropy-Based Local Weighting
Feature Weighting / Attribute Relevance (1)
FromDMtoBD&DS:aComputationalPerspective
24Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Feature Weighting / Attribute Relevance (2)
 Wrapper methods
 RELIEF
 Contextual Information
 Introspective Learning
 Diet Algorithm
 Genetic algorithms
 Unsupervised methods
 Gradient Descent
 Entropy-based Feature Ranking
 UEB-1
 UEB-2
https://kemlg.upc.edu
FromDMtoBD&DS:aComputationalPerspective
DATA MINING TECHNIQUES
FromDMtoBD&DS:aComputationalPerspective
26Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Data Mining Techniques
 Statistical Techniques
 Linear Models: simple regression, multiple regression
 Time Series Models (AR, MA, ARIMA)
 Component Principal Analysis (CPA) / Discriminant Analysis (DA)
 Artificial Intelligence Techniques
 Decision Trees
 Classification Rules
 Association Rules
 Clustering
 Instance-Based Learning (IBL, CBR)
 Connectionist Approach (Artificial Neural Networks)
 Evolutionary Computation (Genetic Algorithms, Genetic Programming)
 AI&Stats Techniques
 Regression Trees
 Model Trees
 Probabilistic/Belief/Bayesian Networks
FromDMtoBD&DS:aComputationalPerspective
27Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Model Clasification (1)
[Gibert & Sànchez-Marrè, 2007]
KNOWLEDGE MODELS
DESCRIPTIVE
MODELS
ASSOCIATIVE
MODELS
DISCRIMINANT
MODELS
PREDICTIVE
MODELS
(AI) Conceptual
Clustering Tech.
(Stats) Statistical
Clustering Tech.
(AI&Stats) Hybrid
Tech.: Classification
Based on Rules (ClBR)
(AI) Association Rules
Model-Based Reasoning
Qualitative Reasoning
(AI&Stats) Bayesian /
Belief Networks (BNs)
(Stats) Principal
Components
Analysis (PCA)
Simple / Multiple
Correspondence
Analysis (SCA /
MCA)
(AI) Decision
Trees
Classification
Rules
(Stats)
Discriminant
Analysis
(AI&Stats)
Box-Plot Based
Induction Rules
(AI) Instance-
Based
Learning (IBL)
(AI) Connexionist Models (ANNs)
Evolutionary Computation (GAs)
Collaborative Resolution
(“Swarm Intelligence“)
(Stats) Simple/Multiple
Linear Regression
Variance Analysis
Time Series Models
Rule-Based
Reasoning
Bayesian
Reasoning
Case-Based
Reasoning
MODELS WITHOUT
RESPONSE VARIABLE
MODELS WITH
RESPONSE VARIABLE
(Stats)
Naive Bayes
Classifier
(AI&Stats) Regression
Trees / Model Trees
FromDMtoBD&DS:aComputationalPerspective
28Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Model Classification (2)
[Gibert & Sànchez-Marrè, 2007]
UNCERTAINTY
MODELS
PROBABILISTIC
MODELS
NEAR-PROBABILISTIC
MODELS
EVIDENTIAL
MODEL
POSSIBILISTIC
MODEL
(AI) Certainty Factor
Method[MYCIN]
(Stats) Pure
Probabilistic
Model
(AI) Subjective Bayesian
Method [PROSPECTOR]
(AI&Stats) Bayesian
Network
Model [Pearl]
(AI) Evidence Theory
[Dempster-Shafer]
(AI) Possibility Theory
Fuzzy Logic [Zadeh]
https://kemlg.upc.edu
FromDMtoBD&DS:aComputationalPerspective
Descriptive Models
Clustering Techniques
FromDMtoBD&DS:aComputationalPerspective
30Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Clustering / Conceptual Clustering
described by A=0.1
described by B[0.3,0.38] & C [0.42,0.45]
We group {1,2} i {3,4} even though d(1,2)>d(1,3)
A B C
1 0.1 0.8 0.3
2 0.1 0.3 100
3 0.7 0.3 0.45
4 0.3 0.38 0.42
FromDMtoBD&DS:aComputationalPerspective
31Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Clustering Techniques
 Clustering with K determined clusters
 K-means method
 Clustering through the neighbours
 Nearest-Neighbour method (NN method)
 Hierarchical Clustering
 Probabilistic/Fuzzy Clustering
 Clustering based on rules
FromDMtoBD&DS:aComputationalPerspective
32Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
K-means Method
Input: X = {x1, ..., xn} // Data to be clustered
k // Number of clusters
Output: C= {c1, ..., ck} // Cluster centroids
Function K-means
initialize C // random selection from X
while C has changed
For each xi in X
cl(xi) = argminj distance (xi, cj)
endfor
For each Cj in C
cj = centroid ({xi | cl(xi) = j})
endfor
endwhile
return c
End function
FromDMtoBD&DS:aComputationalPerspective
K-means method
[Charts extracted from “Mahout in action”, Owen et al., 2011]
33Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
FromDMtoBD&DS:aComputationalPerspective
34
Data
set
b
a
Dendogram
c
e
d
a b
c
d e
f
f
g
Ascendant hierarchical clustering
h
g
h
https://kemlg.upc.edu
FromDMtoBD&DS:aComputationalPerspective
Discriminant Models
K-NN & Decision trees
FromDMtoBD&DS:aComputationalPerspective
K-NN algorithm
Input: T = {t1, ..., tn} // Training Data points available
D = {d1, ..., dm} // Data points to be classified
k // Number of neighbours
Output: neighbours // the k nearest neigbours
Function K-NN
Foreach data point d
neighbours = 
Foreach training data point t
dist = distance (d,t)
If |neighbours| < k then
insert(t, neighbours)
else
fartn = argmaxi distance(t,neighboursi)
if distance (dist < fartn)
Insert (t, neighbours)
Remove (fartn, neighbours)
endif
` endif
endfor
return majority-vote of K-Nearest (neighbours)
endfor
End function 36Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
FromDMtoBD&DS:aComputationalPerspective
37
Decision trees
 The nodes are qualitative attributes
 The branches are the possible values of a qualitative
attribute
 The leaves of the tree have the qualitative
prediction of the attribute that acts as a class label
 Model the process of deciding to which class
belongs a new example of the domain
FromDMtoBD&DS:aComputationalPerspective
38
Decision tree for the contact lens data
Decision Tree: example
FromDMtoBD&DS:aComputationalPerspective
39
ID3 algorithm
 ID3  Induction Decision Tree [Quinlan, 1979], [Quinlan, 1986]
 Machine Learning Technique
 Decision Tree Induction
 Top-Down strategy
 From a set of examples/instances and the class to which they
belong, it builds up the best decision tree which explains the
instances
Criteria
Compactness the more compact better complexity
Goodness  predictive/discriminant ability
FromDMtoBD&DS:aComputationalPerspective
© Miquel Sànchez i Marrè & Karina Gibert, KEMLG, 2009
40
ID3: basic idea
 Select at each step the attribute which can discriminate
more.
 The selection is done through maximizing a certain
function G (X, A).
FromDMtoBD&DS:aComputationalPerspective
41
ID3: selection criteria
 Seleccit Ak which maximizes the gain of information
G(X, Ak) = I(X,C) - E(X, Ak)  E(X, Ak)  0
where
I(X,C) = - p(X,ci) * log2p(X,ci)
CiC
p(X,ci) = # Ci / #X
E(X, Ak) =  p(X,vl) * I(A-1
k(vl),C)
vlV(Ak)
p(X,vl) = # A-1
k(vl) / #X
entropy
information
Probability that one example
belongs to class Ci
Probability that one
example has the
value vl for the
attribute Ak
FromDMtoBD&DS:aComputationalPerspective
42
ID3: algorithm
Function ID3 (in X,A are sets) returns decision tree is
var tree1, tree2 are decision tree endvar
option
case (Ci: xj  X --> xj  Ci) do
tree1  buildTree (Ci)
case no (Ci: xj  X --> xj  Ci) do
option
case A   do
Amax  max AkA {G(X,Ak)};
tree1  buildTree(Amax);
for each v  V(Amax) do
tree2  ID3(A-1
max (v), A-{Amax});
tree1  addBranch(arbre1, arbre2, v)
endforeach
case A =  do
tree1  buildTree(majorityClass(X))
endoption
endoption
returns tree1
endfunction
FromDMtoBD&DS:aComputationalPerspective
43
A B C D
1. 1 1 5 -
2. 2 1 5 -
3. 2 1 7 +
4. 1 1 7 +
5. 2 2 5 -
6. 2 2 7 -
7. 1 2 7 -
8. 2 3 7 +
(1)
p n h(p,n) e
A=1 1 2 0,6365 0,6593
A=2 2 3 0,6730
B=1 2 2 0,6931 0,3465
B=2 0 3 0,0000
B=3 1 0 0,0000
C=5 0 3 0,0000 0,4206
C=7 3 2 0,6730
(2)
B
=1
=2
=3
A C D
1. 1 5 -
2. 2 5 -
3. 2 7 +
4. 1 7 +
5. 2 5 -
6. 2 7 -
7. 1 7 -
8. 2 7 +
(3)
p n h(p,n) e
A=1 1 1 0,6931 0,6931
A=2 1 1 0,6931
C=5 0 2 0,0000 0,0000
C=7 2 0 0,0000
(4)
B
=1
=2 =3
- +
- +
C
(5)
=5 =7
https://kemlg.upc.edu
FromDMtoBD&DS:aComputationalPerspective
Data Post-Processing
FromDMtoBD&DS:aComputationalPerspective
45Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Post-processing Techniques (1)
 Post-processing techniques are devoted to transform the
direct results of the data mining step into directly
understandable, useful knowledge for later decision-making
 Techniques could be summarized in the following tasks
[Bruha and Famili, 2000]:
 Knowledge filtering: The knowledge induced by data-driven
models should be normally filtered.
 Knowledge consistency checking. Also, we may check the
new knowledge for potential conflicts with previously induced
knowledge.
 Interpretation and explanation. The mined knowledge
model could be directly used for prediction, but it would be
very adequate to document, interpret and provide
explanations for the knowledge discovered.
FromDMtoBD&DS:aComputationalPerspective
46Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Post-processing Techniques (2)
 Visualization. Visualization of the knowledge (Cox et al.,
1997) is a very useful technique to have a deeper
understanding of the new discovered knowledge.
 Knowledge integration. The traditional decision-making
systems have been dependant on a single technique,
strategy or model. New sophisticated decision-supporting
systems combine or refine results obtained from several
models, produced usually by different methods. This
process increases accuracy and the likelihood of success.
 Evaluation. After a learning system induces concept
hypotheses (models) from the training set, their evaluation
(or testing) should take place.
FromDMtoBD&DS:aComputationalPerspective
47Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Interpretation / Result Evaluation
 Tables summarising data
 Spatial Representation of Data
 Graphical Visualization of Models/Patterns of
Knowledge
 Decision Trees
 Discovered Clusters
 Induced Rules
 Bayesian Network Learned
FromDMtoBD&DS:aComputationalPerspective
48Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Result Representation (1)
 Direct Data through tables
0.343689 10000 5879.9875
0.467910 2345 98724.935
0.873493 34 92235.9620
FromDMtoBD&DS:aComputationalPerspective
49Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Result Representation (2)
 Features through histograms
0
10
20
30
40
50
60
70
80
90
1 2 3 4 5
Serie1
0 20 40 60 80 100
1
2
3
4
5
Serie1
FromDMtoBD&DS:aComputationalPerspective
50Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Result Representation (3)
 Proportions through pie chart graphics
1
2
3
4
5
1
2
3
4
5
FromDMtoBD&DS:aComputationalPerspective
51Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Two-dimensional Representation
FromDMtoBD&DS:aComputationalPerspective
52Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Three-dimensional Representation
FromDMtoBD&DS:aComputationalPerspective
53Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Validation Methods for Discriminant and Predictive
Models
 Assessment of predictive/discriminant abilities of
models
 Training Examples
 Obtention of the model
 Test Examples
 Assessment of the accuracy and generalization ability del
model
 Methods and tools for rate estimation
 Simple Validations / Cross Validations
 Random Validations / Stratified Validations
FromDMtoBD&DS:aComputationalPerspective
54Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Precision/Accuracy Types
 Global Precision/Accuracy of classification
 Precision/Accuracy of classification by modalities
 Confusion Matrix [General use]
 ROC (Receiver /Relative Operating Characteristic)
Curves [for binary classifiers]
 Gini Index
FromDMtoBD&DS:aComputationalPerspective
55Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Simple Validation [Stratified] (1)
Stratified: The class distribution C1, …,CM in the training set and in the
test set follows the same distribution than the original whole data set
Induced
Model
Test
FromDMtoBD&DS:aComputationalPerspective
56Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Simple Validation [Stratified] (2)
FromDMtoBD&DS:aComputationalPerspective
57Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Cross Validation [Stratified]
The initial data set is split into N subsets
N = [3…10], defined by the user
The final accuracy rate is computed as the average of accuracy rates
obtained for each one of the subsets
FromDMtoBD&DS:aComputationalPerspective
58Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Global Precision of classification
Global Error or Misclassification Rate
Global Accuracy or Success or Classification Rate
MODEL MISCLAS. RATE
VALIDATION
MISCLAS. RATE
TEST MISCLAS.
RATE
CART Tree 0,2593856655 0,2974079127 0,2909836066
K-NN MBR 0,2894197952 0,2974079127 0,3155737705
Regression 0,3071672355 0,3383356071 0,3770491803
RBF 0,3051194539 0,3246930423 0,3360655738
FromDMtoBD&DS:aComputationalPerspective
59Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Precision by modalities (1)
 Confusion Matrix
LOGISTIC
REGRESSION
Predicted
0 1
Observed
0 48,02 10,91
1 22,92 18,14
CART
CLASSIFICATION
TREE
Predicted
0 1
Observed
0 43,52 15,42
1 14,32 26,74
Error Type I /
False Negatives
Error Type II /
False Positives
FromDMtoBD&DS:aComputationalPerspective
60Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Precision by modalities (2)
 Confusion Matrix
RBF neural
network
Predicted
0 1
Observed
0 47,34 11,60
1 20,87 20,19
K-NN MBR Predicted
0 1
Observed
0 41,34 17,60
1 12,14 28,92
FromDMtoBD&DS:aComputationalPerspective
61Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
ROC Curves
Sensitivity
True Positive rate
(1 – prob(error type I))
versus
1- especificity
False Positive rate
(prob(error type II)):
FromDMtoBD&DS:aComputationalPerspective
62Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Performance Gini Index
 Surface between ROC curve and the 45º bisector
Logistic
Regression
RBF
CART
Tree
K-NN
MBR
Gini
index
0,4375 0,4230 0,4445 0,5673
https://kemlg.upc.edu
FromDMtoBD&DS:aComputationalPerspective
BIG DATA
Big data / “Small” data / Data Science / Social Data
FromDMtoBD&DS:aComputationalPerspective
Big Data
 Big Data is a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
 The trend to larger data sets is due to:
 as compared to separate smaller sets with the same total
amount of data
 Increase of storage capacities
 Increase of processing power
 Availability of data
 Additional information derivable from analysis of a single large set of
related data
 As of 2012, limits on the size of data sets that are feasible
to process in a reasonable amount of time were on the
order of exabytes of data.
64Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
FromDMtoBD&DS:aComputationalPerspective Big Data
[Chart from Marko Grobelnik]
65Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
FromDMtoBD&DS:aComputationalPerspective
Big Data
 Problems
 Big data always give an answer, but it could not make sense
 More data could imply more error
 With enough data anything can be proved (statistics)
 Advantages
 More tolerant to errors
 Discover prototypical uncommon cases
 ML algorithms work better
 Features
 Data Volume
 Data in Motion (streaming data)
 Data diversity (only 20% data is related)
 Complexity
 Philosophy: “collect first, then think”
66Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
FromDMtoBD&DS:aComputationalPerspective
Big Data
 Big Data is similar to “Small Data”, but bigger …
 Managing bigger data requires different approaches:
 Techniques
 Tools
 Architectures
 The challenges include (new and old problems):
 Capture
 Curation
 Storage
 Search
 Sharing
 Transfer
 Analysis
 Visualization
67Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
FromDMtoBD&DS:aComputationalPerspective
Data Science
 Data science incorporates
varying elements and
builds on techniques and
theories from many fields,
including math, statistics,
data engineering, pattern
recognition and learning,
advanced computing,
visualization, uncertainty
modeling, data
warehousing, and high
performance computing
with the goal of
extracting meaning
from data and creating
data products.
68Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
FromDMtoBD&DS:aComputationalPerspective
Data Science
[Extracted from “Big Data Course” D. Kossmann & N. Tatbul, 2012]
 New approach to do science
 Step 1: collect data
 Step 2: generate hypotheses
 Step 3: Validate Hypotheses
 Step 4: Goto Step 1 or 2
 It can be automated: no thinking, less error
 How do you test without a ground truth?
69Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
FromDMtoBD&DS:aComputationalPerspective
Big Data Jungle
70Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
NORMAL
DATASETS DATA MINING
SOCIAL
NETWORK
DATASETS
SOCIAL
NETWORK MINING
“SMALL” DATA
BIG DATA
MASSIVE
DATASETS
MASSIVE
DATASETS
MASSIVE
DATASETS
SOCIAL
CONNECTION
DATASETS
SOCIAL
CONNECTION
DATASETS
SOCIAL
CONNECTION
DATASETS
MASSIVE DATASETS MINING MASSIVE SOCIAL MINING
SOCIAL / BUSINESS / COMPANIESTECHNICAL / SCIENTIFIC
FromDMtoBD&DS:aComputationalPerspective
Scientific Data Mining vs Social Mining
71Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
INDIVIDUAL
DATA
INDIVIDUAL DATA &
INTERACTIONS
REASONING
ABOUT
PROTOTYPES
REASONING
ABOUT
INDIVIDUALS
PERSONAL
RECOMMENDATIONS
AVERAGES
INDICES
TYPES / CLUSTERS
ANALYSIS
OF
SOCIAL CONNECTIONS
SCIENTIFIC DATA MINING SOCIAL MINING
CLASSICAL
DATA
MINING
DATA
ANALYSIS
FromDMtoBD&DS:aComputationalPerspective
Social Mining
 Analyse the context and the social connections
 Analyse what you do
 Analyse where you go
 Analyse what you choose
 Analyse what you buy
 Analyse in what you spend time
 Analyse who is influencing you
72Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
FromDMtoBD&DS:aComputationalPerspective
73Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Social Netwoks Analysis and Big Data
FromDMtoBD&DS:aComputationalPerspective
74Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Influence
FromDMtoBD&DS:aComputationalPerspective
75Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Discovering influence based on social relationships
[http://www.tweetStimuli.com, A.Tejeda 2012]
FromDMtoBD&DS:aComputationalPerspective
76Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Discovering influence based on social relationships
[http://www.tweetStimuli.com, A.Tejeda 2012]
FromDMtoBD&DS:aComputationalPerspective
77Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Social Network Analysis and Big Data
https://kemlg.upc.edu
FromDMtoBD&DS:aComputationalPerspective
FROM DATA MINING TO MASSIVE DATASET
MINING
FromDMtoBD&DS:aComputationalPerspective
How to scale from “Small” to Big Data?
 High Volume Data does not fit in Memory
 Slow Algorithm
 Parallelize the algorithm + Slicing Data (Individuals or
attributes)
 Non Slow Algorithm
 Slicing Data (Individuals or attributes)
 High Volume Data Fits in Memory
 Slow Algorithm
 Parallelize the algorithm
 Non Slow Algorithm
 Classical Data Mining Techniques
79Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
DATA
MINING
ALGORITHMS
(Distributed) High Volume or/and Slow algorithm
FromDMtoBD&DS:aComputationalPerspective
Solutions for Scaling
 Parallelizing an existing sequential algorithm
 Exploit the instrinsic parallelism
 Implies slicing data
 Using OpenMP, Java, C#, TBB, etc.
 Slicing Data
 Data is splitted in chunks of data
 Each chunk is processed by a thread or node of computation
 Slicing individuals
 Slicing attributes
 Recombining the results
80Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
FromDMtoBD&DS:aComputationalPerspective
81Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Parallel K-means Method
Input: X = {x1, ..., xn} // Data to be clustered
k // Number of clusters
Output: C= {c1, ..., ck} // Cluster centroids
Function K-means
Initialize C // random selection from X
While C has changed
For each xi in X
cl(xi) = argminj distance (xi, cj)
endfor
For each Cj in C
cj = centroid ({xi | cl(xi) = j})
endfor
Endwhile
return C
End function
Parallelize the loop !
Parallelize the loop !
FromDMtoBD&DS:aComputationalPerspective Parallel K-NN algorithm
Input: T = {t1, ..., tn} // Training Data points available
D = {d1, ..., dm} // Data points to be classified
k // Number of neighbours
Output: neighbours // the k nearest neigbours
Function Parallel K-NN
Foreach data point d
neighbours = 
Foreach training point t
dist = distance (d,t)
If |neighbours| < k then
insert (t, neighbours)
else
fartn = argmaxi distance(t,neighboursi)
if distance (dist < fartn)
insert (t, neighbours)
remove (fartn, neighbours)
endif
` endif
endfor
Return majority-vote of K-Nearest (neighbours)
endfor
End function
82Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Parallelize the loop !
Parallelize this loop too !
In high dimensions,
distance computation
can also benefit from
parallelism !
FromDMtoBD&DS:aComputationalPerspective
Parallel Decision Tree
Function Parallel Attribute Selection in Decision Tree building
Max = - infinite
Foreach candidate attribute
Relevance = quality of split(attribute)
If (relevance > max) then
max = relevance
Selected attribute = attribute
endif
endfor
return (selected attribute)
End function
83Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Parallelize the loop !
Some synchronization
may be necessary !
FromDMtoBD&DS:aComputationalPerspective
MapReduce
[Adapted from “Big Data Course” D. Kossmann & N. Tatbul, 2012]
 A software framework first introduced by Google in 2004
to support parallel and fault-tolerant computations over
large data sets on clusters of computers
 Based on the map/reduce functions commonly used in
the functional programming world
 Given:
 A very large dataset
 A well-defined computation task to be performed on elements
of this dataset (preferably, in a parallel fashion on a large
cluster)
 MapReduce framework:
 Just express what you want to compute (map() & reduce()).
 Don’t worry about parallelization, fault tolerance, data
distribution, load balancing (MapReduce takes care of these).
 What changes from one application to another is the actual
computation; the programming structure stays similar.
84Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
FromDMtoBD&DS:aComputationalPerspective
MapReduce
 Here is the framework in simple terms:
 Read lots of data.
 Map: extract something that you care about from each
record.
 Shuffle and sort.
 Reduce: aggregate, summarize, filter, or transform.
 Write the results.
 One can use as many Maps and Reduces as needed
to model a given problem.
85Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
FromDMtoBD&DS:aComputationalPerspective
MapReduce Basic Programming Model
 Transform a set of input key-value pairs to a set of
output values:
 Map: (k1, v1) → list(k2, v2)
 MapReduce library groups all intermediate pairs with same
key together.
 Reduce: (k2, list(v2)) → list(v2)
 Implicit parallelism in map
 If the order of application of a function f to elements in a
list is commutative, then we can reorder or parallelize
execution.
 This is the “secret” that MapReduce exploits.
86Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
FromDMtoBD&DS:aComputationalPerspective
MapReduce Parallelization
 Multiple map() functions run in parallel, creating
different intermediate values from different input data
sets.
 Multiple reduce() functions also run in parallel, each
working on a different output key.
 All values are processed independently.
 Bottleneck: The reduce phase can’t start until the map
phase is completely finished.
87Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
FromDMtoBD&DS:aComputationalPerspective MapReduce Parallel Processing Model
[Chart from “Big Data Course” D. Kossmann & N. Tatbul, 2012]
88Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
FromDMtoBD&DS:aComputationalPerspective MapReduce Execution Overview
[Chart from “Big Data Course” D. Kossmann & N. Tatbul, 2012]
89Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
FromDMtoBD&DS:aComputationalPerspective
K-means with MapReduce
X = {x1, ..., xn} // Data to be clustered
k // Number of clusters
Output: C= {c1, ..., ck} // Cluster centroids
Map Grouping Reduce
(x1, ?) (x1, cl(x1)) (All cl(xi)=1, xi) (cl(xi)=1, c1= average (xi | cl(xi) = 1))
.... ... .... .....
(xn, ?) (xn, cl(xn)) (All cl(xi)=k, xj) (cl(xi)=k, c1= average (xi | cl(xi) = k))
90Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
https://kemlg.upc.edu
FromDMtoBD&DS:aComputationalPerspective
A look to MAHOUT
FromDMtoBD&DS:aComputationalPerspective
Mahout
 Mahout is an open source machine learning library from
Apache
 The algorithms implemented are from the ML field. By the
moment they are:
 Collaborative filtering/recomender engines
 Clustering
 Classification
 It is scalable. Implemented in Java, and some code upon
Apache’s Hadoop distributed computation.
 It is a Java Library
 Mahout started at 2008, as a subproject of Apache
Lucene’s project
92Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
https://kemlg.upc.edu
FromDMtoBD&DS:aComputationalPerspective
BIG DATA TOOLS & RESOURCES
FromDMtoBD&DS:aComputationalPerspective
Big Data Tools
 NoSQL
 DatabasesMongoDB, CouchDB, Cassandra, Redis, BigTable,
Hbase, Hypertable, Voldemort, Riak, ZooKeeper
 MapReduce
 Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4,
MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum
 Storage
 S3, Hadoop Distributed File System
 Servers
 EC2, Google App Engine, Elastic, Beanstalk, Heroku
 Processing
 R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch,
Datameer, BigSheets, Tinkerpop
94Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
FromDMtoBD&DS:aComputationalPerspective
Big Data Literature (1)
95Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
FromDMtoBD&DS:aComputationalPerspective
Big Data Literature (2)
96Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
FromDMtoBD&DS:aComputationalPerspective
Big Data websites
 http://www.DataScienceCentral.com
 http://www.apache.org
 http://hadoop.apache.org
 http://mahout.apache.org
 http://bigml.com
97Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
FromDMtoBD&DS:aComputationalPerspective
© Miquel Sànchez i Marrè, 2013
98
Miquel Sànchez i Marrè
(miquel@lsi.upc.edu)
http://kemlg.upc.edu/

More Related Content

Similar to From DM to BD & DS

SC4 Workshop 1: Logistics and big data German herrero
SC4 Workshop 1: Logistics and big data  German herreroSC4 Workshop 1: Logistics and big data  German herrero
SC4 Workshop 1: Logistics and big data German herreroBigData_Europe
 
Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...
Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...
Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...David Rozas
 
What's new with analytics in academia?
What's new with analytics in academia?What's new with analytics in academia?
What's new with analytics in academia?InfoTrust LLC
 
The Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science ProjectsThe Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science ProjectsGramener
 
DataBeers Malaga #20 especial datos y ciberseguridad- Plataforma de Análisis ...
DataBeers Malaga #20 especial datos y ciberseguridad- Plataforma de Análisis ...DataBeers Malaga #20 especial datos y ciberseguridad- Plataforma de Análisis ...
DataBeers Malaga #20 especial datos y ciberseguridad- Plataforma de Análisis ...Databeers Malaga
 
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptxarpit206900
 
3 rd International Conference on Data Science and Machine Learning (DSML 2022)
3 rd International Conference on Data Science and Machine Learning (DSML 2022)3 rd International Conference on Data Science and Machine Learning (DSML 2022)
3 rd International Conference on Data Science and Machine Learning (DSML 2022)gerogepatton
 
Every angle jacques adriaansen
Every angle   jacques adriaansenEvery angle   jacques adriaansen
Every angle jacques adriaansenBigDataExpo
 
How Data Science Can Grow Your Business?
How Data Science Can Grow Your Business?How Data Science Can Grow Your Business?
How Data Science Can Grow Your Business?Noam Cohen
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfArmyTrilidiaDevegaSK
 
Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria?
Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria? Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria?
Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria? INACAP
 
Emerging technologies in computer science
Emerging technologies in computer scienceEmerging technologies in computer science
Emerging technologies in computer scienceSrinivas Narasegouda
 
Teaching big data
Teaching big dataTeaching big data
Teaching big dataRim Moussa
 
CDKP 2013
CDKP 2013CDKP 2013
CDKP 2013icaita
 

Similar to From DM to BD & DS (20)

SC4 Workshop 1: Logistics and big data German herrero
SC4 Workshop 1: Logistics and big data  German herreroSC4 Workshop 1: Logistics and big data  German herrero
SC4 Workshop 1: Logistics and big data German herrero
 
Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...
Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...
Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...
 
Data science guide
Data science guideData science guide
Data science guide
 
isd314-01
isd314-01isd314-01
isd314-01
 
What's new with analytics in academia?
What's new with analytics in academia?What's new with analytics in academia?
What's new with analytics in academia?
 
The Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science ProjectsThe Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science Projects
 
DataBeers Malaga #20 especial datos y ciberseguridad- Plataforma de Análisis ...
DataBeers Malaga #20 especial datos y ciberseguridad- Plataforma de Análisis ...DataBeers Malaga #20 especial datos y ciberseguridad- Plataforma de Análisis ...
DataBeers Malaga #20 especial datos y ciberseguridad- Plataforma de Análisis ...
 
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
 
Data mining
Data miningData mining
Data mining
 
3 rd International Conference on Data Science and Machine Learning (DSML 2022)
3 rd International Conference on Data Science and Machine Learning (DSML 2022)3 rd International Conference on Data Science and Machine Learning (DSML 2022)
3 rd International Conference on Data Science and Machine Learning (DSML 2022)
 
"Big Data Dreams"
"Big Data Dreams""Big Data Dreams"
"Big Data Dreams"
 
Every angle jacques adriaansen
Every angle   jacques adriaansenEvery angle   jacques adriaansen
Every angle jacques adriaansen
 
How Data Science Can Grow Your Business?
How Data Science Can Grow Your Business?How Data Science Can Grow Your Business?
How Data Science Can Grow Your Business?
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
 
Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria?
Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria? Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria?
Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria?
 
Emerging technologies in computer science
Emerging technologies in computer scienceEmerging technologies in computer science
Emerging technologies in computer science
 
Teaching big data
Teaching big dataTeaching big data
Teaching big data
 
Data Driven Economy @CMU
Data Driven Economy @CMUData Driven Economy @CMU
Data Driven Economy @CMU
 
20191106 brasil it 2
20191106 brasil it 220191106 brasil it 2
20191106 brasil it 2
 
CDKP 2013
CDKP 2013CDKP 2013
CDKP 2013
 

Recently uploaded

Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 

Recently uploaded (20)

Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 

From DM to BD & DS

  • 1. https://kemlg.upc.edu FromDMtoBD&DS:aComputationalPerspective From Data Mining to Big Data & Data Science: a Computational Perspective Miquel Sànchez-Marrè Knowledge Engineering & Machine Learning Group (KEMLG) Dept. de Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya-BarcelonaTech miquel@lsi.upc.edu http://www.lsi.upc.edu/~miquel 19 de Julio 2013
  • 2. FromDMtoBD&DS:aComputationalPerspective Content  Introduction  Knowledge Discovery in Databases  Data Pre-processing  Data Mining Techniques: Statistical & ML methods  Data Post-Processing  Big Data / Data Science  Social Mining  Scaling from Data Mining to Big Mining  Computational Techniques  Examples of Extrapolation of classical DM Techniques  A look to Mahout  Big Data Tools & Resources  Big Data Trends 2Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 4. FromDMtoBD&DS:aComputationalPerspective 4Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Complex Real-world Systems/Domains  Exist in the daily real life of human beings, and normally show a strong complexity for their understanding, analysis, management or solving.  They imply several decision making tasks very complex and difficult, which usually are faced up by human experts  Some of them, in addition, could have catastrophic consequences either for human beings or for the environment or for the economy of one organization  Examples  Environmental System/Domains  Medical System/Domains  Industrial Process Management Systems/Domains  Business Administration & Management Systems/Domains  Marketing  Decisions on products and prices  Decisions on human resources  Decisions on strategies and company policy  Internet
  • 5. FromDMtoBD&DS:aComputationalPerspective 5Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Need for Decision Making Support Tools  Complexity of the decision making process  Accurate Evaluation of multiple alternatives  Need for forecasting capabilities  Uncertainty  Data Analysis and Exploitation  Need for including experience and expertise (knowledge)  Computational tools: Decision Support Systems (DSS)  Intelligent Computational Tools: Intelligent Decision Support Systems (IDSS)
  • 6. FromDMtoBD&DS:aComputationalPerspective 6Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Management Information Systems MANAGEMENT INFORMATION SYSTEMS TRANSACTION PROCESSING SYSTEMS DECISION SUPPORT SYSTEMS (DSS) MODEL-DRIVEN DSS DATA-DRIVEN DSS AUTOMATIC MODEL-DRIVEN DSS TOP-DOWN APPROACH / CONFIRMATIVE MODELS BOTTOM-UP APPROACH / EXPLORATIVE MODELS STATISTICAL / NUMERICAL METHODS ARTIF. INTELLIG / MACH. LEARN. METHODS INTELLIGENT DECISION SUPPORT SYSTEMS (IDSS)
  • 7. FromDMtoBD&DS:aComputationalPerspective 7Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Intelligent Decision Support Systems (IDSS) [Sànchez-Marrè et al., 2000] DATA BASE (TEMPORAL DATA) SYSTEM / PROCESS SENSORS SUBJECTIVE INFORMATION / ANALYSES Spatial / Geographical data On-line data Off-line data EXPLANATION ALTERNATIVES EVAL. PLANNING / PREDICTION / SUPERVISION REASONING / MODELS’ INTEGRATION DATA MINING KNOWLEDGE ACQUISITION / LEARNING AI MODELS STATISTICAL MODELS NUMERICAL MODELS USER INTERFACE USER ON-LINE / OFF-LINE ACTUATORS ECONOMICAL COSTS LAWS & REGULATIONS Feedback DATAINTERPRETATIONDATAMININGDIAGNOSISDECISIONSUPPORT Background/ Subjective Knowledge Decision / Strategy GIS (SPATIAL DATA) Actuation
  • 8. FromDMtoBD&DS:aComputationalPerspective 8Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Why Data Mining is important ? From observations to decisions [Adapted from A.D. Witakker, 1993] Observations Consequences Recomendations Predictions KNOWLEDGE Understanding Data Quantiy of Information DECISION INTERPRETATION High Low High Low Value and Relevance to Decisions Observations Consequences Recomendations Predictions KNOWLEDGE Understanding Data Quantiy of Information DECISION INTERPRETATION High Low High Low Value and Relevance to Decisions
  • 10. FromDMtoBD&DS:aComputationalPerspective 10Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela KDD Concepts  KDD  Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [Fayyad et al., 1996]  KDD is a multi-step process  Data Mining  Data Mining is a step in the KDD process consisting of particular data mining algorithms that, under some aceptable computational efficiency limitations, produces a particular set of patterns over a set of examples or cases or data. [Fayyad et al., 1996]
  • 11. FromDMtoBD&DS:aComputationalPerspective 11Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela The KDD process [Fayyad, Piatetsky-Shapiro & Smyth, 1996] Data Knowledge Selection Preprocessing Transformation Data Mining Interpretation / Evaluation Target Data Preprocessed Data Transformed Data Knowledge Patterns / Models
  • 12. FromDMtoBD&DS:aComputationalPerspective 12Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Terminology Data Bases Artificial Intelligence Statistics Table Data Matrix Data Register Instance/Example Observation/Individual Field Attribute/Feature Variable Value Data Value
  • 14. FromDMtoBD&DS:aComputationalPerspective 14Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Problems with data  Huge amount of data  Corrupted Data  Noisy Data  Irrelevant Data / Attribute Relevance  Attribute Extraction  Numeric and Symbolic Data  Scarce Data  Missing Attributes  Missing Values  Fractioned Data  Incompatible Data  Different Source Data  Different Granularity Data
  • 15. FromDMtoBD&DS:aComputationalPerspective 15Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Data Preparing Data  Data Tranformation  Data Filtering  Data Sorting  Data Edition  Noise Modelling  New Information Obtention  Visualization  Removing  Data Selection  Sampling  New Information Generation  Data Engineering  Data Fusion  Time Series Analysis  Constructive Induction
  • 16. FromDMtoBD&DS:aComputationalPerspective 16Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Data Cleaning  Duplicated data removing  Data Inconsistency solving  Outlier Detection and Management  Error Detection and Management
  • 17. FromDMtoBD&DS:aComputationalPerspective 17Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Coding  Codify some fields and generate a code table  Address to region  Phone prefix to province code  Birth data to age, and age to decade  To normalize huge magnitudes: millions to thousands (account balance, credits, ...)  Sex, married/single, ... transformed to binary  Weekly, monthly informations, ... to number of week, number of month, ...
  • 18. FromDMtoBD&DS:aComputationalPerspective 18Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Flattering  Transform an attribute of cardinality n in n binary attributes to get a unique register per element DNI hobby 45464748 sport 45464748 music 45464748 reading 50000001 music DNI Spo Mus Read 45464748 1 1 1 50000001 0 1 0
  • 19. FromDMtoBD&DS:aComputationalPerspective 19Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Data Selection  Mechanism to reduce the size of the data matrix, by means of the following techniques:  Instance Selection  Feature Selection  Feature Weighting (Feature Relevance Determination)
  • 20. FromDMtoBD&DS:aComputationalPerspective 20Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Instance Selection  Goal: to reduce the number of examples  Methods [Riaño, 1997]:  Exhaustive methods  Greedy methods  Heuristic methods  Convergent methods  Probabilistic methods
  • 21. FromDMtoBD&DS:aComputationalPerspective 21Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Feature Selection  Goal: to reduce the number of features (dimensionality)  Categories of methods:  Feature ranking  Feature ranking methods ranks the features by a metric and eliminates all features that do not achieve an adequate score  Subset selection  Subset selection methods search the set of possible features for the optimal subset.  Methods  Wrapper methods (Wrappers): utilize the ML model of interest as a black box to score subsets of feature according to their predictive power.  Filter methods (Filters): select subsets of features as a preprocessing step, independently of the chosen predictor.  Embedded methods: perform feature selection in the process of training and are usually specific to given ML techniques
  • 22. FromDMtoBD&DS:aComputationalPerspective 22Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela  Goal: to determine the weight or relevance of all features in an automatic way  Framework for Feature Weighting methods [Wettschereck et al., 1997]: Feature Weighting Dimension Possible Value Bias {Feedback, Preset} Weights Space {Continuous, Binary} Representation {Given, Transformed} Generality {Global, Local} Knowledge {Poor, Intensive}
  • 23. FromDMtoBD&DS:aComputationalPerspective 23Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela  Supervised methods  Filter methods  Global methods  Mutual Information algorithm  Cross-Category Feature Importance  Projection of Attributes  Information Gain measure  Class Value method  Local methods  Value-Difference Metric  Per Category Feature Importance  Class Distribution Weighting algorithm  Flexible Weighting  Entropy-Based Local Weighting Feature Weighting / Attribute Relevance (1)
  • 24. FromDMtoBD&DS:aComputationalPerspective 24Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Feature Weighting / Attribute Relevance (2)  Wrapper methods  RELIEF  Contextual Information  Introspective Learning  Diet Algorithm  Genetic algorithms  Unsupervised methods  Gradient Descent  Entropy-based Feature Ranking  UEB-1  UEB-2
  • 26. FromDMtoBD&DS:aComputationalPerspective 26Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Data Mining Techniques  Statistical Techniques  Linear Models: simple regression, multiple regression  Time Series Models (AR, MA, ARIMA)  Component Principal Analysis (CPA) / Discriminant Analysis (DA)  Artificial Intelligence Techniques  Decision Trees  Classification Rules  Association Rules  Clustering  Instance-Based Learning (IBL, CBR)  Connectionist Approach (Artificial Neural Networks)  Evolutionary Computation (Genetic Algorithms, Genetic Programming)  AI&Stats Techniques  Regression Trees  Model Trees  Probabilistic/Belief/Bayesian Networks
  • 27. FromDMtoBD&DS:aComputationalPerspective 27Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Model Clasification (1) [Gibert & Sànchez-Marrè, 2007] KNOWLEDGE MODELS DESCRIPTIVE MODELS ASSOCIATIVE MODELS DISCRIMINANT MODELS PREDICTIVE MODELS (AI) Conceptual Clustering Tech. (Stats) Statistical Clustering Tech. (AI&Stats) Hybrid Tech.: Classification Based on Rules (ClBR) (AI) Association Rules Model-Based Reasoning Qualitative Reasoning (AI&Stats) Bayesian / Belief Networks (BNs) (Stats) Principal Components Analysis (PCA) Simple / Multiple Correspondence Analysis (SCA / MCA) (AI) Decision Trees Classification Rules (Stats) Discriminant Analysis (AI&Stats) Box-Plot Based Induction Rules (AI) Instance- Based Learning (IBL) (AI) Connexionist Models (ANNs) Evolutionary Computation (GAs) Collaborative Resolution (“Swarm Intelligence“) (Stats) Simple/Multiple Linear Regression Variance Analysis Time Series Models Rule-Based Reasoning Bayesian Reasoning Case-Based Reasoning MODELS WITHOUT RESPONSE VARIABLE MODELS WITH RESPONSE VARIABLE (Stats) Naive Bayes Classifier (AI&Stats) Regression Trees / Model Trees
  • 28. FromDMtoBD&DS:aComputationalPerspective 28Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Model Classification (2) [Gibert & Sànchez-Marrè, 2007] UNCERTAINTY MODELS PROBABILISTIC MODELS NEAR-PROBABILISTIC MODELS EVIDENTIAL MODEL POSSIBILISTIC MODEL (AI) Certainty Factor Method[MYCIN] (Stats) Pure Probabilistic Model (AI) Subjective Bayesian Method [PROSPECTOR] (AI&Stats) Bayesian Network Model [Pearl] (AI) Evidence Theory [Dempster-Shafer] (AI) Possibility Theory Fuzzy Logic [Zadeh]
  • 30. FromDMtoBD&DS:aComputationalPerspective 30Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Clustering / Conceptual Clustering described by A=0.1 described by B[0.3,0.38] & C [0.42,0.45] We group {1,2} i {3,4} even though d(1,2)>d(1,3) A B C 1 0.1 0.8 0.3 2 0.1 0.3 100 3 0.7 0.3 0.45 4 0.3 0.38 0.42
  • 31. FromDMtoBD&DS:aComputationalPerspective 31Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Clustering Techniques  Clustering with K determined clusters  K-means method  Clustering through the neighbours  Nearest-Neighbour method (NN method)  Hierarchical Clustering  Probabilistic/Fuzzy Clustering  Clustering based on rules
  • 32. FromDMtoBD&DS:aComputationalPerspective 32Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela K-means Method Input: X = {x1, ..., xn} // Data to be clustered k // Number of clusters Output: C= {c1, ..., ck} // Cluster centroids Function K-means initialize C // random selection from X while C has changed For each xi in X cl(xi) = argminj distance (xi, cj) endfor For each Cj in C cj = centroid ({xi | cl(xi) = j}) endfor endwhile return c End function
  • 33. FromDMtoBD&DS:aComputationalPerspective K-means method [Charts extracted from “Mahout in action”, Owen et al., 2011] 33Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 36. FromDMtoBD&DS:aComputationalPerspective K-NN algorithm Input: T = {t1, ..., tn} // Training Data points available D = {d1, ..., dm} // Data points to be classified k // Number of neighbours Output: neighbours // the k nearest neigbours Function K-NN Foreach data point d neighbours =  Foreach training data point t dist = distance (d,t) If |neighbours| < k then insert(t, neighbours) else fartn = argmaxi distance(t,neighboursi) if distance (dist < fartn) Insert (t, neighbours) Remove (fartn, neighbours) endif ` endif endfor return majority-vote of K-Nearest (neighbours) endfor End function 36Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 37. FromDMtoBD&DS:aComputationalPerspective 37 Decision trees  The nodes are qualitative attributes  The branches are the possible values of a qualitative attribute  The leaves of the tree have the qualitative prediction of the attribute that acts as a class label  Model the process of deciding to which class belongs a new example of the domain
  • 38. FromDMtoBD&DS:aComputationalPerspective 38 Decision tree for the contact lens data Decision Tree: example
  • 39. FromDMtoBD&DS:aComputationalPerspective 39 ID3 algorithm  ID3  Induction Decision Tree [Quinlan, 1979], [Quinlan, 1986]  Machine Learning Technique  Decision Tree Induction  Top-Down strategy  From a set of examples/instances and the class to which they belong, it builds up the best decision tree which explains the instances Criteria Compactness the more compact better complexity Goodness  predictive/discriminant ability
  • 40. FromDMtoBD&DS:aComputationalPerspective © Miquel Sànchez i Marrè & Karina Gibert, KEMLG, 2009 40 ID3: basic idea  Select at each step the attribute which can discriminate more.  The selection is done through maximizing a certain function G (X, A).
  • 41. FromDMtoBD&DS:aComputationalPerspective 41 ID3: selection criteria  Seleccit Ak which maximizes the gain of information G(X, Ak) = I(X,C) - E(X, Ak)  E(X, Ak)  0 where I(X,C) = - p(X,ci) * log2p(X,ci) CiC p(X,ci) = # Ci / #X E(X, Ak) =  p(X,vl) * I(A-1 k(vl),C) vlV(Ak) p(X,vl) = # A-1 k(vl) / #X entropy information Probability that one example belongs to class Ci Probability that one example has the value vl for the attribute Ak
  • 42. FromDMtoBD&DS:aComputationalPerspective 42 ID3: algorithm Function ID3 (in X,A are sets) returns decision tree is var tree1, tree2 are decision tree endvar option case (Ci: xj  X --> xj  Ci) do tree1  buildTree (Ci) case no (Ci: xj  X --> xj  Ci) do option case A   do Amax  max AkA {G(X,Ak)}; tree1  buildTree(Amax); for each v  V(Amax) do tree2  ID3(A-1 max (v), A-{Amax}); tree1  addBranch(arbre1, arbre2, v) endforeach case A =  do tree1  buildTree(majorityClass(X)) endoption endoption returns tree1 endfunction
  • 43. FromDMtoBD&DS:aComputationalPerspective 43 A B C D 1. 1 1 5 - 2. 2 1 5 - 3. 2 1 7 + 4. 1 1 7 + 5. 2 2 5 - 6. 2 2 7 - 7. 1 2 7 - 8. 2 3 7 + (1) p n h(p,n) e A=1 1 2 0,6365 0,6593 A=2 2 3 0,6730 B=1 2 2 0,6931 0,3465 B=2 0 3 0,0000 B=3 1 0 0,0000 C=5 0 3 0,0000 0,4206 C=7 3 2 0,6730 (2) B =1 =2 =3 A C D 1. 1 5 - 2. 2 5 - 3. 2 7 + 4. 1 7 + 5. 2 5 - 6. 2 7 - 7. 1 7 - 8. 2 7 + (3) p n h(p,n) e A=1 1 1 0,6931 0,6931 A=2 1 1 0,6931 C=5 0 2 0,0000 0,0000 C=7 2 0 0,0000 (4) B =1 =2 =3 - + - + C (5) =5 =7
  • 45. FromDMtoBD&DS:aComputationalPerspective 45Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Post-processing Techniques (1)  Post-processing techniques are devoted to transform the direct results of the data mining step into directly understandable, useful knowledge for later decision-making  Techniques could be summarized in the following tasks [Bruha and Famili, 2000]:  Knowledge filtering: The knowledge induced by data-driven models should be normally filtered.  Knowledge consistency checking. Also, we may check the new knowledge for potential conflicts with previously induced knowledge.  Interpretation and explanation. The mined knowledge model could be directly used for prediction, but it would be very adequate to document, interpret and provide explanations for the knowledge discovered.
  • 46. FromDMtoBD&DS:aComputationalPerspective 46Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Post-processing Techniques (2)  Visualization. Visualization of the knowledge (Cox et al., 1997) is a very useful technique to have a deeper understanding of the new discovered knowledge.  Knowledge integration. The traditional decision-making systems have been dependant on a single technique, strategy or model. New sophisticated decision-supporting systems combine or refine results obtained from several models, produced usually by different methods. This process increases accuracy and the likelihood of success.  Evaluation. After a learning system induces concept hypotheses (models) from the training set, their evaluation (or testing) should take place.
  • 47. FromDMtoBD&DS:aComputationalPerspective 47Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Interpretation / Result Evaluation  Tables summarising data  Spatial Representation of Data  Graphical Visualization of Models/Patterns of Knowledge  Decision Trees  Discovered Clusters  Induced Rules  Bayesian Network Learned
  • 48. FromDMtoBD&DS:aComputationalPerspective 48Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Result Representation (1)  Direct Data through tables 0.343689 10000 5879.9875 0.467910 2345 98724.935 0.873493 34 92235.9620
  • 49. FromDMtoBD&DS:aComputationalPerspective 49Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Result Representation (2)  Features through histograms 0 10 20 30 40 50 60 70 80 90 1 2 3 4 5 Serie1 0 20 40 60 80 100 1 2 3 4 5 Serie1
  • 50. FromDMtoBD&DS:aComputationalPerspective 50Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Result Representation (3)  Proportions through pie chart graphics 1 2 3 4 5 1 2 3 4 5
  • 51. FromDMtoBD&DS:aComputationalPerspective 51Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Two-dimensional Representation
  • 52. FromDMtoBD&DS:aComputationalPerspective 52Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Three-dimensional Representation
  • 53. FromDMtoBD&DS:aComputationalPerspective 53Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Validation Methods for Discriminant and Predictive Models  Assessment of predictive/discriminant abilities of models  Training Examples  Obtention of the model  Test Examples  Assessment of the accuracy and generalization ability del model  Methods and tools for rate estimation  Simple Validations / Cross Validations  Random Validations / Stratified Validations
  • 54. FromDMtoBD&DS:aComputationalPerspective 54Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Precision/Accuracy Types  Global Precision/Accuracy of classification  Precision/Accuracy of classification by modalities  Confusion Matrix [General use]  ROC (Receiver /Relative Operating Characteristic) Curves [for binary classifiers]  Gini Index
  • 55. FromDMtoBD&DS:aComputationalPerspective 55Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Simple Validation [Stratified] (1) Stratified: The class distribution C1, …,CM in the training set and in the test set follows the same distribution than the original whole data set Induced Model Test
  • 56. FromDMtoBD&DS:aComputationalPerspective 56Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Simple Validation [Stratified] (2)
  • 57. FromDMtoBD&DS:aComputationalPerspective 57Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Cross Validation [Stratified] The initial data set is split into N subsets N = [3…10], defined by the user The final accuracy rate is computed as the average of accuracy rates obtained for each one of the subsets
  • 58. FromDMtoBD&DS:aComputationalPerspective 58Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Global Precision of classification Global Error or Misclassification Rate Global Accuracy or Success or Classification Rate MODEL MISCLAS. RATE VALIDATION MISCLAS. RATE TEST MISCLAS. RATE CART Tree 0,2593856655 0,2974079127 0,2909836066 K-NN MBR 0,2894197952 0,2974079127 0,3155737705 Regression 0,3071672355 0,3383356071 0,3770491803 RBF 0,3051194539 0,3246930423 0,3360655738
  • 59. FromDMtoBD&DS:aComputationalPerspective 59Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Precision by modalities (1)  Confusion Matrix LOGISTIC REGRESSION Predicted 0 1 Observed 0 48,02 10,91 1 22,92 18,14 CART CLASSIFICATION TREE Predicted 0 1 Observed 0 43,52 15,42 1 14,32 26,74 Error Type I / False Negatives Error Type II / False Positives
  • 60. FromDMtoBD&DS:aComputationalPerspective 60Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Precision by modalities (2)  Confusion Matrix RBF neural network Predicted 0 1 Observed 0 47,34 11,60 1 20,87 20,19 K-NN MBR Predicted 0 1 Observed 0 41,34 17,60 1 12,14 28,92
  • 61. FromDMtoBD&DS:aComputationalPerspective 61Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela ROC Curves Sensitivity True Positive rate (1 – prob(error type I)) versus 1- especificity False Positive rate (prob(error type II)):
  • 62. FromDMtoBD&DS:aComputationalPerspective 62Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Performance Gini Index  Surface between ROC curve and the 45º bisector Logistic Regression RBF CART Tree K-NN MBR Gini index 0,4375 0,4230 0,4445 0,5673
  • 64. FromDMtoBD&DS:aComputationalPerspective Big Data  Big Data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.  The trend to larger data sets is due to:  as compared to separate smaller sets with the same total amount of data  Increase of storage capacities  Increase of processing power  Availability of data  Additional information derivable from analysis of a single large set of related data  As of 2012, limits on the size of data sets that are feasible to process in a reasonable amount of time were on the order of exabytes of data. 64Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 65. FromDMtoBD&DS:aComputationalPerspective Big Data [Chart from Marko Grobelnik] 65Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 66. FromDMtoBD&DS:aComputationalPerspective Big Data  Problems  Big data always give an answer, but it could not make sense  More data could imply more error  With enough data anything can be proved (statistics)  Advantages  More tolerant to errors  Discover prototypical uncommon cases  ML algorithms work better  Features  Data Volume  Data in Motion (streaming data)  Data diversity (only 20% data is related)  Complexity  Philosophy: “collect first, then think” 66Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 67. FromDMtoBD&DS:aComputationalPerspective Big Data  Big Data is similar to “Small Data”, but bigger …  Managing bigger data requires different approaches:  Techniques  Tools  Architectures  The challenges include (new and old problems):  Capture  Curation  Storage  Search  Sharing  Transfer  Analysis  Visualization 67Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 68. FromDMtoBD&DS:aComputationalPerspective Data Science  Data science incorporates varying elements and builds on techniques and theories from many fields, including math, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products. 68Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 69. FromDMtoBD&DS:aComputationalPerspective Data Science [Extracted from “Big Data Course” D. Kossmann & N. Tatbul, 2012]  New approach to do science  Step 1: collect data  Step 2: generate hypotheses  Step 3: Validate Hypotheses  Step 4: Goto Step 1 or 2  It can be automated: no thinking, less error  How do you test without a ground truth? 69Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 70. FromDMtoBD&DS:aComputationalPerspective Big Data Jungle 70Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela NORMAL DATASETS DATA MINING SOCIAL NETWORK DATASETS SOCIAL NETWORK MINING “SMALL” DATA BIG DATA MASSIVE DATASETS MASSIVE DATASETS MASSIVE DATASETS SOCIAL CONNECTION DATASETS SOCIAL CONNECTION DATASETS SOCIAL CONNECTION DATASETS MASSIVE DATASETS MINING MASSIVE SOCIAL MINING SOCIAL / BUSINESS / COMPANIESTECHNICAL / SCIENTIFIC
  • 71. FromDMtoBD&DS:aComputationalPerspective Scientific Data Mining vs Social Mining 71Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela INDIVIDUAL DATA INDIVIDUAL DATA & INTERACTIONS REASONING ABOUT PROTOTYPES REASONING ABOUT INDIVIDUALS PERSONAL RECOMMENDATIONS AVERAGES INDICES TYPES / CLUSTERS ANALYSIS OF SOCIAL CONNECTIONS SCIENTIFIC DATA MINING SOCIAL MINING CLASSICAL DATA MINING DATA ANALYSIS
  • 72. FromDMtoBD&DS:aComputationalPerspective Social Mining  Analyse the context and the social connections  Analyse what you do  Analyse where you go  Analyse what you choose  Analyse what you buy  Analyse in what you spend time  Analyse who is influencing you 72Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 73. FromDMtoBD&DS:aComputationalPerspective 73Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Social Netwoks Analysis and Big Data
  • 74. FromDMtoBD&DS:aComputationalPerspective 74Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Influence
  • 75. FromDMtoBD&DS:aComputationalPerspective 75Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Discovering influence based on social relationships [http://www.tweetStimuli.com, A.Tejeda 2012]
  • 76. FromDMtoBD&DS:aComputationalPerspective 76Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Discovering influence based on social relationships [http://www.tweetStimuli.com, A.Tejeda 2012]
  • 77. FromDMtoBD&DS:aComputationalPerspective 77Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Social Network Analysis and Big Data
  • 79. FromDMtoBD&DS:aComputationalPerspective How to scale from “Small” to Big Data?  High Volume Data does not fit in Memory  Slow Algorithm  Parallelize the algorithm + Slicing Data (Individuals or attributes)  Non Slow Algorithm  Slicing Data (Individuals or attributes)  High Volume Data Fits in Memory  Slow Algorithm  Parallelize the algorithm  Non Slow Algorithm  Classical Data Mining Techniques 79Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela DATA MINING ALGORITHMS (Distributed) High Volume or/and Slow algorithm
  • 80. FromDMtoBD&DS:aComputationalPerspective Solutions for Scaling  Parallelizing an existing sequential algorithm  Exploit the instrinsic parallelism  Implies slicing data  Using OpenMP, Java, C#, TBB, etc.  Slicing Data  Data is splitted in chunks of data  Each chunk is processed by a thread or node of computation  Slicing individuals  Slicing attributes  Recombining the results 80Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 81. FromDMtoBD&DS:aComputationalPerspective 81Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Parallel K-means Method Input: X = {x1, ..., xn} // Data to be clustered k // Number of clusters Output: C= {c1, ..., ck} // Cluster centroids Function K-means Initialize C // random selection from X While C has changed For each xi in X cl(xi) = argminj distance (xi, cj) endfor For each Cj in C cj = centroid ({xi | cl(xi) = j}) endfor Endwhile return C End function Parallelize the loop ! Parallelize the loop !
  • 82. FromDMtoBD&DS:aComputationalPerspective Parallel K-NN algorithm Input: T = {t1, ..., tn} // Training Data points available D = {d1, ..., dm} // Data points to be classified k // Number of neighbours Output: neighbours // the k nearest neigbours Function Parallel K-NN Foreach data point d neighbours =  Foreach training point t dist = distance (d,t) If |neighbours| < k then insert (t, neighbours) else fartn = argmaxi distance(t,neighboursi) if distance (dist < fartn) insert (t, neighbours) remove (fartn, neighbours) endif ` endif endfor Return majority-vote of K-Nearest (neighbours) endfor End function 82Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Parallelize the loop ! Parallelize this loop too ! In high dimensions, distance computation can also benefit from parallelism !
  • 83. FromDMtoBD&DS:aComputationalPerspective Parallel Decision Tree Function Parallel Attribute Selection in Decision Tree building Max = - infinite Foreach candidate attribute Relevance = quality of split(attribute) If (relevance > max) then max = relevance Selected attribute = attribute endif endfor return (selected attribute) End function 83Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela Parallelize the loop ! Some synchronization may be necessary !
  • 84. FromDMtoBD&DS:aComputationalPerspective MapReduce [Adapted from “Big Data Course” D. Kossmann & N. Tatbul, 2012]  A software framework first introduced by Google in 2004 to support parallel and fault-tolerant computations over large data sets on clusters of computers  Based on the map/reduce functions commonly used in the functional programming world  Given:  A very large dataset  A well-defined computation task to be performed on elements of this dataset (preferably, in a parallel fashion on a large cluster)  MapReduce framework:  Just express what you want to compute (map() & reduce()).  Don’t worry about parallelization, fault tolerance, data distribution, load balancing (MapReduce takes care of these).  What changes from one application to another is the actual computation; the programming structure stays similar. 84Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 85. FromDMtoBD&DS:aComputationalPerspective MapReduce  Here is the framework in simple terms:  Read lots of data.  Map: extract something that you care about from each record.  Shuffle and sort.  Reduce: aggregate, summarize, filter, or transform.  Write the results.  One can use as many Maps and Reduces as needed to model a given problem. 85Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 86. FromDMtoBD&DS:aComputationalPerspective MapReduce Basic Programming Model  Transform a set of input key-value pairs to a set of output values:  Map: (k1, v1) → list(k2, v2)  MapReduce library groups all intermediate pairs with same key together.  Reduce: (k2, list(v2)) → list(v2)  Implicit parallelism in map  If the order of application of a function f to elements in a list is commutative, then we can reorder or parallelize execution.  This is the “secret” that MapReduce exploits. 86Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 87. FromDMtoBD&DS:aComputationalPerspective MapReduce Parallelization  Multiple map() functions run in parallel, creating different intermediate values from different input data sets.  Multiple reduce() functions also run in parallel, each working on a different output key.  All values are processed independently.  Bottleneck: The reduce phase can’t start until the map phase is completely finished. 87Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 88. FromDMtoBD&DS:aComputationalPerspective MapReduce Parallel Processing Model [Chart from “Big Data Course” D. Kossmann & N. Tatbul, 2012] 88Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 89. FromDMtoBD&DS:aComputationalPerspective MapReduce Execution Overview [Chart from “Big Data Course” D. Kossmann & N. Tatbul, 2012] 89Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 90. FromDMtoBD&DS:aComputationalPerspective K-means with MapReduce X = {x1, ..., xn} // Data to be clustered k // Number of clusters Output: C= {c1, ..., ck} // Cluster centroids Map Grouping Reduce (x1, ?) (x1, cl(x1)) (All cl(xi)=1, xi) (cl(xi)=1, c1= average (xi | cl(xi) = 1)) .... ... .... ..... (xn, ?) (xn, cl(xn)) (All cl(xi)=k, xj) (cl(xi)=k, c1= average (xi | cl(xi) = k)) 90Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 92. FromDMtoBD&DS:aComputationalPerspective Mahout  Mahout is an open source machine learning library from Apache  The algorithms implemented are from the ML field. By the moment they are:  Collaborative filtering/recomender engines  Clustering  Classification  It is scalable. Implemented in Java, and some code upon Apache’s Hadoop distributed computation.  It is a Java Library  Mahout started at 2008, as a subproject of Apache Lucene’s project 92Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 94. FromDMtoBD&DS:aComputationalPerspective Big Data Tools  NoSQL  DatabasesMongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, ZooKeeper  MapReduce  Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum  Storage  S3, Hadoop Distributed File System  Servers  EC2, Google App Engine, Elastic, Beanstalk, Heroku  Processing  R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets, Tinkerpop 94Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 95. FromDMtoBD&DS:aComputationalPerspective Big Data Literature (1) 95Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 96. FromDMtoBD&DS:aComputationalPerspective Big Data Literature (2) 96Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 97. FromDMtoBD&DS:aComputationalPerspective Big Data websites  http://www.DataScienceCentral.com  http://www.apache.org  http://hadoop.apache.org  http://mahout.apache.org  http://bigml.com 97Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
  • 98. FromDMtoBD&DS:aComputationalPerspective © Miquel Sànchez i Marrè, 2013 98 Miquel Sànchez i Marrè (miquel@lsi.upc.edu) http://kemlg.upc.edu/