( Big ) Data Management - Data Mining and Machine Learning - Global concepts in 10 slides
1. ( Big ) Data Management
Data Mining & Machine Learning
Global Concepts in 10 slides
2016
Nicolas SARRAMAGNA
https://fr.linkedin.com/pub/nicolas-sarramagna/19/941/587
3. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Data Mining / Machine Learning in Data Management 3
Collect
Storage
Data Mining /
Machine Learning
Data Viz
Governance
Security
Master Data
Data quality
DATA MANAGEMENT
Multiples modules
BIG DATA
Velocity, Volume, Variety, Veracity, Value
4. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Data Mining / Machine Learning – What / Why 4
DATA MINING - VALUE
Explore, understand data and find : relations, new properties, inductions on them
Descriptive approach
MACHINE LEARNING - VALUE
Build a predictive model to answer a question
Predictive approach
20/30 YEARS OLD BUT NEW CONTEXT
cpu, db, ram capacities
more data and features
Internet
Big data
5. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Overview - Data Mining
SEPTEMBER 2015
5
EXPLORE DATA
usage of statistics
need data vizualisation for interpretation and insights
CLUSTERING, ASSOCIATION
usage of machine learning
6. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Overview - Machine Learning 6
PREDICTION
predict a categorical : classification
predict a number : regression
clustering, association
usage of data mining
7. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Data Mining / Machine Learning - How 7
PROCESS
Define objective, answer, success criteria -> ML Canvas
Data understanding : collect data (one or more data sources), explore (min, max, histogram, charts)
Data preparation : data quality (outliers, void values), normalize, dimension reduction, noise, new features, data
labeled, text, date, shuffle
Data modeling : baseline (random, mean), split data : train & test, select, combine, apply algorithms
Data evaluation : interpretation, evaluation (confusion matrix : recall, precision, formula), validation
Data deployment : deploy and monitor the model (integration, performance : latency, throughput), A/B testing,
scalable, sustainability
WARNING
Need business : domain knowledge
Need data, need features : min 10 by feature, 100 better, relevant features
Date preparation is crucial : garbage in -> garbage out
Stay rigorous on phases of modeling and evaluation : overfitting (train, test, cross validation), models can fail
Use best practices of Web development : Continuous integration, deployment, evaluation, monitoring, packaging
IN PRACTICE, DIFFERENT LEVELS OF ABSTRACTION
Dev/lib (R, python scikit-learn, Spark) < generic (MLaaS : BigML, AWS) < problem specific and / or dedicated soft
Use a data-driven approach than model-driven : better ROI with new features, more input data, trying different
models (as-is) and usage of combination of parameters than creating, tuning models and no automatic
combination parameters approach
8. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Data Mining / Machine Learning - How
MARCH 2015
8
EXAMPLE OF MACHINE LEARNING CANVAS ~ BUSINESS MODEL CANVAS
https://github.com/louisdorard/machinelearningcanvas
9. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Data Mining / Machine Learning – How 9
DATA MODELING DEV/LIB LEVEL MODE (SEE LINKS IN LAST SLIDE)
DATA MODELING GENERIC LEVEL MODE : 1-CLICK (AND SOME OPTIONS)
11. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Data Mining / Machine Learning - How 11
EVALUATION (TRAIN, TEST) WITH CONFUSION MATRIX :
Recall -> % quantity of results : False Negative = 0 -> recall 100%
Precision -> % quality of results : False Positive = 0 -> precision 100%
Other metric : TP x costTP + TN x costTN + FP x costFP + FN x costFN = value of the model
12. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Data Mining / Machine Learning - How
MARCH 2015 FOOTER CAN BE PERSIZED AS FOLLOW: INSERT / HEADER AND FOOTER
12
ACTORS ON THE MARKET : LIBS, GENERIC, PROBLEM SPECIFIC