SlideShare a Scribd company logo
1 of 50
Download to read offline
BIG DATA & DATA SCIENCE
COME DIVENTARE UN DATA SCIENTIST
PRIMI CONSIGLI PER STUDENTI E PERSONE ALLE PRIME ARMI
Paolo Pellegrini, Senior Consultant
giugno 2016
BIG DATA & DATA SCIENCE
Perché diventare un Data Scientist?
Chi è un Data Scientist?
Come viene selezionato un Data Scientist?
Cos’è un algoritmo?
Un esempio di problema di Data Science
Alcuni Tool di Data Science
Dove aggiornarsi e formarsi?
1
AGENDA
BIG DATA & DATA SCIENCE
IL 56% DELLE IMPRESE ITALIANE INDICA BIG
DATA E DATA SCIENCE COME PRIMARIO
SVILUPPO STRATEGICO PER IL 2016/17
2
3%
5%
6%
7%
10%
10%
17%
17%
18%
18%
25%
25%
31%
31%
40%
48%
53%
56%
0% 10% 20% 30% 40% 50% 60%
Smart Manufacturing
Internet of Things
Smart Working
Progetti commerciali web social
Cyber Security
Compliance e Risk Management
Collaboration
Storage e virtualizzazione
Mobile e eCommerce
Data Center
Mobile Marketing e CRM
Cloud pubblico e privato
Consolidamento applicativo
Sistemi CRM
Device Mobili e Mobile Apps
Sistemi ERP
Dematerializzazione
Big Data e Analytics
BIG DATA & DATA SCIENCE
HARVARD, GIA’ ANNI FA, LO AVEVA DEFINITO IL
LAVORO PIÙ SEXY DEL NOSTRO SECOLO…ED È
ANCHE BEN REMUNERATO!
3
GOOGLE TREND
«DATA SCIENTIST»
AVERAGE SALARY
123,000 $
BIG DATA & DATA SCIENCE
Perché diventare un Data Scientist?
Chi è un Data Scientist?
Come viene selezionato un Data Scientist?
Cos’è un algoritmo?
Un esempio di problema di Data Science
Alcuni Tool di Data Science
Dove aggiornarsi e formarsi?
4
AGENDA
BIG DATA & DATA SCIENCE
UN DATA SCIENTIST PUÒ FARE TUTTO!
5
Nate Silver è la persona che ha cambiato il
concetto di “Psephology”, usando Big Data
& Data Science per predire i risultati
delle elezioni Americane.
Oggi, è uno dei più famosi Data Scientist
al mondo
BIG DATA & DATA SCIENCE
UN DATA SCIENTIST È UNA FIGURA FORTMENTE
INTERDISCIPLINARE, CHE CONIUGA STATISTICA,
PROGRAMMAZIONE E LOGICHE DI BUSINESS
6
«On any given day a team member might author a
multistage processing pipeline in Python, design a
hypothesis test, perform a regression analysis over
data samples with R, design and implement an
algorithm for some data-intensive product or service
in Hadoop, or communicate the results of an
analysis to other members of the organization in a
clear and concise fashion»
2009 – Jeff Hammerbacher | Data Scientist @ Facebook
BIG DATA & DATA SCIENCE
LA DIFFUSIONE NEL MONDO È SEMPRE PIÙ
GRANDE, E VEDE L’ITALIA PROTAGONISTA
7
RJ Metrics on 11.400 Data Scientist profile on LinkedIn
BIG DATA & DATA SCIENCE
I SETTORI CHE IMPIEGANO PIÙ DATA SCIENTIST
SONO QUELLI MAGGIORMENTE ORIENTATI ALL’IT,
MA LA DIFFUSIONE E’ SEMPRE PIÙ CAPILLARE
8
RJ Metrics on 11.400 Data Scientist profile on LinkedIn
BIG DATA & DATA SCIENCE
LE COMPETENZE PIÙ DIFFUSE VERTONO SU
LINGUAGGI E STRUMENTI COME «R» E «PYTHON»
9
RJ Metrics on 11.400 Data Scientist profile on LinkedIn
BIG DATA & DATA SCIENCE
LE COMPETENZE MODELLISITICHE E DI
PROGRAMMAZIONE SONO FONDAMENTALI PER
UNA RISORSA JUNIOR
10
RJ Metrics on 11.400 Data Scientist profile on LinkedIn
BIG DATA & DATA SCIENCE
UN DATA SCIENTIST PUÒ AVERE QUALSIASI TIPO
DI BACKGROUND: CONTA SOLO VOGLIA E
ATTITUDINE A LAVORARE SUI DATI
11
RJ Metrics on 11.400 Data Scientist profile on LinkedIn
BIG DATA & DATA SCIENCE
2%
11%
14%
73%
PRESTO OGNI AZIENDA AVRÀ UN DATA SCIENTIST
12
Present, with a well defined role
Present, but without a well defined role
Introduction planned for 2016
Possible introduction in the future
Data Scientist
BIG DATA & DATA SCIENCE
Perché diventare un Data Scientist?
Chi è un Data Scientist?
Come viene selezionato un Data Scientist?
Cos’è un algoritmo?
Un esempio di problema di Data Science
Alcuni Tool di Data Science
Dove aggiornarsi e formarsi?
13
AGENDA
BIG DATA & DATA SCIENCE
PERCORSO DI VALUTAZIONE
14
Job posting
Test development
CV selection
Test assignment
Test evaluation
Technical interview
Data Scientist evaluation
BIG DATA & DATA SCIENCE
PERCORSO DI VALUTAZIONE
15
Job posting
Test development
CV selection
Test assignment
Test evaluation
Technical interview
Data Scientist evaluation
1) You have two tables in an existing RDBMS. One contains
information about the products you sell (name, size, color, etc.)
The other contains images of the products in JPEG format.
These tables are frequently joined in queries to your database.
You would like to move this data into HBase. What is the most
efficient schema design for this scenario?
• Create a single table, with two column family
• Create a single table, with one column family
• Create two tables, with one column family
2) A sandwich shop studies the number of men, and women, that
enter the shop during the lunch hour from noon to 1pm each day.
They find that the number of men that enter can be modeled as a
random variable with distribution Poisson(M), and likewise the
number of women that enter as Poisson(W). What is likely to be
the best model of the total number of customers that enter during
the lunch hour?
• Poisson (M+W)
• Possion (M/W)
• Poisson (M*W)
Junion Data Scientist Selection
BIG DATA & DATA SCIENCE
PERCORSO DI VALUTAZIONE
16
Job posting
Test development
CV selection
Test assignment
Test evaluation
Technical interview
Data Scientist evaluation
Senior Data Scientist Selection
Consegna di un Data Set, via mail o tramite piattaforme
come University2Business, che i candidate devono
analizzare al fine di sviluppare un modello predittivo
BIG DATA & DATA SCIENCE
PERCORSO DI VALUTAZIONE
17
Job posting
Test development
CV selection
Test assignment
Test evaluation
Technical interview
Data Scientist evaluation
Seguono domande generiche sulla costruzione dei modelli o
discussioni di dettaglio su quando svolto nel test. Ad esempio:
• Pulizia dati
• Costruzione modello
• Sviluppo algoritmo
• …
Senior Data Scientist Selection
BIG DATA & DATA SCIENCE
PERCORSO DI VALUTAZIONE
18
Job posting
Test development
CV selection
Test assignment
Test evaluation
Technical interview
Data Scientist evaluation
Socio-
economic
Statistic
Business
Math
Soft – Role
specific
Soft – Business
generic
Computer
Science
BIG DATA & DATA SCIENCE
DESCRIZIONE DELLE COMPETENZE
19
TECHNICALSKILLS
SOFT
ROLESPECIFIC
SOFT
BUSINESSGENERIC
Socio-Economiche
Capacità di lettura del
contesto sociale e di
come questo impatti
sul contesto economico
Settoriali
Conoscenza di
processi, mercato e
anticipazione degli
impatti delle variabili
esogene sullo specifico
settore
Matematiche
Capacità di sistemizzare
la realtà attraverso
classificazioni e modelli
che tengano conto delle
interazioni fra gli
elementi
Informatiche
Capacità di trattamento
dell’informazione, mediante
lo sviluppo di procedure
automatizzate (es.
algoritmi) e di un supporto
HW/SW
Statistiche
Capacità trarre
deduzioni logiche ed
estrarre conoscenza
dallo studio di un
particolare fenomeno
non deterministico
Hacking
Capacità di fare uso di
creatività e immaginazione
nella ricerca della
conoscenza
Storytelling
Capacità di inventiva nella creazione di scenari da
esplorare e di inserire le informazioni all’interno di un
framework che ne facilitino la trasmissione e la
comprensione all’esterno, anche attraverso capacità
di sintesi e di presentazione delle informazioni
Etica
Capacità di fare uso
coscienzioso dei dati, anche
a fronte del possesso di dati
sensibili
Management
Capacità di guida e
coordinamento di un gruppo di
risorse, assunzione di decisioni
per garantire l'ottenimento di
risultati aziendali
Teamwork
Capacità di operare in gruppo,
attraverso spartizione di ruoli e
aggregazione di competenze, al
fine di raggiungere un obiettivo
comune
Coaching/Mentoring
Capacità di formazione di
risorse con meno esperienza, al
fine di migliorarne le
potenzialità, partendo
dall’unicità dell'individuo
Relazioni interpersonali
Capacità di relazionarsi con altri
soggetti, ponendosi nel modo
opportuno a seconda di status,
relazioni gerarchiche,
contingenze, ecc.
BIG DATA & DATA SCIENCE
PROFILI TIPICI
20
Junior Data Scientist
Socio-Economic
Statistic
Business
MathSoft
Role Specic
Soft
Bsuiness Generic
Computer Science
Socio-Economic
Statistic
Business
MathSoft
Role Specic
Soft
Bsuiness Generic
Computer Science
Senior Data Scientist
Chief Data Scientist
Socio-Economic
Statistic
Business
MathSoft
Role Specic
Soft
Bsuiness Generic
Computer Science
Business Manager
Socio-Economic
Statistic
Business
MathSoft
Role Specic
Soft
Bsuiness Generic
Computer Science
BIG DATA & DATA SCIENCE
Perché diventare un Data Scientist?
Chi è un Data Scientist?
Come viene selezionato un Data Scientist?
Cos’è un algoritmo?
Un esempio di problema di Data Science
Alcuni Tool di Data Science
Dove aggiornarsi e formarsi?
21
AGENDA
BIG DATA & DATA SCIENCE
“Data is inherently dumb - Algorithms are where
the real value lies. Algorithms define action”
Peter Sondergaard
Senior Vice President
Gartner Research
DAL DATO ALL’ALGORITMO
22
A
graphical expression of
Euclid's algorithm to find
the greatest common
divisor for 1599 and 650
Algorithm is a self-contained step-by-step
set of operations to be performed
BIG DATA & DATA SCIENCE
COME GLI ALGORITMI SUPPORTANO IL BUSINESS
23
INFORMATION INSIGHTS DECISION ACTION
DESCRIPTIVE
What
happened?
DIAGNOSTIC
Why does it
happened?
PREDICTIVE
What future?
PRESCRIPTIVE
How to react to
recent events?
PREEMPTIVE
How to avoid
bad events?
DATA-DRIVEN
STRATEGY
Decisional
Support
OPTIMIZATION
STRATEGY
ANALYTICS
STRATEGY
OLD-STYLE
STRATEGY
+
BIG DATA & DATA SCIENCE
• Machine learning is a subfield of computer science, that evolved from the study of pattern recognition
and computational learning theory in artificial intelligence
• In 1959, Arthur Samuel defined machine learning as a "Field of study that gives computers the ability
to learn without being explicitly programmed“
• Machine learning explores the study and construction of algorithms that can learn from and
make predictions on data. Such algorithms operate by building a model from an example training set
of input observations in order to make data-driven predictions or decisions expressed as outputs
rather than following strictly static program instructions
• Machine learning is closely related to (and often overlaps with) computational statistics; a discipline
which also focuses in prediction-making through the use of computers. It has strong ties to
mathematical optimization, which delivers methods, theory and application domains to the field
COME SI DIFINISCE IL «MACHINE LEARNING»
24
BIG DATA & DATA SCIENCE
• C4.5 - Constructs a classifier in the form of a decision tree. In order to do this, C4.5 is
given a set of data representing things that are already classified. This is supervised
learning, since the training dataset is labeled with classes
ALCUNI ESEMPI DI ALGORITMI
25
BIG DATA & DATA SCIENCE
• k-means - creates k groups from a set of objects so that the members of a group are
more similar. It’s a popular cluster analysis technique for exploring a dataset. Most would
classify k-means as unsupervised. Other than specifying the number of clusters, k-means
“learns” the clusters on its own without any information about which cluster an observation
belongs to k-means can be semi-supervised
ALCUNI ESEMPI DI ALGORITMI
26
BIG DATA & DATA SCIENCE
• Support vector machines - SVM teaches a hyperplane to classify data into 2 classes. At
a high-level, SVM performs a similar task like C4.5 except SVM doesn’t use decision trees
at all. It is a supervised learning, since a dataset is used to first teach the SVM about the
classes
ALCUNI ESEMPI DI ALGORITMI
27
BIG DATA & DATA SCIENCE
• Naive Bayes - it is not a single algorithm, but a family of classification algorithms that
share one common assumption: every feature of the data being classified is independent
of all other features given the class. This is supervised learning, since Naive Bayes is
provided a labeled training dataset in order to construct the tables
ALCUNI ESEMPI DI ALGORITMI
28
BIG DATA & DATA SCIENCE
• PCA - Principal component analysis uses an orthogonal transformation to convert a set of
observations of possibly correlated variables into a set of values of linearly uncorrelated
variables called principal components. The number of principal components is less than or
equal to the number of original variables. The first principal component has the largest
possible variance, and each succeeding component in turn has the highest variance
possible under the constraint that it is orthogonal to the preceding components. The
resulting vectors are an uncorrelated orthogonal basis set. The principal components are
orthogonal because they are the eigenvectors of the covariance matrix, which is
symmetric. PCA is sensitive to the relative scaling of the original variables. This is
unsupervised learning
ALCUNI ESEMPI DI ALGORITMI
29
BIG DATA & DATA SCIENCE
PROCESSO LOGICO DI USO DEGLI ALGORITMI
30
Ricezione
Dataset
Analisi esplorativa
dei dati
Pulizia dei dati
Uso di Algoritmi per
trovare variabili più
predittive
Costruzione
modello logico
• Random Forest
• Decision Tree
• SVM
• …
Testing & Tunion
Sviluppo algoritmo
ad-hoc di supporto
al business
BIG DATA & DATA SCIENCE
Perché diventare un Data Scientist?
Chi è un Data Scientist?
Come viene selezionato un Data Scientist?
Cos’è un algoritmo?
Un esempio di problema di Data Science
Alcuni Tool di Data Science
Dove aggiornarsi e formarsi?
31
AGENDA
BIG DATA & DATA SCIENCE
DATA SCIENTIST: MAGICIAN OR SUPERHERO?
32
Può un Data Scientist predire i crimini a San Francisco?
Può un Data Scientist aiutare la città ad esser più sicura?
BIG DATA & DATA SCIENCE
From 1934 to 1963, San Francisco was infamous for housing some of the world's
most notorious criminals on the inescapable island of Alcatraz.
Today, the city is known more for its tech scene than its criminal past. But, with
rising wealth inequality, housing shortages, and a proliferation of expensive digital
toys riding BART to work, there is no scarcity of crime in the city by the bay.
• It’s provided a dataset of 12 years of incidents from across all of San
Francisco's neighborhoods, from 1/1/2003 to 13/05/2015.
• Dataset has been divided in two parts: a training set, to be used for the model
development, and a test set, useful to verify the predictive algorithm.
THE SAN FRANCISCO CHALLENGE
33
Given time and location, you must predict the category of crime that can occur
BIG DATA & DATA SCIENCE
878.049INCIDENTS WITH
39 CATEGORIES OF CRIME
TRAINING SET STRUCTURE
34
For every incident is provided:
• Data and time
• Category
• Description
• Day of week
• Pd District
• Resolution
• Address
• Latitude
• Longitude
BIG DATA & DATA SCIENCE
UN ESEMPIO DI DATA VISUALIZATION
35
BIG DATA & DATA SCIENCE
Q1 – HOW TO ACT WITH THE DATASET?
36
• To manage Dataset are used CSV files, but also JSON. No xls!
• 800.000 record is Big Data!
• You can use only variables known when the model is applied
Variable: Data and time, Day of week, Pd District, Address, Latitude,
Longitude
Variable not to be included: Description, Resolution
Taget: Category
BIG DATA & DATA SCIENCE
Manage Dataset in order to make all valid variable usable by a predictive model:
• Generate an ID for every record
• Verify the structure of every variable and search for data that need to be
cleaned up (e.g. empty record, double space, ecc.)
• Split “Date” (13/05/2015 23:53:00) into single variables (Month, Year, Hour)
• Merge “Latitude” and “Longitude” to verify the presence of unique place
Verify the distribution of every variable to find out the presence of “non-normal
distribution” or other kind of problems to be fix
STEP 1: DATA CLEANSING
37
BIG DATA & DATA SCIENCE
STEP 2: LAUNCH THE FIRST EXPLORATIVE MODEL
38
You can use professional (and free) tools like Rapid Miner, Weka, Knime, ecc.
BIG DATA & DATA SCIENCE
• Tool: IBM Watson
• Algorithm: Decision Tree CHAID
• Predictive Strength: 17% - less than 1/4 category crime is correctly predicted
STEP 2: LAUNCH THE FIRST EXPLORATIVE MODEL
39
Top Predictors
Decision Tree for Arson
BIG DATA & DATA SCIENCE
Q1 – HOW TO ACT WITH THE DATASET?
40
• Data Weekend&HolidayDummy [using historical calendar holidays]
• Time NightDummy, Cold/RainDummy, HotDummy,
WorkingTimeDummy [using weather, sunborn/set, ecc. information]
• Address StreetType [managing strings]
• Other UnemploymentRateByMonth, VisitorsRateByMonth,
PoupulationDensityByDistrict, HouseCostByDistrict,
EducationLevelByDistrict [desk analysis]
• Imagine how to transform Data variable
• Imagine how to transform Time variable
• Imagine how to transform Address variable
• Imagine other external information to be included in the model
• Try to select sources to import these information
BIG DATA & DATA SCIENCE
STEP 3: LAUNCH THE FINAL EXPLORATIVE MODEL
41
Top Predictors
• Tool: IBM Watson
• Algorithm: Decision Tree CHAID
• Predictive Strength: 32% - about 1/3 category crime is correctly predicted
BIG DATA & DATA SCIENCE
Perché diventare un Data Scientist?
Chi è un Data Scientist?
Come viene selezionato un Data Scientist?
Cos’è un algoritmo?
Un esempio di problema di Data Science
Alcuni Tool di Data Science
Dove aggiornarsi e formarsi?
42
AGENDA
BIG DATA & DATA SCIENCE
BIG DATA = TANTISSIME TECNOLOGIE
43
BIG DATA & DATA SCIENCE
SCEGLIERE UN TOOL DI DATA SCIENCE
44
Linguaggi
• R
• Phyton
Altri Tool
• HP Vertica
• Weka
• Tableau
• Neo4K
COMPLETENESS OF VISION
ABILITYTOEXECUTE
CHALLENGERS
NICHE PLAYER
LEADERS
VISIONARIES
SAS
IBM KNIME
RapidMiner
Microsoft
Alteryx
Predixion Software
Alphine Data
FICO
Lavastorm
Megaputer
Prognoz
Accenture
Dell
SAPAngoss
MAGIC QUADRANT 2016 - GARTNER
Pay
Free
LEGENDA
BIG DATA & DATA SCIENCE
RAPID MINER: UN LABORATORIO VISUALE
45
BIG DATA & DATA SCIENCE
IBM WATSON: UNO DEGLI STRUMENTI PIÙ FAMOSI
A MENO DI 50€/MESE PER UTENTE
46
BIG DATA & DATA SCIENCE
Perché diventare un Data Scientist?
Chi è un Data Scientist?
Come viene selezionato un Data Scientist?
Cos’è un algoritmo?
Un esempio di problema di Data Science
Alcuni Tool di Data Science
Dove aggiornarsi e formarsi?
47
AGENDA
BIG DATA & DATA SCIENCE
MOOC
48
BIG DATA & DATA SCIENCE
COME DIVENTARE UN DATA SCIENTIST
PRIMI CONSIGLI PER STUDENTI E PERSONE ALLE PRIME ARMI
Paolo Pellegrini, Senior Consultant
giugno 2016

More Related Content

What's hot

AI & Big Data Analytics : Innovation trends and use cases
AI & Big Data Analytics : Innovation trends and use casesAI & Big Data Analytics : Innovation trends and use cases
AI & Big Data Analytics : Innovation trends and use casesSarvesh Kumar
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data ScienceEdureka!
 
Emcien overview v6 01282013
Emcien overview v6 01282013Emcien overview v6 01282013
Emcien overview v6 01282013WCJones6348
 
Analytics: The Real-world Use of Big Data
Analytics: The Real-world Use of Big DataAnalytics: The Real-world Use of Big Data
Analytics: The Real-world Use of Big DataDavid Pittman
 
Applications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesApplications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesT.S. Lim
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligenceManish Jain
 
Big Data and The Future of Insight - Future Foundation
Big Data and The Future of Insight - Future FoundationBig Data and The Future of Insight - Future Foundation
Big Data and The Future of Insight - Future FoundationForesight Factory
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research reportJULIO GONZALEZ SANZ
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceSrishti44
 
Real time analytics of big data
Real time analytics of big dataReal time analytics of big data
Real time analytics of big dataDeependra Jyoti
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsSri Ambati
 
BigData Analytics_1.7
BigData Analytics_1.7BigData Analytics_1.7
BigData Analytics_1.7Rohit Mittal
 

What's hot (20)

AI & Big Data Analytics : Innovation trends and use cases
AI & Big Data Analytics : Innovation trends and use casesAI & Big Data Analytics : Innovation trends and use cases
AI & Big Data Analytics : Innovation trends and use cases
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
 
Data science
Data scienceData science
Data science
 
Big data analysis
Big data analysisBig data analysis
Big data analysis
 
Big Data and Analytics - 2016 CFO
Big Data and Analytics - 2016 CFOBig Data and Analytics - 2016 CFO
Big Data and Analytics - 2016 CFO
 
Data Science
Data ScienceData Science
Data Science
 
Emcien overview v6 01282013
Emcien overview v6 01282013Emcien overview v6 01282013
Emcien overview v6 01282013
 
Big data upload
Big data uploadBig data upload
Big data upload
 
Analytics: The Real-world Use of Big Data
Analytics: The Real-world Use of Big DataAnalytics: The Real-world Use of Big Data
Analytics: The Real-world Use of Big Data
 
Applications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesApplications of Big Data Analytics in Businesses
Applications of Big Data Analytics in Businesses
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial Intelligence
 
Big data
Big dataBig data
Big data
 
Big Data and The Future of Insight - Future Foundation
Big Data and The Future of Insight - Future FoundationBig Data and The Future of Insight - Future Foundation
Big Data and The Future of Insight - Future Foundation
 
Data science
Data scienceData science
Data science
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research report
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Real time analytics of big data
Real time analytics of big dataReal time analytics of big data
Real time analytics of big data
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
 
BigData Analytics_1.7
BigData Analytics_1.7BigData Analytics_1.7
BigData Analytics_1.7
 

Similar to Come diventare data scientist - Paolo Pellegrini

What does it_takes_to_be_a_good_data_scientist_2019_aim_simplilearn
What does it_takes_to_be_a_good_data_scientist_2019_aim_simplilearnWhat does it_takes_to_be_a_good_data_scientist_2019_aim_simplilearn
What does it_takes_to_be_a_good_data_scientist_2019_aim_simplilearnPraj H
 
Who is a data scientist
Who is a data scientist  Who is a data scientist
Who is a data scientist prateek kumar
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceJuuso Parkkinen
 
From Rocket Science to Data Science
From Rocket Science to Data ScienceFrom Rocket Science to Data Science
From Rocket Science to Data ScienceSanghamitra Deb
 
Every angle jacques adriaansen
Every angle   jacques adriaansenEvery angle   jacques adriaansen
Every angle jacques adriaansenBigDataExpo
 
Data scientist the sexiest job of the 21st century by thomas h davenport and ...
Data scientist the sexiest job of the 21st century by thomas h davenport and ...Data scientist the sexiest job of the 21st century by thomas h davenport and ...
Data scientist the sexiest job of the 21st century by thomas h davenport and ...Darpan Deoghare
 
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptxarpit206900
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
 
Big data vs business intelligence.pptx
Big data vs business intelligence.pptxBig data vs business intelligence.pptx
Big data vs business intelligence.pptxRafiulHasan19
 
Welcome to Data Science
Welcome to Data ScienceWelcome to Data Science
Welcome to Data ScienceNyraSehgal
 
ABOUT DATA SCIENCE big data analytics ppt.pptx
ABOUT DATA SCIENCE big data analytics ppt.pptxABOUT DATA SCIENCE big data analytics ppt.pptx
ABOUT DATA SCIENCE big data analytics ppt.pptxVASANTHIG10
 
Module 6 The Future of Big and Smart Data- Online
Module 6 The Future of Big and Smart Data- Online Module 6 The Future of Big and Smart Data- Online
Module 6 The Future of Big and Smart Data- Online caniceconsulting
 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)mark madsen
 
Data science market insights usa
Data science market insights usaData science market insights usa
Data science market insights usaKaitlin McAndrews
 
Data Science Whitepaper
Data Science WhitepaperData Science Whitepaper
Data Science WhitepaperTuan Yang
 
Data Science Growth Accelerator
Data Science Growth AcceleratorData Science Growth Accelerator
Data Science Growth AcceleratorKanika Khanna
 
s|ngular Data and Analytics Intro
s|ngular Data and Analytics Intros|ngular Data and Analytics Intro
s|ngular Data and Analytics IntroSngular Meaning
 
iTrain Malaysia: Data Science by Tarun Sukhani
iTrain Malaysia: Data Science by Tarun SukhaniiTrain Malaysia: Data Science by Tarun Sukhani
iTrain Malaysia: Data Science by Tarun SukhaniiTrain
 

Similar to Come diventare data scientist - Paolo Pellegrini (20)

What does it_takes_to_be_a_good_data_scientist_2019_aim_simplilearn
What does it_takes_to_be_a_good_data_scientist_2019_aim_simplilearnWhat does it_takes_to_be_a_good_data_scientist_2019_aim_simplilearn
What does it_takes_to_be_a_good_data_scientist_2019_aim_simplilearn
 
Who is a data scientist
Who is a data scientist  Who is a data scientist
Who is a data scientist
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 
From Rocket Science to Data Science
From Rocket Science to Data ScienceFrom Rocket Science to Data Science
From Rocket Science to Data Science
 
Every angle jacques adriaansen
Every angle   jacques adriaansenEvery angle   jacques adriaansen
Every angle jacques adriaansen
 
Data scientist the sexiest job of the 21st century by thomas h davenport and ...
Data scientist the sexiest job of the 21st century by thomas h davenport and ...Data scientist the sexiest job of the 21st century by thomas h davenport and ...
Data scientist the sexiest job of the 21st century by thomas h davenport and ...
 
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
 
Data Science
Data ScienceData Science
Data Science
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
Big data vs business intelligence.pptx
Big data vs business intelligence.pptxBig data vs business intelligence.pptx
Big data vs business intelligence.pptx
 
Welcome to Data Science
Welcome to Data ScienceWelcome to Data Science
Welcome to Data Science
 
ABOUT DATA SCIENCE big data analytics ppt.pptx
ABOUT DATA SCIENCE big data analytics ppt.pptxABOUT DATA SCIENCE big data analytics ppt.pptx
ABOUT DATA SCIENCE big data analytics ppt.pptx
 
Module 6 The Future of Big and Smart Data- Online
Module 6 The Future of Big and Smart Data- Online Module 6 The Future of Big and Smart Data- Online
Module 6 The Future of Big and Smart Data- Online
 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
 
Data science market insights usa
Data science market insights usaData science market insights usa
Data science market insights usa
 
Data Science Whitepaper
Data Science WhitepaperData Science Whitepaper
Data Science Whitepaper
 
Data Science Growth Accelerator
Data Science Growth AcceleratorData Science Growth Accelerator
Data Science Growth Accelerator
 
365 Data Science
365 Data Science365 Data Science
365 Data Science
 
s|ngular Data and Analytics Intro
s|ngular Data and Analytics Intros|ngular Data and Analytics Intro
s|ngular Data and Analytics Intro
 
iTrain Malaysia: Data Science by Tarun Sukhani
iTrain Malaysia: Data Science by Tarun SukhaniiTrain Malaysia: Data Science by Tarun Sukhani
iTrain Malaysia: Data Science by Tarun Sukhani
 

Recently uploaded

Sarah Lahm In Media Res Media Component
Sarah Lahm  In Media Res Media ComponentSarah Lahm  In Media Res Media Component
Sarah Lahm In Media Res Media ComponentInMediaRes1
 
Transdisciplinary Pathways for Urban Resilience [Work in Progress].pptx
Transdisciplinary Pathways for Urban Resilience [Work in Progress].pptxTransdisciplinary Pathways for Urban Resilience [Work in Progress].pptx
Transdisciplinary Pathways for Urban Resilience [Work in Progress].pptxinfo924062
 
PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFE
PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFEPART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFE
PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFEMISSRITIMABIOLOGYEXP
 
Farrington HS Streamlines Guest Entrance
Farrington HS Streamlines Guest EntranceFarrington HS Streamlines Guest Entrance
Farrington HS Streamlines Guest Entrancejulius27264
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...Nguyen Thanh Tu Collection
 
16. Discovery, function and commercial uses of different PGRS.pptx
16. Discovery, function and commercial uses of different PGRS.pptx16. Discovery, function and commercial uses of different PGRS.pptx
16. Discovery, function and commercial uses of different PGRS.pptxUmeshTimilsina1
 
6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroom6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroomSamsung Business USA
 
Sulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesSulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesVijayaLaxmi84
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQuiz Club NITW
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptxmary850239
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxAnupam32727
 
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...Osopher
 
The Emergence of Legislative Behavior in the Colombian Congress
The Emergence of Legislative Behavior in the Colombian CongressThe Emergence of Legislative Behavior in the Colombian Congress
The Emergence of Legislative Behavior in the Colombian CongressMaria Paula Aroca
 
DBMSArchitecture_QueryProcessingandOptimization.pdf
DBMSArchitecture_QueryProcessingandOptimization.pdfDBMSArchitecture_QueryProcessingandOptimization.pdf
DBMSArchitecture_QueryProcessingandOptimization.pdfChristalin Nelson
 
BÀI TẬP BỔ TRỢ 4 KĨ NĂNG TIẾNG ANH LỚP 11 (CẢ NĂM) - FRIENDS GLOBAL - NĂM HỌC...
BÀI TẬP BỔ TRỢ 4 KĨ NĂNG TIẾNG ANH LỚP 11 (CẢ NĂM) - FRIENDS GLOBAL - NĂM HỌC...BÀI TẬP BỔ TRỢ 4 KĨ NĂNG TIẾNG ANH LỚP 11 (CẢ NĂM) - FRIENDS GLOBAL - NĂM HỌC...
BÀI TẬP BỔ TRỢ 4 KĨ NĂNG TIẾNG ANH LỚP 11 (CẢ NĂM) - FRIENDS GLOBAL - NĂM HỌC...Nguyen Thanh Tu Collection
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research DiscourseAnita GoswamiGiri
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17Celine George
 
DiskStorage_BasicFileStructuresandHashing.pdf
DiskStorage_BasicFileStructuresandHashing.pdfDiskStorage_BasicFileStructuresandHashing.pdf
DiskStorage_BasicFileStructuresandHashing.pdfChristalin Nelson
 

Recently uploaded (20)

Sarah Lahm In Media Res Media Component
Sarah Lahm  In Media Res Media ComponentSarah Lahm  In Media Res Media Component
Sarah Lahm In Media Res Media Component
 
Transdisciplinary Pathways for Urban Resilience [Work in Progress].pptx
Transdisciplinary Pathways for Urban Resilience [Work in Progress].pptxTransdisciplinary Pathways for Urban Resilience [Work in Progress].pptx
Transdisciplinary Pathways for Urban Resilience [Work in Progress].pptx
 
PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFE
PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFEPART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFE
PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFE
 
Farrington HS Streamlines Guest Entrance
Farrington HS Streamlines Guest EntranceFarrington HS Streamlines Guest Entrance
Farrington HS Streamlines Guest Entrance
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
 
16. Discovery, function and commercial uses of different PGRS.pptx
16. Discovery, function and commercial uses of different PGRS.pptx16. Discovery, function and commercial uses of different PGRS.pptx
16. Discovery, function and commercial uses of different PGRS.pptx
 
6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroom6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroom
 
Sulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesSulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their uses
 
Introduction to Research ,Need for research, Need for design of Experiments, ...
Introduction to Research ,Need for research, Need for design of Experiments, ...Introduction to Research ,Need for research, Need for design of Experiments, ...
Introduction to Research ,Need for research, Need for design of Experiments, ...
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
 
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...
 
The Emergence of Legislative Behavior in the Colombian Congress
The Emergence of Legislative Behavior in the Colombian CongressThe Emergence of Legislative Behavior in the Colombian Congress
The Emergence of Legislative Behavior in the Colombian Congress
 
DBMSArchitecture_QueryProcessingandOptimization.pdf
DBMSArchitecture_QueryProcessingandOptimization.pdfDBMSArchitecture_QueryProcessingandOptimization.pdf
DBMSArchitecture_QueryProcessingandOptimization.pdf
 
BÀI TẬP BỔ TRỢ 4 KĨ NĂNG TIẾNG ANH LỚP 11 (CẢ NĂM) - FRIENDS GLOBAL - NĂM HỌC...
BÀI TẬP BỔ TRỢ 4 KĨ NĂNG TIẾNG ANH LỚP 11 (CẢ NĂM) - FRIENDS GLOBAL - NĂM HỌC...BÀI TẬP BỔ TRỢ 4 KĨ NĂNG TIẾNG ANH LỚP 11 (CẢ NĂM) - FRIENDS GLOBAL - NĂM HỌC...
BÀI TẬP BỔ TRỢ 4 KĨ NĂNG TIẾNG ANH LỚP 11 (CẢ NĂM) - FRIENDS GLOBAL - NĂM HỌC...
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research Discourse
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17
 
Teaching Critical AI Literacies - Maha Bali
Teaching Critical AI Literacies - Maha BaliTeaching Critical AI Literacies - Maha Bali
Teaching Critical AI Literacies - Maha Bali
 
DiskStorage_BasicFileStructuresandHashing.pdf
DiskStorage_BasicFileStructuresandHashing.pdfDiskStorage_BasicFileStructuresandHashing.pdf
DiskStorage_BasicFileStructuresandHashing.pdf
 

Come diventare data scientist - Paolo Pellegrini

  • 1. BIG DATA & DATA SCIENCE COME DIVENTARE UN DATA SCIENTIST PRIMI CONSIGLI PER STUDENTI E PERSONE ALLE PRIME ARMI Paolo Pellegrini, Senior Consultant giugno 2016
  • 2. BIG DATA & DATA SCIENCE Perché diventare un Data Scientist? Chi è un Data Scientist? Come viene selezionato un Data Scientist? Cos’è un algoritmo? Un esempio di problema di Data Science Alcuni Tool di Data Science Dove aggiornarsi e formarsi? 1 AGENDA
  • 3. BIG DATA & DATA SCIENCE IL 56% DELLE IMPRESE ITALIANE INDICA BIG DATA E DATA SCIENCE COME PRIMARIO SVILUPPO STRATEGICO PER IL 2016/17 2 3% 5% 6% 7% 10% 10% 17% 17% 18% 18% 25% 25% 31% 31% 40% 48% 53% 56% 0% 10% 20% 30% 40% 50% 60% Smart Manufacturing Internet of Things Smart Working Progetti commerciali web social Cyber Security Compliance e Risk Management Collaboration Storage e virtualizzazione Mobile e eCommerce Data Center Mobile Marketing e CRM Cloud pubblico e privato Consolidamento applicativo Sistemi CRM Device Mobili e Mobile Apps Sistemi ERP Dematerializzazione Big Data e Analytics
  • 4. BIG DATA & DATA SCIENCE HARVARD, GIA’ ANNI FA, LO AVEVA DEFINITO IL LAVORO PIÙ SEXY DEL NOSTRO SECOLO…ED È ANCHE BEN REMUNERATO! 3 GOOGLE TREND «DATA SCIENTIST» AVERAGE SALARY 123,000 $
  • 5. BIG DATA & DATA SCIENCE Perché diventare un Data Scientist? Chi è un Data Scientist? Come viene selezionato un Data Scientist? Cos’è un algoritmo? Un esempio di problema di Data Science Alcuni Tool di Data Science Dove aggiornarsi e formarsi? 4 AGENDA
  • 6. BIG DATA & DATA SCIENCE UN DATA SCIENTIST PUÒ FARE TUTTO! 5 Nate Silver è la persona che ha cambiato il concetto di “Psephology”, usando Big Data & Data Science per predire i risultati delle elezioni Americane. Oggi, è uno dei più famosi Data Scientist al mondo
  • 7. BIG DATA & DATA SCIENCE UN DATA SCIENTIST È UNA FIGURA FORTMENTE INTERDISCIPLINARE, CHE CONIUGA STATISTICA, PROGRAMMAZIONE E LOGICHE DI BUSINESS 6 «On any given day a team member might author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of an analysis to other members of the organization in a clear and concise fashion» 2009 – Jeff Hammerbacher | Data Scientist @ Facebook
  • 8. BIG DATA & DATA SCIENCE LA DIFFUSIONE NEL MONDO È SEMPRE PIÙ GRANDE, E VEDE L’ITALIA PROTAGONISTA 7 RJ Metrics on 11.400 Data Scientist profile on LinkedIn
  • 9. BIG DATA & DATA SCIENCE I SETTORI CHE IMPIEGANO PIÙ DATA SCIENTIST SONO QUELLI MAGGIORMENTE ORIENTATI ALL’IT, MA LA DIFFUSIONE E’ SEMPRE PIÙ CAPILLARE 8 RJ Metrics on 11.400 Data Scientist profile on LinkedIn
  • 10. BIG DATA & DATA SCIENCE LE COMPETENZE PIÙ DIFFUSE VERTONO SU LINGUAGGI E STRUMENTI COME «R» E «PYTHON» 9 RJ Metrics on 11.400 Data Scientist profile on LinkedIn
  • 11. BIG DATA & DATA SCIENCE LE COMPETENZE MODELLISITICHE E DI PROGRAMMAZIONE SONO FONDAMENTALI PER UNA RISORSA JUNIOR 10 RJ Metrics on 11.400 Data Scientist profile on LinkedIn
  • 12. BIG DATA & DATA SCIENCE UN DATA SCIENTIST PUÒ AVERE QUALSIASI TIPO DI BACKGROUND: CONTA SOLO VOGLIA E ATTITUDINE A LAVORARE SUI DATI 11 RJ Metrics on 11.400 Data Scientist profile on LinkedIn
  • 13. BIG DATA & DATA SCIENCE 2% 11% 14% 73% PRESTO OGNI AZIENDA AVRÀ UN DATA SCIENTIST 12 Present, with a well defined role Present, but without a well defined role Introduction planned for 2016 Possible introduction in the future Data Scientist
  • 14. BIG DATA & DATA SCIENCE Perché diventare un Data Scientist? Chi è un Data Scientist? Come viene selezionato un Data Scientist? Cos’è un algoritmo? Un esempio di problema di Data Science Alcuni Tool di Data Science Dove aggiornarsi e formarsi? 13 AGENDA
  • 15. BIG DATA & DATA SCIENCE PERCORSO DI VALUTAZIONE 14 Job posting Test development CV selection Test assignment Test evaluation Technical interview Data Scientist evaluation
  • 16. BIG DATA & DATA SCIENCE PERCORSO DI VALUTAZIONE 15 Job posting Test development CV selection Test assignment Test evaluation Technical interview Data Scientist evaluation 1) You have two tables in an existing RDBMS. One contains information about the products you sell (name, size, color, etc.) The other contains images of the products in JPEG format. These tables are frequently joined in queries to your database. You would like to move this data into HBase. What is the most efficient schema design for this scenario? • Create a single table, with two column family • Create a single table, with one column family • Create two tables, with one column family 2) A sandwich shop studies the number of men, and women, that enter the shop during the lunch hour from noon to 1pm each day. They find that the number of men that enter can be modeled as a random variable with distribution Poisson(M), and likewise the number of women that enter as Poisson(W). What is likely to be the best model of the total number of customers that enter during the lunch hour? • Poisson (M+W) • Possion (M/W) • Poisson (M*W) Junion Data Scientist Selection
  • 17. BIG DATA & DATA SCIENCE PERCORSO DI VALUTAZIONE 16 Job posting Test development CV selection Test assignment Test evaluation Technical interview Data Scientist evaluation Senior Data Scientist Selection Consegna di un Data Set, via mail o tramite piattaforme come University2Business, che i candidate devono analizzare al fine di sviluppare un modello predittivo
  • 18. BIG DATA & DATA SCIENCE PERCORSO DI VALUTAZIONE 17 Job posting Test development CV selection Test assignment Test evaluation Technical interview Data Scientist evaluation Seguono domande generiche sulla costruzione dei modelli o discussioni di dettaglio su quando svolto nel test. Ad esempio: • Pulizia dati • Costruzione modello • Sviluppo algoritmo • … Senior Data Scientist Selection
  • 19. BIG DATA & DATA SCIENCE PERCORSO DI VALUTAZIONE 18 Job posting Test development CV selection Test assignment Test evaluation Technical interview Data Scientist evaluation Socio- economic Statistic Business Math Soft – Role specific Soft – Business generic Computer Science
  • 20. BIG DATA & DATA SCIENCE DESCRIZIONE DELLE COMPETENZE 19 TECHNICALSKILLS SOFT ROLESPECIFIC SOFT BUSINESSGENERIC Socio-Economiche Capacità di lettura del contesto sociale e di come questo impatti sul contesto economico Settoriali Conoscenza di processi, mercato e anticipazione degli impatti delle variabili esogene sullo specifico settore Matematiche Capacità di sistemizzare la realtà attraverso classificazioni e modelli che tengano conto delle interazioni fra gli elementi Informatiche Capacità di trattamento dell’informazione, mediante lo sviluppo di procedure automatizzate (es. algoritmi) e di un supporto HW/SW Statistiche Capacità trarre deduzioni logiche ed estrarre conoscenza dallo studio di un particolare fenomeno non deterministico Hacking Capacità di fare uso di creatività e immaginazione nella ricerca della conoscenza Storytelling Capacità di inventiva nella creazione di scenari da esplorare e di inserire le informazioni all’interno di un framework che ne facilitino la trasmissione e la comprensione all’esterno, anche attraverso capacità di sintesi e di presentazione delle informazioni Etica Capacità di fare uso coscienzioso dei dati, anche a fronte del possesso di dati sensibili Management Capacità di guida e coordinamento di un gruppo di risorse, assunzione di decisioni per garantire l'ottenimento di risultati aziendali Teamwork Capacità di operare in gruppo, attraverso spartizione di ruoli e aggregazione di competenze, al fine di raggiungere un obiettivo comune Coaching/Mentoring Capacità di formazione di risorse con meno esperienza, al fine di migliorarne le potenzialità, partendo dall’unicità dell'individuo Relazioni interpersonali Capacità di relazionarsi con altri soggetti, ponendosi nel modo opportuno a seconda di status, relazioni gerarchiche, contingenze, ecc.
  • 21. BIG DATA & DATA SCIENCE PROFILI TIPICI 20 Junior Data Scientist Socio-Economic Statistic Business MathSoft Role Specic Soft Bsuiness Generic Computer Science Socio-Economic Statistic Business MathSoft Role Specic Soft Bsuiness Generic Computer Science Senior Data Scientist Chief Data Scientist Socio-Economic Statistic Business MathSoft Role Specic Soft Bsuiness Generic Computer Science Business Manager Socio-Economic Statistic Business MathSoft Role Specic Soft Bsuiness Generic Computer Science
  • 22. BIG DATA & DATA SCIENCE Perché diventare un Data Scientist? Chi è un Data Scientist? Come viene selezionato un Data Scientist? Cos’è un algoritmo? Un esempio di problema di Data Science Alcuni Tool di Data Science Dove aggiornarsi e formarsi? 21 AGENDA
  • 23. BIG DATA & DATA SCIENCE “Data is inherently dumb - Algorithms are where the real value lies. Algorithms define action” Peter Sondergaard Senior Vice President Gartner Research DAL DATO ALL’ALGORITMO 22 A graphical expression of Euclid's algorithm to find the greatest common divisor for 1599 and 650 Algorithm is a self-contained step-by-step set of operations to be performed
  • 24. BIG DATA & DATA SCIENCE COME GLI ALGORITMI SUPPORTANO IL BUSINESS 23 INFORMATION INSIGHTS DECISION ACTION DESCRIPTIVE What happened? DIAGNOSTIC Why does it happened? PREDICTIVE What future? PRESCRIPTIVE How to react to recent events? PREEMPTIVE How to avoid bad events? DATA-DRIVEN STRATEGY Decisional Support OPTIMIZATION STRATEGY ANALYTICS STRATEGY OLD-STYLE STRATEGY +
  • 25. BIG DATA & DATA SCIENCE • Machine learning is a subfield of computer science, that evolved from the study of pattern recognition and computational learning theory in artificial intelligence • In 1959, Arthur Samuel defined machine learning as a "Field of study that gives computers the ability to learn without being explicitly programmed“ • Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from an example training set of input observations in order to make data-driven predictions or decisions expressed as outputs rather than following strictly static program instructions • Machine learning is closely related to (and often overlaps with) computational statistics; a discipline which also focuses in prediction-making through the use of computers. It has strong ties to mathematical optimization, which delivers methods, theory and application domains to the field COME SI DIFINISCE IL «MACHINE LEARNING» 24
  • 26. BIG DATA & DATA SCIENCE • C4.5 - Constructs a classifier in the form of a decision tree. In order to do this, C4.5 is given a set of data representing things that are already classified. This is supervised learning, since the training dataset is labeled with classes ALCUNI ESEMPI DI ALGORITMI 25
  • 27. BIG DATA & DATA SCIENCE • k-means - creates k groups from a set of objects so that the members of a group are more similar. It’s a popular cluster analysis technique for exploring a dataset. Most would classify k-means as unsupervised. Other than specifying the number of clusters, k-means “learns” the clusters on its own without any information about which cluster an observation belongs to k-means can be semi-supervised ALCUNI ESEMPI DI ALGORITMI 26
  • 28. BIG DATA & DATA SCIENCE • Support vector machines - SVM teaches a hyperplane to classify data into 2 classes. At a high-level, SVM performs a similar task like C4.5 except SVM doesn’t use decision trees at all. It is a supervised learning, since a dataset is used to first teach the SVM about the classes ALCUNI ESEMPI DI ALGORITMI 27
  • 29. BIG DATA & DATA SCIENCE • Naive Bayes - it is not a single algorithm, but a family of classification algorithms that share one common assumption: every feature of the data being classified is independent of all other features given the class. This is supervised learning, since Naive Bayes is provided a labeled training dataset in order to construct the tables ALCUNI ESEMPI DI ALGORITMI 28
  • 30. BIG DATA & DATA SCIENCE • PCA - Principal component analysis uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. The first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set. The principal components are orthogonal because they are the eigenvectors of the covariance matrix, which is symmetric. PCA is sensitive to the relative scaling of the original variables. This is unsupervised learning ALCUNI ESEMPI DI ALGORITMI 29
  • 31. BIG DATA & DATA SCIENCE PROCESSO LOGICO DI USO DEGLI ALGORITMI 30 Ricezione Dataset Analisi esplorativa dei dati Pulizia dei dati Uso di Algoritmi per trovare variabili più predittive Costruzione modello logico • Random Forest • Decision Tree • SVM • … Testing & Tunion Sviluppo algoritmo ad-hoc di supporto al business
  • 32. BIG DATA & DATA SCIENCE Perché diventare un Data Scientist? Chi è un Data Scientist? Come viene selezionato un Data Scientist? Cos’è un algoritmo? Un esempio di problema di Data Science Alcuni Tool di Data Science Dove aggiornarsi e formarsi? 31 AGENDA
  • 33. BIG DATA & DATA SCIENCE DATA SCIENTIST: MAGICIAN OR SUPERHERO? 32 Può un Data Scientist predire i crimini a San Francisco? Può un Data Scientist aiutare la città ad esser più sicura?
  • 34. BIG DATA & DATA SCIENCE From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz. Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay. • It’s provided a dataset of 12 years of incidents from across all of San Francisco's neighborhoods, from 1/1/2003 to 13/05/2015. • Dataset has been divided in two parts: a training set, to be used for the model development, and a test set, useful to verify the predictive algorithm. THE SAN FRANCISCO CHALLENGE 33 Given time and location, you must predict the category of crime that can occur
  • 35. BIG DATA & DATA SCIENCE 878.049INCIDENTS WITH 39 CATEGORIES OF CRIME TRAINING SET STRUCTURE 34 For every incident is provided: • Data and time • Category • Description • Day of week • Pd District • Resolution • Address • Latitude • Longitude
  • 36. BIG DATA & DATA SCIENCE UN ESEMPIO DI DATA VISUALIZATION 35
  • 37. BIG DATA & DATA SCIENCE Q1 – HOW TO ACT WITH THE DATASET? 36 • To manage Dataset are used CSV files, but also JSON. No xls! • 800.000 record is Big Data! • You can use only variables known when the model is applied Variable: Data and time, Day of week, Pd District, Address, Latitude, Longitude Variable not to be included: Description, Resolution Taget: Category
  • 38. BIG DATA & DATA SCIENCE Manage Dataset in order to make all valid variable usable by a predictive model: • Generate an ID for every record • Verify the structure of every variable and search for data that need to be cleaned up (e.g. empty record, double space, ecc.) • Split “Date” (13/05/2015 23:53:00) into single variables (Month, Year, Hour) • Merge “Latitude” and “Longitude” to verify the presence of unique place Verify the distribution of every variable to find out the presence of “non-normal distribution” or other kind of problems to be fix STEP 1: DATA CLEANSING 37
  • 39. BIG DATA & DATA SCIENCE STEP 2: LAUNCH THE FIRST EXPLORATIVE MODEL 38 You can use professional (and free) tools like Rapid Miner, Weka, Knime, ecc.
  • 40. BIG DATA & DATA SCIENCE • Tool: IBM Watson • Algorithm: Decision Tree CHAID • Predictive Strength: 17% - less than 1/4 category crime is correctly predicted STEP 2: LAUNCH THE FIRST EXPLORATIVE MODEL 39 Top Predictors Decision Tree for Arson
  • 41. BIG DATA & DATA SCIENCE Q1 – HOW TO ACT WITH THE DATASET? 40 • Data Weekend&HolidayDummy [using historical calendar holidays] • Time NightDummy, Cold/RainDummy, HotDummy, WorkingTimeDummy [using weather, sunborn/set, ecc. information] • Address StreetType [managing strings] • Other UnemploymentRateByMonth, VisitorsRateByMonth, PoupulationDensityByDistrict, HouseCostByDistrict, EducationLevelByDistrict [desk analysis] • Imagine how to transform Data variable • Imagine how to transform Time variable • Imagine how to transform Address variable • Imagine other external information to be included in the model • Try to select sources to import these information
  • 42. BIG DATA & DATA SCIENCE STEP 3: LAUNCH THE FINAL EXPLORATIVE MODEL 41 Top Predictors • Tool: IBM Watson • Algorithm: Decision Tree CHAID • Predictive Strength: 32% - about 1/3 category crime is correctly predicted
  • 43. BIG DATA & DATA SCIENCE Perché diventare un Data Scientist? Chi è un Data Scientist? Come viene selezionato un Data Scientist? Cos’è un algoritmo? Un esempio di problema di Data Science Alcuni Tool di Data Science Dove aggiornarsi e formarsi? 42 AGENDA
  • 44. BIG DATA & DATA SCIENCE BIG DATA = TANTISSIME TECNOLOGIE 43
  • 45. BIG DATA & DATA SCIENCE SCEGLIERE UN TOOL DI DATA SCIENCE 44 Linguaggi • R • Phyton Altri Tool • HP Vertica • Weka • Tableau • Neo4K COMPLETENESS OF VISION ABILITYTOEXECUTE CHALLENGERS NICHE PLAYER LEADERS VISIONARIES SAS IBM KNIME RapidMiner Microsoft Alteryx Predixion Software Alphine Data FICO Lavastorm Megaputer Prognoz Accenture Dell SAPAngoss MAGIC QUADRANT 2016 - GARTNER Pay Free LEGENDA
  • 46. BIG DATA & DATA SCIENCE RAPID MINER: UN LABORATORIO VISUALE 45
  • 47. BIG DATA & DATA SCIENCE IBM WATSON: UNO DEGLI STRUMENTI PIÙ FAMOSI A MENO DI 50€/MESE PER UTENTE 46
  • 48. BIG DATA & DATA SCIENCE Perché diventare un Data Scientist? Chi è un Data Scientist? Come viene selezionato un Data Scientist? Cos’è un algoritmo? Un esempio di problema di Data Science Alcuni Tool di Data Science Dove aggiornarsi e formarsi? 47 AGENDA
  • 49. BIG DATA & DATA SCIENCE MOOC 48
  • 50. BIG DATA & DATA SCIENCE COME DIVENTARE UN DATA SCIENTIST PRIMI CONSIGLI PER STUDENTI E PERSONE ALLE PRIME ARMI Paolo Pellegrini, Senior Consultant giugno 2016