INTRODUCTION TO
DATA SCIENCE
Gabriel Moreira
Lead Data Scientist
@gspmoreira
Why so much buzz?
Source: https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
Commoditized Hardware
Advances in AI research
WHERE IS DATA SCIENCE BEEN USED?
Source: http://www.kdnuggets.com/2014/12/where-analytics-data-mining-data-science-applied.html
WHAT IS DATA SCIENCE
http://drewconway.com
WHAT IS DATA SCIENTIST
A Data Scientist is someone with deliberate dual personality who can first build a
curious business case defined with a telescopic vision and can then dive deep with
microscopic lens to sift through DATA to reach the goal while defining and
executing all the intermittent tasks.
http://www.datasciencecentral.com/profiles/blogs/are-you-a-data-scientist
Source: http://nirvacana.com/thoughts/becoming-a-data-scientist/
Data Science MetroMap Curriculum
Source: https://www.kdnuggets.com/2018/05/poll-tools-analytics-data-science-machine-learning-results.html
Top Data Science Tools
Source: https://www.kaggle.com/surveys/2017
What language would you recommend new data scientists learn first?
Data Visualization
DATA PRODUCTS
“If information has context and the context is
interactive, insights are not predictable."
[Agile Data Science, O’Reilly, 2014]
SENTIMENT ANALYSIS
bit.ly/eleicoes2014debatesbt
Analytical Dashboard
SENTIMENT ANALYSIS
Analytical Dashboard
bit.ly/eleicoes2014debatesbt
NETWORK ANALYSIS
https://linkedjazz.org/network/
js
Machine Learning
“Gives computers the ability to learn without being explicitly programmed”
- Artur Samuel, 1959
Machine Learning
Artificial Intelligence
“Creation of intelligent machines that work and react like humans”
- John McCarthy, 1956
Some definitions...
Supervised
Learning
● Data with clearly defined
output is given
● Direct feedback is given
● Predicts outcome/future
● Resolves classification
and regression problems
Unsupervised
Learning
● Machine understands the
data (clustering,
association rules)
● Evaluation is qualitative or
indirect
● Does not predict/find
anything specific
Reinforcement
Learning
● Intelligent agent that
learns how to act in a
certain environment,
based on maximizing
rewards
● Used to optimize goals
Types of Machine Learning
“Real” data
95% daisy
Training
dataset
Algorithms
Parameters
Training
Model
Prediction
Supervised Learning - Classification
Supervised Learning - Regression
Supervised Learning - Recommender System
Unsupervised Learning - Association Rules
Unsupervised Learning - Clustering
Machine Learning Quiz
Supervised Learning
● Classification (C)
● Regression (R)
● Recommender
Systems (RS)
Unsupervised Learning
● Clustering (CL)
● Association Rules (A)
1. Which products should I offer to a customer?
2. How will be sales for the next month?
3. Which customers are prone to churn?
4. Which products are commonly bought together (market basket)?
5. Is a transaction fraudulent?
6. How can my customers be segmented for targeting?
7. How can I personalize search results for user context?
8. Which product is this (based on a picture)?
9. What are the main topics of messages from a chatbot?
10. What will be a company’s stock prices in the end of the day?
11. Which customers should I offer a product?
Supervised Learning
● Classification (C)
● Regression (R)
● Recommender
Systems (RS)
Unsupervised Learning
● Clustering (CL)
● Association Rules (A)
1. Which products should I offer to a customer?
2. How will be sales for the next month?
3. Which customers are prone to churn?
4. Which products are commonly bought together (market basket)?
5. Is a transaction fraudulent?
6. How can my customers be segmented for targeting?
7. How can I personalize search results for user context?
8. Which product is this (based on a picture)?
9. What are the main topics of messages from a chatbot?
10. What will be a company’s stock prices in the end of the day?
11. Which customers should I offer a product?
(RS)
(R)
(C)
(A)
(C)
(CL)
(RS)
(C)
(CL)
(R)
(RS)
Machine Learning Quiz
Deep Learning
Deep Learning
Feature extraction from unstructured data using:
● Convolutional Neural Networks (CNNs)
● Recurrent Neural Networks (RNNs)
Images Text Audio/Music
Deep Learning
ML for Business
Cognitive & Advanced Analytics
UX
Business
Machine
Learning
Big Data
Advanced
Analytics
Customer
Centric
UNDERSTAND YOUR
CUSTOMER
Company
Centric
CREATE
PROACTIVE
EXPERIENCES
Cognitive
OPTIMIZE YOUR
PROCESSES
AIRBNB
Types of Analytics
Cognitive & Adv. Analytics - Quiz
● Cognitive (C)
● Data Science / Advance
Analytics
○ Descriptive (ADes)
○ Diagnostic (ADia)
○ Predictive (APred)
○ Prescriptive (APres)
1. How many products were sold last month?
2. Which products were commonly bought together?
3. How can customers be segmented based on purchases?
4. How many products I will sell next month for a customer segment?
5. Which products with cross-sell opportunity should I offer for each
customer segment?
6. During a journey user, which products I could recommend
automatically, based on his historical behaviour?
(ADes)
(ADia)
(ADia)
(APred)
(APres)
(C)
E-commerce
IS DATA SCIENTIST THE
NEW WEBMASTER?
[Doing Data Science, O’Reilly, 2014]
Data Science roles
CRISP-DM
Cognitive Lifecycle Overview
1 3
Identify
Opportunity
Objectives & pain
points to build the
use case
Data
Exploration
Data availability and
analysis to support
use case
4
Modeling
Select algorithms,
features, train and
evaluate models
5
Offline
evaluation
Value demonstration
based on results of
best model
10
Human in the
Loop
Acquire new data based
on feedback loops from
users, reviewers, etc
8
Monitor Model
Evaluate results and
retrain or rebuild when
performance degrades
7
Development &
Deployment
Development of related
systems and Deployment
of systems and model
ML Engineer
Data Scientist
2
Data Ingestion
and Cleansing
Implements the ETL
pipeline from
multiple data
sources
Data Engineer
Data Scientist /
Business
6
Experiments
Design
Design of the
experiments to
evaluate with real users
Online evaluation
A/B testing
results analysis 9
Source: https://www.kaggle.com/surveys/2017
What barriers are faced at work?
AI in industries
Tourism
Transportation
Health
Customer
Service
Law
Marketing
Education
Financial
Services
Manufacturing
Retail
Insurance
Product Personalization
Loyalty programs (redemption
recommendations)
Next Best Offer
Multi-channel attribution
Customer Churn
Chatbots
Fraud Detection
Credit Scoring
Product Recommendations
Hospital Readmission Risks
Diagnostic Imaging
Disease Propensity
Risk scoring
Pricing
Fraudulent Claims
Delivery Optimization Understanding or
generating juridic
documents
Automation
Personalized
learning
Government
Planning
Security
Disaster prevention
How to learn more
kaggle.com
DATA SCIENCE COURSES
• Fundamentos de AI e Machine Learning (Udacity)
• Data Science at Scale (Univ. of Washington)
• Data Science specialization (Johns Hopkins)
• Machine Learning (Stanford)
• Statistical Learning (Stanford)
BOOKS
TALKS
https://www.infoq.com/br/presentations/python-for-data-science
https://www.infoq.com/br/presentations/feature-engineering-extraindo-o-
potencial-maximo-dos-dados
https://www.youtube.com/watch?v=IPMwdk8qHMI
Happy data geeking!
INTRODUCTION TO
DATA SCIENCE
Gabriel Moreira
Lead Data Scientist
@gspmoreira

Introduction to Data Science