SlideShare a Scribd company logo
Data Science With Python
Mosky
Data Science
➤ = Extract knowledge or insights from data.
➤ Data science includes:
➤ Visualization
➤ Statistics
➤ Machine learning
➤ Deep learning
➤ Big data
➤ And related methods
➤ ≈ Data mining
2
Data Science
➤ = Extract knowledge or insights from data.
➤ Data science includes:
➤ Visualization
➤ Statistics
➤ Machine learning
➤ Deep learning
➤ Big data
➤ And related methods
➤ ≈ Data mining
3
We will introduce.
➤ It's kind of outdated, but still contains lot of keywords.
➤ MrMimic/data-scientist-roadmap – GitHub
➤ Becoming a Data Scientist – Curriculum via Metromap
5
➤ Machine learning = statistics - checking of assumptions 😆
➤ But does resolve more problems. "
➤ Statistics constructs more solid inferences.
➤ Machine learning constructs more interesting predictions.
Statistics vs. Machine Learning
6
Probability, Descriptive Statistics, and Inferential Statistics
7
Population
Sample
Probability
Descriptive

Statistics
Inferential
Statistics
➤ Deep learning is the most renowned part of machine learning.
➤ A.k.a. the “AI”.
➤ Deep learning uses artificial neural networks (NNs).
➤ Which are especially good at:
➤ Computer vision (CV) 👀
➤ Natural language processing (NLP) 📖
➤ Machine translation
➤ Speech recognition
➤ Too costly to simple problems.
Machine Learning vs. Deep Learning
8
Big Data
➤ The “size” is constantly moving.
➤ As of 2012, ranges from 10n TB to n PB, which is 100x.
➤ Has high-3Vs:
➤ Volume, amount of data.
➤ Velocity, speed of data in and out.
➤ Variety, range of data types and sources.
➤ A practical definition:
➤ A single computer can't process in a reasonable time.
➤ Distributed computing is a big deal.
9
Today,
➤ “Models” are the math models.
➤ “Statistical models”, emphasize inferences.
➤ “Machine learning models”, emphasize predictions.
➤ “Deep learning” and “big data” are gigantic subfields.
➤ We won't introduce.
➤ But the learning resources are listed at the end.
10
Mosky
➤ Python Charmer at Pinkoi.
➤ Has spoken at
➤ PyCons in 

TW, MY, KR, JP, SG, HK,

COSCUPs, and TEDx, etc.
➤ Countless hours 

on teaching Python.
➤ Own the Python packages:
➤ ZIPCodeTW, 

MoSQL, Clime, etc.
➤ http://mosky.tw/
11
The Outline
➤ “Data”
➤ The Analysis Steps
➤ Visualization
➤ Preprocessing
➤ Dimensionality Reduction
➤ Statistical Models
➤ Machine Learning Models
➤ Keep Learning
12
The Packages
➤ $ pip3 install jupyter numpy scipy sympy matplotlib
ipython pandas seaborn statsmodels scikit-learn
➤ Or
➤ > conda install jupyter numpy scipy sympy matplotlib
ipython pandas seaborn statsmodels scikit-learn
13
Common Jupyter Notebook Shortcuts
14
Esc Edit mode → command mode.
Ctrl-Enter Run the cell.
B Insert cell below.
D, D Delete the current cell.
M To Markdown cell.
Cmd-/ Comment the code.
H Show keyboard shortcuts.
P Open the command palette.
Checkpoint: The Packages
➤ Open 00_preface_the_packages.ipynb up.
➤ Run it.
➤ The notebooks are available on https://github.com/moskytw/
data-science-with-python.
15
”Data”
“Data”
➤ = Variables
➤ = Dimensions
➤ = Labels + Features
17
Data in Different Types
18
Discrete
Nominal {male, female}
Ordinal

Ranked
↑ & can be ordered. {great > good > fair}
Continuous
Interval ↑ & distance is meaningful. temperatures
Ratio ↑ & 0 is meaningful. weights
Data in the X-Y Form
19
y x
dependent variable independent variable
response variable explanatory variable
regressand regressor
endogenous variable | endog exogenous variable | exog
outcome design
label feature
➤ Confounding variables:
➤ May affect y, but not x.
➤ May lead erroneous conclusions, “garbage in, garbage out”.
➤ Controlling, e.g., fix the environment.
➤ Randomizing, e.g, choose by computer.
➤ Matching, e.g., order by gender and then assign group.
➤ Statistical control, e.g., BMI to remove height effect.
➤ Double-blind, even triple-blind trials.
20
Get the Data
➤ Logs
➤ Existent datasets
➤ The Datasets Package – StatsModels
➤ Kaggle
➤ Experiments
21
The Analysis Steps
The Three Steps
1. Define Assumption
2. Validate Assumption
3. Validated Assumption?
23
1. Define Assumption
➤ Specify a feasible objective.
➤ “Use AI to get the moon!”
➤ Write an formal assumption.
➤ “The users will buy 1% items from our recommendation.”

rather than “The users will love our recommendation!”
➤ Note the dangerous gaps.
➤ “All the items from recommendation are free!”
➤ “Correlation does not imply causation.”
➤ Consider the next actions.
➤ “Release to 100% of users.” rather than “So great!”
24
2. Validate Assumption
➤ Collect potential data.
➤ List possible methods.
➤ A plotting, median, or even mean may be good enough.
➤ Selecting Statistical Tests – Bates College
➤ Choosing a statistical test – HBS
➤ Choosing the right estimator – Scikit-Learn
➤ Evaluate the metrics of methods with data.
25
3. Validated Assumption?
➤ Yes → Congrats! Report fully and take the actions! 🎉
➤ No → Check:
➤ The hypotheses of methods.
➤ The confounding variables in data.
➤ The formality of assumption.
➤ The feasibility of objective.

26
Iterate Fast While Industry Changes Rapidly
➤ Resolve the small problems first.
➤ Resolve the high impact/effort problems first.
➤ One week to get a quick result and improve

rather than one year to get the may-be-the-best result.
➤ Fail fast!
27
Checkpoint: Pick up a Method
➤ Think of an interesting problem.
➤ E.g., revenue is higher, but is it random?
➤ Pick one method from the cheatsheets.
➤ Selecting Statistical Tests – Bates College
➤ Choosing a statistical test – HBS
➤ Choosing the right estimator – Scikit-Learn
➤ Remember the three analysis steps.
28
Visualization
Visualization
➤ Make Data Colorful – Plotting
➤ 01_1_visualization_plotting.ipynb
➤ In a Statistical Way – Descriptive Statistics
➤ 01_2_visualization_descriptive_statistics.ipynb
30
➤ Star98
➤ star98_df = sm.datasets.star98.load_pandas().data
➤ Fair
➤ fair_df = sm.datasets.fair.load_pandas().data
➤ Howell1
➤ howell1_df = pd.read_csv(

'dataset_howell1.csv', sep=';')
➤ Or your own datasets.
➤ Plot the variables that interest you.
Checkpoint: Plot the Variables
31
Preprocessing
Feed the Data That Models Like
33
➤ Preprocess data for:
➤ Hard requirements, e.g.,
➤ corpus → vectors
➤ “What kind of news will be voted down on PTT?”
➤ Soft requirements (hypotheses), e.g.,
➤ t-test: better when samples are normally distributed.
➤ SVM: better when features range from -1 to 1.
➤ More representative features, e.g., total price / units.
➤ Note that different models have different tastes.
Preprocessing
➤ The Dishes – Containers
➤ 02_1_preprocessing_containers.ipynb
➤ A Cooking Method – Standardization
➤ 02_2_preprocessing_standardization.ipynb
➤ Watch Out for Poisonous Data Points – Removing Outliers
➤ 02_3_preprocessing_removing_outliers.ipynb
34
➤ Try to standardize and compare.
➤ Try to trim the outliners.
Checkpoint: Preprocess the Variables
35
Dimensionality
Reduction
The Model Sicks Up!
➤ Let's reduce the variables.
➤ Feed a subset → feature selection.
➤ Feature selection using SelectFromModel – Scikit-Learn
➤ Feed a transformation → feature extraction.
➤ PCA, FA, etc.
➤ Another definition: non-numbers → numbers.
37
➤ Principal Component Analysis
➤ 03_1_dimensionality_reduction_principal_component_analysis.ipynb
➤ Factor Analysis
➤ 03_2_dimensionality_reduction_factor_analysis.ipynb
Dimensionality Reduction
38
➤ Try to PCA(all variables) → the better components, or FA.
➤ And then plot n-dimensional data onto 2-dimensional plane.
Checkpoint: Reduce the Variables
39
Statistical Models
Statistical Models
➤ Identify Boring or Interesting – Hypothesis Testings
➤ 04_1_statistical_models_hypothesis_testings.ipynb
➤ “Hypothesis Testing With Python”
➤ Identify X-Y Relationships – Regression
➤ 04_2_statistical_models_regression_anova.ipynb
41
More Regression Models
➤ If y is not linear,
➤ Logit or Poisson Regression | Generalized Linear Models, GLMs
➤ If y is correlated,
➤ Linear Mixed Models, LMMs | Generalized Estimating Equation, GEE
➤ If x has multicollinearity,
➤ Lasso or Ridge Regression
➤ If error term is heteroscedastic,
➤ Weighted Least Squares, WLS | Generalized Least Squares, GLS
➤ If x is time series – predict x0 from x-1, not predict y from x,
➤ Autoregressive Integrated Moving Average, ARIMA
42
➤ Try to apply the analysis steps with a statistical method.
1. Define Assumption
2. Validate Assumption
3. Validated Assumption?
Checkpoint: Apply a Statistical Method
43
Machine Learning
Models
➤ Apple or Orange? – Classification
➤ 05_1_machine_learning_models_classification.ipynb
➤ Without Labels – Clustering
➤ 05_2_machine_learning_models_clustering.ipynb
➤ Predict the Values – Regression
➤ Who Are the Best? – Model Selection
➤ sklearn.model_selection.GridSearchCV
Machine Learning Models
45
Confusion matrix, where A = 002 = C[0, 0]
46
predicted -
AC
predicted +
BD
actual -
AB
true -
A
false +
B
actual +
CD
false -
C
true +
D
➤ precision = D / BD
➤ recall = D / CD
➤ sensitivity = D / CD = recall = observed power
➤ specificity = A / AB = observed confidence level
➤ false positive rate = B / AB = observed α
➤ false negative rate= C / CD = observed β
Common “rates” in confusion matrix
47
Ensemble Models
➤ Bagging
➤ N independent models and average their output.
➤ e.g., the random forest models.
➤ Boosting
➤ N sequential models, the n model learns from n-1's error.
➤ e.g., gradient tree boosting.
48
➤ Try to apply the analysis steps with a ML method.
1. Define Assumption
2. Validate Assumption
3. Validated Assumption?
Checkpoint: Apply a Machine Learning Method
49
Keep Learning
Keep Learning
➤ Statistics
➤ Seeing Theory
➤ Biological Statistics
➤ scipy.stats + StatsModels
➤ Research Methods
➤ Machine Learning
➤ Scikit-learn Tutorials
➤ Standford CS229
➤ Hsuan-Tien Lin

➤ Deep Learning
➤ TensorFlow | PyTorch
➤ Standford CS231n
➤ Standford CS224n
➤ Big Data
➤ Dask
➤ Hive
➤ Spark
➤ HBase
➤ AWS
51
The Facts
➤ ∵
➤ You can't learn all things in the data science!
➤ ∴
➤ “Let's learn to do” ❌
➤ “Let's do to learn” ✅
52
The Learning Flow
1. Ask a question.
➤ “How to tell the differences confidently?”
2. Explore the references.
➤ “T-test, ANOVA, ...”
3. Digest into an answer.
➤ Explore by the breadth-first way.
➤ Write the code.
➤ Make it work, make it right, finally make it fast.
53
Recap
➤ Let's do to learn, not learn to do.
➤ What is your objective?
➤ For the objective, what is your assumption?
➤ For the assumption, what method may validate it?
➤ For the method, how will you evaluate it with data?
➤ Q & A
54

More Related Content

What's hot

Introduction To Python | Edureka
Introduction To Python | EdurekaIntroduction To Python | Edureka
Introduction To Python | Edureka
Edureka!
 
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Edureka!
 
Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | Edureka
Edureka!
 
Data Analysis and Visualization using Python
Data Analysis and Visualization using PythonData Analysis and Visualization using Python
Data Analysis and Visualization using Python
Chariza Pladin
 
Python programming | Fundamentals of Python programming
Python programming | Fundamentals of Python programming Python programming | Fundamentals of Python programming
Python programming | Fundamentals of Python programming
KrishnaMildain
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
NishantKumar1179
 
Python Programming Language | Python Classes | Python Tutorial | Python Train...
Python Programming Language | Python Classes | Python Tutorial | Python Train...Python Programming Language | Python Classes | Python Tutorial | Python Train...
Python Programming Language | Python Classes | Python Tutorial | Python Train...
Edureka!
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
Sreenatha Reddy K R
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data Science
Arc & Codementor
 
What is Jupyter Notebook?
What is Jupyter Notebook?What is Jupyter Notebook?
What is Jupyter Notebook?
Vinita Silaparasetty
 
Data Analysis in Python-NumPy
Data Analysis in Python-NumPyData Analysis in Python-NumPy
Data Analysis in Python-NumPy
Devashish Kumar
 
What is Python? | Edureka
What is Python? | EdurekaWhat is Python? | Edureka
What is Python? | Edureka
Edureka!
 
NUMPY
NUMPY NUMPY
Introduction to IPython & Jupyter Notebooks
Introduction to IPython & Jupyter NotebooksIntroduction to IPython & Jupyter Notebooks
Introduction to IPython & Jupyter Notebooks
Eueung Mulyana
 
Introduction to python
 Introduction to python Introduction to python
Introduction to python
Learnbay Datascience
 
Python Basics | Python Tutorial | Edureka
Python Basics | Python Tutorial | EdurekaPython Basics | Python Tutorial | Edureka
Python Basics | Python Tutorial | Edureka
Edureka!
 
Learn Python Programming | Python Programming - Step by Step | Python for Beg...
Learn Python Programming | Python Programming - Step by Step | Python for Beg...Learn Python Programming | Python Programming - Step by Step | Python for Beg...
Learn Python Programming | Python Programming - Step by Step | Python for Beg...
Edureka!
 
Python basics
Python basicsPython basics
Python basics
Hoang Nguyen
 
Python ppt
Python pptPython ppt
Python ppt
Mohita Pandey
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
Spotle.ai
 

What's hot (20)

Introduction To Python | Edureka
Introduction To Python | EdurekaIntroduction To Python | Edureka
Introduction To Python | Edureka
 
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
 
Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | Edureka
 
Data Analysis and Visualization using Python
Data Analysis and Visualization using PythonData Analysis and Visualization using Python
Data Analysis and Visualization using Python
 
Python programming | Fundamentals of Python programming
Python programming | Fundamentals of Python programming Python programming | Fundamentals of Python programming
Python programming | Fundamentals of Python programming
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Python Programming Language | Python Classes | Python Tutorial | Python Train...
Python Programming Language | Python Classes | Python Tutorial | Python Train...Python Programming Language | Python Classes | Python Tutorial | Python Train...
Python Programming Language | Python Classes | Python Tutorial | Python Train...
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data Science
 
What is Jupyter Notebook?
What is Jupyter Notebook?What is Jupyter Notebook?
What is Jupyter Notebook?
 
Data Analysis in Python-NumPy
Data Analysis in Python-NumPyData Analysis in Python-NumPy
Data Analysis in Python-NumPy
 
What is Python? | Edureka
What is Python? | EdurekaWhat is Python? | Edureka
What is Python? | Edureka
 
NUMPY
NUMPY NUMPY
NUMPY
 
Introduction to IPython & Jupyter Notebooks
Introduction to IPython & Jupyter NotebooksIntroduction to IPython & Jupyter Notebooks
Introduction to IPython & Jupyter Notebooks
 
Introduction to python
 Introduction to python Introduction to python
Introduction to python
 
Python Basics | Python Tutorial | Edureka
Python Basics | Python Tutorial | EdurekaPython Basics | Python Tutorial | Edureka
Python Basics | Python Tutorial | Edureka
 
Learn Python Programming | Python Programming - Step by Step | Python for Beg...
Learn Python Programming | Python Programming - Step by Step | Python for Beg...Learn Python Programming | Python Programming - Step by Step | Python for Beg...
Learn Python Programming | Python Programming - Step by Step | Python for Beg...
 
Python basics
Python basicsPython basics
Python basics
 
Python ppt
Python pptPython ppt
Python ppt
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 

Similar to Data Science With Python

How to apply deep learning to 3 d objects
How to apply deep learning to 3 d objectsHow to apply deep learning to 3 d objects
How to apply deep learning to 3 d objects
Ogushi Masaya
 
How to track and organize your experimentation process
How to track and organize your experimentation processHow to track and organize your experimentation process
How to track and organize your experimentation process
neptune.ml
 
0-introduction.pdf
0-introduction.pdf0-introduction.pdf
0-introduction.pdf
ArafathJazeeb1
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
Roger Barga
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Joshua Robinson
 
Semi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text DataSemi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text Data
Tech Triveni
 
Hypothesis Testing With Python
Hypothesis Testing With PythonHypothesis Testing With Python
Hypothesis Testing With Python
Mosky Liu
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4
Roger Barga
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
c.titus.brown
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
c.titus.brown
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
Akshay Kanchan
 
Big data expo - machine learning in the elastic stack
Big data expo - machine learning in the elastic stack Big data expo - machine learning in the elastic stack
Big data expo - machine learning in the elastic stack
BigDataExpo
 
supervised.pptx
supervised.pptxsupervised.pptx
supervised.pptx
MohamedSaied316569
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine Learning
Rebecca Bilbro
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
Kai Koenig
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
VenkateswaraBabuRavi
 
Machine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.pptMachine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.ppt
ShivaShiva783981
 
5 Practical Steps to a Successful Deep Learning Research
5 Practical Steps to a Successful  Deep Learning Research5 Practical Steps to a Successful  Deep Learning Research
5 Practical Steps to a Successful Deep Learning Research
Brodmann17
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
Shubhmay Potdar
 
lec1.ppt
lec1.pptlec1.ppt
lec1.ppt
SVasuKrishna1
 

Similar to Data Science With Python (20)

How to apply deep learning to 3 d objects
How to apply deep learning to 3 d objectsHow to apply deep learning to 3 d objects
How to apply deep learning to 3 d objects
 
How to track and organize your experimentation process
How to track and organize your experimentation processHow to track and organize your experimentation process
How to track and organize your experimentation process
 
0-introduction.pdf
0-introduction.pdf0-introduction.pdf
0-introduction.pdf
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Semi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text DataSemi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text Data
 
Hypothesis Testing With Python
Hypothesis Testing With PythonHypothesis Testing With Python
Hypothesis Testing With Python
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Big data expo - machine learning in the elastic stack
Big data expo - machine learning in the elastic stack Big data expo - machine learning in the elastic stack
Big data expo - machine learning in the elastic stack
 
supervised.pptx
supervised.pptxsupervised.pptx
supervised.pptx
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine Learning
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
 
Machine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.pptMachine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.ppt
 
5 Practical Steps to a Successful Deep Learning Research
5 Practical Steps to a Successful  Deep Learning Research5 Practical Steps to a Successful  Deep Learning Research
5 Practical Steps to a Successful Deep Learning Research
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
lec1.ppt
lec1.pptlec1.ppt
lec1.ppt
 

More from Mosky Liu

Statistical Regression With Python
Statistical Regression With PythonStatistical Regression With Python
Statistical Regression With Python
Mosky Liu
 
Practicing Python 3
Practicing Python 3Practicing Python 3
Practicing Python 3
Mosky Liu
 
Elegant concurrency
Elegant concurrencyElegant concurrency
Elegant concurrency
Mosky Liu
 
Boost Maintainability
Boost MaintainabilityBoost Maintainability
Boost Maintainability
Mosky Liu
 
Beyond the Style Guides
Beyond the Style GuidesBeyond the Style Guides
Beyond the Style Guides
Mosky Liu
 
Simple Belief - Mosky @ TEDxNTUST 2015
Simple Belief - Mosky @ TEDxNTUST 2015Simple Belief - Mosky @ TEDxNTUST 2015
Simple Belief - Mosky @ TEDxNTUST 2015
Mosky Liu
 
Concurrency in Python
Concurrency in PythonConcurrency in Python
Concurrency in Python
Mosky Liu
 
ZIPCodeTW: Find Taiwan ZIP Code by Address Fuzzily
ZIPCodeTW: Find Taiwan ZIP Code by Address FuzzilyZIPCodeTW: Find Taiwan ZIP Code by Address Fuzzily
ZIPCodeTW: Find Taiwan ZIP Code by Address Fuzzily
Mosky Liu
 
Graph-Tool in Practice
Graph-Tool in PracticeGraph-Tool in Practice
Graph-Tool in Practice
Mosky Liu
 
Minimal MVC in JavaScript
Minimal MVC in JavaScriptMinimal MVC in JavaScript
Minimal MVC in JavaScript
Mosky Liu
 
Learning Git with Workflows
Learning Git with WorkflowsLearning Git with Workflows
Learning Git with Workflows
Mosky Liu
 
Dive into Pinkoi 2013
Dive into Pinkoi 2013Dive into Pinkoi 2013
Dive into Pinkoi 2013
Mosky Liu
 
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
Mosky Liu
 
Learning Python from Data
Learning Python from DataLearning Python from Data
Learning Python from Data
Mosky Liu
 
MoSQL: More than SQL, but less than ORM
MoSQL: More than SQL, but less than ORMMoSQL: More than SQL, but less than ORM
MoSQL: More than SQL, but less than ORM
Mosky Liu
 
Introduction to Clime
Introduction to ClimeIntroduction to Clime
Introduction to Clime
Mosky Liu
 
Programming with Python - Adv.
Programming with Python - Adv.Programming with Python - Adv.
Programming with Python - Adv.
Mosky Liu
 
Programming with Python - Basic
Programming with Python - BasicProgramming with Python - Basic
Programming with Python - Basic
Mosky Liu
 

More from Mosky Liu (18)

Statistical Regression With Python
Statistical Regression With PythonStatistical Regression With Python
Statistical Regression With Python
 
Practicing Python 3
Practicing Python 3Practicing Python 3
Practicing Python 3
 
Elegant concurrency
Elegant concurrencyElegant concurrency
Elegant concurrency
 
Boost Maintainability
Boost MaintainabilityBoost Maintainability
Boost Maintainability
 
Beyond the Style Guides
Beyond the Style GuidesBeyond the Style Guides
Beyond the Style Guides
 
Simple Belief - Mosky @ TEDxNTUST 2015
Simple Belief - Mosky @ TEDxNTUST 2015Simple Belief - Mosky @ TEDxNTUST 2015
Simple Belief - Mosky @ TEDxNTUST 2015
 
Concurrency in Python
Concurrency in PythonConcurrency in Python
Concurrency in Python
 
ZIPCodeTW: Find Taiwan ZIP Code by Address Fuzzily
ZIPCodeTW: Find Taiwan ZIP Code by Address FuzzilyZIPCodeTW: Find Taiwan ZIP Code by Address Fuzzily
ZIPCodeTW: Find Taiwan ZIP Code by Address Fuzzily
 
Graph-Tool in Practice
Graph-Tool in PracticeGraph-Tool in Practice
Graph-Tool in Practice
 
Minimal MVC in JavaScript
Minimal MVC in JavaScriptMinimal MVC in JavaScript
Minimal MVC in JavaScript
 
Learning Git with Workflows
Learning Git with WorkflowsLearning Git with Workflows
Learning Git with Workflows
 
Dive into Pinkoi 2013
Dive into Pinkoi 2013Dive into Pinkoi 2013
Dive into Pinkoi 2013
 
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
 
Learning Python from Data
Learning Python from DataLearning Python from Data
Learning Python from Data
 
MoSQL: More than SQL, but less than ORM
MoSQL: More than SQL, but less than ORMMoSQL: More than SQL, but less than ORM
MoSQL: More than SQL, but less than ORM
 
Introduction to Clime
Introduction to ClimeIntroduction to Clime
Introduction to Clime
 
Programming with Python - Adv.
Programming with Python - Adv.Programming with Python - Adv.
Programming with Python - Adv.
 
Programming with Python - Basic
Programming with Python - BasicProgramming with Python - Basic
Programming with Python - Basic
 

Recently uploaded

DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
agdhot
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
ugydym
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
blueshagoo1
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
Bisnar Chase Personal Injury Attorneys
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
asyed10
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
Vineet
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
eudsoh
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
actyx
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
9gr6pty
 

Recently uploaded (20)

DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
 

Data Science With Python

  • 1. Data Science With Python Mosky
  • 2. Data Science ➤ = Extract knowledge or insights from data. ➤ Data science includes: ➤ Visualization ➤ Statistics ➤ Machine learning ➤ Deep learning ➤ Big data ➤ And related methods ➤ ≈ Data mining 2
  • 3. Data Science ➤ = Extract knowledge or insights from data. ➤ Data science includes: ➤ Visualization ➤ Statistics ➤ Machine learning ➤ Deep learning ➤ Big data ➤ And related methods ➤ ≈ Data mining 3 We will introduce.
  • 4.
  • 5. ➤ It's kind of outdated, but still contains lot of keywords. ➤ MrMimic/data-scientist-roadmap – GitHub ➤ Becoming a Data Scientist – Curriculum via Metromap 5
  • 6. ➤ Machine learning = statistics - checking of assumptions 😆 ➤ But does resolve more problems. " ➤ Statistics constructs more solid inferences. ➤ Machine learning constructs more interesting predictions. Statistics vs. Machine Learning 6
  • 7. Probability, Descriptive Statistics, and Inferential Statistics 7 Population Sample Probability Descriptive
 Statistics Inferential Statistics
  • 8. ➤ Deep learning is the most renowned part of machine learning. ➤ A.k.a. the “AI”. ➤ Deep learning uses artificial neural networks (NNs). ➤ Which are especially good at: ➤ Computer vision (CV) 👀 ➤ Natural language processing (NLP) 📖 ➤ Machine translation ➤ Speech recognition ➤ Too costly to simple problems. Machine Learning vs. Deep Learning 8
  • 9. Big Data ➤ The “size” is constantly moving. ➤ As of 2012, ranges from 10n TB to n PB, which is 100x. ➤ Has high-3Vs: ➤ Volume, amount of data. ➤ Velocity, speed of data in and out. ➤ Variety, range of data types and sources. ➤ A practical definition: ➤ A single computer can't process in a reasonable time. ➤ Distributed computing is a big deal. 9
  • 10. Today, ➤ “Models” are the math models. ➤ “Statistical models”, emphasize inferences. ➤ “Machine learning models”, emphasize predictions. ➤ “Deep learning” and “big data” are gigantic subfields. ➤ We won't introduce. ➤ But the learning resources are listed at the end. 10
  • 11. Mosky ➤ Python Charmer at Pinkoi. ➤ Has spoken at ➤ PyCons in 
 TW, MY, KR, JP, SG, HK,
 COSCUPs, and TEDx, etc. ➤ Countless hours 
 on teaching Python. ➤ Own the Python packages: ➤ ZIPCodeTW, 
 MoSQL, Clime, etc. ➤ http://mosky.tw/ 11
  • 12. The Outline ➤ “Data” ➤ The Analysis Steps ➤ Visualization ➤ Preprocessing ➤ Dimensionality Reduction ➤ Statistical Models ➤ Machine Learning Models ➤ Keep Learning 12
  • 13. The Packages ➤ $ pip3 install jupyter numpy scipy sympy matplotlib ipython pandas seaborn statsmodels scikit-learn ➤ Or ➤ > conda install jupyter numpy scipy sympy matplotlib ipython pandas seaborn statsmodels scikit-learn 13
  • 14. Common Jupyter Notebook Shortcuts 14 Esc Edit mode → command mode. Ctrl-Enter Run the cell. B Insert cell below. D, D Delete the current cell. M To Markdown cell. Cmd-/ Comment the code. H Show keyboard shortcuts. P Open the command palette.
  • 15. Checkpoint: The Packages ➤ Open 00_preface_the_packages.ipynb up. ➤ Run it. ➤ The notebooks are available on https://github.com/moskytw/ data-science-with-python. 15
  • 17. “Data” ➤ = Variables ➤ = Dimensions ➤ = Labels + Features 17
  • 18. Data in Different Types 18 Discrete Nominal {male, female} Ordinal
 Ranked ↑ & can be ordered. {great > good > fair} Continuous Interval ↑ & distance is meaningful. temperatures Ratio ↑ & 0 is meaningful. weights
  • 19. Data in the X-Y Form 19 y x dependent variable independent variable response variable explanatory variable regressand regressor endogenous variable | endog exogenous variable | exog outcome design label feature
  • 20. ➤ Confounding variables: ➤ May affect y, but not x. ➤ May lead erroneous conclusions, “garbage in, garbage out”. ➤ Controlling, e.g., fix the environment. ➤ Randomizing, e.g, choose by computer. ➤ Matching, e.g., order by gender and then assign group. ➤ Statistical control, e.g., BMI to remove height effect. ➤ Double-blind, even triple-blind trials. 20
  • 21. Get the Data ➤ Logs ➤ Existent datasets ➤ The Datasets Package – StatsModels ➤ Kaggle ➤ Experiments 21
  • 23. The Three Steps 1. Define Assumption 2. Validate Assumption 3. Validated Assumption? 23
  • 24. 1. Define Assumption ➤ Specify a feasible objective. ➤ “Use AI to get the moon!” ➤ Write an formal assumption. ➤ “The users will buy 1% items from our recommendation.”
 rather than “The users will love our recommendation!” ➤ Note the dangerous gaps. ➤ “All the items from recommendation are free!” ➤ “Correlation does not imply causation.” ➤ Consider the next actions. ➤ “Release to 100% of users.” rather than “So great!” 24
  • 25. 2. Validate Assumption ➤ Collect potential data. ➤ List possible methods. ➤ A plotting, median, or even mean may be good enough. ➤ Selecting Statistical Tests – Bates College ➤ Choosing a statistical test – HBS ➤ Choosing the right estimator – Scikit-Learn ➤ Evaluate the metrics of methods with data. 25
  • 26. 3. Validated Assumption? ➤ Yes → Congrats! Report fully and take the actions! 🎉 ➤ No → Check: ➤ The hypotheses of methods. ➤ The confounding variables in data. ➤ The formality of assumption. ➤ The feasibility of objective.
 26
  • 27. Iterate Fast While Industry Changes Rapidly ➤ Resolve the small problems first. ➤ Resolve the high impact/effort problems first. ➤ One week to get a quick result and improve
 rather than one year to get the may-be-the-best result. ➤ Fail fast! 27
  • 28. Checkpoint: Pick up a Method ➤ Think of an interesting problem. ➤ E.g., revenue is higher, but is it random? ➤ Pick one method from the cheatsheets. ➤ Selecting Statistical Tests – Bates College ➤ Choosing a statistical test – HBS ➤ Choosing the right estimator – Scikit-Learn ➤ Remember the three analysis steps. 28
  • 30. Visualization ➤ Make Data Colorful – Plotting ➤ 01_1_visualization_plotting.ipynb ➤ In a Statistical Way – Descriptive Statistics ➤ 01_2_visualization_descriptive_statistics.ipynb 30
  • 31. ➤ Star98 ➤ star98_df = sm.datasets.star98.load_pandas().data ➤ Fair ➤ fair_df = sm.datasets.fair.load_pandas().data ➤ Howell1 ➤ howell1_df = pd.read_csv(
 'dataset_howell1.csv', sep=';') ➤ Or your own datasets. ➤ Plot the variables that interest you. Checkpoint: Plot the Variables 31
  • 33. Feed the Data That Models Like 33 ➤ Preprocess data for: ➤ Hard requirements, e.g., ➤ corpus → vectors ➤ “What kind of news will be voted down on PTT?” ➤ Soft requirements (hypotheses), e.g., ➤ t-test: better when samples are normally distributed. ➤ SVM: better when features range from -1 to 1. ➤ More representative features, e.g., total price / units. ➤ Note that different models have different tastes.
  • 34. Preprocessing ➤ The Dishes – Containers ➤ 02_1_preprocessing_containers.ipynb ➤ A Cooking Method – Standardization ➤ 02_2_preprocessing_standardization.ipynb ➤ Watch Out for Poisonous Data Points – Removing Outliers ➤ 02_3_preprocessing_removing_outliers.ipynb 34
  • 35. ➤ Try to standardize and compare. ➤ Try to trim the outliners. Checkpoint: Preprocess the Variables 35
  • 37. The Model Sicks Up! ➤ Let's reduce the variables. ➤ Feed a subset → feature selection. ➤ Feature selection using SelectFromModel – Scikit-Learn ➤ Feed a transformation → feature extraction. ➤ PCA, FA, etc. ➤ Another definition: non-numbers → numbers. 37
  • 38. ➤ Principal Component Analysis ➤ 03_1_dimensionality_reduction_principal_component_analysis.ipynb ➤ Factor Analysis ➤ 03_2_dimensionality_reduction_factor_analysis.ipynb Dimensionality Reduction 38
  • 39. ➤ Try to PCA(all variables) → the better components, or FA. ➤ And then plot n-dimensional data onto 2-dimensional plane. Checkpoint: Reduce the Variables 39
  • 41. Statistical Models ➤ Identify Boring or Interesting – Hypothesis Testings ➤ 04_1_statistical_models_hypothesis_testings.ipynb ➤ “Hypothesis Testing With Python” ➤ Identify X-Y Relationships – Regression ➤ 04_2_statistical_models_regression_anova.ipynb 41
  • 42. More Regression Models ➤ If y is not linear, ➤ Logit or Poisson Regression | Generalized Linear Models, GLMs ➤ If y is correlated, ➤ Linear Mixed Models, LMMs | Generalized Estimating Equation, GEE ➤ If x has multicollinearity, ➤ Lasso or Ridge Regression ➤ If error term is heteroscedastic, ➤ Weighted Least Squares, WLS | Generalized Least Squares, GLS ➤ If x is time series – predict x0 from x-1, not predict y from x, ➤ Autoregressive Integrated Moving Average, ARIMA 42
  • 43. ➤ Try to apply the analysis steps with a statistical method. 1. Define Assumption 2. Validate Assumption 3. Validated Assumption? Checkpoint: Apply a Statistical Method 43
  • 45. ➤ Apple or Orange? – Classification ➤ 05_1_machine_learning_models_classification.ipynb ➤ Without Labels – Clustering ➤ 05_2_machine_learning_models_clustering.ipynb ➤ Predict the Values – Regression ➤ Who Are the Best? – Model Selection ➤ sklearn.model_selection.GridSearchCV Machine Learning Models 45
  • 46. Confusion matrix, where A = 002 = C[0, 0] 46 predicted - AC predicted + BD actual - AB true - A false + B actual + CD false - C true + D
  • 47. ➤ precision = D / BD ➤ recall = D / CD ➤ sensitivity = D / CD = recall = observed power ➤ specificity = A / AB = observed confidence level ➤ false positive rate = B / AB = observed α ➤ false negative rate= C / CD = observed β Common “rates” in confusion matrix 47
  • 48. Ensemble Models ➤ Bagging ➤ N independent models and average their output. ➤ e.g., the random forest models. ➤ Boosting ➤ N sequential models, the n model learns from n-1's error. ➤ e.g., gradient tree boosting. 48
  • 49. ➤ Try to apply the analysis steps with a ML method. 1. Define Assumption 2. Validate Assumption 3. Validated Assumption? Checkpoint: Apply a Machine Learning Method 49
  • 51. Keep Learning ➤ Statistics ➤ Seeing Theory ➤ Biological Statistics ➤ scipy.stats + StatsModels ➤ Research Methods ➤ Machine Learning ➤ Scikit-learn Tutorials ➤ Standford CS229 ➤ Hsuan-Tien Lin
 ➤ Deep Learning ➤ TensorFlow | PyTorch ➤ Standford CS231n ➤ Standford CS224n ➤ Big Data ➤ Dask ➤ Hive ➤ Spark ➤ HBase ➤ AWS 51
  • 52. The Facts ➤ ∵ ➤ You can't learn all things in the data science! ➤ ∴ ➤ “Let's learn to do” ❌ ➤ “Let's do to learn” ✅ 52
  • 53. The Learning Flow 1. Ask a question. ➤ “How to tell the differences confidently?” 2. Explore the references. ➤ “T-test, ANOVA, ...” 3. Digest into an answer. ➤ Explore by the breadth-first way. ➤ Write the code. ➤ Make it work, make it right, finally make it fast. 53
  • 54. Recap ➤ Let's do to learn, not learn to do. ➤ What is your objective? ➤ For the objective, what is your assumption? ➤ For the assumption, what method may validate it? ➤ For the method, how will you evaluate it with data? ➤ Q & A 54