SlideShare a Scribd company logo
1 of 31
30 分鐘學會
實作 Python Feature Selection
James CC Huang
Disclaimer
• 只有實作
• 沒有數學
• 沒有統計
Source: Internet
Warming Up
• 聽說這場分享不會有人問問題 (把講者釘在台上)
• 原 session 只講 40 分鐘,但是今天的分享給了 2 小時
• 考驗我的記憶力和理解力
• 講者講了一大堆名詞但沒有講實作 (不可能有時間講)
• 我用 Python 實作範例
• 希望大家如果跟我一樣,不搞理論也不搞數學統計,回家用剪貼的就可
以用 scikit-learn 做 feature selection
Reinventing the Wheel?
Source: P.60 http://www.slideshare.net/tw_dsconf/ss-62245351
進行 Machine Learning 和 Deep Learning…
• 到底需不需要懂背後的數學、統計、理論…?
• 推廣及普及 Machine Learning / Deep Learning
• 工具的易用性及快速開發
• 正反方意見都有
• 正方例子:談到投入大演算 ”… 你會認為這需要繁重的數
學和嚴謹的理論工作,其實不然,反倒這所需要的是從
艱深的數學理論抽離,以便能看到學習現象的整體模
式。” (大演算 The Master Algorithm, P. 40)
• 反方例子:Deep Neural Networks - A Developmental
Perspective (slides, video)
2014 – 2016
台灣資料科學”愛好者”年會
我的分享
一、連續 3 年吃便當的經驗
二、2016 聽完 Feature Engineering in Machine Learning 演講後夢到的東西
三年的進化
• 參加的人愈來愈多
• [不負責任目測] 與會者平均年齡愈來愈大 XD
• 內容愈來愈多、場次愈來愈多
• 演講者身份的改變:教授和來自研究單位變多
• Deep Learning 這個詞出現頻率大幅增加
• $$ 愈來愈貴
• 朝向使用者付費
• 部分付費課程也會持續開課
• 便當沒有進化(都是同樣那幾家)
http://datasci.tw/agenda.php
http://datasci.tw/agenda.php
http://datasci.tw/agenda.php
http://datasci.tw/agenda.php
http://datasci.tw/agenda.php
http://datasci.tw/agenda.php
http://datasci.tw/agenda.php
Feature Engineering in Machine Learning
Session (Speaker: 李俊良)
Source: http://www.slideshare.net/tw_dsconf/feature-engineering-in-machine-learning
用 Feature Engineering 可否判斷出寫作風
格?
• 羅琳化名寫小說 曝光後銷量飆升
http://www.bbc.com/zhongwen/trad/uk_study/2013/07/130714_ro
wling_novel
• “曾有書評評價新書《杜鵑鳥在呼喚》是部「才華橫溢的處女作」,還有
書評盛讚這名男性作者,能如此精湛地描述女性的服裝。”
• “… 出版( 3 個月)的這部小說,已經售出1500冊。但亞馬遜網站報道說,
周日正午12點後,該書的銷售量飆增,增速高達500000%。”
• 原投影片 P. 14 (Source:
http://www.slideshare.net/tw_dsconf/feature-engineering-in-
machine-learning)
Find Word / Doc Similarity with
Deep Learning
Using word2vec and Gensim (Python)
Goal (or Problem to Solve)
• Problem: Tech Support engineers (TS) want to “precisely” categorize
support cases. The task is being performed manually by TS engineers.
• Goal: Automatically categorize support case.
• What I have:
• 156 classified cases (with “so-called” correct issue categories)
• Support cases in database
• Challenges:
• Based on current data available, supervised classification algorisms can‘t be
applied.
• Clustering may not 100% achieve the goal.
• What about Deep Learning?
Gensim (word2vec implementation in Python)
from os import listdir
import gensim
LabeledSentence = gensim.models.doc2vec.LabeledSentence
docLabels = []
docLabels = [f for f in listdir(“../corpora/2016/”) if f.endswith(‘.txt’)]
data = []
for doc in docLabels:
data.append(open(“../corpora/2016/” + doc, ‘r’))
class LabeledLineSentence(object):
def __init__(self, doc_list, labels_list):
self.labels_list = labels_list
self.doc_list = doc_list
def __iter__(self):
for idx, doc in enumerate(self.doc_list):
yield LabeledSentence(words=doc.read().split(),
labels=[self.labels_list[idx]])
Gensim (Cont’d)
it = LabeledLineSentence(data, docLabels)
model = gensim.models.Doc2Vec(alpha=0.025,
min_alpha=0.025)
model.build_vocab(it)
for epoch in range(10):
model.train(it)
model.alpha -= 0.002
model.min_alpha = model.alpha
# find most similar support case
print model.most_similar(“00111105”)
江湖傳言
• 用 Deep Learning 就不需要做 feature selection,因為 deep learning
會自動幫你決定
• From Wikipedia (https://en.wikipedia.org/wiki/Deep_learning):
• “One of the promises of deep learning is replacing handcrafted features with
efficient algorithms for unsupervised or semi-supervised feature learning and
hierarchical feature extraction.”
• 真 的 有 這 麼 神 奇 嗎 ?
Feature selection for Iris Dataset as Example
• Iris dataset attributes
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica
Feature Selection - LASSO
>>> from sklearn.linear_model import Lasso
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectFromModel
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> print X.shape
(150, 4)
>>> clf = Lasso(alpha=0.01)
>>> sfm = SelectFromModel(clf, threshold=0.25)
>>> sfm.fit(X, y)
>>> n_features = sfm.transform(X).shape[1]
>>> print n_features
2
petal width & petal length
Feature Selection - LASSO (Cont’d)
>>> scaler = StandardScaler()
>>> X = scaler.fit_transform(X)
>>> names = iris["feature_names"]
>>> lasso = Lasso(alpha=0.01, positive=True)
>>> lasso.fit(X, y)
>>> print (sorted(zip(map(lambda x: round(x, 4),
lasso.coef_), names), reverse=True))
[(0.47199999999999998, 'petal width (cm)'),
(0.3105, 'petal length (cm)'), (0.0, 'sepal
width (cm)'), (0.0, 'sepal length (cm)')]
Feature Selection – Random Forest
>>> from sklearn.datasets import load_iris
>>> from sklearn.ensemble import RandomForestRegressor
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> print (X.shape)
(150, 4)
>>> names = iris["feature_names"]
>>> rf = RandomForestRegressor()
>>> rf.fit(X, y)
>>> print (sorted(zip(map(lambda x: round(x, 4),
rf.feature_importances_), names), reverse=True))
[(0.50729999999999997, 'petal width (cm)'), (0.47870000000000001,
'petal length (cm)'), (0.0091000000000000004, 'sepal width (cm)'),
(0.0048999999999999998, 'sepal length (cm)')]
Dimension Reduction - PCA
>>> from sklearn.datasets import load_iris
>>> from sklearn.decomposition import PCA as pca
>>> from sklearn.preprocessing import StandardScaler
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X = StandardScaler().fit_transform(X)
>>> sklearn_pca = pca(n_components=2)
>>> sklearn_pca.fit_transform(X)
>>> print (sklearn_pca.components_)
[[ 0.52237162 -0.26335492 0.58125401 0.56561105]
[-0.37231836 -0.92555649 -0.02109478 -0.06541577]]
There are many others…
這次分享就是僅是把原講者所提到的方式實際做出來
簡單的我做完了, 難的就留給大家去發掘~
Reference
scikit-learn
• Feature selection
http://scikit-learn.org/stable/modules/feature_selection.html
• sklearn.linear_model.Lasso
http://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
• sklearn.decomposition.PCA http://scikit-
learn.org/stable/modules/generated/sklearn.decomposition.PCA.htm
l
Gensim
• https://radimrehurek.com/gensim/index.html
HoG (Histogram of Oriented Gradients)
• Python code example http://scikit-
image.org/docs/dev/auto_examples/plot_hog.html

More Related Content

What's hot

Unidad 1. determinación del tipo de distribución que presenta un proceso esto...
Unidad 1. determinación del tipo de distribución que presenta un proceso esto...Unidad 1. determinación del tipo de distribución que presenta un proceso esto...
Unidad 1. determinación del tipo de distribución que presenta un proceso esto...PEDRO LARA MALDONADO
 
Investigacion de operaciones
Investigacion de operacionesInvestigacion de operaciones
Investigacion de operacionesGARAVITOGIOVANNI
 
Ejemplo de flujo a costo minimo 1
Ejemplo de flujo a costo minimo 1Ejemplo de flujo a costo minimo 1
Ejemplo de flujo a costo minimo 1eduardo307
 
Teoria general de los sistemas
Teoria general de los sistemasTeoria general de los sistemas
Teoria general de los sistemasAlberto Hernandez
 
La organización como sistema, teoría general de sistemas (tgs) y sistemas de ...
La organización como sistema, teoría general de sistemas (tgs) y sistemas de ...La organización como sistema, teoría general de sistemas (tgs) y sistemas de ...
La organización como sistema, teoría general de sistemas (tgs) y sistemas de ...3677903
 
Condiciones de kuhn tucker y lagrange 97
Condiciones de kuhn tucker y lagrange 97Condiciones de kuhn tucker y lagrange 97
Condiciones de kuhn tucker y lagrange 97Andrea Alfonzo Sanchez
 
Algoritmo congruencial multiplicativo y prueba de medias u otavalo
Algoritmo congruencial multiplicativo y prueba de medias u otavaloAlgoritmo congruencial multiplicativo y prueba de medias u otavalo
Algoritmo congruencial multiplicativo y prueba de medias u otavaloRoberth Burgos
 
Arboles de decisión
Arboles de decisiónArboles de decisión
Arboles de decisióndrakatiadiaz
 
Metodo Montecarlo
Metodo MontecarloMetodo Montecarlo
Metodo MontecarloJuan Velez
 

What's hot (20)

Unidad 1. determinación del tipo de distribución que presenta un proceso esto...
Unidad 1. determinación del tipo de distribución que presenta un proceso esto...Unidad 1. determinación del tipo de distribución que presenta un proceso esto...
Unidad 1. determinación del tipo de distribución que presenta un proceso esto...
 
Diagramas influencia
Diagramas influenciaDiagramas influencia
Diagramas influencia
 
Investigacion de operaciones
Investigacion de operacionesInvestigacion de operaciones
Investigacion de operaciones
 
Ejemplo de flujo a costo minimo 1
Ejemplo de flujo a costo minimo 1Ejemplo de flujo a costo minimo 1
Ejemplo de flujo a costo minimo 1
 
Teoria general de los sistemas
Teoria general de los sistemasTeoria general de los sistemas
Teoria general de los sistemas
 
PROGRAMACIÓN LINEAL INVESTIGACIÓN DE OPERACIONES
PROGRAMACIÓN LINEAL  INVESTIGACIÓN DE OPERACIONESPROGRAMACIÓN LINEAL  INVESTIGACIÓN DE OPERACIONES
PROGRAMACIÓN LINEAL INVESTIGACIÓN DE OPERACIONES
 
Zq teo polmonu_lima2014ii
Zq teo polmonu_lima2014iiZq teo polmonu_lima2014ii
Zq teo polmonu_lima2014ii
 
La organización como sistema, teoría general de sistemas (tgs) y sistemas de ...
La organización como sistema, teoría general de sistemas (tgs) y sistemas de ...La organización como sistema, teoría general de sistemas (tgs) y sistemas de ...
La organización como sistema, teoría general de sistemas (tgs) y sistemas de ...
 
Herramienta solver
Herramienta solverHerramienta solver
Herramienta solver
 
Miop u1 ea
Miop u1 eaMiop u1 ea
Miop u1 ea
 
Condiciones de kuhn tucker y lagrange 97
Condiciones de kuhn tucker y lagrange 97Condiciones de kuhn tucker y lagrange 97
Condiciones de kuhn tucker y lagrange 97
 
Algoritmo Coungrencial Multiplicativo & Aditivo
Algoritmo Coungrencial Multiplicativo & AditivoAlgoritmo Coungrencial Multiplicativo & Aditivo
Algoritmo Coungrencial Multiplicativo & Aditivo
 
El Metodo Simplex
El Metodo SimplexEl Metodo Simplex
El Metodo Simplex
 
Unidad 3. Programación dinámica
Unidad 3. Programación dinámicaUnidad 3. Programación dinámica
Unidad 3. Programación dinámica
 
Algoritmo congruencial multiplicativo y prueba de medias u otavalo
Algoritmo congruencial multiplicativo y prueba de medias u otavaloAlgoritmo congruencial multiplicativo y prueba de medias u otavalo
Algoritmo congruencial multiplicativo y prueba de medias u otavalo
 
Unidad 2 presentacion Localización de la planta
Unidad 2 presentacion Localización de la plantaUnidad 2 presentacion Localización de la planta
Unidad 2 presentacion Localización de la planta
 
Arboles de decisión
Arboles de decisiónArboles de decisión
Arboles de decisión
 
Metodo Montecarlo
Metodo MontecarloMetodo Montecarlo
Metodo Montecarlo
 
Análisis de Sensibilidad.pptx
Análisis de Sensibilidad.pptxAnálisis de Sensibilidad.pptx
Análisis de Sensibilidad.pptx
 
Numeros pseudoaleatorios
Numeros pseudoaleatoriosNumeros pseudoaleatorios
Numeros pseudoaleatorios
 

Similar to 30 分鐘學會實作 Python Feature Selection

30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature SelectionJames Huang
 
Python 표준 라이브러리
Python 표준 라이브러리Python 표준 라이브러리
Python 표준 라이브러리용 최
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...Databricks
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak PROIDEA
 
Effective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyEffective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyKimikazu Kato
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
 
Python于Web 2.0网站的应用 - QCon Beijing 2010
Python于Web 2.0网站的应用 - QCon Beijing 2010Python于Web 2.0网站的应用 - QCon Beijing 2010
Python于Web 2.0网站的应用 - QCon Beijing 2010Qiangning Hong
 
Simple APIs and innovative documentation
Simple APIs and innovative documentationSimple APIs and innovative documentation
Simple APIs and innovative documentationPyDataParis
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimizationg3_nittala
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
 
Python utan-stodhjul-motorsag
Python utan-stodhjul-motorsagPython utan-stodhjul-motorsag
Python utan-stodhjul-motorsagniklal
 
Python na Infraestrutura 
MySQL do Facebook

Python na Infraestrutura 
MySQL do Facebook
Python na Infraestrutura 
MySQL do Facebook

Python na Infraestrutura 
MySQL do Facebook
Artur Rodrigues
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)Qiangning Hong
 
PyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorialPyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorialjbellis
 
Object Orientation vs. Functional Programming in Python
Object Orientation vs. Functional Programming in PythonObject Orientation vs. Functional Programming in Python
Object Orientation vs. Functional Programming in PythonPython Ireland
 

Similar to 30 分鐘學會實作 Python Feature Selection (20)

30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection
 
Python 표준 라이브러리
Python 표준 라이브러리Python 표준 라이브러리
Python 표준 라이브러리
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
 
C3 w2
C3 w2C3 w2
C3 w2
 
Effective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyEffective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPy
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Python于Web 2.0网站的应用 - QCon Beijing 2010
Python于Web 2.0网站的应用 - QCon Beijing 2010Python于Web 2.0网站的应用 - QCon Beijing 2010
Python于Web 2.0网站的应用 - QCon Beijing 2010
 
Simple APIs and innovative documentation
Simple APIs and innovative documentationSimple APIs and innovative documentation
Simple APIs and innovative documentation
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Python utan-stodhjul-motorsag
Python utan-stodhjul-motorsagPython utan-stodhjul-motorsag
Python utan-stodhjul-motorsag
 
Intro to Python
Intro to PythonIntro to Python
Intro to Python
 
Python na Infraestrutura 
MySQL do Facebook

Python na Infraestrutura 
MySQL do Facebook
Python na Infraestrutura 
MySQL do Facebook

Python na Infraestrutura 
MySQL do Facebook

 
Python: The Dynamic!
Python: The Dynamic!Python: The Dynamic!
Python: The Dynamic!
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
 
PyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorialPyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorial
 
6 class design
6 class design6 class design
6 class design
 
Ml5
Ml5Ml5
Ml5
 
Object Orientation vs. Functional Programming in Python
Object Orientation vs. Functional Programming in PythonObject Orientation vs. Functional Programming in Python
Object Orientation vs. Functional Programming in Python
 

Recently uploaded

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 

Recently uploaded (20)

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 

30 分鐘學會實作 Python Feature Selection

  • 1. 30 分鐘學會 實作 Python Feature Selection James CC Huang
  • 2. Disclaimer • 只有實作 • 沒有數學 • 沒有統計 Source: Internet
  • 3. Warming Up • 聽說這場分享不會有人問問題 (把講者釘在台上) • 原 session 只講 40 分鐘,但是今天的分享給了 2 小時 • 考驗我的記憶力和理解力 • 講者講了一大堆名詞但沒有講實作 (不可能有時間講) • 我用 Python 實作範例 • 希望大家如果跟我一樣,不搞理論也不搞數學統計,回家用剪貼的就可 以用 scikit-learn 做 feature selection
  • 4. Reinventing the Wheel? Source: P.60 http://www.slideshare.net/tw_dsconf/ss-62245351
  • 5. 進行 Machine Learning 和 Deep Learning… • 到底需不需要懂背後的數學、統計、理論…? • 推廣及普及 Machine Learning / Deep Learning • 工具的易用性及快速開發 • 正反方意見都有 • 正方例子:談到投入大演算 ”… 你會認為這需要繁重的數 學和嚴謹的理論工作,其實不然,反倒這所需要的是從 艱深的數學理論抽離,以便能看到學習現象的整體模 式。” (大演算 The Master Algorithm, P. 40) • 反方例子:Deep Neural Networks - A Developmental Perspective (slides, video)
  • 6. 2014 – 2016 台灣資料科學”愛好者”年會 我的分享 一、連續 3 年吃便當的經驗 二、2016 聽完 Feature Engineering in Machine Learning 演講後夢到的東西
  • 7. 三年的進化 • 參加的人愈來愈多 • [不負責任目測] 與會者平均年齡愈來愈大 XD • 內容愈來愈多、場次愈來愈多 • 演講者身份的改變:教授和來自研究單位變多 • Deep Learning 這個詞出現頻率大幅增加 • $$ 愈來愈貴 • 朝向使用者付費 • 部分付費課程也會持續開課 • 便當沒有進化(都是同樣那幾家)
  • 15. Feature Engineering in Machine Learning Session (Speaker: 李俊良) Source: http://www.slideshare.net/tw_dsconf/feature-engineering-in-machine-learning
  • 16. 用 Feature Engineering 可否判斷出寫作風 格? • 羅琳化名寫小說 曝光後銷量飆升 http://www.bbc.com/zhongwen/trad/uk_study/2013/07/130714_ro wling_novel • “曾有書評評價新書《杜鵑鳥在呼喚》是部「才華橫溢的處女作」,還有 書評盛讚這名男性作者,能如此精湛地描述女性的服裝。” • “… 出版( 3 個月)的這部小說,已經售出1500冊。但亞馬遜網站報道說, 周日正午12點後,該書的銷售量飆增,增速高達500000%。” • 原投影片 P. 14 (Source: http://www.slideshare.net/tw_dsconf/feature-engineering-in- machine-learning)
  • 17. Find Word / Doc Similarity with Deep Learning Using word2vec and Gensim (Python)
  • 18. Goal (or Problem to Solve) • Problem: Tech Support engineers (TS) want to “precisely” categorize support cases. The task is being performed manually by TS engineers. • Goal: Automatically categorize support case. • What I have: • 156 classified cases (with “so-called” correct issue categories) • Support cases in database • Challenges: • Based on current data available, supervised classification algorisms can‘t be applied. • Clustering may not 100% achieve the goal. • What about Deep Learning?
  • 19. Gensim (word2vec implementation in Python) from os import listdir import gensim LabeledSentence = gensim.models.doc2vec.LabeledSentence docLabels = [] docLabels = [f for f in listdir(“../corpora/2016/”) if f.endswith(‘.txt’)] data = [] for doc in docLabels: data.append(open(“../corpora/2016/” + doc, ‘r’)) class LabeledLineSentence(object): def __init__(self, doc_list, labels_list): self.labels_list = labels_list self.doc_list = doc_list def __iter__(self): for idx, doc in enumerate(self.doc_list): yield LabeledSentence(words=doc.read().split(), labels=[self.labels_list[idx]])
  • 20. Gensim (Cont’d) it = LabeledLineSentence(data, docLabels) model = gensim.models.Doc2Vec(alpha=0.025, min_alpha=0.025) model.build_vocab(it) for epoch in range(10): model.train(it) model.alpha -= 0.002 model.min_alpha = model.alpha # find most similar support case print model.most_similar(“00111105”)
  • 21. 江湖傳言 • 用 Deep Learning 就不需要做 feature selection,因為 deep learning 會自動幫你決定 • From Wikipedia (https://en.wikipedia.org/wiki/Deep_learning): • “One of the promises of deep learning is replacing handcrafted features with efficient algorithms for unsupervised or semi-supervised feature learning and hierarchical feature extraction.” • 真 的 有 這 麼 神 奇 嗎 ?
  • 22. Feature selection for Iris Dataset as Example • Iris dataset attributes 1. sepal length in cm 2. sepal width in cm 3. petal length in cm 4. petal width in cm 5. class: -- Iris Setosa -- Iris Versicolour -- Iris Virginica
  • 23. Feature Selection - LASSO >>> from sklearn.linear_model import Lasso >>> from sklearn.datasets import load_iris >>> from sklearn.feature_selection import SelectFromModel >>> iris = load_iris() >>> X, y = iris.data, iris.target >>> print X.shape (150, 4) >>> clf = Lasso(alpha=0.01) >>> sfm = SelectFromModel(clf, threshold=0.25) >>> sfm.fit(X, y) >>> n_features = sfm.transform(X).shape[1] >>> print n_features 2 petal width & petal length
  • 24. Feature Selection - LASSO (Cont’d) >>> scaler = StandardScaler() >>> X = scaler.fit_transform(X) >>> names = iris["feature_names"] >>> lasso = Lasso(alpha=0.01, positive=True) >>> lasso.fit(X, y) >>> print (sorted(zip(map(lambda x: round(x, 4), lasso.coef_), names), reverse=True)) [(0.47199999999999998, 'petal width (cm)'), (0.3105, 'petal length (cm)'), (0.0, 'sepal width (cm)'), (0.0, 'sepal length (cm)')]
  • 25. Feature Selection – Random Forest >>> from sklearn.datasets import load_iris >>> from sklearn.ensemble import RandomForestRegressor >>> iris = load_iris() >>> X, y = iris.data, iris.target >>> print (X.shape) (150, 4) >>> names = iris["feature_names"] >>> rf = RandomForestRegressor() >>> rf.fit(X, y) >>> print (sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), names), reverse=True)) [(0.50729999999999997, 'petal width (cm)'), (0.47870000000000001, 'petal length (cm)'), (0.0091000000000000004, 'sepal width (cm)'), (0.0048999999999999998, 'sepal length (cm)')]
  • 26. Dimension Reduction - PCA >>> from sklearn.datasets import load_iris >>> from sklearn.decomposition import PCA as pca >>> from sklearn.preprocessing import StandardScaler >>> iris = load_iris() >>> X, y = iris.data, iris.target >>> X = StandardScaler().fit_transform(X) >>> sklearn_pca = pca(n_components=2) >>> sklearn_pca.fit_transform(X) >>> print (sklearn_pca.components_) [[ 0.52237162 -0.26335492 0.58125401 0.56561105] [-0.37231836 -0.92555649 -0.02109478 -0.06541577]]
  • 27. There are many others… 這次分享就是僅是把原講者所提到的方式實際做出來 簡單的我做完了, 難的就留給大家去發掘~
  • 29. scikit-learn • Feature selection http://scikit-learn.org/stable/modules/feature_selection.html • sklearn.linear_model.Lasso http://scikit- learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html • sklearn.decomposition.PCA http://scikit- learn.org/stable/modules/generated/sklearn.decomposition.PCA.htm l
  • 31. HoG (Histogram of Oriented Gradients) • Python code example http://scikit- image.org/docs/dev/auto_examples/plot_hog.html