SlideShare a Scribd company logo
1 of 71
The Growth of a Data Scientist
我的資料科學之路
李 育 杰
Data Science and Machine Intelligence Lab
國立交通大學應用數學系
台灣資料科學年會
July 14-17, 2016
Big Data 3 V
台灣資料科學年會演講三要
要有料
要有趣
要有用
 From Data Mining to Big Data
 Some my experiences in Data Science
 Breast Cancer Diagnosis and Prognosis
 Malicious URLs Detection
 露天拍賣詐騙商品偵測
 …
 Final Remarks
Nuclear feature extraction for breast tumor diagnosis,WN Street,WHWolberg, OL Mangasarian
IS&T/SPIE's Symposium on Electronic Imaging: Science andTechnology, 861-870
253 Patients
(113 NoChemo, 140 Chemo)
Cluster 113 NoChemo Patients
Use k-Median Algorithm with Initial Centers:
Medians of Good1 & Poor1
69 NoChemo Good 44 NoChemo Poor 67 Chemo Good 73 Chemo Poor
Good PoorIntermediate
Cluster 140 Chemo Patients
Use k-Median Algorithm with Initial Centers:
Medians of Good1 & Poor1
Good1:
Lymph=0 AND Tumor<2
Compute Median Using 6 Features
Poor1:
Lymph>=5 OR Tumor>=4
Compute Median Using 6 Features
Compute Initial
Cluster Centers
 Privacy is an issue
 Working with domain experts is EXTREMELY important
 Y2K
 少也賤,故能多鄙事
Breast cancer survival and chemotherapy: a support vector machine analysis
YJ Lee, OL Mangasarian,WHWolberg, Discrete Math. Problem with MedicalApplication,
DIMACSWorkshop
Survival-time classification of breast cancer patients
YJ Lee, OL Mangasarian,WHWolberg
ComputationalOptimization andApplications 25 (1-3), 151-166
Can you filter out the benign URLs ONLY based on
the URL stream?
09/10/2013 Lab of Data Science & Machine Intelligence 18
• Malicious websites have become tools for spreading criminal activity
on theWeb
• Phishing
• Malware
09/10/2013 19(a) paypal.com (b) paypal.com-us.cgi-bin-webscr...
• Blacklist service
• PhishTank
• Spam and Open Relay Blocking System (SORBS)
• Real-time URI Blacklist (URIBL)
• Malicious URLs detection
• Google
• Google safe browsing
• Trend Micro
• Web Reputation Service
09/10/2013 Lab of Data Science & Machine Intelligence 20
• URL requests received from users all over the world
• 3,000,000,000 ~ 7,000,000,000 per day
• 200,000,000 ~ 800,000,000 need to be analyzed
• Only 0.01% are malicious URLs
09/10/2013 Lab of Data Science & Machine Intelligence 21
Filtering Mechanism
• No page content need for prioritization
• Prioritization means to return the most suspicious URLs
• No host based information is allowed
• Effectiveness
• Filtering (Download) Rate = Filtered URLs/Total URLs < 25%
• MaliciousCoverage = Filtered Malicious URLs/Total Malicious URLs > 75%
• Performance – Filtering
• > 2000 URLs per second for 1 dual-coreVM with 4GB memory.
• Scalability
• One hour data should be consumed in one hour
09/10/2013 Lab of Data Science & Machine Intelligence 22
 Large scale data streaming
 One million URLs will be received per hour in average
 High dimension and sparse presentation
 Lexical information makes feature vector to become very sparse
 Extremely imbalanced data set
 Only contains about 0.01% malicious URLs
• Malicious URLs usually have very short life time
09/10/2013 Lab of Data Science & Machine Intelligence 23
09/10/2013 Lab of Data Science & Machine Intelligence 24
09/10/2013 Lab of Data Science & Machine Intelligence 25
Filtering Rate ≈ False Positive Rate
MaliciousURLsCoveringRate
25%, if uniformly random
75%, requirement
90%
• Limitations: Can not use
• Host-based information
• Web page content information
• Two types of feature sets are proposed
• Lexical features
• Descriptive features
09/10/2013 Lab of Data Science & Machine Intelligence 26
• The words in a URL string are translated into a Boolean vector
• Each Boolean value represents the occurrence of specific word
• Each URL component is split by specific delimiters and the words
will be saved in a dictionary
09/10/2013 Lab of Data Science & Machine Intelligence 27
• Three character length sliding window on the domain name
• For the malicious websites which slightly modify its domain name
• For reducing memory usage of dictionary:
• Remove zero-weight words
• Remove word form argument value
• Replace IP with AS number (Using static mapping table)
• Replace the digits in word with regular expression
• Example: replace cool567 to cool[0-9]+
• Keep the words generated in the last 24 hours only
09/10/2013 Lab of Data Science & Machine Intelligence 28
 Descriptive features observed from malicious websites
 For detecting the phishing websites
 A Digit between Two Letters (LDL)
 Examp1e
 A Letter between Two Digits (DLD)
 award2o12
 For detecting malware website
 Executable File or Not
09/10/2013 Lab of Data Science & Machine Intelligence 29
 Fraction of domain name
 Categorizing characters to letters, digits and symbols
 Splitting domain name by the connection of different categories
 Summing of longest token length of each category and divides by the domain
name length
09/10/2013 Lab of Data Science & Machine Intelligence 30
 For detecting randomly generated string
 Alphabet Entropy
 Number Rate
 For detecting abnormal phenomenon in URL string
 Length
 Length Ratio
 Letter, digit and symbol count.
 For detecting the common way on URL
 Using IP as Domain Name
 Default Port Number
 For covering the Sparse Features
 Delimiter Count
 The Length of Longest Word
09/10/2013 Lab of Data Science & Machine Intelligence 31
 We choose two online learning algorithms to update the model for
 Saving processing time and memory usage
 Adjusting model from concept drift of data streaming
 Two prospects of features to build two filters
 For descriptive features
 Passive-aggressive algorithm
 For lexical features
 Confident weighted algorithm
 Over-sampling technique for extremely imbalanced data set
09/10/2013 Lab of Data Science & Machine Intelligence 32
•
09/10/2013 Lab of Data Science & Machine Intelligence 33
•
09/10/2013 Lab of Data Science & Machine Intelligence 34
• Data Set
• Measure
• Download Rate (DR)
• (TP + FP) / # of instances
• Missing Malicious Rate (MMR)
• FN / (TP + FN)
09/10/2013 Lab of Data Science & Machine Intelligence 35
• Environment
 CPU : Dual-core (3.00GHz)
 Memory : 4GB OS : Cent OS 64 bit
• Results
09/10/2013 Lab of Data Science & Machine Intelligence 36
 Settings
 Use one hour data for training/updating
 Predict next hour and record the results
 Compute the daily average of DR & MMR
09/10/2013 Lab of Data Science & Machine Intelligence 37
Apr.
09/10/2013 38
Sep.
Nov.-Dec.
 Both of two filters have a certain accuracy
 However, their results are different
 Set the download rate for each filter around 10%
09/10/2013 Lab of Data Science & Machine Intelligence 39
Apr.
Average MMR of
Descriptive Filter
Average MMR of Lexical
Filter
 How to convert URLs stream into n-dimensional vector space
 How to deal with extremely unbalanced data
 Brain storming to define and extract features
 Choose a right learning algorithm
 How to deal with industry
Malicious URL filtering -A big data application
MS Lin, CY Chiu,YJ Lee, HK Pao, Big Data, 2013 IEEE InternationalConference on, 589-596
露天個案研究
利用機器學習輔助審查人員偵測詐騙商品
2016/7/17 Lab of Data Science & Machine Intelligence 42
露天 is the Top 1 in Shopping Category
詐騙事件層出不窮
104年1至12月前10名高風險賣場 165警政署統計資料
排名 賣場名稱 總件數
1 露天拍賣 1418
2 86 小舖 824
3 SHOPPING99 804
4 奇摩拍賣 674
5 HITO本舖 592
6 小三美日 462
7 金石堂網路書店 368
8 衣芙日系 349
9 奇摩超級商城 317
10 樂天 217
165 警政署高風險場排行榜
網路電商都要打假、防詐騙
露天有甚麼機制防範詐騙?
檢測專家遇到的挑戰
• 上架商品數量百萬等級
• 如果24小時全年無休一個人
一秒看一個商品也要十個人
才能消化每天全部上架的商
品
如何聰明地篩選需檢查的商品?
Machine Learning
Major Challenges
• 極端不平衡的資料
– 好人佔大多數極少數壞人
– 但壞人會上傳大量詐騙商品
• 詐騙集團策略會改變
– 詐騙商品的多樣性
– 詐騙商品會隨季節流行改變
• 收集與前處理資料困難
– 釐清需要甚麼資料
– 需要的資料散佈在不同資料來源
– 露天傳統資料儲存系統無法滿足
所需操作
• 資料量龐大
– 每天有百萬級商品上傳
– 需要即時性處理
Get your Hands Dirty
• 訪談審核人員、確認SOP
• 確認所需資料來源
• 確認資料流
• 從不同資料來源中萃取有用資料
• 從審核人員累積的 (Label) 資料中建模
審核專家經驗 – 帳號
• 以使用者帳號為例,詐騙集團假冒正當賣家的方式
– 當露天帳號申請門檻不高時,詐騙集團會大量申
請殭屍帳號,帳號通常會像亂數,由程式產生
– 經由各種手段盜用正常使用者帳號,被盜用的使
用者通常為長久無使用的帳號,所以可比對最後
一次登入日期與上架日期判定
帳號
re9n6LG0
4JEMxCRFwW
jiJo4tpK
7FCXfrgO
2Ntla
5YsJSafSg8
jcshgx
12pm1c
審核專家經驗 – 付款方式與地點
• 以商品所在地為例詐騙集團
避免面交所以常常把商品所
在設在較偏遠地區
• 也不喜歡貨到付款這種付款
方式
• ip 該如何使用?
確認資料來源與資料流
系統架構
露天主要資料
庫
Log檔案
原始資料庫鏡
像
用戶畫像
用戶特徵、商品特徵
額外應用 預測模型
Labeling
檢測專家
如何看一件商品?
Data Set
• Training set: (251827, 20)
• Testing set: (109122, 20)
上架日期 商品類別 商品數目 付款方式 商品價格 商品地點
商品運送
方式
上架方式
使用者帳
號
有無認證 ip
上次上架
日期
是否第一
次上架
1467043200 生活、居家 999 信用卡 520 台南市 7-11取貨 單品上架 xxxxx 有
xxx.xx.xxx.x
xx
2016-06-27
23:59:59
N
1467043200 手機、通訊 999 支付連 1461 台北市 7-11取貨 單品上架 xxxxxxx 無
xxx.xx.xxx.x
xx
2016-06-27
23:59:59
N
1467043200 休閒旅遊 50 貨到付款 19 台南市 貨到付款 單品上架 xxx 有
xx.xxx.xxx.x
xx
2016-06-27
23:59:23
N
Leading Board
AUC
Competition Results
• Model
– Logistic Regression
– SVM
– Gradient Boost Decision Tree
– Naïve Bayes
• Training time: 5~10 min
• Testing time: 1~2 min
從工人智慧到人工智慧
• 審核人員僅能檢查相對少數的資料,即使焚膏繼晷
• Sampling scheme 造成漏網之魚
• 數個月下來,累積的Labelled Data,應該發揮功用!
• We used these labelled data as the “ground truth”
• Our model can achieve 0.983 AUC
• 1,000 items can be examined in a second
• There is a hope to examine EVERY item
Keep the Experts in the Loop
• 審核人員對我們依然重要!
• We use our model to rank the “suspicious level” of item
• We can let 審核人員 examine the “gray” items
• Get the new labelled items from 審核人員
• We can re-train our model with new labelled data
• Tango with bad guys
Data Science Team in Ruten
We are hiring!
I am watching you!
但是…

More Related Content

What's hot

Deep Learning with Python (PyData Seattle 2015)
Deep Learning with Python (PyData Seattle 2015)Deep Learning with Python (PyData Seattle 2015)
Deep Learning with Python (PyData Seattle 2015)Alexander Korbonits
 
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
 A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs) A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)Thomas da Silva Paula
 
Machine learning on Hadoop data lakes
Machine learning on Hadoop data lakesMachine learning on Hadoop data lakes
Machine learning on Hadoop data lakesDataWorks Summit
 
Deep Learning @ ZHAW Datalab (with Mark Cieliebak & Yves Pauchard)
Deep Learning @ ZHAW Datalab (with Mark Cieliebak & Yves Pauchard)Deep Learning @ ZHAW Datalab (with Mark Cieliebak & Yves Pauchard)
Deep Learning @ ZHAW Datalab (with Mark Cieliebak & Yves Pauchard)Thilo Stadelmann
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"NUS-ISS
 
Capitalico / Chart Pattern Matching in Financial Trading Using RNN
Capitalico / Chart Pattern Matching in Financial Trading Using RNNCapitalico / Chart Pattern Matching in Financial Trading Using RNN
Capitalico / Chart Pattern Matching in Financial Trading Using RNNAlpaca
 
[台灣人工智慧學校] 主題演講 - 張智威總經理 (President of HTC DeepQ)
[台灣人工智慧學校] 主題演講 - 張智威總經理 (President of HTC DeepQ)[台灣人工智慧學校] 主題演講 - 張智威總經理 (President of HTC DeepQ)
[台灣人工智慧學校] 主題演講 - 張智威總經理 (President of HTC DeepQ)台灣資料科學年會
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for EveryoneDhiana Deva
 
IoT with Azure Machine Learning and InfluxDB
IoT with Azure Machine Learning and InfluxDBIoT with Azure Machine Learning and InfluxDB
IoT with Azure Machine Learning and InfluxDBIvo Andreev
 
Usage of Generative Adversarial Networks (GANs) in Healthcare
Usage of Generative Adversarial Networks (GANs) in HealthcareUsage of Generative Adversarial Networks (GANs) in Healthcare
Usage of Generative Adversarial Networks (GANs) in HealthcareGlobalLogic Ukraine
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017MLconf
 
Generative Adversarial Networks and Their Applications
Generative Adversarial Networks and Their ApplicationsGenerative Adversarial Networks and Their Applications
Generative Adversarial Networks and Their ApplicationsArtifacia
 
Neural Networks and Deep Learning for Physicists
Neural Networks and Deep Learning for PhysicistsNeural Networks and Deep Learning for Physicists
Neural Networks and Deep Learning for PhysicistsHéloïse Nonne
 
Webinar: Deep Learning with H2O
Webinar: Deep Learning with H2OWebinar: Deep Learning with H2O
Webinar: Deep Learning with H2OSri Ambati
 
ESM Machine learning 5주차 Review by Mario Cho
ESM Machine learning 5주차 Review by Mario ChoESM Machine learning 5주차 Review by Mario Cho
ESM Machine learning 5주차 Review by Mario ChoMario Cho
 
Tutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksTutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksMLReview
 
Azure Machine Learning and ML on Premises
Azure Machine Learning and ML on PremisesAzure Machine Learning and ML on Premises
Azure Machine Learning and ML on PremisesIvo Andreev
 

What's hot (20)

Deep Learning with Python (PyData Seattle 2015)
Deep Learning with Python (PyData Seattle 2015)Deep Learning with Python (PyData Seattle 2015)
Deep Learning with Python (PyData Seattle 2015)
 
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
 A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs) A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
 
Learning End to End
Learning End to EndLearning End to End
Learning End to End
 
Machine learning on Hadoop data lakes
Machine learning on Hadoop data lakesMachine learning on Hadoop data lakes
Machine learning on Hadoop data lakes
 
Deep Learning @ ZHAW Datalab (with Mark Cieliebak & Yves Pauchard)
Deep Learning @ ZHAW Datalab (with Mark Cieliebak & Yves Pauchard)Deep Learning @ ZHAW Datalab (with Mark Cieliebak & Yves Pauchard)
Deep Learning @ ZHAW Datalab (with Mark Cieliebak & Yves Pauchard)
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
 
Capitalico / Chart Pattern Matching in Financial Trading Using RNN
Capitalico / Chart Pattern Matching in Financial Trading Using RNNCapitalico / Chart Pattern Matching in Financial Trading Using RNN
Capitalico / Chart Pattern Matching in Financial Trading Using RNN
 
[台灣人工智慧學校] 主題演講 - 張智威總經理 (President of HTC DeepQ)
[台灣人工智慧學校] 主題演講 - 張智威總經理 (President of HTC DeepQ)[台灣人工智慧學校] 主題演講 - 張智威總經理 (President of HTC DeepQ)
[台灣人工智慧學校] 主題演講 - 張智威總經理 (President of HTC DeepQ)
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
 
Dato Keynote
Dato KeynoteDato Keynote
Dato Keynote
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for Everyone
 
IoT with Azure Machine Learning and InfluxDB
IoT with Azure Machine Learning and InfluxDBIoT with Azure Machine Learning and InfluxDB
IoT with Azure Machine Learning and InfluxDB
 
Usage of Generative Adversarial Networks (GANs) in Healthcare
Usage of Generative Adversarial Networks (GANs) in HealthcareUsage of Generative Adversarial Networks (GANs) in Healthcare
Usage of Generative Adversarial Networks (GANs) in Healthcare
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
 
Generative Adversarial Networks and Their Applications
Generative Adversarial Networks and Their ApplicationsGenerative Adversarial Networks and Their Applications
Generative Adversarial Networks and Their Applications
 
Neural Networks and Deep Learning for Physicists
Neural Networks and Deep Learning for PhysicistsNeural Networks and Deep Learning for Physicists
Neural Networks and Deep Learning for Physicists
 
Webinar: Deep Learning with H2O
Webinar: Deep Learning with H2OWebinar: Deep Learning with H2O
Webinar: Deep Learning with H2O
 
ESM Machine learning 5주차 Review by Mario Cho
ESM Machine learning 5주차 Review by Mario ChoESM Machine learning 5주차 Review by Mario Cho
ESM Machine learning 5주차 Review by Mario Cho
 
Tutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksTutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial Networks
 
Azure Machine Learning and ML on Premises
Azure Machine Learning and ML on PremisesAzure Machine Learning and ML on Premises
Azure Machine Learning and ML on Premises
 

Viewers also liked

「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會台灣資料科學年會
 
[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies台灣資料科學年會
 
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程台灣資料科學年會
 
[系列活動] 手把手教你R語言資料分析實務
[系列活動] 手把手教你R語言資料分析實務[系列活動] 手把手教你R語言資料分析實務
[系列活動] 手把手教你R語言資料分析實務台灣資料科學年會
 
手把手教你 R 語言資料分析實務/張毓倫&陳柏亨
手把手教你 R 語言資料分析實務/張毓倫&陳柏亨手把手教你 R 語言資料分析實務/張毓倫&陳柏亨
手把手教你 R 語言資料分析實務/張毓倫&陳柏亨台灣資料科學年會
 
[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R台灣資料科學年會
 
[系列活動] 給工程師的統計學及資料分析 123
[系列活動] 給工程師的統計學及資料分析 123[系列活動] 給工程師的統計學及資料分析 123
[系列活動] 給工程師的統計學及資料分析 123台灣資料科學年會
 
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)台灣資料科學年會
 
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹台灣資料科學年會
 
不會寫程式的人友善上手機器學習-淺談 Azure machine learning studio
不會寫程式的人友善上手機器學習-淺談 Azure machine learning studio不會寫程式的人友善上手機器學習-淺談 Azure machine learning studio
不會寫程式的人友善上手機器學習-淺談 Azure machine learning studioR Ladies Taipei
 
Big-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunitiesBig-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunities台灣資料科學年會
 
資料科學的第一堂課 Data Science Orientation
資料科學的第一堂課 Data Science Orientation資料科學的第一堂課 Data Science Orientation
資料科學的第一堂課 Data Science OrientationRyan Chung
 
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室台灣資料科學年會
 
機率統計 -- 使用 R 軟體
機率統計 -- 使用 R 軟體機率統計 -- 使用 R 軟體
機率統計 -- 使用 R 軟體鍾誠 陳鍾誠
 
R統計軟體 -安裝與使用
R統計軟體 -安裝與使用R統計軟體 -安裝與使用
R統計軟體 -安裝與使用Person Lin
 
R統計軟體簡介
R統計軟體簡介R統計軟體簡介
R統計軟體簡介Person Lin
 
[DSC 2016] 系列活動:許懷中 / R 語言資料探勘實務
[DSC 2016] 系列活動:許懷中 / R 語言資料探勘實務[DSC 2016] 系列活動:許懷中 / R 語言資料探勘實務
[DSC 2016] 系列活動:許懷中 / R 語言資料探勘實務台灣資料科學年會
 

Viewers also liked (20)

「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
 
[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies
 
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程
 
[系列活動] 手把手教你R語言資料分析實務
[系列活動] 手把手教你R語言資料分析實務[系列活動] 手把手教你R語言資料分析實務
[系列活動] 手把手教你R語言資料分析實務
 
手把手教你 R 語言資料分析實務/張毓倫&陳柏亨
手把手教你 R 語言資料分析實務/張毓倫&陳柏亨手把手教你 R 語言資料分析實務/張毓倫&陳柏亨
手把手教你 R 語言資料分析實務/張毓倫&陳柏亨
 
[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R
 
[系列活動] 給工程師的統計學及資料分析 123
[系列活動] 給工程師的統計學及資料分析 123[系列活動] 給工程師的統計學及資料分析 123
[系列活動] 給工程師的統計學及資料分析 123
 
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
 
[系列活動] 機器學習速遊
[系列活動] 機器學習速遊[系列活動] 機器學習速遊
[系列活動] 機器學習速遊
 
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
 
不會寫程式的人友善上手機器學習-淺談 Azure machine learning studio
不會寫程式的人友善上手機器學習-淺談 Azure machine learning studio不會寫程式的人友善上手機器學習-淺談 Azure machine learning studio
不會寫程式的人友善上手機器學習-淺談 Azure machine learning studio
 
Big-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunitiesBig-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunities
 
資料科學的第一堂課 Data Science Orientation
資料科學的第一堂課 Data Science Orientation資料科學的第一堂課 Data Science Orientation
資料科學的第一堂課 Data Science Orientation
 
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室
 
新手村-資料探索
新手村-資料探索新手村-資料探索
新手村-資料探索
 
機率統計 -- 使用 R 軟體
機率統計 -- 使用 R 軟體機率統計 -- 使用 R 軟體
機率統計 -- 使用 R 軟體
 
第一場預測
第一場預測第一場預測
第一場預測
 
R統計軟體 -安裝與使用
R統計軟體 -安裝與使用R統計軟體 -安裝與使用
R統計軟體 -安裝與使用
 
R統計軟體簡介
R統計軟體簡介R統計軟體簡介
R統計軟體簡介
 
[DSC 2016] 系列活動:許懷中 / R 語言資料探勘實務
[DSC 2016] 系列活動:許懷中 / R 語言資料探勘實務[DSC 2016] 系列活動:許懷中 / R 語言資料探勘實務
[DSC 2016] 系列活動:許懷中 / R 語言資料探勘實務
 

Similar to Growth of a Data Scientist

陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰台灣資料科學年會
 
Mehr und schneller ist nicht automatisch besser - data2day, 06.10.16
Mehr und schneller ist nicht automatisch besser - data2day, 06.10.16Mehr und schneller ist nicht automatisch besser - data2day, 06.10.16
Mehr und schneller ist nicht automatisch besser - data2day, 06.10.16Boris Adryan
 
BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6Rod Soto
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Raja Chiky
 
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...epamspb
 
Titles with Abstracts_2023-2024_Data Mining.pdf
Titles with Abstracts_2023-2024_Data Mining.pdfTitles with Abstracts_2023-2024_Data Mining.pdf
Titles with Abstracts_2023-2024_Data Mining.pdfinfo751436
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
 
Scaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudScaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudRob Gillen
 
Fast Data Mining: Real Time Knowledge Discovery for Predictive Decision Making
Fast Data Mining: Real Time Knowledge Discovery for Predictive Decision MakingFast Data Mining: Real Time Knowledge Discovery for Predictive Decision Making
Fast Data Mining: Real Time Knowledge Discovery for Predictive Decision MakingCodemotion
 
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...Amazon Web Services Korea
 
Fraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for TelecomFraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for TelecomSudarson Roy Pratihar
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Rise of the machines -- Owasp israel -- June 2014 meetup
Rise of the machines -- Owasp israel -- June 2014 meetupRise of the machines -- Owasp israel -- June 2014 meetup
Rise of the machines -- Owasp israel -- June 2014 meetupShlomo Yona
 
Delivering Security Insights with Data Analytics and Visualization
Delivering Security Insights with Data Analytics and VisualizationDelivering Security Insights with Data Analytics and Visualization
Delivering Security Insights with Data Analytics and VisualizationRaffael Marty
 
Big data and computing grid
Big data and computing gridBig data and computing grid
Big data and computing gridThang Nguyen
 
High-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutionsHigh-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutionsClusterpoint
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsMarina Santini
 

Similar to Growth of a Data Scientist (20)

陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰
 
Mehr und schneller ist nicht automatisch besser - data2day, 06.10.16
Mehr und schneller ist nicht automatisch besser - data2day, 06.10.16Mehr und schneller ist nicht automatisch besser - data2day, 06.10.16
Mehr und schneller ist nicht automatisch besser - data2day, 06.10.16
 
BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...
 
Titles with Abstracts_2023-2024_Data Mining.pdf
Titles with Abstracts_2023-2024_Data Mining.pdfTitles with Abstracts_2023-2024_Data Mining.pdf
Titles with Abstracts_2023-2024_Data Mining.pdf
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
Scaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudScaling Document Clustering in the Cloud
Scaling Document Clustering in the Cloud
 
Fast Data Mining: Real Time Knowledge Discovery for Predictive Decision Making
Fast Data Mining: Real Time Knowledge Discovery for Predictive Decision MakingFast Data Mining: Real Time Knowledge Discovery for Predictive Decision Making
Fast Data Mining: Real Time Knowledge Discovery for Predictive Decision Making
 
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Fraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for TelecomFraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for Telecom
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Rise of the machines -- Owasp israel -- June 2014 meetup
Rise of the machines -- Owasp israel -- June 2014 meetupRise of the machines -- Owasp israel -- June 2014 meetup
Rise of the machines -- Owasp israel -- June 2014 meetup
 
Delivering Security Insights with Data Analytics and Visualization
Delivering Security Insights with Data Analytics and VisualizationDelivering Security Insights with Data Analytics and Visualization
Delivering Security Insights with Data Analytics and Visualization
 
Big data and computing grid
Big data and computing gridBig data and computing grid
Big data and computing grid
 
High-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutionsHigh-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutions
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 

More from 台灣資料科學年會

[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用台灣資料科學年會
 
[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告台灣資料科學年會
 
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰台灣資料科學年會
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機台灣資料科學年會
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機台灣資料科學年會
 
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話台灣資料科學年會
 
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇台灣資料科學年會
 
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 [TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 台灣資料科學年會
 
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵台灣資料科學年會
 
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用台灣資料科學年會
 
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告台灣資料科學年會
 
[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話台灣資料科學年會
 
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人台灣資料科學年會
 
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維台灣資料科學年會
 
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察台灣資料科學年會
 
[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰台灣資料科學年會
 
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT台灣資料科學年會
 
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達台灣資料科學年會
 
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳台灣資料科學年會
 

More from 台灣資料科學年會 (20)

[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用
 
[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告
 
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
 
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
 
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
 
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 [TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
 
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
 
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
 
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
 
台灣人工智慧學校成果發表會
台灣人工智慧學校成果發表會台灣人工智慧學校成果發表會
台灣人工智慧學校成果發表會
 
[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話
 
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
 
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
 
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
 
[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰
 
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
 
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
 
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
 

Recently uploaded

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookmanojkuma9823
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 

Recently uploaded (20)

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 

Growth of a Data Scientist

  • 1. The Growth of a Data Scientist 我的資料科學之路 李 育 杰 Data Science and Machine Intelligence Lab 國立交通大學應用數學系 台灣資料科學年會 July 14-17, 2016
  • 4.  From Data Mining to Big Data  Some my experiences in Data Science  Breast Cancer Diagnosis and Prognosis  Malicious URLs Detection  露天拍賣詐騙商品偵測  …  Final Remarks
  • 5.
  • 6.
  • 7.
  • 8.
  • 9. Nuclear feature extraction for breast tumor diagnosis,WN Street,WHWolberg, OL Mangasarian IS&T/SPIE's Symposium on Electronic Imaging: Science andTechnology, 861-870
  • 10.
  • 11.
  • 12.
  • 13.
  • 14. 253 Patients (113 NoChemo, 140 Chemo) Cluster 113 NoChemo Patients Use k-Median Algorithm with Initial Centers: Medians of Good1 & Poor1 69 NoChemo Good 44 NoChemo Poor 67 Chemo Good 73 Chemo Poor Good PoorIntermediate Cluster 140 Chemo Patients Use k-Median Algorithm with Initial Centers: Medians of Good1 & Poor1 Good1: Lymph=0 AND Tumor<2 Compute Median Using 6 Features Poor1: Lymph>=5 OR Tumor>=4 Compute Median Using 6 Features Compute Initial Cluster Centers
  • 15.
  • 16.
  • 17.  Privacy is an issue  Working with domain experts is EXTREMELY important  Y2K  少也賤,故能多鄙事 Breast cancer survival and chemotherapy: a support vector machine analysis YJ Lee, OL Mangasarian,WHWolberg, Discrete Math. Problem with MedicalApplication, DIMACSWorkshop Survival-time classification of breast cancer patients YJ Lee, OL Mangasarian,WHWolberg ComputationalOptimization andApplications 25 (1-3), 151-166
  • 18. Can you filter out the benign URLs ONLY based on the URL stream? 09/10/2013 Lab of Data Science & Machine Intelligence 18
  • 19. • Malicious websites have become tools for spreading criminal activity on theWeb • Phishing • Malware 09/10/2013 19(a) paypal.com (b) paypal.com-us.cgi-bin-webscr...
  • 20. • Blacklist service • PhishTank • Spam and Open Relay Blocking System (SORBS) • Real-time URI Blacklist (URIBL) • Malicious URLs detection • Google • Google safe browsing • Trend Micro • Web Reputation Service 09/10/2013 Lab of Data Science & Machine Intelligence 20
  • 21. • URL requests received from users all over the world • 3,000,000,000 ~ 7,000,000,000 per day • 200,000,000 ~ 800,000,000 need to be analyzed • Only 0.01% are malicious URLs 09/10/2013 Lab of Data Science & Machine Intelligence 21 Filtering Mechanism
  • 22. • No page content need for prioritization • Prioritization means to return the most suspicious URLs • No host based information is allowed • Effectiveness • Filtering (Download) Rate = Filtered URLs/Total URLs < 25% • MaliciousCoverage = Filtered Malicious URLs/Total Malicious URLs > 75% • Performance – Filtering • > 2000 URLs per second for 1 dual-coreVM with 4GB memory. • Scalability • One hour data should be consumed in one hour 09/10/2013 Lab of Data Science & Machine Intelligence 22
  • 23.  Large scale data streaming  One million URLs will be received per hour in average  High dimension and sparse presentation  Lexical information makes feature vector to become very sparse  Extremely imbalanced data set  Only contains about 0.01% malicious URLs • Malicious URLs usually have very short life time 09/10/2013 Lab of Data Science & Machine Intelligence 23
  • 24. 09/10/2013 Lab of Data Science & Machine Intelligence 24
  • 25. 09/10/2013 Lab of Data Science & Machine Intelligence 25 Filtering Rate ≈ False Positive Rate MaliciousURLsCoveringRate 25%, if uniformly random 75%, requirement 90%
  • 26. • Limitations: Can not use • Host-based information • Web page content information • Two types of feature sets are proposed • Lexical features • Descriptive features 09/10/2013 Lab of Data Science & Machine Intelligence 26
  • 27. • The words in a URL string are translated into a Boolean vector • Each Boolean value represents the occurrence of specific word • Each URL component is split by specific delimiters and the words will be saved in a dictionary 09/10/2013 Lab of Data Science & Machine Intelligence 27
  • 28. • Three character length sliding window on the domain name • For the malicious websites which slightly modify its domain name • For reducing memory usage of dictionary: • Remove zero-weight words • Remove word form argument value • Replace IP with AS number (Using static mapping table) • Replace the digits in word with regular expression • Example: replace cool567 to cool[0-9]+ • Keep the words generated in the last 24 hours only 09/10/2013 Lab of Data Science & Machine Intelligence 28
  • 29.  Descriptive features observed from malicious websites  For detecting the phishing websites  A Digit between Two Letters (LDL)  Examp1e  A Letter between Two Digits (DLD)  award2o12  For detecting malware website  Executable File or Not 09/10/2013 Lab of Data Science & Machine Intelligence 29
  • 30.  Fraction of domain name  Categorizing characters to letters, digits and symbols  Splitting domain name by the connection of different categories  Summing of longest token length of each category and divides by the domain name length 09/10/2013 Lab of Data Science & Machine Intelligence 30
  • 31.  For detecting randomly generated string  Alphabet Entropy  Number Rate  For detecting abnormal phenomenon in URL string  Length  Length Ratio  Letter, digit and symbol count.  For detecting the common way on URL  Using IP as Domain Name  Default Port Number  For covering the Sparse Features  Delimiter Count  The Length of Longest Word 09/10/2013 Lab of Data Science & Machine Intelligence 31
  • 32.  We choose two online learning algorithms to update the model for  Saving processing time and memory usage  Adjusting model from concept drift of data streaming  Two prospects of features to build two filters  For descriptive features  Passive-aggressive algorithm  For lexical features  Confident weighted algorithm  Over-sampling technique for extremely imbalanced data set 09/10/2013 Lab of Data Science & Machine Intelligence 32
  • 33. • 09/10/2013 Lab of Data Science & Machine Intelligence 33
  • 34. • 09/10/2013 Lab of Data Science & Machine Intelligence 34
  • 35. • Data Set • Measure • Download Rate (DR) • (TP + FP) / # of instances • Missing Malicious Rate (MMR) • FN / (TP + FN) 09/10/2013 Lab of Data Science & Machine Intelligence 35
  • 36. • Environment  CPU : Dual-core (3.00GHz)  Memory : 4GB OS : Cent OS 64 bit • Results 09/10/2013 Lab of Data Science & Machine Intelligence 36
  • 37.  Settings  Use one hour data for training/updating  Predict next hour and record the results  Compute the daily average of DR & MMR 09/10/2013 Lab of Data Science & Machine Intelligence 37 Apr.
  • 39.  Both of two filters have a certain accuracy  However, their results are different  Set the download rate for each filter around 10% 09/10/2013 Lab of Data Science & Machine Intelligence 39 Apr. Average MMR of Descriptive Filter Average MMR of Lexical Filter
  • 40.  How to convert URLs stream into n-dimensional vector space  How to deal with extremely unbalanced data  Brain storming to define and extract features  Choose a right learning algorithm  How to deal with industry Malicious URL filtering -A big data application MS Lin, CY Chiu,YJ Lee, HK Pao, Big Data, 2013 IEEE InternationalConference on, 589-596
  • 41.
  • 43. 露天 is the Top 1 in Shopping Category
  • 44.
  • 46. 104年1至12月前10名高風險賣場 165警政署統計資料 排名 賣場名稱 總件數 1 露天拍賣 1418 2 86 小舖 824 3 SHOPPING99 804 4 奇摩拍賣 674 5 HITO本舖 592 6 小三美日 462 7 金石堂網路書店 368 8 衣芙日系 349 9 奇摩超級商城 317 10 樂天 217 165 警政署高風險場排行榜
  • 49.
  • 52. Major Challenges • 極端不平衡的資料 – 好人佔大多數極少數壞人 – 但壞人會上傳大量詐騙商品 • 詐騙集團策略會改變 – 詐騙商品的多樣性 – 詐騙商品會隨季節流行改變 • 收集與前處理資料困難 – 釐清需要甚麼資料 – 需要的資料散佈在不同資料來源 – 露天傳統資料儲存系統無法滿足 所需操作 • 資料量龐大 – 每天有百萬級商品上傳 – 需要即時性處理
  • 53. Get your Hands Dirty • 訪談審核人員、確認SOP • 確認所需資料來源 • 確認資料流 • 從不同資料來源中萃取有用資料 • 從審核人員累積的 (Label) 資料中建模
  • 54. 審核專家經驗 – 帳號 • 以使用者帳號為例,詐騙集團假冒正當賣家的方式 – 當露天帳號申請門檻不高時,詐騙集團會大量申 請殭屍帳號,帳號通常會像亂數,由程式產生 – 經由各種手段盜用正常使用者帳號,被盜用的使 用者通常為長久無使用的帳號,所以可比對最後 一次登入日期與上架日期判定 帳號 re9n6LG0 4JEMxCRFwW jiJo4tpK 7FCXfrgO 2Ntla 5YsJSafSg8 jcshgx 12pm1c
  • 55. 審核專家經驗 – 付款方式與地點 • 以商品所在地為例詐騙集團 避免面交所以常常把商品所 在設在較偏遠地區 • 也不喜歡貨到付款這種付款 方式 • ip 該如何使用?
  • 59. Data Set • Training set: (251827, 20) • Testing set: (109122, 20) 上架日期 商品類別 商品數目 付款方式 商品價格 商品地點 商品運送 方式 上架方式 使用者帳 號 有無認證 ip 上次上架 日期 是否第一 次上架 1467043200 生活、居家 999 信用卡 520 台南市 7-11取貨 單品上架 xxxxx 有 xxx.xx.xxx.x xx 2016-06-27 23:59:59 N 1467043200 手機、通訊 999 支付連 1461 台北市 7-11取貨 單品上架 xxxxxxx 無 xxx.xx.xxx.x xx 2016-06-27 23:59:59 N 1467043200 休閒旅遊 50 貨到付款 19 台南市 貨到付款 單品上架 xxx 有 xx.xxx.xxx.x xx 2016-06-27 23:59:23 N
  • 61. Competition Results • Model – Logistic Regression – SVM – Gradient Boost Decision Tree – Naïve Bayes • Training time: 5~10 min • Testing time: 1~2 min
  • 62. 從工人智慧到人工智慧 • 審核人員僅能檢查相對少數的資料,即使焚膏繼晷 • Sampling scheme 造成漏網之魚 • 數個月下來,累積的Labelled Data,應該發揮功用! • We used these labelled data as the “ground truth” • Our model can achieve 0.983 AUC • 1,000 items can be examined in a second • There is a hope to examine EVERY item
  • 63. Keep the Experts in the Loop • 審核人員對我們依然重要! • We use our model to rank the “suspicious level” of item • We can let 審核人員 examine the “gray” items • Get the new labelled items from 審核人員 • We can re-train our model with new labelled data • Tango with bad guys
  • 64. Data Science Team in Ruten We are hiring!
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.