SlideShare a Scribd company logo
1 of 16
Download to read offline
Catch Me If You Can:
Detecting Pickpocket Suspects
from Large-scale Transit Records
2019.3.11 Youngmi Huang
Sharing Topic
Agenda
• 解決了了什什麼問題
• 如何解
• 如何衡量量此解法的好壞
• 後續應⽤用
解決了了什什麼問題 如何解 如何衡量量好壞 後續應⽤用
Identify thieves in the public transit
system
• 難點:
(1) 正常乘客與⼩小偷的 mobile pattern ⾼高度重疊
(2) 存在 imbalanced data 的問題 (1:600 ≈ 0.0017)
• 2016 KDD paper:知識發現的頂會
• 論⽂文創新點:
• 特徵構建的⼯工作 (交通⾏行行為數據+地理理功能分
區+社群蒐集⼩小偷的ground truth)
(2) two-step approach 解決了了異異常⾏行行為以及
從中辨識誰是⼩小偷
(3) 不僅能分析潛在⼩小偷的⾏行行為,建立監測+
預警系統
解決了了什什麼問題 如何解 如何衡量量好壞 後續應⽤用
Framework Overview
解決了了什什麼問題 如何解 如何衡量量好壞 後續應⽤用
≈
1.數據源
2.特徵構建與
可疑⾏行行為分析
3. 兩兩階段模型
4.視覺化呈現
Framework: Data Source (1/3)
≈
1.數據源
地圖數據
解決了了什什麼問題 如何解 後續應⽤用如何衡量量好壞
Framework: Data Source (2/3)
≈
• 每⼀一個 trip 由多筆 record 組成
• 超過 30 分鐘即視為新的 trip
trip1
trip2
trip3
1.數據源
交通數據
地圖數據
解決了了什什麼問題 如何解 後續應⽤用如何衡量量好壞
Framework: Data Source (3/3)
≈
違規報告
• ⼈人 • 被竊地點 • 時間
• 微博 (官⽅方po⽂文 + ⺠民眾揭露)
• 做為 ground truth
解決了了什什麼問題 如何解 後續應⽤用
1.數據源
交通數據
地圖數據
如何衡量量好壞
Framework: Mobility Characteristics
1. 超過 80%的乘客每⽇日搭乘時間⼩小於2⼩小時,搭
乘記錄為2次
2. 正常乘客會傾向於 short ride 的次數越少越好
但少於7次 vs 少於19次
的分布其實很不⼀一樣
3. 定義每⼀一次 trip 的主要⽬目的 (e.g. 觀光,⼯工作,…)
4. 辨識是否為可疑的 wandering behavior
(via 統計出沒區域的頻率)
5. 起點與終點的合理理搭乘時間
(via 與⼤大眾相比的標準差)
2.特徵構建與
可疑⾏行行為分析
解決了了什什麼問題 如何解 後續應⽤用如何衡量量好壞
Framework: Two Step Approach (1/2)
step1: Anomaly Detection (One-Class SVM)
3. 兩兩階段模型
異異常包含真正的⼩小偷與
誤判為⼩小偷的正常乘客(fp)
解決了了什什麼問題 如何解 後續應⽤用如何衡量量好壞
Framework: Two Step Approach (1/2)
step2:
Supervised Classification (SVM)
3. 兩兩階段模型
step1: Anomaly Detection (One-Class SVM)
異異常包含真正的⼩小偷與
誤判為⼩小偷的正常乘客(fp)
解決了了什什麼問題 如何解 後續應⽤用如何衡量量好壞
Framework: Two Step Approach (2/2)
step2:
Supervised Classification (SVM)
step1:
Anomaly Detection (One-Class SVM)
non-linear decision boundaries
min
w,ρ
1
2
∥w∥2
+ C
N
∑
n=1
ϵi − ρ
s . t . ̂g(Xi) = ⟨w, ϕ⟩ + ρ ≤ ϵi and ϵi ≥ 0, for all collected passengers n=1,2,…N
̂g(x) = ⟨w, ϕ(x)⟩ + ρ,
g(x) =
1 ̂g(x) ≥ 0
0 ̂g(x) < 0
and then
(objective function)
解決了了什什麼問題 如何解 後續應⽤用如何衡量量好壞
Framework: Two Step Approach (2/2)
step2:
Supervised Classification (SVM)
step1:
Anomaly Detection (One-Class SVM)
margin
decision plane
non-linear decision boundaries
min
w,ρ
1
2
∥w∥2
+ C
N
∑
n=1
ϵi − ρ
s . t . ̂g(Xi) = ⟨w, ϕ⟩ + ρ ≤ ϵi and ϵi ≥ 0, for all collected passengers n=1,2,…N
̂g(x) = ⟨w, ϕ(x)⟩ + ρ,
g(x) =
1 ̂g(x) ≥ 0
0 ̂g(x) < 0
and then
(objective function)
解決了了什什麼問題 如何解 後續應⽤用如何衡量量好壞
Experiments & Discussion (1/2)
解決了了什什麼問題 如何解 如何衡量量好壞 後續應⽤用
在⼩小偷 ground truth 樣本數極少的情況下:
(1) 單⼀一模型:
anomaly detection 比分類模型有效
(2) 兩兩階段模型:
先辨識異異常,再利利⽤用⼆二分類模型分類
有效在 recall, precision, f1-score 有所提升
• 使⽤用數據
經由數據清洗(排除極值)後約有 16 億
筆搭乘紀錄、約有 600 萬名乘客
• 模型成效
10-fold cross validation, frac= 0.2
• precision 7% 是否為好?
Experiments & Discussion (2/2)
• precision 7% 是否為好? 在錯分類當中:FP⾼高,FN低,
代表模型寧可錯抓也不要漏放
• 抓⼩小偷與抓出⾼高風險的⼈人?
在數據本質上類似(樣本少、潛在違約樣態多元),因此後續量量化風險的異異常檢測可以參參
考本篇論⽂文的作法
解決了了什什麼問題 如何解 如何衡量量好壞 後續應⽤用
Application:
Prototype & Pattern Discovery
解決了了什什麼問題 如何解 如何衡量量好壞 後續應⽤用
≈
全體/有效群體/離群值/可疑⼩小偷
實時流量量熱點圖
可疑⼩小偷的出沒地點
可疑⼩小偷的List
交互式移動路路徑
THANK YOU

More Related Content

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Catch me if you can: detecting pickpocket suspects from large-scale transit records

  • 1. Catch Me If You Can: Detecting Pickpocket Suspects from Large-scale Transit Records 2019.3.11 Youngmi Huang Sharing Topic
  • 2. Agenda • 解決了了什什麼問題 • 如何解 • 如何衡量量此解法的好壞 • 後續應⽤用 解決了了什什麼問題 如何解 如何衡量量好壞 後續應⽤用
  • 3. Identify thieves in the public transit system • 難點: (1) 正常乘客與⼩小偷的 mobile pattern ⾼高度重疊 (2) 存在 imbalanced data 的問題 (1:600 ≈ 0.0017) • 2016 KDD paper:知識發現的頂會 • 論⽂文創新點: • 特徵構建的⼯工作 (交通⾏行行為數據+地理理功能分 區+社群蒐集⼩小偷的ground truth) (2) two-step approach 解決了了異異常⾏行行為以及 從中辨識誰是⼩小偷 (3) 不僅能分析潛在⼩小偷的⾏行行為,建立監測+ 預警系統 解決了了什什麼問題 如何解 如何衡量量好壞 後續應⽤用
  • 4. Framework Overview 解決了了什什麼問題 如何解 如何衡量量好壞 後續應⽤用 ≈ 1.數據源 2.特徵構建與 可疑⾏行行為分析 3. 兩兩階段模型 4.視覺化呈現
  • 5. Framework: Data Source (1/3) ≈ 1.數據源 地圖數據 解決了了什什麼問題 如何解 後續應⽤用如何衡量量好壞
  • 6. Framework: Data Source (2/3) ≈ • 每⼀一個 trip 由多筆 record 組成 • 超過 30 分鐘即視為新的 trip trip1 trip2 trip3 1.數據源 交通數據 地圖數據 解決了了什什麼問題 如何解 後續應⽤用如何衡量量好壞
  • 7. Framework: Data Source (3/3) ≈ 違規報告 • ⼈人 • 被竊地點 • 時間 • 微博 (官⽅方po⽂文 + ⺠民眾揭露) • 做為 ground truth 解決了了什什麼問題 如何解 後續應⽤用 1.數據源 交通數據 地圖數據 如何衡量量好壞
  • 8. Framework: Mobility Characteristics 1. 超過 80%的乘客每⽇日搭乘時間⼩小於2⼩小時,搭 乘記錄為2次 2. 正常乘客會傾向於 short ride 的次數越少越好 但少於7次 vs 少於19次 的分布其實很不⼀一樣 3. 定義每⼀一次 trip 的主要⽬目的 (e.g. 觀光,⼯工作,…) 4. 辨識是否為可疑的 wandering behavior (via 統計出沒區域的頻率) 5. 起點與終點的合理理搭乘時間 (via 與⼤大眾相比的標準差) 2.特徵構建與 可疑⾏行行為分析 解決了了什什麼問題 如何解 後續應⽤用如何衡量量好壞
  • 9. Framework: Two Step Approach (1/2) step1: Anomaly Detection (One-Class SVM) 3. 兩兩階段模型 異異常包含真正的⼩小偷與 誤判為⼩小偷的正常乘客(fp) 解決了了什什麼問題 如何解 後續應⽤用如何衡量量好壞
  • 10. Framework: Two Step Approach (1/2) step2: Supervised Classification (SVM) 3. 兩兩階段模型 step1: Anomaly Detection (One-Class SVM) 異異常包含真正的⼩小偷與 誤判為⼩小偷的正常乘客(fp) 解決了了什什麼問題 如何解 後續應⽤用如何衡量量好壞
  • 11. Framework: Two Step Approach (2/2) step2: Supervised Classification (SVM) step1: Anomaly Detection (One-Class SVM) non-linear decision boundaries min w,ρ 1 2 ∥w∥2 + C N ∑ n=1 ϵi − ρ s . t . ̂g(Xi) = ⟨w, ϕ⟩ + ρ ≤ ϵi and ϵi ≥ 0, for all collected passengers n=1,2,…N ̂g(x) = ⟨w, ϕ(x)⟩ + ρ, g(x) = 1 ̂g(x) ≥ 0 0 ̂g(x) < 0 and then (objective function) 解決了了什什麼問題 如何解 後續應⽤用如何衡量量好壞
  • 12. Framework: Two Step Approach (2/2) step2: Supervised Classification (SVM) step1: Anomaly Detection (One-Class SVM) margin decision plane non-linear decision boundaries min w,ρ 1 2 ∥w∥2 + C N ∑ n=1 ϵi − ρ s . t . ̂g(Xi) = ⟨w, ϕ⟩ + ρ ≤ ϵi and ϵi ≥ 0, for all collected passengers n=1,2,…N ̂g(x) = ⟨w, ϕ(x)⟩ + ρ, g(x) = 1 ̂g(x) ≥ 0 0 ̂g(x) < 0 and then (objective function) 解決了了什什麼問題 如何解 後續應⽤用如何衡量量好壞
  • 13. Experiments & Discussion (1/2) 解決了了什什麼問題 如何解 如何衡量量好壞 後續應⽤用 在⼩小偷 ground truth 樣本數極少的情況下: (1) 單⼀一模型: anomaly detection 比分類模型有效 (2) 兩兩階段模型: 先辨識異異常,再利利⽤用⼆二分類模型分類 有效在 recall, precision, f1-score 有所提升 • 使⽤用數據 經由數據清洗(排除極值)後約有 16 億 筆搭乘紀錄、約有 600 萬名乘客 • 模型成效 10-fold cross validation, frac= 0.2 • precision 7% 是否為好?
  • 14. Experiments & Discussion (2/2) • precision 7% 是否為好? 在錯分類當中:FP⾼高,FN低, 代表模型寧可錯抓也不要漏放 • 抓⼩小偷與抓出⾼高風險的⼈人? 在數據本質上類似(樣本少、潛在違約樣態多元),因此後續量量化風險的異異常檢測可以參參 考本篇論⽂文的作法 解決了了什什麼問題 如何解 如何衡量量好壞 後續應⽤用
  • 15. Application: Prototype & Pattern Discovery 解決了了什什麼問題 如何解 如何衡量量好壞 後續應⽤用 ≈ 全體/有效群體/離群值/可疑⼩小偷 實時流量量熱點圖 可疑⼩小偷的出沒地點 可疑⼩小偷的List 交互式移動路路徑