Successfully reported this slideshow.
You’ve unlocked unlimited downloads on SlideShare!
Hong Kong District Council (Disco)
● Power of each Camp:
● Power of major parties:
● Guide for 2015 election:
Meet the data
● From 1999 to 2015
● # of Candidates: 4392
○ Name, occupation,
party, camp, votes
● # of Constituencies: 2039
○ Total votes, voting rate,
count of voters,
○ Copy-and-paste a few table from the website
○ Data cleaning by human
● Manual input from books
○ Labour intensive
Meet the sources
# of unique participants 8
Data collection/ cleaning 720 man-hours (3 months)
Data validation 24 man-hours (3 days)
Data analysis 50 man-hours (6 days)
Project span 5 months
Manpower overview of the large data collection campaign
Challenge: Hard to Collect
Database open sourced:
1999 2003 2007 2011 2015
Research/ investigation consumes significant more time
Online accessible/ (semi-) formatted data saves time
Importance of open data and knowledge sharing
Challenge: Efficiency & Quality
“Manual input is only a problem of labour; not a problem of science”
- How to use semi-automatic tools to improve efficiency?
- How to track data pipeline/ dependency graph?
- How many points should you sample for data validation?
- How to maximize the performance of a group of data collectors?
- In terms of project span?
- In terms of through-put?
- How to setup incentive mechanism to ensure quality?
All those are active research directions.
HK Legislative Council Voting
SOPA 2016 Excellence Award Winner
Hong Kong Legislative Council (Legco)
Hong Kong Legislative Council
● Current term: 17/10/2012 ~ 18/06/2015
● 70 members
● 12 government departments
● 2921 motions
Structured data set
Focus on mining
Video: Legco Voting on Youtube (English)
Video: Legco Voting on Youtube (Cantonese)
Other output of Legco Analysis project
● Chinese report + Cantonese animation:
● Interactive Web:
● English animation: https://www.youtube.com/watch?
Two cases to be shared today
Challenge: Insights? Impact? Value?
Challenge: “Value” of Data
(e.g. Disco -- Time Series)
(e.g. Legco -- Starry Lee against
(e.g. Legco -- Heatmap)
Technically Deep Analysis
(e.g. Legco -- Member ordering)
Challenge: Data Pipelining
○ Google Analytics
○ Fabric/ Crashlytics
○ Server log
○ … Many third party stats
A combination of
and auto integration
Lot room for improvement
Usually deferred until must
Only useful after successful
articulation of your findings