Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Download to read offline

20160607 wids 2016 pub

Download to read offline

WIDS 2016 - Seminar Speaker - Pili Hu, Initium Media

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

20160607 wids 2016 pub

  1. 1. Data in A New Media Pili Hu CTO, Initium Media June 7, 2016
  2. 2. Initium Media
  3. 3. A New Media/ An Internet-Native Media
  4. 4. https://theinitium.com/misc/about/
  5. 5. A Glimpse into the NewsRoom
  6. 6. A Glimpse into the Geek Corner
  7. 7. Sample: Multi-regional Collaborative Reporting 抗戰70週年,我們第一次把國軍、共軍、台籍日軍、華籍英軍老兵請到了一起⋯⋯ 《六個人的一九四五》
  8. 8. Sample: In-depth Cross-continental Investigation 我們花了3個月,橫跨4個國家,做了對中國和國際關係都極重要,但華語媒體無人 可做的獨家調查:維吾爾人外逃伊斯蘭國的真實紀錄。 《獨家:維吾爾人外逃伊斯蘭國》
  9. 9. Sample: Beyond Politics 在互不相讓的政治事件與衝突中,我們關注大敘述底下的「人」。 《佔領之後,運動創傷專題》 《程翔、彭泓基:四十載同窗,為母校一朝割席》
  10. 10. Sample: Infographics 我們大量使用原創且適合手機閱讀的信息圖表,用infographics講故事。
  11. 11. Sample: Macau Gambling Industry 我們把一個cover story型深度專題,做成了圖文影音互動的手機頁面。 《澳門賭業大變奏》
  12. 12. Sample: Digital Presentation 我們把一篇四千字的《廢青演變史》文章唱成了Rap,在本地收穫好評無數。 《香港廢青進化論》
  13. 13. Technology
  14. 14. iOS/ Android/ Website
  15. 15. Data-Driven Journalism
  16. 16. Beyond Interview: Story via Data Mining
  17. 17. Beyond Texts: Visualisation to Uncover Trends
  18. 18. Beyond Graphics: Auralization
  19. 19. Beyond One-way Communication: News Game 9道題測測你會投票給 哪位台灣總統候選人 香港經濟轉捩點:當「積極不干 預」撞上「適度有為」 Salary 360 (人口普查開放數據利用)
  20. 20. Beyond News: Data at Core + Various Derivatives
  21. 21. Two Case Studies Sample project & Challenges
  22. 22. The Workflow
  23. 23. Two cases to be shared today HK Disco HK Legco
  24. 24. HK District Council Election
  25. 25. Hong Kong District Council (Disco) Final output: ● Power of each Camp: ○ https://theinitium.com/project/20151012-hk-district-council-elections/ ● Power of major parties: ○ https://theinitium.com/project/20151019-hk-district-council-elections-2/ ● Guide for 2015 election: ○ https://theinitium.com/project/20151029-hk-district-council-elections- 3/desktop.html
  26. 26. Meet the data Data: ● From 1999 to 2015 ● # of Candidates: 4392 ○ Name, occupation, party, camp, votes ● # of Constituencies: 2039 ○ Total votes, voting rate, count of voters, population
  27. 27. Methodology: ● Automatic ○ Scraping ● Semi-automatic: ○ Copy-and-paste a few table from the website ○ Data cleaning by human ● Manual input from books ○ Labour intensive ● Investigation Meet the sources
  28. 28. Manpower overview Metric Value # of unique participants 8 Data collection/ cleaning 720 man-hours (3 months) Data validation 24 man-hours (3 days) Data analysis 50 man-hours (6 days) Project span 5 months Manpower overview of the large data collection campaign
  29. 29. 16 Years HK District Council Election
  30. 30. Challenge: Hard to Collect Database open sourced: http://initiumlab.com/#database 1999 2003 2007 2011 2015 個人信息 (年齡) 手動抄書 (3) 手動抄書 (3) 手動抄書 (3) 手動抄書 (3) 自動抓取睇嘢 (0.5) 個人信息 (性別、職業) 手動抄書 (6) 手動抄書 (6) 手動抄書 (6) 區選網站/手動 (2) 區選網站/自動 (1) 政黨派別 (政黨) 手動抄書 (3) 手動抄書 (3) 手動抄書 (3) 區選網站/手動 (1) 區選網站/自動 (0.5) 政黨派別 (泛/建/其他) 起底+標註 (130) 起底+標註 (130) 起底+標註 (130) 起底+標註 (130) 起底+標註 (130) 選區信息 (居民數、選民數、 投票率) 區選網站/手動 (2.1) 區選網站/手動 (2.1) 區選網站/手動 (2.1) 區選網站/手動 (2.1) 區選網站/自動 (1.1) 選舉結果 (得票率) 手動抄書 (3) 手動抄書 (3) 手動抄書 (3) 手動抄書 (3) missing (0) Research/ investigation consumes significant more time Online accessible/ (semi-) formatted data saves time Importance of open data and knowledge sharing
  31. 31. Challenge: Efficiency & Quality Misconception: “Manual input is only a problem of labour; not a problem of science” - How to use semi-automatic tools to improve efficiency? - How to track data pipeline/ dependency graph? - How many points should you sample for data validation? - How to maximize the performance of a group of data collectors? - In terms of project span? - In terms of through-put? - How to setup incentive mechanism to ensure quality? - … All those are active research directions.
  32. 32. HK Legislative Council Voting SOPA 2016 Excellence Award Winner More: https://theinitium.com/article/20160615-sopa-awards-2016/
  33. 33. Hong Kong Legislative Council (Legco) Hong Kong Legislative Council ● Current term: 17/10/2012 ~ 18/06/2015 ● 70 members ● 12 government departments ● 2921 motions Structured data set Focus on mining
  34. 34. Video: Legco Voting on Youtube (English) https://www.youtube.com/watch?v=0evK3PtLaUo
  35. 35. Video: Legco Voting on Youtube (Cantonese) https://www.youtube.com/watch?v=KYa-ygjqaV4
  36. 36. Other output of Legco Analysis project ● Chinese report + Cantonese animation: https://theinitium.com/article/20150812-hongkong-legcoanalysis/ ● Interactive Web: http://legco.initiumlab.com ● English animation: https://www.youtube.com/watch? v=CExoTvKuXSw
  37. 37. Two cases to be shared today HK Disco HK Legco
  38. 38. The Data Analysis Iteration
  39. 39. Core Tech: Dimensionality Reduction (PCA)
  40. 40. Challenge: Insights? Impact? Value? “Value”: High Difficulty: Low “Value”: Med Difficulty: Med “Value”: Low Difficulty: High
  41. 41. Challenge: “Value” of Data Unique source (e.g. Disco -- Time Series) Interesting data point (e.g. Legco -- Starry Lee against herself) Good Interpretation/ visualisation (e.g. Legco -- Heatmap) Technically Deep Analysis (e.g. Legco -- Member ordering)
  42. 42. Data Empowering Production
  43. 43. Challenge: Data Pipelining ● Integration: ○ Google Analytics ○ UMeng ○ Fabric/ Crashlytics ○ Database ○ Server log ○ … Many third party stats ● Processing: ○ Extraction ○ Transformation ○ Aggregation ○ Visualisation ● Presentation: ○ Visualisation ○ Formating ○ Articulation A combination of manual, semi-auto and auto integration Lot room for improvement Usually deferred until must Only useful after successful articulation of your findings
  44. 44. https://whatsthebigdata.com/2016/05/01/data-scientists-spend-most-of-their-time-cleaning-data/
  45. 45. Challenge: Environment ● Libraries/ Modules ● Processing Power ● Quick Deployment/ Interactive Exploration ● Data-urchin: ○ https://github.com/initiumlab/data-urchin ○ Python 3 ○ Selected common Libs for data ○ BCPs ○ Docker Compose ○ Samples (still building) ○ Immature and ideas are wanted
  46. 46. Life is short Enjoy Initium...
  47. 47. Enjoy Living https://www.youtube.com/watch?v=WEkFyLzj6yg
  48. 48. Enjoy Reading https://www.youtube.com/watch?v=zFeSh2W1_C8
  49. 49. Enjoy Hacking https://www.youtube.com/watch?v=mZF7_dyuP6Q
  50. 50. Enjoy Sharing
  51. 51. More… http://initiumlab.com/events/
  52. 52. Q/A Initium? New Media? Data? Lab? We are looking for talents. Check Openings: join.init.im Send CV to: join@init.im

WIDS 2016 - Seminar Speaker - Pili Hu, Initium Media

Views

Total views

169

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

0

Shares

0

Comments

0

Likes

0

×