2. About me
• Education
• NCU (MIS)、NCCU (CS)
• Work Experience
• Telecom big data Innovation
• AI projects
• Retail marketing technology
• User Group
• TW Spark User Group
• TW Hadoop User Group
• Taiwan Data Engineer Association Director
• Research
• Big Data/ ML/ AIOT/ AI Columnist
2
5. 建立環境
• Anaconda on Windows
• https://www.anaconda.com/products/distribution
• Miniconda on Windows
• https://docs.conda.io/en/latest/miniconda.html
5
記得打勾
11. EDA process
• 當我們拿到資料集,如何進行下一步? EDA 就是第一步
• EDA 有助於我們了解資料樣貌
• 總資料筆數、遺缺值比例、遺缺值處理方式、欄位值分布、欄位值合理
性(business domain)
• EDA 有助於事後模型預測
• 進行處理 (normalization與standardization)
11
EDA is an approach to analyzing datasets to summarize their main characteristics,
often with visual methods (wikipedia)
18. Data visualization (圖表類型: relplot)
• Visualizing statistical relationships
• Statistical analysis is a process of understanding how variables in a dataset
relate to each other and how those relationships depend on other variables.
• Visualization can be a core component of this process because, when data are
visualized properly, the human visual system can see trends and patterns that
indicate a relationship.
18
參考: https://www.cntofu.com/book/172/docs/10.md
27. Data visualization (圖表類型: catplot)
• How to use different visual representations to show the relationship
between multiple variables in a dataset.
• We focused on cases where the main relationship was between two
numerical variables. If one of the main variables is categorical
(divided into discrete groups) it may be helpful to use a more
specialized approach to visualization
27
參考: https://www.cntofu.com/book/172/docs/13.md
38. Data visualization (圖表類型: displot)
• What range do the observations cover?
• Are they heavily skewness/kurtosis?
• Is there evidence for bimodality (雙峰)?
38
參考: https://www.cntofu.com/book/172/docs/24.md
雙峰 skewness 與 kurtosis 的計算?