在此課程中將帶領對資料分析感到陌生卻又充滿興趣的您,完整地學會運用 R 語言從最初的蒐集資料、探索性分析解讀資料,並進行文字探勘,發現那些肉眼看不見、隱藏在資料底下的意義。此課程主要設計給對於 R 語言有基本認識,想要進一步熟悉實作分析的朋友們,希望在課程結束後,您能夠更熟悉 R 語言這個豐富的分析工具。透過蘋果日報慈善捐款的資料集,了解如何從頭解析網頁,撰寫爬蟲自動化收集資訊;取得資料後,能夠靈活處理資料,做清洗、整合及探索;並利用現成的套件進行文字探勘、文本解析;我們將一步步實際走一回資料分析的歷程,處理、觀察、解構資料,試著看看人們在捐款的決策過程中,究竟是什麼因素產生了影響,以及這些結果又是如何從資料中挖掘而出的呢?
在此課程中將帶領對資料分析感到陌生卻又充滿興趣的您,完整地學會運用 R 語言從最初的蒐集資料、探索性分析解讀資料,並進行文字探勘,發現那些肉眼看不見、隱藏在資料底下的意義。此課程主要設計給對於 R 語言有基本認識,想要進一步熟悉實作分析的朋友們,希望在課程結束後,您能夠更熟悉 R 語言這個豐富的分析工具。透過蘋果日報慈善捐款的資料集,了解如何從頭解析網頁,撰寫爬蟲自動化收集資訊;取得資料後,能夠靈活處理資料,做清洗、整合及探索;並利用現成的套件進行文字探勘、文本解析;我們將一步步實際走一回資料分析的歷程,處理、觀察、解構資料,試著看看人們在捐款的決策過程中,究竟是什麼因素產生了影響,以及這些結果又是如何從資料中挖掘而出的呢?
在這個資料科學蔚為風潮的年代,身為一個對新技術充滿好奇的攻城獅,自然會想要擴充自己的武器庫,學習嶄新的資料分析工具;而 R 語言,一個由統計學家專門為了資料探索與分析所開發的腳本語言,具有龐大的開源社群支持以及琳瑯滿目、數以萬計的各式套件,正是當今學習資料科學相關工具的首選。
然而,R 語言的設計邏輯與一般的程式語言不同,工程師們過去學習程式語言的經驗,往往造成學習 R 語言的障礙,本課程將從 R 語言的基礎開始,讓同學們從課堂講解以及互動式上機課程中,得以徹底理解 R 語言的核心概念與精要,學習如何利用 R 語言問資料問題,並且從資料分析的角度撰寫效率良好同時具有高度可讀性的 R 語言代碼。
在這個資料科學蔚為風潮的年代,身為一個對新技術充滿好奇的攻城獅,自然會想要擴充自己的武器庫,學習嶄新的資料分析工具;而 R 語言,一個由統計學家專門為了資料探索與分析所開發的腳本語言,具有龐大的開源社群支持以及琳瑯滿目、數以萬計的各式套件,正是當今學習資料科學相關工具的首選。
然而,R 語言的設計邏輯與一般的程式語言不同,工程師們過去學習程式語言的經驗,往往造成學習 R 語言的障礙,本課程將從 R 語言的基礎開始,讓同學們從課堂講解以及互動式上機課程中,得以徹底理解 R 語言的核心概念與精要,學習如何利用 R 語言問資料問題,並且從資料分析的角度撰寫效率良好同時具有高度可讀性的 R 語言代碼。
Exploratory data analysis is the process of quickly looking at data, formulating hypotheses, and testing those hypotheses. In practice, two of the most important components of this process are transforming data and visualizing it. This tutorial will be a hands-on, practical introduction to using R for data exploration, with an emphasis on data transformation and visualization. I will focus on using modern R packages like ggplot2, dplyr, and tidyr for this tutorial.
國立臺灣大學電機所博士生,平時致力於推廣 R 語言,曾主辦多場 R 語言推廣講座,並經常於 Taiwan R User Group 分享 R 的使用心得。有豐富的 R 語言實務經驗,包含資料的收集、整理、分析到報告製作。擅長根據專案需求,量身打造 R 的資料分析系統,以及運用 R 和 C++ 撰寫高效能演算法。
Overview of a few ways to group and summarize data in R using sample airfare data from DOT/BTS's O&D Survey.
Starts with naive approach with subset() & loops, shows base R's tapply() & aggregate(), highlights doBy and plyr packages.
Presented at the March 2011 meeting of the Greater Boston useR Group.
Köhler, Sven, Bertram Ludäscher, and Yannis Smaragdakis. 2012. “Declarative Datalog Debugging for Mere Mortals.” In Datalog in Academia and Industry, edited by Pablo Barceló and Reinhard Pichler, 111–22. Lecture Notes in Computer Science 7494. Springer Berlin Heidelberg. doi:10.1007/978-3-642-32925-8_12.
Abstract. Tracing why a “faulty” fact A is in the model M = P(I) of program P on input I quickly gets tedious, even for small examples. We propose a simple method for debugging and “logically profiling” P by generating a provenance-enriched rewriting P̂, which records rule firings according to the logical semantics. The resulting provenance graph can be easily queried and analyzed using a set of predefined and ad-hoc queries. We have prototypically implemented our approach for two different Datalog engines (DLV and LogicBlox), demonstrating the simplicity, effectiveness, and system-independent nature of our method.
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
---------------------------------------------------------------
Come join our meet-up and learn how easily you can use R for advanced Machine learning. In this meet-up, we will demonstrate how to understand and use Xgboost for Kaggle competition. Tong is in Canada and will do remote session with us through google hangout.
---------------------------------------------------------------
Speaker Bio:
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Pre-requisite(if any): R /Calculus
Preparation: A laptop with R installed. Windows users might need to have RTools installed as well.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Event arrangement:
6:45pm Doors open. Come early to network, grab a beer and settle in.
7:00-9:00pm XgBoost Demo
Reference:
https://github.com/dmlc/xgboost
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
40. Summary Functions in R
Function Name Description
names() Functions to get or set the names of an object
head(), tail()
Returns the first or last parts of a vector, matrix, table,
data frame or function
str()
Compactly display the internal structure of
an R object
summary() Produce result summaries
dim() Retrieve or set the dimension of an object
length() Get or set the length of vectors
complete.cases()
Return a logical vector indicating which cases are
complete, i.e., have no missing values
as.Data()
Convert between character representations and
objects of class "Date" representing calendar dates
40
EDA - 講解 B-01
41. Visualization Functions in R
Function Name Description
plot() Generic function for plotting of R objects
boxplot() Produce box-and-whisker plot(s) of the given
(grouped) values
hist() Computes a histogram of the given data values
barplot() Creates a bar plot with vertical or horizontal bars
arrows() Draw arrows between pairs of points
abline() a, b: the intercept and slope, single values.
y = [A] + [B]x
lines() Join the corresponding points with line segments.
41
Function name and parameter 的縮寫解釋:
http://jeromyanglim.blogspot.tw/2010/05/abbreviations-of-r-commands-explained.html
EDA - 講解 B-01
42. session_B_eda.R
# load in apple daily article
> d <- read.csv(“df_article.csv”, fileEncoding = “UTF-
8”)
# use dim() to know data frame dimension
> dim(d)
[1] 3784 17
# check the column names
> names(d)
[1] "aid" "case.closed" "circulation"
[4] "date.funded" "date.published" "donation"
[7] "donor" "journalist" "n.fb.comment"
[10] "n.fb.like" "n.fb.share" "n.fb.total"
[13] "n.image" "n.word" "title"
[16] "url.article" "url.detail"
讀入資料與看一看變數
42
EDA - 講解 B-01
43. # use str() to have a brief data summary
> str(d)
利用 str() 迅速了解資料格式
43
EDA - 講解 B-01
90. > library(jiebaR)
# initiate segmentation engine
> cutter = worker(bylines = T)
# cooler way to do segmentation
> article_words = lapply(article_txt, function(x)
cutter <= x)
# traditional way
> article_words = lapply(article_txt, function(x)
segment(x, cutter))
# check if all got segmented
> print(len(article_words))
# adjust to the format for text2vec::itoken
> article_words = lapply(article_words, '[[', 1)
> save(article_words, file =
'data/list_article_words(jieba).RData')
利用 jiebaR 斷詞
90
資料礦工- 講解 C-01
91. 詞的向量化
text2vec
作者 Dmitriy Selivanov 俄羅斯人
支持 the state of art word embeddings (GloVe)
Count-based Model
https://cran.r-project.org/web/packages/text2vec/index.html
word2vec
Tomas Mikolov 領軍的 Google Brain Team 研究團隊
開發
Predictive Model
資料礦工- 講解 C-02
107. # calculate the ratio of word clusters per article
>
計算文章的文字群比例
107
資料礦工- 練習 C-04
108. # check the correlation between word clusters and
# the variables we care
> i <- grep('^k|donation|donor|log|n.fb|ttl',
names(d))
> View(cor(d[,i]))
用相關性來觀察影響
108
資料礦工- 練習 C-04