SlideShare a Scribd company logo
Data Analysis Making Big Data Work 
David Chiu 
2014/11/24
About Me 
Founder of LargitData 
Ex-Trend Micro Engineer 
ywchiu.com
Big Data & Data Science
US Election Prediction 
4
World Cup Prediction
Hurricane Prediction
Data Science 
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Being A Data Scientist, You Need to Know That Much? Seriously?
Statistic 
Single Variable、Multi Variable、ANOVA 
Data Munging 
Data Extraction, Transformation, Loading 
Data Visualization 
Figure, Business Intelligence 
Required Skills
What You Probably Need Is A Team 
Business Analyst Knowing how to use different tools under different circumstance 
Statistician How to process big data? 
DBA How to deal with unstructured data 
Software Engineer Knowing how to user statistics
Four Dimension 
12 
Single Machine Memory R Local File 
Cloud Distributed Hadoop HDFS 
Statistics Analysis Linear Algebra 
Architect Management Standard 
Concept MapReduce Linear Algebra Logistic Regression 
Tool Hadoop PostgreSQL R 
Analyst How to use these tools 
Hackers R Python Java
“80% are doing summing and averaging” 
Content 
1.Data Munging 
2.Data Analysis 
3.Interpret Result 
What Data Scientists Do?
Application of Data Analysis 
Text Mining 
Classify Spam Mail 
Build Index 
Data Search Engine 
Social Network Analysis 
Finding Opinion Leader 
Recommendation System 
What user likes? 
Opinion Mining 
Positive/Negative Opinion 
Fraud Analysis 
Credit Card Fraud
Feed data to computer 
Make Computer to Do Analysis
Let Computer Predict For You
Predictive Analysis 
Learn from experience (Data), to predict future behavior 
What to Predict? 
e.g. Who is likely to click on that ad? 
For What? 
e.g. According to the click possibility and revenue to decide which ad to show. 
Predictive Analysis
Customer buying beer will also buy pampers? 
People are surfing telephone fee rate are likely to switch its vendor 
People belong to same group are tend to have same telecom vendor 
Surprising Conclusion
According to personal behavior, predictive model can use personal characteristic to generate a probabilistic score, which the higher the score, the more likely the behavior. 
Predictive Model
Linear Model 
e.g. Based on a cosmetic ad. We can give 90% weight to female customers, give10% to male customer. Based on the click probability (15%), we can calculate the possibility score (or probability) 
Female 13.5%,Male1.5% 
Rule Model 
e.g. 
If the user is “She” 
And Income is over 30k 
And haven’t seen the ad yet 
The click rate is 11% 
Simple Predictive Model
Induction 
From detail to general 
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E 
-- Tom Mitchell (1998) 
Discover an effective model 
Start from a simple model 
Update the model based on feeding data 
Keep on improving prediction power 
Machine Learning
Statistic Analysis 
Regression Analysis 
Clustering 
Classification 
Recommendation 
Text Mining 
Application 
22
Image recognition
Decision Tree 
Rate > 1,299/Month 
Probability to switch vendor 15% 
Probability to switch vendor 3% 
Yes 
No
Decision Tree 
Rate > 1,299/Month 
Probability to switch vendor 3% 
Yes 
No 
Probability to switch vendor 10% 
Probability to switch vendor 22% 
Income>22,000 
Yes 
No
Decision Tree 
Rate > 1,299/Month 
Yes 
No 
Probability to switch vendor 10% 
Probability to switch vendor 22% 
Income>22,000 
Yes 
No 
Probability to switch vendor 1% 
Probability to switch vendor 7% 
Free for intranet 
Yes 
No
Supervised Learning 
Regression 
Classification 
Unsupervised Learning 
Dimension Reduction 
Clustering 
Machine Learning
Supervised Learning
Classification 
e.g. Stock prediction on bull/bear market 
Regression 
e.g. Price prediction 
Supervised Learning
Dimension Reduction 
e.g. Making a new index 
Clustering 
e.g. Customer Segmentation 
Unsupervised Learning
Lift 
The better the lift, the greater the cost? 
The more decision rule, the more campaign? 
Design strategy for different persona? 
The lift for 4 campaign? 
The lift for 20 ampaign? 
Lift
Can we use the production rate of butter to predict stock market? 
Overfitting
Use noise as information 
Over assumption 
Over Interpretation 
What overfitting learn is not truth 
Like memorize all answers in a single test. 
Overfitting
Testing Model 
Use external data or partial data as testing dataset
Traditional Analysis Tool
Statistics On The Fly 
Built-in Math and Graphic Function 
Free and Open Source 
http://cran.r-project.org/src/base/ 
R Language 
36
Functional Programming 
Use Function Definition To Retrieve Answer 
Interpreted Language 
Statistics On the Fly 
Object Oriented Language 
S3 and S4 Method 
R Language
Most Used Analytic Language 
Most popular languages are R, Python (39%), SQL (37%). SAS (20%). 
By Gregory Piatetsky, Aug 27, 2013.
Kaggle 
http://www.kaggle.com/ 
Most often used language in Kaggle competition
Data Scientist in Google and Apple Use R 
What is your programming language of choice, R, Python or something else? 
“I use R, and occasionally matlab, for data analysis. There is a large, active and extremely knowledgeable R community at Google.” 
http://simplystatistics.org/2013/02/15/interview-with-nick-chamandy-statistician-at-google/ 
“Expert knowledge of SAS (With Enterprise Guide/Miner) required and candidates with strong knowledge of R will be preferred” 
http://www.kdnuggets.com/jobs/13/03-29-apple-sr-data- scientist.html?utm_source=twitterfeed&utm_medium=facebook&utm_campaign=tfb&utm_content=FaceBook&utm_term=analytics#.UVXibgXOpfc.facebook
Discover which customer is likely to churn? 
Customer Churn Analysis
Account Information 
state 
account length. 
area code 
phone number 
User Behavior 
international plan 
voice mail plan, number vmail messages 
total day minutes, total day calls, total day charge 
total eve minutes, total eve calls, total eve charge 
total night minutes, total night calls, total night charge 
total intl minutes, total intl calls, total intl charge 
number customer service calls 
Target 
Churn (Yes/No) 
Data Description
> install.packages("C50") > library(C50) > data(churn) > str(churnTrain) > churnTrain = churnTrain[,! names(churnTrain) %in% c("state", "area_code", "account_length") ] > set.seed(2) > ind <- sample(2, nrow(churnTrain), replace = TRUE, prob=c(0.7, 0.3)) > trainset = churnTrain[ind == 1,] > testset = churnTrain[ind == 2,] 
Split data into training and testing dataset 
70% as training dataset 
30% as testing dataset
churn.rp <- rpart(churn ~ ., data=trainset) plot(churn.rp, margin= 0.1) text(churn.rp, all=TRUE, use.n = TRUE) 
Build Classifier 
Classfication
> predictions <- predict(churn.rp, testset, type="class") > table(testset$churn, predictions) 
Prediction Result 
pred 
no 
yes 
no 
859 
18 
yes 
41 
100
> confusionMatrix(table(predictions, testset$churn)) Confusion Matrix and Statistics predictions yes no yes 100 18 no 41 859 Accuracy : 0.942 95% CI : (0.9259, 0.9556) No Information Rate : 0.8615 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.7393 Mcnemar's Test P-Value : 0.004181 Sensitivity : 0.70922 Specificity : 0.97948 Pos Pred Value : 0.84746 Neg Pred Value : 0.95444 Prevalence : 0.13851 Detection Rate : 0.09823 Detection Prevalence : 0.11591 Balanced Accuracy : 0.84435 'Positive' Class : yes 
Use Confusion Matrix
Use Testing Data to Validate Result 
predictions <- predict(churn.rp, testset, type="prob") pred.to.roc <- predictions[, 1] pred.rocr <- prediction(pred.to.roc, as.factor(testset[,(dim(testset)[[2]])])) perf.rocr <- performance(pred.rocr, measure = "auc", x.measure = "cutoff") perf.tpr.rocr <- performance(pred.rocr, "tpr","fpr") plot(perf.tpr.rocr, colorize=T,main=paste("AUC:",(perf.rocr@y.values)))
Finding Most Important Variable model=fit(churn~.,trainset,model="svm") VariableImportance=Importance(model,trainset,method="sensv") L=list(runs=1,sen=t(VariableImportance$imp),sresponses=VariableImportance$ sresponses) mgraph(L,graph="IMP",leg=names(trainset),col="gray",Grid=10)
Dynamic Language 
Execution at runtime 
Dynamic Type 
Interpreted Language 
See the result after execution 
OOP 
Python Language 
49
Cross Platform(Python VM) 
Third-Party Resource 
(Data Analysis、Graphics、Website Development) 
Simple, and easy to learn 
Benefit of Python
Data Analysis 
Scipy 
Numpy 
Scikit-learn 
Pandas 
51
Company that use python 
52
Use InfoLite Tool To Extract DOM
Use Python To Build Up Dashboard
Monitor Social Media and News 
Monitor post on social media 
Configure keyword and alert 
Use line plot to show daily post statistics 
55 
蘋果, nownews, udn, 中央跟風傳媒 還有 其他財經媒體
Daily Statistics Report 
56
Examine Associate Article 
57
Configure Alert and Keyword 
58
Configure Monitor Channel 
59
Track Specific Article 
60
Have You Learned Big Data? 
61
The 3Vs of Big Data
Product Centric 
Customer Centric 
Product Centric v.s. Customer Centric
Customer Centric? 
http://goo.gl/iuy4lY
Personal Recommendation
Knowing Who You Are? 
Personal recommendation 
Customer relation management 
Knowing What Futures Likes? 
From the history, we can see the future 
Predictive analysis 
Knowing What is Hidden Beneath? 
Correlation, Correlation, Correlation 
So… What is Big Data?
So… How To Analyze?
Apache Project – From Yahoo 
Feature 
Extensible 
Cost Effective 
Flexible 
High Fault Tolerant 
Hadoop
Hadoop Eco System 
HDFS 
MR 
IMPALA 
HBASE 
PIG 
HIVE 
SQOOP FLUME 
HUE, Oozie, Mahout
Tools for different scale 
Size 
Classification 
Tools 
Lines 
Sample Data 
Analysis and Visualisation 
Whiteboard, 
Bash, ... 
KBs – low MBs 
Prototype Data 
Analysis and Visualisation 
Matlab, Octave, R, Processing, Bash, ... 
MBs – low GBs 
Online Data 
Storage 
MySQL (DBs), ... 
Analysis 
NumPy, SciPy, Pandas, Weka.. 
Visualisation 
Flare, AmCharts, Raphael 
GBs 
– TBs 
– PBs 
Big Data 
Storage 
HDFS, Hbase, Cassandra,... 
Analysis 
Hive, Giraph, Hama, Mahout
Amazon
Facebook
Recommendation System 
Javascript 
Flume 
HDFS 
HBase 
Pig 
Mahout
Item- Based
User - Based
Monitor User Rating
Send User Behavior to Backend
Use Flume To Collect Streaming Data 
From /tmp/postlog.txt To /user/cloudera/flume
JSON sample data 
{"food":"Tacos", "person":"Alice", "amount":3} 
{"food":"Tomato Soup", "person":"Sarah", "amount":2} 
{"food":"Grilled Cheese", "person":"Alex", "amount":5} 
Demo Code 
second_table = LOAD 'second_table.json' 
USING JsonLoader('food:chararray, person:chararray, amount:int'); 
Use Pig To Load JSON
Build Recommendation Model
$ hbase shell 
> create ‘mydata’, ‘mycf’ 
Build Table In HBase
Examine Data In HDFS
Use Pig To Transfer Data Into HBase
Examine Data In HBase
Build API
Recommendation System
Focus on algorithm 
Divide and Conquer, Trie, Collaborative Filtering 
Being an expert of single programming language 
But knowing what tools and algorithm you can use to solve your problem 
Define your role 
Statistician 
Software engineer 
What You Should Do
Website: 
largitdata.com 
ywchiu.com 
Email: 
david@largitdata.com 
tr.ywchiu@gmail.com 
Contacts
Data Analysis - Making Big Data Work

More Related Content

What's hot

Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Edureka!
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Edureka!
 
Introduction To Data Science With Python
Introduction To Data Science With PythonIntroduction To Data Science With Python
Introduction To Data Science With Python
Spotle.ai
 
Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Predictive analytics and big data tutorial
Predictive analytics and big data tutorial
Benjamin Taylor
 
Different Career Paths in Data Science
Different Career Paths in Data ScienceDifferent Career Paths in Data Science
Different Career Paths in Data Science
Roger Huang
 
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...
Edureka!
 
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Edureka!
 
Association Mining
Association Mining Association Mining
Association Mining Edureka!
 
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Edureka!
 
Python webinar 4th june
Python webinar 4th junePython webinar 4th june
Python webinar 4th june
Edureka!
 
Data science
Data scienceData science
Data science
Sreejith c
 
How to program your way into data science?
How to program your way into data science?How to program your way into data science?
How to program your way into data science?
DeZyre
 
Using hadoop for big data
Using hadoop for big dataUsing hadoop for big data
Using hadoop for big data
Data Science Thailand
 
Python for Data Science
Python for Data SciencePython for Data Science
Python for Data Science
Gabriel Moreira
 
Data Scientist Job, Career & Salary | Data Scientist Salary | Data Science Ma...
Data Scientist Job, Career & Salary | Data Scientist Salary | Data Science Ma...Data Scientist Job, Career & Salary | Data Scientist Salary | Data Science Ma...
Data Scientist Job, Career & Salary | Data Scientist Salary | Data Science Ma...
Edureka!
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
David Pittman
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
ryanorban
 
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Simplilearn
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples
Edureka!
 
Data Science : Make Smarter Business Decisions
Data Science : Make Smarter Business DecisionsData Science : Make Smarter Business Decisions
Data Science : Make Smarter Business Decisions
Edureka!
 

What's hot (20)

Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
 
Introduction To Data Science With Python
Introduction To Data Science With PythonIntroduction To Data Science With Python
Introduction To Data Science With Python
 
Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Predictive analytics and big data tutorial
Predictive analytics and big data tutorial
 
Different Career Paths in Data Science
Different Career Paths in Data ScienceDifferent Career Paths in Data Science
Different Career Paths in Data Science
 
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...
 
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
 
Association Mining
Association Mining Association Mining
Association Mining
 
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
 
Python webinar 4th june
Python webinar 4th junePython webinar 4th june
Python webinar 4th june
 
Data science
Data scienceData science
Data science
 
How to program your way into data science?
How to program your way into data science?How to program your way into data science?
How to program your way into data science?
 
Using hadoop for big data
Using hadoop for big dataUsing hadoop for big data
Using hadoop for big data
 
Python for Data Science
Python for Data SciencePython for Data Science
Python for Data Science
 
Data Scientist Job, Career & Salary | Data Scientist Salary | Data Science Ma...
Data Scientist Job, Career & Salary | Data Scientist Salary | Data Science Ma...Data Scientist Job, Career & Salary | Data Scientist Salary | Data Science Ma...
Data Scientist Job, Career & Salary | Data Scientist Salary | Data Science Ma...
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
 
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples
 
Data Science : Make Smarter Business Decisions
Data Science : Make Smarter Business DecisionsData Science : Make Smarter Business Decisions
Data Science : Make Smarter Business Decisions
 

Viewers also liked

新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
David Chiu
 
如何建置關鍵字精靈 How to Build an Keyword Wizard
如何建置關鍵字精靈 How to Build an Keyword Wizard如何建置關鍵字精靈 How to Build an Keyword Wizard
如何建置關鍵字精靈 How to Build an Keyword Wizard
晨揚 施
 
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)
David Chiu
 
Social Network Analysis With R
Social Network Analysis With RSocial Network Analysis With R
Social Network Analysis With R
David Chiu
 
Big Data Analysis With RHadoop
Big Data Analysis With RHadoopBig Data Analysis With RHadoop
Big Data Analysis With RHadoop
David Chiu
 
Tune up your data science process
Tune up your data science processTune up your data science process
Tune up your data science process
Benjamin Skrainka
 
Analysis, data & process modeling
Analysis, data & process modelingAnalysis, data & process modeling
Analysis, data & process modeling
Chi D. Nguyen
 
Cross border - off-shoring and outsourcing privacy sensitive data
Cross border - off-shoring and outsourcing privacy sensitive dataCross border - off-shoring and outsourcing privacy sensitive data
Cross border - off-shoring and outsourcing privacy sensitive data
Ulf Mattsson
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabad
Kelly Technologies
 
Statistical analysis of process data 7 stages oil flow chart power point temp...
Statistical analysis of process data 7 stages oil flow chart power point temp...Statistical analysis of process data 7 stages oil flow chart power point temp...
Statistical analysis of process data 7 stages oil flow chart power point temp...SlideTeam.net
 
Data analysis with R and Julia
Data analysis with R and JuliaData analysis with R and Julia
Data analysis with R and Julia
Mark Tabladillo
 
Data Science and Goodhart's Law
Data Science and Goodhart's LawData Science and Goodhart's Law
Data Science and Goodhart's Law
Domino Data Lab
 
Machine Learning With R
Machine Learning With RMachine Learning With R
Machine Learning With R
David Chiu
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopGwen (Chen) Shapira
 
A Tour of the Data Science Process, a Case Study Using Movie Industry Data
A Tour of the Data Science Process, a Case Study Using Movie Industry DataA Tour of the Data Science Process, a Case Study Using Movie Industry Data
A Tour of the Data Science Process, a Case Study Using Movie Industry Data
Domino Data Lab
 
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Cloudera, Inc.
 
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Cloudera, Inc.
 
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
Jeffrey Breen
 
How to read a data model
How to read a data modelHow to read a data model
How to read a data model
sanksh
 

Viewers also liked (19)

新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
 
如何建置關鍵字精靈 How to Build an Keyword Wizard
如何建置關鍵字精靈 How to Build an Keyword Wizard如何建置關鍵字精靈 How to Build an Keyword Wizard
如何建置關鍵字精靈 How to Build an Keyword Wizard
 
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)
 
Social Network Analysis With R
Social Network Analysis With RSocial Network Analysis With R
Social Network Analysis With R
 
Big Data Analysis With RHadoop
Big Data Analysis With RHadoopBig Data Analysis With RHadoop
Big Data Analysis With RHadoop
 
Tune up your data science process
Tune up your data science processTune up your data science process
Tune up your data science process
 
Analysis, data & process modeling
Analysis, data & process modelingAnalysis, data & process modeling
Analysis, data & process modeling
 
Cross border - off-shoring and outsourcing privacy sensitive data
Cross border - off-shoring and outsourcing privacy sensitive dataCross border - off-shoring and outsourcing privacy sensitive data
Cross border - off-shoring and outsourcing privacy sensitive data
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabad
 
Statistical analysis of process data 7 stages oil flow chart power point temp...
Statistical analysis of process data 7 stages oil flow chart power point temp...Statistical analysis of process data 7 stages oil flow chart power point temp...
Statistical analysis of process data 7 stages oil flow chart power point temp...
 
Data analysis with R and Julia
Data analysis with R and JuliaData analysis with R and Julia
Data analysis with R and Julia
 
Data Science and Goodhart's Law
Data Science and Goodhart's LawData Science and Goodhart's Law
Data Science and Goodhart's Law
 
Machine Learning With R
Machine Learning With RMachine Learning With R
Machine Learning With R
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
A Tour of the Data Science Process, a Case Study Using Movie Industry Data
A Tour of the Data Science Process, a Case Study Using Movie Industry DataA Tour of the Data Science Process, a Case Study Using Movie Industry Data
A Tour of the Data Science Process, a Case Study Using Movie Industry Data
 
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
 
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
 
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
 
How to read a data model
How to read a data modelHow to read a data model
How to read a data model
 

Similar to Data Analysis - Making Big Data Work

Intro to ai application emeritus uob-final
Intro to ai application emeritus uob-finalIntro to ai application emeritus uob-final
Intro to ai application emeritus uob-final
Luis Fernando Gonzalez Sanchez
 
MB2208A- Business Analytics- unit-4.pptx
MB2208A- Business Analytics- unit-4.pptxMB2208A- Business Analytics- unit-4.pptx
MB2208A- Business Analytics- unit-4.pptx
ssuser28b150
 
A day in the life of a data scientist in an AI company
A day in the life of a data scientist in an AI companyA day in the life of a data scientist in an AI company
A day in the life of a data scientist in an AI company
Francesca Lazzeri, PhD
 
Machine Learning with Azure and Databricks Virtual Workshop
Machine Learning with Azure and Databricks Virtual WorkshopMachine Learning with Azure and Databricks Virtual Workshop
Machine Learning with Azure and Databricks Virtual Workshop
CCG
 
Time-to-Event Models, presented by DataSong and Revolution Analytics
Time-to-Event Models, presented by DataSong and Revolution AnalyticsTime-to-Event Models, presented by DataSong and Revolution Analytics
Time-to-Event Models, presented by DataSong and Revolution Analytics
Revolution Analytics
 
Better Living Through Analytics - Strategies for Data Decisions
Better Living Through Analytics - Strategies for Data DecisionsBetter Living Through Analytics - Strategies for Data Decisions
Better Living Through Analytics - Strategies for Data Decisions
Product School
 
What is Data analytics? How is data analytics a better career option?
What is Data analytics? How is data analytics a better career option?What is Data analytics? How is data analytics a better career option?
What is Data analytics? How is data analytics a better career option?
Aspire Techsoft Academy
 
JDO 2019: Data Science for Developers - Matthew Renze
JDO 2019: Data Science for Developers -  Matthew RenzeJDO 2019: Data Science for Developers -  Matthew Renze
JDO 2019: Data Science for Developers - Matthew Renze
PROIDEA
 
Fuel for the cognitive age: What's new in IBM predictive analytics
Fuel for the cognitive age: What's new in IBM predictive analytics Fuel for the cognitive age: What's new in IBM predictive analytics
Fuel for the cognitive age: What's new in IBM predictive analytics
IBM SPSS Software
 
ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)
ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)
ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)
Leslie McFarlin
 
Machine Learning and Remarketing
Machine Learning and RemarketingMachine Learning and Remarketing
Machine Learning and Remarketing
Clark Boyd
 
What's New in Predictive Analytics IBM SPSS
What's New in Predictive Analytics IBM SPSSWhat's New in Predictive Analytics IBM SPSS
What's New in Predictive Analytics IBM SPSS
Virginia Fernandez
 
What's New in Predictive Analytics IBM SPSS - Apr 2016
What's New in Predictive Analytics IBM SPSS - Apr 2016What's New in Predictive Analytics IBM SPSS - Apr 2016
What's New in Predictive Analytics IBM SPSS - Apr 2016
Edgar Alejandro Villegas
 
Intro to Data Analytics with Oscar's Director of Product
 Intro to Data Analytics with Oscar's Director of Product Intro to Data Analytics with Oscar's Director of Product
Intro to Data Analytics with Oscar's Director of Product
Product School
 
Mixed Methods Research in the Age of Big Data: A Primer for UX Researchers
Mixed Methods Research in the Age of Big Data: A Primer for UX ResearchersMixed Methods Research in the Age of Big Data: A Primer for UX Researchers
Mixed Methods Research in the Age of Big Data: A Primer for UX Researchers
UXPA International
 
UXPA 2016: Mixed Methods Research in the Age of Big Data
UXPA 2016: Mixed Methods Research in the Age of Big DataUXPA 2016: Mixed Methods Research in the Age of Big Data
UXPA 2016: Mixed Methods Research in the Age of Big Data
Zachary Sam Zaiss
 
Data science tutorial
Data science tutorialData science tutorial
Data science tutorial
Aakashdata
 
Big data and Marketing by Edward Chenard
Big data and Marketing by Edward ChenardBig data and Marketing by Edward Chenard
Big data and Marketing by Edward Chenard
Edward Chenard
 

Similar to Data Analysis - Making Big Data Work (20)

Intro to ai application emeritus uob-final
Intro to ai application emeritus uob-finalIntro to ai application emeritus uob-final
Intro to ai application emeritus uob-final
 
MB2208A- Business Analytics- unit-4.pptx
MB2208A- Business Analytics- unit-4.pptxMB2208A- Business Analytics- unit-4.pptx
MB2208A- Business Analytics- unit-4.pptx
 
A day in the life of a data scientist in an AI company
A day in the life of a data scientist in an AI companyA day in the life of a data scientist in an AI company
A day in the life of a data scientist in an AI company
 
Machine Learning with Azure and Databricks Virtual Workshop
Machine Learning with Azure and Databricks Virtual WorkshopMachine Learning with Azure and Databricks Virtual Workshop
Machine Learning with Azure and Databricks Virtual Workshop
 
Time-to-Event Models, presented by DataSong and Revolution Analytics
Time-to-Event Models, presented by DataSong and Revolution AnalyticsTime-to-Event Models, presented by DataSong and Revolution Analytics
Time-to-Event Models, presented by DataSong and Revolution Analytics
 
Better Living Through Analytics - Strategies for Data Decisions
Better Living Through Analytics - Strategies for Data DecisionsBetter Living Through Analytics - Strategies for Data Decisions
Better Living Through Analytics - Strategies for Data Decisions
 
1 kwyfvb
1 kwyfvb1 kwyfvb
1 kwyfvb
 
ForresterPredictiveWave
ForresterPredictiveWaveForresterPredictiveWave
ForresterPredictiveWave
 
What is Data analytics? How is data analytics a better career option?
What is Data analytics? How is data analytics a better career option?What is Data analytics? How is data analytics a better career option?
What is Data analytics? How is data analytics a better career option?
 
JDO 2019: Data Science for Developers - Matthew Renze
JDO 2019: Data Science for Developers -  Matthew RenzeJDO 2019: Data Science for Developers -  Matthew Renze
JDO 2019: Data Science for Developers - Matthew Renze
 
Fuel for the cognitive age: What's new in IBM predictive analytics
Fuel for the cognitive age: What's new in IBM predictive analytics Fuel for the cognitive age: What's new in IBM predictive analytics
Fuel for the cognitive age: What's new in IBM predictive analytics
 
ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)
ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)
ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)
 
Machine Learning and Remarketing
Machine Learning and RemarketingMachine Learning and Remarketing
Machine Learning and Remarketing
 
What's New in Predictive Analytics IBM SPSS
What's New in Predictive Analytics IBM SPSSWhat's New in Predictive Analytics IBM SPSS
What's New in Predictive Analytics IBM SPSS
 
What's New in Predictive Analytics IBM SPSS - Apr 2016
What's New in Predictive Analytics IBM SPSS - Apr 2016What's New in Predictive Analytics IBM SPSS - Apr 2016
What's New in Predictive Analytics IBM SPSS - Apr 2016
 
Intro to Data Analytics with Oscar's Director of Product
 Intro to Data Analytics with Oscar's Director of Product Intro to Data Analytics with Oscar's Director of Product
Intro to Data Analytics with Oscar's Director of Product
 
Mixed Methods Research in the Age of Big Data: A Primer for UX Researchers
Mixed Methods Research in the Age of Big Data: A Primer for UX ResearchersMixed Methods Research in the Age of Big Data: A Primer for UX Researchers
Mixed Methods Research in the Age of Big Data: A Primer for UX Researchers
 
UXPA 2016: Mixed Methods Research in the Age of Big Data
UXPA 2016: Mixed Methods Research in the Age of Big DataUXPA 2016: Mixed Methods Research in the Age of Big Data
UXPA 2016: Mixed Methods Research in the Age of Big Data
 
Data science tutorial
Data science tutorialData science tutorial
Data science tutorial
 
Big data and Marketing by Edward Chenard
Big data and Marketing by Edward ChenardBig data and Marketing by Edward Chenard
Big data and Marketing by Edward Chenard
 

Recently uploaded

Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 

Recently uploaded (20)

Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 

Data Analysis - Making Big Data Work

  • 1. Data Analysis Making Big Data Work David Chiu 2014/11/24
  • 2. About Me Founder of LargitData Ex-Trend Micro Engineer ywchiu.com
  • 3. Big Data & Data Science
  • 8.
  • 9. Being A Data Scientist, You Need to Know That Much? Seriously?
  • 10. Statistic Single Variable、Multi Variable、ANOVA Data Munging Data Extraction, Transformation, Loading Data Visualization Figure, Business Intelligence Required Skills
  • 11. What You Probably Need Is A Team Business Analyst Knowing how to use different tools under different circumstance Statistician How to process big data? DBA How to deal with unstructured data Software Engineer Knowing how to user statistics
  • 12. Four Dimension 12 Single Machine Memory R Local File Cloud Distributed Hadoop HDFS Statistics Analysis Linear Algebra Architect Management Standard Concept MapReduce Linear Algebra Logistic Regression Tool Hadoop PostgreSQL R Analyst How to use these tools Hackers R Python Java
  • 13. “80% are doing summing and averaging” Content 1.Data Munging 2.Data Analysis 3.Interpret Result What Data Scientists Do?
  • 14. Application of Data Analysis Text Mining Classify Spam Mail Build Index Data Search Engine Social Network Analysis Finding Opinion Leader Recommendation System What user likes? Opinion Mining Positive/Negative Opinion Fraud Analysis Credit Card Fraud
  • 15. Feed data to computer Make Computer to Do Analysis
  • 17. Predictive Analysis Learn from experience (Data), to predict future behavior What to Predict? e.g. Who is likely to click on that ad? For What? e.g. According to the click possibility and revenue to decide which ad to show. Predictive Analysis
  • 18. Customer buying beer will also buy pampers? People are surfing telephone fee rate are likely to switch its vendor People belong to same group are tend to have same telecom vendor Surprising Conclusion
  • 19. According to personal behavior, predictive model can use personal characteristic to generate a probabilistic score, which the higher the score, the more likely the behavior. Predictive Model
  • 20. Linear Model e.g. Based on a cosmetic ad. We can give 90% weight to female customers, give10% to male customer. Based on the click probability (15%), we can calculate the possibility score (or probability) Female 13.5%,Male1.5% Rule Model e.g. If the user is “She” And Income is over 30k And haven’t seen the ad yet The click rate is 11% Simple Predictive Model
  • 21. Induction From detail to general A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E -- Tom Mitchell (1998) Discover an effective model Start from a simple model Update the model based on feeding data Keep on improving prediction power Machine Learning
  • 22. Statistic Analysis Regression Analysis Clustering Classification Recommendation Text Mining Application 22
  • 24. Decision Tree Rate > 1,299/Month Probability to switch vendor 15% Probability to switch vendor 3% Yes No
  • 25. Decision Tree Rate > 1,299/Month Probability to switch vendor 3% Yes No Probability to switch vendor 10% Probability to switch vendor 22% Income>22,000 Yes No
  • 26. Decision Tree Rate > 1,299/Month Yes No Probability to switch vendor 10% Probability to switch vendor 22% Income>22,000 Yes No Probability to switch vendor 1% Probability to switch vendor 7% Free for intranet Yes No
  • 27. Supervised Learning Regression Classification Unsupervised Learning Dimension Reduction Clustering Machine Learning
  • 29. Classification e.g. Stock prediction on bull/bear market Regression e.g. Price prediction Supervised Learning
  • 30. Dimension Reduction e.g. Making a new index Clustering e.g. Customer Segmentation Unsupervised Learning
  • 31. Lift The better the lift, the greater the cost? The more decision rule, the more campaign? Design strategy for different persona? The lift for 4 campaign? The lift for 20 ampaign? Lift
  • 32. Can we use the production rate of butter to predict stock market? Overfitting
  • 33. Use noise as information Over assumption Over Interpretation What overfitting learn is not truth Like memorize all answers in a single test. Overfitting
  • 34. Testing Model Use external data or partial data as testing dataset
  • 36. Statistics On The Fly Built-in Math and Graphic Function Free and Open Source http://cran.r-project.org/src/base/ R Language 36
  • 37. Functional Programming Use Function Definition To Retrieve Answer Interpreted Language Statistics On the Fly Object Oriented Language S3 and S4 Method R Language
  • 38. Most Used Analytic Language Most popular languages are R, Python (39%), SQL (37%). SAS (20%). By Gregory Piatetsky, Aug 27, 2013.
  • 39. Kaggle http://www.kaggle.com/ Most often used language in Kaggle competition
  • 40. Data Scientist in Google and Apple Use R What is your programming language of choice, R, Python or something else? “I use R, and occasionally matlab, for data analysis. There is a large, active and extremely knowledgeable R community at Google.” http://simplystatistics.org/2013/02/15/interview-with-nick-chamandy-statistician-at-google/ “Expert knowledge of SAS (With Enterprise Guide/Miner) required and candidates with strong knowledge of R will be preferred” http://www.kdnuggets.com/jobs/13/03-29-apple-sr-data- scientist.html?utm_source=twitterfeed&utm_medium=facebook&utm_campaign=tfb&utm_content=FaceBook&utm_term=analytics#.UVXibgXOpfc.facebook
  • 41. Discover which customer is likely to churn? Customer Churn Analysis
  • 42. Account Information state account length. area code phone number User Behavior international plan voice mail plan, number vmail messages total day minutes, total day calls, total day charge total eve minutes, total eve calls, total eve charge total night minutes, total night calls, total night charge total intl minutes, total intl calls, total intl charge number customer service calls Target Churn (Yes/No) Data Description
  • 43. > install.packages("C50") > library(C50) > data(churn) > str(churnTrain) > churnTrain = churnTrain[,! names(churnTrain) %in% c("state", "area_code", "account_length") ] > set.seed(2) > ind <- sample(2, nrow(churnTrain), replace = TRUE, prob=c(0.7, 0.3)) > trainset = churnTrain[ind == 1,] > testset = churnTrain[ind == 2,] Split data into training and testing dataset 70% as training dataset 30% as testing dataset
  • 44. churn.rp <- rpart(churn ~ ., data=trainset) plot(churn.rp, margin= 0.1) text(churn.rp, all=TRUE, use.n = TRUE) Build Classifier Classfication
  • 45. > predictions <- predict(churn.rp, testset, type="class") > table(testset$churn, predictions) Prediction Result pred no yes no 859 18 yes 41 100
  • 46. > confusionMatrix(table(predictions, testset$churn)) Confusion Matrix and Statistics predictions yes no yes 100 18 no 41 859 Accuracy : 0.942 95% CI : (0.9259, 0.9556) No Information Rate : 0.8615 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.7393 Mcnemar's Test P-Value : 0.004181 Sensitivity : 0.70922 Specificity : 0.97948 Pos Pred Value : 0.84746 Neg Pred Value : 0.95444 Prevalence : 0.13851 Detection Rate : 0.09823 Detection Prevalence : 0.11591 Balanced Accuracy : 0.84435 'Positive' Class : yes Use Confusion Matrix
  • 47. Use Testing Data to Validate Result predictions <- predict(churn.rp, testset, type="prob") pred.to.roc <- predictions[, 1] pred.rocr <- prediction(pred.to.roc, as.factor(testset[,(dim(testset)[[2]])])) perf.rocr <- performance(pred.rocr, measure = "auc", x.measure = "cutoff") perf.tpr.rocr <- performance(pred.rocr, "tpr","fpr") plot(perf.tpr.rocr, colorize=T,main=paste("AUC:",(perf.rocr@y.values)))
  • 48. Finding Most Important Variable model=fit(churn~.,trainset,model="svm") VariableImportance=Importance(model,trainset,method="sensv") L=list(runs=1,sen=t(VariableImportance$imp),sresponses=VariableImportance$ sresponses) mgraph(L,graph="IMP",leg=names(trainset),col="gray",Grid=10)
  • 49. Dynamic Language Execution at runtime Dynamic Type Interpreted Language See the result after execution OOP Python Language 49
  • 50. Cross Platform(Python VM) Third-Party Resource (Data Analysis、Graphics、Website Development) Simple, and easy to learn Benefit of Python
  • 51. Data Analysis Scipy Numpy Scikit-learn Pandas 51
  • 52. Company that use python 52
  • 53. Use InfoLite Tool To Extract DOM
  • 54. Use Python To Build Up Dashboard
  • 55. Monitor Social Media and News Monitor post on social media Configure keyword and alert Use line plot to show daily post statistics 55 蘋果, nownews, udn, 中央跟風傳媒 還有 其他財經媒體
  • 58. Configure Alert and Keyword 58
  • 61. Have You Learned Big Data? 61
  • 62.
  • 63. The 3Vs of Big Data
  • 64.
  • 65. Product Centric Customer Centric Product Centric v.s. Customer Centric
  • 68. Knowing Who You Are? Personal recommendation Customer relation management Knowing What Futures Likes? From the history, we can see the future Predictive analysis Knowing What is Hidden Beneath? Correlation, Correlation, Correlation So… What is Big Data?
  • 69. So… How To Analyze?
  • 70. Apache Project – From Yahoo Feature Extensible Cost Effective Flexible High Fault Tolerant Hadoop
  • 71. Hadoop Eco System HDFS MR IMPALA HBASE PIG HIVE SQOOP FLUME HUE, Oozie, Mahout
  • 72. Tools for different scale Size Classification Tools Lines Sample Data Analysis and Visualisation Whiteboard, Bash, ... KBs – low MBs Prototype Data Analysis and Visualisation Matlab, Octave, R, Processing, Bash, ... MBs – low GBs Online Data Storage MySQL (DBs), ... Analysis NumPy, SciPy, Pandas, Weka.. Visualisation Flare, AmCharts, Raphael GBs – TBs – PBs Big Data Storage HDFS, Hbase, Cassandra,... Analysis Hive, Giraph, Hama, Mahout
  • 75. Recommendation System Javascript Flume HDFS HBase Pig Mahout
  • 79. Send User Behavior to Backend
  • 80. Use Flume To Collect Streaming Data From /tmp/postlog.txt To /user/cloudera/flume
  • 81. JSON sample data {"food":"Tacos", "person":"Alice", "amount":3} {"food":"Tomato Soup", "person":"Sarah", "amount":2} {"food":"Grilled Cheese", "person":"Alex", "amount":5} Demo Code second_table = LOAD 'second_table.json' USING JsonLoader('food:chararray, person:chararray, amount:int'); Use Pig To Load JSON
  • 83. $ hbase shell > create ‘mydata’, ‘mycf’ Build Table In HBase
  • 85. Use Pig To Transfer Data Into HBase
  • 89. Focus on algorithm Divide and Conquer, Trie, Collaborative Filtering Being an expert of single programming language But knowing what tools and algorithm you can use to solve your problem Define your role Statistician Software engineer What You Should Do
  • 90. Website: largitdata.com ywchiu.com Email: david@largitdata.com tr.ywchiu@gmail.com Contacts