SlideShare a Scribd company logo
Financial and Risk
Applications of InfoQ
Prof. Ron S. Kenett
KPA Ltd., Raanana, Israel
Universita degli Studi di Torino, Turin, Italy
NYU Poly, New York, USA
ron@kpa-group.com
Three case studies (1/3)
1. Predicting Changes in Quarterly Corporate
Earnings Using Economic Indicators
http://galitshmueli.com/content/predicting-changes-quarterly-
corporate-earnings-using-economic-indicators
This study looks at corporate earnings in relation
to an existing theory of business forecasting
developed by Joseph H. Ellis (former research
analyst at Goldman Sachs).
2
Three case studies (2/3)
2. Predicting ZILLOW.com’s Zestimate accuracy
http://galitshmueli.com/content/predicting-zillowcom-s-zestimate-
accuracy
Zillow.com is a free real estate service that
calculates an estimated home valuation
("Zestimate") as a starting point for anyone to see
for most homes in the U.S. The study looks at the
accuracy of Zestimates.
3
Three case studies (3/3)
3. Predicting First Day Returns for Japanese IPOs
http://galitshmueli.com/content/predicting-first-day-returns-
japanese-ipos.
An Initial Public Offering (IPO) is the first sale of
stock by a company to the public. The study looks
at the first-day returns on IPOs of Japanese
companies.
4
InfoQ(f,X,g) = U( f(X|g) )
Depends on quality of g, X, f, U and relationship between them
The potential of a particular dataset
to achieve a particular goal using a
given empirical analysis method
5
g A specific analysis goal
X The available dataset
f An empirical analysis method
U A utility measure
Information Quality
Kenett, R.S. and Shmueli , G. (2013) On Information Quality, http://ssrn.com/abstract=1464444
Journal of the Royal Statistical Society, Series A (with discussion), 176(4).
Analysis goal
g
Explain, predict, describe
enumerative, analytic,
exploratory, confirmatory
Goal Specification
• “error of the third kind” - giving the right answer to the wrong
question – Kimball
• “Far better an approximate answer to the right question, which
is often vague, than an exact answer to the wrong question,
which can always be made precise” - Tukey
6
Analysis goal
g
7
Goal 1. Decide where to launch improvement initiatives
Goal 2. Highlight drivers of overall satisfaction
Goal 3. Detect positive or negative trends in customer satisfaction
Goal 4. Identify best practices by comparing products
Goal 5. Determine strengths and weaknesses
Goal 6. Set up improvement goals
Goal 7. Design a balanced scorecard with customer inputs
Goal 8. Communicate the results using graphics
Goal 9. Assess the reliability of the questionnaire
Goal 10. Improve the questionnaire for future use
Typical Goals of Customer Surveys
X
Available data
Data Source
• Primary, secondary
• Observational, experiment
• Single, multiple sources
• Collection instrument, protocol
Data Type
• Continuous, categorical, semantic
• Structured, un-, semi-structured
• Cross-sectional, time series, panel,
network, geographical
Data Quality
• “Zeroth Problem - How do the data relate to the problem, and
what other data might be relevant?” - Mallows
• Quality of Statistical Data (IMF, OECD) - usefulness of summary
statistics for a particular goal (7 dimensions)
Data Size and
Dimension
• # observations
• # variables
8
f
Data analysis
method
Analysis Quality
• “poor models and poor analysis techniques, or even analyzing the
data in a totally incorrect way.” - Godfrey
• Analyst expertise
• Software availability
• The focus of statistics education
Statistical models and methods
• Parametric, semi-, non-parametric
• Classic, Bayesian
Data mining algorithms
Graphical methods
Operations research methods
9
Utility measure
U
Utility Measure
• Adequate metric from analysis standpoint (R2, holdout data)
• Adequate metric from domain standpoint
• Predictive accuracy, lift
• Goodness-of-fit
• Statistical power, statistical significance
• Strength-of-fit
• Expected costs, gains
• Bias reduction, bias-variance tradeoff
10
11
Goal of study:
1. Predict the final price of an Ebay
auction at start of auction
2. Predict price during ongoing
auction
3. Predict the auctions with the
highest prices (ranking)
4. Identify factors that determine the
final price of an eBay auction?
“Pennies from ebay: The
determinants of price in
online auctions”
Lucking-Reiley D., Bryan D.,
Prasad N. & Reeves D.
Journal of Indust. Econ., 2007
An example….
X
Available data
Analysis goal
g
12
 461 eBay coin auctions (Indian Head pennies)
 Auction characteristics
 Duration
 Open and close prices
 Number of bids and bidders
 Secret reserve price
 Weekday/weekend ending
 Seller characteristics
 Seller rating
 Item characteristics
 Year and grade of coin
X
Available data
“Pennies from ebay: The
determinants of price in
online auctions”
Lucking-Reiley D., Bryan D.,
Prasad N. & Reeves D.
Journal of Indust. Econ., 2007
13
Dimension Reduction
f
Data analysis
method
An example….
14
Prediction error:
• Holdout data
• Metrics such as MAPE
and RMSE
f
Data analysis
method
Utility measure
U
An example….
Statistical Approaches for Increasing InfoQ
Study Design (Pre-Data)
• DOE
• Clinical trials
• Survey sampling
• Computer experiments
Post-Data-Collection
• Data cleaning and
preprocessing
• Re-weighting, bias
adjustment
• Meta analysis
Randomization, Stratification,
Blinding, Placebo, Blocking,
Replication, Sampling frame,
Link data collection protocol
with appropriate design
Recovering “real data” vs.
“cleaning for the goal”
Handling missing values,
outlier detection, re-
weighting, combining results 15
Assessing InfoQ
“Quality of Statistical Data”
(Eurostat, OECD, NCSES,…)
• Relevance
• Accuracy
• Timeliness and punctuality
• Accessibility
• Interpretability
• Coherence
• Credibility
InfoQ dimensions
1. Data resolution
2. Data structure
3. Data integration
4. Temporal relevance
5. Chronology of data and goal
6. Generalizability
7. Operationalization
8. Communication
3 V’s of Big Data
• Volume
• Variety
• Velocity
Marketing Research
• Recency
• Accuracy
• Availability
• Relevance 16
4 V’s of Big Data
• Volume
• Variety
• Velocity
• Veracity
#1 Data Resolution
17
#2 Data Structure
Data Types
• Time series, cross-sectional, panel
• Structured, semi-, non-structured
• Geographic, spatial, network
• Text, audio, video, semantic
• Discrete, continuous
Data Characteristics
Corrupted and missing
values due to study design
or data collection
mechanism
18
19
www.riscoss.eu
Managing Risk and Costs in OSS Adoption
#2 Data Structure
20
#2 Data Structure
21
Who talks
to whom?
IRC chat archives: http://dev.xwiki.org/xwiki/bin/view/IRC/WebHome
XWiki Community
#2 Data Structure
XWiki Community
Use association rules
To characterize the
content of the
clusters (tm, arules)
#2 Data Structure
XWiki Community
#2 Data Structure
#3 Data Integration
Linkage, privacy-preserving
methods: Increase or
decrease InfoQ?
24
#4 Temporal Relevance
Analysis Timeliness
(solving the right
problem too late)
Data
Collection
Data
Analysis
Study
Deployment
t1 t2 t3 t4 t5 t6
Collection Timeliness
(relevance to g)
g: Prospective vs. retrospective; longitudinal vs. snapshot
Nature of X, complexity of f
forecast
25
#5 Chronology of Data & Goal
Data: Daily AQI in a city
g1: Reverse-engineer AQI
g2: Forecast AQI
Retrospective/prospective
Ex-post availability
Endogeneity
26
http://www.airnow.gov/?action=aqibasics.aqi
#6 Generalizability
Statistical
generalizability
Scientific
generalizability
Definition of g
Choice of X, f, U 27
#7 (Construct) Operationalization
χ construct
X = θ(χ) operationalization (measurable)
• Causal explanation vs.
prediction, description
• Theory vs. data
• Data: Questionnaire,
physio measurement
28
#7 (Action) Operationalization
29
http://www.spcpress.com/pdf/DJW187.pdf
#7 Operationalization
30
National Education Goals
Panel (NEGP)
recommended that states
answer four questions on
their student reports:
1. How did my child do?
2. What types of skills or
knowledge does his or her
performance reflect?
3. How did my child
perform in comparison to
other students in the
school,
district, state, and, if
available, the nation?
4. What can I do to help
my child improve?
31
#7 Operationalization
http://sat.collegeboard.org/practice/sat-skills-insight/writing/band/200
32
#7 Operationalization
http://sat.collegeboard.org/practice/sat-skills-insight/writing/band/200
33
When asked what the 18% in line 1 meant,
53% of the policy makers responded incorrectly
1992 NAEP
Executive
Summary Report
#8 Communication
43162
34
#8 Communication
http://nces.ed.gov/nationsreportcard/itemmaps/index.asp
35
#8 Communication
http://nces.ed.gov/nationsreportcard/itemmaps/index.asp
36
#8 Communication
http://nces.ed.gov/nationsreportcard/itemmaps/index.asp
37
#8 Communication
http://nces.ed.gov/nationsreportcard/itemmaps/index.asp
38
#8 Communication
http://nces.ed.gov/nationsreportcard/itemmaps/index.asp
39
#8 Communication
http://nces.ed.gov/nationsreportcard/itemmaps/index.asp
40
The Israeli version……
#8 Communication
http://rama.education.gov.il
'
N"N"N"
"18,68450110213,182521875,502454118
"21,40750010014,466524846,941444111
"20,6445249114,787536805,857496106
"19,1655248613,379532775,786506101
"19,6315327613,961537735,67051981
* "20,2225287713,957541706,26549882
41
http://www.madlan.co.il/education/schools
#8 Communication
The Israeli version……
Assessing InfoQ in Practice
Rating-based assessment
1-5 scale on each dimension:
InfoQ Score = [d1(Y1) d2(Y2) … d8(Y8)]1/8
Experience from two research methods courses
– Preparing a PhD research proposal (U Ljubljana, 50
students, goo.gl/f6bIA)
– Post-hoc evaluation of five completed studies (CMU,
16 students, goo.gl/erNPF) 42
# Dimension Note Value Index
1 Data resolution 5 1.0000
2 Data structure 4 0.7500
3 Data integration 5 1.0000
4 Temporal relevance 5 1.0000
5 Generalizability 3 0.5000
6 Chronology of data and goal 5 1.0000
7 Concept operationalization 2 0.2500
8 Communication 3 0.5000
InfoQ Score = 0.68
InfoQ=68%
InfoQ: Strengths and Challenges
InfoQ approach streamlines questioning of data value
• “Why should we invest in data?” – management
• Compare value of potential datasets, analyses
• Prioritize/rank projects
• Strengthen functional – analytical relationship
Multiple goals:
• Goals can change during study: Reevaluate InfoQ
• Multiple goals: Prioritize.
– clinical trials: effect of new drug, adverse effects
To Do:
• Improve InfoQ assessment
• Alternative InfoQ assessment approaches (pilot study, EDA, other)
• Further dimensions (data privacy, human subject compliance and risk)
• Effect of technological advances on InfoQ 43
Primary Data Secondary Data
- Experimental - Experimental
- Observational - Observational
Data
Quality
Information
Quality
Analysis
Quality
Knowledge
g A specific analysis goal
X The available dataset
f An empirical analysis method
U A utility measure
1.Data resolution
2.Data structure
3.Data integration
4.Temporal relevance
5.Chronology of data and goal
6.Generalizability
7.Operationalization
8.Communication
What
How
Goals InfoQ(f,X,g) = U(f(X|g))
Information Quality
Russom, P., Big Data Analytics, TDWI Best Practices Report, Q4 2011
Massive data sets
1. Data resolution
2. Data structure
3. Data integration
4. Temporal relevance
5. Chronology of data and goal
6. Generalizability
7. Operationalization
8. Communication
Big data Analytics
Three case studies (1/3)
1. Predicting Changes in Quarterly Corporate
Earnings Using Economic Indicators
Stages in economic downturn: 1) the peak, 2) modest slowing, 3) intensifying
worrying by investors (a lot of panic selling occurs in this stage), and 4) the
advent of recession. Can we predict the economic slowdown in corporate
earnings (S&P 500 EPS) well in advance?
Ellis claims (based on observations) there is a 0-9 month lag between wages
and its effect on consumer spending. 0-6 months until changes in consumer
spending affects changes in industrial production. Another 6-12 months
between industrial production and capital spending. And finally, another 6-12
between capital spending and its effects on Corporate Profits.
46
Three case studies (1/3)
1. Predicting Changes in Quarterly Corporate
Earnings Using Economic Indicators
Ellis model:
47
Three case studies (1/3)
1. Predicting Changes in Quarterly Corporate
Earnings Using Economic Indicators
The data: i) 180 quarters. 6 [Economic] x variables. Ii) Change
in S&P EPS = y variable, iii) All variables transformed to year vs
year % change, iv( All data used is publicly available via websites
of US agencies: BEA, BLS, FED, and S&P.
The analysis: XLMiner on these different versions of datasets.
Partitioned it. Ran predictor applications: ACF Plots, MLR,
Regression Tree – full and pruned.
48
Auto Correlation Chart. Based on this, took Lag_1
as one of the predictors. Lag_1 = QEPS_YY(Q-1)
Three case studies (1/3)
1. Predicting Changes in Quarterly Corporate
Earnings Using Economic Indicators
49
QEPS_YY%(t) = 0.0486 + 0.747*QEPS_YY%(t-1) -0.517*QRCAP_YY%(t-2)
# Dimension Note Value Index
1 Data resolution quarterly data 2 0.2500
2 Data structure no externalities 3 0.5000
3 Data integration 4 0.7500
4 Temporal relevance 5 1.0000
5 Generalizability 5 1.0000
6 Chronology of data and goal quarterly data 3 0.5000
7 Concept operationalization 5 1.0000
8 Communication 4 0.7500
InfoQ Score = 0.66
InfoQ=66%
Three case studies (2/3)
2. Predicting ZILLOW.com’s Zestimate accuracy
50
 “Zillow.com” is a real
estate service launched
in 2006
 It calculates a
Zestimate-home
valuation for most
homes in the U.S
For MD and VA it gets
only about 26% of
predictions within the
+/-5% range only.
1.Home Type (Single Family, Condo , etc)
2.No of Bed Rooms
3.No of Bath Rooms
4.Total Area –Sqft
5.Lot size –Sqft
6.No of Stories
7.Total Rooms
8.Distance from Metro
9.Primary School Rank
10.Middle School Rank
11.High School Rank
12.Age of house at Sale
13.Sale Season (Fall , Winter , etc)
14.Recession Period (Y/N)
15.Sales Volume
Three case studies (2/3)
2. Predicting ZILLOW.com’s Zestimate accuracy
51
• Data collected, cleansed and
merged from 4 sources –Zillow
, Redfin, School Digger and
Google Maps
• 17 counties (29 Zip codes) in
Northern VA
House sales data
• Before Data Clean up: 3500+
• After Data Clean up: 1416
• Y –Is Zestimate correct (Y/N)
37.6%/62.43%
• X –15 variables (5+ variables
where discarded from initial
set )
Three case studies (2/3)
2. Predicting ZILLOW.com’s Zestimate accuracy
52
# Dimension Note Value Index
1 Data resolution by individual house 5 1.0000
2 Data structure no externalities 4 0.7500
3 Data integration 5 1.0000
4 Temporal relevance 5 1.0000
5 Generalizability only VA counties 3 0.5000
6 Chronology of data and goal 5 1.0000
7 Concept operationalization 4 0.7500
8 Communication 4 0.7500
InfoQ Score = 0.82
InfoQ=82%
Three case studies (2/3)
2. Predicting ZILLOW.com’s Zestimate accuracy
53
http://www.madlan.co.il/education/schools
The Israeli version……
Three case studies (3/3)
3. Predicting First Day Returns for Japanese IPOs
Goal: To predict the First Day returns on Japanese IPOs (based on first day closing price),
using public information available prior to the offer
The data: i) Japanese IPO data from 1997-2009*, ii) 1561 IPOs, iii) Industry(categorical) :
35 industries - 3 were spelling errors, corrected
Remove Air Trans (1), Fishery & Forestry (2) industries
–Removed first 128 entries (1997-1999) as they had no data for 2 columns :
Underwriter’s fees & Allocation to BRLM
–New Columns
Minimum bid size
Secondary Offering %age
–Creation of Dummy Variables
BRLMs – 3, on the basis of Gross proceeds of IPO
Industry – 4, binned by average return
Market – whether the IPO was OTC or not
54
*Kaneko and Pettway’s Japanese IPO Database (KP-JIPO)
http://www.fbc.keio.ac.jp/~kaneko/KP-JIPO/top.htm
Three case studies (3/3)
3. Predicting First Day Returns for Japanese IPOs
55
1) Age of company at time of IPO
2) Gross Proceeds (size of IPO)
3) Minimum Bid Amount
4) IS_OTC listing
5) Secondary offering as %age of total
5) Percentage shares allocated to Lead Manager 1
7) Underwriter’s Gross Spread (fees as %age of size of IPO)
8) Industry_Type (binned categorical variable – 4 categories)
9) Lead_Manager (binned categorical variable – 3 categories)
# Dimension Note Value Index
1 Data resolution 5 1.0000
2 Data structure 4 0.7500
3 Data integration no externalities 2 0.2500
4 Temporal relevance 5 1.0000
5 Generalizability no theory 3 0.5000
6 Chronology of data and goal should be ex ante 3 0.5000
7 Concept operationalization 5 1.0000
8 Communication 4 0.7500
InfoQ Score = 0.66
Prediction algorithms do not give a reasonable prediction of
IPO returns from public information. (High RMSE: 90%)
InfoQ=66%
Thank you for your attention
56

More Related Content

What's hot

To Explain, To Predict, or To Describe?
To Explain, To Predict, or To Describe?To Explain, To Predict, or To Describe?
To Explain, To Predict, or To Describe?
Galit Shmueli
 
Machine Learning and Causal Inference
Machine Learning and Causal InferenceMachine Learning and Causal Inference
Machine Learning and Causal Inference
NBER
 
Repurposing Classification & Regression Trees for Causal Research with High-D...
Repurposing Classification & Regression Trees for Causal Research with High-D...Repurposing Classification & Regression Trees for Causal Research with High-D...
Repurposing Classification & Regression Trees for Causal Research with High-D...
Galit Shmueli
 
Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...
IOSRjournaljce
 
To explain or to predict
To explain or to predictTo explain or to predict
To explain or to predict
Galit Shmueli
 
A REVIEW ON PREDICTIVE ANALYTICS IN DATA MINING
A REVIEW ON PREDICTIVE ANALYTICS IN DATA MININGA REVIEW ON PREDICTIVE ANALYTICS IN DATA MINING
A REVIEW ON PREDICTIVE ANALYTICS IN DATA MINING
ijccmsjournal
 
Nuts and bolts
Nuts and boltsNuts and bolts
Nuts and bolts
NBER
 
Big Data - To Explain or To Predict? Talk at U Toronto's Rotman School of Ma...
Big Data - To Explain or To Predict?  Talk at U Toronto's Rotman School of Ma...Big Data - To Explain or To Predict?  Talk at U Toronto's Rotman School of Ma...
Big Data - To Explain or To Predict? Talk at U Toronto's Rotman School of Ma...
Galit Shmueli
 
WebSite Visit Forecasting Using Data Mining Techniques
WebSite Visit Forecasting Using Data Mining  TechniquesWebSite Visit Forecasting Using Data Mining  Techniques
WebSite Visit Forecasting Using Data Mining Techniques
Chandana Napagoda
 
R programming for data science
R programming for data scienceR programming for data science
R programming for data science
AbhishekKumarSingh260
 
Advancing Foundation and Practice of Software Analytics
Advancing Foundation and Practice of Software AnalyticsAdvancing Foundation and Practice of Software Analytics
Advancing Foundation and Practice of Software Analytics
Tao Xie
 
Collaboration with Statistician? 矩陣視覺化於探索式資料分析
Collaboration with Statistician? 矩陣視覺化於探索式資料分析Collaboration with Statistician? 矩陣視覺化於探索式資料分析
Collaboration with Statistician? 矩陣視覺化於探索式資料分析
台灣資料科學年會
 
1030 track 2 barrett_using our laptop
1030 track 2 barrett_using our laptop1030 track 2 barrett_using our laptop
1030 track 2 barrett_using our laptop
Rising Media, Inc.
 
Predictive Model Selection in PLS-PM (SCECR 2015)
Predictive Model Selection in PLS-PM (SCECR 2015)Predictive Model Selection in PLS-PM (SCECR 2015)
Predictive Model Selection in PLS-PM (SCECR 2015)
Galit Shmueli
 
Repurposing predictive tools for causal research
Repurposing predictive tools for causal researchRepurposing predictive tools for causal research
Repurposing predictive tools for causal research
Galit Shmueli
 
V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1
warishali570
 
Big Data Quality Panel : Diachron Workshop @EDBT
Big Data Quality Panel: Diachron Workshop @EDBTBig Data Quality Panel: Diachron Workshop @EDBT
Big Data Quality Panel : Diachron Workshop @EDBT
Paolo Missier
 
Contextual Information Elicitation in Travel Recommender Systems
Contextual Information Elicitation in Travel Recommender SystemsContextual Information Elicitation in Travel Recommender Systems
Contextual Information Elicitation in Travel Recommender Systems
Matthias Braunhofer
 
Business research (1)
Business research (1)Business research (1)
Business research (1)
007donmj
 

What's hot (19)

To Explain, To Predict, or To Describe?
To Explain, To Predict, or To Describe?To Explain, To Predict, or To Describe?
To Explain, To Predict, or To Describe?
 
Machine Learning and Causal Inference
Machine Learning and Causal InferenceMachine Learning and Causal Inference
Machine Learning and Causal Inference
 
Repurposing Classification & Regression Trees for Causal Research with High-D...
Repurposing Classification & Regression Trees for Causal Research with High-D...Repurposing Classification & Regression Trees for Causal Research with High-D...
Repurposing Classification & Regression Trees for Causal Research with High-D...
 
Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...
 
To explain or to predict
To explain or to predictTo explain or to predict
To explain or to predict
 
A REVIEW ON PREDICTIVE ANALYTICS IN DATA MINING
A REVIEW ON PREDICTIVE ANALYTICS IN DATA MININGA REVIEW ON PREDICTIVE ANALYTICS IN DATA MINING
A REVIEW ON PREDICTIVE ANALYTICS IN DATA MINING
 
Nuts and bolts
Nuts and boltsNuts and bolts
Nuts and bolts
 
Big Data - To Explain or To Predict? Talk at U Toronto's Rotman School of Ma...
Big Data - To Explain or To Predict?  Talk at U Toronto's Rotman School of Ma...Big Data - To Explain or To Predict?  Talk at U Toronto's Rotman School of Ma...
Big Data - To Explain or To Predict? Talk at U Toronto's Rotman School of Ma...
 
WebSite Visit Forecasting Using Data Mining Techniques
WebSite Visit Forecasting Using Data Mining  TechniquesWebSite Visit Forecasting Using Data Mining  Techniques
WebSite Visit Forecasting Using Data Mining Techniques
 
R programming for data science
R programming for data scienceR programming for data science
R programming for data science
 
Advancing Foundation and Practice of Software Analytics
Advancing Foundation and Practice of Software AnalyticsAdvancing Foundation and Practice of Software Analytics
Advancing Foundation and Practice of Software Analytics
 
Collaboration with Statistician? 矩陣視覺化於探索式資料分析
Collaboration with Statistician? 矩陣視覺化於探索式資料分析Collaboration with Statistician? 矩陣視覺化於探索式資料分析
Collaboration with Statistician? 矩陣視覺化於探索式資料分析
 
1030 track 2 barrett_using our laptop
1030 track 2 barrett_using our laptop1030 track 2 barrett_using our laptop
1030 track 2 barrett_using our laptop
 
Predictive Model Selection in PLS-PM (SCECR 2015)
Predictive Model Selection in PLS-PM (SCECR 2015)Predictive Model Selection in PLS-PM (SCECR 2015)
Predictive Model Selection in PLS-PM (SCECR 2015)
 
Repurposing predictive tools for causal research
Repurposing predictive tools for causal researchRepurposing predictive tools for causal research
Repurposing predictive tools for causal research
 
V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1
 
Big Data Quality Panel : Diachron Workshop @EDBT
Big Data Quality Panel: Diachron Workshop @EDBTBig Data Quality Panel: Diachron Workshop @EDBT
Big Data Quality Panel : Diachron Workshop @EDBT
 
Contextual Information Elicitation in Travel Recommender Systems
Contextual Information Elicitation in Travel Recommender SystemsContextual Information Elicitation in Travel Recommender Systems
Contextual Information Elicitation in Travel Recommender Systems
 
Business research (1)
Business research (1)Business research (1)
Business research (1)
 

Viewers also liked

Вы управляете проектом или проект управляет вами?
Вы управляете проектом или проект управляет вами?Вы управляете проектом или проект управляет вами?
Вы управляете проектом или проект управляет вами?
КоммандКор
 
Lembar asistensi laporan pengukuran besaran listrik
Lembar asistensi laporan pengukuran besaran listrikLembar asistensi laporan pengukuran besaran listrik
Lembar asistensi laporan pengukuran besaran listrik
Ady Purnomo
 
Viral marketing. How it works.
Viral marketing. How it works.Viral marketing. How it works.
Viral marketing. How it works.
Viktor Kharchevskyi
 
Воронка продаж Web 3.0
Воронка продаж Web 3.0Воронка продаж Web 3.0
Воронка продаж Web 3.0Viktor Kharchevskyi
 
ABA Life Sciences
ABA Life Sciences ABA Life Sciences
ABA Life Sciences
ABA - Invest in Austria
 
Financial regulation sept2010
Financial regulation sept2010Financial regulation sept2010
Financial regulation sept2010EuclidNetwork
 
3-D Конструктор управления
3-D Конструктор управления3-D Конструктор управления
3-D Конструктор управления
КоммандКор
 
The Giveaway Cafe
The Giveaway CafeThe Giveaway Cafe
The Giveaway Cafe
The Giveaway Cafe
 
1. conversion clinic with dr coleman
1. conversion clinic with dr coleman1. conversion clinic with dr coleman
1. conversion clinic with dr coleman
MoreNiche
 
MDFF_Guidelines_Print version_FINAL_Low Res
MDFF_Guidelines_Print version_FINAL_Low ResMDFF_Guidelines_Print version_FINAL_Low Res
MDFF_Guidelines_Print version_FINAL_Low Resivanidrovo
 
ABA Localita Austria
ABA Localita AustriaABA Localita Austria
ABA Localita Austria
ABA - Invest in Austria
 
7. mastering wordpress
7. mastering wordpress7. mastering wordpress
7. mastering wordpress
MoreNiche
 
Afstudeeronderzoek Annemiek Van Den Bosch
Afstudeeronderzoek Annemiek Van Den BoschAfstudeeronderzoek Annemiek Van Den Bosch
Afstudeeronderzoek Annemiek Van Den Bosch
AnnemiekvdBosch
 

Viewers also liked (16)

Вы управляете проектом или проект управляет вами?
Вы управляете проектом или проект управляет вами?Вы управляете проектом или проект управляет вами?
Вы управляете проектом или проект управляет вами?
 
Lembar asistensi laporan pengukuran besaran listrik
Lembar asistensi laporan pengukuran besaran listrikLembar asistensi laporan pengukuran besaran listrik
Lembar asistensi laporan pengukuran besaran listrik
 
Viral marketing. How it works.
Viral marketing. How it works.Viral marketing. How it works.
Viral marketing. How it works.
 
Воронка продаж Web 3.0
Воронка продаж Web 3.0Воронка продаж Web 3.0
Воронка продаж Web 3.0
 
ABA Life Sciences
ABA Life Sciences ABA Life Sciences
ABA Life Sciences
 
Financial regulation sept2010
Financial regulation sept2010Financial regulation sept2010
Financial regulation sept2010
 
3-D Конструктор управления
3-D Конструктор управления3-D Конструктор управления
3-D Конструктор управления
 
The Giveaway Cafe
The Giveaway CafeThe Giveaway Cafe
The Giveaway Cafe
 
1. conversion clinic with dr coleman
1. conversion clinic with dr coleman1. conversion clinic with dr coleman
1. conversion clinic with dr coleman
 
MDFF_Guidelines_Print version_FINAL_Low Res
MDFF_Guidelines_Print version_FINAL_Low ResMDFF_Guidelines_Print version_FINAL_Low Res
MDFF_Guidelines_Print version_FINAL_Low Res
 
ABA Localita Austria
ABA Localita AustriaABA Localita Austria
ABA Localita Austria
 
7. mastering wordpress
7. mastering wordpress7. mastering wordpress
7. mastering wordpress
 
Afstudeeronderzoek Annemiek Van Den Bosch
Afstudeeronderzoek Annemiek Van Den BoschAfstudeeronderzoek Annemiek Van Den Bosch
Afstudeeronderzoek Annemiek Van Den Bosch
 
iSell - beckend of eSexshop
iSell - beckend of eSexshopiSell - beckend of eSexshop
iSell - beckend of eSexshop
 
Antriksh forest
Antriksh forestAntriksh forest
Antriksh forest
 
Vision_UkrainanLaw_2022_UA
Vision_UkrainanLaw_2022_UAVision_UkrainanLaw_2022_UA
Vision_UkrainanLaw_2022_UA
 

Similar to Kenett On Information NYU-Poly 2013

On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
Galit Shmueli
 
How Data Scientists Make Reliable Decisions with Data
How Data Scientists Make Reliable Decisions with DataHow Data Scientists Make Reliable Decisions with Data
How Data Scientists Make Reliable Decisions with Data
Ta-Wei (David) Huang
 
Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...
Stats Statswork
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
Yabebal Ayalew
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Greg Makowski
 
KDD22_tutorial_slides_final_sharing.pptx
KDD22_tutorial_slides_final_sharing.pptxKDD22_tutorial_slides_final_sharing.pptx
KDD22_tutorial_slides_final_sharing.pptx
mattmcknight4
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
XanGwaps
 
Qualitative and Quantitative Research Plans By Malik Muhammad Mehran
Qualitative and Quantitative Research Plans By Malik Muhammad MehranQualitative and Quantitative Research Plans By Malik Muhammad Mehran
Qualitative and Quantitative Research Plans By Malik Muhammad Mehran
Malik Mughal
 
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
Laboratorio di Cultura Digitale, labcd.humnet.unipi.it
 
ACRL 2011 Data-Driven Library Web Design
ACRL 2011 Data-Driven Library Web DesignACRL 2011 Data-Driven Library Web Design
ACRL 2011 Data-Driven Library Web Design
Amanda Dinscore
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
Shesha R
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
sumitkumar600840
 
Characteristic of a Quantitative Research PPT.pptx
Characteristic of a Quantitative Research PPT.pptxCharacteristic of a Quantitative Research PPT.pptx
Characteristic of a Quantitative Research PPT.pptx
JHANMARKLOGENIO1
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
Dr. Radhey Shyam
 
PERFORMANCE ANALYSIS OF HYBRID FORECASTING MODEL IN STOCK MARKET FORECASTING
PERFORMANCE ANALYSIS OF HYBRID FORECASTING MODEL IN STOCK MARKET FORECASTINGPERFORMANCE ANALYSIS OF HYBRID FORECASTING MODEL IN STOCK MARKET FORECASTING
PERFORMANCE ANALYSIS OF HYBRID FORECASTING MODEL IN STOCK MARKET FORECASTING
IJMIT JOURNAL
 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Robert Williams
 
Introduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdfIntroduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdf
GraceOkeke3
 

Similar to Kenett On Information NYU-Poly 2013 (20)

On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
 
ml-03x01.pdf
ml-03x01.pdfml-03x01.pdf
ml-03x01.pdf
 
How Data Scientists Make Reliable Decisions with Data
How Data Scientists Make Reliable Decisions with DataHow Data Scientists Make Reliable Decisions with Data
How Data Scientists Make Reliable Decisions with Data
 
Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...
 
Kenett on info q and pse
Kenett on info q and pseKenett on info q and pse
Kenett on info q and pse
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
 
KDD22_tutorial_slides_final_sharing.pptx
KDD22_tutorial_slides_final_sharing.pptxKDD22_tutorial_slides_final_sharing.pptx
KDD22_tutorial_slides_final_sharing.pptx
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
 
Qualitative and Quantitative Research Plans By Malik Muhammad Mehran
Qualitative and Quantitative Research Plans By Malik Muhammad MehranQualitative and Quantitative Research Plans By Malik Muhammad Mehran
Qualitative and Quantitative Research Plans By Malik Muhammad Mehran
 
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
 
ACRL 2011 Data-Driven Library Web Design
ACRL 2011 Data-Driven Library Web DesignACRL 2011 Data-Driven Library Web Design
ACRL 2011 Data-Driven Library Web Design
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
 
Characteristic of a Quantitative Research PPT.pptx
Characteristic of a Quantitative Research PPT.pptxCharacteristic of a Quantitative Research PPT.pptx
Characteristic of a Quantitative Research PPT.pptx
 
De carlo rizk 2010 icelw
De carlo rizk 2010 icelwDe carlo rizk 2010 icelw
De carlo rizk 2010 icelw
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
 
PERFORMANCE ANALYSIS OF HYBRID FORECASTING MODEL IN STOCK MARKET FORECASTING
PERFORMANCE ANALYSIS OF HYBRID FORECASTING MODEL IN STOCK MARKET FORECASTINGPERFORMANCE ANALYSIS OF HYBRID FORECASTING MODEL IN STOCK MARKET FORECASTING
PERFORMANCE ANALYSIS OF HYBRID FORECASTING MODEL IN STOCK MARKET FORECASTING
 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
 
Introduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdfIntroduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdf
 

Kenett On Information NYU-Poly 2013

  • 1. Financial and Risk Applications of InfoQ Prof. Ron S. Kenett KPA Ltd., Raanana, Israel Universita degli Studi di Torino, Turin, Italy NYU Poly, New York, USA ron@kpa-group.com
  • 2. Three case studies (1/3) 1. Predicting Changes in Quarterly Corporate Earnings Using Economic Indicators http://galitshmueli.com/content/predicting-changes-quarterly- corporate-earnings-using-economic-indicators This study looks at corporate earnings in relation to an existing theory of business forecasting developed by Joseph H. Ellis (former research analyst at Goldman Sachs). 2
  • 3. Three case studies (2/3) 2. Predicting ZILLOW.com’s Zestimate accuracy http://galitshmueli.com/content/predicting-zillowcom-s-zestimate- accuracy Zillow.com is a free real estate service that calculates an estimated home valuation ("Zestimate") as a starting point for anyone to see for most homes in the U.S. The study looks at the accuracy of Zestimates. 3
  • 4. Three case studies (3/3) 3. Predicting First Day Returns for Japanese IPOs http://galitshmueli.com/content/predicting-first-day-returns- japanese-ipos. An Initial Public Offering (IPO) is the first sale of stock by a company to the public. The study looks at the first-day returns on IPOs of Japanese companies. 4
  • 5. InfoQ(f,X,g) = U( f(X|g) ) Depends on quality of g, X, f, U and relationship between them The potential of a particular dataset to achieve a particular goal using a given empirical analysis method 5 g A specific analysis goal X The available dataset f An empirical analysis method U A utility measure Information Quality Kenett, R.S. and Shmueli , G. (2013) On Information Quality, http://ssrn.com/abstract=1464444 Journal of the Royal Statistical Society, Series A (with discussion), 176(4).
  • 6. Analysis goal g Explain, predict, describe enumerative, analytic, exploratory, confirmatory Goal Specification • “error of the third kind” - giving the right answer to the wrong question – Kimball • “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise” - Tukey 6
  • 7. Analysis goal g 7 Goal 1. Decide where to launch improvement initiatives Goal 2. Highlight drivers of overall satisfaction Goal 3. Detect positive or negative trends in customer satisfaction Goal 4. Identify best practices by comparing products Goal 5. Determine strengths and weaknesses Goal 6. Set up improvement goals Goal 7. Design a balanced scorecard with customer inputs Goal 8. Communicate the results using graphics Goal 9. Assess the reliability of the questionnaire Goal 10. Improve the questionnaire for future use Typical Goals of Customer Surveys
  • 8. X Available data Data Source • Primary, secondary • Observational, experiment • Single, multiple sources • Collection instrument, protocol Data Type • Continuous, categorical, semantic • Structured, un-, semi-structured • Cross-sectional, time series, panel, network, geographical Data Quality • “Zeroth Problem - How do the data relate to the problem, and what other data might be relevant?” - Mallows • Quality of Statistical Data (IMF, OECD) - usefulness of summary statistics for a particular goal (7 dimensions) Data Size and Dimension • # observations • # variables 8
  • 9. f Data analysis method Analysis Quality • “poor models and poor analysis techniques, or even analyzing the data in a totally incorrect way.” - Godfrey • Analyst expertise • Software availability • The focus of statistics education Statistical models and methods • Parametric, semi-, non-parametric • Classic, Bayesian Data mining algorithms Graphical methods Operations research methods 9
  • 10. Utility measure U Utility Measure • Adequate metric from analysis standpoint (R2, holdout data) • Adequate metric from domain standpoint • Predictive accuracy, lift • Goodness-of-fit • Statistical power, statistical significance • Strength-of-fit • Expected costs, gains • Bias reduction, bias-variance tradeoff 10
  • 11. 11 Goal of study: 1. Predict the final price of an Ebay auction at start of auction 2. Predict price during ongoing auction 3. Predict the auctions with the highest prices (ranking) 4. Identify factors that determine the final price of an eBay auction? “Pennies from ebay: The determinants of price in online auctions” Lucking-Reiley D., Bryan D., Prasad N. & Reeves D. Journal of Indust. Econ., 2007 An example…. X Available data Analysis goal g
  • 12. 12  461 eBay coin auctions (Indian Head pennies)  Auction characteristics  Duration  Open and close prices  Number of bids and bidders  Secret reserve price  Weekday/weekend ending  Seller characteristics  Seller rating  Item characteristics  Year and grade of coin X Available data “Pennies from ebay: The determinants of price in online auctions” Lucking-Reiley D., Bryan D., Prasad N. & Reeves D. Journal of Indust. Econ., 2007
  • 14. 14 Prediction error: • Holdout data • Metrics such as MAPE and RMSE f Data analysis method Utility measure U An example….
  • 15. Statistical Approaches for Increasing InfoQ Study Design (Pre-Data) • DOE • Clinical trials • Survey sampling • Computer experiments Post-Data-Collection • Data cleaning and preprocessing • Re-weighting, bias adjustment • Meta analysis Randomization, Stratification, Blinding, Placebo, Blocking, Replication, Sampling frame, Link data collection protocol with appropriate design Recovering “real data” vs. “cleaning for the goal” Handling missing values, outlier detection, re- weighting, combining results 15
  • 16. Assessing InfoQ “Quality of Statistical Data” (Eurostat, OECD, NCSES,…) • Relevance • Accuracy • Timeliness and punctuality • Accessibility • Interpretability • Coherence • Credibility InfoQ dimensions 1. Data resolution 2. Data structure 3. Data integration 4. Temporal relevance 5. Chronology of data and goal 6. Generalizability 7. Operationalization 8. Communication 3 V’s of Big Data • Volume • Variety • Velocity Marketing Research • Recency • Accuracy • Availability • Relevance 16 4 V’s of Big Data • Volume • Variety • Velocity • Veracity
  • 18. #2 Data Structure Data Types • Time series, cross-sectional, panel • Structured, semi-, non-structured • Geographic, spatial, network • Text, audio, video, semantic • Discrete, continuous Data Characteristics Corrupted and missing values due to study design or data collection mechanism 18
  • 19. 19 www.riscoss.eu Managing Risk and Costs in OSS Adoption #2 Data Structure
  • 21. 21 Who talks to whom? IRC chat archives: http://dev.xwiki.org/xwiki/bin/view/IRC/WebHome XWiki Community #2 Data Structure
  • 22. XWiki Community Use association rules To characterize the content of the clusters (tm, arules) #2 Data Structure
  • 24. #3 Data Integration Linkage, privacy-preserving methods: Increase or decrease InfoQ? 24
  • 25. #4 Temporal Relevance Analysis Timeliness (solving the right problem too late) Data Collection Data Analysis Study Deployment t1 t2 t3 t4 t5 t6 Collection Timeliness (relevance to g) g: Prospective vs. retrospective; longitudinal vs. snapshot Nature of X, complexity of f forecast 25
  • 26. #5 Chronology of Data & Goal Data: Daily AQI in a city g1: Reverse-engineer AQI g2: Forecast AQI Retrospective/prospective Ex-post availability Endogeneity 26 http://www.airnow.gov/?action=aqibasics.aqi
  • 28. #7 (Construct) Operationalization χ construct X = θ(χ) operationalization (measurable) • Causal explanation vs. prediction, description • Theory vs. data • Data: Questionnaire, physio measurement 28
  • 30. #7 Operationalization 30 National Education Goals Panel (NEGP) recommended that states answer four questions on their student reports: 1. How did my child do? 2. What types of skills or knowledge does his or her performance reflect? 3. How did my child perform in comparison to other students in the school, district, state, and, if available, the nation? 4. What can I do to help my child improve?
  • 33. 33 When asked what the 18% in line 1 meant, 53% of the policy makers responded incorrectly 1992 NAEP Executive Summary Report #8 Communication 43162
  • 40. 40 The Israeli version…… #8 Communication http://rama.education.gov.il ' N"N"N" "18,68450110213,182521875,502454118 "21,40750010014,466524846,941444111 "20,6445249114,787536805,857496106 "19,1655248613,379532775,786506101 "19,6315327613,961537735,67051981 * "20,2225287713,957541706,26549882
  • 42. Assessing InfoQ in Practice Rating-based assessment 1-5 scale on each dimension: InfoQ Score = [d1(Y1) d2(Y2) … d8(Y8)]1/8 Experience from two research methods courses – Preparing a PhD research proposal (U Ljubljana, 50 students, goo.gl/f6bIA) – Post-hoc evaluation of five completed studies (CMU, 16 students, goo.gl/erNPF) 42 # Dimension Note Value Index 1 Data resolution 5 1.0000 2 Data structure 4 0.7500 3 Data integration 5 1.0000 4 Temporal relevance 5 1.0000 5 Generalizability 3 0.5000 6 Chronology of data and goal 5 1.0000 7 Concept operationalization 2 0.2500 8 Communication 3 0.5000 InfoQ Score = 0.68 InfoQ=68%
  • 43. InfoQ: Strengths and Challenges InfoQ approach streamlines questioning of data value • “Why should we invest in data?” – management • Compare value of potential datasets, analyses • Prioritize/rank projects • Strengthen functional – analytical relationship Multiple goals: • Goals can change during study: Reevaluate InfoQ • Multiple goals: Prioritize. – clinical trials: effect of new drug, adverse effects To Do: • Improve InfoQ assessment • Alternative InfoQ assessment approaches (pilot study, EDA, other) • Further dimensions (data privacy, human subject compliance and risk) • Effect of technological advances on InfoQ 43
  • 44. Primary Data Secondary Data - Experimental - Experimental - Observational - Observational Data Quality Information Quality Analysis Quality Knowledge g A specific analysis goal X The available dataset f An empirical analysis method U A utility measure 1.Data resolution 2.Data structure 3.Data integration 4.Temporal relevance 5.Chronology of data and goal 6.Generalizability 7.Operationalization 8.Communication What How Goals InfoQ(f,X,g) = U(f(X|g)) Information Quality
  • 45. Russom, P., Big Data Analytics, TDWI Best Practices Report, Q4 2011 Massive data sets 1. Data resolution 2. Data structure 3. Data integration 4. Temporal relevance 5. Chronology of data and goal 6. Generalizability 7. Operationalization 8. Communication Big data Analytics
  • 46. Three case studies (1/3) 1. Predicting Changes in Quarterly Corporate Earnings Using Economic Indicators Stages in economic downturn: 1) the peak, 2) modest slowing, 3) intensifying worrying by investors (a lot of panic selling occurs in this stage), and 4) the advent of recession. Can we predict the economic slowdown in corporate earnings (S&P 500 EPS) well in advance? Ellis claims (based on observations) there is a 0-9 month lag between wages and its effect on consumer spending. 0-6 months until changes in consumer spending affects changes in industrial production. Another 6-12 months between industrial production and capital spending. And finally, another 6-12 between capital spending and its effects on Corporate Profits. 46
  • 47. Three case studies (1/3) 1. Predicting Changes in Quarterly Corporate Earnings Using Economic Indicators Ellis model: 47
  • 48. Three case studies (1/3) 1. Predicting Changes in Quarterly Corporate Earnings Using Economic Indicators The data: i) 180 quarters. 6 [Economic] x variables. Ii) Change in S&P EPS = y variable, iii) All variables transformed to year vs year % change, iv( All data used is publicly available via websites of US agencies: BEA, BLS, FED, and S&P. The analysis: XLMiner on these different versions of datasets. Partitioned it. Ran predictor applications: ACF Plots, MLR, Regression Tree – full and pruned. 48 Auto Correlation Chart. Based on this, took Lag_1 as one of the predictors. Lag_1 = QEPS_YY(Q-1)
  • 49. Three case studies (1/3) 1. Predicting Changes in Quarterly Corporate Earnings Using Economic Indicators 49 QEPS_YY%(t) = 0.0486 + 0.747*QEPS_YY%(t-1) -0.517*QRCAP_YY%(t-2) # Dimension Note Value Index 1 Data resolution quarterly data 2 0.2500 2 Data structure no externalities 3 0.5000 3 Data integration 4 0.7500 4 Temporal relevance 5 1.0000 5 Generalizability 5 1.0000 6 Chronology of data and goal quarterly data 3 0.5000 7 Concept operationalization 5 1.0000 8 Communication 4 0.7500 InfoQ Score = 0.66 InfoQ=66%
  • 50. Three case studies (2/3) 2. Predicting ZILLOW.com’s Zestimate accuracy 50  “Zillow.com” is a real estate service launched in 2006  It calculates a Zestimate-home valuation for most homes in the U.S For MD and VA it gets only about 26% of predictions within the +/-5% range only. 1.Home Type (Single Family, Condo , etc) 2.No of Bed Rooms 3.No of Bath Rooms 4.Total Area –Sqft 5.Lot size –Sqft 6.No of Stories 7.Total Rooms 8.Distance from Metro 9.Primary School Rank 10.Middle School Rank 11.High School Rank 12.Age of house at Sale 13.Sale Season (Fall , Winter , etc) 14.Recession Period (Y/N) 15.Sales Volume
  • 51. Three case studies (2/3) 2. Predicting ZILLOW.com’s Zestimate accuracy 51 • Data collected, cleansed and merged from 4 sources –Zillow , Redfin, School Digger and Google Maps • 17 counties (29 Zip codes) in Northern VA House sales data • Before Data Clean up: 3500+ • After Data Clean up: 1416 • Y –Is Zestimate correct (Y/N) 37.6%/62.43% • X –15 variables (5+ variables where discarded from initial set )
  • 52. Three case studies (2/3) 2. Predicting ZILLOW.com’s Zestimate accuracy 52 # Dimension Note Value Index 1 Data resolution by individual house 5 1.0000 2 Data structure no externalities 4 0.7500 3 Data integration 5 1.0000 4 Temporal relevance 5 1.0000 5 Generalizability only VA counties 3 0.5000 6 Chronology of data and goal 5 1.0000 7 Concept operationalization 4 0.7500 8 Communication 4 0.7500 InfoQ Score = 0.82 InfoQ=82%
  • 53. Three case studies (2/3) 2. Predicting ZILLOW.com’s Zestimate accuracy 53 http://www.madlan.co.il/education/schools The Israeli version……
  • 54. Three case studies (3/3) 3. Predicting First Day Returns for Japanese IPOs Goal: To predict the First Day returns on Japanese IPOs (based on first day closing price), using public information available prior to the offer The data: i) Japanese IPO data from 1997-2009*, ii) 1561 IPOs, iii) Industry(categorical) : 35 industries - 3 were spelling errors, corrected Remove Air Trans (1), Fishery & Forestry (2) industries –Removed first 128 entries (1997-1999) as they had no data for 2 columns : Underwriter’s fees & Allocation to BRLM –New Columns Minimum bid size Secondary Offering %age –Creation of Dummy Variables BRLMs – 3, on the basis of Gross proceeds of IPO Industry – 4, binned by average return Market – whether the IPO was OTC or not 54 *Kaneko and Pettway’s Japanese IPO Database (KP-JIPO) http://www.fbc.keio.ac.jp/~kaneko/KP-JIPO/top.htm
  • 55. Three case studies (3/3) 3. Predicting First Day Returns for Japanese IPOs 55 1) Age of company at time of IPO 2) Gross Proceeds (size of IPO) 3) Minimum Bid Amount 4) IS_OTC listing 5) Secondary offering as %age of total 5) Percentage shares allocated to Lead Manager 1 7) Underwriter’s Gross Spread (fees as %age of size of IPO) 8) Industry_Type (binned categorical variable – 4 categories) 9) Lead_Manager (binned categorical variable – 3 categories) # Dimension Note Value Index 1 Data resolution 5 1.0000 2 Data structure 4 0.7500 3 Data integration no externalities 2 0.2500 4 Temporal relevance 5 1.0000 5 Generalizability no theory 3 0.5000 6 Chronology of data and goal should be ex ante 3 0.5000 7 Concept operationalization 5 1.0000 8 Communication 4 0.7500 InfoQ Score = 0.66 Prediction algorithms do not give a reasonable prediction of IPO returns from public information. (High RMSE: 90%) InfoQ=66%
  • 56. Thank you for your attention 56