This document discusses three case studies that use data analysis methods to address financial and risk-related questions. The first case study looks at predicting changes in corporate earnings using economic indicators. The second predicts the accuracy of Zillow home valuation estimates. The third examines factors that influence returns on initial public offerings of Japanese companies. The document then discusses dimensions of information quality that can impact the ability of a given dataset and analysis method to achieve a specified goal.
Automated Hypothesis Testing with Large Scale Scientific Workflowsdgarijo
(Credit to Varun Ratnakar and Yolanda Gil).
The automation of important aspects of scientific data analysis would significantly accelerate the pace of science and innovation. Although important aspects of data analysis can be automated, the hypothesize-test-evaluate discovery cycle is largely carried out by hand by researchers. This introduces a significant human bottleneck, which is inefficient and can lead to erroneous and incomplete explorations. We introduce a novel approach to automate the hypothesize-test-evaluate discovery cycle with an intelligent system that a scientist can task to test hypotheses of interest in a data repository. Our approach captures three types of data analytics knowledge: 1) common data analytic methods represented as semantic workflows; 2) meta-analysis methods that aggregate those results, represented as meta-workflows; and 3) data analysis strategies that specify for a type of hypothesis what data and methods to use, represented as lines of inquiry. Given a hypothesis specified by a scientist, appropriate lines of inquiry are triggered, which lead to retrieving relevant datasets, running relevant workflows on that data, and finally running meta-workflows on workflow results. The scientist is then presented with a level of confidence on the initial hypothesis (or a revised hypothesis) based on the data and methods applied. We have implemented this approach in the DISK system, and applied it to multi-omics data analysis.
We estimate that nearly one third of news articles contain references to future events. While this information can prove crucial to understanding news stories and how events will develop for a given topic, there is currently no easy way to access this information. We propose a new task to address the problem of retrieving and ranking sentences that contain mentions to future events, which we call ranking related news predictions. In this paper, we formally define this task and propose a learning to rank approach based on 4 classes of features: term similarity, entity-based similarity, topic similarity, and temporal similarity. Through extensive evaluations using a corpus consisting of 1.8 millions news articles and 6,000 manually judged relevance pairs, we show that our approach is able to retrieve a significant number of relevant predictions related to a given topic.
Machine Learning for Forecasting: From Data to DeploymentAnant Agarwal
Forecasting is everywhere. This talk covers:
• Fundamental concepts of time series
• Data preprocessing (imputation and outlier analysis)
• Feature engineering and EDA for time series
• Statistical and machine learning algorithms
• Model evaluation through backtesting
• Model explanation using SHAP
• Model monitoring and deployment considerations
Artificial Intelligence and Stock Marketingijsrd.com
Business sagacity is turning into a significant pattern in money related world. One such range is securities exchange knowledge that makes utilization of information mining strategies, for example, affiliation, grouping, fake neural systems, choice tree, hereditary calculation, master frameworks and fuzzy rationale. These strategies could be utilized to anticipate stock value or exchanging indicator naturally with adequate exactness. In spite of the fact that there has been a loads of exploration done here , still there are numerous issues that have not been investigated yet furthermore it is not clear to new analysts where and how to begin . Information mining could be connected on over a significant time span monetary information to create examples and choice making framework. This paper gives concise review of a few endeavors made via scientists for stock expectation by concentrating on securities exchange dissection furthermore characterizes another exploration space to comprehend the sagacity of stock exchange. This alludes as stock exchange brainpower, which is to create information mining strategies to help all parts of algorithmic exchanging furthermore recommend various exploration issues in stock knowledge identified with guaging& its exactness.
Automated Hypothesis Testing with Large Scale Scientific Workflowsdgarijo
(Credit to Varun Ratnakar and Yolanda Gil).
The automation of important aspects of scientific data analysis would significantly accelerate the pace of science and innovation. Although important aspects of data analysis can be automated, the hypothesize-test-evaluate discovery cycle is largely carried out by hand by researchers. This introduces a significant human bottleneck, which is inefficient and can lead to erroneous and incomplete explorations. We introduce a novel approach to automate the hypothesize-test-evaluate discovery cycle with an intelligent system that a scientist can task to test hypotheses of interest in a data repository. Our approach captures three types of data analytics knowledge: 1) common data analytic methods represented as semantic workflows; 2) meta-analysis methods that aggregate those results, represented as meta-workflows; and 3) data analysis strategies that specify for a type of hypothesis what data and methods to use, represented as lines of inquiry. Given a hypothesis specified by a scientist, appropriate lines of inquiry are triggered, which lead to retrieving relevant datasets, running relevant workflows on that data, and finally running meta-workflows on workflow results. The scientist is then presented with a level of confidence on the initial hypothesis (or a revised hypothesis) based on the data and methods applied. We have implemented this approach in the DISK system, and applied it to multi-omics data analysis.
We estimate that nearly one third of news articles contain references to future events. While this information can prove crucial to understanding news stories and how events will develop for a given topic, there is currently no easy way to access this information. We propose a new task to address the problem of retrieving and ranking sentences that contain mentions to future events, which we call ranking related news predictions. In this paper, we formally define this task and propose a learning to rank approach based on 4 classes of features: term similarity, entity-based similarity, topic similarity, and temporal similarity. Through extensive evaluations using a corpus consisting of 1.8 millions news articles and 6,000 manually judged relevance pairs, we show that our approach is able to retrieve a significant number of relevant predictions related to a given topic.
Machine Learning for Forecasting: From Data to DeploymentAnant Agarwal
Forecasting is everywhere. This talk covers:
• Fundamental concepts of time series
• Data preprocessing (imputation and outlier analysis)
• Feature engineering and EDA for time series
• Statistical and machine learning algorithms
• Model evaluation through backtesting
• Model explanation using SHAP
• Model monitoring and deployment considerations
Artificial Intelligence and Stock Marketingijsrd.com
Business sagacity is turning into a significant pattern in money related world. One such range is securities exchange knowledge that makes utilization of information mining strategies, for example, affiliation, grouping, fake neural systems, choice tree, hereditary calculation, master frameworks and fuzzy rationale. These strategies could be utilized to anticipate stock value or exchanging indicator naturally with adequate exactness. In spite of the fact that there has been a loads of exploration done here , still there are numerous issues that have not been investigated yet furthermore it is not clear to new analysts where and how to begin . Information mining could be connected on over a significant time span monetary information to create examples and choice making framework. This paper gives concise review of a few endeavors made via scientists for stock expectation by concentrating on securities exchange dissection furthermore characterizes another exploration space to comprehend the sagacity of stock exchange. This alludes as stock exchange brainpower, which is to create information mining strategies to help all parts of algorithmic exchanging furthermore recommend various exploration issues in stock knowledge identified with guaging& its exactness.
Repurposing Classification & Regression Trees for Causal Research with High-D...Galit Shmueli
Keynote at WOMBAT 2019 (Monash University) https://www.monash.edu/business/wombat2019
Abstract:
Studying causal effects and structures is central to research in management, social science, economics, and other areas, yet typical analysis methods are designed for low-dimensional data. Classification & Regression Trees ("trees") and their variants are popular predictive tools used in many machine learning applications and predictive research, as they are powerful in high-dimensional predictive scenarios. Yet trees are not commonly used in causal-explanatory research. In this talk I will describe adaptations of trees that we developed for tackling two causal-explanatory issues: self selection and confounder detection. For self selection, we developed a novel tree-based approach adjusting for observable self-selection bias in intervention studies, thereby creating a useful tool for analysis of observational impact studies as well as post-analysis of experimental data which scales for big data. For tackling confounders, we repurose trees for automated detection of potential Simpson's paradoxes in data with few or many potential confounding variables, and even with very large samples. I'll also show insights revealed when applying these trees to applications in eGov, labor economics, and healthcare.
Different Classification Technique for Data mining in Insurance Industry usin...IOSRjournaljce
this paper addresses the issues and techniques for Property/Casualty actuaries applying data mining methods. Data mining means the effective unknown pattern discovery from a large amount database. It is an interactive knowledge discovery procedure which is includes data acquisition, data integration, data exploration, model building, and model validation. The paper provides an overview of the data discovery method and introduces some important data mining method for application to insurance concluding cluster discovery approaches.
Presentation at special event "To Explain or To Predict?" at Tel Aviv University, July 9, 2012. Event co-organized by the Israel Statistical Association and Tel Aviv University's Department of Statistics and OR.
A REVIEW ON PREDICTIVE ANALYTICS IN DATA MININGijccmsjournal
The data mining its main process is to collect, extract and store the valuable information and now-a-days it’s
done by many enterprises actively. In advanced analytics, Predictive analytics is the one of the branch which is
mainly used to make predictions about future events which are unknown. Predictive analytics which uses
various techniques from machine learning, statistics, data mining, modeling, and artificial intelligence for
analyzing the current data and to make predictions about future. The two main objectives of predictive
analytics are Regression and Classification. It is composed of various analytical and statistical techniques used
for developing models which predicts the future occurrence, probabilities or events. Predictive analytics deals
with both continuous changes and discontinuous changes. It provides a predictive score for each individual
(healthcare patient, product SKU, customer, component, machine, or other organizational unit, etc.) to
determine, or influence the organizational processes which pertain across huge numbers of individuals, like in
fraud detection, manufacturing, credit risk assessment, marketing, and government operations including law
enforcement.
Big Data - To Explain or To Predict? Talk at U Toronto's Rotman School of Ma...Galit Shmueli
Slide from Prof. Galit Shmueli's talk at University of Toronto's Rotman School of Management, March 4, 2016. This talk is part of Rotman's Big Data Expert Speaker Series.
https://www.rotman.utoronto.ca/ProfessionalDevelopment/Events/UpcomingEvents/20160304GalitShmueli.aspx
WebSite Visit Forecasting Using Data Mining TechniquesChandana Napagoda
Data mining is a technique which is used for identifying relationships between various large amounts of data in many areas including scientific research, business planning, traffic analysis, clinical trial data mining etc. This research will be researching applicability of data mining techniques in web site visit prediction domain. Here we will be concentrating on time series regression techniques which will be used to analyse and forecast time dependent data points. Then how those techniques will be applied to forecast web site visits will be explained.
Advancing Foundation and Practice of Software AnalyticsTao Xie
Vision Statement Presentation on "Advancing Foundation & Practice of Software Analytics" at the 2nd International NSF sponsored Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE 2013) http://promisedata.org/raise/2013/
Contextual Information Elicitation in Travel Recommender SystemsMatthias Braunhofer
Context-Aware Recommender Systems are advisory applications that exploit users’ preference knowledge contained in datasets of context-dependent user ratings, i.e., ratings augmented with the description of the contextual situation detected when the user experienced the item and rated it. Since the space of context-dependent ratings increases exponentially in size with the number of contextual factors, and because certain contextual information is still hard to acquire automatically (e.g., the user’s mood or the travellers’ group composition), it is fundamental to identify and acquire only those factors that truly influence the user preferences and consequently the ratings and the recommendations. In this paper, we propose a novel method that estimates the impact of a contextual factor on rating predictions and adaptively elicits from the users only the relevant ones. Our experimental evaluation, on two travel-related datasets, shows that our method compares favorably to other state-of-the-art context selection methods.
Вы управляете проектом или проект управляет вами?КоммандКор
Вы управляете проектом или проект управляет Вами? Мы уверены, что мы управляем проектом. И часто то, что происходит во время управления идет своим чередом… А может ли оказаться, что в определенный момент проект начал управлять нами? И все, что мы видим, является лишь отражением и доказательством его независимого поведения?
Repurposing Classification & Regression Trees for Causal Research with High-D...Galit Shmueli
Keynote at WOMBAT 2019 (Monash University) https://www.monash.edu/business/wombat2019
Abstract:
Studying causal effects and structures is central to research in management, social science, economics, and other areas, yet typical analysis methods are designed for low-dimensional data. Classification & Regression Trees ("trees") and their variants are popular predictive tools used in many machine learning applications and predictive research, as they are powerful in high-dimensional predictive scenarios. Yet trees are not commonly used in causal-explanatory research. In this talk I will describe adaptations of trees that we developed for tackling two causal-explanatory issues: self selection and confounder detection. For self selection, we developed a novel tree-based approach adjusting for observable self-selection bias in intervention studies, thereby creating a useful tool for analysis of observational impact studies as well as post-analysis of experimental data which scales for big data. For tackling confounders, we repurose trees for automated detection of potential Simpson's paradoxes in data with few or many potential confounding variables, and even with very large samples. I'll also show insights revealed when applying these trees to applications in eGov, labor economics, and healthcare.
Different Classification Technique for Data mining in Insurance Industry usin...IOSRjournaljce
this paper addresses the issues and techniques for Property/Casualty actuaries applying data mining methods. Data mining means the effective unknown pattern discovery from a large amount database. It is an interactive knowledge discovery procedure which is includes data acquisition, data integration, data exploration, model building, and model validation. The paper provides an overview of the data discovery method and introduces some important data mining method for application to insurance concluding cluster discovery approaches.
Presentation at special event "To Explain or To Predict?" at Tel Aviv University, July 9, 2012. Event co-organized by the Israel Statistical Association and Tel Aviv University's Department of Statistics and OR.
A REVIEW ON PREDICTIVE ANALYTICS IN DATA MININGijccmsjournal
The data mining its main process is to collect, extract and store the valuable information and now-a-days it’s
done by many enterprises actively. In advanced analytics, Predictive analytics is the one of the branch which is
mainly used to make predictions about future events which are unknown. Predictive analytics which uses
various techniques from machine learning, statistics, data mining, modeling, and artificial intelligence for
analyzing the current data and to make predictions about future. The two main objectives of predictive
analytics are Regression and Classification. It is composed of various analytical and statistical techniques used
for developing models which predicts the future occurrence, probabilities or events. Predictive analytics deals
with both continuous changes and discontinuous changes. It provides a predictive score for each individual
(healthcare patient, product SKU, customer, component, machine, or other organizational unit, etc.) to
determine, or influence the organizational processes which pertain across huge numbers of individuals, like in
fraud detection, manufacturing, credit risk assessment, marketing, and government operations including law
enforcement.
Big Data - To Explain or To Predict? Talk at U Toronto's Rotman School of Ma...Galit Shmueli
Slide from Prof. Galit Shmueli's talk at University of Toronto's Rotman School of Management, March 4, 2016. This talk is part of Rotman's Big Data Expert Speaker Series.
https://www.rotman.utoronto.ca/ProfessionalDevelopment/Events/UpcomingEvents/20160304GalitShmueli.aspx
WebSite Visit Forecasting Using Data Mining TechniquesChandana Napagoda
Data mining is a technique which is used for identifying relationships between various large amounts of data in many areas including scientific research, business planning, traffic analysis, clinical trial data mining etc. This research will be researching applicability of data mining techniques in web site visit prediction domain. Here we will be concentrating on time series regression techniques which will be used to analyse and forecast time dependent data points. Then how those techniques will be applied to forecast web site visits will be explained.
Advancing Foundation and Practice of Software AnalyticsTao Xie
Vision Statement Presentation on "Advancing Foundation & Practice of Software Analytics" at the 2nd International NSF sponsored Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE 2013) http://promisedata.org/raise/2013/
Contextual Information Elicitation in Travel Recommender SystemsMatthias Braunhofer
Context-Aware Recommender Systems are advisory applications that exploit users’ preference knowledge contained in datasets of context-dependent user ratings, i.e., ratings augmented with the description of the contextual situation detected when the user experienced the item and rated it. Since the space of context-dependent ratings increases exponentially in size with the number of contextual factors, and because certain contextual information is still hard to acquire automatically (e.g., the user’s mood or the travellers’ group composition), it is fundamental to identify and acquire only those factors that truly influence the user preferences and consequently the ratings and the recommendations. In this paper, we propose a novel method that estimates the impact of a contextual factor on rating predictions and adaptively elicits from the users only the relevant ones. Our experimental evaluation, on two travel-related datasets, shows that our method compares favorably to other state-of-the-art context selection methods.
Вы управляете проектом или проект управляет вами?КоммандКор
Вы управляете проектом или проект управляет Вами? Мы уверены, что мы управляем проектом. И часто то, что происходит во время управления идет своим чередом… А может ли оказаться, что в определенный момент проект начал управлять нами? И все, что мы видим, является лишь отражением и доказательством его независимого поведения?
Что общего между проектом, процессом, системой? Каждый из них может выступать объектом управления. Причем управление этими объектами построено на сопоставимых наборах типовых управленческих ситуаций и решений. Этот уникальный набор можно назвать ДНК объекта управления. Предлагаемая ЗД-модель позволяет качественно обслуживать процесс принятия управленческих решений через конструирование ДНК, задание индивидуальных особенностей объектов управления...
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)Galit Shmueli
Slides by Galit Shmueli for keynote presentation at 2015 Statistical Challenges in eCommerce Research (SCECR) symposium, Addis Ababa, Ethiopia (www.scecr.org)
Research design decisions and be competent in the process of reliable data co...Stats Statswork
Research Design may be described as the researchers scheme of outlining the flow of his project. It is based on research design, that the researcher goes about gathering data to answer his research question. It enables the researcher to prioritize his work, create better questionnaires and arrive at conclusions with greater clarity. Statswork offers statistical services as per the requirements of the customers. When you Order statistical Services at Statswork, we promise you the following – Always on Time, outstanding customer support, and High-quality Subject Matter Experts.
Learn More: http://bit.ly/2S312hb
Why Statswork?
Plagiarism Free | Unlimited Support | Prompt Turnaround Times | Subject Matter Expertise | Experienced Bio-statisticians & Statisticians | Statistics Across Methodologies | Wide Range Of Tools & Technologies Supports | Tutoring Services | 24/7 Email Support | Recommended by Universities
Contact Us:
Website: www.statswork.com/
Email: info@statswork.com
UnitedKingdom: +44-1143520021
India: +91-4448137070
WhatsApp: +91-8754446690
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Greg Makowski
Describing a predictive data mining model can provide a competitive advantage for solving business problems with a model. The SSA approach can also provide reasons for the forecast for each record. This can help drive investigations into fields and interactions during a data mining project, as well as identifying "data drift" between the original training data, and the current scoring data. I am working on open source version of SSA, first in R.
Presentation from a workshop given at ACRL 2011 conference, Data-Driven Library Web Design: Making Usability Testing Work with Collaborative Partnerships
PERFORMANCE ANALYSIS OF HYBRID FORECASTING MODEL IN STOCK MARKET FORECASTINGIJMIT JOURNAL
This paper presents performance analysis of hybrid model comprise of concordance and Genetic
Programming (GP) to forecast financial market with some existing models. This scheme can be used for in
depth analysis of stock market. Different measures of concordances such as Kendall’s Tau, Gini’s Mean
Difference, Spearman’s Rho, and weak interpretation of concordance are used to search for the pattern in
past that look similar to present. Genetic Programming is then used to match the past trend to present
trend as close as possible. Then Genetic Program estimates what will happen next based on what had
happened next. The concept is validated using financial time series data (S&P 500 and NASDAQ indices)
as sample data sets. The forecasted result is then compared with standard ARIMA model and other model
to analyse its performance
Introduction to Data Analysis Course Notes.pdfGraceOkeke3
"Embark on a journey into data analysis with our Introduction to Data Analysis slides. Uncover the fundamentals and prerequisites for effective analysis, explore types of data, and discover essential tools and methodologies. Equip yourself with the skills to unlock valuable insights.
Similar to Kenett On Information NYU-Poly 2013 (20)
1. Financial and Risk
Applications of InfoQ
Prof. Ron S. Kenett
KPA Ltd., Raanana, Israel
Universita degli Studi di Torino, Turin, Italy
NYU Poly, New York, USA
ron@kpa-group.com
2. Three case studies (1/3)
1. Predicting Changes in Quarterly Corporate
Earnings Using Economic Indicators
http://galitshmueli.com/content/predicting-changes-quarterly-
corporate-earnings-using-economic-indicators
This study looks at corporate earnings in relation
to an existing theory of business forecasting
developed by Joseph H. Ellis (former research
analyst at Goldman Sachs).
2
3. Three case studies (2/3)
2. Predicting ZILLOW.com’s Zestimate accuracy
http://galitshmueli.com/content/predicting-zillowcom-s-zestimate-
accuracy
Zillow.com is a free real estate service that
calculates an estimated home valuation
("Zestimate") as a starting point for anyone to see
for most homes in the U.S. The study looks at the
accuracy of Zestimates.
3
4. Three case studies (3/3)
3. Predicting First Day Returns for Japanese IPOs
http://galitshmueli.com/content/predicting-first-day-returns-
japanese-ipos.
An Initial Public Offering (IPO) is the first sale of
stock by a company to the public. The study looks
at the first-day returns on IPOs of Japanese
companies.
4
5. InfoQ(f,X,g) = U( f(X|g) )
Depends on quality of g, X, f, U and relationship between them
The potential of a particular dataset
to achieve a particular goal using a
given empirical analysis method
5
g A specific analysis goal
X The available dataset
f An empirical analysis method
U A utility measure
Information Quality
Kenett, R.S. and Shmueli , G. (2013) On Information Quality, http://ssrn.com/abstract=1464444
Journal of the Royal Statistical Society, Series A (with discussion), 176(4).
6. Analysis goal
g
Explain, predict, describe
enumerative, analytic,
exploratory, confirmatory
Goal Specification
• “error of the third kind” - giving the right answer to the wrong
question – Kimball
• “Far better an approximate answer to the right question, which
is often vague, than an exact answer to the wrong question,
which can always be made precise” - Tukey
6
7. Analysis goal
g
7
Goal 1. Decide where to launch improvement initiatives
Goal 2. Highlight drivers of overall satisfaction
Goal 3. Detect positive or negative trends in customer satisfaction
Goal 4. Identify best practices by comparing products
Goal 5. Determine strengths and weaknesses
Goal 6. Set up improvement goals
Goal 7. Design a balanced scorecard with customer inputs
Goal 8. Communicate the results using graphics
Goal 9. Assess the reliability of the questionnaire
Goal 10. Improve the questionnaire for future use
Typical Goals of Customer Surveys
8. X
Available data
Data Source
• Primary, secondary
• Observational, experiment
• Single, multiple sources
• Collection instrument, protocol
Data Type
• Continuous, categorical, semantic
• Structured, un-, semi-structured
• Cross-sectional, time series, panel,
network, geographical
Data Quality
• “Zeroth Problem - How do the data relate to the problem, and
what other data might be relevant?” - Mallows
• Quality of Statistical Data (IMF, OECD) - usefulness of summary
statistics for a particular goal (7 dimensions)
Data Size and
Dimension
• # observations
• # variables
8
9. f
Data analysis
method
Analysis Quality
• “poor models and poor analysis techniques, or even analyzing the
data in a totally incorrect way.” - Godfrey
• Analyst expertise
• Software availability
• The focus of statistics education
Statistical models and methods
• Parametric, semi-, non-parametric
• Classic, Bayesian
Data mining algorithms
Graphical methods
Operations research methods
9
11. 11
Goal of study:
1. Predict the final price of an Ebay
auction at start of auction
2. Predict price during ongoing
auction
3. Predict the auctions with the
highest prices (ranking)
4. Identify factors that determine the
final price of an eBay auction?
“Pennies from ebay: The
determinants of price in
online auctions”
Lucking-Reiley D., Bryan D.,
Prasad N. & Reeves D.
Journal of Indust. Econ., 2007
An example….
X
Available data
Analysis goal
g
12. 12
461 eBay coin auctions (Indian Head pennies)
Auction characteristics
Duration
Open and close prices
Number of bids and bidders
Secret reserve price
Weekday/weekend ending
Seller characteristics
Seller rating
Item characteristics
Year and grade of coin
X
Available data
“Pennies from ebay: The
determinants of price in
online auctions”
Lucking-Reiley D., Bryan D.,
Prasad N. & Reeves D.
Journal of Indust. Econ., 2007
18. #2 Data Structure
Data Types
• Time series, cross-sectional, panel
• Structured, semi-, non-structured
• Geographic, spatial, network
• Text, audio, video, semantic
• Discrete, continuous
Data Characteristics
Corrupted and missing
values due to study design
or data collection
mechanism
18
25. #4 Temporal Relevance
Analysis Timeliness
(solving the right
problem too late)
Data
Collection
Data
Analysis
Study
Deployment
t1 t2 t3 t4 t5 t6
Collection Timeliness
(relevance to g)
g: Prospective vs. retrospective; longitudinal vs. snapshot
Nature of X, complexity of f
forecast
25
26. #5 Chronology of Data & Goal
Data: Daily AQI in a city
g1: Reverse-engineer AQI
g2: Forecast AQI
Retrospective/prospective
Ex-post availability
Endogeneity
26
http://www.airnow.gov/?action=aqibasics.aqi
30. #7 Operationalization
30
National Education Goals
Panel (NEGP)
recommended that states
answer four questions on
their student reports:
1. How did my child do?
2. What types of skills or
knowledge does his or her
performance reflect?
3. How did my child
perform in comparison to
other students in the
school,
district, state, and, if
available, the nation?
4. What can I do to help
my child improve?
33. 33
When asked what the 18% in line 1 meant,
53% of the policy makers responded incorrectly
1992 NAEP
Executive
Summary Report
#8 Communication
43162
42. Assessing InfoQ in Practice
Rating-based assessment
1-5 scale on each dimension:
InfoQ Score = [d1(Y1) d2(Y2) … d8(Y8)]1/8
Experience from two research methods courses
– Preparing a PhD research proposal (U Ljubljana, 50
students, goo.gl/f6bIA)
– Post-hoc evaluation of five completed studies (CMU,
16 students, goo.gl/erNPF) 42
# Dimension Note Value Index
1 Data resolution 5 1.0000
2 Data structure 4 0.7500
3 Data integration 5 1.0000
4 Temporal relevance 5 1.0000
5 Generalizability 3 0.5000
6 Chronology of data and goal 5 1.0000
7 Concept operationalization 2 0.2500
8 Communication 3 0.5000
InfoQ Score = 0.68
InfoQ=68%
43. InfoQ: Strengths and Challenges
InfoQ approach streamlines questioning of data value
• “Why should we invest in data?” – management
• Compare value of potential datasets, analyses
• Prioritize/rank projects
• Strengthen functional – analytical relationship
Multiple goals:
• Goals can change during study: Reevaluate InfoQ
• Multiple goals: Prioritize.
– clinical trials: effect of new drug, adverse effects
To Do:
• Improve InfoQ assessment
• Alternative InfoQ assessment approaches (pilot study, EDA, other)
• Further dimensions (data privacy, human subject compliance and risk)
• Effect of technological advances on InfoQ 43
44. Primary Data Secondary Data
- Experimental - Experimental
- Observational - Observational
Data
Quality
Information
Quality
Analysis
Quality
Knowledge
g A specific analysis goal
X The available dataset
f An empirical analysis method
U A utility measure
1.Data resolution
2.Data structure
3.Data integration
4.Temporal relevance
5.Chronology of data and goal
6.Generalizability
7.Operationalization
8.Communication
What
How
Goals InfoQ(f,X,g) = U(f(X|g))
Information Quality
45. Russom, P., Big Data Analytics, TDWI Best Practices Report, Q4 2011
Massive data sets
1. Data resolution
2. Data structure
3. Data integration
4. Temporal relevance
5. Chronology of data and goal
6. Generalizability
7. Operationalization
8. Communication
Big data Analytics
46. Three case studies (1/3)
1. Predicting Changes in Quarterly Corporate
Earnings Using Economic Indicators
Stages in economic downturn: 1) the peak, 2) modest slowing, 3) intensifying
worrying by investors (a lot of panic selling occurs in this stage), and 4) the
advent of recession. Can we predict the economic slowdown in corporate
earnings (S&P 500 EPS) well in advance?
Ellis claims (based on observations) there is a 0-9 month lag between wages
and its effect on consumer spending. 0-6 months until changes in consumer
spending affects changes in industrial production. Another 6-12 months
between industrial production and capital spending. And finally, another 6-12
between capital spending and its effects on Corporate Profits.
46
47. Three case studies (1/3)
1. Predicting Changes in Quarterly Corporate
Earnings Using Economic Indicators
Ellis model:
47
48. Three case studies (1/3)
1. Predicting Changes in Quarterly Corporate
Earnings Using Economic Indicators
The data: i) 180 quarters. 6 [Economic] x variables. Ii) Change
in S&P EPS = y variable, iii) All variables transformed to year vs
year % change, iv( All data used is publicly available via websites
of US agencies: BEA, BLS, FED, and S&P.
The analysis: XLMiner on these different versions of datasets.
Partitioned it. Ran predictor applications: ACF Plots, MLR,
Regression Tree – full and pruned.
48
Auto Correlation Chart. Based on this, took Lag_1
as one of the predictors. Lag_1 = QEPS_YY(Q-1)
49. Three case studies (1/3)
1. Predicting Changes in Quarterly Corporate
Earnings Using Economic Indicators
49
QEPS_YY%(t) = 0.0486 + 0.747*QEPS_YY%(t-1) -0.517*QRCAP_YY%(t-2)
# Dimension Note Value Index
1 Data resolution quarterly data 2 0.2500
2 Data structure no externalities 3 0.5000
3 Data integration 4 0.7500
4 Temporal relevance 5 1.0000
5 Generalizability 5 1.0000
6 Chronology of data and goal quarterly data 3 0.5000
7 Concept operationalization 5 1.0000
8 Communication 4 0.7500
InfoQ Score = 0.66
InfoQ=66%
50. Three case studies (2/3)
2. Predicting ZILLOW.com’s Zestimate accuracy
50
“Zillow.com” is a real
estate service launched
in 2006
It calculates a
Zestimate-home
valuation for most
homes in the U.S
For MD and VA it gets
only about 26% of
predictions within the
+/-5% range only.
1.Home Type (Single Family, Condo , etc)
2.No of Bed Rooms
3.No of Bath Rooms
4.Total Area –Sqft
5.Lot size –Sqft
6.No of Stories
7.Total Rooms
8.Distance from Metro
9.Primary School Rank
10.Middle School Rank
11.High School Rank
12.Age of house at Sale
13.Sale Season (Fall , Winter , etc)
14.Recession Period (Y/N)
15.Sales Volume
51. Three case studies (2/3)
2. Predicting ZILLOW.com’s Zestimate accuracy
51
• Data collected, cleansed and
merged from 4 sources –Zillow
, Redfin, School Digger and
Google Maps
• 17 counties (29 Zip codes) in
Northern VA
House sales data
• Before Data Clean up: 3500+
• After Data Clean up: 1416
• Y –Is Zestimate correct (Y/N)
37.6%/62.43%
• X –15 variables (5+ variables
where discarded from initial
set )
52. Three case studies (2/3)
2. Predicting ZILLOW.com’s Zestimate accuracy
52
# Dimension Note Value Index
1 Data resolution by individual house 5 1.0000
2 Data structure no externalities 4 0.7500
3 Data integration 5 1.0000
4 Temporal relevance 5 1.0000
5 Generalizability only VA counties 3 0.5000
6 Chronology of data and goal 5 1.0000
7 Concept operationalization 4 0.7500
8 Communication 4 0.7500
InfoQ Score = 0.82
InfoQ=82%
53. Three case studies (2/3)
2. Predicting ZILLOW.com’s Zestimate accuracy
53
http://www.madlan.co.il/education/schools
The Israeli version……
54. Three case studies (3/3)
3. Predicting First Day Returns for Japanese IPOs
Goal: To predict the First Day returns on Japanese IPOs (based on first day closing price),
using public information available prior to the offer
The data: i) Japanese IPO data from 1997-2009*, ii) 1561 IPOs, iii) Industry(categorical) :
35 industries - 3 were spelling errors, corrected
Remove Air Trans (1), Fishery & Forestry (2) industries
–Removed first 128 entries (1997-1999) as they had no data for 2 columns :
Underwriter’s fees & Allocation to BRLM
–New Columns
Minimum bid size
Secondary Offering %age
–Creation of Dummy Variables
BRLMs – 3, on the basis of Gross proceeds of IPO
Industry – 4, binned by average return
Market – whether the IPO was OTC or not
54
*Kaneko and Pettway’s Japanese IPO Database (KP-JIPO)
http://www.fbc.keio.ac.jp/~kaneko/KP-JIPO/top.htm
55. Three case studies (3/3)
3. Predicting First Day Returns for Japanese IPOs
55
1) Age of company at time of IPO
2) Gross Proceeds (size of IPO)
3) Minimum Bid Amount
4) IS_OTC listing
5) Secondary offering as %age of total
5) Percentage shares allocated to Lead Manager 1
7) Underwriter’s Gross Spread (fees as %age of size of IPO)
8) Industry_Type (binned categorical variable – 4 categories)
9) Lead_Manager (binned categorical variable – 3 categories)
# Dimension Note Value Index
1 Data resolution 5 1.0000
2 Data structure 4 0.7500
3 Data integration no externalities 2 0.2500
4 Temporal relevance 5 1.0000
5 Generalizability no theory 3 0.5000
6 Chronology of data and goal should be ex ante 3 0.5000
7 Concept operationalization 5 1.0000
8 Communication 4 0.7500
InfoQ Score = 0.66
Prediction algorithms do not give a reasonable prediction of
IPO returns from public information. (High RMSE: 90%)
InfoQ=66%