Searching for Anomalies, by Thomas Dietterich, Distinguished Professor Emeritus in the School of EECS at Oregon State University and Chief Scientist of BigML.
*MLSEV 2020: Virtual Conference.
Alleviating Privacy Attacks Using Causal ModelsAmit Sharma
Machine learning models, especially deep neural networks have been shown to reveal membership information of inputs in the training data. Such membership inference attacks are a serious privacy concern, for example, patients providing medical records to build a model that detects HIV would not want their identity to be leaked. Further, we show that the attack accuracy amplifies when the model is used to predict samples that come from a different distribution than the training set, which is often the case in real world applications. Therefore, we propose the use of causal learning approaches where a model learns the causal relationship between the input features and the outcome. An ideal causal model is known to be invariant to the training distribution and hence generalizes well to shifts between samples from the same distribution and across different distributions. First, we prove that models learned using causal structure provide stronger differential privacy guarantees than associational models under reasonable assumptions. Next, we show that causal models trained on sufficiently large samples are robust to membership inference attacks across different distributions of datasets and those trained on smaller sample sizes always have lower attack accuracy than corresponding associational models. Finally, we confirm our theoretical claims with experimental evaluation on 4 moderately complex Bayesian network datasets and a colored MNIST image dataset. Associational models exhibit upto 80\% attack accuracy under different test distributions and sample sizes whereas causal models exhibit attack accuracy close to a random guess. Our results confirm the value of the generalizability of causal models in reducing susceptibility to privacy attacks. Paper available at https://arxiv.org/abs/1909.12732
R - what do the numbers mean? #RStats This is the presentation for my Demo at Orlando Live60 AILIve. We go through statistics interpretation with examples
Searching for Anomalies, by Thomas Dietterich, Distinguished Professor Emeritus in the School of EECS at Oregon State University and Chief Scientist of BigML.
*MLSEV 2020: Virtual Conference.
Alleviating Privacy Attacks Using Causal ModelsAmit Sharma
Machine learning models, especially deep neural networks have been shown to reveal membership information of inputs in the training data. Such membership inference attacks are a serious privacy concern, for example, patients providing medical records to build a model that detects HIV would not want their identity to be leaked. Further, we show that the attack accuracy amplifies when the model is used to predict samples that come from a different distribution than the training set, which is often the case in real world applications. Therefore, we propose the use of causal learning approaches where a model learns the causal relationship between the input features and the outcome. An ideal causal model is known to be invariant to the training distribution and hence generalizes well to shifts between samples from the same distribution and across different distributions. First, we prove that models learned using causal structure provide stronger differential privacy guarantees than associational models under reasonable assumptions. Next, we show that causal models trained on sufficiently large samples are robust to membership inference attacks across different distributions of datasets and those trained on smaller sample sizes always have lower attack accuracy than corresponding associational models. Finally, we confirm our theoretical claims with experimental evaluation on 4 moderately complex Bayesian network datasets and a colored MNIST image dataset. Associational models exhibit upto 80\% attack accuracy under different test distributions and sample sizes whereas causal models exhibit attack accuracy close to a random guess. Our results confirm the value of the generalizability of causal models in reducing susceptibility to privacy attacks. Paper available at https://arxiv.org/abs/1909.12732
R - what do the numbers mean? #RStats This is the presentation for my Demo at Orlando Live60 AILIve. We go through statistics interpretation with examples
Module 4: Model Selection and EvaluationSara Hooker
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
State of the Art in Machine Learning, by Thomas Dietterich, Distinguished Professor Emeritus in the School of EECS at Oregon State University and Chief Scientist of BigML.
*MLSEV 2020: Virtual Conference.
An introduction to causal graphical models with examples of causality in practice from different fields of science. More focused discussion of causal inference in online ads and recommender systems.
Traditional randomized experiments allow us to determine the overall causal impact of a treatment program (e.g. marketing, medical, social, education, political). Uplift modeling (also known as true lift, net lift, incremental lift) takes a further step to identify individuals who are truly positively influenced by a treatment through data mining / machine learning. This technique allows us to identify the “persuadables” and thus optimize target selection in order to maximize treatment benefits. This important subfield of data mining/data science/business analytics has gained significant attention in areas such as personalized marketing, personalized medicine, and political election with plenty of publications and presentations appeared in recent years from both industry practitioners and academics.
In this workshop, I will introduce the concept of Uplift, review existing methods, contrast with the traditional approach, and introduce a new method that can be implemented with standard software. A method and metrics for model assessment will be recommended. Our discussion will include new approaches to handling a general situation where only observational data are available, i.e. without randomized experiments, using techniques from causal inference. Additionally, an integrated modeling approach for uplift and direct response (where it can be identified who actually responded, e.g., click-through or coupon scanning) will be discussed. Last but not least, extension to the multiple treatment situation with solutions to optimizing treatments at the individual level will also be discussed. While the talk is geared towards marketing applications (“personalized marketing”), the same methodologies can be readily applied in other fields such as insurance, medicine, education, political, and social programs. Examples from the retail and non-profit industries will be used to illustrate the methodologies.
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan HamedRising Media Ltd.
For many businesses, it is not enough to model the probability of an outcome but rather, “given a predictive model, what can we do to change the probability of this outcome?” The goal of this talk is to present how uplift modelling is used to make causal inferences that guide acquisition strategy at Shopify. Mojan will walk through a case study focused on the statistics and experimental design behind uplift modelling, in addition to the learnings gained from bringing this model to production. The python implementation of this presentation will be made available to attendees.
Predictability of popularity on online social media: Gaps between prediction ...Amit Sharma
Can we predict the future popularity of a song, movie or tweet? Recent work suggests that although it may be hard to predict an item’s popularity when it is first introduced, peeking into its early adopters and properties of their social network makes the problem easier. We test the robustness of such claims by using data from social networks spanning music, books, photos, and URLs. We find a stronger result: not only do predictive models with peeking achieve high accuracy on all datasets, they also generalize well, so much so that models trained on any one dataset perform with comparable accuracy on items from other datasets. Though practically useful, our models (and those in other work) are intellectually unsatisfying because common formulations of the problem, which involve peeking at the first small-k adopters and predicting whether items end up in the top half of popular items, are both too sensitive to the speed of early adoption and too easy. Most of the predictive power comes from looking at how quickly items reach their first few adopters, while for other features of early adopters and their networks, even the direction of correlation with popularity is not consistent across domains. Problem formulations that examine items that reach k adopters in about the same amount of time reduce the importance of temporal features,
but also overall accuracy, highlighting that we understand little about why items become popular while providing a context in which we might build that understanding.
Recent news about the pending shortage of data scientists prompts speculation about automation: will machines replace human analysts? We propose a model of automation, and briefly review progress in automated machine learning over the past twenty years. Summarizing the current state of the art, we look at some of the remaining challenges, and the implications for practicing data scientists.
Simulating data to gain insights intopower and p-hackingDorothy Bishop
Very basic introduction to simulating data to illustrate issues affecting reproducibility. Uses Excel and R, but assumes no prior knowledge of R. Please let me know of errors or things that need better explanation.
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
Econometrics or machine learning. I explain which each tool is appropriate, and survey the issues and tools involved in establishing causal relationships.
Distributed tracing is still finding its footing in many organizations today, one challenge to overcome is the data volume - keeping 100% of your traces is expensive and unnecessary. Enter sampling - head vs tail how do you decide? Let’s review the tradeoffs associated with different types of sampling strategies and how they can be mixed and matched.
Through every change in marketing technology, the email newsletter has remained one of the most effective tools in the SMB marketer’s kit. Because of its importance, we surveyed 500 U.S. SMB principals to better understand the role email newsletters play in today’s dynamic marketing environment.
In this SlideShare you’ll learn:
• How SMBs rate their business outlook and challenges
• The formats and topics SMBs are most interested in
• Which industries SMBs most want email newsletters from, and from which they’re already subscribed
• The content mix they prefer
• Where SMBs are most likely to subscribe to an email newsletter
• What gets SMBs to forward an email newsletter to colleagues
• The effect of an email newsletter program on awareness, brand perception and purchase propensity
You’ll get valuable insights to put to work right away in your SMB email newsletter program.
Module 4: Model Selection and EvaluationSara Hooker
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
State of the Art in Machine Learning, by Thomas Dietterich, Distinguished Professor Emeritus in the School of EECS at Oregon State University and Chief Scientist of BigML.
*MLSEV 2020: Virtual Conference.
An introduction to causal graphical models with examples of causality in practice from different fields of science. More focused discussion of causal inference in online ads and recommender systems.
Traditional randomized experiments allow us to determine the overall causal impact of a treatment program (e.g. marketing, medical, social, education, political). Uplift modeling (also known as true lift, net lift, incremental lift) takes a further step to identify individuals who are truly positively influenced by a treatment through data mining / machine learning. This technique allows us to identify the “persuadables” and thus optimize target selection in order to maximize treatment benefits. This important subfield of data mining/data science/business analytics has gained significant attention in areas such as personalized marketing, personalized medicine, and political election with plenty of publications and presentations appeared in recent years from both industry practitioners and academics.
In this workshop, I will introduce the concept of Uplift, review existing methods, contrast with the traditional approach, and introduce a new method that can be implemented with standard software. A method and metrics for model assessment will be recommended. Our discussion will include new approaches to handling a general situation where only observational data are available, i.e. without randomized experiments, using techniques from causal inference. Additionally, an integrated modeling approach for uplift and direct response (where it can be identified who actually responded, e.g., click-through or coupon scanning) will be discussed. Last but not least, extension to the multiple treatment situation with solutions to optimizing treatments at the individual level will also be discussed. While the talk is geared towards marketing applications (“personalized marketing”), the same methodologies can be readily applied in other fields such as insurance, medicine, education, political, and social programs. Examples from the retail and non-profit industries will be used to illustrate the methodologies.
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan HamedRising Media Ltd.
For many businesses, it is not enough to model the probability of an outcome but rather, “given a predictive model, what can we do to change the probability of this outcome?” The goal of this talk is to present how uplift modelling is used to make causal inferences that guide acquisition strategy at Shopify. Mojan will walk through a case study focused on the statistics and experimental design behind uplift modelling, in addition to the learnings gained from bringing this model to production. The python implementation of this presentation will be made available to attendees.
Predictability of popularity on online social media: Gaps between prediction ...Amit Sharma
Can we predict the future popularity of a song, movie or tweet? Recent work suggests that although it may be hard to predict an item’s popularity when it is first introduced, peeking into its early adopters and properties of their social network makes the problem easier. We test the robustness of such claims by using data from social networks spanning music, books, photos, and URLs. We find a stronger result: not only do predictive models with peeking achieve high accuracy on all datasets, they also generalize well, so much so that models trained on any one dataset perform with comparable accuracy on items from other datasets. Though practically useful, our models (and those in other work) are intellectually unsatisfying because common formulations of the problem, which involve peeking at the first small-k adopters and predicting whether items end up in the top half of popular items, are both too sensitive to the speed of early adoption and too easy. Most of the predictive power comes from looking at how quickly items reach their first few adopters, while for other features of early adopters and their networks, even the direction of correlation with popularity is not consistent across domains. Problem formulations that examine items that reach k adopters in about the same amount of time reduce the importance of temporal features,
but also overall accuracy, highlighting that we understand little about why items become popular while providing a context in which we might build that understanding.
Recent news about the pending shortage of data scientists prompts speculation about automation: will machines replace human analysts? We propose a model of automation, and briefly review progress in automated machine learning over the past twenty years. Summarizing the current state of the art, we look at some of the remaining challenges, and the implications for practicing data scientists.
Simulating data to gain insights intopower and p-hackingDorothy Bishop
Very basic introduction to simulating data to illustrate issues affecting reproducibility. Uses Excel and R, but assumes no prior knowledge of R. Please let me know of errors or things that need better explanation.
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
Econometrics or machine learning. I explain which each tool is appropriate, and survey the issues and tools involved in establishing causal relationships.
Distributed tracing is still finding its footing in many organizations today, one challenge to overcome is the data volume - keeping 100% of your traces is expensive and unnecessary. Enter sampling - head vs tail how do you decide? Let’s review the tradeoffs associated with different types of sampling strategies and how they can be mixed and matched.
Through every change in marketing technology, the email newsletter has remained one of the most effective tools in the SMB marketer’s kit. Because of its importance, we surveyed 500 U.S. SMB principals to better understand the role email newsletters play in today’s dynamic marketing environment.
In this SlideShare you’ll learn:
• How SMBs rate their business outlook and challenges
• The formats and topics SMBs are most interested in
• Which industries SMBs most want email newsletters from, and from which they’re already subscribed
• The content mix they prefer
• Where SMBs are most likely to subscribe to an email newsletter
• What gets SMBs to forward an email newsletter to colleagues
• The effect of an email newsletter program on awareness, brand perception and purchase propensity
You’ll get valuable insights to put to work right away in your SMB email newsletter program.
Neotys organized its first Performance Advisory Council in Scotland, the 14th & 15th of November.
With 15 Load Testing experts from several countries (UK, France, New-Zeland, Germany, USA, Australia, India…) we explored several theme around Load Testing such as DevOps, Shift Right, AI etc.
By discussing around their experience, the methods they used, their data analysis and their interpretation, we created a lot of high-value added content that you can use to discover what will be the future of Load Testing.
You want to know more about this event ? https://www.neotys.com/performance-advisory-council
How well do you know your pixels? Join this session to learn everything from basic information on how we display colors, all the way through using advanced calculations to prove that a device has a retina display. Whether you design interfaces for watches, phones, tablets, desktops, or 10-foot UI’s, you will gain some great insight into the fundamentals of how your work is displayed. This session will give you the foundation to come up with the next great concepts in digital interfaces!
dxDOE design of experiment for students.ppttenadrementees
Text on statistics which can be used by students and professionals. This covers more topics which are relevant to professionals in the field who need the knowledge"
Statistics is not just a subject confined to textbooks; it's a powerful tool that permeates every aspect of our lives. Whether you're a student embarking on your academic journey or a seasoned professional navigating the complexities of your field, a solid understanding of statistics is indispensable. That's where this comprehensive text comes in.
From the foundational principles to advanced techniques, this text is designed to equip both students and professionals with the knowledge and skills necessary to harness the full potential of statistics. We start by laying the groundwork with essential concepts such as probability theory, random variables, and descriptive statistics. Through clear explanations and illustrative examples, we ensure that readers grasp these fundamental building blocks with ease.
But statistics is not just about crunching numbers; it's about making sense of data and drawing meaningful insights. That's why we delve into inferential statistics, exploring hypothesis testing, confidence intervals, and regression analysis. By learning how to infer conclusions from sample data, readers gain the ability to make informed decisions and predictions based on statistical evidence.
But the journey doesn't stop there. We go beyond the basics to cover advanced topics that are crucial for professionals in today's data-driven world. Multivariate analysis, time series analysis, and Bayesian statistics are just a few of the advanced techniques that readers will master, providing them with the tools to tackle complex problems and extract deeper insights from data.
What sets this text apart is its emphasis on real-world relevance. Each chapter is carefully crafted to bridge the gap between theory and practice, with practical examples and case studies drawn from a wide range of industries and disciplines. Whether you're working in finance, healthcare, marketing, or any other field, you'll find that the principles and techniques covered in this text are directly applicable to your day-to-day work.
Moreover, we recognize that proficiency in statistical software is essential for modern professionals. That's why we include discussions on popular tools such as R, Python, and SPSS, empowering readers to analyze data efficiently and effectively. With hands-on exercises and tutorials, readers can develop their skills in data analysis and visualization, gaining practical experience that will serve them well in their careers.
In sum, this text is more than just a book; it's a comprehensive guide to mastering the art and science of statistics. Whether you're a student seeking to build a strong foundation or a professional looking to expand your analytical toolkit, this text has everything you need to succeed in today's data-driven world. With its clear explanations, practical examples.
Similar to Webinar slides: DIY Market Mapping Using Correspondence Analysis (20)
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Webinar slides: DIY Market Mapping Using Correspondence Analysis
1. T I M B O C K P R E S E N T S
If you have any questions, enter them into the Questions field.
Questions will be answered at the end. If we do not have time to get to your question, we will email you.
We will email you a link to the video, slides, and data.
Get a free one-month trial of Q from www.q-researchsoftware.com
DIY Market Mapping
Using Correspondence Analysis
IF YOU HAVE ANY TECHNICAL ISSUES VIEWING THIS WEBINAR,
YOU CAN CATCH UP ON THE FULL RECORDING ON OUR WEBSITE
2. Introduction Overview
Visualizing a big table
Software
Interpretation Proximities, angles, and lengths
Quality
Make it better Removing ‘outliers’
Rotation
Supplementary points
Data & algorithms Appropriate data for correspondence analysis
Correspondence analysis of square tables
Choice of statistic
Multiple correspondence analysis
Composite tables
Interpretation again Normalization
Visualization Moonplots
Logos
Bubble charts
Comparing groups
Trends
End Resources
Q&A
Overview
2
3. Typical input data: Brand association table
3
% Fun Worth
what you
pay for
Innovative Good
customer
service
Stylish Easy to
use
High
quality
High
performance
Low
prices
Apple 64% 49% 75% 51% 69% 59% 72% 66% 7%
Microsoft 22% 39% 43% 21% 20% 38% 46% 45% 7%
IBM 3% 6% 15% 4% 5% 7% 21% 23% 4%
Google 63% 40% 59% 27% 32% 58% 40% 42% 17%
Intel 4% 15% 19% 8% 5% 10% 21% 23% 3%
Hewlett-Packard 5% 21% 15% 13% 15% 19% 31% 25% 12%
Sony 25% 36% 28% 18% 36% 34% 48% 36% 12%
Dell 6% 15% 10% 12% 11% 17% 21% 18% 22%
Yahoo 14% 7% 9% 6% 3% 14% 7% 7% 11%
Nokia 5% 16% 11% 12% 12% 25% 22% 12% 25%
Samsung 29% 43% 50% 30% 52% 51% 49% 46% 21%
LG 16% 36% 28% 18% 31% 35% 38% 29% 34%
Panasonic 10% 27% 20% 13% 23% 27% 35% 24% 22%
None of these 14% 9% 5% 21% 10% 5% 4% 6% 31%
6. Software
6
Everything in this webinar can be done using R, with
our flipDimensionReduction package on github:
https://github.com/Displayr/flipDimensionReduction
Everything we do today can be done using Displayr:
Insert > More > Dimension Reduction.
Everything in this Webinar is demonstrated using Q
(www.q-researchsoftware.com)
7. Introduction Overview
Visualizing a big table
Software
Interpretation Proximities, angles, and lengths
Quality
Make it better Removing ‘outliers’
Rotation
Supplementary points
Data & algorithms Appropriate data for correspondence analysis
Correspondence analysis of square tables
Choice of statistic
Multiple correspondence analysis
Composite tables
Interpretation again Normalization
Visualization Moonplots
Logos
Bubble charts
Comparing groups
Trends
End Resources
Q&A
Overview
8
8. Interpretation 1: More similar brands (rows) are usually close together
9
Similar
Similar
Not similar
9. Interpretation 2: The further a brand from the origin, the more differentiated (usually)
10
Differentiated
Undifferentiated
10. Interpretation 3: More similar attributes (columns) are usually close together
11
Similar
Not similar
11. Interpretation 4: The further an attribute from the origin, the more differentiating (usually)
12
Differentiating
Not differentiating
12. Interpretation 5: Relationships between brands and attributes are not determined by proximity
13
There is not a
strong association
between Easy to
use, Stylish and
Samsung
(Samsung is not
differentiated;
Easy to use is not
differentiating)
13. Interpretation 6: The direction of association between brands and attributes is usually
determined by angle of the lines joining the brand and the attribute to the origin – example 1
14
There is a positive association between Low
prices and Nokia, as the lines connecting them to
the origin have a small (acute) angle.
14. Interpretation 6: The direction of association between brands and attributes is usually
determined by angle of the lines joining the brand and the attribute to the origin – example 2
15
There is no association between High quality and
Nokia, as the angle formed by the lines connecting
the brand and attribute to the origin (0,0) is
approximately 90 degrees.
15. Interpretation 6: The direction of association between brands and attributes is usually
determined by angle of the lines joining the brand and the attribute to the origin
16
There is a negative association between
Innovative and Nokia, as they are on opposite
sides of the origin.
16. Interpretation 7: The strength of association is usually proportional to the product of the
cosine of the angle, and the lengths of the lines from brand and attribute to origin – example 1
17
There is a strong positive
association between Low prices
and Nokia.
17. Interpretation 7: The strength of association is usually proportional to the product of the cosine
of the angle, and the lengths of the two lines from brand and attribute to origin – example 2
18
There is perhaps a very weak association between
Easy to use and Samsung:
• The line to Easy to use from the origin is short
• The line to Samsung from the origin is moderate
• The angle at the origin is irrelevant because the
lines are so short.
18. Interpretation 7: The strength of association is usually proportional to the product of the cosine
of the angle, and the lengths of the two lines from brand and attribute to origin – example 3
19
Nokia’s negative association with Innovative is
stronger than LG’s.
19. Interpretation 7: The strength of association is usually proportional to the product of the cosine
of the angle, and the lengths of the two lines from brand and attribute to origin – example 4
20
There is a negative association between Low
prices and Apple:
• The line to Low prices from the origin is long
• The line to Apple from the origin is moderate
• The angle at the origin is obtuse (this means
negative).
22. Interpretation 9: The biggest number in the raw data table will usually not be the biggest
indexed residual (i.e., strongest association)
23
Retail sales (millions) Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Food retailing 10,245 9,557 10,354 9,728 9,815 9,517 9,929 10,042 10,006 10,483 10,436 12,230
Household goods retailing 4,377 3,980 4,097 4,065 4,093 4,357 4,225 4,239 4,469 4,697 4,874 5,782
Clothes/Accessories 1,876 1,599 1,781 1,925 1,927 1,967 1,876 1,806 1,897 1,938 2,057 3,331
Department stores 1,519 1,156 1,452 1,451 1,450 1,596 1,468 1,294 1,394 1,497 1,684 2,850
Other retailing 3,305 3,257 3,399 3,356 3,429 3,414 3,493 3,562 3,602 3,643 4,051 4,860
Food service 3,432 3,187 3,435 3,452 3,431 3,314 3,573 3,648 3,696 3,717 3,679 4,047
23. Interpretation 10: Review the variance explained
(Select the Map: Create > Dimension Reduction > Diagnostic > Quality)
100% - 54.3% - 25.9% = 19.8% of the variance
in the indexed residuals is not shown on the
map. The map will be misleading in some
ways.
24
24. Interpretation 11: Review the quality of the map for each brand (row)
(set Output to Diagnostics)
= +
The map only explains 16% of the
variance relating to Samsung
25. Interpretation 12: Review the quality of the map for each attribute (column)
(set Output to Diagnostics)
The map largely ignores the
attributes Easy to use and Stylish.
26. Interpretation 13: Check interesting results using the raw data
27
So, Samsung and Easy to use may still be related.
(Note that all the earlier slides had the caveat
usually regarding interpretation.)
27. Interpretation 14: Check interesting results using the raw data
28
% Fun Worth
what you
pay for
Innovative Good
customer
service
Stylish Easy to
use
High
quality
High
performance
Low
prices
Apple 64% 49% 75% 51% 69% 59% 72% 66% 7%
Microsoft 22% 39% 43% 21% 20% 38% 46% 45% 7%
IBM 3% 6% 15% 4% 5% 7% 21% 23% 4%
Google 63% 40% 59% 27% 32% 58% 40% 42% 17%
Intel 4% 15% 19% 8% 5% 10% 21% 23% 3%
Hewlett-Packard 5% 21% 15% 13% 15% 19% 31% 25% 12%
Sony 25% 36% 28% 18% 36% 34% 48% 36% 12%
Dell 6% 15% 10% 12% 11% 17% 21% 18% 22%
Yahoo 14% 7% 9% 6% 3% 14% 7% 7% 11%
Nokia 5% 16% 11% 12% 12% 25% 22% 12% 25%
Samsung 29% 43% 50% 30% 52% 51% 49% 46% 21%
LG 16% 36% 28% 18% 31% 35% 38% 29% 34%
Panasonic 10% 27% 20% 13% 23% 27% 35% 24% 22%
None of these 14% 9% 5% 21% 10% 5% 4% 6% 31%
28. Interpretation 15: Use standardized residuals to help interpret the raw data
(in Q, the arrows and colors are based on the standardized residuals)
29
% Fun Worth
what you
pay for
Innovative Good
customer
service
Stylish Easy to
use
High
quality
High
performance
Low
prices
Apple 64% 49% 75% 51% 69% 59% 72% 66% 7%
Microsoft 22% 39% 43% 21% 20% 38% 46% 45% 7%
IBM 3% 6% 15% 4% 5% 7% 21% 23% 4%
Google 63% 40% 59% 27% 32% 58% 40% 42% 17%
Intel 4% 15% 19% 8% 5% 10% 21% 23% 3%
Hewlett-Packard 5% 21% 15% 13% 15% 19% 31% 25% 12%
Sony 25% 36% 28% 18% 36% 34% 48% 36% 12%
Dell 6% 15% 10% 12% 11% 17% 21% 18% 22%
Yahoo 14% 7% 9% 6% 3% 14% 7% 7% 11%
Nokia 5% 16% 11% 12% 12% 25% 22% 12% 25%
Samsung 29% 43% 50% 30% 52% 51% 49% 46% 21%
LG 16% 36% 28% 18% 31% 35% 38% 29% 34%
Panasonic 10% 27% 20% 13% 23% 27% 35% 24% 22%
None of these 14% 9% 5% 21% 10% 5% 4% 6% 31%
Low prices and Fun are the most differentiating attributes. As
they are not correlated with each other, they make up the first
two dimensions, squeezing Stylish off the map.
Samsung is
not well
differentiated
on most of
the attributes
29. Interpretation 16: The aspect ratio needs to be 1 for correct interpretation
30
Detractor
Passive
Promoter
18 to 34
35 to 49
50 over
-0.015
-0.01
-0.005
0
0.005
0.01
0.015
0.02
-0.3 -0.2 -0.1 0 0.1 0.2 0.3
Google NPS Age
.100
.008
This map has an aspect ratio of 12.5 (.1 / .008).
This means that vertical distances are shown to
be 12.5 times bigger than is appropriate.
DetractorPassive Promoter18 to 3435 to 4950 over
-0.05
0
0.05
-0.3 -0.2 -0.1 0 0.1 0.2 0.3
Dimension 1 (horizontal)
Google NPS Age
This map has an aspect ratio of 1
30. Introduction Overview
Visualizing a big table
Software
Interpretation Proximities, angles, and lengths
Quality
Make it better Removing ‘outliers’
Rotation
Supplementary points
Data & algorithms Appropriate data for correspondence analysis
Correspondence analysis of square tables
Choice of statistic
Multiple correspondence analysis
Composite tables
Interpretation again Normalization
Visualization Moonplots
Logos
Bubble charts
Comparing groups
Trends
End Resources
Q&A
Overview
31
31. Introduction Overview
Visualizing a big table
Software
Interpretation Proximities, angles, and lengths
Quality
Make it better Removing ‘outliers’
Rotation
Supplementary points
Data & algorithms Appropriate data for correspondence analysis
Correspondence analysis of square tables
Choice of statistic
Multiple correspondence analysis
Composite tables
Interpretation again Normalization
Visualization Moonplots
Logos
Bubble charts
Comparing groups
Trends
End Resources
Q&A
Overview
32
32. When to use correspondence analysis
• When we have a table with:
• At least two rows
• At least two columns
• No missing values
• No negatives
• Data on the same scale: Does the table cease to make sense if it is sorted
by any of its rows or columns?
33
33. Introduction Overview
Visualizing a big table
Software
Interpretation Proximities, angles, and lengths
Quality
Make it better Removing ‘outliers’
Rotation
Supplementary points
Data & algorithms Appropriate data for correspondence analysis
Correspondence analysis of square tables
Choice of statistic
Multiple correspondence analysis
Composite tables
Interpretation again Normalization
Visualization Moonplots
Logos
Bubble charts
Comparing groups
Trends
End Resources
Q&A
Overview
36
34. Interpretation 17: The default normalization settings of most correspondence analysis plots
misrepresent the associations between the brands and attributes
Normalization How to interpret
brand relationships
How to interpret
attribute relationships
How to interpret brand-
attribute associations
Principal Proximity Proximity Angles and lengths (but
angles and lengths are
misrepresented)
Row principal Proximity Proximity, adjusting for
variance explained
Angles and lengths
Row principal (scaled) Proximity Proximity, adjusting for
variance explained
Angles and lengths
Column principal Proximity, adjusting for
variance explained
Proximity Angles and lengths
Column principal (scaled) Proximity, adjusting for
variance explained
Proximity Angles and lengths
Symmetrical ½ Proximity, ½ adjusting
for variance explained
Proximity, ½ adjusting
for variance explained
Angles and lengths
35. Introduction Overview
Visualizing a big table
Software
Interpretation Proximities, angles, and lengths
Quality
Make it better Removing ‘outliers’
Rotation
Supplementary points
Data & algorithms Appropriate data for correspondence analysis
Correspondence analysis of square tables
Choice of statistic
Multiple correspondence analysis
Composite tables
Interpretation again Normalization
Visualization Moonplots
Logos
Bubble charts
Comparing groups
Trends
End Resources
Q&A
Overview
38
36. Resources
• Correspondence Analysis in Practice, Third Edition (Chapman & Hall/CRC
Interdisciplinary Statistics) 3rd Edition, by Michael Greenacre (2017)
• 18 posts on various aspects of correspondence analysis on our blog:
www.displayr.com/blog
• The Q wiki: http://wiki.q-researchsoftware.com/wiki/Main_Page
• All the source code: https://github.com/Displayr/flipDimensionReduction
39
37. T I M B O C K P R E S E N T S
Q&A
www.q-researchsoftware.com/webinars
DIY Market Mapping
Using Correspondence Analysis
Editor's Notes
Hello and welcome to Automate or Die.
My name is Matt Steele and I’m part of Q’s London-based team.
Today, we’re exploring the topic of automation in quantitative research.
We’ll be looking at theoretical as well as practical expressions of automation.
As noted on the screen, you can submit questions as we go along.
I’ll do my best to answer them at the end, but if I don’t get to answer all of them, we’ll be collating and posting the Q&A’s on our website.
We’ll also be posting a recording of this webinar so you can rewatch any of the material later.
OK with that continuum in mind, let’s look at the first of 8 ways we can automate our work in quant research
%, Top 2 Boxes, Means
OK with that continuum in mind, let’s look at the first of 8 ways we can automate our work in quant research
Correspondence Analysis (Traditional)
Inertia(s):
Canonical Correlation Inertia Proportion
Dimension 1 .169 .029 .997
Dimension 2 .010 .000 .003
Standard Coordinates: Google NPS
Dimension 1 Dimension 2
Detractor -.52 1.44
Passive -1.11 -1.09
Promoter 1.17 -.27
Principal Coordinates: Google NPS
Dimension 1 Dimension 2
Detractor -.09 .01
Passive -.19 -.01
Promoter .20 .00
Standard Coordinates: Age
Dimension 1 Dimension 2
18 to 34 .87 1.12
35 to 49 .37 -1.18
50 over -1.60 .35
Principal Coordinates: Age
Dimension 1 Dimension 2
18 to 34 .15 .01
35 to 49 .06 -.01
50 over -.27 .00
OK with that continuum in mind, let’s look at the first of 8 ways we can automate our work in quant research
OK with that continuum in mind, let’s look at the first of 8 ways we can automate our work in quant research
OK with that continuum in mind, let’s look at the first of 8 ways we can automate our work in quant research
OK with that continuum in mind, let’s look at the first of 8 ways we can automate our work in quant research
OK so now time for Q&A.
Again if I don’t get to answer all these will put on the website