Probabilistic Latent Factor Induction and
 Statistical Factor Analysis
Upcoming SlideShare
Loading in...5

Probabilistic Latent Factor Induction and
 Statistical Factor Analysis



It is not surprising that the new Bayesian network paradigm prompts comparisons to more conventional methods. In the field of market research, for instance, long-established methods, such as factor ...

It is not surprising that the new Bayesian network paradigm prompts comparisons to more conventional methods. In the field of market research, for instance, long-established methods, such as factor analysis remain in daily use today. Given that there exists a direct counterpart to factor analysis in the Bayesian network framework, we want to highlight similarities as well as fundamental differences. The objective of this paper is to present both methods side-by-side and thus help researchers to correctly compare and interpret the respective results. More specifically, we want to establish the semantic equivalents between the traditional statistical factor analysis approach and BayesiaLab’s method based on Bayesian networks, which we refer to as Probabilistic Latent Factor Induction.



Total Views
Views on SlideShare
Embed Views



2 Embeds 3 2 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Probabilistic Latent Factor Induction and
 Statistical Factor Analysis Probabilistic Latent Factor Induction and
 Statistical Factor Analysis Document Transcript

    • Probabilistic Latent Factor Induction andStatistical Factor AnalysisA Comparison of MethodsStefan Conrady, stefan.conrady@conradyscience.comDr. Lionel Jouffe, jouffe@bayesia.comApril 7, 2011Conrady Applied Science, LLC - Bayesia’s North American Partner for Sales and Consulting
    • Probabilistic Factor Induction and Statistical Factor AnalysisTable of ContentsIntroduction About the Authors 4 Stefan Conrady 4 Lionel Jouffe 4 Key Concepts from Information Theory 1 Entropy 1 Chain Rule Theorem 2 Conditional Entropy 2 Mutual Information 3 Relative Entropy (Kullback-Leibler Divergence) 3 Example 1 3 Example 2 4Comparison of Methods Approach 5 Notation 5 Key Terminology 5 Data Set 6 Probabilistic Latent Factor Induction with BayesiaLab 7 Data Import 7 Variable Clustering 16 Latent Factor Induction 21 Statistical Factor Analysis 30 Factor Analysis with STATISTICA 32 Conclusion 39 References 40 Contact Information 41 Conrady Applied Science, LLC 41 Bayesia SAS 41 Copyright | ii
    • Probabilistic Factor Induction and Statistical Factor AnalysisIntroductionBayesian networks have been gaining prominence among scientists over the recent decade and the new insights gener-ated by this powerful research approach can now be found in studies that circulate well beyond the academic communi-ties. As a result, many practitioners and managerial decision-makers see more and more references to Bayesian networksin all kinds of scienti c and business research, ranging from biostatistics to marketing analytics.It is not surprising that the new Bayesian network paradigm prompts comparisons to more conventional methods. Inthe eld of market research, for instance, long-established methods, such as factor analysis remain in daily use today.Given that there exists a direct counterpart to factor analysis in the Bayesian network framework, we want to highlightsimilarities as well as fundamental differences. The objective of this paper is to present both methods side-by-side andthus help researchers to correctly compare and interpret the respective results. More speci cally, we want to establishthe semantic equivalents between the traditional statistical factor analysis approach and BayesiaLab’s method based onBayesian networks, which we refer to as Probabilistic Latent Factor Induction.Factor Analysis is a statistical method used to describe variability among observed variables in terms of a potentiallylower number of unobserved variables called factors. It is possible, for example, that variations in three or four ob-served variables mainly re ect the variations in a single unobserved variable, or in a reduced number of unobservedvariables. The observed variables can be seen as manifestations of abstract underlying (and unobserved) dimensions or(latent) factors.Factor analysis originated in psychometrics, and is used in behavioral sciences, social sciences, marketing, product man-agement, operations research, and other applied sciences that deal with a large number of variables in their data.Probabilistic Latent Factor Induction is a work ow within the BayesiaLab software package, which has the same objec-tive as the traditional factor analysis, i.e. variable reduction, but works entirely with the framework of Bayesian net-works and is based on principles derived from information theory.It is important to point out that this comparison is not meant to favor one approach over the other (and to declare awinner and loser), although it is clearly in the authors’ interest to promote Bayesian networks in general and BayesiaLabin particular. Rather, this paper should serve as reference for research practitioners and those who use research resultsin their decision-making processes, so they can correctly interpret insights generated with either | iii
    • Probabilistic Factor Induction and Statistical Factor AnalysisAbout the AuthorsStefan ConradyStefan Conrady is the cofounder and managing partner of Conrady Applied Science, LLC, a privately held consulting rm specializing in knowledge discovery and probabilistic reasoning with Bayesian networks. In 2010, Conrady AppliedScience was appointed the authorized sales and consulting partner of Bayesia SAS for North America.Stefan Conrady studied Electrical Engineering and has extensive management experience in the elds of product plan-ning, marketing and analytics, working at Daimler and BMW Group in Europe, North America and Asia. Prior to es-tablishing his own rm, he was heading the Analytics & Forecasting group at Nissan North America.Lionel JouffeDr. Lionel Jouffe is cofounder and CEO of France-based Bayesia SAS. Lionel Jouffe holds a Ph.D. in Computer Scienceand has been working in the eld of Arti cial Intelligence since the early 1990s. He and his team have been developingBayesiaLab since 1999 and it has emerged as the leading software package for knowledge discovery, data mining andknowledge modeling using Bayesian networks. BayesiaLab enjoys broad acceptance in academic communities as well asin business and industry. The relevance of Bayesian networks, especially in the context of consumer research, is high-lighted by Bayesia’s strategic partnership with Procter & Gamble, who has deployed BayesiaLab globally since | iv
    • Information Theory BackgroundKey Concepts from Information TheoryBefore we proceed to the direct comparison of methods, it is important to establish several key concepts relating to theknowledge representation in Bayesian networks.EntropyThe concept of entropy provides the underpinning for all structural learning and analysis algorithms in BayesiaLab.Entropy measures the uncertainty inherent in the distribution of a random variable.The entropy H(X) of a random variable X is de ned as:H (X) = − ∑ p(x)log 2 p(x) , x∈Xwhere x stands for the states, which variable X can take. Note that the log is to the base of 2 and the value of entropy isexpressed in bits (0/1).An example can perhaps illustrate this: If variable X represents the outcome of a coin toss, X can have one of twostates, Heads and Tails, i.e. the set of potential outcomes is X={Heads, Tails}. Given the coin toss is fair, the probabilityof Head and Tails will be 0.5, i.e. p(Heads)=0.5 and p(Tails)=0.5.We can now compute the entropy H(Xfair), based on these values:H (X fair ) = − p(Heads)log 2 p(Heads) − p(Tails)log 2 p(Tails)= −0.5 log 2 0.5 − 0.5 log 2 0.5 = 0.5 + 0.5 = 1 bitThis means our uncertainty prior to a fair coin toss is equivalent to an entropy value of 1 bit, which is the maximumentropy due to the uniform distribution of the variable with two states.If we had a biased coin instead with p(Heads)=0.7 and p(Tails)=0.3, it is intuitive to think that the uncertainty would belower as one state of the coin toss will be more probable and, indeed, computing the entropy H(Xbiased) yields a lowervalue.H (Xbiased ) = −0.7 log 2 0.7 − 0.3log 2 0.3 = 0.881To complete this idea, we can also plot H(X) as a function of the bias, p(Heads)=1-p(Tails), with p(Heads)∈{0,..,1}, i.e.ranging from impossible, p(Heads)=0, to certain, p(Heads) | 1
    • Information Theory Background H￿X￿ p￿Heads￿ 0.2 0.4 0.6 0.8 1.0Clearly, anything other than a perfectly fair coin reduces the entropy and thus our uncertainty regarding the outcome ofthe coin toss.Chain Rule TheoremThe chain rule for joint entropy states that the total uncertainty about the value of X and Y is equal to the uncertaintyabout X plus the (average) uncertainty about Y once you know X.H (X,Y ) = H (X) + H (Y X)The proof of this theorem follows:H (X,Y ) = − ∑ ∑ p(x, y)log 2 p(x, y) y∈Y x∈X= − ∑ ∑ p(x, y)log 2 p(y x)p(x) y∈Y x∈X= − ∑ ∑ p(x, y)log 2 p(y x) − ∑ ∑ p(x, y)log 2 p(x) y∈Y x∈X y∈Y x∈X= − ∑ ∑ p(x, y)log 2 p(y x) − ∑ p(x)log 2 p(x) y∈Y x∈X x∈X= H (Y X) + H (X)Conditional EntropyPerhaps the single most important concept for computations in BayesiaLab is conditional entropy. Conditional entropyrefers to the entropy of a random variable when we have information on another variable.The conditional entropy H(Y|X), is de ned | 2
    • Information Theory BackgroundH (Y X ∑ p(x)H (Y X = x) x∈X= − ∑ p(x)∑ p(y x)log 2 p(y x) x∈X y∈Y= − ∑ ∑ p(x, y)log 2 p(y x) x∈X y∈YThe conditional entropy of Y conditional on X refers to the expected entropy of Y conditional on the value of X.Mutual InformationThe mutual information I(X,Y) measures how much (on average) the observation of random variable Y tells us aboutthe uncertainty of X, i.e. by how much the entropy of X is reduced if we we have information on Y.I(X,Y ) = H (X) − H (X Y ) = H (Y ) − H (Y X)Note that the mutual information is a symmetric metric, which re ects the uncertainty reduction of X by knowing Y aswell as of Y by knowing X.Relative Entropy (Kullback-Leibler Divergence)A closely related concept is the relative entropy, also referred to as the Kullback-Leibler Divergence (DKL) or sometimescross entropy. The Kullback-Leibler Divergence is a measure of the difference between two probability distributions pand q.For probability distributions p and q of a discrete random variable X, their K–L divergence is de ned to be p(x)DKL = ( p(X) || q(X)) = ∑ p(x)log 2 x∈X q(x)In words, it is the expected value of the logarithmic difference between the joint probability distributions p(X) and q(X).In contrast to the mutual information, the relative entropy is non-symmetric.Example 1We once again use tossing coins as an example. By default, we would expect that any given coin is fair and assume amodel q(Heads)=q(Tails)=0.5. As it turns out, in repeated coin tosses, we observe that a probability of p(Heads)=0.75and of p(Tails)=0.25. We can now use the Kullback-Leibler Divergence to establish the “distance” or “distortion” be-tween the originally assumed distribution q(x) and the observed distribution of p(x). p(x)DKL = ( p(X) || q(X)) = ∑ p(x)log 2 x∈X q(x) p(Heads) p(Tails) 0.75 0.25= p(Heads)log 2 + p(Tails)log 2 = 0.75 log 2 + 0.25 log 2 q(Heads) q(Tails) 0.5 0.5= 0.188722 | 3
    • Information Theory BackgroundExample 2For another illustration we use an example from the eld of meteorology. More speci cally, we look at the rainfall intwo cities in state of Victoria, Australia. We used daily rainfall data measured at Geelong Airport and at MelbourneTullamarine Airport, which are approximately 80 kilometers apart, over the entire calendar year of 2010. Given theproximity of those locations, one would generally expect similar weather. Perhaps the Geelong weather isn’t reported inthe Melbourne newspapers and so a traveler wants to use the Melbourne weather as a proxy. However, the actualweather station observations tell us that there is rain in Melbourne on 40.3% of the days, whereas Geelong sees rainfallon 47.4% of the days in the year.We can now compute the Kullback-Leibler Divergence for these two distributions, and pGeelong(x) stands for Geelongand pMelbourne(x) for the Melbourne rain probability distributions. ( ) pGeelong (x)DKL = pGeelong (X) || pMelbourne (X) = ∑ pGeelong (x)log 2 x∈X pMelbourne (x) pGeloong (x = No Rain) pGeloong (x = Rain)= pGeelong (x = No Rain)log 2 + pGeelong (x = Rain)log 2 pMelbourne (x = No Rain) pMelbourne (x = Rain) 0.526 0.474= 0.526 log 2 + 0.474 log 2 = 0.0148958 bits 0.597 0.403 ( )DKL = pMelbourne (X) || pGeelong (X) = ∑ pMelbourne (x)log 2 x∈X pMelbourne (x) pGeelong (x) pMelbourne (x = Rain) p (x = No Rain)= pMelbourne (x = Rain)log 2 + pMelbourne (x = No Rain)log 2 Melbourne pGeelong (x = Rain) pGeelong (x = No Rain) 0.403 0.597= 0.403log 2 + 0.597 log 2 = 0.0147077 bits 0.474 0.526BayesiaLab’s primary metric, the Arc Force, is directly proportional to the relative entropy and describes the strength ofthe directional link between two variables. More speci cally, it describes the difference between the joint probabilitydistributions with and without the particular | 4
    • Probabilistic Latent Factor Induction vs. Statistical Factor AnalysisComparison of MethodsApproachWe believe that we can best facilitate a comparison of the statistical factor analysis and latent factor induction by work-ing through an example. We draw upon the familiar dataset from the previously presented case study from the perfumeindustry, hereafter referred to as the “Perfume Study.” 1We begin our tutorial with the Data Import process for BayesiaLab, although it is not meant to be at the core of thecomparison. It is important though to spell out the data pre-processing steps in BayesiaLab, as they highlight some ofthe fundamental differences between probabilistic and statistical approaches.Once the data preparation is complete, we rst present the probabilistic latent factor induction work ow withBayesiaLab and then provide an example of a statistical factor analysis. For the statistical factor analysis, we will useSTATISTICA 10 as the software platform, although most steps are fairly generic and could be reproduced with a num-ber of other statistical software packages as well.NotationTo clearly distinguish between natural language, software-speci c functions and study-speci c variable names, the fol-lowing notation is used:• BayesiaLab-speci c functions, keywords, commands, etc., are capitalized and shown in bold type.• Names of attributes, variable, node and factors are italicized.• At appropriate points in the text, grey boxes highlight parallels between the two presented methods: Probabilistic Latent Factor Induction Statistical Factor AnalysisKey Terminology• “Observed” and “manifest” are used interchangeably and describe those random variables, which have been meas- ured by the researcher. Each variable measure• The terms “latent” or “unobserved” are used interchangeably in the context of hidden concepts or factors, which cannot be measured, but can potentially be extracted or induced. In our context, the term factor stands exclusively for latent variables. Consequently, the terms “factor”, “factor variable”, “latent variable” and “unobserved variable” are equivalent.1 Conrady and Jouffe (2010) | 5
    • Probabilistic Latent Factor Induction vs. Statistical Factor AnalysisData SetThe Perfume Study is based on a monadic consumer survey about a range of fragrances, which was conducted inFrance. In this example we use survey responses from 1,321 women, who have evaluated a total of 11 fragrances on awide range of attributes:• 27 ratings on fragrance-related attributes, such as, “sweet”, “ owery”, “feminine”, etc., measured on a 1-to-10 scale.• 12 ratings on projected imagery related to someone, who would be wearing the respective fragrance, e.g. “is sexy”, “is modern”, measured on a 1-to-10 scale.• 1 variable for Intensity, a measure re ecting the level of intensity, measured on a 1-to-5 scale.• 1 variable for Purchase Intent, measured on a 1-to-6 scale.• 1 nominal variable, Product, for product identi cation | 6
    • Probabilistic Latent Factor Induction with BayesiaLabProbabilistic Latent Factor Induction with BayesiaLabData ImportTo start the process with BayesiaLab, we rst import the data set, which is formatted as a CSV le.2 With Data>OpenData Source>Text File, we start the Data Import wizard, which immediately provides a preview of the data le.The table displayed in the Data Import wizard shows the individual variables as columns and the survey responses asrows. There are a number of options available, e.g. for sampling. However, this is not necessary in our example giventhe relatively small size of the database.Clicking the Next button, prompts a data type analysis, which provides BayesiaLab’s best guess regarding the data typeof each variable.Furthermore, the Information box provides a brief summary regarding the number of records, the number of missingvalues, ltered states, etc.32 CSV stands for “comma-separated values”, a common format for text-based data les.3 There are no missing values in our database and ltered states are not applicable in this | 7
    • Probabilistic Latent Factor Induction with BayesiaLabFor this example, we will need to override the default data type for the Product variable, as each value is a nominalproduct identi er rather than a numerical scale value. We can change the data type by highlighting the Product variableand clicking the Discrete check box, which changes the color of the Product column to red.We will also de ne Purchase Intent and Intensity as discrete variables, as the default number of states of these variablesis already adequate for our purposes.4The next screen provides options as to how to treat any missing values. In our case, there are no missing values so thecorresponding panel is grayed-out.Clicking the small upside-down triangle next to the variable names brings up a window with key statistics of theselected variable, in this case Fresh.4 The desired number of variable states is largely a function of the analyst’s | 8
    • Probabilistic Latent Factor Induction with BayesiaLabThe next step is the Discretization and Aggregation dialogue, which allows the analyst to determine the type ofdiscretization that must be performed on all continuous variables.5 For this survey, and given the number ofobservations, it is appropriate to reduce the number of states from the original 10 states (1 through 10) to smallernumber. One could, for instance, bin the 1-10 rating into low, mid and high, or apply any other arbitrary methoddeemed appropriate by the analyst.The screenshot shows the dialogue for the Manual selection of discretization steps, which permits to select binningthresholds by point-and-click.5 BayesiaLab requires discrete distributions for all | 9
    • Probabilistic Latent Factor Induction with BayesiaLab Note For choosing discretization algorithms beyond this example, the following rule of thumb may be helpful: • For supervised learning, choose Decision Tree. • For unsupervised learning, choose, in the order of priority, K-Means, Equal Distances or Equal Frequencies.For this particular example, we select Equal Distances with 5 intervals for all continuous variables. This was theanalyst’s choice in order to be consistent with prior research.Clicking Select All Continuous followed by Finish completes the import process and the 49 variables (columns) fromour database are now shown as blue nodes in the Graph Panel, which is the main window for network editing. Bydefault, all variables are represented as nodes. This initial view represents a fully unconnected Bayesian | 10
    • Probabilistic Latent Factor Induction with BayesiaLabIn the above graph, two variables play a fundamentally different role. The values of Product represent categories andPurchase Intent is the overall target variable, i.e. the dependent variable of the Perfume Study. Thus both will be ex-cluded from the factor generation process.While correlation and covariance the central measures for statistical factor analysis, learning Bayesian networks withBayesiaLab (and thus probabilistic factor induction) is based on measures from information theory, such as theKullback-Leibler Divergence, which was introduced in the rst chapter.The Kullback-Leibler Divergence can be obtained after learning an initial Bayesian network with one of BayesiaLab’sunsupervised learning algorithms. “Unsupervised” implies that the learning algorithm searches for an overall representa-tion of the joint distribution of the underlying data rather than the characterization of an individual target variable.In our example, we use BayesiaLab’s EQ algorithm to obtain a Bayesian | 11
    • Probabilistic Latent Factor Induction with BayesiaLabAs this view of the network is not easily readable, BayesiaLab has numerous built-in layout algorithms, of which theForce Directed Layout is perhaps the most commonly used. It can be invoked by View>Automatic Layout>ForceDirected Layout or alternatively through the keyboard shortcut “p”.The resulting network will look similar to the following | 12
    • Probabilistic Latent Factor Induction with BayesiaLab Completed Bayesian Network upon EQ LearningWith the network established, we can now further examine the probabilistic relationships between the nodes, which arerepresented as arcs.6 By selecting, Analysis>Graphic>Arc Force, we can show the probabilistic strength of the arcs,which is visualized by the thickness of the arcs.6 “Arcs” are directed links or edges between nodes, which appear as arrows in the | 13
    • Probabilistic Latent Factor Induction with BayesiaLab Network with Arc ForceThe numeric values of the Arc Force can be shown by selecting View>Display Arc Comments. In the network shownbelow, the Arc Force values are presented in yellow boxes attached to each | 14
    • Probabilistic Latent Factor Induction with BayesiaLab Network with Arc Force Arc Force Covariance In BayesiaLab, Arc Force, a probabilistic measure based on the Kullback-Leibler Divergence, is the central measure for latent factor induction. In statistical factor analysis, covariance, correlation and, in particular, the covariance matrix play the equivalent | 15
    • Probabilistic Latent Factor Induction with BayesiaLabVariable ClusteringWith Arc Force established as a the key measure across the entire network, BayesiaLab can determine clusters of vari-ables, which are “close” in a probabilistic sense. This can be initiated from the menu via Analysis>Graphic>VariableClustering.The clustering algorithm is iterative and starts with those two variables, whose connecting arc has the strongest ArcForce. The following sequence of screenshots illustrates this algorithm conceptually in “slow motion,” as the analystwould not see these individual steps in the actual work ow.As a starting point, every manifest variable is treated as a distinct cluster and so we have 47 clusters. Using theKullback-Leibler Divergence as a measure, the “closest” variables are then merged into one concept. As a result, we rstobtain 46 clusters, then 45, etc., as shown in the array of dendrograms below. BayesiaLab proposes to conclude thisalgorithm upon nding 15 clusters. However, the analyst has the ability to override this automatic selection. As thechoice of clusters appears to be generally compatible with our interpretation of the variable names, we accept this | 16
    • Probabilistic Latent Factor Induction with BayesiaLab Sequence of Dendrograms 47 46 45 44 ... 16 15Because of the importance of this process, we will also show it from another angle, i.e. by looking at sequential views ofthe | 17
    • Probabilistic Latent Factor Induction with BayesiaLab Step 0 - 47 Clusters Step 1 - 46 Clusters: Pleasure merged with CorrespondsThe strongest Arc Force exists between Pleasure and Corresponds and BayesiaLab will form an interim concept fromthem. The next-highest Arc Force then determines whether another variable is merged with the rst concept or whethera new concept is created. In our case, Radiant and In Love are combined as a new | 18
    • Probabilistic Latent Factor Induction with BayesiaLab Step 2 - 45 Clusters: Radiant merged with In LoveIn the third step, we see Sensual and Romantic merged into a new latent concept, and so on. Step 3 - 44 Clusters: Sensual merged with RomanticUpon completion of this process, BayesiaLab forms variable/node clusters from these common concepts and color-codesthem | 19
    • Probabilistic Latent Factor Induction with BayesiaLab Network with Color-Coded Variable ClustersBy clicking the Validate Clustering button , we can now formally xate the new latent factor variables. Thenew latent factors are shown in the following table with their associated observed variables. By default, they are giventhe name “Factor” plus a numeric suf | 20
    • Probabilistic Latent Factor Induction with BayesiaLab Latent Factor Induction Upon de nition of the new latent factor variables, we now want to make them available for modeling purposes. Although these latent factors exist as new concepts and are conceptually linked to the manifest variables, the factors do not yet have any values or states. This will now happen in the Multiple Clustering process, which creates discrete states for each latent factor variable by performing data clustering over the linked manifest variables. More speci cally, the states of each latent factor will be created in such a way that they best summarize the joint probability distribution de ned by the manifest vari- ables. Factor 0 and its linked manifest variables are shown below. Subnetwork for Factor | 21
    • Probabilistic Latent Factor Induction with BayesiaLabThe following Monitors display the marginal probability distributions of the variables associated with Factor 1, plus,highlighted in red, Factor 1 itself and its states are shown. We can see that 5 states were created for Factor 1, labelledC1 through C5, and they each have an expected value, which is shown in parentheses. For instance, state C2 has anexpected value of 9.21. That means, given that C2 is observed, the mean value of the manifest variables, weighted bytheir relation with C2, is equal to 9.21. In other words, C2 corresponds to high ratings with regard to those 5 dimen-sions.By selecting speci c states of Factor 0 in the Monitor Panel, we can see the conditional distributions of the manifestvariables. The states C2 and C3 are displayed for reference below. They can be easily interpreted by looking at the asso-ciated values, e.g. state C2 appears to re ect high ratings of the manifest variables, whereas state C3 captures very lowratings.A more general analysis of the relationships between manifest variables and latent factors can be obtained throughAnalysis>Reports>Relationship Analysis:This chart summarizes the values of key clustering measures, such as the Kullback-Leibler Divergence, for every mani-fest variable associated with Factor 0. For reference only, it also includes Pearson’s Correlation Coef cient | 22
    • Probabilistic Latent Factor Induction with BayesiaLab Relationship Analysis Factor Loadings This summary of clustering measures in the Relationship Analysis allows an interpretation, which is very similar to what is provided with factor loadings.It is also possible to visualize the mean values of the manifest variables (x-axis) along with the Mutual Information (y-axis, left panel) and the Standardized Total Effect (y-axis, right panel) for the latent factor variable.Although we have now de ned new factor variables, we have not yet seen the original matrix survey responses in termsof the new factor variables. For instance, every respondent record has a value for Active, Ful lled, Trust, etc., as thesevariables were observed and recorded in the survey, but how do we nd the values (or states) of the new latent factorsfor each respondent record?Actually, at the conclusion of the Multiple Clustering process, BayesiaLab has introduced the new factors into the origi-nal network. By using BayesiaLab’s imputation process, which is based on maximum likelihood, they were added asnew nodes to the graph and also saved as new columns (or elds) to the database, | 23
    • Probabilistic Latent Factor Induction with BayesiaLab Latent Factors Introduced into Network Factor Induction Saving Factor Scores Introducing the new latent factors into the network is equivalent to adding the factor scores to the original observa- tion matrix.We can easily verify that each new factor has a value for each respondent record. We start Inference>Interactive Infer-ence, which allows to scroll through the survey records and view the values of any variable, including the values of thenew latent | 24
    • Probabilistic Latent Factor Induction with BayesiaLabFor instance, survey record #0 is expressed as state C4 in terms of Factor 0. The states of the manifest variables areshown for reference.Record #8, for example is assigned to state C3:Now we have the entire set of respondent records re-expressed in terms of 15 latent factors, which allows us to usethem for all kinds of modeling | 25
    • Probabilistic Latent Factor Induction with BayesiaLabGiven the importance of latent factors for interpretation, we will assign descriptive labels to each of them. BayesiaLabcan visually aid in this process by showing the latent factors and their relationships to the original manifest variables.This means, we will simply learn a new network, which includes both factor variables and manifest | 26
    • Probabilistic Latent Factor Induction with BayesiaLab Network including Latent Factors and Manifest VariablesThe emerging network structure clearly lends itself to de ning descriptive labels, which are applied to the factors in thefollowing graph.77 See Conrady and Jouffe (2010) for a more detailed explanation of the interpretation | 27
    • Probabilistic Latent Factor Induction with BayesiaLab Network including Latent Factors and Manifest Variables plus Factor LabelsIt is important to reiterate that the latent factors generated here are not orthogonal, which means that probabilistic rela-tionships exist between the factors. For illustration purposes, we can highlight the latent factors and exclude the mani-fest variables from being displayed. In addition, the following graph also displays the Arc Force between each latentfactor providing further con rmation that the latent factors are not | 28
    • Probabilistic Latent Factor Induction with BayesiaLab Network with Latent Factors and Arc | 29
    • Statistical Factor AnalysisStatistical Factor AnalysisPerhaps the most common approach for extracting factors from a set of observed variables is Principal ComponentsAnalysis (PCA) and it is frequently considered a synonym for factor analysis.8 For our purpose, we look at PCA as aprototypical tool for factor extraction, which lends itself to be compared to the latent factor induction with BayesiaLabpresented earlier.Principal Component Analysis (PCA) is a mathematical procedure that uses an orthogonal transformation to convert aset of observations, represented by matrix X, of possibly correlated variables into a set of values of uncorrelated vari-ables called principal components, to be represented by a new matrix Y. The goal of this transformation is to minimizeredundancy (measured by covariance) and to maximize the signal (measured by variance).This transformation is de ned in such a way that the rst principal component has the highest possible variance, i.e.accounting for as much of the variability in the data as possible. In turn, each succeeding component has the next-highest variance while being orthogonal to (uncorrelated with) the preceding components. Conceptual Illustration of Principal Component VectorsMore formally, PCA creates a re-expression of the original data set on the basis of a new set of orthonormal vectors,replacing the original set of “naive” basis vectors, which resulted from the choice of measurements.9In matrix notation, this can be expressed as follows:PX = Y8 There are differences between PCA and the more general concept of factor analysis, but explaining those goes beyondthe scope of this paper.9 Any observed variable automatically establishes a basis vector. Measuring 47 variables would thus result in a 47-dimensional coordinate | 30
    • Statistical Factor Analysiswith X being the matrix of original observations and P being a yet-to-be-determined orthonormal matrix that trans-forms X into Y. Interpreting this geometrically, P is a rotation and stretch to generate Y. The rows of P, {p1,…,pm}, arethe new set of basis vectors for expressing the columns of X. Writing out the explicit dot products may better illustratethis. ⎛ p1 ⎞ ⎜PX = ⎜  ⎜ pm ⎟ ⎟ ⎟ (x 1  xn ) ⎝ ⎠ ⎛ p 1 ⋅ x1 … p 1 ⋅ x n ⎞ ⎜ ⎟Y=⎜    ⎟ ⎜ p m ⋅ x1  p m ⋅ x n ⎝ ⎟ ⎠This provides us with the general framework, but we have yet to determine what matrix P should be.This is the point where we need to introduce the concept of the covariance matrix (Cx). It is de ned as 1CX = XX T n −1• CX is a square and symmetric m × m matrix.• The elements on the diagonal of CX represent the variance of the observed variables.• The off-diagonal elements of CX represent the covariance between observed variables.As a result CX captures the correlations between all possible pairs of observed variables.This obviously relates to our objective of minimizing redundancy (measured by covariance) and maximizing the signal(measured by variance) of the target matrix Y. The optimum achievement of these goals would imply a diagonal covari-ance matrix of Y, i.e. with all off-diagonal elements being zero, and our objective thus translates into stipulating that CYmust be diagonal. Fortunately, linear algebra provides several tools for diagonalizing a matrix.More formally, the objective becomes nding some orthonormal matrix P where Y=PX such that CY is diagonalized.The rows of P are then the principal components.Without providing further detail, the solution is:• The principal components of X are the eigenvectors of XXT or the rows of P.• The ith diagonal value of CY is the variance of X along | 31
    • Statistical Factor AnalysisFactor Analysis with STATISTICAUpon loading the survey data into STATISTICA, the respondent records will be presented as a data table, with the vari-able names shown as column headers and case numbers shown as row headers.10 This represents our observation matrixX. Observation Matrix XAs a starting point of the PCA process, we can display CX, the covariance matrix of X:10 We will skip a detailed description of the data import steps, as they are fairly generic and we assume that readerswould use a wide array of statistical | 32
    • Statistical Factor Analysis Covariance Matrix Arc Force Covariance In BayesiaLab, Arc Force, a probabilistic measure based on the Kullback-Leibler Divergence, is the central measure for latent factor induction. In statistical factor analysis, covariance, correlation and, in particular, the covariance matrix play the equivalent role.As expected, there is a high amount of covariance, i.e. redundancy, between many of the observed variables. To get abetter sense of the magnitude of these pairwise relationships, it helps to display the correlation matrix for | 33
    • Statistical Factor Analysis Correlation MatrixSTATISTICA, like many other statistical software packages, has built-in routines, which can perform the computationof the matrix P of principal components automatically. There are several methods available for solving the PCA, includ-ing the approach using the eigenvectors of the covariance matrix, which was shown earlier.Regardless of the computational method used, the solution of the PCA provides as many eigenvalues as there are ob-served variables. The sum of all eigenvalues equals the number of observed variables, in our case 47. This allows to de-termine the share of variance attributable to each factor. For instance, the rst factor has an eigenvalue of 29.6, whichmeans that it accounts for 29.6/47=62.98% of the variance. Proceeding down the list, the eigenvalues decline in value andcorrespondingly their contribution to the total | 34
    • Statistical Factor Analysis List of EigenvaluesNow that we have a measure of how much variance each successive factor extracts, we can return to the question ofhow many factors to retain, as the overall objective of this exercise is variable reduction. The precise number of factorsto be retained is ultimately an arbitrary decision of the analyst, but factors with eigenvalues greater than 1 are typicallyconsidered candidates. A scree plot11 is typically used to illustrate the eigenvalues of the extracted factors. Sometimesthis provides a visual indication of a natural cutoff point between higher and lower eigenvalues. Here such a distinctioncannot be made easily, so we defer to the rule-of-thumb and retain eigenvalues greater than 1.11 The name “scree plot” is a metaphorical expression, as “scree” is the term for the accumulation of broken rock at thebase of mountain cliffs. In the scree plot we want to distinguish the substantial eigenvalues from the “rubble” at | 35
    • Statistical Factor Analysis Scree PlotIn the next step we turn to the interpretation of the extracted factors. The table below shows the factor loadings, whichare the correlations of each observed variable with the extracted factors. Factor | 36
    • Statistical Factor AnalysisGiven the high eigenvalue of factor 1, it is not surprising that many variables are highly correlated with it. In our par-ticular case, however, this correlation is mostly negative, which may be counterintuitive for interpretation purposes.It is common practice to rotate factors in order to aid in the interpretation process. Intuitively speaking, the rotation intypically chosen in such a way that the principal factor, i.e. factor 1, aligns with what is commonly understood as the“positive x-axis.”Such a factor rotation, for which several methods exist, was also performed with STATISTICA and the results appear inthe table below. In addition, factor loadings higher than 0.7 are highlighted. Loadings on Rotated Factors Relationship Analysis Factor Loadings The summary of clustering measures in BayesiaLab’s Relationship Analysis allows an interpretation, which is very simi- lar to what is provided with factor loadings.The analyst can now use these factor loadings to assign meaningful names to each factor. Some are quite obvious intheir characterization, such as factor 3, which could be called “pleasant” or factor 4, which is quite obviously “classi-cal.” It is also interesting to see that only one variable, i.e. Intensity, has a high loading on factor 2. This implies | 37
    • Statistical Factor Analysisperhaps Intensity is a standalone concept, which has little redundancy. On the other extreme, many variables have highloadings on factor 1, which makes identifying a distinct concept more elusive.Without completing this interpretation process, we turn to the “reduction” part by introducing the extracted factors asvariables into the original data set, i.e. replacing 47 variables with 6 variables. This is often referred to as “saving factorscores,” with the factor scores being the values related to the original observations in this new coordinate system createdby the extracted factors. Our observations now have new coordinates in a 6-dimensional coordinate system rather thanin one with 47 dimensions. Factor Scores Latent Factor Induction Saving Factor Scores Introducing the latent factors into the network is equivalent to adding the factor scores to the original observation matrix.We now have the ability to create a wide range of models, for instance, modeling Purchase Intent as a function of the 6new factors. This will undoubtedly be easier to interpret than a model, which includes all of the 47 original | 38
    • Probabilistic Factor Induction and Statistical Factor AnalysisConclusionAlthough fundamentally different in their framework, statistical factor analysis and probabilistic latent factor inductionhave many parallels, which lend themselves to direct comparative interpretation. Given these parallels, analysts familiarwith either domain should nd it easy to translate their research work ow from one framework into the other. Equally,end users of research results, who may be less familiar with the underlying computations, should be in a position tointerpret the ndings from both methods in a very similar | 39
    • Probabilistic Factor Induction and Statistical Factor AnalysisReferencesConrady, Stefan, and Lionel Jouffe. “Driver Analysis & Product Optimization, A Case Study from the Perfume Indus- try”, December 1, 2010., T. M, and J. A Thomas. “Entropy, relative entropy and mutual information.” Elements of Information Theory (1991): 12–49.Kachigan, Sam Kash. Multivariate Statistical Analysis: A Conceptual Introduction. 2nd ed. Radius Press, 1991.MacKay, David J. C. Information Theory, Inference and Learning Algorithms. 1st ed. Cambridge University Press, 2003.Shlens, J. “A tutorial on principal component analysis.” Systems Neurobiology Laboratory, University of California at San Diego (2005).StatSoft, Inc. “Electronic Statistics Textbook.” Electronic Statistics Textbook, 2011. | 40
    • Probabilistic Factor Induction and Statistical Factor AnalysisContact InformationConrady Applied Science, LLC312 Hamlet’s End WayFranklin, TN 37067USA+1 888-386-8383info@conradyscience.comwww.conradyscience.comBayesia SAS6, rue Léonard de VinciBP 11953001 Laval CedexFrance+33(0)2 43 49 75 69info@bayesia.comwww.bayesia.comCopyright© 2011 Conrady Applied Science, LLC and Bayesia SAS. All rights reserved.Any redistribution or reproduction of part or all of the contents in any form is prohibited other than the following:• You may print or download this document for your personal and noncommercial use only.• You may copy the content to individual third parties for their personal use, but only if you acknowledge Conrady Applied Science, LLC and Bayesia SAS as the source of the material.• You may not, except with our express written permission, distribute or commercially exploit the content. Nor may you transmit it or store it in any other website or other form of electronic retrieval | 41