Sparsity Control for Robustness and Social Data Analysis
Upcoming SlideShare
Loading in...5
×
 

Sparsity Control for Robustness and Social Data Analysis

on

  • 465 views

Georgios Giannakis, Professor and ADC Chair in Wireless Telecommunications, University of Minnesota, Department of Electrical & Computer Engineering (IEEE/EURASIP Fellow, IEEE SPS DL), Sparsity ...

Georgios Giannakis, Professor and ADC Chair in Wireless Telecommunications, University of Minnesota, Department of Electrical & Computer Engineering (IEEE/EURASIP Fellow, IEEE SPS DL), Sparsity Control for Robustness and Social Data Analysis

Statistics

Views

Total Views
465
Views on SlideShare
437
Embed Views
28

Actions

Likes
0
Downloads
1
Comments
0

1 Embed 28

http://dls.csd.auth.gr 28

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Good afternoon. Thanks for attending my talk on “Sparsity control….”. Before starting I would like to thank my advisor and the Profs. That kindly agreed to serve in my committee; as well as the grants that support financially our research.
  • The advancement of technology has led to the current `era of information’, throughout the years data has transitioned from being scarce to superabundant. The ability to mine valuable information from this unprecedented volumes of data , is envisioned to facilitate accomplishing critical tasks such as limiting the spread of diseases, fight crime, identify trends of social behaviors and in financial markets. This information explosion phenomenon is also known as `Big Data’, and has been nicely characterized by Kenneth Cukier, an expert in the topic that writes for the Economist. The way he puts its, Big Data is clearly `BIG’ and fast. It is everywhere, this map was generated by drawing points at the locations where a picture was taken and uploaded to Flickr. It is revealing, when the city of Oakland publicly released police data, they did not expect that it will reveal confidential tactics. It can be productive and it is smart, since we can do clever things with it like digitizing books while we go through security protocols. But unfortunately it is messy, data is unstructured, inconsistent, noisy – and dealing with these so termed outliers while learning from data is the main topic of these work.
  • A major source of data are social-computational systems, that is, complex systems of people and computers interacting with one another. Contemporary examples include financial markets, online role-playing games, and Wikipedia, to name a few. SoCS involve human and computer “actors” whose individual capabilities, values, and prefereneces determine the resulting modes of social engagement. Hence, our vision is that preference measurement holds the keys towards understanding and engineering socially intelligent computational systems. For that purpose, we will leverage the dual role that sparsity has to play in terms of complexity control through variable selection, and robustness to outliers (grossly inconsistent data). The preference measurement workhorse is conjoint analysis…
  • …which has been widely applied in marketing, healthcare, and psychology. In the context of marketing, an important question is how to optimally design and position new products. To answer it , conjoint analysis adopts the strategy of describing products by a set of attributes, its `parts’. The goal is to learn the consumer’s linear utility function from collected preference data. This way, CA seeks to understand how much is each part worth to the consumer in valuing a given product . Why conjoint analysis? Because the method studies the effect of all attributes jointly. A CA success story is that of Marriot’s Courtyard hotels. They performed a CA study to launch a new hotel chain for businessman, and the considered product attributes were e.g., room size, TV option, airport transportation, etc.
  • Let’s formalize the CA problem mathematically. There are I respondents (consumers) index by I, each of them rating J_i product profiles x_ij. The x_ij are p-dimensional vectors, where each entry corresponds to an attribute, and can take one out of several values (e.g., the TV options attribute can take 0 = cable only, 1 = cable + on-demand movies). Different combinations of these attribute values define the profiles. Assuming linear utilities, the goal is to estimate a vector of partworths given conjoint data. Data collection formats typically fall in 2 classes : metric ratings where the respondent provides a rating y_ij for each of the profiles presented to her/himself. Also there is choice-based data where he/she picks among two options per question. Note that under (M1) CA boils down to a parameter (function) estimation problem, whereas under (M2) we deal with a binary classification problem. Traditionally , CA studies relied on few questions & attributes, data was collected in controlled environments. But with the explosion of online SoCS-based preference data, one has to deal with inconsistent, corrupted, irrelevant data.
  • Towards robustifying PM, consider for instance metric data, and focus on a specific consumer or class of homogeneous consumers so that i can be dropped . Our goal is to develop a robust partworth estimator which requires minimal assumptions on the outlier model . One such universal and very powerful estimator is LTS, given by this optimization problem. The r_[j]^2(w) is the ….., so that the cost entails the sum of the smallest \\nu residuals, for each feasible w. Note tha the largest J-\\nu residuals are effectively not considered in the LTS cost. Given that LTS is nonconvex, we may wonder whether a minimizer exists. The answer is positive, and it is very simple (conceptually) to find the solution. Just try al subsamples of size \\nu, solve the LS problem, and choose the one among all candidate solutions that yields the smallest cost. Albeit simple , this solution procedure is intractable except for small problems, and near optimal solvers are available.
  • An alternative to discaarding large residuals is to introduce auxiliary outlier variables o_j, one per rated profile s.t. o_j=…. We arrive to an outlier-aware preference model , where nominal ratings obey (M1), and outliers something else. In this model, both w and o are unknown and have to be estimated. Also note that the model is inherently underdetermined . However, if outliers are sporadic, o is sparse (a few of its entries are nonzero). A natural sparsity-aware estimator is given here, with an LS cost and an l_0-norm constraint on the outlier vector. Unfortunately, this problem is NP-hard
  • Up here, I’m just rewriting the last problem in Lagrangian form, where the tuning parameter \\lambda_o controls the sparsity in \\hat o. If \\lambda_0=0, no sparsity is enforced, then for sufficiently large \\lambda_0 we get the all-zero solution – certainly, we are interested on something in between. A key observation is that controlling the sparsity in \\hat o amounts to controlling the number of outliers . Interestingly, we have established the following result which connects (LTS) and (P0). Read statement and punch line.
  • Now, (P0) is NP-hard. As in compressive sampling, let’s relax |o|_0 with its closest convex approximant |o|_1. We arrive at (P1), which also encourages sparsity in o , and can be controlled with \\lambda_1. After the relaxation one may ponder whether (P1) still provides robust partworth estimates w. The answer is positive since (P1) is equivalent to the problem down here after eliminating the o variables, of which Huber’s optimal estimator is a special case for a specific choice of \\lambda_1. Different from Huber though, we do not assume a contamination model. Now, how do we solve (P1)?
  • The following result will prove very useful , and it says that in order to solve (P1), it suffices to solve a single instance of Lasso (with o as optimization variable). This allows for data-driven methods to select \\lambda_1 which require e.g., a rough estimate of the number of outliers, or, robust estimates of the nominal noise variance. These methods leverage Lasso solvers that can compute the whole path of solutions, that in this context we have baptized “robustification paths”. Typically, for learning tasks \\lambda_1 is chosen via CV, but those methods are challenged in the presence of outliers.
  • Lasso estimates are biased towards 0. So how can we improve our outlier estimates? Well, if you remember we started with (P0) and replaced |o|_0 with its closest convex approximation. But it is conceivable that replacing |o|_0 with a non-convex surrogate may yield a tighter approximation. Some non-convex regularizers were proposed to this end, such as….It turns out that the right thing to do in order to tackle the non-convex problem over here, is to solve an interatively-reweighted version of Lasso in the previous slide. What we do in practice is to solve (P1) first, and use this solution a initialization to the reweighted algorithm, which we run for a few iterations. Through this two-step refinement, we have empirically observed significant improvements in terms of generalization capability, which is somehow expected since weighted l_1-norm regularized estimators are known to attain the so-termed “oracle properties”.
  • Here we show a numerical example, to compare the performance of the sparsity-controlling estimator against RANSAC . For J=100, the 10-dimensional profiles are drawn from a standard normal distribution, whereas nominal data adheres to a linear Gaussian model, with known variance. Outliers are Laplacian distributed, and contamination levels from 0 to 80% are examined. For RANSAC, the number of iterations is fixed to either 1000 or 10000. The plot compares both algorithms in terms of RMSE, for different levels of contamination . Both methods yield accurate results for small percentages of outliers. As the number of outliers increases, RANSAC breaks down resulting in large RMSEs with high variability. Note how the performance of the proposed scheme degrades gracefully beyond 40% contamination.
  • So far we have considered linear utilities. What about modeling interactions among the attributes within the partworth vector? (e.g., price and brand). Clearly these are not captured by the linear model. Still, these interactions are driven by complex mechanisms, and it may be tough to pick a good model. In this case, it may be prudent to let the data dictate the form of the utility function sought, adopting a nonparametric regression approach where the unknown u is only assumed to live in a space of smooth functions H. Without additional constraints, estimating u from finite data is ill posed, and people have resorted to regularization techniques to control the complexity of the family of functions. Endowing H with the additional structure of a reproducing Kernel Hilbert space, the variational estimator we have proposed towards robustifiying nonparametric regression is shown down here, and…
  • … allow me to show you some results obtained after using thin-plate splines for robust Gaussian mixture estimation. The training data comprises 80 noisy samples of the true function up here, plus 20 outliers. The nonrobust estimate is severely distorted by the outliers, whereas the robust predictions look much better.
  • Another application we have explored is load curve data cleansing for power systems engineering and monitoring. The term load curve refers to the electric power consumption recorded periodically at points of interest (residential units, substations). Accurate load profiles are critical assets aiding operational decisions in the envisioned smart grid system . However, it is not uncommon to encounter data deviating from nominal models due to faulty meters, communication errors,…. Sport events may seem like out of place, but take a look at this plot. I would also like to point out that spline models we adopted recently for load curve prediction and denoising.
  • We have applied our robust nonparametric regression algorithm to cleanse real load curve data from a government building. We analyzed a 470 hour period (roughly 3 weeks), and you can see the original data on the left – energy consumption in kWh versus time in hour intervals. On the right we show the cleansed load profile superimposed to the original one . We indicate in red the samples identified as outliers, which in most cases correspond to the so-termed “Building operational…”. Different from [Chen et al’10], no user intervention is required here to label the outliers.
  • For the remainder of the talk I will focus on principal component analysis (PCA), the workhorse of statistical learning from high-dimensional data. PCA provides LS optimal linear approximants in R^q to a data set in R^p, for q<p (dimensionality reduction). It is due to the LS criterion that PCA is non-robust to outliers, and the goal here is to robustify PCA by controlling outlier sparsity.
  • Let me provide some context, by mentioning some application domains tied to SoCS that have benefitted from PCA and low-rank modelling . First we have unveiling anomalies in IP networks, and intruders from video surveillance feeds. Broadly, both deal with the separation of a low-rank `background’ and a `foreground’ . In this image the `foreground’ corresponds to the people, possibly intruders. In addition, low rank preference models have been adopted for matrix completion in the context of collaborative filtering. With regards to robustifying PCA, early approaches considered robust estimation of the data covariance matrix, while M-estimators were adopted in computer vision applications as well. Recently, remarkable performance guarantees were obtained for the related problem of low-rank matrix recovery in the presence of sparse errors.
  • To have everyone on the same page and fix notation, allow me to give you a brief overview of the PCA formulations . We are given I p-dimensional vectors y_i, and assume they are centered. Let \\hat \\Sigma denote the sample covariance matrix. For the minimum reconstruction error formulation of PCA, introduce a fat compression operator B (note that q<p), and a tall reconstruction error operator C. The objective is to find the best C, B such that the reconstruction error is minimized. There is also an equivalent maximum variance formulation, but let me focus on the last one which is based on a generative model for the data. It is postulated that each y_i is equal to a tall matrix C times a vector of principal components w_i, up to noise. In words, each datum y_i is assumed to live in a low-dimensional subspace spanned by the columns of C. PCA thus seeks the best subspace C and the corresponding principal components by minimzing the LS approximation error. For all these equivalent formulations, the solution is given by….
  • Towards robustifying PCA we start from the low-rank component analysis model and include outlier variables. We also estimate the mean vector m, since centering the data in the presence of outliers could be tricky. Note that this could be seen as a blind preference model with latent profiles (rows of C). The natural robust PCA estimator is (P2), note that sparsity is enforced at the group-level through an l_2-norm regularization. This allows to tell whether y_i (as a vector) is deemed an outlier or not. Some quick remarks , first the l_0-norm counterpart of (P2) is related to (LTS PCA)….
  • Here is a batch alternating-minimization algorithm to solve problem (P2), which for convenience we write in matrix form. Matrix Y is Ixp (it has the vectors y_i stacked as rows), and norm 2,r denotes sum of the row-wise norm 2. The detail of the iterations is not important, I just want to highlight that the C update is obtained….. In practice, typically 10 iterations suffice to converge to a stationary point of (P2).
  • Here are some results we obtained when robust PCA is used to separate the background from the foreground of a sequence of video frames . The training set consists of 520 images of size p=120x160. For q=10, both standard and robust PCA (with l_1-norm regularization on O) were adopted to build a low-rank background model of the scene captured by the camera . The first column shows three representative images, and the second one shows the PCA reconstruction. The presence of `ghostly’ artifacts is apparent, since PCA was unable to separate the people (foreground) from the background. The third column shows the robust PCA reconstruction of the background, while the fourth one shows the estimated outliers which capture the people.
  • Now, allow me to switch to a completely different application, which has to do with the robust measurement of the Big Five personality factors . The Big Five are 5 broad dimensions of personality traits, discovered through factor analysis, and originally based on samples of WEIRD subjects, i.e., people who are Western, Educated, Industrialized, Rich, and Democratic. The Big Five are extraversion,… A widely utilized questionnaire to measure the Big Five is the Big Five Inventory (BFI). The BFI is short (only 44 questions), in each of which the subject is asked to rate in a 1-to-5 scale, statements of the form….
  • We have tested our robust PCA algorithm on real BFI data from the Eugene-Springfield community sample. The dataset includes responses from 437 subjects, as we said p=44 items, and q=5 (five dimensions). We solved (P2) over a grid of values of \\lambda_2; the results are summarized on the leftmost plot which shows \\lambda_2 index versus the row support of matrix O (black indicates subject n is not an outlier, white indicates it is). For large \\lambda_2, as expected all o_i are identically zero (no outliers). For example, subjects 418 and 204 are strong outlier candidates due to random responding, since they enter the model for relatively large values of \\lambda_2. On the other hand, the responses of subjects 63 (all ‘3’s) and 249 (all ‘3’s except for five ‘4’s) are well modeled by the mean vector in the low-rank model, so a very small value of \\lambda_2 is required for them to be flagged as outliers. We corroborated these observations by running robust PCA again on a corrupted data set, obtained from the original one by overwriting rows 151-160 with random item responses, and rows 301-310 with constant responses of value `3’. To reveal outliers we select \\lambda_2 such that the number of nonzero rows of \\hat O is 100, and we plot the norm of the largest 40 outliers. One sees a clear break and declares the top 8 as outliers. These results were validated via suitable `inconsistency’ scores, which I can explain offline if someone is interested.
  • Real-time data and memory limitations motivate well an online (adaptive) counterpart of the batch estimator (P2). We have thus proposed an exponentially-weighted subspace tracker , where parameter \\beta is a forgetting factor, and n denotes a time index. For \\beta<1, past data are exponentially discarded allowing operation in nonstationary environments. To obtain an online robust PCA algorithm, we adopt an alternating-minimization scheme whereby iterations coincide with the time instants of data acquisition. Leaving details aside , I just want to point out that the current outlier is estimated via thresholding, whereas the subspace C(n) is efficiently updated via recursive LS.
  • We have performed computer simulations to corroborate the convergence of OR-PCA, and compare its performance relative to a couple non-robust subspace trackers in the literature. Nominal data was generated according to the stationary low-rank model described so far , and uniform distributed outliers are introduced in the interval n=[1001-1005]. On the left we show the time evolution of the angle between the estimated and the true subspace (for the three schemes), and it is apparent that OR-PCA markedly outperforms the non-robust alternatives. A similar trend can be observed on the right plot, which depicts the reconstruction error as figure of merit.
  • Under the same framework , we have also looked at the problem of robustifying kernel PCA. KPCA is a generalization of linear PCA, seeking principal components in a possibly (infinite dimensional) feature space. Letting \\phi be a nonlinear function which maps vectors y_i in input space to H, then one can fit this outlier-aware model to the transformed data using the robust PCA estimator (P2). However, the challenge is that the quantities … are infinite dimensional, and one can not store them or operate with them directly. Interestingly, it turns out that one can overcome this hurdle through the kernel trick, and obtain a kernelized version of the robust PCA algorithm. Exploiting the connection between KPCA and spectral clustering, we obtained a robust version of the latter as a byproduct. Here is a toy example, where we want to cluster points in three concentric circles, in the presence of 5 outliers. On the right hand side you can see a plot of the two-dominant eigenvectors of the kernel matrix, and also the outliers that were identified exactly. After removing the outliers, K-means can easily recover the cluster structure. Note that a non-robust spectral clustering approach would have assigned the outliers to the green cluster.
  • Robust KPCA is used here to identify communities and outliers in a network of 115 NCAA football teams, by capitalizing again on its connection with spectral clustering. An edge connects two nodes (teams), if the teams played each other during the Fall’00 season. The adopted kernel matrix is detailed up here, A and D are respectively the adjacency and degree matrices of the network graph. We added a diagonal loading to ensure that K is SPD. Running robust kernel PCA yields robust estimates of the dominant eigenvectors and identifies outliers. The partitioned graph is shown on the left (nodes belonging to the same community (conference) are colored identically, outliers are shown as diamonds). On the right plot, the kernel matrix is show after permuting its rows and columns to reveal the clustering structure found. The community structure of traditional conference was identified exactly. Five of the eight teams deemed as outliers are `independent’ teams not belonging to any conference. Before concluding, I would like to give you a super quick overview of other work I have done during my PhD.
  • I have presented a sparsity-controlling outlier rejection framework for robust learning from high-dimensional data. A main contribution is to show that controlling sparsity in model residuals, can be tantamount to controlling the number of outliers rejected. We have addressed several research issues…. We have explored diverse application domains… I terms of future work we are looking forward to further experimental validation with the GPIPP personality data, that e.g. comprises in the order of 6 million BFI test cases. This data are WIRED (collected over the Internet) but not WEIRD, and it is certainly messy, so we envision fertile ground for the application of our robust algorithms.

Sparsity Control for Robustness and Social Data Analysis Sparsity Control for Robustness and Social Data Analysis Presentation Transcript

  • S p a r s it y C o n t r o l f o rR o b u s t n e s s a n d S o c ia l D ataAn a ly s is Georgios B. Giannakis D e p t. o f E C E , a n d D ig ita l T e c h . C e n t e r , U n iv. o f M in n e s o ta Ac k n o w l e d g m e n t s : D r . G o n z a l o M at e o s AF O S R M U R I FA9 5 5 0 - 0 - -0 5 6 7 1 1 2 n d G r e e k S P Ja m M a y 1 7 ,2 0 1 2 1
  • L e a r n in g f r o m “B ig D a t a ”D ata are widely availabl what is scarce is the abil to extract wisdom from them’` e, ity H a l V a r ia n , G o o g l e ’s c h ie f e c o n o m is t Fast B IG U b iq u it o u s P r o d u c t iv e R e v e a l in g K . C u k ie r , ``H a r n e s s in g t h e d a t a d e l u g e , N o v . Sm art Me s s y 2
  • S o c ia l -C o m p u t a t io n a l So y p s e t s e s m m s o f p e o p l e a n d c o m p u t e r s C m l x y te s T h e v is io n : p r e f e r e n c e m e a s u r e m e n t (P M ), a n a l y s is , man agemen t  U n d e r s t a n d a n d e n g in e e r S o C S T h e m e a n s : l e v e r a g e d u a l r o l e o f s p a r s it y  Complexity controlt h r o u g h v a r ia b l e s e l e c t io n  Robustness t o o u t l ie r s 3 View slide
  • C o n j o in t a n a l y s is M a r k e t in g , h e a l t h c a r e , p s y c h o l o g y [G r e e n - S r in iv a s a n ‘7 8 ] O p t im a l d e s ig n a n d p o s it io n in g o f n e w p r o d u c t s  S t r a t e g y : d e s c r ib e p r o d u c t s b y a s e t o f a t t r ib u t e s , `p a r t s ’ G o a l : l e a r n c o n s u m e r ’s u t il it y f u n c t io n f r o m preferen ce d ata  L in e a r u t il it ie s : ` ow much is each part worth?’ H S u c c e s s s t o r y [W in d e t a l ’8 9 ]  At t r ib u t e s : r o o m s iz e , T V o p t io n s , r e s t a u r a n t , t r a n s p o r t a t io n 4 View slide
  • M o d e l in g p r e l im in a r ie s R e s p o n d e n t s (e .g ., c o n s u m e r s )  R a t e p r o f il e s Ea c h c o m p r is e s a t t r ib u t e s L in e a r u t il it y : e s t im a t e v e c t o r o f partworths C o n j o in t d a t a c o l l e c t io n f o r m a t s ( M1) M e t r ic r a t in g s : ( M2) C h o ic e -b a s e d c o n j o in t d a t a : O n l in e S o C S -b a s e d p r e f e r e n c e d a t a e x p o n e n t ia l l y in c r e a s e s  In c o n s is t e n t /c o r r u p t e d /ir r e l e v a n t d a t a O u t l ie r s 5
  • R o b u s t if y in g P M  L e a s t -t r im m e d s q u a r e s [R o u s s e e u w ’8 7 ] (L T S )  is t h e -t h o r d e r s t a t is t ic a m o n g  r e s id u a l s d is c a r d e d Q: H o w s h o u l d w e g o a b o u t m in im iz in g n o n c o n v e x (L T S )? A: T r y a l l s u b s e t s o f s iz e , s o l v e , a n d p ic k t h e b e s t  S im p l e b u t in t r a c t a b l e b e y o n d s m a l l p r o b l e m s  N e a r o p t im a l s o l v e r s [R o u s s e e u w ’0 6 ], R AN S AC [F is c h l e r - B o l l e s ’8 1 ]G . M a t e o s a n d G . B . G ia n n a k is , ``R o b u s t c o n j o in t a n a l y s is b y c o n t r o l l in g 6o u t l ie r s p a r s it y , Proc.. of EUSIPCO , Au g ./S e p . 2 0 1 1 .
  • M o d e l in g o u t l ie r s O u t l ie r v a r ia b l e s s .t . o u t l ie r otherwi se  N o m in a l r a t in g s o b e y ( M1) ; o u t l ie r s s o m e t h in g e l s e -c o n t a m in a t io n [F u c h s ’9 9 ], B a y e s ia n m o d e l [Jin -R a o ’1 0 ]  Bo t h a n d u n k n o w n , t y p ic a l l y s p a r s e ! N a t u r a l (b u t in t r a c t a b l e )n o n c o n v e x e s t im a t o r 7
  • L T S a s s p a r s e r e g r e s s io n L a g r a n g ia n f o r m (P 0 )  T u n in g p a r a m e t e r c o n t r o l s s p a r s it y in number of o u t l ie r s Proposition 1: If s o l v e s (P 0 )w it h c h o s e n s .t . , then in (L T S ).  F o r m a l l y j u s t if ie s t h e p r e f e r e n c e m o d e l a n d it s e s t im a t o r (P 0 )  T ie s s p a r s e r e g r e s s io n w it h r o b u s t e s t im a t io n 8
  • Ju s t r e l a x ! (P 0 )is N P -h a r d relax e .g ., [T r o p p ’0 6 ] (P 1 )  (P 1 )c o n v e x , a n d t h u s e f f ic ie n t l y s o l v e d  R o l e o f s p a r s it y -c o n t r o l l in g is c e n t r a l Q: D o e s (P 1 )y ie l d r o b u s t e s t im a t e s ? A: Y a p ! H u b e r e s t im a t o r is a s p e c ia l c a s e wher e 9
  • L a s s o in g o u t l ie r s S u f f ic e s t o s o l v e L a s s o [T ib s h ir a n i’9 4 ] Proposition 2: M in im iz e r s o f (P 1 )a r e , D a t a -d r iv e n m e t h o d s t o s e l e c t  L a s s o s o l v e r s r e t u r n e n t ir e robustification path (RP) Co e f f s . D e c r e a s in g 10
  • N o n c o n v e x r e g u l a r iz a t io n N o n c o n v e x p e n a l t y t e r m s a p p r o x im a t e b e t t e r in (P 0 ) O p t io n s : S C AD [F a n -L i’0 1 ], o r s u m -o f -l o g s [C a n d e s e t a l ’0 8 ] It e r a t iv e l in e a r iz a t io n -m in im iz a t io n o f aro un d  In it ia l iz e w it h ,u s e  B ia s r e d u c t io n (c f . a d a p t iv e L a s s o [Z o u ’0 6 ]) 11
  • C o m p a r is o n w it h R AN S AC , i.i.d .N o m in a l :O u t l ie r s: 12
  • N o n p a r a m e t r ic In t e r a c t io n s a m o n g a t t r ib u t e s ? r N o tgc ar t er sd sy io n  e p u e b  D r iv e n b y c o m p l e x m e c h a n is m s h ard to mo d el If o n e t r u s t s d a t a m o r e t h a n a n y p a r a m e t r ic m o d e l  G o n o n p a r a m e t r ic r e g r e s s io n :  l iv e s in a s p a c e o f “s m o o t h ’’ f u n c t io n s Il l -p o s e d p r o b l e m  W o r k a r o u n d : r e g u l a r iz a t io n [T ik h o n o v ’7 7 ], [W a h b a ’9 0 ]  R K H S w it h k e r n e l an d n o rm 13
  • N o n p a r a m e t r ic b a s is Tr u e No n r o b u s t p u r su nuc tit n f io p r e d ic t io n s R o b u s t p r e d ic t io n s R e f in e d p r e d ic t io n s  E f f e c t iv e n e s s in r e j e c t in g o u t l ie r s is a p p a r e n tG . M a t e o s a n d G . B . G ia n n a k is , ``R o b u s t n o n p a r a m e t r ic r e g r e s s io n v ia s p a r s it yc o n t r o l w it h a p p l ic a t io n t o l o a d c u r v e d a t a c l e a n s in g , IEEE Trans. Signal Process.,14
  • Lo a d c u r v e d a t a co l d e u r vn : e l ein rg p o w e r c o n s u m p t io n r e c o r d e d L a c a e s c t ic p e r io d ic a l l y  R a l ia t y e d a e r s k c o m m u n ic l iz e s me a r t g r id v is io n [H a u s e r ’0 9 ] F e u l b l m e t t a : , e y t o r e a a t io n r r o r s  U n s c h e d u l e d m a in t e n a n c e , s t r ik e s , s p o r t e v e n t s B -s p l in e s f o r l o a d c u r v e p r e d ic t io n a n d d e n o is in g [C h e n e t a l ’1 0 ] U r u g u a y ’s p o w e r c o n s u m p t io n (M W ) 15
  • N o r t h W r it e d a t a E n e r g y c o n s u m p t io n o f a g o v e r n m e n t b u il d in g (’0 5 -’1 0 )  R o b u s t s m o o t h in g s p l in e e s t im a t o r , hours  O u t l ie r s : “B u il d in g o p e r a t io n a l t r a n s it io n s h o u l d e r p e r io d s ”  N o m a n u a l l a b e l in g o f o u t l ie r s [C h e n e t a l ’1 0 ] 16 Data: c o u r t e s y o f N o r t h W r it e E n e r g y G r o u p , p r o v id e d b y P r o f . V .
  • P r in c ip a l C o m p o n e n t An a l t a ts t ic a l )l e a r n in g f r o m h ig h -d im e n s io n a l Motivation: (s y is is d ata D N A m ic r o a r r a y T r a f f ic s u r v e il l a n c e P r in c ip a l c o m p o n e n t a n a l y s is (P C A)[P e a r s o n ’1 9 0 1 ]  E x t r a c t io n o f l o w -d im e n s io n a l d a t a s t r u c t u r e  D a t a c o m p r e s s io n a n d r e c o n s t r u c t io n  P C A is n o n -r o b u s t t o o u t l ie r s [Jo l l if f e ’8 6 ] O u r g o a l : r o b u s t if y P C A b y c o n t r o l l in g o u t l ie r s p a r s it y 17
  • O u r w o r k in c o n t e x t C o n t e m p o r a r y a p p l ic a t io n s t ie d t o S o C S  An o m a l y d e t e c t io n in IP n e t w o r k s [H u a n g e t a l ’0 7 ], [K im e t a l ’0 9 ]  V id e o s u r v e il l a n c e , e .g ., [O l iv e r e t a l ’9 9 ]  M a t r ix c o m p l e t io n f o r c o l l a b o r a t iv e f il t e r in g , e .g ., [C a n d e s e t a l ’0 9 ] Ro b u s t PCA  R o b u s t c o v a r ia n c e m a t r ix e s t im a t o r s [C a m p b e l l ’8 0 ], [H u b e r ’8 1 ]  C o m p u t e r v is io n [X u -Y u il l e ’9 5 ], [D e l a T o r r e -B l a c k ’0 3 ]  L o w -r a n k m a t r ix r e c o v e r y f r o m s p a r s e e r r o r s , e .g ., [W r ig h t e t a l ’0 9 ] 18
  • P C A f o r m u l a t io n s T r a in in g d a t a M in im u m r e c o n s t r u c t io n e r r o r  C o m p r e s s io n o p e r a t o r  R e c o n s t r u c t io n o p e r a t o r M a x im u m v a r ia n c e C o m p o n e n t a n a l y s is m o d e l S o l u t io n : 19
  • R o b u s t if y in g P C A O u t l ie r -a w a r e m o d e l  In t e r p r e t : b l in d p r e f e r e n c e m o d e l w it h l a t e n t p r o f il e s (P 2 )  -n o r m c o u n t e r p a r t t ie d t o (L T S P C A)  (P 2 )s u b s u m e s o p t im a l (v e c t o r )H u b e r  -n o r m r e g u l a r iz a t io n f o r e n t r y -w is e o u t l ie r sG . M a t e o s a n d G . B . G ia n n a k is , ``R o b u s t P C A a s b il in e a r d e c o m p o s it io n w it h o u t l ie r 20s p a r s it y r e g u l a r iz a t io n , IEEE Trans. Signal Process., 2 0 1 2 (t o a p p e a r ).
  • Al t e r n a t in g m in im iz a t io n (P 2 ) 1  u p d a t e : S V D o f o u t l ie r -c o m p e n s a t e d d a t a  u p d a t e : r o w -w is e v e c t o r s o f t -t h r e s h o l d in gProposition 3: Al g . 1 ’s it e r a t e s c o n v e r g e t o a s t a t io n a r y p o in t o f (P 2 ). 21
  • V id e o s u r v e il l a n c e O r ig in a l PCA Ro b u s t PCA `O u t l ie r s ’ 22Data: http://www.cs.cmu.edu/~ftorre/
  • B ig F iv e p e r s o n a l it y fiva d c e n s io n s s f p e r s o n a l it y t r a it s [G o l d b e r g ’9 3 ] F e im t o r o [C o s t a -M c R a e ’9 2 ]  D is c o v e r e d t h r o u g h f a c t o r a n a l y s is  W E IR D s u b j e c t s  B ig F iv e In v e n t o r y (B F I)  M e a s u r e t h e B ig F iv e  S h o r t -q u e s t io n n a ir e (4 4 it e m s )  R a t e 1 -5 , e .g ., `I s e e m y s e l f a s s o m e o n e who… … is t a l k a t iv e ’ … is f u l l o f e n e r g y ’Hand book of personal Theory and research, O . P . Jo h n , R . W . R o b in s , a n d L . A. P e r v in , E d s . N e w ity: 23Y o r k , N Y : G u il f o r d P r e s s , 2 0 0 8 .
  • B F Id a t a E u g e n e -S p r in g f ie l d c o m m u n it y s a m p l e [G o l d b e r g ’0 8 ]  s u b je c t s , it e m r e s p o n s e s , facto rs R o b u s t P C A id e n t if ie s 8 o u t l y in g s u b j e c t s  V a l id a t e d v ia `in c o n s is t e n c y ’ s c o r e s , e .g ., V R IN [T e l l e g e n ’8 8 ] 24 Data: c o u r t e s y o f P r o f . L . G o l d b e r g , p r o v id e d b y P r o f . N . W a l l e r
  • O n l in e r o b u s t P C A M o t iv a t io n : R e a l -t im e d a t a a n d m e m o r y l im it a t io n s E x p o n e n t ia l l y -w e ig h t e d r o b u s t P C A  At t im e , d o n o t r e -e s t im a t e 25
  • O n l in e P C A in a c t io n N o m in a l : O u t l ie r s : 26
  • Ro b u s t k e r n e l PCA In p u t s p a c e K e r n e l (K )P C A [S c h o l k o p f ‘9 7 ] Fe a t u r e s pace  C h a l l e n g e : -d im e n s io n a l K e r n e l t r ic k : Re l a t e d t o s p e c t r a l c l u s t e r in g 27
  • U n v e il in g c o m m u n it ie s N e t w o r k : N C AA f o o t b a l l t e a m s (n o d e s ), F ’0 0 g a m e s (e d g e s )  t e a m s ,k e r n e l AR I= 0 .8 9 6 7  Id e n t if ie d e x a c t l y : B ig 1 0 , B ig 1 2 , AC C , S E C , B ig Ea s t  O u t l ie r s : In d e p e n d e n t t e a m s 28 Data: http://www-personal.umich.edu/~mejn/netdata/
  • C o n c l u d in g s u m m a r y  C o n t r o l s p a r s it y in m o d e l r e s id u a l s f o r r o b u s t l e a r n in g  R e s e a r c h is s u e s a d d r e s s e d  S p a r s it y c o n t r o l f o r r o b u s t m e t r ic a n d c h o ic e -b a s e d PM  K e r n e l -b a s e d n o n p a r a m e t r ic u t il it y e s t im a t io n  R o b u s t (k e r n e l )p r in c ip a l c o m p o n e n t a n a l y s is  S c a l a b l e d is t r ib u t e d r e a l -t im e im p l e m e n t a t io n s L S IG NA G  Ap p l ic a t io n d o m a in s PROCES S IN  P r e f e r e n c e m e a s u r e m e n t a n d c o n j o in t a n a l yOUTLIER-RESILIENT s is ESTIMATION  P s y c h o m e t r ic s , p e r s o n a l it y a s s e s s m e n t  V id e o s u r v e il l a n c e LASSO  S o c ia l a n d p o w e r n e t w o r k s  E x p e r im e n t a l v a l id a t io n w it h G P IP P p e r s o n a l it y r a t in g s (~6 M )G o s l in g -P o t t e r In t e r n e t P e r s o n a l it y P r o j e c t (G P IP P )- 29h t t p ://w w w .o u t o f s e r v ic e .c o m
  • R o b u s t if ic a t io n p a t h s L a s s o p a t h o f s o l u t io n s is p ie c e w is e l in e a r Co e f f s .  L AR S r e t u r n s w h o l e R P [E f r o n ’0 3 ]  S a m e c o s t o f a s in g l e L S f it ( ) L a s s o is s im p l e in t h e s c a l a r c a s e  C o o r d in a t e d e s c e n t is f a s t ! [F r ie d m a n ‘0 7 ]  E x p l o it s w a r m s t a r t s , s p a r s it y  O t h e r s o l v e r s : S p a R S A [W r ig h t e t a l ’0 9 ], S P AM S [M a ir a l e t a l ’1 0 ] Le v e r a g e t h e s e s o l v e r s c o n s id e r a g r id  values o f w it h 30
  • S e l e c t in g R e l ie s o n R P a n d k n o w l e d g e o n t h e d a t a m o d e l Number of outliers known: f r o m R P , o b t a in r a n g e o f s .t . . D is c a r d o u t l ie r s (k n o w n ), a n d u s e C V t o d e t e r m in e Variance of the nominal noise known: f r o m R P , f o r e a c h o n t h e g r id , f in d t h e s a m p l e v a r ia n c e Th e b e s t is s .t . Variance of the nominal noise unknown: r e p l a c e a b o v e w it h a r o b u s t e s t im a t e , e .g ., m e d ia n a b s o l u t e d e v ia t io n (M AD ) 31