Your SlideShare is downloading. ×
Statistically adaptive learning for a general class of                                                                    ...
very different from that acquired previously or situations      respect to a fixed (optimal) parameter θ∗ (in practice wewh...
receive a convex loss function φt as in (2). Typically,         Assuming invertibility of X T X, it is well known that the...
Up to time t, the standard solution to the least squaresproblem on the data {X[0,t], Y[0,t] } is then                    b...
tistically compared so as to make sure that the model is       mention that the number of features generated is on thenot ...
Table 1: Model 1 Results For Different Learning Mech-        Table 2: Model 2 Results For Different Learning Mech-anisms (...
fields of machine learning, political micro-targeting and     as MSNBC, The New York Times, The Washington Postpredictive a...
Upcoming SlideShare
Loading in...5
×

Statistically adaptive learning for a general class of..

670

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
670
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Statistically adaptive learning for a general class of.."

  1. 1. Statistically adaptive learning for a general class of cost functions (SA L-BFGS)∗ Stephen Purpura †§ Dustin Hillard § Mark Hubenthal ‡§ spurpura@contextrelevant.com dhillard@contextrelevant.com mhubenthal@contextrelevant.com § § Jim Walsh Scott Golder Scott Smith § jwalsh@contextrelevant.com sgolder@contextrelevant.com ssmith@contextrelevant.comarXiv:1209.0029v3 [cs.LG] 5 Sep 2012 Abstract where x(i) ∈ Rl is the feature vector of the ith exam- ple, y (i) ∈ {0, 1} is the label, θ ∈ Rl is the vector We present a system that enables rapid model of fitting parameters, l is a smooth convex loss function experimentation for tera-scale machine learn- and S a regularizer. Some of the more popular methods ing with trillions of non-zero features, billions for determining θ include linear and logistic regression, of training examples, and millions of param- respectively. The optimal such θ corresponds to a linear eters. Our contribution to the literature is a predictor function pθ (x) = θT x that best fits the data new method (SA L-BFGS) for changing batch in some appropriate sense, depending on l and S. We L-BFGS to perform in near real-time by using remark that such cost functions in (1) have a structure statistical tools to balance the contributions of which is naturally decomposable over the given training previous weights, old training examples, and examples, so that all computations can potentially be run new training examples to achieve fast conver- in parallel over a distributed environment. gence with few iterations. The result is, to our knowledge, the most scalable and flexible However, in practice it is often the case that the model linear learning system reported in the literature, must be updated accordingly as new data is acquired. beating standard practice with the current best That is, we want to answer the question of how θ changes system (Vowpal Wabbit and AllReduce). Using in the presence of new training examples. One naive the KDD Cup 2012 data set from Tencent, Inc. approach would be to completely redo the entire data we provide experimental results to verify the analysis process from scratch on the larger data set. The performance of this method. current fastest method in such a case utilizes the L-BFGS quasi-Newton minimization algorithm with AllReduce 1 Introduction along a distributed cluster, (Agarwal et al., 2012). The other extreme is to apply the method of online learning, The demand for analysis and predictive modeling derived which considers one data point at a time and updates the from very large data sets has grown immensely in recent parameters θ according to some form of gradient descent, years. One of the big problems in meeting this demand is see (Langford et al., 2009), (Duchi et al., 2010). How- the fact that data has grown faster than the availability of ever, on the tera-scale, neither approach is as appealing or raw computational speed. As such, it has been necessary as fast as we can achieve with our method. We describe in to use intelligent and efficient approaches when tackling a bit more detail these recent approaches to solving (1) in the data training process. Specifically, there has been Section 3. For completeness, we also refer the reader to much focus on problems of the form recent work relating to large-scale optimization contained m in (Schraudolph et al., 2007) and (Bottou, 2010). min l(θT x(i) ; y (i) ) + λS(θ), (1) θ∈Rl i=1 Our approach in simple terms lies somewhere be- ∗ tween pure L-BFGS minimization (widely accepted as The material of this work is patent pending. † the fastest brute force optimization algorithm whenever in absentia at Department of Information Science, Cornell University, Ithaca, NY, 14850 the function is convex and smooth) and online learning. ‡ recent graduate of Department of Mathematics, University While L-BFGS offers accuracy and robustness with a of Washington, Seattle, WA, 98195 relatively small number of iterations, it fails to take di- § Context Relevant, Inc., Seattle, WA, 98121 rect advantage of situations where the new data is not
  2. 2. very different from that acquired previously or situations respect to a fixed (optimal) parameter θ∗ (in practice wewhere the new data is extremely different than the old don’t know the true value of θ∗ ) bydata. Certainly, one can initiate a new optimization job ton the larger data set with the parameter θ initialized to Rφ (t) = rφ (θ∗ , t) := [φs (θs ) − φs (θ∗ )] (3)the previous result. But we are left with the problem of s=0optimizing over increasingly larger training sets at one ttime. Similarly, online learning methods only consider = [fs (θs ) + λS(θs ) − fs (θ∗ ) − λS(θ∗ )].one data point at a time and cannot reasonably change s=0the parameter θ by too much at any given step without An effective algorithm is then one in which the sequencerisk of severely increasing the regret. It also cannot typ- tf {θt }t=0 suffers sub-linear regret, i.e., Rφ (t) = o(t).ically reach as small of an error count as that of a global As mentioned earlier, there has been much work donegradient descent approach. On the other hand, we will regarding how to solve (1) with a variety of meth-show that it is possible to combine the advantages of ods. Before proceeding with a basic overview of theboth methods: in particular the small number of iterations two most popular approaches to large-scale machineand speed of L-BFGS when applied to reasonably sized learning in Section 3, it is important to understand thebatches, and the ability of online learning to “forget” pre- underlying assumptions and implications of the pre-vious data when the new data has changed significantly. vious body of work. In particular, we mention the The outline of the paper is as follows. In Section 2 work of L´ on Bottou in (Bottou, 2010) regarding large- ewe describe the general problem of interest. In Sec- scale optimization with stochastic gradient descent, andtion 3 we briefly mention current widely used methods (Schraudolph et al., 2007) regarding stochastic online L-of solving (1). In Section 4 we outline the statistically BFGS (oL-BFGS) optimization. Such works and othersadaptive learning algorithm. Finally, in Section 5 we have demonstrated that with a lot of randomly shuffledbenchmark the performance of our two related methods data, a variety of methods (oL-BFGS, 2nd order stochas-(Context Relevant FAST L-BFGS and SA L-BFGS, re- tic gradient descent and averaged stochastic gradient de-spectively) against Vowpal Wabbit - one of the fastest scent) can work in fewer iterations than L-BFGS because:currently available routines which incorporates the workof (Agarwal et al., 2012). We also include the associated (a) Small data learning problems are fundamentally dif-Area Under Curve (AUC) rating, which roughly speak- ferent from large data learning problems;ing, is a number in [0, 1], where a value of 1 indicates (b) The cost functions as framed in the literature haveperfect prediction by the model. well suited curvature near the global minimum.2 Background and Problem Setup We remark that the key problem for all quasi-Newton based optimization methods (including L-BFGS) hasIn this paper the underlying problem is as follows. Sup- been that noise associated with the approximation pro-pose we have a sequence of time-indexed data sets cess – with specific properties dependent on each learning (i) (i){Xt , Yt } where Xt = {xt }mt , Yt = {yt }mt , t = i=1 i=1 problem – causes adverse conditions which can make L-0, 1, . . . , tf is the time index, and mt ∈ N is the batch BFGS (and its variants) fail. However, the problem ofsize (typically independent of t). Such data is given se- noise leading to non-positive curvature near the minimumquentially as it is acquired (e.g. t could represent days), can be averted if the data is appropriately shaped (i.e. fea-so that at t = 0 one only has possession of {X0 , Y0 }. Al- ture selection plus proper data transformations). For nowternatively, if we are given a large data set all at once, we though, we ignore the issue and assume we already havecould divide it into batches indexed sequentially by t. In a methodology for “feature shaping” that assures undergeneral, we use the notation xt with subscript t to denote operational conditions that the curvature of the resultinga time-dependent vector at time t, and we write xt,j to learning problem is well-suited to the algorithm that wedenote the jth component of xt . For each t = 0, . . . , tf describe.we define mt 3 Previous Work (i) (i) ft (θ) = l(θT xt ; yt ) 3.1 Online Updates i=1 φt (θ) = ft (θ) + λS(θ), (2) In online learning, the problem of storage is completely averted as each data point is discarded once it is read.where as before, l is a given smooth convex loss function We remark that one can essentially view this approachand S is a regularizer. Also let θt be the parameter vector as a special case of the statistically adaptive method de-obtained at time t, which in practice will approximately scribed in Section 4 with a batch size of 1. Such algo- tminimize λS(θ)+ s=0 fs (θ). We define the regret with rithms iteratively make a prediction θt ∈ Rl and then
  3. 3. receive a convex loss function φt as in (2). Typically, Assuming invertibility of X T X, it is well known that theφt (θ) = l(θT xt ; yt ) + λR(θ), where (xt , yt ) is the data solution is given bypoint read at time t. We then make an update to obtain θ = (X T X)−1 X T Y. (5)θt+1 using a rule that is typically based on the gradi-ent of l(θT xt , yt ) in θ. Indeed, the simplest approach Now, suppose that we have time indexed data(with no regularization term) would be the update rule {Xs , Ys }T with Xs ∈ Rms ×l and Ys ∈ Rms . In order s=0 Tθt+1 = θt − ηt ∇θ l(θt xt , yt ). to update θt given {Xt+1 , Yt+1 }, first we must check how However, there currently exist more sophisticated up- well θt fits the newly augmented data set. We do this bydate schemes which can achieve better regret bounds for evaluating(3). In particular, the work of Duchi, Hazan, and Singer t+1 2is a type of subgradient method with adaptive proximal Xs θ − Ys 2 (6)functions. It is proven that their ADAGRAD algorithm s=0can achieve theoretical regret bounds of the form with θ = θt . Depending on the result, we choose a parameter λ ∈ [0, 1] that determines how much weight 1/2 to give the previous data when computing θt+1 . That Rφ (t) = O( θ∗ 2 tr(Gt )) and (4) Å ã is, λ represents how much we would like to “forget” the 1/2 Rφ (t) = O max θs − θ∗ 2 tr(Gt ) , previous data (or emphasize the new data), with a value of s≤t λ = 1 indicating that all previous data has been thrown t T out. Similarly, the case λ = 1 corresponds to the casewhere in general, Gt = s=0 gs gs is an outer prod- 2 when past and present are weighed equally, and the caseuct matrix generated by the sequence of gradients gs = λ = 1 corresponds to the case when θt fits the new data∇θ fs (θs ) (Duchi et al., 2010). We remark that since the perfectly (i.e. (6) is equal to zero).loss function gradients converge to zero under ideal con- tditions, the estimate (4) is indeed sublinear, because the Let X[0,t] be the s=0 ms × l matrix T T T T [X0 , X1 , . . . , Xt ] obtained by concatenation,decay of the gradients, however slow, counters the linear 1/2 and similarly define the length- t ms vector s=0growth in the size of Gt . Y[0,t] := [Y0T , Y1T , . . . , YtT ]T . Then (6) is equivalent to3.2 Vowpal Wabbit with Gradient Descent X[0,t+1] θ − Y[0,t+1] 2 . Now, when using a particular 2 second order Newton method for minimizing a smoothVowpal Wabbit is a freely available software pack- convex function, the computation of the inverse Hessianage which implements the method described briefly in matrix is analogous to computing (X[0,t] X[0,t] )−1 T(Agarwal et al., 2012). In particular, it combines online above. As t grows large, the cumulative normal matrixlearning and brute force gradient-descent optimization in T X[0,t] X[0,t] becomes increasingly costly to computea slightly different way. First, one does a single pass from scratch, as does its inverse. Fortunately, we observeover the whole data set using online learning to obtain thata rough choice of parameter θ. Then, L-BFGS optimiza- T T Ttion of the cost function is initiated with the data split X[0,t+1] X[0,t+1] = X[0,t] X[0,t] + Xt+1 Xt+1 . (7)across a cluster. The cost function and its gradient are However, if we want to incorporate the flexibility tocomputed locally and AllReduce is utilized to collect the weigh current data differently relative to previous data,global function values and gradients in order to update θ. we need to abandon the exact computation of (7). In-The main improvement of this algorithm over previous stead, letting At denote the approximate analogue ofmethods is the use of AllReduce with the Hadoop file T X[0,t] X[0,t] , we introduce the updatestructure, which significantly cuts down on communi-cation time as is the case with MapReduce. Moreover, 2 At+1 ← µ2 At + Xt+1 Xt+1 T (8)the baseline online learning step is done with a learn- 1 + µ2ing rate chosen in an optimal manner as discussed in µ 2(Karampatziakis and Langford, 2011). where µ satisfies λ = 1+µ2 . The actual update of θt is performed as follows. Define4 Our Approach  Y0   Y1 4.1 Least Squares Digression ‹ T T T  Y[0,t] := X[0,t] Y[0,t] = [X0 , . . . , Xt ]  .  Before we describe the statistically adaptive approach for  .  .minimizing a generic cost function, consider the follow- Yting simpler scenario in the context of least squares regres- t Tsion. Given data {X, Y } with X ∈ Rm×l and Y ∈ Rm , = Xs Ys .we want to choose θ that solves minθ∈Rl Xθ − Y 2 . 2 s=0
  4. 4. Up to time t, the standard solution to the least squaresproblem on the data {X[0,t], Y[0,t] } is then batch data points selected 1,2,...,t-1 from previous batches T ‹ θ = (X[0,t] X[0,t] )−1 Y[0,t] . (9) ‹Now let Bt be an approximation to Y[0,t] . We define Bt+1by the update data points selected 2 batch t from current batch Bt+1 = µ2 Bt + Xt+1 Yt+1 . T 1 + µ2 Figure 1: Subsampling of the partitioned data stream atFinally, we set time t and times 0, 1, . . . , t − 1, respectively. −1 θt+1 := At+1 Bt+1 . (10) is used to update θ. From the subsample chosen, weIt is easily verified that (10) coincides with the standard then apply a gradient descent optimization routine whereupdate (9) when µ = 1. the initialization of the associated starting parameters is4.2 Statistically Adaptive Learning generated from those stored from the previous iteration. In the case of L-BFGS, the rank 1 matrices used to ap-Returning to our original problem, we start with the pa- proximate the inverse Hessian stored from the previousrameter θ0 obtained from some initial pass through of iteration are used to initialize the new descent routine.{X0 , Y0 }, typically using a particular gradient descent al- We summarize the process in Algorithm 1.gorithm. In what follows, we will need to define an easilyevaluated error function to be applied at each iteration, Algorithm 1 Statistically Adaptive Learning Method (SAmildly related to the cumulative regret (3): L-BFGS) t ms (i) (i) Require: Error checking function I(t, θ) s=0 i=1 |pθ (xt ) − yt | tf I(t, θ) := t (11) Given data {Xs , Ys }s=0 s=0 |Xs | Run gradient descent optimization on {X0 , Y0 } toWe remark that I(t, θ) represents the relative number of compute θ0incorrect predictions associated with the parameter θ over for t = 1 to tf doall data points from time s = 0 to s = t. Moreover, if I(t + 1, θt ) − I(t, θt ) > σ(t) thenbecause pθ is a linear function of x, I is very fast to Choose Mold and Mnewevaluate (essentially O(m) where m = t ms ). Subsample data s=0 Given θt , we compute I(t + 1, θt ). There are two Run L-BFGS with initial parameter θt to ob-extremal possibilities: tain θt+1 else θt+1 ← θt 1. I(t + 1, θt ) is significantly larger than I(t, θt ). More end if precisely, we mean that I(t + 1, θt ) − I(t, θt ) > end for σ(t), where σ(t) is the standard deviation of {I(s, θs )}t . In this case the data has significantly s=0 As a typical example, at some time t we might have changed, and so θ must be modified. tf Mold = 1000, Mnew = 100, 000, and t=0 mt = 1·109 . 2. Otherwise, there is no need to change θ and we set This would be indicative of a batch {Xt+1 , Yt+1 } with θt+1 = θt . significantly higher error using the current parameter θt than for previously analyzed batches.In the former case, we use the magnitude of I(t+ 1, θt )− We remark that when learning on each new batch ofI(t, θt ) to determine a subsample of the old and new data data, there are two main aspects that can be parallelized.with Mold and Mnew points chosen, respectively (see First, the batch itself can be partitioned and distributedFigure 1). Roughly speaking, the larger the difference the among nodes in a cluster via AllReduce to significantlymore weight will be given to the most recent data points. speed up the evaluation of the cost function and its gra-The sampling of previous data points serves to anchor dient as is done in (Agarwal et al., 2012). Furthermore,the model so that the parameters do not over fit to the one can run multiple independent optimization routinesnew batch at the expense of significantly increasing the in parallel where the distribution used to subsample fromglobal regret. This is a generalization of online learning Xt+1 and ∪t Xs is varied. The resulting parameters s=0methods where only the most recent single data point θ obtained from each separate instance can then be sta-
  5. 5. tistically compared so as to make sure that the model is mention that the number of features generated is on thenot overly sensitive to the choice of sampling distribu- order of 1000.tion. Otherwise, having θ be too highly dependent on thechoice of subsample would invalidate using a stochas- 5.2 Model 1 Resultstic gradient descent-based approach. A bi-product of For our first set of experiments, we compare the perfor-this ability to simultaneously experiment with different mance of Vowpal Wabbit (VW) using its L-BFGS im-samplings is that it provides a quick means to check the plementation and the Context Relevant Flexible Analyt-consistency of the data. ics and Statistics TechnologyTM L-BFGS implementation Finally, we remark that the SA L-BFGS method can be running on 10 Amazon m1.xlarge instances. The timereasonably adapted to account for changes in the selected measured (in seconds) is only the time required to trainfeatures as new data is acquired. Indeed, it is very ap- the models. The features are generated and cached forpealing within the industry to be able to experiment with each implementation prior to training.different choices of features in order to find those that Performance was measured using the Area Undermatter most, while still being able to use the previously Curve (AUC) metric because this was the methodologycomputed parameters θt to speed up the optimization on used in (Tencent, 2012). In short, the AUC is equal to thethe new data. Of course, it is possible to directly ap- probability that a classifier will rank a randomly chosenply an online learning approach in this situation, since positive instance higher than a randomly chosen negativeprevious data points have already been discarded. But one. More precisely, it is computed via Algorithm 3 intypical gradient descent algorithms do not a priori have (Fawcett, 2004). We compute our AUC results over athe flexibility to be directly applied in such cases and they portion of the public section of the test set that has abouttypically perform worse than batch methods such as L- 2 million examples.BFGS(Agarwal et al., 2012). Model 1 includes only the basic id features, with no conjunction features, and achieves an AUC of 0.748 as5 Experiments shown in Table 1. A simple baseline performance, which can be generated by predicting the mean ctr for each5.1 Description of Dataset and Features ad id would perform at approximately an AUC of 0.71.We consider data used to predict the click-through-rate The winner of (Tencent, 2012) performed at an AUC of(pCTR) of online ads. An accurate model is necessary 0.80. However, the winning model was substantiallyin the search advertising market in order to appropriately more complicated and used many additional features thatrank ads and price clicks. The data contains 11 variables were excluded from this simple demonstration. In futureand 1 output, corresponding to the number of times a work, we will explore more sophisticated feature sets.given ad was clicked by the user among the number of The Context Relevant and VW models both achieve thetimes it was displayed. In order to reduce the data size, same AUC on the test set, which validates that the basicinstances with the same user id, ad id, query, and setting gradient descent and L-BFGS implementations are func-are combined, so that the output may take on any posi- tionally equivalent. The Context Relevant model com-tive integer value. For each instance (training example), pletes learning between four and five times more quickly.the input variables serve to classify various properties of Our implementation is heavily optimized to reduce com-the ad displayed, in addition to the specific search query putation time as well as memory footprint. In addition,entered. This data was acquired from sessions of the Ten- our implementation utilizes an underlying MapReducecent proprietary search engine and was posted publicly on implementation that provides robustness to job and nodewww.kddcup.2012.org (Tencent, 2012). failures1 . For these experiments we build a basic model thatlearns from the identifiers provided in the training set. 5.3 Model 2 ResultsThese include unique identifiers for each query, ad, key- Context Relevant’s implementation of SA L-BFGS isword, advertiser, title, description, display url, user, ad designed to accelerate and simplify learning iterativeposition, and ad depth (further details available in the changes to models. Using information gleaned from theKDD documentation). We compute a position and depth initial L-BFGS pass, SA L-BFGS develops a samplingnormalized click through rate for each identifier, as well 1as combinations (conjunctions) of these identifiers. Then Context Relevant had to re-write the AllReduce networkat training and testing time we annotate each example implementation to add error checking so that the AllReduce sys-with these normalized click through rates. Additionally, tem was robust to errors that were encountered during normal execution of experiments on Amazon’s EC2 systems. Withoutbefore running the optimization, it is necessary to build these changes, we could not keep AllReduce from hanging dur-appropriate feature vectors (i.e. shape the data). We will ing the experiments. There is no graceful recovery from the lossnot go into detail regarding how this is done, except to of a single node.
  6. 6. Table 1: Model 1 Results For Different Learning Mech- Table 2: Model 2 Results For Different Learning Mech-anisms (VW = Vowpal Wabbit; CR = Context Relevant anisms (VW = Vowpal Wabbit; CR = Context RelevantFAST L-BFGS FAST L-BFGS; SA = Context Relevant FAST SA L- VW CR BFGS) L-BFGS L-BFGS VW CR CR seconds 490 114 L-BFGS L-BFGS SA L-BFGS AUC .748 .748 seconds 515 115 9 AUC .750 .752 .751strategy to minimize sampling induced noise when learn-ing new models that are derived from previous models. system uses a new version of L-BFGS to combine theThe larger the divergence in the models, the less speed- robustness and accuracy of second order gradient descentup is likely. For this set of experiments, we add a con- optimization methods with the memory advantages ofjunction feature that captures the interaction between a online learning. This provides a model building envi-query id and an ad id, which has frequently been an ronment that significantly lowers the time and computeimportant feature in well known advertising systems. We cost of asking new questions. The ability to quickly askthen compare the speed and accuracy of Vowpal Wab- and answer experimental questions vastly expands to setbit (VW) using its L-BFGS implementation; the Context of questions that can be asked, and therefore the spaceRelevant Flexible Analytics and Statistics TechnologyTM of models that can be explored to discover the optimalL-BFGS implementation, and the Context Relevant Flex- solution.ible Analytics and Statistics TechnologyTM SA L-BFGS SA L-BFGS is also well suited to environments whereimplementation (SA) running on 10 Amazon X1.Large the underlying distribution of the data provided to a learn-instances. Here the baseline L-BFGS models are trained ing algorithm is shifting. Whether the shift is causedwith the standard practice for adding a new feature, the by changes in user behavior, changes in market pricing,models are retrained on the entire dataset. The time mea- or changes in term usage, SA L-BFGS can be empiri-sured (in seconds) is only the time required to train the cally tuned to dynamically adjust to the changing con-models. The features are generated and cached for each ditions. One can utilize the parallelized approach inimplementation prior to training. (Agarwal et al., 2012) on each batch of data in the time Again, performance was measured using the Area Un- variable, with the additional freedom to choose the batchder Curve (AUC) metric. Table 2 lists the results for each size. Furthermore, the statistical aspects of the algorithmalgorithm, and shows that AUC improves in comparison provide a useful way to check the consistency of the datato Model 1 when the new feature is added. As with Model in real time. However, like all L-BFGS implementa-1, the VW and CR models achieve similar AUC and the tions that rely on small, reduced, or sampled data sets,basic L-BFGS CR learning time is significantly faster. increased sampling noise from the L-BFGS estimationFurthermore, we show that SA L-BFGS also achieves process affects the quality of the resulting learning al-similar AUC, but in less than one tenth of the time (which gorithm. Users of this new algorithm must take care tolikewise implies one tenth of the compute cost required). provide smooth convex loss functions for optimization.This speed up can enable a large increase in the number The Context Relevant Flexible Analytics and Statisticsof experiments, without requiring additional compute or TechnologyTM is designed to provide such functions fortime. It is important to note that the speed of L-BFGS optimization.and SA L-BFGS is essentially tied to the number offeatures and the number of examples for each iteration. 7 About Context Relevant and the AuthorsThe primary performance gains that can be found are: (a)reducing the number of iterations; (b) reducing the num- Context Relevant was founded in March 2012 by Stephenber of examples; or (c) reducing the number of features Purpura and Christian Metcalfe. The company was ini-with non-zero weights. SA L-BFGS adopts the former tially funded by friends, family, Seattle angel investors,two strategies. A reduction in the number of features and Madrona Venture Group.with non-zero weights can be forced through aggressive Stephen Purpura – CEO & Co-Founder – Stephenregularization, but at the expense of specificity. works as the CEO and CTO of the company. He has more than 20 years of experience, is listed as an inventor of6 Conclusions five issued United States patents and served as program manager of Microsoft Windows and the Chief SecurityWe have presented a new tera-scale machine learning Officer of MSFDC, one of the first Internet bill paymentsystem that enables rapid model experimentation. The systems. Stephen is recognized as a leading expert in the
  7. 7. fields of machine learning, political micro-targeting and as MSNBC, The New York Times, The Washington Postpredictive analytics. and National Public Radio. He has also been profiled by Stephen received a bachelor’s degree from the Uni- LiveScience’s “ScienceLives”.versity of Washington, a master’s degree from Harvard, He has worked as a research scientist in the Socialwhere he was part of the Program on Networked Gover- Computing Lab at HP Labs and has interned at Google,nance at the John F. Kennedy School of Government, and IBM and Microsoft. Scott holds a master’s degree fromhe is set to be granted a PhD in information science from MIT, where he worked with the Media Laboratory’s So-Cornell. ciable Media Group, and graduated from Harvard Univer- Jim Walsh – VP Engineering – Jim brings 23 years sity, where he studied Linguistics and Computer Science.of experience in technology innovation and engineering Scott is currently on leave from the PhD program in So-management to Context Relevant. He founded, built and ciology at Cornell University.led the Cosmos team – Microsoft’s massive scale dis- Mark Hubenthal – Member of the Technical Staff –tributed data storage and analysis environment, under- Mark recently received his PhD in Mathematics from thepinning virtually all Microsoft products including Bing University of Washington. He works on inverse problems– and founded the Bing multimedia search team. In his applicable to medical imaging and geophysics.final position at Microsoft, Jim served as the principal Scott Smith - Principal Engineer and Architect - Scottarchitect for Microsoft advertising platforms. has experience with distributed computing, compiler de- Prior to joining Microsoft, Jim created two software sign, and performance optimization. At Akamai, hestartups, wrote the second-ever Windows application and helped design and implement the load balancing andcreated custom computer animation software for the tele- failover logic for the first CDN. At Clustrix, he built thevision broadcast industry. He is recognized as one of SQL optimizer and compiler for a distributed RDBMS.the world’s leading network performance experts and has He holds a masters degree in computer science from MIT.been granted twenty two technology patents that rangefrom performance to user interface design. Jim earned abachelor of science degree in computing science from the ReferencesUniversity of Alberta. [Agarwal et al.2012] A. Agarwal, O. Chapelle, M. Dudik, Dustin Rigg Hillard – Director of Engineering and and J. Langford. 2012. A reliable effective terascaleData Scientist – Dustin is a recognized data science and linear learning system. Feb. arXiv:1110.4198v2.machine learning expert who has published more than 30papers in these areas. Previously at Microsoft and Ya- [Bottou2010] L. Bottou. 2010. Large-scale machine learning with stochastic gradient descent. pages 177–hoo!, he spent the last decade building systems that sig- 187.nificantly improve large-scale processing and machine-learning for advertising, natural language and speech. [Duchi et al.2010] J. Duchi, E. Hazan, and Y. Singer. Prior to joining Context Relevant, Dustin Hillard 2010. Adaptive subgradient methods for online learn-worked for Microsoft, where he worked to improve ing and stochastic optimization.speech understanding for mobile applications and XBox [Fawcett2004] T. Fawcett. 2004. ROC graphs: Notes andKinect. Before that he was at Yahoo! for three practical considerations for researchers.years, where he focused on improving ad relevance.His research in graduate school focused on automatic [Karampatziakis and Langford2011] N. Karampatziakisspeech recognition and statistical translation. Dustin and J. Langford. 2011. Online importance weightincorporates approaches from these and other fields to aware updates.learn from massive amounts of data with supervised, [Langford et al.2009] J. Langford, L. Li, and T. Zhang.semi-supervised, and unsupervised machine learning ap- 2009. Sparse online learning via truncated gradient.proaches. Journal of Machine Learning Research, 10:777–801. Dustin holds a bachelor’s and master’s degree as wellas a PhD in Electrical Engineering from the University of [Schraudolph et al.2007] N. Schraudolph, J. Yu, and S. G¨ nter. 2007. A stochastic quasi-newton method uWashington. for online convex optimization. Scott Golder – Data Scientist & Staff Sociologist –Scott mines social networking data to investigate broad [Tencent2012] Tencent. 2012.questions such as when people are happiest (mornings KDD Cup 2012, Track 2 Data.and weekends) and how Twitter users form new social http://www.kddcup2012.org/c/kddcup2012-track2/datties. His work has been published in the journal Scienceas well as top computer science conferences by the ACMand IEEE, and has been covered in media outlets such

×