Slides from my talk at 2015 Hadoop Summit.
Topic modeling – a task of discovering hidden thematic structure in a set of documents – is an important problem of modern machine learning. Despite great progress in recent years, scaling it for the large number of topics and massive corpora remains a challenge. We share our experiences in building a high performance distributed system for topic modeling using Apache REEF framework. We start with introduction into topic modeling via Latent Dirichlet Allocation, and describe our new LDA inference algorithm based on tensor factorization. Finally, we demonstrate how using Apache REEF framework helped us to implement the system in a few lines of clean readable code.
2. Furong Huang
UC Irvine
Anima Anandkumar
UC Irvine
Nikos Karampatziakis
Microsoft CISL
Paul Mineiro + 𝜀
Microsoft CISL
Sergiy Matusevych
Microsoft CISL
Shravan Narayanamurthy
Microsoft CISL
Markus Weimer
Microsoft CISL
Apache REEF Contributors
Worldwide
3.
4.
5. /pos/cv107_24319.txt
is evil dead ii a bad movie ?
it's full of terrible acting ,
pointless violence , and plot
holes yet it remains a cult
classic nearly fifteen years
after its release ...
/pos/cv108_15571.txt
it's rather strange too have
two computer animated talking
ant movies come out in a single
year , but that is what disney
and pixar animation ; s latest
film represents ...
http://www.cs.cornell.edu/People/pabo/movie-review-data
22. Easy to reason about
Centralized control flow
• Evaluator allocation and configuration
• Task configuration and submission
Centralized error handling
• Task exceptions thrown to the Driver
• Evaluator failures reported to the Driver
Scalable
Event-based programming
• Driver sends requests as events to REEF
• REEF sends events to the Driver
Mostly stateless design
• REEF maintains minimal state
• Majority of state keeping (e.g. work queues)
is maintained by the Driver
23. // Submit task to the newly created context
public class ContextActiveHandler implements EventHandler<ActiveContext> {
@Override
public void onNext(final ActiveContext context) {
taskGroups.submitNext(context);
}
}
// Submit next task to current context
public class TaskCompletedHandler implements EventHandler<CompletedTask> {
@Override
public void onNext(final CompletedTask task) {
final ActiveContext context = task.getActiveContext();
taskGroups.submitNext(context);
}
}
24.
25. @Inject
public WhitenTask(
final @Parameter(TaskConfigurationOptions.Identifier.class) String taskId,
final @Parameter(Launch.DimD.class) int dimD,
final @Parameter(Launch.DimK.class) int dimK,
final GroupCommClient groupCommClient,
final InputData data,
final TaskEnvironment env) {
// ...
}
“ ”Use Java “type system” to validate the configuration
26.
27.
28.
29. // We can send and receive any Java serializable data, e.g. JBLAS matrices
private final Broadcast.Sender<DoubleMatrix> modelSender;
private final Broadcast.Receiver<DoubleMatrix[]> resultReceiver;
// Broadcast the model, collect the results, repeat.
do {
this.modelSender.send(sliceA);
// ...
final DoubleMatrix[] result = this.resultReceiver.reduce();
} while (notConverged(sliceA, prevSliceA));
39. • Find whitening matrix s.t. orthogonal
• Use to find s.t.
• Whiten :
Editor's Notes
We are hiring!
What is the problem we are solving, why it’s important, and what are state-of-the-art solutions.
New approach and our algorithm
etc
In general, given data (e.g. corpus of text, social graph, user pageview/click logs), reveal latent parameters that influence the distribution – communities, user preferences, text topics. We’ll talk about text because it’s easy to demo and reason about even on a small dataset
Top 10 topics. Each document has a mixture of topics; some topics are common, e.g. film/movie/time. Word appear in many topics, e.g. action/crime/cop and action/Jackie Chan. Topics are sparse
Start 3:20
It’s all bag of words to me
Nikolai Ge, Portrait of Leo Tolstoy, 1884 Tretyakov gallery, Moscow
Writing what I believe
Start 10
Introduced by Karl Pearson in 1894; everything new is well forgotten old; so M1 is a vector, M2 a matrix; M2 is not enough for topics (there is spectral clustering – will talk later if asked). Need to capture triplets – a cube of data…
It was shown that with these shifted terms M1..M3 are sufficient to reveal not only clusters, but mixtures of latent parameters. in fact, if you squint right, M2 is a covariance matrix, and a0 is a Dirichlet hyperprior. Similarly, M3 is skewness (shifted). I will give more details later. So this is information that we collect.. How to get the topics??
8:25
We can factorize the tensor into a cross product of eigenvectors that reveal the topics. i.e. each vector beta_i contains probabilities of words in topic i.
We can factorize the tensor into a cross product of eigenvectors that reveal the topics. i.e. each vector beta_i contains probabilities of words in topic i.
it’s linear
.
Need resource manager, e.g. YARN, and distributed FS.
.
Master node checks for convergence
Markus gave a talk at Hadoop Summit 2014 – see on YouTube
Much nicer in C#
REEF itself has very little state; all state is in the driver
Form a communication tree – nodes pass data along.. On reduce stage we also specify the aggregation operator
Future work: community detection, larger datasets (pubmed), compare with LightLDA; in general: need better support for tensors (libraries, CUDA, parameter server)
Future work: community detection, larger datasets (pubmed), compare with LightLDA; in general: need better support for tensors (libraries, CUDA, parameter server)
Future work: community detection, larger datasets (pubmed), compare with LightLDA; in general: need better support for tensors (libraries, CUDA, parameter server)
End: 20 min sharp
Total ~24 min with questions
Model (LDA) is independent from inference algorithms (variational Bayes, MCMC, tensors)