Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Streaming Topic Model Training and Inference with Apache Flink

68 views

Published on

Analyzing streams of text data to extract topics is an important task for getting useful insights to be leveraged in subsequent workflows. For example, extracting topics from text to be continuously ingested into a search engine can be useful to tag documents with important
keywords or concepts to be used at search time. Another use case is doing analysis of support tickets to get insights on the most common problems for customers.

In this talk we illustrate how to use Apache Flink's Dynamic processing and Stateful streaming capabilities to continuously train topic models from unlabelled text and use such models to extract topics from the data itself. Such topic models will be built leveraging distributed representations of words and documents. We’ll be seeing as to how this can be all done in a pure streaming fashion without having to resort to a Lambda Architecture kind’a setup. An earlier version of this talk was presented at Flink Forward Berlin 2018.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Streaming Topic Model Training and Inference with Apache Flink

  1. 1. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 1/40 Streaming Topic Model TrainingStreaming Topic Model Training and Inferenceand Inference Suneel MarthiSuneel Marthi March 21, 2019March 21, 2019 DataWorks Summit 2019 - BarcelonaDataWorks Summit 2019 - Barcelona 1
  2. 2. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 2/40 $WhoAreWe$WhoAreWe Suneel MarthiSuneel Marthi  @suneelmarthi@suneelmarthi Member of Apache Software Foundation Committer and PMC on Apache Mahout, Apache OpenNLP, Apache Streams 2
  3. 3. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 3/40 AgendaAgenda Motivation for Topic Modeling Existing Approaches for Topic Modeling Topic Modeling on Streams 3
  4. 4. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 4/40 Motivation for Topic ModelingMotivation for Topic Modeling 4
  5. 5. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 5/40 Topic ModelsTopic Models Automatically discovering main topics in collections of documents Overall What are the main themes discussed on a mailing list or a discussion forum ? What are the most recurring topics discussed in research papers ? 5
  6. 6. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 6/40 Topic Models (Contd)Topic Models (Contd) Per Document What are the main topics discussed in a certain (single) newspaper article ? What is the most distinctive topic that makes a certain research article more interesting than the others ? 6
  7. 7. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 7/40 Uses of Topic ModelsUses of Topic Models Search Engines to organize text Annotating Documents according to these topics Uncovering hidden topical patterns across a document collection 7
  8. 8. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 8/40 Common Approaches to TopicCommon Approaches to Topic ModelingModeling 8
  9. 9. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 9/40 Latent Dirichlet Allocation (LDA) Topics are composed by probability distributions over words Documents are composed by probability distributions over Topics Batch Oriented approach 9
  10. 10. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 10/40 N: vocab size M/D: # of documents alpha: prior on per-document topic dist. beta: prior on topic-word dist. theta: per-document topic dist. phi: per-topic word dist. z: topic for word w IntuitionIntuition: Documents are a mixture of topics, topics are a mixture of words. So documents: Documents are a mixture of topics, topics are a mixture of words. So documents can be generated by drawing a sample of topics, and then drawing a sample of words.can be generated by drawing a sample of topics, and then drawing a sample of words. LDA ModelLDA Model 10
  11. 11. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 11/40 Latent Semantic Analysis (LSA) SVDed TF-IDF Document-Term Matrix Batch Oriented approach 11
  12. 12. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 12/40 Drawbacks of Traditional Topic Modeling Approach Problem : labelling topics is often a manual task E.g. LDA topic:30% basket,15% ball,10% drunk,… can be tagged as basketball 12
  13. 13. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 13/40 Common ApproachesCommon Approaches DeepDeep Learning (of course)Learning (of course) Learn to perform LDA: LDA supervised DNN training (inference speed up) https://arxiv.org/abs/1508.01011 Deep Belief Networks for Topic Modelling https://arxiv.org/abs/1501.04325 13
  14. 14. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 14/40 What are Embeddings ?What are Embeddings ? Represent a document as a point in space, semantically similarRepresent a document as a point in space, semantically similar docs are close together when plotteddocs are close together when plotted Such a representation is learned via a (shallow) neural networkSuch a representation is learned via a (shallow) neural network algorithmalgorithm 14
  15. 15. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 15/40 Embeddings for Topic ModelingEmbeddings for Topic Modeling LDA2Vec: Mixing LDA and word2vec word embeddings https://arxiv.org/abs/1605.02019 Navigating embeddings: geometric navigation in word and doc embeddings space looking for topics 15
  16. 16. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 16/40 Static vs Dynamic CorporaStatic vs Dynamic Corpora Learning Topic Models over a fixed set of Documents Fetch the Corpus Train and Fit a topic model Extract topics ...other downstream tasks 16
  17. 17. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 17/40 Static vs Dynamic Corpora (Contd)Static vs Dynamic Corpora (Contd) What to do with Corpus updates? Document collections are not static in real world There may be no document collections to begin with Memory limits with large corpora Reprocess entire corpora when new documents are added 17
  18. 18. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 18/40 Static vs Dynamic Corpora (Contd)Static vs Dynamic Corpora (Contd) Streams, Streams, Streams Twitter stream Reddit posts Papers on ResearchGate and Arxiv .... 18
  19. 19. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 19/40 Questions for AudienceQuestions for Audience Who’s doing inference or scoring on streaming data? Who’s doing online learning and optimization on streaming data? (2) is hard, sometimes because the algorithms aren’t(2) is hard, sometimes because the algorithms aren’t there.there. But, also because people don’t know theBut, also because people don’t know the algorithms are there.algorithms are there. 19
  20. 20. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 20/40 Topic Modeling on StreamsTopic Modeling on Streams 20
  21. 21. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 21/40 2 Streaming Approaches2 Streaming Approaches Learning Topics from Jira Issues Online LDA 21
  22. 22. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 22/40 Learning Topics on Jira IssuesLearning Topics on Jira Issues 22
  23. 23. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 23/40 WorkflowWorkflow
  24. 24. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 24/40 23
  25. 25. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 25/40 FLINK-9963: Add a single value table formatFLINK-9963: Add a single value table format factoryfactory 24
  26. 26. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 26/40 Topics extracted for Flink-9963Topics extracted for Flink-9963 format factory table schema table source sink factory table sink 25
  27. 27. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 27/40 FLINK-8286: Fix Flink-Yarn-Kerberos integrationFLINK-8286: Fix Flink-Yarn-Kerberos integration for FLIP-6for FLIP-6
  28. 28. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 28/40 26
  29. 29. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 29/40 Topics extracted for Flink-8286Topics extracted for Flink-8286 yarn kerberos integration kerberos integrations yarn kerberos 27
  30. 30. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 30/40 Bayesian InferenceBayesian Inference GoalGoal: Calculate the posterior distribution of a: Calculate the posterior distribution of a probabilistic model given dataprobabilistic model given data Gibbs sampling/MCMC: Repeatedly sample from conditional distribution(s) to approximate the posterior of the joint distribution; typically a batch process Variational Bayes: Optimize the parameters of a function that approximates joint distribution 28
  31. 31. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 31/40 Variational BayesVariational Bayes Expectation Maximization-like optimization Faster and more accurate Can often be computed online (on windows over streams) 29
  32. 32. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 32/40 Online LDA (1/2)Online LDA (1/2) Variational Bayes for LDA Batch or online Iterate between optimizing per-word topics and per-document topics (“E” step) and topic-word distributions (“M” step) Default topic model optimization in Gensim, scikit-learn, Apache Spark MLlib -- not because it’s amenable to streaming per se, but more because it’s memory efficient and can be applied to large corpora. Hoffman, Blei & Bach. Online Learning for Latent Dirichlet Allocation. Proceedings of Neural Information Processing Systems. 2010.Hoffman, Blei & Bach. Online Learning for Latent Dirichlet Allocation. Proceedings of Neural Information Processing Systems. 2010. 30
  33. 33. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 33/40 job-constant K: number of topics alpha: topic-document dist. prior eta: topic-word dist. prior tau0: learning rate kappa: decay rate (mini)batch-local gamma: theta prior theta: param. for per-document topic dist phi: param. for per-topic word dist. rho: update weight job-global lambda: parameter for topic-word dist. t: documents/batches seen (index) GoalGoal: Optimize the topic-word dist. parameter: Optimize the topic-word dist. parameter lambda; it tells us what words belong to the topics.lambda; it tells us what words belong to the topics. [Image source: Hoffman, Blei & Bach, 2010.][Image source: Hoffman, Blei & Bach, 2010.] Job-global parameters can be tracked in a stateJob-global parameters can be tracked in a state store. Batch-local parameters can be discarded.store. Batch-local parameters can be discarded. Online LDA (2/2)Online LDA (2/2) 31
  34. 34. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 34/40 job-constant K: number of topics alpha: topic-document dist. prior eta: topic-word dist. prior tau0: learning rate kappa: decay rate Set job parameters globally withSet job parameters globally with setGlobalJobParameters() on executionsetGlobalJobParameters() on execution environment, or pass as arguments toenvironment, or pass as arguments to operatorsoperators (mini)batch-local gamma: theta prior theta: param. for per-document topic dist phi: param. for per-topic word dist. rho: update weight Operator/task-internal and don't need to beOperator/task-internal and don't need to be accessible by other tasksaccessible by other tasks job-global lambda: parameter for topic-word dist. t: documents/batches seen (index) Several options: External key-value store Accumulators: lambda update as increment, accumulator + broadcast variables Broadcast state And, (mini)batches selected by Window sizeAnd, (mini)batches selected by Window size Online LDA on BeamOnline LDA on Beam 32
  35. 35. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 35/40 OnlineLDA PipelineOnlineLDA Pipeline 33
  36. 36. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 36/40 So what?So what? 34
  37. 37. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 37/40 For many tasks, the default is batch training/optimization Often, even if good online training and optimization algorithms exist (why?) So if your algorithms only require 1-step of parameter state, why not just do streaming? You can get instant, always up-to-date results by putting the effort into using a different class of optimization algorithms an using a stream processing engine, so just do that 35
  38. 38. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 38/40 LinksLinks http://papers.nips.cc/paper/3902-online-learning- for-latentdirichlet-allocation Learning from LDA using Deep Neural Networks - https://arxiv.org/pdf/1508.01011.pdf Deep Belief Nets for Topic Modeling - https://arxiv.org/pdf/1501.04325.pdf Topic Modeling on Jira issues: https://github.com/tteofili/jtm 36
  39. 39. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 39/40 CreditsCredits Tommaso Teofili, Simone Tripodi (Adobe - Rome) Joern Kottmann, Bruno Kinoshita (Apache OpenNLP) Fabian Hueske (Apache Flink, Data Artisans) 37
  40. 40. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 40/40 Questions ???Questions ??? 38

×