Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Text mining meets neural nets

3,898 views

Published on

Text mining with word embeddings (word2vec) and deep learning, particularly convolution networks compared to TF-IDF classifiers.

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Text mining meets neural nets

  1. 1. Dan Sullivan October 21, 2015 Portland, OR
  2. 2. * * Introduction to Natural Language Processing and Text Mining * Linguistic and Statistical Approaches *Critiquing Classifier Results * A New Dawn: Deep Learning * What’s Next
  3. 3. * * Enterprise Architect, Big Data and Analytics * Former Research Scientist, bioinformatics institute * Completing PhD in Computational Biology with focus on text mining *Author *Contact *dan@dsapptech.com *@dsapptech *Linkedin.com/in/dansullivanpdx
  4. 4. *
  5. 5. *
  6. 6. *
  7. 7. Manual procedures are time consuming and costly Volume of literature continues to grow Commonly used search techniques, such as keyword, similarity searching, metadata filtering, etc. can still yield volumes of literature that are difficult to analyze manually Some success with popular tools but limitations
  8. 8. * * Linguistic (from 1960s) * Focus on syntax * Transformational Grammar * Sentence parsing *Statistical (from 1990s) * Focus on words, ngrams, etc. * Statistics and Probability * Related work in Information Retrieval * Topic Modeling and Classification * Deep Learning (from ~2006) * Focus on multi-layered neural net computing non-linear functions * Light on theory, heavy on engineering * Multiple NLP tasks
  9. 9. * VS.
  10. 10. * http://www.slideshare.net/DanSullivan10 /text-mining-meets-neural-nets http://www.slideshare.net/DanSullivan10 /text-mining-meets-neural-nets http://www.slideshare.net/DanSullivan10 /text-mining-meets-neural-nets
  11. 11. *
  12. 12. * Image: http://www.nltk.org/book_1ed/ch08.html
  13. 13. * Stephen H. Chen et al. Physiol. Genomics 2005;22:257-267
  14. 14. *
  15. 15. * * Technique for identify dominant themes in document * Does not require training * Multiple Algorithms * Probabilistic Latent Semantic Indexing (PLSI) * Latent Dirichlet allocation (LDA) *Assumptions *Documents about a mixture of topics *Words used in document attributable to topic Source: http://www.keepcalm-o-matic.co.uk/p/keep-calm-theres-no-training-today/
  16. 16. Debt, Law, Graduation Debt, EU, Greece, Euro Source: http://www.nytimes.com/pages/business/index.html April 27, 2015 EU, Greece, Negotiations, Varoufakis
  17. 17. * * Topics represented by words; documents about a set of topics *Doc 1: 50% politics, 50% presidential *Doc 2: 25% CPU, 30% memory, 45% I/O *Doc 3: 30% cholesterol, 40% arteries, 30% heart * Learning Topics *Assign each word to a topic *For each word and topic, compute * Probability of topic given a document P(topic|doc) * Probability of word given a topic P(word|topic) * Reassign word to new topic with probability P(topic|doc) * P(word|topic) * Reassignment based on probability that topic T generated use of word W TOPICS
  18. 18. Image Source: David Blei, “Probabilistic Topic Models” http://yosinski.com/mlss12/MLSS-2012-Blei-Probabilistic-Topic-Models/
  19. 19. * 3 Key Components * Data * Representation scheme * Algorithms * Data * Positive examples – Examples from representative corpus * Negative examples – Randomly selected from same publications * Representation * TF-IDF * Vector space representation * Cosine of vectors measure of similarity * Algorithms * Supervised learning * SVMs * Ridge Classifier * Perceptrons * kNN * SGD Classifier * Naïve Bayes * Random Forest * AdaBoost *
  20. 20. * Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python: Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/
  21. 21. *Term Frequency (TF) tf(t,d) = # of occurrences of t in d t is a term d is a document *Inverse Document Frequency (IDF) idf(t,D) = log(N / |{d in D : t in d}|) D is set of documents N is number of document *TF-IDF = tf(t,d) * idf(t,D) *TF-IDF is *large when high term frequency in document and low term frequency in all documents *small when term appears in many documents *
  22. 22. The 1 0 0 0 0 0 0 Esp8 0 1 0 0 0 0 0 gene 0 0 1 0 0 0 0 is 0 0 0 1 0 0 0 a 0 0 0 0 1 0 0 known 0 0 0 0 0 1 0 virulenc e 0 0 0 0 0 0 1 translocat es reduced levels of Esp8 host cell Sentence 1 0.193 0.2828 0.078 0.0001 0.389 0.0144 0.011 Sentence 2 0 0.0091 0.0621 0 0 0 0 Sentence 3 0 0 0 0 0.028 0.0113 0 Sentence 4 0.021 0 0 0 0 0 0 One Hot Representation TF-IDF Representation *
  23. 23. * Bag of words model * Ignores structure (syntax) and meaning (semantics) of sentences * Representation vector length is the size of set of unique words in corpus * Stemming used to remove morphological differences * Each word is assigned an index in the representation vector, V * The value V[i] is non-zero if word appears in sentence represented by vector * The non-zero value is a function of the frequency of the word in the sentence and the frequency of the term in the corpus *
  24. 24. Support Vector Machine (SVM) is large margin classifier Commonly used in text classification Initial results based on life sciences sentence classifier Image Source:http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png *
  25. 25. *
  26. 26. Non-VF, Predicted VF:  “Collectively, these data suggest that EPEC 30-5-1(3) translocates reduced levels of EspB into the host cell.”  “Data were log-transformed to correct for heterogeneity of the variances where necessary.”  “Subsequently, the kanamycin resistance cassette from pVK4 was cloned into the PstI site of pMP3, and the resulting plasmid pMP4 was used to target a disruption in the cesF region of EHEC strain 85-170.” VF, Predicted Non-VF  “Here, it is reported that the pO157-encoded Type V-secreted serine protease EspP influences the intestinal colonization of calves. “  “Here, we report that intragastric inoculation of a Shiga toxin 2 (Stx2)-producing E. coli O157:H7 clinical isolate into infant rabbits led to severe diarrhea and intestinal inflammation but no signs of HUS. “  “The DsbLI system also comprises a functional redox pair”
  27. 27.  Adding additional examples is not likely to substantially improve results as seen by error curve 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 2000 4000 6000 8000 10000 All Training Error Validation Error
  28. 28. 8 Alternative Algorithms Select 10,000 most important features using chi-square
  29. 29. * Increase quantity of data (not always helpful; see error curves) * Improve quality of data * Utilize multiple supervised algorithms, ensemble and non-ensemble * Use unlabeled data and semi-supervised techniques * Feature Selection * Parameter Tuning * Feature Engineering * Given: * High quality data in sufficient quantity * State of the art machine learning algorithms * How to improve results: Change Representation? *
  30. 30. *TF-IDF *Loss of syntactic and semantic information *No relation between term index and meaning *No support for disambiguation *Feature engineering extends vector representation or substitute specific for more general terms – a crude way to capture semantic properties *  Ideal Representation ◦ Capture semantic similarity of words ◦ Does not require feature engineering ◦ Minimal pre- processing, e.g. no mapping to ontologies ◦ Improves precision and recall
  31. 31. *
  32. 32. * *Dense vector representation (n = 50 … 300 or more) *Capture semantics – similar words close by cosine measure *Captures language features *Syntactic relations *Semantic relations
  33. 33. * [0.160610 -0.547976 -0.444522 -0.037896 0.044305 0.245423 -0.261498 0.000294 -0.275621 -0.021201 -0.432955 0.388905 0.106494 0.405797 -0.159357 -0.073897 0.177182 0.043535 0.600987 0.064762 -0.348964 0.189289 0.650318 0.112554 0.374456 -0.227780 0.208623 0.065362 0.235401 -0.118003 0.032858 -0.309767 0.024085 -0.055148 0.158807 0.171749 -0.153825 0.090301 0.033275 0.089936 0.187864 -0.044472 0.421533 0.209217 -0.142092 0.153070 -0.168291 -0.052823 -0.090984 0.018695 -0.265503 -0.055572 -0.212252 -0.326411 -0.083590 -0.009575 -0.125065 0.376738 0.059734 -0.005585 -0.085654 0.111499 -0.099688 0.147020 -0.419087 -0.042069 -0.241274 0.154339 -0.008625 -0.298928 0.060612 0.216670 -0.080013 -0.218985 -0.805539 0.298797 0.089364 0.071044 0.390878 0.167600 -0.101478 -0.017312 -0.260500 0.392749 0.184021 -0.258466 -0.222133 0.357018 -0.244508 0.221385 -0.012634 -0.073752 -0.409362 0.113296 0.048397 0.000424 0.146018 -0.060891 -0.139045 -0.180432 0.014984 0.023384 -0.032300 -0.161608 -0.188434 0.018036 0.023236 0.060335 -0.173066 0.053327 0.523037 -0.330135 -0.014888 -0.124564 0.046332 -0.124301 0.029865 0.144504 0.163142 -0.018653 -0.140519 0.060562 0.098858 -0.128970 0.762193 -0.230067 -0.226374 0.100086 0.367147 0.160035 0.148644 -0.087583 0.248333 -0.033163 -0.312134 0.162414 0.047267 0.383573 -0.271765 -0.019852 -0.033213 0.340789 0.151498 -0.195642 -0.105429 -0.172337 0.115681 0.033890 -0.026444 -0.048083 -0.039565 -0.159685 -0.211830 0.191293 0.049531 -0.008248 0.119094 0.091608 -0.077601 -0.050206 0.147080 -0.217278 -0.039298 -0.303386 0.543094 -0.198962 -0.122825 -0.135449 0.190148 0.262060 0.146498 -0.236863 0.140620 0.128250 -0.157921 -0.119241 0.059280 -0.003679 0.091986 0.105117 0.117597 -0.187521 -0.388895 0.166485 0.149918 0.066284 0.210502 0.484910 0.396106 -0.118060 -0.076609 -0.326138 -0.305618 -0.297695 -0.078404 -0.210814 0.423335 -0.377239 -0.323599 0.282586] immune_system
  34. 34. *Large volume of data *Billions of words in context *Multiple passes over data *Algorithms *Word2Vec *CBOW *Skip-gram *GloVe *Linguistic terms with similar distributions have similar meaning * T. Mikolov, et. al. “Efficient Estimation of Word Representations in Vector Space.” 2013. http://arxiv.org/pdf/1301.3781.pdf
  35. 35. * Image: https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc
  36. 36. * Image: https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc
  37. 37. *
  38. 38. *
  39. 39. *
  40. 40. *
  41. 41. * Heart : Cardiovascular as Kidney:
  42. 42. * Salmonella : Proteobacteria Staphylococcus
  43. 43. * Salmonella : Enterobacteriacea as Staphylococcus Staphylococcaceae
  44. 44. *
  45. 45. * Image: http://u.cs.biu.ac.il/~yogo/nnlp.pdf
  46. 46. * https://en.wikibooks.org/wiki/Artificial_Neural_Networks/Activation_ Functions
  47. 47. * * Non-linear Activation Function *Sigmoid *Hyberbolic tangent (tanh) *Rectifier (ReLU) * Word embeddings * Window size * Loss function *Binary *Multiclass *Cross-entropy
  48. 48. * Images: http://u.cs.biu.ac.il/~yogo/nnlp.pdf; http://blog.datumbox.com/tuning- the-learning-rate-in-gradient-descent/
  49. 49. * Image: https://aclweb.org/anthology/P/P14/P14-2105.xhtml
  50. 50. *
  51. 51. *
  52. 52. * Image: http://greg.org/archive/2010/07/05/the_planck_all-sky_survey.html
  53. 53. * http://riotwire.com/column/immigrants-socialists-and-semantics-oh-my/
  54. 54. *
  55. 55. * * Word2Vec – command line tool * Gensim – Python topic modeling tool with word2vec module * GloVe (Global Vector for Word Representation) – command line tool
  56. 56. * * Theano: Python CPU/GPU symbolic expression compiler * Torch: Scientific framework for LuaJIT * PyLearn2: Python deep learning platform * Lasange: light weight framework on Theano * Keras: Python library for working with Theano * DeepDist: Deep Learning on Spark * Deeplearning4J: Java and Scala, integrated with Hadoop and Spark
  57. 57. * *Deep Learning Bibliography - http://memkite.com/deep- learning-bibliography/ * Deep Learning Reading List – http://deeplearning.net/reading-list/ *Kim, Yoon. "Convolutional neural networks for sentence classification." arXiv preprint arXiv:1408.5882 (2014). * Goldberg, Yav. “A Primer on Neural Network Models for Natural Language Processing” http://u.cs.biu.ac.il/~yogo/nnlp.pdf
  58. 58. *

×