Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017

475 views

Published on

Representation Learning @ Red Hat:
For many companies, the vast majority of their data is unstructured and unlabeled; however, the data often contains information that could be useful in a variety of scenarios. Representation learning is the process of extracting meaningful features from unlabeled data so that it can be used in other tasks. In this talk, you’ll hear about how Red Hat is using deep learning to discover meaningful entity representations in a number of different settings, including: (1) identifying duplicate documents on the Customer Portal, (2) finding contextually similar URLs with word2vec, and (3) clustering behaviorally similar customers with doc2vec. To close, we will walk through an example demonstrating how representation learning can be applied to Major League Baseball players.

Bio: Michael first developed his data crunching chops as an undergraduate at Auburn University (War Eagle!) where he used a number of different statistical techniques to investigate various aspects of salamander biology (work that led to several publications). He then went on to earn a M.S. in evolutionary biology from The University of Chicago (where he wrote a thesis on frog ecomorphology) before changing directions and earning a second M.S. in computer science (with a focus on intelligent systems) from The University of Texas at Dallas. As a Machine Learning Engineer – Information Retrieval at Red Hat, Michael is constantly looking for ways to use the latest and greatest machine learning technology to improve search.

Published in: Technology
  • Be the first to comment

Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017

  1. 1. REPRESENTATION LEARNING @ RED HAT Michael A. Alcorn (malcorn@redhat.com) Machine Learning Engineer - Information Retrieval https://sites.google.com/view/michaelaalcorn/ 1
  2. 2. Outline Background word2vec/url2vec doc2vec/account2vec Duplicate Detection (batter|pitcher)2vec MLconf Blog 2
  3. 3. Background Why?​​ Small amount (zero?) of labeled data for task Lots of unlabeled data (labeled data for a different task?) Can we use large amounts of unlabeled data to make better predictions? Not the same as traditional unsupervised learning! in Goodfellow et al.'s Deep Learning textbook by Bengio et al. Representation learning Transfer learning Excellent chapter Article 3
  4. 4. word2vec ew TextTextTextText NVIDIA - " "Introduction to Neural Machine Translation with GPUs (Part 2) 4
  5. 5. word2vec ew Deeplearning4j - " " Mikolov et al. (2013) Word2vec 5
  6. 6. word2vec Analogies "x is to y as ? is to z" x - y + z = ? bash - shellshock + heartbleed = openssl firefox - linux + windows = internet_explorer openshift - cloud + storage = gluster rhn_register - rhn + rhsm = subscription- manager =+— 6
  7. 7. Naming Colors mapping RGB values to color names Results are pretty underwhelming for those in the know Can word embeddings improve ( )? Blog post by Janelle Shane GitHub 7
  8. 8. url2vec Tasks concerning URLs Search - returning relevant content Troubleshooting - recommending related articles Obvious method - look at text Alternative/enhanced method - use customer browsing behavior as additional contextual clues 8
  9. 9. url2vec How? Treat each day of browsing activity as a "sentence" Treat each URL as a "word" Run word2vec! 9
  10. 10. url2vec https://access.redhat.com/solutions/25190 https://access.redhat.com/solutions/10107 Application: ScatterPlot3D 10
  11. 11. doc2vec " " Le and Mikolov (2014) NLP 05: From Word2vec to Doc2vec: a simple example with Gensim 11
  12. 12. customer2vec Why? Data-driven segmentation Same idea as url2vec except now we treat each account as a "document" of many "sentences" (different browsing days) 12
  13. 13. customer2vec Why? Data-driven segmentation Same idea as url2vec except now we treat each account as a "document" of many "sentences" (different browsing days) 13
  14. 14. customer2vec 14
  15. 15. Duplicate Detection There are a number of "duplicate" KCS solutions​ on the Customer Portal Muddy search results How can we identify candidate duplicate documents? Obvious approach - compare text (e.g., tf-idf) ​Bag-of-words loses any structural meaning behind text ​Can we learn better representations? Title is essentially a summary of the solution content Learn representations of body that are similar to title representations (like the DSSM; )my code 15
  16. 16. Deep Semantic Similarity Model Jianfeng Gao - " "Deep Learning for Web Search and Natural Language Processing 16
  17. 17. (batter|pitcher)2vec ( )GitHub Can we learn meaningful representations of MLB players? Accurate representations could be used to simulate games and inform trades Find undervalued/overvalued players 17
  18. 18. Can we learn meaningful representations of MLB players? Accurate representations could be used to simulate games and inform trades Find undervalued/overvalued players (batter|pitcher)2vec ( )GitHub 18
  19. 19. Can we learn meaningful representations of MLB players? Accurate representations could be used to simulate games and inform trades Find undervalued/overvalued players SI.com NBCSports.com =+— LR (batter|pitcher)2vec ( )GitHub 19
  20. 20. (batter|pitcher)2vec " " Learning to Coach Football Wang and Zemel (2016) 20
  21. 21. THANK YOU! 21

×