Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling automated quality text generation for enterprise sites

17,386 views

Published on

Writing quality content and meta data at scale is a big problem for most enterprise sites. In this webinar we are going to explore what is possible given the latest advances in deep learning and natural language processing.Our main focus is going to be about generating metadata: titles, meta descriptions, h1s, etc that are critical for technical SEO performance. But, we will cover full article generation as well.

Published in: Marketing
  • Be the first to comment

Scaling automated quality text generation for enterprise sites

  1. 1. @hamletbatista SCALING AUTOMATED QUALITY TEXT GENERATION FOR ENTERPRISE SITES HAMLET BATISTA
  2. 2. @hamletbatista LET’S CRAWL A FEW PAGES FROM THIS SITE
  3. 3. @hamletbatista WE ARE MISSING KEY META TAGS!
  4. 4. @hamletbatista AND SOME PAGES ARE LACKING CONTENT
  5. 5. @hamletbatista LET’S FIX THAT WITH AUTOMATION! We want to address 4 scenarios common in enterprise websites. For large ecommerce sites, we will focus on: 1. Pages with large images and no text. 2. Pages with large images and some text. For large web publishers, we will focus on: 1. Pages with a lot of quality text and no metadata. 2. Pages with very little text.
  6. 6. @hamletbatista AGENDA We are going to explore different text generation strategies and recommend the best one for each problem. Specifically, we will cover: 1. Image captioning 2. Visual question and answering 3. Text summarization 4. Question and answering from text (short answers) 5. Long-form question and answering 6. Full article generation
  7. 7. @hamletbatista AGENDA We are going to build two models from scratch: 1. We will build an image captioning and visual question and answering model 2. We will also build a state of the art text summarization model At the end, I will share some resources to learn more about these topics.
  8. 8. @hamletbatista HOW TO FIND RELEVANT RESEARCH AND CODE Papers with Code
  9. 9. @hamletbatista HOW TO FIND RELEVANT RESEARCH AND CODE Papers with Code (SOTA)
  10. 10. @hamletbatista TEXT GENERATION FOR ECOMMERCE SITES
  11. 11. @hamletbatista IMAGE CAPTIONING AND VISUAL QUESTION ANSWERING Bottom- Up and Top- Down Attention for Image Captionin g and Visual Question Answering
  12. 12. @hamletbatista THE PYTHIA MODULAR FRAMEWORK Pythia Github
  13. 13. @hamletbatista THE PYTHIA MODULAR FRAMEWORK Pythia
  14. 14. @hamletbatista IMAGE CAPTIONING AND VISUAL ANSWERING RESULTS Pythia
  15. 15. @hamletbatista LET’S BUILD A CAPTIONING MODEL! Pythia captionin g demo
  16. 16. @hamletbatista RUN ALL CELLS Pythia captionin g demo
  17. 17. @hamletbatista USE THE LAST SECTION TO TEST IMAGES Pythia captionin g demo
  18. 18. @hamletbatista USE THE LAST SECTION TO CAPTION IMAGES “a group of people in a boat on the water”
  19. 19. @hamletbatista THERE IS MORE TEXT HIDDEN IN THE OTHER IMAGES!
  20. 20. @hamletbatista TRY THE NEW TITLES FROM CAPTIONS IN CLOUDFLARE FIRST Cloudflare App
  21. 21. @hamletbatista YOU CAN ALSO ASK QUESTIONS ABOUT IMAGES “how many people?” “4 with 62.9 confidenc e”
  22. 22. @hamletbatista YOU CAN ALSO ASK QUESTIONS ABOUT IMAGES “what are these people riding?” “boat with 99.96 confidenc e”
  23. 23. @hamletbatista CAPTIONING AND VISUAL QUESTION ANSWERING PAPER Bottom- Up and Top- Down Attention for Image Captionin g and Visual Question Answerin g
  24. 24. @hamletbatista CAPTIONING AND VISUAL QUESTION ANSWERING RESULTS Bottom- Up and Top- Down Attention for Image Captionin g and Visual Question Answerin g
  25. 25. @hamletbatista WHERE DID I LEARN ABOUT THIS? Advanced Machine Learning Specializa tion
  26. 26. @hamletbatista WHERE DID I LEARN ABOUT THIS? Introducti on to Deep Learning
  27. 27. @hamletbatista TEXT GENERATION FOR WEB PUBLISHERS
  28. 28. @hamletbatista AI TEXT GENERATOR: TALKTOTRANSFORMER.COM Talk to transform er
  29. 29. @hamletbatista TEXT SUMMARIZATION Papers with Code (Text Summariz ation)
  30. 30. @hamletbatista TEXT SUMMARIZATION PAPER (EXTRACTIVE) Papers with Code (Extractiv e Text Summariz ation)
  31. 31. @hamletbatista TEXT SUMMARIZATION RESULTS (EXTRACTIVE) Fine-tune BERT for Extractive Summariz ation
  32. 32. @hamletbatista TEXT SUMMARIZATION PAPER (ABSTRACTIVE) Papers with Code (Abstracti ve Text Summariz ation)
  33. 33. @hamletbatista TEXT SUMMARIZATION RESULTS (ABSTRACTIVE) MASS: Masked Sequence to Sequence Pre- training for Language Generatio n
  34. 34. @hamletbatista LET’S BUILD AN EXTRACTIVE TEXT SUMMARIZATION MODEL! https://github. com/nlpyang/B ertSum
  35. 35. @hamletbatista LET’S BUILD AN EXTRACTIVE TEXT SUMMARIZATION MODEL! https://github. com/nlpyang/B ertSum
  36. 36. @hamletbatista BERTSUM MODEL OVERVIEW "Meanwhile, although BERT has segmentation embeddings for indicating different sentences, it only has two labels (sentence A or sentence B), instead of multiple sentences as in extractive summarization. Therefore, we modify the input sequence and embeddings of BERT to make it possible
  37. 37. @hamletbatista BERTSUM DOWNLOAD AND SETUP BERTSUM Colab notebook
  38. 38. @hamletbatista BERTSUM TRAINING BERT+TRANSFORMER MODEL BERTSUM Colab notebook
  39. 39. @hamletbatista BERTSUM TRAINING BERT+TRANSFORMER MODEL BERTSUM Colab notebook #first run #Change -visible_gpus 0,1,2 -gpu_ranks 0,1,2 - world_size 3 to -visible_gpus 0 -gpu_ranks 0 -world_size 1, #after downloading, you could kill the process and rerun the code with multi-GPUs. #BERT+Transformer model !python train.py -mode train -encoder transformer - dropout 0.1 -bert_data_path ../bert_data/cnndm -model_path ../models/bert_transformer -lr 2e-3 - visible_gpus 0 -gpu_ranks 0 -world_size 1 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -decay_method noam -train_steps 50000 -accum_count 2 -log_file ../logs/bert_transformer -use_interval true -warmup_steps 10000 -ff_size 2048
  40. 40. @hamletbatista BERTSUM TRAINING BERT+TRANSFORMER MODEL We are simply following the instructions in the Github repository
  41. 41. @hamletbatista BERTSUM TRAINING BERT+TRANSFORMER MODEL 1. Training takes two days on Colab (with interruptions) 2. Saving progress and resuming is critical
  42. 42. @hamletbatista BERTSUM TRAINING BERT+TRANSFORMER MODEL BERTSUM Colab notebook
  43. 43. @hamletbatista BERTSUM RESUMING TRAINING BERTSUM Colab notebook #resume run #Change -visible_gpus 0,1,2 -gpu_ranks 0,1,2 -world_size 3 to -visible_gpus 0 - gpu_ranks 0 -world_size 1, #after downloading, you could kill the process and rerun the code with multi-GPUs. #BERT+Transformer model !python train.py -mode train -encoder transformer -dropout 0.1 -train_from ../../drive/My Drive/Presentations/DeepCrawl Webinar/models/bert_transformer/model_step_49000.pt -bert_data_path ../bert_data/cnndm -model_path ../../drive/My Drive/Presentations/DeepCrawl Webinar/models/bert_transformer -lr 2e-3 -visible_gpus 0 -gpu_ranks 0 -world_size 1 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -decay_method noam -train_steps 50000 -accum_count 2 -log_file ../../drive/My Drive/Presentations/DeepCrawl Webinar/logs/bert_transformer -use_interval true -warmup_steps 10000 -ff_size 2048
  44. 44. @hamletbatista SIMPLER: JUST GET A TRAINED MODEL FROM THE INVENTOR! BERTSUM Colab notebook
  45. 45. @hamletbatista BERTSUM TESTING BERT+TRANSFORMER MODEL BERTSUM Colab notebook
  46. 46. @hamletbatista BERTSUM TESTING RESULTS BERTSUM Colab notebook Gold Summary: 'click on the brilliant interactive graphic below for details on each hole of the masters 2015 course', Candidate Summary after 50,000 training steps: 'click on the graphic below to get a closer look at what the biggest names in the game will face when they tee off on thursday .',
  47. 47. @hamletbatista TEXT GENERATION FOR WEB PUBLISHERS
  48. 48. @hamletbatista QUESTION ANSWERING Papers with Code (Question Answering )
  49. 49. @hamletbatista QUESTION ANSWERING PAPER: XLNET Papers with Code (Question Answering )
  50. 50. @hamletbatista QUESTION ANSWERING RESULTS: XLNET XLNet: Generalized Autoregressi ve Pretraining for Language Understandi ng
  51. 51. @hamletbatista XLNET CODE https://github. com/zihangdai /xlnet
  52. 52. @hamletbatista LONG-FORM QUESTION ANSWERING Introducin g long- form question answering
  53. 53. @hamletbatista LONG-FORM QUESTION ANSWERING Subreddit: Explain it Like I'm Five
  54. 54. @hamletbatista LONG-FORM QUESTION ANSWERING Scripts and links to recreate the ELI5 dataset.
  55. 55. @hamletbatista LONG-FORM QUESTION ANSWERING BASELINE
  56. 56. @hamletbatista FINALLY, LET’S GO FOR SOMETHING MORE AMBITIOUS Generatin g Wikipedia by Summarizi ng Long Sequences
  57. 57. @hamletbatista GENERATING WIKIPEDIA BY SUMMARIZING LONG SEQUENCES Generatin g Wikipedia by Summarizi ng Long Sequences
  58. 58. @hamletbatista GENERATING WIKIPEDIA BY SUMMARIZING LONG SEQUENCES CONCLUSION “We have shown that generating Wikipedia can be approached as a multi-document summarization problem with a large, parallel dataset, and demonstrated a two-stage extractive-abstractive framework for carrying it out. The coarse extraction method used in the first stage appears to have a significant effect on final performance, suggesting further research on improving it would be fruitful. We introduce a new, decoder-only sequence transduction model for the abstractive stage, capable of handling very long input-output examples. This model significantly outperforms traditional encoder/decoder architectures on long sequences, allowing us to condition on many reference documents and to generate coherent and informative Wikipedia articles.” Generatin g Wikipedia by Summarizi ng Long Sequences
  59. 59. @hamletbatista CAN WE HAVE THE SOURCE CODE? YES! Github link
  60. 60. @hamletbatista HERE ARE SOME TRAINING COST ESTIMATES Github link
  61. 61. @hamletbatista RESOURCES TO LEARN MORE Faster Data Science Education https://www.kaggle.com/learn/overview Data Scientist’s Guide to Summarization https://towardsdatascience.com/data-scientists-guide-to-summarization-fc0db952e363 An open source neural machine translation system http://opennmt.net/ Bottom-Up Abstractive Summarization http://opennmt.net/OpenNMT-py/Summarization.html Abstractive Text Summarization (tutorial 2) , Text Representation made very easy https://hackernoon.com/abstractive-text-summarization-tutorial-2-text-representation-made-very-easy-ef4511a1a46
  62. 62. @hamletbatista RESOURCES TO LEARN MORE Build an Abstractive Text Summarizer in 94 Lines of Tensorflow !! (Tutorial 6) https://hackernoon.com/build-an-abstractive-text-summarizer-in-94-lines-of-tensorflow-tutorial-6-f0e1b4d88b55 What Is ROUGE And How It Works For Evaluation Of Summarization Tasks? https://rxnlp.com/how-rouge-works-for-evaluation-of-summarization-tasks/ Introducing Eli5: How Facebook is Tackling Long-Form Question-Answering Conversations https://towardsdatascience.com/introducing-eli5-how-facebook-is-tackling-long-form-question-answering- conversations-4f8e59374717 Pythia’s Documentation https://learnpythia.readthedocs.io/en/latest/

×