Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Generating Qualitative Content with GPT-2 in All Languages

225 views

Published on

Vincent Terrasi's slides from TechSEO Boost 2019

Published in: Marketing
  • Be the first to comment

Generating Qualitative Content with GPT-2 in All Languages

  1. 1. #TechSEOBoost | @CatalystSEM THANK YOU TO OUR SPONSORS Generating Qualitative Content with GPT-2 in All Languages Vincent Terrasi, OnCrawl
  2. 2. Vincent Terrasi | @vincentterrasi | #TechSEOBoost In All Languages Generating Qualitative Content
  3. 3. Vincent Terrasi | @vincentterrasi | #TechSEOBoost SEO Use-cases • Image captioning with Pythia • Visual question & Answering • Abstractive Summarization with BERTsum • Full Article generation with GPT-2
  4. 4. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Text Spinners are bad
  5. 5. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Google, What is bad generated content in 2016? • Text translated by an automated tool without human review or curation before publishing • Text generated through automated processes, such as Markov chains • Text generated using automated synonymizing or obfuscation techniques • Text generated from scraping Atom/RSS feeds or search results • Stitching or combining content from different web pages without adding sufficient value https://web.archive.org/web/20160222004700/https://support.google.com/webmasters/answer/2721306?hl=en
  6. 6. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Google, What is bad generated content in 2019? • Text that makes no sense to the reader but which may contain search keywords. • Text translated by an automated tool without human review or curation before publishing • Text generated through automated processes, such as Markov chains • Text generated using automated synonymizing or obfuscation techniques • Text generated from scraping Atom/RSS feeds or search results • Stitching or combining content from different web pages without adding sufficient value https://support.google.com/webmasters/answer/2721306?hl=en
  7. 7. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Surprise!
  8. 8. Vincent Terrasi | @vincentterrasi | #TechSEOBoost 2019, the best year for using AI for text generation
  9. 9. Vincent Terrasi | @vincentterrasi | #TechSEOBoost GPT-2BERT ELMO ULM-FIT J Howard
  10. 10. Vincent Terrasi | @vincentterrasi | #TechSEOBoost GPT-2BERT ELMO ULM-FIT J Howard
  11. 11. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Transformer and Attention Model
  12. 12. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word
  13. 13. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word Pattern 2: Attention to previous word
  14. 14. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word Pattern 2: Attention to previous word Pattern 3: Attention to identical/related words
  15. 15. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word Pattern 2: Attention to previous word Pattern 3: Attention to identical/related words Pattern 4: Attention to identical/related words in other sentence
  16. 16. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word Pattern 2: Attention to previous word Pattern 3: Attention to identical/related words Pattern 4: Attention to identical/related words in other sentence Pattern 5: Attention to other words predictive (next word) of word
  17. 17. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word Pattern 2: Attention to previous word Pattern 3: Attention to identical/related words Pattern 4: Attention to identical/related words in other sentence Pattern 5: Attention to other words predictive (next word) of word Pattern 6: Attention to delimiter tokens
  18. 18. Vincent Terrasi | @vincentterrasi | #TechSEOBoost State of the Art ⚫ All models exist for English ⚫ Documentation is good ⚫ So we just need to translate
  19. 19. Vincent Terrasi | @vincentterrasi | #TechSEOBoost There are a lot of biases: ◦ Small Talk ◦ Idioms ◦ Local Named Entities ◦ Rarest Verbs ◦ Uncommon Tenses ◦ Gender rules
  20. 20. Vincent Terrasi | @vincentterrasi | #TechSEOBoost How to scale? Create your own model in your language
  21. 21. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Objectives Use only qualitative methods to improve the quality of content created by humans Extract the knowledge learnt by the Deep Learning.
  22. 22. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Why others attempts have failed? Quantitative: You need a lot of data: more than 100 000 texts with a minimum of 500 words Qualitative: You need qualitative texts
  23. 23. Vincent Terrasi | @vincentterrasi | #TechSEOBoost GPT-2 Recipe
  24. 24. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 1: Training the model This method without pretraining requires significant computing power. You need GPUs! 3 days to get my first result with one GPU.
  25. 25. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 2: Generating the compressed training dataset - 1/2 GPT-2 needs to learn with the Byte Pair Encoding (BPE) format which is a simple form of data compression. Why? - Predicting the next character is too imprecise - Predicting the next word is too precive and take a lot of computing power.
  26. 26. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 2: Generating the compressed training dataset - 2/2 Use SentencePiece to generate my BPE files. Why? - Unsupervised text tokenizer and detokenizer - Purely end-to-end system that does not depend on language-specific pre/postprocessing.
  27. 27. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 3: Fine-tuning the model Vocabulary size: depends on the language - n_vocab:50257
  28. 28. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 3: Fine-tuning the model Vocabulary size: depends on the language - n_vocab:50257 Embedding size: default value recommended by Open AI team - n_embd:768
  29. 29. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 3: Fine-tuning the model Vocabulary size: depends on the language - n_vocab:50257 Embedding size: default value recommended by Open AI team - n_embd:768 Size of attention: no greater accuracy if you increase this value - n_head:12
  30. 30. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 3: Fine-tuning the model Vocabulary size: depends on the language - n_vocab:50257 Embedding size: default value recommended by Open AI team - n_embd:768 Size of attention: no greater accuracy if you increase this value - n_head:12 Number of layers: no greater accuracy if you increase this value - n_layer:12
  31. 31. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 4: Generating article text Once the model has been trained, the gpt-2-gen command is used to generate a text. The first parameter is the path to the model. The second is the beginning of the sentence. Then there are two optional parameters: o --tokens-to-generate: number of tokens to generate, default 42 o --top-k: number of candidate tokens each time, by default 8.
  32. 32. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Results & Quality Evaluated subjectively by a native reader. API pylanguagetool was used to quantifiably confirm the quality of results and did not find any errors in the generated text. https://github.com/Findus23/pyLanguagetool
  33. 33. Vincent Terrasi | @vincentterrasi | #TechSEOBoost You can find my Google Colab Notebook here for the French https://colab.research.google.com/drive/13Lbk1TYmTjoQFO6qbw_f1TJgoD5ulJwV Warning: it is just an example using limited data. NOW it is your turn.
  34. 34. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Further ? Parameters Objectives Use Cases top-k < 10 token < 10 High Performance Very high qualitative content related to your original training content Anchors for Internal Linking Variant of Title Variant of Meta top-k > 50 token > 400 Low Performance Low qualitative content because the model is weak, but the model successfully extracts all concepts that GPT-2 learnt about your dataset. Guides to help you write, compared to a query, with the stated purpose of saving you time.
  35. 35. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Thank You vincent@oncrawl.com
  36. 36. Catalyst | @CatalystSEM | #TechSEOBoost Thanks for Viewing the Slideshare! – Watch the Recording: https://youtube.com/session-example Or Contact us today to discover how Catalyst can deliver unparalleled SEO results for your business. https://www.catalystdigital.com/

×