SlideShare a Scribd company logo
1 of 36
Download to read offline
#TechSEOBoost | @CatalystSEM
THANK YOU TO OUR SPONSORS
Generating Qualitative Content with GPT-2
in All Languages
Vincent Terrasi, OnCrawl
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
In All Languages
Generating Qualitative
Content
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
SEO Use-cases
• Image captioning with Pythia
• Visual question & Answering
• Abstractive Summarization with BERTsum
• Full Article generation with GPT-2
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Text Spinners are bad
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Google, What is bad generated content in 2016?
• Text translated by an automated tool without human review or curation before
publishing
• Text generated through automated processes, such as Markov chains
• Text generated using automated synonymizing or obfuscation techniques
• Text generated from scraping Atom/RSS feeds or search results
• Stitching or combining content from different web pages without adding sufficient value
https://web.archive.org/web/20160222004700/https://support.google.com/webmasters/answer/2721306?hl=en
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Google, What is bad generated content in 2019?
• Text that makes no sense to the reader but which may contain search keywords.
• Text translated by an automated tool without human review or curation before
publishing
• Text generated through automated processes, such as Markov chains
• Text generated using automated synonymizing or obfuscation techniques
• Text generated from scraping Atom/RSS feeds or search results
• Stitching or combining content from different web pages without adding sufficient value
https://support.google.com/webmasters/answer/2721306?hl=en
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Surprise!
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
2019, the best year for
using AI for text
generation
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
GPT-2BERT
ELMO ULM-FIT
J Howard
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
GPT-2BERT
ELMO ULM-FIT
J Howard
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Transformer and Attention Model
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Patterns for Attention Model
Pattern 1: Attention to next word
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Patterns for Attention Model
Pattern 1: Attention to next word
Pattern 2: Attention to previous word
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Patterns for Attention Model
Pattern 1: Attention to next word
Pattern 2: Attention to previous word
Pattern 3: Attention to identical/related words
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Patterns for Attention Model
Pattern 1: Attention to next word
Pattern 2: Attention to previous word
Pattern 3: Attention to identical/related words
Pattern 4: Attention to identical/related words in other sentence
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Patterns for Attention Model
Pattern 1: Attention to next word
Pattern 2: Attention to previous word
Pattern 3: Attention to identical/related words
Pattern 4: Attention to identical/related words in other sentence
Pattern 5: Attention to other words predictive (next word) of word
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Patterns for Attention Model
Pattern 1: Attention to next word
Pattern 2: Attention to previous word
Pattern 3: Attention to identical/related words
Pattern 4: Attention to identical/related words in other sentence
Pattern 5: Attention to other words predictive (next word) of word
Pattern 6: Attention to delimiter tokens
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
State of the Art
⚫ All models exist for English
⚫ Documentation is good
⚫ So we just need to translate
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
There are a lot of biases:
◦ Small Talk
◦ Idioms
◦ Local Named Entities
◦ Rarest Verbs
◦ Uncommon Tenses
◦ Gender rules
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
How to scale?
Create your own model
in your language
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Objectives
Use only qualitative methods to improve
the quality of content created by humans
Extract the knowledge learnt by the Deep
Learning.
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Why others attempts have
failed?
Quantitative:
You need a lot of data: more than 100 000
texts with a minimum of 500 words
Qualitative:
You need qualitative texts
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
GPT-2
Recipe
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 1: Training the model
This method without pretraining requires significant computing power.
You need GPUs! 3 days to get my first result with one GPU.
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 2: Generating the compressed training dataset - 1/2
GPT-2 needs to learn with the Byte Pair Encoding (BPE) format which is a simple form of
data compression.
Why?
- Predicting the next character is too imprecise
- Predicting the next word is too precive and take a lot of computing power.
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 2: Generating the compressed training dataset - 2/2
Use SentencePiece to generate my BPE files.
Why?
- Unsupervised text tokenizer and detokenizer
- Purely end-to-end system that does not depend on language-specific
pre/postprocessing.
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 3: Fine-tuning the model
Vocabulary size: depends on the language
- n_vocab:50257
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 3: Fine-tuning the model
Vocabulary size: depends on the language
- n_vocab:50257
Embedding size: default value recommended by Open AI team
- n_embd:768
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 3: Fine-tuning the model
Vocabulary size: depends on the language
- n_vocab:50257
Embedding size: default value recommended by Open AI team
- n_embd:768
Size of attention: no greater accuracy if you increase this value
- n_head:12
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 3: Fine-tuning the model
Vocabulary size: depends on the language
- n_vocab:50257
Embedding size: default value recommended by Open AI team
- n_embd:768
Size of attention: no greater accuracy if you increase this value
- n_head:12
Number of layers: no greater accuracy if you increase this value
- n_layer:12
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 4: Generating article text
Once the model has been trained, the gpt-2-gen command is used to generate a text.
The first parameter is the path to the model.
The second is the beginning of the sentence.
Then there are two optional parameters:
o --tokens-to-generate: number of tokens to generate, default 42
o --top-k: number of candidate tokens each time, by default 8.
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Results & Quality
Evaluated subjectively by a native reader.
API pylanguagetool was used to quantifiably
confirm the quality of results and did not find
any errors in the generated text.
https://github.com/Findus23/pyLanguagetool
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
You can find my Google Colab Notebook
here for the French
https://colab.research.google.com/drive/13Lbk1TYmTjoQFO6qbw_f1TJgoD5ulJwV
Warning: it is just an example using limited
data.
NOW it is your turn.
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Further ?
Parameters Objectives Use Cases
top-k < 10
token < 10
High Performance
Very high qualitative content related
to your original training content
Anchors for Internal Linking
Variant of Title
Variant of Meta
top-k > 50
token > 400
Low Performance
Low qualitative content because the
model is weak, but the model
successfully extracts all concepts
that GPT-2 learnt about your dataset.
Guides to help you write, compared
to a query, with the stated purpose of
saving you time.
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Thank You
vincent@oncrawl.com
Catalyst | @CatalystSEM | #TechSEOBoost
Thanks for Viewing the Slideshare!
–
Watch the Recording: https://youtube.com/session-example
Or
Contact us today to discover how Catalyst can deliver unparalleled SEO
results for your business. https://www.catalystdigital.com/

More Related Content

What's hot

What's hot (20)

BERT introduction
BERT introductionBERT introduction
BERT introduction
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 
Slash n near real time indexing
Slash n   near real time indexingSlash n   near real time indexing
Slash n near real time indexing
 
BERT
BERTBERT
BERT
 
Brighton SEO 2021 - A Deep Dive into the Depths of DevTools
Brighton SEO 2021 - A Deep Dive into the Depths of DevToolsBrighton SEO 2021 - A Deep Dive into the Depths of DevTools
Brighton SEO 2021 - A Deep Dive into the Depths of DevTools
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
How to Automatically Subcategorise Your Website Automatically With Python
How to Automatically Subcategorise Your Website Automatically With PythonHow to Automatically Subcategorise Your Website Automatically With Python
How to Automatically Subcategorise Your Website Automatically With Python
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
BERT
BERTBERT
BERT
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
 
Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)
Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)
Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)
 
Brighton SEO 2023 - ML Lessons For Total Search.pdf
Brighton SEO 2023 - ML Lessons For Total Search.pdfBrighton SEO 2023 - ML Lessons For Total Search.pdf
Brighton SEO 2023 - ML Lessons For Total Search.pdf
 
How to Combat SERP Volatility - Adriana Stein - BrightonSEO Slides 2023pdf
How to Combat SERP Volatility - Adriana Stein - BrightonSEO Slides 2023pdfHow to Combat SERP Volatility - Adriana Stein - BrightonSEO Slides 2023pdf
How to Combat SERP Volatility - Adriana Stein - BrightonSEO Slides 2023pdf
 
BrightonSEO - Amanda Jordan.pptx
BrightonSEO - Amanda Jordan.pptxBrightonSEO - Amanda Jordan.pptx
BrightonSEO - Amanda Jordan.pptx
 
E-Commerce search with Elasticsearch
E-Commerce search with ElasticsearchE-Commerce search with Elasticsearch
E-Commerce search with Elasticsearch
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
 

Similar to Generating Qualitative Content with GPT-2 in All Languages

Similar to Generating Qualitative Content with GPT-2 in All Languages (20)

Automate, Create Tools, & Test Ideas Quickly with Google Apps Script
Automate, Create Tools, & Test Ideas Quickly with Google Apps ScriptAutomate, Create Tools, & Test Ideas Quickly with Google Apps Script
Automate, Create Tools, & Test Ideas Quickly with Google Apps Script
 
ChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdfChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdf
 
TechSEO Boost 2019: Research Competition
TechSEO Boost 2019: Research CompetitionTechSEO Boost 2019: Research Competition
TechSEO Boost 2019: Research Competition
 
TechSEO Boost - Apps script for SEOs
TechSEO Boost - Apps script for SEOsTechSEO Boost - Apps script for SEOs
TechSEO Boost - Apps script for SEOs
 
Analyzing Real Time News
Analyzing Real Time NewsAnalyzing Real Time News
Analyzing Real Time News
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
BTech Final Project (1).pptx
BTech Final Project (1).pptxBTech Final Project (1).pptx
BTech Final Project (1).pptx
 
Machine Learning for Designers
Machine Learning for DesignersMachine Learning for Designers
Machine Learning for Designers
 
MOVIE RATING PREDICTION BASED ON TWITTER SENTIMENT ANALYSIS
MOVIE RATING PREDICTION BASED ON TWITTER SENTIMENT ANALYSISMOVIE RATING PREDICTION BASED ON TWITTER SENTIMENT ANALYSIS
MOVIE RATING PREDICTION BASED ON TWITTER SENTIMENT ANALYSIS
 
How to build your in-house ChatGPT
How to build your in-house ChatGPT How to build your in-house ChatGPT
How to build your in-house ChatGPT
 
Improve existing code with confidence, supported by unit tests
Improve existing code with confidence, supported by unit testsImprove existing code with confidence, supported by unit tests
Improve existing code with confidence, supported by unit tests
 
Deep Learning using Tensorflow and Data Science Experience
Deep Learning using Tensorflow and Data Science ExperienceDeep Learning using Tensorflow and Data Science Experience
Deep Learning using Tensorflow and Data Science Experience
 
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett
 
Five steps to search and store tweets by keywords
Five steps to search and store tweets by keywordsFive steps to search and store tweets by keywords
Five steps to search and store tweets by keywords
 
MmIT webinar 2018 - Essential tools and technologies for the library and info...
MmIT webinar 2018 - Essential tools and technologies for the library and info...MmIT webinar 2018 - Essential tools and technologies for the library and info...
MmIT webinar 2018 - Essential tools and technologies for the library and info...
 
Intent Classifier with Facebook fastText
Intent Classifier with Facebook fastTextIntent Classifier with Facebook fastText
Intent Classifier with Facebook fastText
 
Machine Learning and Python For Marketing Automation | MKGO October 2019 | Ru...
Machine Learning and Python For Marketing Automation | MKGO October 2019 | Ru...Machine Learning and Python For Marketing Automation | MKGO October 2019 | Ru...
Machine Learning and Python For Marketing Automation | MKGO October 2019 | Ru...
 
Thesis Presentation V4
Thesis Presentation V4Thesis Presentation V4
Thesis Presentation V4
 
How can AI be a creative partner for PR & marketing?
How can AI be a creative partner for PR & marketing?How can AI be a creative partner for PR & marketing?
How can AI be a creative partner for PR & marketing?
 
Sentiment analysis on demonetisation
Sentiment analysis on demonetisationSentiment analysis on demonetisation
Sentiment analysis on demonetisation
 

More from Catalyst

New Commerce Commerce: All Things Instacart
New Commerce Commerce: All Things InstacartNew Commerce Commerce: All Things Instacart
New Commerce Commerce: All Things Instacart
Catalyst
 
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your ReopeningReignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Catalyst
 
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Catalyst
 

More from Catalyst (20)

Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...
Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...
Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...
 
TechSEO Boost 2021 - Cultivating a Product Mindset for Success
TechSEO Boost 2021 - Cultivating a Product Mindset for SuccessTechSEO Boost 2021 - Cultivating a Product Mindset for Success
TechSEO Boost 2021 - Cultivating a Product Mindset for Success
 
TechSEO Boost 2021 - SEO Experimentation
TechSEO Boost 2021 - SEO ExperimentationTechSEO Boost 2021 - SEO Experimentation
TechSEO Boost 2021 - SEO Experimentation
 
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...
 
TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...
TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...
TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...
 
10 Trends Changing Programmatic
10 Trends Changing Programmatic10 Trends Changing Programmatic
10 Trends Changing Programmatic
 
New Commerce Conference: Charting a Course to Success with Your Retail Media ...
New Commerce Conference: Charting a Course to Success with Your Retail Media ...New Commerce Conference: Charting a Course to Success with Your Retail Media ...
New Commerce Conference: Charting a Course to Success with Your Retail Media ...
 
The New Commerce Conference: The Omni-channel Imperative
The New Commerce Conference: The Omni-channel ImperativeThe New Commerce Conference: The Omni-channel Imperative
The New Commerce Conference: The Omni-channel Imperative
 
New Commerce Commerce: All Things Instacart
New Commerce Commerce: All Things InstacartNew Commerce Commerce: All Things Instacart
New Commerce Commerce: All Things Instacart
 
The Power of SEO: Protect Your Bottom Line & Future Proof Your Brand
The Power of SEO: Protect Your Bottom Line & Future Proof Your BrandThe Power of SEO: Protect Your Bottom Line & Future Proof Your Brand
The Power of SEO: Protect Your Bottom Line & Future Proof Your Brand
 
The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...
The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...
The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...
 
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your ReopeningReignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
 
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
 
Evolve Your Social Commerce Strategy: Thinking Beyond Facebook
Evolve Your Social Commerce Strategy: Thinking Beyond FacebookEvolve Your Social Commerce Strategy: Thinking Beyond Facebook
Evolve Your Social Commerce Strategy: Thinking Beyond Facebook
 
B2B SEO: Increase Traffic & Leads in 2020
B2B SEO: Increase Traffic & Leads in 2020B2B SEO: Increase Traffic & Leads in 2020
B2B SEO: Increase Traffic & Leads in 2020
 
Keynote: Bias in Search and Recommender Systems
Keynote: Bias in Search and Recommender SystemsKeynote: Bias in Search and Recommender Systems
Keynote: Bias in Search and Recommender Systems
 
NLP Powered Outreach Link Building
NLP Powered Outreach Link BuildingNLP Powered Outreach Link Building
NLP Powered Outreach Link Building
 
NLP for SEO
NLP for SEONLP for SEO
NLP for SEO
 
What I Learned Building a Toy Example to Crawl & Render like Google
What I Learned Building a Toy Example to Crawl & Render like GoogleWhat I Learned Building a Toy Example to Crawl & Render like Google
What I Learned Building a Toy Example to Crawl & Render like Google
 
The User is The Query: The Rise of Predictive Proactive Search
The User is The Query: The Rise of Predictive Proactive SearchThe User is The Query: The Rise of Predictive Proactive Search
The User is The Query: The Rise of Predictive Proactive Search
 

Recently uploaded

obat pelancar haid di apotik dan harganya
obat pelancar haid di apotik dan harganyaobat pelancar haid di apotik dan harganya
obat pelancar haid di apotik dan harganya
infoobataborsi24
 
Top 10 Recommended Fragrances for Father's Day
Top 10 Recommended Fragrances for Father's DayTop 10 Recommended Fragrances for Father's Day
Top 10 Recommended Fragrances for Father's Day
disenylurial
 
How To Structure Your Web3 Website For Max Visibility In The Bull Market🚀
How To Structure Your Web3 Website For Max Visibility In The Bull Market🚀How To Structure Your Web3 Website For Max Visibility In The Bull Market🚀
How To Structure Your Web3 Website For Max Visibility In The Bull Market🚀
Victoria Olsina
 
Klinik Jual Obat Aborsi Di Bandung wa 0851/7541/5434 Misoprostol 200mcg Pfize...
Klinik Jual Obat Aborsi Di Bandung wa 0851/7541/5434 Misoprostol 200mcg Pfize...Klinik Jual Obat Aborsi Di Bandung wa 0851/7541/5434 Misoprostol 200mcg Pfize...
Klinik Jual Obat Aborsi Di Bandung wa 0851/7541/5434 Misoprostol 200mcg Pfize...
Spesialis Kandungan Resmi BPOM
 

Recently uploaded (20)

Being a PMM with a multi-product portfolio - Product Marketing Summit
Being a PMM with a multi-product portfolio - Product Marketing SummitBeing a PMM with a multi-product portfolio - Product Marketing Summit
Being a PMM with a multi-product portfolio - Product Marketing Summit
 
How to Scale Your Digital Marketing Services in 2024
How to Scale Your Digital Marketing Services in 2024How to Scale Your Digital Marketing Services in 2024
How to Scale Your Digital Marketing Services in 2024
 
The BoF Brand Magic Index Volume Two — Preview.pdf
The BoF Brand Magic Index Volume Two — Preview.pdfThe BoF Brand Magic Index Volume Two — Preview.pdf
The BoF Brand Magic Index Volume Two — Preview.pdf
 
Fantasy Cricket Apps: A New Viewpoint for Online Cricket Betting Apps
Fantasy Cricket Apps: A New Viewpoint for Online Cricket Betting AppsFantasy Cricket Apps: A New Viewpoint for Online Cricket Betting Apps
Fantasy Cricket Apps: A New Viewpoint for Online Cricket Betting Apps
 
Top tips for effective SEO copywriting.pdf
Top tips for effective SEO copywriting.pdfTop tips for effective SEO copywriting.pdf
Top tips for effective SEO copywriting.pdf
 
Webinar: What the Hell is Legitimate Interest?
Webinar: What the Hell is Legitimate Interest?Webinar: What the Hell is Legitimate Interest?
Webinar: What the Hell is Legitimate Interest?
 
Digital PR & Content Marketing Lecture for Advanced Digital & Social Media St...
Digital PR & Content Marketing Lecture for Advanced Digital & Social Media St...Digital PR & Content Marketing Lecture for Advanced Digital & Social Media St...
Digital PR & Content Marketing Lecture for Advanced Digital & Social Media St...
 
SES London 2009 Beyond Linkbait Greg Jarboe.ppt
SES London 2009 Beyond Linkbait Greg Jarboe.pptSES London 2009 Beyond Linkbait Greg Jarboe.ppt
SES London 2009 Beyond Linkbait Greg Jarboe.ppt
 
youtube_marketing_partner_vling_service_introduction
youtube_marketing_partner_vling_service_introductionyoutube_marketing_partner_vling_service_introduction
youtube_marketing_partner_vling_service_introduction
 
obat pelancar haid di apotik dan harganya
obat pelancar haid di apotik dan harganyaobat pelancar haid di apotik dan harganya
obat pelancar haid di apotik dan harganya
 
Top 10 Recommended Fragrances for Father's Day
Top 10 Recommended Fragrances for Father's DayTop 10 Recommended Fragrances for Father's Day
Top 10 Recommended Fragrances for Father's Day
 
The Future Normal - DIGGIT - Henry Coutinho-Mason.pdf
The Future Normal - DIGGIT - Henry Coutinho-Mason.pdfThe Future Normal - DIGGIT - Henry Coutinho-Mason.pdf
The Future Normal - DIGGIT - Henry Coutinho-Mason.pdf
 
Influencer Marekting Trends- Where the creator economy is going in in 2024
Influencer Marekting Trends- Where the creator economy is going in in 2024Influencer Marekting Trends- Where the creator economy is going in in 2024
Influencer Marekting Trends- Where the creator economy is going in in 2024
 
Using GA 4 to to Prove Value - Greg Jarboe - Aug 8, 2023.pptx
Using GA 4 to to Prove Value - Greg Jarboe - Aug 8, 2023.pptxUsing GA 4 to to Prove Value - Greg Jarboe - Aug 8, 2023.pptx
Using GA 4 to to Prove Value - Greg Jarboe - Aug 8, 2023.pptx
 
Impacts Of Smart Watch & Wearable Technology On Daily Life
Impacts Of Smart Watch & Wearable Technology On Daily LifeImpacts Of Smart Watch & Wearable Technology On Daily Life
Impacts Of Smart Watch & Wearable Technology On Daily Life
 
Beyond the Basics: Enhanced Strategies for Next-Level Advertising
Beyond the Basics: Enhanced Strategies for Next-Level AdvertisingBeyond the Basics: Enhanced Strategies for Next-Level Advertising
Beyond the Basics: Enhanced Strategies for Next-Level Advertising
 
The Vital Role of Keyword Density in Crafting SEO-Optimized Content
The Vital Role of Keyword Density in Crafting SEO-Optimized ContentThe Vital Role of Keyword Density in Crafting SEO-Optimized Content
The Vital Role of Keyword Density in Crafting SEO-Optimized Content
 
How To Structure Your Web3 Website For Max Visibility In The Bull Market🚀
How To Structure Your Web3 Website For Max Visibility In The Bull Market🚀How To Structure Your Web3 Website For Max Visibility In The Bull Market🚀
How To Structure Your Web3 Website For Max Visibility In The Bull Market🚀
 
Klinik Jual Obat Aborsi Di Bandung wa 0851/7541/5434 Misoprostol 200mcg Pfize...
Klinik Jual Obat Aborsi Di Bandung wa 0851/7541/5434 Misoprostol 200mcg Pfize...Klinik Jual Obat Aborsi Di Bandung wa 0851/7541/5434 Misoprostol 200mcg Pfize...
Klinik Jual Obat Aborsi Di Bandung wa 0851/7541/5434 Misoprostol 200mcg Pfize...
 
Niche Analysis for Client Outreach Outside Marketplace.pptx
Niche Analysis for Client Outreach Outside Marketplace.pptxNiche Analysis for Client Outreach Outside Marketplace.pptx
Niche Analysis for Client Outreach Outside Marketplace.pptx
 

Generating Qualitative Content with GPT-2 in All Languages

  • 1. #TechSEOBoost | @CatalystSEM THANK YOU TO OUR SPONSORS Generating Qualitative Content with GPT-2 in All Languages Vincent Terrasi, OnCrawl
  • 2. Vincent Terrasi | @vincentterrasi | #TechSEOBoost In All Languages Generating Qualitative Content
  • 3. Vincent Terrasi | @vincentterrasi | #TechSEOBoost SEO Use-cases • Image captioning with Pythia • Visual question & Answering • Abstractive Summarization with BERTsum • Full Article generation with GPT-2
  • 4. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Text Spinners are bad
  • 5. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Google, What is bad generated content in 2016? • Text translated by an automated tool without human review or curation before publishing • Text generated through automated processes, such as Markov chains • Text generated using automated synonymizing or obfuscation techniques • Text generated from scraping Atom/RSS feeds or search results • Stitching or combining content from different web pages without adding sufficient value https://web.archive.org/web/20160222004700/https://support.google.com/webmasters/answer/2721306?hl=en
  • 6. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Google, What is bad generated content in 2019? • Text that makes no sense to the reader but which may contain search keywords. • Text translated by an automated tool without human review or curation before publishing • Text generated through automated processes, such as Markov chains • Text generated using automated synonymizing or obfuscation techniques • Text generated from scraping Atom/RSS feeds or search results • Stitching or combining content from different web pages without adding sufficient value https://support.google.com/webmasters/answer/2721306?hl=en
  • 7. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Surprise!
  • 8. Vincent Terrasi | @vincentterrasi | #TechSEOBoost 2019, the best year for using AI for text generation
  • 9. Vincent Terrasi | @vincentterrasi | #TechSEOBoost GPT-2BERT ELMO ULM-FIT J Howard
  • 10. Vincent Terrasi | @vincentterrasi | #TechSEOBoost GPT-2BERT ELMO ULM-FIT J Howard
  • 11. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Transformer and Attention Model
  • 12. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word
  • 13. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word Pattern 2: Attention to previous word
  • 14. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word Pattern 2: Attention to previous word Pattern 3: Attention to identical/related words
  • 15. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word Pattern 2: Attention to previous word Pattern 3: Attention to identical/related words Pattern 4: Attention to identical/related words in other sentence
  • 16. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word Pattern 2: Attention to previous word Pattern 3: Attention to identical/related words Pattern 4: Attention to identical/related words in other sentence Pattern 5: Attention to other words predictive (next word) of word
  • 17. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word Pattern 2: Attention to previous word Pattern 3: Attention to identical/related words Pattern 4: Attention to identical/related words in other sentence Pattern 5: Attention to other words predictive (next word) of word Pattern 6: Attention to delimiter tokens
  • 18. Vincent Terrasi | @vincentterrasi | #TechSEOBoost State of the Art ⚫ All models exist for English ⚫ Documentation is good ⚫ So we just need to translate
  • 19. Vincent Terrasi | @vincentterrasi | #TechSEOBoost There are a lot of biases: ◦ Small Talk ◦ Idioms ◦ Local Named Entities ◦ Rarest Verbs ◦ Uncommon Tenses ◦ Gender rules
  • 20. Vincent Terrasi | @vincentterrasi | #TechSEOBoost How to scale? Create your own model in your language
  • 21. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Objectives Use only qualitative methods to improve the quality of content created by humans Extract the knowledge learnt by the Deep Learning.
  • 22. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Why others attempts have failed? Quantitative: You need a lot of data: more than 100 000 texts with a minimum of 500 words Qualitative: You need qualitative texts
  • 23. Vincent Terrasi | @vincentterrasi | #TechSEOBoost GPT-2 Recipe
  • 24. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 1: Training the model This method without pretraining requires significant computing power. You need GPUs! 3 days to get my first result with one GPU.
  • 25. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 2: Generating the compressed training dataset - 1/2 GPT-2 needs to learn with the Byte Pair Encoding (BPE) format which is a simple form of data compression. Why? - Predicting the next character is too imprecise - Predicting the next word is too precive and take a lot of computing power.
  • 26. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 2: Generating the compressed training dataset - 2/2 Use SentencePiece to generate my BPE files. Why? - Unsupervised text tokenizer and detokenizer - Purely end-to-end system that does not depend on language-specific pre/postprocessing.
  • 27. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 3: Fine-tuning the model Vocabulary size: depends on the language - n_vocab:50257
  • 28. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 3: Fine-tuning the model Vocabulary size: depends on the language - n_vocab:50257 Embedding size: default value recommended by Open AI team - n_embd:768
  • 29. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 3: Fine-tuning the model Vocabulary size: depends on the language - n_vocab:50257 Embedding size: default value recommended by Open AI team - n_embd:768 Size of attention: no greater accuracy if you increase this value - n_head:12
  • 30. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 3: Fine-tuning the model Vocabulary size: depends on the language - n_vocab:50257 Embedding size: default value recommended by Open AI team - n_embd:768 Size of attention: no greater accuracy if you increase this value - n_head:12 Number of layers: no greater accuracy if you increase this value - n_layer:12
  • 31. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 4: Generating article text Once the model has been trained, the gpt-2-gen command is used to generate a text. The first parameter is the path to the model. The second is the beginning of the sentence. Then there are two optional parameters: o --tokens-to-generate: number of tokens to generate, default 42 o --top-k: number of candidate tokens each time, by default 8.
  • 32. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Results & Quality Evaluated subjectively by a native reader. API pylanguagetool was used to quantifiably confirm the quality of results and did not find any errors in the generated text. https://github.com/Findus23/pyLanguagetool
  • 33. Vincent Terrasi | @vincentterrasi | #TechSEOBoost You can find my Google Colab Notebook here for the French https://colab.research.google.com/drive/13Lbk1TYmTjoQFO6qbw_f1TJgoD5ulJwV Warning: it is just an example using limited data. NOW it is your turn.
  • 34. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Further ? Parameters Objectives Use Cases top-k < 10 token < 10 High Performance Very high qualitative content related to your original training content Anchors for Internal Linking Variant of Title Variant of Meta top-k > 50 token > 400 Low Performance Low qualitative content because the model is weak, but the model successfully extracts all concepts that GPT-2 learnt about your dataset. Guides to help you write, compared to a query, with the stated purpose of saving you time.
  • 35. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Thank You vincent@oncrawl.com
  • 36. Catalyst | @CatalystSEM | #TechSEOBoost Thanks for Viewing the Slideshare! – Watch the Recording: https://youtube.com/session-example Or Contact us today to discover how Catalyst can deliver unparalleled SEO results for your business. https://www.catalystdigital.com/