How to automate all your SEO projects

How to automate all
your SEO projects
@VincentTerrasi
OVH

Planning
• Each Day :
• Advanced Reporting
• Anomalies Detection
• Log Analysis
• Webperf with SiteSpeed.io
• Each Week :
• Ranking monitoring
• Opportunities Detection
• Hot Topic Detection
• Each Quarter :
• Semantic Analysis
Time is precious
Automate
everything

1. RStudio Server
2. Shiny Server
3. Jupyter Notebook
4. Dataiku
5. OpenSource

searchConsoleR
Docker
ATinternetR oncrawlR
Rstudio
Server
Shiny Server Dataiku
DataLake
Scheduled
Email
Notebook DataAPIShiny Apps DataViz
Reports

1. RStudio Server
Automate all your SEO projects

Why R ?
Scriptable
Big Community
Mac / PC / Unix
Open Source
Free
 10 000 packages

Rgui
WheRe ? How ?
Rstudio
https://www.cran.r-project.org

• Docker on Ubuntu 16.04 Server
• From the docker window, run:
• sudo docker run -d -p 8787:8787 rocker/rstudio
• e.g. http://yourIP:8787, and you should be greeted by the RStudio
welcome screen.
Log in using:
• username: rstudio
• password: rstudio
RStudio Server - Install

• install.packages("httr")
• install.packages("RCurl")
• install.packages("stringr")
• install.packages("stringi")
• install.packages("openssl")
• install.packages("Rmpi")
• install.packages("doMpi")
R – Scraper – Packages

R – Scraper – RCurl
seocrawler <- function( url ) {
useragent <- "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko)
Version/6.0 Mobile/10A5376e Safari/8536.25“
h <- basicTextGatherer()
html <- getURL(url
,followlocation = TRUE
,ssl.verifypeer = FALSE
,httpheader = c('User-Agent' = useragent)
,headerfunction = h$update
)
return(html)
}

R – Scraper – Header
ind0 <- grep("HTTP/",h$value(NULL))
df$StatusCode <- tail(h$value(NULL)[ind0],1)
ind1 <- grep("^Content-Type",h$value(NULL))
df$ContentType <- gsub("Content-Type:","",tail(h$value(NULL)[ind1],1))
ind2 <- grep("Last-Modified",h$value(NULL))
df$LastModified <- gsub("Last-Modified:","",tail(h$value(NULL)[ind2],1))
ind3 <- grep("Content-Language",h$value(NULL))
df$ContentLanguage <- gsub("Content-Language:","",tail(h$value(NULL)[ind3],1))
ind4 <- grep("Location",h$value(NULL))
df$Location <- gsub("Location:","",tail(h$value(NULL)[ind4],1))

R – Scraper – Xpath
doc <- htmlParse(html, asText=TRUE,encoding="UTF-8")
• H1 <- head(xpathSApply(doc, "//h1", xmlValue),1)
• H2 <- head(xpathSApply(doc, "//h2", xmlValue),1)
• robots <- head(xpathSApply(doc, '//meta[@name="robots"]', xmlGetAttr, 'content'),1)
• canonical <- head(xpathSApply(doc, '//link[@rel="canonical"]', xmlGetAttr, 'href'),1)
• DF_a <- xpathSApply(doc, "//a", xmlGetAttr, 'href')

R – Scraper – OpenMpi
• MPI : Message Passing Interface is a specification for an API for passing
messages between different computers.
• Programming with MPI
• Difficult because of Rmpi package defines about 110 R functions
• Needs a parallel programming system to do the actual work in parallel
• The doMPI package acts as an adaptor to the Rmpi package, which in
turn is an R interface to an implementation of MPI
• Very easy to install Open MPI, and Rmpi on Debian / Ubuntu
• You can test with one computer

R – Scraper – Install OpenMPI
sudo yum install openmpi openmpi-devel openmpi-libs
sudo ldconfig /usr/lib64/openmpi/lib/
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}${LD_LIBRARY_PATH:+:}/usr/lib64/openmpi/lib/“
install.packages("Rmpi",
configure.args =
c("--with-Rmpi-include=/usr/include/openmpi-x86_64/",
"--with-Rmpi-libpath=/usr/lib64/openmpi/lib/",
"--with-Rmpi-type=OPENMPI"))
install.packages (“doMPI“)

R – Scraper – Test doMpi
library(doMPI)
#start your cluster
cl <- startMPIcluster(count=20)
registerDoMPI(cl)
#
max <- dim(mydataset)[1]
x <- foreach(i=1:max, .combine="rbind") %dopar% seocrawlerThread(mydataset,i)
#close your cluster
closeCluster(cl)

• Venn Matrix :
http://blog.mrbioinfo.com/
R – Semantic Analysis – Intro

R – Semantic Analysis – Data

R – Semantic Analysis – eVenn
evenn(pathRes="./eVenn/", matLists=all.the.data, annot=FALSE, CompName=“croisiere”)

R – Semantic Analysis – Filter
fichierVenn <- "./eVenn/Venn_croisiere/VennMatrixBin.txt"
#read csv
DF <- read.csv(fichierVenn, sep = "t", encoding="CP1252", stringsAsFactors=FALSE)
#find
DF_PotentialKeywords <- subset(DF, DF$Total_lists >= 4 & DF$planete.croisiere.com==0 )

R – Semantic Analysis – nGram
library(text2vec)
it <- itoken( DF_PotentialKeywords[['Keywords']],
preprocess_function = tolower,
tokenizer = word_tokenizer,
progessbar = F )
# 2 and 3 grams
vocab <- create_vocabulary(it, ngram = c(2L, 3L))
DF_SEO_vocab <- data.frame(vocab$vocab)
DF_SEO_select <- data.frame(word=DF_SEO_vocab$terms,
freq=DF_SEO_vocab$terms_counts) %>%
arrange(-freq) %>%
top_n(30)

• Dplyr
• Readxl
• SearchConsoleR
• googleAuthR
• googleAnalyticsR
R – Packages SEO
Thanks to Mark Edmondson

R – SearchConsoleR
library(googleAuthR)
library(searchConsoleR)
# get your password on google console api
options("searchConsoleR.client_id" = "41078866233615q3i3uXXXX.apps.googleusercontent.com")
options("searchConsoleR.client_secret" = "GO0m0XXXXXXXXXX")
## change this to the website you want to download data for. Include http
website <- "https://data-seo.fr"
## data is in search console reliably 3 days ago, so we donwnload from then
## today - 3 days
start <- Sys.Date() - 3
## one days data, but change it as needed
end <- Sys.Date() - 3

R – SearchConsoleR
## what to download, choose between data, query, page, device, country
download_dimensions <- c('date','query')
## what type of Google search, choose between 'web', 'video' or 'image'
type <- c('web')
## Authorize script with Search Console.
## First time you will need to login to Google but should auto-refresh after that so can be put in
## Authorize script with an account that has access to website.
googleAuthR::gar_auth()
## first time stop here and wait for authorisation
## get the search analytics data
data <- search_analytics(siteURL = website, startDate = start, endDate = end, dimensions =
download_dimensions, searchType = type)

• Table: Crontab Fields and Allowed Ranges (Linux Crontab Syntax)
• MIN Minute field 0 to 59
• HOUR Hour field 0 to 23
• DOM Day of Month 1-31
• MON Month field 1-12
• DOW Day Of Week 0-6
• CMD Command Any command to be executed.
• $ crontab –e
• Run the R script filePath.R at 23:15 for every day of the year :
15 23 * * * Rscript filePath.R
R – CronTab – Method 1

• R Package : https://github.com/bnosac/cronR
R – Cron – Method 2
library(cronR)
cron_add(cmd, frequency = 'hourly', id = 'job4', at = '00:20',
days_of_week = c(1, 2))
cron_add(cmd, frequency = 'daily', id = 'job5', at = '14:20')
cron_add(cmd, frequency = 'daily', id = 'job6', at = '14:20',
days_of_week = c(0, 3, 5))
OR

2. Shiny Server
Creating webapps with R

Shiny Server – Where and How
• ShinyApps.io
• A local server
• Hosted on your server

• docker run --rm -p 3838:3838
-v /srv/shinyapps/:/srv/shiny-server/
-v /srv/shinylog/:/var/log/
rocker/shiny
• If you have an app in /srv/shinyapps/appdir, you can run the app
by visiting http://yourIP:3838/appdir/.
Shiny Server - Install

Shiny – ui.R
fluidPage(
titlePanel("Compute your internal pagerank"),
sidebarLayout(
sidebarPanel(
a("data-seo.com", href="https://data-seo.com"),
tags$hr(),
p('Step 1 : Export your outlinks data from ScreamingFrog'),
fileInput('file1', 'Choose file to upload (e.g. all_outlinks.csv)',
accept = c('text/csv'), multiple = FALSE
),
tags$hr(),
downloadButton('downloadData', 'Download CSV')
),
mainPanel(
h3(textOutput("caption")),
tags$hr(),
tableOutput('contents')
)
)
)

Shiny – server.R
function(input, output, session) {
....
output$contents <- renderTable({
if (!is.null(input$file1)) {
inFile <- input$file1
logsSummary <- importLogs(inFile$datapath)
logsSummary
}
})
output$downloadData <- downloadHandler(
filename = "extract.csv",
content = function(file) {
if (!is.null(input$file1)) {
inFile <- input$file1
logsSummary <- importLogs(inFile$datapath)
write.csv2(logsSummary,file, row.names = FALSE)
}
}
)
}

https://mark.shinyapps.io/GA-dashboard-demo
Code on Github: https://github.com/MarkEdmondson1234/ga-dashboard-demo
• Interactive trend graphs.
• Auto-updating Google Analytics data.
• Zoomable day-of-week heatmaps.
• Top Level Trends via Year on Year, Month on Month
and Last Month vs Month Last Year data modules.
• A MySQL connection for data blending your own data with GA data.
• An easy upload option to update a MySQL database.
• Analysis of the impact of marketing events via Google's CausalImpact.
• Detection of unusual time-points using Twitter's Anomaly Detection.
Shiny – Use case

3. Jupyter Notebook
Sharing source code with your SEO team

• Reproducibility
• Quality
• Discoverability
• Learning
Jupyter Notebook – Why ?

Step 1 — Installing Python 2.7 and Pip
$ sudo apt-get update
$ sudo apt-get -y install python2.7 python-pip python-dev
Step 2 — Installing Ipython and Jupyter Notebook
$ sudo apt-get -y install ipython ipython-notebook
$ sudo -H pip install jupyter
Step 3 — Running Jupyter Notebook
$ jupyter notebook
Jupyter Notebook Install

• https://github.com/voltek62/RNotebook-SEO
• Semantic Analysis for SEO
• Scraper for SEO
Jupyter Notebook Examples

Process Validation
Documentation

4. Dataiku
Use AML to find the best algorithm

Automated Machine Learning
• Benchmarking
• Detecting Target Leakage
• Diagnostics
• Automation

$ adduser vincent sudo
$ sudo apt-get install default-jre
$ wget https://downloads.dataiku.com/public/studio/4.0.1/dataiku-dss-4.0.1.tar.gz
$ tar xzf dataiku-dss-4.0.1.tar.gz
$ cd dataiku-dss-4.0.1
>> install all prerequites
$ sudo -i "/home/dataiku-dss-4.0.1/scripts/install/install-deps.sh" -without-java
>> install dataiku
$ ./installer.sh -d DATA_DIR -p 11000
$ DATA_DIR/bin/dss start
http://<your server address>:11000.
Dataiku- Install on Instance Cloud

Go to the DSS data dir
$ cd DATADIR
Stop DSS
$ ./bin/dss stop
Run the installation script
$ ./bin/dssadmin install-R-integration
$ ./bin/dss start
Dataiku- Install R

Use-Case :
Detect Featured
Snippet

• Get all your featured snippet with Ranxplorer
• Get SERP for each keywords with Ranxplorer
• Use homemade scraper to enrich data :
• 'Keyword' 'Domain' 'StatusCode' 'ContentType' 'LastModified' 'Location'
• 'Title' 'TitleLength' 'TitleDist' 'TitleIsQuestion'
• 'noSnippet' 'isJsonLD' 'isItemType' 'isItemProp'
• 'Wordcount' 'Size' 'ResponseTime'
• 'H1' 'H1Length' 'H1Dist' 'H1IsQuestion'
• 'H2' 'H2Length' 'H2Dist' 'H2IsQuestion‘
• Use AML to find importance features
Dataiku : Featured Snippet

Dataiku : My Plugins
• SEMrush
• SearchConsole
• Majestic
• Visiblis [ongoing]
A DSS plugin is a zip file.
Inside DSS, click the top right gear → Administration → Plugins → Store.
https://github.com/voltek62/Dataiku-SEO-Plugins

• Learn from the success of others with AML
• Use all methods at your disposal to show Google you are the
answer to the question. ( Title, H1, H2, … )
Dataiku : Results

• Yes, you can because :
• Great advertising
• Get customers for specific features and trainings
Open Source & SEO ?
• Showing your work
• Attract talent
• Teaching the next generation

• Automated Reports with Rstudio Server
• Automated KPI reporting with Shiny Server
• Process Validation Documentation with Jupyter Notebook
• Automated Machine Learning with Dataiku
Take away

Now, machines can learn and adapt,
it is time to take advantage of the
opportunity to create new jobs.
Data-SEO, Data-Doctor, Data-Journalist …

Vincent Terrasi
@vincentterrasi
Get all my last discoveries and updates

How to automate all your SEO projects

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How to automate all your SEO projects

Similar to How to automate all your SEO projects (20)

More from Vincent Terrasi

More from Vincent Terrasi (13)

Recently uploaded

Recently uploaded (20)

How to automate all your SEO projects

Editor's Notes