Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Analyse your SEO Data with R and Kibana

4,641 views

Published on

Using Screaming Frog to crawl a website
Using R for SEO Analysis
Using PaasLogs to centralize logs
Using Kibana to build fancy dashboards

Tutorial : www.data-seo.com

Published in: Data & Analytics

Analyse your SEO Data with R and Kibana

  1. 1. Analyse your SEO Data with R and Kibana June 10th, 2016 Vincent Terrasi
  2. 2. Vincent Terrasi -- SEO Director - Groupe M6Web CuisineAZ, PasseportSanté, MeteoCity, … -- Join the OVH adventure in July 2016 Blog : data-seo.com
  3. 3. Agenda Mission : Do a Real-Time Log Analysis Tool 1. Using Screaming Frog to crawl a website 2. Using R for SEO Analysis 3. Using PaasLogs to centralize logs 4. Using Kibana to build fancy dashboards 5. Test ! 3 “The world is full of obvious things which nobody by any chance ever observes.” Sherlock Holmes Quote
  4. 4. Real-Time Log Analysis Tool 4 • Screaming Frog • Google Analytics • R Crawler • IIS Logs • Apache Logs • Nginx Logs Logs
  5. 5. Using Screaming Frog
  6. 6. Screaming Frog : Export Data 6 When the crawl is finished, click the export button and save the XLSX file Add your url and click the start button
  7. 7. Screaming Frog : Data ! 7 "Address" "Content" "Status Code" "Status" "Title 1" "Title 1 Length" "Title 1 Pixel Width" "Title 2" "Title 2 Length" "Title 2 Pixel Width" "Meta Description 1" "Meta Description 1 Length“ "Meta Description 1 Pixel Width" "Meta Keyword 1" "Meta Keywords 1 Length" "H1-1" "H1-1 length" "H2-1" "H2-1 length" "H2-2" "H2-2 length" "Meta Robots 1“ "Meta Refresh 1" "Canonical Link Element 1" "Size" "Word Count" "Level" "Inlinks" "Outlinks" "External Outlinks" "Hash" "Response Time" "Last Modified" "Redirect URI“ "GA Sessions" "GA % New Sessions" "GA New Users" "GA Bounce Rate" "GA Page Views Per Sesssion" "GA Avg Session Duration" "GA Page Value" "GA Goal Conversion Rate All" "GA Goal Completions All" "GA Goal Value All" "Clicks" "Impressions" "CTR" "Position" "H1-2" "H1-2 length"
  8. 8. Using R
  9. 9. Why R ? Scriptable Big Community Mac / PC / Unix Open Source  7500 packages 9 Documentation
  10. 10. WheRe ? How ?  https://www.cran.r-project.org/ 10 Rgui RStudio
  11. 11. Using R : Step 1 Export All Urls 11 "request“;"section“;"active“; "speed“;"compliant“;"depth“;"inlinks" Packages : Stringr Ggplot Dplyr Readxl
  12. 12. R Examples  Crawl via Screaming Frog  Classify URLs by :  Section  Load Time  Number of Inlinks  Detect Active Pages  Min 1 visit per month  Detect Compliant Pages  Canonical Not Equal  Meta No-index  Bad HTTP Status Code  Detect Duplicate Meta 12
  13. 13. R : read files 13 # Read xlsx file urls <- read_excel("internal_html_blog.xlsx", sheet = 1, col_names = TRUE, skip=1) # Read csv file urls <- read.csv2("internal_html_blog.csv", sep=";", header = TRUE)
  14. 14. Detect Active Pages 14 #default urls_select$Active <- FALSE urls_select$Active[ which(urls_select$`GA Sessions` > 0) ] <- TRUE #factor urls_select$Active <- as.factor(urls_select$Active)
  15. 15. Classify URLs by Section 15 schemas <- read.csv(“conf.csv”,header = FALSE, col.names = "schema", stringsAsFactors = FALSE) urls_select$Cat <- "no match" for (j in 1:length(schemas)) { urls_select$Cat[ which(stri_detect_fixed(urls_select$Address , schemas[j]) ) ] <- schemas[j] } /agenda/sorties-cinema/ /agenda/parutions/ /agenda/evenements/ /agenda/programme-tv/ /encyclopedie/ Conf.csv
  16. 16. Classify URLs By Load Time 16 urls_select$Speed <- NA urls_select$Speed[ which(urls_select$`Response Time` < 0.501 ) ] <- "Fast“ urls_select$Speed [ which(urls_select$`Response Time` >= 0.501 & urls_select$`Response Time` < 1.001) ] <- "Medium“ urls_select$Speed[ which(urls_select$`Response Time` >= 1.001 & urls_select$`Response Time` < 2.001) ] <- "Slow“ urls_select$Speed[ which(urls_select$`Response Time` >= 2.001) ] <- "Slowest" urls_select$Speed <- as.factor(urls_select$Speed)
  17. 17. Classify URLs By Number of Inlinks 17 urls_select$`Group Inlinks` <- "URLs with No Follow Inlinks" urls_select$`Group Inlinks`[ which(urls_select$`Inlinks` < 1 ) ] <- "URLs with No Follow Inlinks" urls_select$`Group Inlinks`[ which(urls_select$`Inlinks` == 1 ) ] <- "URLs with 1 Follow Inlink“ urls_select$`Group Inlinks`[ which(urls_select$`Inlinks` > 1 & urls_select$`Inlinks` < 6) ] <- "URLs with 2 to 5 Follow Inlinks“ urls_select$`Group Inlinks`[ which(urls_select$`Inlinks` >= 6 & urls_select$`Inlinks` < 11 ) ] <- "URLs with 5 to 10 Follow Inlinks“ urls_select$`Group Inlinks`[ which(urls_select$`Inlinks` >= 11) ] <- "URLs with more than 10 Follow Inlinks" urls_select$`Group Inlinks` <- as.factor(urls_select$`Group Inlinks`)
  18. 18. Detect Compliant Pages 18 # Compliant Pages # Canonical Not Equal # Meta No-index # Bad HTTP Status Code # Not Equal urls_select$Compliant <- TRUE urls_select$Compliant[ which(urls_select$`Status Code` != 200 | urls_select$`Canonical Link Element 1` != urls_select$Address | urls_select$Status != "OK" | grepl("noindex",urls_select$`Meta Robots 1`) ) ] <- FALSE urls_select$Compliant <- as.factor(urls_select$Compliant)
  19. 19. Detect Duplicata Meta 19 urls_select$`Status Title` <- 'Unique' urls_select$`Status Title`[ which(urls_select$`Title 1 Length` == 0) ] <- "No Set" urls_select$`Status Description` <- 'Unique' urls_select$`Status Description`[ which(urls_select$`Meta Description 1 Length` == 0) ] <- "No Set" urls_select$`Status H1` <- 'Unique' urls_select$`Status H1`[ which(urls_select$`H1-1 Length` == 0) ] <- "No Set" urls_select$`Status Title`[ which(duplicated(urls_select$`Title 1`)) ] <- 'Duplicate' urls_select$`Status Description`[ which(duplicated(urls_select$`Meta Description 1`)) ] <- 'Duplicate' urls_select$`Status H1`[ which(duplicated(urls_select$`H1-1`)) ] <- 'Duplicate' urls_select$`Status Title` <- as.factor(urls_select$`Status Title`) urls_select$`Status Description` <- as.factor(urls_select$`Status Description`) urls_select$`Status H1` <- as.factor(urls_select$`Status H1`)
  20. 20. Generate CSV 20 urls_light <- select(urls_select,Address,Cat,Active,Speed,Compliant,Level,Inlinks) %>% mutate(Address=gsub(“http://moniste.fr","",Address)) colnames(urls_light) <- c("request","section","active","speed","compliant","depth","inlinks") write.csv2(“file.csv”, filename, row.names = FALSE) Package dplyr : select and mutate Edit colnames Use write.csv2
  21. 21. R : ggplot2 command 21 DATA Create the ggplot object and populate it with data (always a data frame) ggplot( mydata, aes( x=section,y=count, fill=active )) LAYERS Add layer(s) + geom_point() FACET Used for conditionning on variable(s) + facet_grid(~rescode)
  22. 22. ggplot2 : Geometry 22
  23. 23. R Chart : Active Pages 23 urls_level_active <- group_by(urls_select,Level,Active) %>% summarise(count = n()) %>% filter(Level<12) Geometry Aesthetic p <- ggplot(urls_level_active, aes(x=Level, y=count, fill=Active) ) + geom_bar(stat = "identity", position = "stack") + scale_fill_manual(values=c("#e5e500", "#4DBD33")) + labs(x = "Depth", y ="Crawled URLs") #display print(p) # save in file ggsave(file=“chart.png")
  24. 24. R Chart : GA Sessions 24 urls_cat_gasessions <- aggregate( urls_select$`GA Sessions`, by=list(Cat=urls_select$Cat, urls_select$Compliant), FUN=sum, na.rm=TRUE) colnames(urls_cat_gasessions) <- c("Category","Compliant","GA Sessions") p <- ggplot(urls_cat_gasessions, aes(x=Category, y=`GA Sessions`, fill=Compliant))+ geom_bar(stat = "identity", position = "stack") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(x = "Section", y ="Sessions") + scale_fill_manual(values=c("#e5e500","#4DBD33")) #display print(p) # save in file ggsave(file=“chart.png")
  25. 25. R Chart : Compliant 25 urls_cat_compliant_statuscode <- group_by(urls_select,Cat, Compliant,`Status Code`) %>% summarise(count = n()) %>% filter(grepl(200,`Status Code`) | grepl(301,`Status Code`)) p <- ggplot(urls_cat_compliant_statuscode, aes(x=Cat, y=count, fill= Compliant ) ) + geom_bar(stat = "identity", position = "stack") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + facet_grid(`Status Code` ~ .) + labs(x = "Section", y ="Crawled URLs") + scale_fill_manual(values=c("#e5e500","#4DBD33"))
  26. 26. R : SEO Cheat Sheet 26 Package Dplyr select() allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions mutate() a data frame by adding new or replacing existing columns filter() allows you to select a subset of rows in a data frame. Package Gplot2 aes - geom ggsave() Package Readxl read_excel() read.csv2() write.csv2()
  27. 27. ELK
  28. 28. Architecture 28 Hard to monitor and optimize host server performance
  29. 29. Architecture 29
  30. 30. Using PaasLogs
  31. 31. PaasLogs 31
  32. 32. PaasLogs 32 164 noeuds au sein du cluster Elastic Search 180 machines connectées Entre 100 000 et 300 000 logs traités par seconde 12 milliards de logs transitent tous les jours 211 milliards de documents enregistrés 8 clicks and 3 copy/paste to use it !
  33. 33. PaasLogs: Step 1 33
  34. 34. PaasLogs : Step 2 34
  35. 35. PaasLogs 35
  36. 36. PaasLogs : Streams 36 The Streams are the recipient of your logs. When you send a log with the right stream token, it arrives automatically to your stream in a awesome software named Graylog.
  37. 37. PaasLogs : Dashboards 37 The Dashboard is the global view of your logs, A Dashboard is an efficient way to exploit your logs and to view global information like metrics and trends about your data without being overwhelmed by the logs details.
  38. 38. PaasLogs : Aliases 38 The Aliases will allow you to access directly your data from your Kibana or using an Elasticsearch query DON’T FORGET TO ENABLE KIBANA INDICES AND WRITE YOUR USER PASSWORD
  39. 39. PaasLogs : Inputs 39 The Inputs will allow you to ask OVH to host your own dedicated collector like Logstash or Flowgger.
  40. 40. PaasLogs : Network Configuration 40
  41. 41. PaasLogs : Plugins Logstash 41 OVHCOMMONAPACHELOG %{IPORHOST:clientip} %{USER:ident} %{USER:auth} [%{HTTPDATE:timestamp}] "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion_num:float})?|%{DATA:rawrequest})" %{NUMBER:response_int:int} (?:%{NUMBER:bytes_int:int}|-) OVHCOMBINEDAPACHELOG %{OVHCOMMONAPACHELOG} "%{NOTSPACE:referrer}" %{QS:agent}
  42. 42. PaasLogs : Config Logstash 42 if [type] == "apache" { grok { match => [ "message", "%{OVHCOMBINEDAPACHELOG}"] patterns_dir => "/opt/logstash/patterns" } } if [type] == "csv_infos" { csv { columns => ["request", "section","active", "speed", "compliant","depth","inlinks"] separator => ";" } }
  43. 43. How to send Logs to PaasLogs ? 43
  44. 44. Use Filebeat 44
  45. 45. Filebeat : Install 45 Install filebeat curl -L -O https://download.elastic.co/beats/filebeat/filebeat_1.2.1_amd64.deb sudo dpkg -i filebeat_1.2.1_amd64.deb https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html
  46. 46. Filebeat : Edit filebeat.yml 46 filebeat: prospectors: - paths: - /home/ubuntu/lib/apache2/log/access.log input_type: log fields_under_root: true document_type: apache - paths: - /home/ubuntu/workspace/csv/crawled-urls-filebeat-*.csv input_type: csv fields_under_root: true document_type: csv_infos output: logstash: hosts: ["c002-5717e1b5d2ee5e00095cea38.in.laas.runabove.com:5044"] worker: 1 tls: certificate_authorities: ["/home/ubuntu/workspace/certificat/key.crt"]
  47. 47. Filebeat : Start 47 Copy / Paste Key.crt -----BEGIN CERTIFICATE----- MIIDozCCAougAwIBAgIJALxR4fTZlzQMMA0GCSqGSIb3DQEBCwUAMGgxCzAJBgNVBAYTAkZSMQ 8wDQYDVQQIDAZGcmFuY2UxDjAMBgNVBAcMBVBhcmlzMQwwCgYDVQQKDANPVkgxCzAJBgNVB AYTAkZSMR0wGwYDVQQDDBRpbi5sYWFzLnJ1bmFib3ZlLmNvbTAeFw0xNjAzMTAxNTEzMDNaFw0 xNzAzMTAxNTEzMDNaMGgxCzAJBgNVBAYTAkZSMQ8wDQYDVQQIDAZGcmFuY2UxDjAMBgNVBA cMBVBhcmlzMQwwCgYDVQQKDANPVkgx -----END CERTIFICATE----- Start Filebeat sudo /etc/init.d/filebeat start sudo /etc/init.d/filebeat stop
  48. 48. How to combine multiple sources ? 48
  49. 49. Paaslogs : Plugins ES 49 Description : Copies fields from previous log events in Elasticsearch to current events if [type] == "apache" { elasticsearch { hosts => "laas.runabove.com" index => "logsDataSEO" # alias ssl => true query => ‘ type:csv_infos AND request: "%{[request]}" ‘ fields => [["speed","speed"],["compliant","compliant"], ["section","section"],["active","active"], ["depth","depth"],["inlinks","inlinks"]] } } # TIP : fields => [[src,dest],[src,dest]]
  50. 50. Using Kibana
  51. 51. Kibana : Install 51 Download Kibana 4.1 • Download and unzip Kibana 4 • Extract your archive • Open config/kibana.yml in an editor • Set the elasticsearch.url to point at your Elasticsearch instance • Run ./bin/kibana (or binkibana.bat on Windows) • Point your browser athttp://yourhost.com:5601
  52. 52. Kibana : Edit Kibana.yml 52 Update Kibana.xml server.port: 8080 server.host: "0.0.0.0" elasticsearch.url: "https://laas.runabove.com:9200" elasticsearch.preserveHost: true kibana.index: "ra-logs-33078" kibana.defaultAppId: "discover" elasticsearch.username: "ra-logs-33078" elasticsearch.password: "rHftest6APlolNcc6"
  53. 53. Kibana : Line Chart 53 Number of active crawled from google over a period of time
  54. 54. Kibana : Vertical Bar Chart 54
  55. 55. Kibana : Pie Chart 55
  56. 56. How to compare two periods ? 56
  57. 57. Kibana : Use Date Range 57
  58. 58. Final Architecture PassLogs Kibana Filebeat @ 58 @ Soft RealTime -- Old Logs IIS Apache Ngnix HA Proxy
  59. 59. Test yourself 59 Use Screaming Frog Spider Tool www.screamingfrog.co.uk Teach R www.datacamp.com www.data-seo.com www.moise-le-geek.fr/push-your-hands-in-the-r-introduction/ Test PassLogs www.runabove.com Install Kibana www.elastic.co/downloads/kibana
  60. 60. TODO List 60 - Create a GitHub Repository with all source code - Add Plugin Logstash to do a reverse DNS lookup - Schedule A Crawl By Command Line - Upload Screaming Frog File to web server
  61. 61. Thank you Keep in touch June 10th, 2016 @vincentterrasi Vincent Terrasi

×