SlideShare a Scribd company logo
1 of 26
Elasticsearch + Oncrawl =
<3
A SaaS SEO Monitoring solution by
Presentation by Tanguy Moal
@tuxnco
Meetup Elasticsearch Paris #12
2015/01/22
22/01/15 Oncrawl · Elasticsearch Meetup France #12 2
[tuxnco@hal]:/opt$ whoami
- age: 0x20
- kids: 0x02
- hobbies:
- tech founder & cto at cogniteev
- search, natural language processing, datamining
- misc.
- history:
- r&d engineer @ exalead
- r&d engineer @ jobijoba
22/01/15 Oncrawl · Elasticsearch Meetup France #12 3
Presentation plan
Introduction to Oncrawl
Oncrawl technical overview
hadoop-elasticsearch within Oncrawl
Oncrawl API
Scaling Oncrawl infrastructure with Saltstack.
Conclusion / Questions
Introduction
22/01/15 Oncrawl · Elasticsearch Meetup France #12 5
Oncrawl: SEO Monitoring
- SEO Game has changed:
- Websites are getting bigger, harder to
maintain
- Several indicators to monitor
- SaaS to the rescue (Moz, Ranks, Majestic SEO,
Botify, Deepcrawl, …)
22/01/15 Oncrawl · Elasticsearch Meetup France #12 6
Oncrawl: SEO Monitoring
- Analysis performed through crawl reports
- SEO monitoring follows 5 axis:
- Performance
- HTML quality
- Inlinks
- Outlinks
- Content
- Interactive Analysis (URL explorer)
- Planned: crawl over crawl trends spotting
22/01/15 Oncrawl · Elasticsearch Meetup France #12 7
Oncrawl: Pricing
Oncrawl: technical overview
Oncrawl: application architecture
22/01/15 Oncrawl · Elasticsearch Meetup France #12 9
22/01/15 Oncrawl · Elasticsearch Meetup France #12 10
Boom
.
Boom2.
Application scenario
- User has a plan and configured projects
- Plan grants privileges
- Used to : allow project creation and triggering
of crawls
- Each project may have associated crawls
- Each crawl contains a report
What data are involved in a crawl report?
22/01/15 Oncrawl · Elasticsearch Meetup France #12 11
Links
22/01/15 Oncrawl · Elasticsearch Meetup France #12 12
- Important piece in serious SEO campaigns
- Key fields:
- origin, origin_domain, origin_depth
- target, target_domain, target_depth
- context:
- position in origin page
- anchor text
- wraps significant tags (hn, img, …)
- Use cases:
- list outlinks (resp. inlinks) of a given page
- distinguish links used to go up (resp. down) the site’s tree
- anchor text analysis, …
Page model
22/01/15 Oncrawl · Elasticsearch Meetup France #12 13
- Key fields
- url
- domain
- hash
- fetch
- date, size, time
- HTTP headers
- HTTP status code | ignored
(robots.txt|settings)
- parse
- title, hn, metas,
- canonical
- seo
- depth. popularity. total inlinks
- outlinks breakdown (internal vs
external, follow vs nofollow)
- word count, text to code ratio,
duplicated fields, simhash
- Use cases
- stats on size/fetch time/status code, by depth or for pages having any
combination of criterion
- find pages with highest similarity to a given one
- find pages with duplicated properties (title, hn, …)
- The central piece of the puzzle. Wraps all metadata
relating to a given URL
Hadoop & Elasticsearch.
Elasticsearch for Hadoop
- references
- overview http://www.elasticsearch.org/overview/hadoop/
- online documentation
http://www.elasticsearch.org/guide/en/elasticsearch/hado
op/current/index.html
- github
- repo https://github.com/elasticsearch/elasticsearch-hadoop
- author https://github.com/costin
- features
- compatibility
- simplicity
- low footprint
- flexible
22/01/15 Oncrawl · Elasticsearch Meetup France #12 15
Oncrawl: hadoop-elasticsearch
- Apache Nutch (v1.x) uses HDFS (v2.x supports several storages
through Apache Gora -- including elasticsearch -- but…)
- Stacked different custom hadoop jobs to compute
Oncrawl’s custom attributes (duplicates, …)
- What about Apache Nutch’s ESIndexer ?
- hadoop-elasticsearch does the job pretty well
- Relies on job’s configuration:
- es.resource(.read|.write)? : « index/type » (supports “late”
type routing from fields in collected output, e.g.
« my_index/{some_field} »)
22/01/15 Oncrawl · Elasticsearch Meetup France #12 16
Oncrawl: hadoop-elasticsearch
• Reading from elasticsearch
– job.setInputFormat(EsInputFormat.class);
• Writing to elasticsearch
– job.setOutputFormat(EsOutputFormat.class);
– Map<Object, Object> value = new
LinkedHashMap <Object, Object> ();
– collector.collect(key,
WritableUtils.toWritable(value));
22/01/15 Oncrawl · Elasticsearch Meetup France #12 17
Read  Write HDFS Elasticsearch
HDFS builtin yes
Elasticsearch yes yes
Elasticsearch & Python
Oncrawl API
• Python / Flask :
– Lightweight
– Easy to deploy / mirror
– Clean syntax
• elasticsearch python client:
– simple API
– allows for fine tuning of the client (HTTP
connection parameters, …)
• API’s mission : populate application’s report’s
graphs
22/01/15 Oncrawl · Elasticsearch Meetup France #12 19
Oncrawl API
- Each graph on the app has a dedicated API endpoint
- Binds graph semantics to an elasticsearch query. Returns json data ready for
the rendering (d3.js, …)
- Example : Summary of page load times
22/01/15 Oncrawl · Elasticsearch Meetup France #12 20
- 4 buckets :
- perfect (under 500ms)
- medium (between 500ms and
1000ms)
- slow (between 1000ms and
2000ms)
- too slow (beyond 2000ms)
- Expected output by plotting library:
Oncrawl API
- Queries are easy to compose using python
- Write & test it in Marvel
- Integrate in Flask API
22/01/15 Oncrawl · Elasticsearch Meetup France #12 21
Elastic: Scale it
May I have the salt, please ?
Oncrawl scalability constraints
- 1 index per crawl
- size of indices ? S-M-L-XL
- sharding policy:
- S: 1 shard
- M: 3 shards
- L: 5 shards
- XL: 10 shards
- Hadoop cluster
management
- Provisioned for a given number
of concurrent crawl cycles
- HDFS grows with total clients
- Elasticsearch cluster
management
- Build: same provision as
hadoop cluster
- Storage / service:
- provisionned for 3 months of
subscription
- Old indices:
- close & snapshot
- reopen on demand
22/01/15 Oncrawl · Elasticsearch Meetup France #12 23
Saltstack
• Cluster with members having roles: master
vs minions
• Each minion can be fully administrated
through the master
• Minions ask master for enrollment
• Administrator on master can either accept
or decline minions
• Once minion is accepted, can be fully
operated remotely
22/01/15 Oncrawl · Elasticsearch Meetup France #12 24
Saltstack
• A set of « recipes » define what states are made of, and how to
get there
• Recipes can use « jinja » templating so variable parts of
configuration files can be rendered at deployment time
• Minions can have their role defined by several means:
– grains defined on the minion
– deployment specific rules, defined in « the pillar »
• Within Oncrawl, saltstack is used :
– To maintain indices templates (config/templates/*json)
– To maintain elasticsearch clusters, nodes and shards allocation
(config/settings.yml)
– To deploy the elasticsearch cluster, the hadoop cluster, staging
and prod servers
• Deploy anything, anywhere (Droplets @ Digital Ocean, VMs @
Vultr, Instances @ AWS, dedicated servers @ OVH)
22/01/15 Oncrawl · Elasticsearch Meetup France #12 25
Thank you!
Follow us:
@tuxnco (me)
@cogniteev (company)
@oncrawl (product)
Part of the gang
Any question ?

More Related Content

What's hot

An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataJihoon Son
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldJihoon Son
 
StormCrawler at Bristech
StormCrawler at BristechStormCrawler at Bristech
StormCrawler at BristechJulien Nioche
 
Relevance trilogy may dream be with you! (dec17)
Relevance trilogy  may dream be with you! (dec17)Relevance trilogy  may dream be with you! (dec17)
Relevance trilogy may dream be with you! (dec17)Woonsan Ko
 
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB
 
Exoscale: Pithos: your personal S3 object store on cassandra
Exoscale: Pithos: your personal S3 object store on cassandraExoscale: Pithos: your personal S3 object store on cassandra
Exoscale: Pithos: your personal S3 object store on cassandraDataStax Academy
 
The Practice of Alluxio in JD.com
The Practice of Alluxio in JD.comThe Practice of Alluxio in JD.com
The Practice of Alluxio in JD.comAlluxio, Inc.
 
Hybrid collaborative tiered storage with alluxio
Hybrid collaborative tiered storage with alluxioHybrid collaborative tiered storage with alluxio
Hybrid collaborative tiered storage with alluxioThai Bui
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoHyunsik Choi
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopBig Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopGruter
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataGruter
 
Improve Presto Architectural Decisions with Shadow Cache
 Improve Presto Architectural Decisions with Shadow Cache Improve Presto Architectural Decisions with Shadow Cache
Improve Presto Architectural Decisions with Shadow CacheAlluxio, Inc.
 
Friends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFSFriends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFSSaumitra Srivastav
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioAlluxio, Inc.
 
Exploring Alluxio for Daily Tasks at Robinhood
Exploring Alluxio for Daily Tasks at RobinhoodExploring Alluxio for Daily Tasks at Robinhood
Exploring Alluxio for Daily Tasks at RobinhoodAlluxio, Inc.
 
Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLON
Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLONPaul Dix (Founder InfluxDB) - Organising Metrics at #DOXLON
Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLONOutlyer
 
An Introduction to Elasticsearch for Beginners
An Introduction to Elasticsearch for BeginnersAn Introduction to Elasticsearch for Beginners
An Introduction to Elasticsearch for BeginnersAmir Sedighi
 
N hidden gems in forge (as of may '17)
N hidden gems in forge (as of may '17)N hidden gems in forge (as of may '17)
N hidden gems in forge (as of may '17)Woonsan Ko
 

What's hot (20)

An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack World
 
StormCrawler at Bristech
StormCrawler at BristechStormCrawler at Bristech
StormCrawler at Bristech
 
Relevance trilogy may dream be with you! (dec17)
Relevance trilogy  may dream be with you! (dec17)Relevance trilogy  may dream be with you! (dec17)
Relevance trilogy may dream be with you! (dec17)
 
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
 
Exoscale: Pithos: your personal S3 object store on cassandra
Exoscale: Pithos: your personal S3 object store on cassandraExoscale: Pithos: your personal S3 object store on cassandra
Exoscale: Pithos: your personal S3 object store on cassandra
 
The Practice of Alluxio in JD.com
The Practice of Alluxio in JD.comThe Practice of Alluxio in JD.com
The Practice of Alluxio in JD.com
 
Hybrid collaborative tiered storage with alluxio
Hybrid collaborative tiered storage with alluxioHybrid collaborative tiered storage with alluxio
Hybrid collaborative tiered storage with alluxio
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopBig Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
Improve Presto Architectural Decisions with Shadow Cache
 Improve Presto Architectural Decisions with Shadow Cache Improve Presto Architectural Decisions with Shadow Cache
Improve Presto Architectural Decisions with Shadow Cache
 
Friends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFSFriends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFS
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
Exploring Alluxio for Daily Tasks at Robinhood
Exploring Alluxio for Daily Tasks at RobinhoodExploring Alluxio for Daily Tasks at Robinhood
Exploring Alluxio for Daily Tasks at Robinhood
 
Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLON
Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLONPaul Dix (Founder InfluxDB) - Organising Metrics at #DOXLON
Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLON
 
An Introduction to Elasticsearch for Beginners
An Introduction to Elasticsearch for BeginnersAn Introduction to Elasticsearch for Beginners
An Introduction to Elasticsearch for Beginners
 
N hidden gems in forge (as of may '17)
N hidden gems in forge (as of may '17)N hidden gems in forge (as of may '17)
N hidden gems in forge (as of may '17)
 

Similar to Oncrawl elasticsearch meetup france #12

Accra MongoDB User Group
Accra MongoDB User GroupAccra MongoDB User Group
Accra MongoDB User GroupMongoDB
 
Alfresco monitoring with Nagios and ELK stack
Alfresco monitoring with Nagios and ELK stackAlfresco monitoring with Nagios and ELK stack
Alfresco monitoring with Nagios and ELK stackCesar Capillas
 
Présentation du générateur de site statique eleventy
Présentation du générateur de site statique eleventyPrésentation du générateur de site statique eleventy
Présentation du générateur de site statique eleventyGilles Vauvarin
 
Attack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaAttack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaPrajal Kulkarni
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Aad Versteden
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionFlink Forward
 
Headless approach for offloading heavy tasks in Magento
Headless approach for offloading heavy tasks in MagentoHeadless approach for offloading heavy tasks in Magento
Headless approach for offloading heavy tasks in MagentoSander Mangel
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbMongoDB APAC
 
MySQL Day Paris 2018 - MySQL JSON Document Store
MySQL Day Paris 2018 - MySQL JSON Document StoreMySQL Day Paris 2018 - MySQL JSON Document Store
MySQL Day Paris 2018 - MySQL JSON Document StoreOlivier DASINI
 
OSMC 2021 | Monitoring Open Infrastructure Logs – With Real Life Examples
OSMC 2021 | Monitoring Open Infrastructure Logs – With Real Life ExamplesOSMC 2021 | Monitoring Open Infrastructure Logs – With Real Life Examples
OSMC 2021 | Monitoring Open Infrastructure Logs – With Real Life ExamplesNETWAYS
 
Monitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with PrometheusMonitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with PrometheusFabian Reinartz
 
Node.js and the MySQL Document Store
Node.js and the MySQL Document StoreNode.js and the MySQL Document Store
Node.js and the MySQL Document StoreRui Quelhas
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
 
Lessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedDainius Jocas
 
MySQL Document Store - A Document Store with all the benefts of a Transactona...
MySQL Document Store - A Document Store with all the benefts of a Transactona...MySQL Document Store - A Document Store with all the benefts of a Transactona...
MySQL Document Store - A Document Store with all the benefts of a Transactona...Olivier DASINI
 
At the Crossroads of HPC and Cloud Computing with Openstack
At the Crossroads of HPC and Cloud Computing with OpenstackAt the Crossroads of HPC and Cloud Computing with Openstack
At the Crossroads of HPC and Cloud Computing with OpenstackRyan Aydelott
 
HAXTheWeb @ Apereo 19
HAXTheWeb @ Apereo 19HAXTheWeb @ Apereo 19
HAXTheWeb @ Apereo 19btopro
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
 

Similar to Oncrawl elasticsearch meetup france #12 (20)

Accra MongoDB User Group
Accra MongoDB User GroupAccra MongoDB User Group
Accra MongoDB User Group
 
Alfresco monitoring with Nagios and ELK stack
Alfresco monitoring with Nagios and ELK stackAlfresco monitoring with Nagios and ELK stack
Alfresco monitoring with Nagios and ELK stack
 
Présentation du générateur de site statique eleventy
Présentation du générateur de site statique eleventyPrésentation du générateur de site statique eleventy
Présentation du générateur de site statique eleventy
 
Attack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaAttack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and Kibana
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
 
Headless approach for offloading heavy tasks in Magento
Headless approach for offloading heavy tasks in MagentoHeadless approach for offloading heavy tasks in Magento
Headless approach for offloading heavy tasks in Magento
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
 
MySQL Day Paris 2018 - MySQL JSON Document Store
MySQL Day Paris 2018 - MySQL JSON Document StoreMySQL Day Paris 2018 - MySQL JSON Document Store
MySQL Day Paris 2018 - MySQL JSON Document Store
 
OSMC 2021 | Monitoring Open Infrastructure Logs – With Real Life Examples
OSMC 2021 | Monitoring Open Infrastructure Logs – With Real Life ExamplesOSMC 2021 | Monitoring Open Infrastructure Logs – With Real Life Examples
OSMC 2021 | Monitoring Open Infrastructure Logs – With Real Life Examples
 
Monitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with PrometheusMonitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with Prometheus
 
Node.js and the MySQL Document Store
Node.js and the MySQL Document StoreNode.js and the MySQL Document Store
Node.js and the MySQL Document Store
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
Lessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at Vinted
 
MySQL Document Store - A Document Store with all the benefts of a Transactona...
MySQL Document Store - A Document Store with all the benefts of a Transactona...MySQL Document Store - A Document Store with all the benefts of a Transactona...
MySQL Document Store - A Document Store with all the benefts of a Transactona...
 
At the Crossroads of HPC and Cloud Computing with Openstack
At the Crossroads of HPC and Cloud Computing with OpenstackAt the Crossroads of HPC and Cloud Computing with Openstack
At the Crossroads of HPC and Cloud Computing with Openstack
 
Lightweight web frameworks
Lightweight web frameworksLightweight web frameworks
Lightweight web frameworks
 
HAXTheWeb @ Apereo 19
HAXTheWeb @ Apereo 19HAXTheWeb @ Apereo 19
HAXTheWeb @ Apereo 19
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 

Recently uploaded

Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdfKamal Acharya
 
Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)NareenAsad
 
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Prakhyath Rai
 
Theory for How to calculation capacitor bank
Theory for How to calculation capacitor bankTheory for How to calculation capacitor bank
Theory for How to calculation capacitor banktawat puangthong
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxKarpagam Institute of Teechnology
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdfAlexander Litvinenko
 
Intelligent Agents, A discovery on How A Rational Agent Acts
Intelligent Agents, A discovery on How A Rational Agent ActsIntelligent Agents, A discovery on How A Rational Agent Acts
Intelligent Agents, A discovery on How A Rational Agent ActsSheetal Jain
 
Online crime reporting system project.pdf
Online crime reporting system project.pdfOnline crime reporting system project.pdf
Online crime reporting system project.pdfKamal Acharya
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...drjose256
 
Interfacing Analog to Digital Data Converters ee3404.pdf
Interfacing Analog to Digital Data Converters ee3404.pdfInterfacing Analog to Digital Data Converters ee3404.pdf
Interfacing Analog to Digital Data Converters ee3404.pdfragupathi90
 
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...jiyav969
 
BORESCOPE INSPECTION for engins CFM56.pdf
BORESCOPE INSPECTION for engins CFM56.pdfBORESCOPE INSPECTION for engins CFM56.pdf
BORESCOPE INSPECTION for engins CFM56.pdfomarzaboub1997
 
Insurance management system project report.pdf
Insurance management system project report.pdfInsurance management system project report.pdf
Insurance management system project report.pdfKamal Acharya
 
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfInstruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfEr.Sonali Nasikkar
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxCHAIRMAN M
 
Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...
Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...
Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...ShivamTiwari995432
 
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfMadan Karki
 
Introduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AIIntroduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AISheetal Jain
 
How to Design and spec harmonic filter.pdf
How to Design and spec harmonic filter.pdfHow to Design and spec harmonic filter.pdf
How to Design and spec harmonic filter.pdftawat puangthong
 
Raashid final report on Embedded Systems
Raashid final report on Embedded SystemsRaashid final report on Embedded Systems
Raashid final report on Embedded SystemsRaashidFaiyazSheikh
 

Recently uploaded (20)

Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdf
 
Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)
 
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
 
Theory for How to calculation capacitor bank
Theory for How to calculation capacitor bankTheory for How to calculation capacitor bank
Theory for How to calculation capacitor bank
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptx
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
 
Intelligent Agents, A discovery on How A Rational Agent Acts
Intelligent Agents, A discovery on How A Rational Agent ActsIntelligent Agents, A discovery on How A Rational Agent Acts
Intelligent Agents, A discovery on How A Rational Agent Acts
 
Online crime reporting system project.pdf
Online crime reporting system project.pdfOnline crime reporting system project.pdf
Online crime reporting system project.pdf
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
 
Interfacing Analog to Digital Data Converters ee3404.pdf
Interfacing Analog to Digital Data Converters ee3404.pdfInterfacing Analog to Digital Data Converters ee3404.pdf
Interfacing Analog to Digital Data Converters ee3404.pdf
 
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
 
BORESCOPE INSPECTION for engins CFM56.pdf
BORESCOPE INSPECTION for engins CFM56.pdfBORESCOPE INSPECTION for engins CFM56.pdf
BORESCOPE INSPECTION for engins CFM56.pdf
 
Insurance management system project report.pdf
Insurance management system project report.pdfInsurance management system project report.pdf
Insurance management system project report.pdf
 
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfInstruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
 
Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...
Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...
Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...
 
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
 
Introduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AIIntroduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AI
 
How to Design and spec harmonic filter.pdf
How to Design and spec harmonic filter.pdfHow to Design and spec harmonic filter.pdf
How to Design and spec harmonic filter.pdf
 
Raashid final report on Embedded Systems
Raashid final report on Embedded SystemsRaashid final report on Embedded Systems
Raashid final report on Embedded Systems
 

Oncrawl elasticsearch meetup france #12

  • 1. Elasticsearch + Oncrawl = <3 A SaaS SEO Monitoring solution by Presentation by Tanguy Moal @tuxnco Meetup Elasticsearch Paris #12 2015/01/22
  • 2. 22/01/15 Oncrawl · Elasticsearch Meetup France #12 2 [tuxnco@hal]:/opt$ whoami - age: 0x20 - kids: 0x02 - hobbies: - tech founder & cto at cogniteev - search, natural language processing, datamining - misc. - history: - r&d engineer @ exalead - r&d engineer @ jobijoba
  • 3. 22/01/15 Oncrawl · Elasticsearch Meetup France #12 3 Presentation plan Introduction to Oncrawl Oncrawl technical overview hadoop-elasticsearch within Oncrawl Oncrawl API Scaling Oncrawl infrastructure with Saltstack. Conclusion / Questions
  • 5. 22/01/15 Oncrawl · Elasticsearch Meetup France #12 5 Oncrawl: SEO Monitoring - SEO Game has changed: - Websites are getting bigger, harder to maintain - Several indicators to monitor - SaaS to the rescue (Moz, Ranks, Majestic SEO, Botify, Deepcrawl, …)
  • 6. 22/01/15 Oncrawl · Elasticsearch Meetup France #12 6 Oncrawl: SEO Monitoring - Analysis performed through crawl reports - SEO monitoring follows 5 axis: - Performance - HTML quality - Inlinks - Outlinks - Content - Interactive Analysis (URL explorer) - Planned: crawl over crawl trends spotting
  • 7. 22/01/15 Oncrawl · Elasticsearch Meetup France #12 7 Oncrawl: Pricing
  • 9. Oncrawl: application architecture 22/01/15 Oncrawl · Elasticsearch Meetup France #12 9
  • 10. 22/01/15 Oncrawl · Elasticsearch Meetup France #12 10 Boom . Boom2.
  • 11. Application scenario - User has a plan and configured projects - Plan grants privileges - Used to : allow project creation and triggering of crawls - Each project may have associated crawls - Each crawl contains a report What data are involved in a crawl report? 22/01/15 Oncrawl · Elasticsearch Meetup France #12 11
  • 12. Links 22/01/15 Oncrawl · Elasticsearch Meetup France #12 12 - Important piece in serious SEO campaigns - Key fields: - origin, origin_domain, origin_depth - target, target_domain, target_depth - context: - position in origin page - anchor text - wraps significant tags (hn, img, …) - Use cases: - list outlinks (resp. inlinks) of a given page - distinguish links used to go up (resp. down) the site’s tree - anchor text analysis, …
  • 13. Page model 22/01/15 Oncrawl · Elasticsearch Meetup France #12 13 - Key fields - url - domain - hash - fetch - date, size, time - HTTP headers - HTTP status code | ignored (robots.txt|settings) - parse - title, hn, metas, - canonical - seo - depth. popularity. total inlinks - outlinks breakdown (internal vs external, follow vs nofollow) - word count, text to code ratio, duplicated fields, simhash - Use cases - stats on size/fetch time/status code, by depth or for pages having any combination of criterion - find pages with highest similarity to a given one - find pages with duplicated properties (title, hn, …) - The central piece of the puzzle. Wraps all metadata relating to a given URL
  • 15. Elasticsearch for Hadoop - references - overview http://www.elasticsearch.org/overview/hadoop/ - online documentation http://www.elasticsearch.org/guide/en/elasticsearch/hado op/current/index.html - github - repo https://github.com/elasticsearch/elasticsearch-hadoop - author https://github.com/costin - features - compatibility - simplicity - low footprint - flexible 22/01/15 Oncrawl · Elasticsearch Meetup France #12 15
  • 16. Oncrawl: hadoop-elasticsearch - Apache Nutch (v1.x) uses HDFS (v2.x supports several storages through Apache Gora -- including elasticsearch -- but…) - Stacked different custom hadoop jobs to compute Oncrawl’s custom attributes (duplicates, …) - What about Apache Nutch’s ESIndexer ? - hadoop-elasticsearch does the job pretty well - Relies on job’s configuration: - es.resource(.read|.write)? : « index/type » (supports “late” type routing from fields in collected output, e.g. « my_index/{some_field} ») 22/01/15 Oncrawl · Elasticsearch Meetup France #12 16
  • 17. Oncrawl: hadoop-elasticsearch • Reading from elasticsearch – job.setInputFormat(EsInputFormat.class); • Writing to elasticsearch – job.setOutputFormat(EsOutputFormat.class); – Map<Object, Object> value = new LinkedHashMap <Object, Object> (); – collector.collect(key, WritableUtils.toWritable(value)); 22/01/15 Oncrawl · Elasticsearch Meetup France #12 17 Read Write HDFS Elasticsearch HDFS builtin yes Elasticsearch yes yes
  • 19. Oncrawl API • Python / Flask : – Lightweight – Easy to deploy / mirror – Clean syntax • elasticsearch python client: – simple API – allows for fine tuning of the client (HTTP connection parameters, …) • API’s mission : populate application’s report’s graphs 22/01/15 Oncrawl · Elasticsearch Meetup France #12 19
  • 20. Oncrawl API - Each graph on the app has a dedicated API endpoint - Binds graph semantics to an elasticsearch query. Returns json data ready for the rendering (d3.js, …) - Example : Summary of page load times 22/01/15 Oncrawl · Elasticsearch Meetup France #12 20 - 4 buckets : - perfect (under 500ms) - medium (between 500ms and 1000ms) - slow (between 1000ms and 2000ms) - too slow (beyond 2000ms) - Expected output by plotting library:
  • 21. Oncrawl API - Queries are easy to compose using python - Write & test it in Marvel - Integrate in Flask API 22/01/15 Oncrawl · Elasticsearch Meetup France #12 21
  • 22. Elastic: Scale it May I have the salt, please ?
  • 23. Oncrawl scalability constraints - 1 index per crawl - size of indices ? S-M-L-XL - sharding policy: - S: 1 shard - M: 3 shards - L: 5 shards - XL: 10 shards - Hadoop cluster management - Provisioned for a given number of concurrent crawl cycles - HDFS grows with total clients - Elasticsearch cluster management - Build: same provision as hadoop cluster - Storage / service: - provisionned for 3 months of subscription - Old indices: - close & snapshot - reopen on demand 22/01/15 Oncrawl · Elasticsearch Meetup France #12 23
  • 24. Saltstack • Cluster with members having roles: master vs minions • Each minion can be fully administrated through the master • Minions ask master for enrollment • Administrator on master can either accept or decline minions • Once minion is accepted, can be fully operated remotely 22/01/15 Oncrawl · Elasticsearch Meetup France #12 24
  • 25. Saltstack • A set of « recipes » define what states are made of, and how to get there • Recipes can use « jinja » templating so variable parts of configuration files can be rendered at deployment time • Minions can have their role defined by several means: – grains defined on the minion – deployment specific rules, defined in « the pillar » • Within Oncrawl, saltstack is used : – To maintain indices templates (config/templates/*json) – To maintain elasticsearch clusters, nodes and shards allocation (config/settings.yml) – To deploy the elasticsearch cluster, the hadoop cluster, staging and prod servers • Deploy anything, anywhere (Droplets @ Digital Ocean, VMs @ Vultr, Instances @ AWS, dedicated servers @ OVH) 22/01/15 Oncrawl · Elasticsearch Meetup France #12 25
  • 26. Thank you! Follow us: @tuxnco (me) @cogniteev (company) @oncrawl (product) Part of the gang Any question ?