My weekend startup
project
SEOCRAWLER.CO
TEAM

Goran Čandrlić
Conversion, Google AdWords &
Internet Marketing Specialist
Webiny Cofounder

Hrvoje Hudoletnjak
Software developer
Microsoft ASP.NET/IIS MVP
WHY?
Target market
Web masters, site owners
Marketers

Usage scenarios
Get broken pages, redirects, non-index, non-follow, ...
On-site SQL quality
Crawl competitor pages and find out what are they doing

Business model
Free
Pay as you go
Share and get credits
THE PLAN
Let’s build a crawler
MVP version: download CSV file of all pages
Public launch: browsing crawled pages online, payments
Let’s spread the word
Use social channel to attract more users
Let’s see what we’re missing, what can be done better
Find out what would people like to pay
Iterate, find new niche markets, ask and listen to people
GETTING HANDS DIRTY
ENGINE DEV
Basic engine: 2 days
Production ready (horizontal scalability, disaster recovery, ...): 60+ days
Find edge cases (broken HTML), keep crawler running for days/weeks without crashing
Analysis (tags and content)
Store reports for user filtering and browsing

WEB APP
Landing page + admin UI (Themeforest)
Communication with crawlers
Browse reports, filters
Payment gateway integration (Paypal)
Ticketing support system
CURRENT STATUS
2,5m pages crawled
150GB transfered
800 registered users
Most important things:
we (think we) know what should we do next
polished some edge cases, made more stable service
got the word spread
got speaking slot at WebCampZg!!
CLOUD
STORAGE

RABBIT MQ
HTML, CSS
AJAX / WEBSOCKETS

USER

FRONT END WEB APP

CRAWLERS

DB
FRONT END / ADMIN UI
Landing page + admin theme from Themeforest
ASP.NET MVC 4
Entity Framework 5 (POCO, EF migrations)
DotNetOpenAuth for Social login
EasyNetQ for RabbitMQ (pub/sub), CQS pattern for inprocess msg
SignalR (fullduplex: WebSockets – Ajax pooling duplex)
KnockoutJS, jQuery, Toastr
StructureMap IOC/DI, Automapper (db entities <> DTO)
RABBIT MQ

ADO.NET / EF

LOG

CONTROLLER

COMMAND/QUERY BUS
(CQS)

CRAWLER
CRAWLER WORKER

CRAWLER WORKER

CRAWLER WORKER

...
CRAWLER SERVICE
Multi-threaded Crawler (vs evented crawler)
Entity Framework 5 LINQ + RAW SQL queries with EF + ADO.NET Bulk
Insert
EasyNetQ, RabbitMQ, CQS pattern
Structuremap, HTMLAgilityPack, NLog
Protobuf
CRAWLER WORKER PROCESS
Start or Resume
Resume: load state (SQL, serialized)

Get next page from queue (RabbitMQ, durable store)
Download HTML (200ms – 5sec delay), HEAD req for external
Check statuses, canonical, redirects
Run page analysers, extract data for report, prepare for bulk insert
Find links
Check duplicated, blacklisted
Check Robots.txt
Check if visited – cache & db
Normalize & store to queue (RabbitMQ)

Save state every N pages (Serialize with Protobuf, store byte[] to Db)
RABBITMQ + EASYNETQ
ADMIN UI
rabbitBus.OpenChannel(c => c.Publish(new RecreateReportMessage(id)));

SERVICE
rabbitbus.Subscribe<RecreateReportMessage>("crawlerservice", message =>
{
_commandBus.Execute(new MakeReportCommand(message.ProjectId));
});
COMMAND BUS (MEDIATOR)
bool alreadyVisited =
_bus.Request<bool>(new VisitedPageQuery.Input(projectId, urlHash));

_bus.Execute(new SavePageCommand(pageData, webPage));
public class SavePageReportHandler : IHandle<SavePageCommand>
{
// implementation
}

Encapsulate command / query into classes
IOC / DI for finding and matching handler with command/query types
Easy unit testing
AOP: intercept query or command, pre/post execution (logging, auth, caching, ...)
ISSUES
Everything will crash: net connection, db, thread, VM, ...
Resuming / saving states
Memory issue/leaks with some frameworks
Don’t optimize before profiling (memory, db)
Log everything
DB indexes: how to store for fast filtering, paging
DB as queueing system (don’t)
CQS: command / query separation
Broken HTML, crazy links
Cloud services: connections fail
LEARNED
ORM
Go low level (raw SQL, bulk insert, SP) if needed
Profile: memory, SQL queries
Watch for 1st level cache (ORM unit of work or session)
NoSQL?

Caching
in process – in memory
Plan moving to separate service (Redis, ...)

SOA
Pipeline design
Pub/Sub, CQS pattern (Mediator)
Unit testing
Cloud resiliance
HOSTING
Hosting:
All on one server for now
Started with EC2
Migrated to Azure VM (higher HDD IO, faster CPU), Bizspark (free VM), free
inbound traffic!
Now on Hetzner (dedicated, i7, 32GB RAM, 2xSSD, Win2012 = 60€/m)

Stack: Win 2012, SQL Server 2012, .NET 4.5, ASP.NET MVC 4
Load & stress testing (crawl 500k URLs)
Goal: 100 parallel crawlers on VM 2CPU 4GB RAM (OS, DB)

Will scale when needed
FUTURE PLANS
Fancy reports
Brand new web user interface
Integration with 3th party services (MajesticSEO, ...)
Special page analysis
NoSQL (RavenDb or Redis) for caching
Warehouse Db for browsing crawled pages
Lucene for full text search (RavenDb)
Refactor crawler, pipeline design, async evented design
THANK YOU! QUESTIONS?
Hrvoje Hudoletnjak
m: hrvoje@hudoletnjak.com
t: twitter.com/hhrvoje

Goran Čandrlid
m: gorancandrlic@gmail.com
t: twitter.com/chande

My weekend startup: seocrawler.co

  • 1.
  • 2.
    TEAM Goran Čandrlić Conversion, GoogleAdWords & Internet Marketing Specialist Webiny Cofounder Hrvoje Hudoletnjak Software developer Microsoft ASP.NET/IIS MVP
  • 3.
    WHY? Target market Web masters,site owners Marketers Usage scenarios Get broken pages, redirects, non-index, non-follow, ... On-site SQL quality Crawl competitor pages and find out what are they doing Business model Free Pay as you go Share and get credits
  • 4.
    THE PLAN Let’s builda crawler MVP version: download CSV file of all pages Public launch: browsing crawled pages online, payments Let’s spread the word Use social channel to attract more users Let’s see what we’re missing, what can be done better Find out what would people like to pay Iterate, find new niche markets, ask and listen to people
  • 5.
    GETTING HANDS DIRTY ENGINEDEV Basic engine: 2 days Production ready (horizontal scalability, disaster recovery, ...): 60+ days Find edge cases (broken HTML), keep crawler running for days/weeks without crashing Analysis (tags and content) Store reports for user filtering and browsing WEB APP Landing page + admin UI (Themeforest) Communication with crawlers Browse reports, filters Payment gateway integration (Paypal) Ticketing support system
  • 6.
    CURRENT STATUS 2,5m pagescrawled 150GB transfered 800 registered users Most important things: we (think we) know what should we do next polished some edge cases, made more stable service got the word spread got speaking slot at WebCampZg!!
  • 10.
    CLOUD STORAGE RABBIT MQ HTML, CSS AJAX/ WEBSOCKETS USER FRONT END WEB APP CRAWLERS DB
  • 11.
    FRONT END /ADMIN UI Landing page + admin theme from Themeforest ASP.NET MVC 4 Entity Framework 5 (POCO, EF migrations) DotNetOpenAuth for Social login EasyNetQ for RabbitMQ (pub/sub), CQS pattern for inprocess msg SignalR (fullduplex: WebSockets – Ajax pooling duplex) KnockoutJS, jQuery, Toastr StructureMap IOC/DI, Automapper (db entities <> DTO)
  • 12.
    RABBIT MQ ADO.NET /EF LOG CONTROLLER COMMAND/QUERY BUS (CQS) CRAWLER CRAWLER WORKER CRAWLER WORKER CRAWLER WORKER ...
  • 13.
    CRAWLER SERVICE Multi-threaded Crawler(vs evented crawler) Entity Framework 5 LINQ + RAW SQL queries with EF + ADO.NET Bulk Insert EasyNetQ, RabbitMQ, CQS pattern Structuremap, HTMLAgilityPack, NLog Protobuf
  • 14.
    CRAWLER WORKER PROCESS Startor Resume Resume: load state (SQL, serialized) Get next page from queue (RabbitMQ, durable store) Download HTML (200ms – 5sec delay), HEAD req for external Check statuses, canonical, redirects Run page analysers, extract data for report, prepare for bulk insert Find links Check duplicated, blacklisted Check Robots.txt Check if visited – cache & db Normalize & store to queue (RabbitMQ) Save state every N pages (Serialize with Protobuf, store byte[] to Db)
  • 15.
    RABBITMQ + EASYNETQ ADMINUI rabbitBus.OpenChannel(c => c.Publish(new RecreateReportMessage(id))); SERVICE rabbitbus.Subscribe<RecreateReportMessage>("crawlerservice", message => { _commandBus.Execute(new MakeReportCommand(message.ProjectId)); });
  • 16.
    COMMAND BUS (MEDIATOR) boolalreadyVisited = _bus.Request<bool>(new VisitedPageQuery.Input(projectId, urlHash)); _bus.Execute(new SavePageCommand(pageData, webPage)); public class SavePageReportHandler : IHandle<SavePageCommand> { // implementation } Encapsulate command / query into classes IOC / DI for finding and matching handler with command/query types Easy unit testing AOP: intercept query or command, pre/post execution (logging, auth, caching, ...)
  • 17.
    ISSUES Everything will crash:net connection, db, thread, VM, ... Resuming / saving states Memory issue/leaks with some frameworks Don’t optimize before profiling (memory, db) Log everything DB indexes: how to store for fast filtering, paging DB as queueing system (don’t) CQS: command / query separation Broken HTML, crazy links Cloud services: connections fail
  • 18.
    LEARNED ORM Go low level(raw SQL, bulk insert, SP) if needed Profile: memory, SQL queries Watch for 1st level cache (ORM unit of work or session) NoSQL? Caching in process – in memory Plan moving to separate service (Redis, ...) SOA Pipeline design Pub/Sub, CQS pattern (Mediator) Unit testing Cloud resiliance
  • 19.
    HOSTING Hosting: All on oneserver for now Started with EC2 Migrated to Azure VM (higher HDD IO, faster CPU), Bizspark (free VM), free inbound traffic! Now on Hetzner (dedicated, i7, 32GB RAM, 2xSSD, Win2012 = 60€/m) Stack: Win 2012, SQL Server 2012, .NET 4.5, ASP.NET MVC 4 Load & stress testing (crawl 500k URLs) Goal: 100 parallel crawlers on VM 2CPU 4GB RAM (OS, DB) Will scale when needed
  • 20.
    FUTURE PLANS Fancy reports Brandnew web user interface Integration with 3th party services (MajesticSEO, ...) Special page analysis NoSQL (RavenDb or Redis) for caching Warehouse Db for browsing crawled pages Lucene for full text search (RavenDb) Refactor crawler, pipeline design, async evented design
  • 21.
    THANK YOU! QUESTIONS? HrvojeHudoletnjak m: hrvoje@hudoletnjak.com t: twitter.com/hhrvoje Goran Čandrlid m: gorancandrlic@gmail.com t: twitter.com/chande