My weekend startup: seocrawler.co

My weekend startup
project
SEOCRAWLER.CO

TEAM

Goran Čandrlić
Conversion, Google AdWords &
Internet Marketing Specialist
Webiny Cofounder

Hrvoje Hudoletnjak
Software developer
Microsoft ASP.NET/IIS MVP

WHY?
Target market
Web masters, site owners
Marketers

Usage scenarios
Get broken pages, redirects, non-index, non-follow, ...
On-site SQL quality
Crawl competitor pages and find out what are they doing

Business model
Free
Pay as you go
Share and get credits

THE PLAN
Let’s build a crawler
MVP version: download CSV file of all pages
Public launch: browsing crawled pages online, payments
Let’s spread the word
Use social channel to attract more users
Let’s see what we’re missing, what can be done better
Find out what would people like to pay
Iterate, find new niche markets, ask and listen to people

GETTING HANDS DIRTY
ENGINE DEV
Basic engine: 2 days
Production ready (horizontal scalability, disaster recovery, ...): 60+ days
Find edge cases (broken HTML), keep crawler running for days/weeks without crashing
Analysis (tags and content)
Store reports for user filtering and browsing

WEB APP
Landing page + admin UI (Themeforest)
Communication with crawlers
Browse reports, filters
Payment gateway integration (Paypal)
Ticketing support system

CURRENT STATUS
2,5m pages crawled
150GB transfered
800 registered users
Most important things:
we (think we) know what should we do next
polished some edge cases, made more stable service
got the word spread
got speaking slot at WebCampZg!!

CLOUD
STORAGE

RABBIT MQ
HTML, CSS
AJAX / WEBSOCKETS

USER

FRONT END WEB APP

CRAWLERS

DB

FRONT END / ADMIN UI
Landing page + admin theme from Themeforest
ASP.NET MVC 4
Entity Framework 5 (POCO, EF migrations)
DotNetOpenAuth for Social login
EasyNetQ for RabbitMQ (pub/sub), CQS pattern for inprocess msg
SignalR (fullduplex: WebSockets – Ajax pooling duplex)
KnockoutJS, jQuery, Toastr
StructureMap IOC/DI, Automapper (db entities <> DTO)

RABBIT MQ

ADO.NET / EF

LOG

CONTROLLER

COMMAND/QUERY BUS
(CQS)

CRAWLER
CRAWLER WORKER

CRAWLER WORKER

CRAWLER WORKER

...

CRAWLER SERVICE
Multi-threaded Crawler (vs evented crawler)
Entity Framework 5 LINQ + RAW SQL queries with EF + ADO.NET Bulk
Insert
EasyNetQ, RabbitMQ, CQS pattern
Structuremap, HTMLAgilityPack, NLog
Protobuf

CRAWLER WORKER PROCESS
Start or Resume
Resume: load state (SQL, serialized)

Get next page from queue (RabbitMQ, durable store)
Download HTML (200ms – 5sec delay), HEAD req for external
Check statuses, canonical, redirects
Run page analysers, extract data for report, prepare for bulk insert
Find links
Check duplicated, blacklisted
Check Robots.txt
Check if visited – cache & db
Normalize & store to queue (RabbitMQ)

Save state every N pages (Serialize with Protobuf, store byte[] to Db)

RABBITMQ + EASYNETQ
ADMIN UI
rabbitBus.OpenChannel(c => c.Publish(new RecreateReportMessage(id)));

SERVICE
rabbitbus.Subscribe<RecreateReportMessage>("crawlerservice", message =>
{
_commandBus.Execute(new MakeReportCommand(message.ProjectId));
});

COMMAND BUS (MEDIATOR)
bool alreadyVisited =
_bus.Request<bool>(new VisitedPageQuery.Input(projectId, urlHash));

_bus.Execute(new SavePageCommand(pageData, webPage));
public class SavePageReportHandler : IHandle<SavePageCommand>
{
// implementation
}

Encapsulate command / query into classes
IOC / DI for finding and matching handler with command/query types
Easy unit testing
AOP: intercept query or command, pre/post execution (logging, auth, caching, ...)

ISSUES
Everything will crash: net connection, db, thread, VM, ...
Resuming / saving states
Memory issue/leaks with some frameworks
Don’t optimize before profiling (memory, db)
Log everything
DB indexes: how to store for fast filtering, paging
DB as queueing system (don’t)
CQS: command / query separation
Broken HTML, crazy links
Cloud services: connections fail

LEARNED
ORM
Go low level (raw SQL, bulk insert, SP) if needed
Profile: memory, SQL queries
Watch for 1st level cache (ORM unit of work or session)
NoSQL?

Caching
in process – in memory
Plan moving to separate service (Redis, ...)

SOA
Pipeline design
Pub/Sub, CQS pattern (Mediator)
Unit testing
Cloud resiliance

HOSTING
Hosting:
All on one server for now
Started with EC2
Migrated to Azure VM (higher HDD IO, faster CPU), Bizspark (free VM), free
inbound traffic!
Now on Hetzner (dedicated, i7, 32GB RAM, 2xSSD, Win2012 = 60€/m)

Stack: Win 2012, SQL Server 2012, .NET 4.5, ASP.NET MVC 4
Load & stress testing (crawl 500k URLs)
Goal: 100 parallel crawlers on VM 2CPU 4GB RAM (OS, DB)

Will scale when needed

FUTURE PLANS
Fancy reports
Brand new web user interface
Integration with 3th party services (MajesticSEO, ...)
Special page analysis
NoSQL (RavenDb or Redis) for caching
Warehouse Db for browsing crawled pages
Lucene for full text search (RavenDb)
Refactor crawler, pipeline design, async evented design

THANK YOU! QUESTIONS?
Hrvoje Hudoletnjak
m: hrvoje@hudoletnjak.com
t: twitter.com/hhrvoje

Goran Čandrlid
m: gorancandrlic@gmail.com
t: twitter.com/chande

My weekend startup: seocrawler.co

More Related Content

What's hot

Viewers also liked

Similar to My weekend startup: seocrawler.co

Recently uploaded

My weekend startup: seocrawler.co