• Like
  • Save
My weekend startup: seocrawler.co
Upcoming SlideShare
Loading in...5
×
 

My weekend startup: seocrawler.co

on

  • 1,867 views

Why and how is Seocrawler.co built, a talk for Webcamp Zagreb 2013 conference. Presented technical part of project with dev advices for building crawler/spider

Why and how is Seocrawler.co built, a talk for Webcamp Zagreb 2013 conference. Presented technical part of project with dev advices for building crawler/spider

Statistics

Views

Total Views
1,867
Views on SlideShare
1,856
Embed Views
11

Actions

Likes
1
Downloads
6
Comments
0

2 Embeds 11

https://twitter.com 9
http://inbound.org 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    My weekend startup: seocrawler.co My weekend startup: seocrawler.co Presentation Transcript

    • My weekend startup project SEOCRAWLER.CO
    • TEAM Goran Čandrlić Conversion, Google AdWords & Internet Marketing Specialist Webiny Cofounder Hrvoje Hudoletnjak Software developer Microsoft ASP.NET/IIS MVP
    • WHY? Target market Web masters, site owners Marketers Usage scenarios Get broken pages, redirects, non-index, non-follow, ... On-site SQL quality Crawl competitor pages and find out what are they doing Business model Free Pay as you go Share and get credits
    • THE PLAN Let’s build a crawler MVP version: download CSV file of all pages Public launch: browsing crawled pages online, payments Let’s spread the word Use social channel to attract more users Let’s see what we’re missing, what can be done better Find out what would people like to pay Iterate, find new niche markets, ask and listen to people
    • GETTING HANDS DIRTY ENGINE DEV Basic engine: 2 days Production ready (horizontal scalability, disaster recovery, ...): 60+ days Find edge cases (broken HTML), keep crawler running for days/weeks without crashing Analysis (tags and content) Store reports for user filtering and browsing WEB APP Landing page + admin UI (Themeforest) Communication with crawlers Browse reports, filters Payment gateway integration (Paypal) Ticketing support system
    • CURRENT STATUS 2,5m pages crawled 150GB transfered 800 registered users Most important things: we (think we) know what should we do next polished some edge cases, made more stable service got the word spread got speaking slot at WebCampZg!!
    • CLOUD STORAGE RABBIT MQ HTML, CSS AJAX / WEBSOCKETS USER FRONT END WEB APP CRAWLERS DB
    • FRONT END / ADMIN UI Landing page + admin theme from Themeforest ASP.NET MVC 4 Entity Framework 5 (POCO, EF migrations) DotNetOpenAuth for Social login EasyNetQ for RabbitMQ (pub/sub), CQS pattern for inprocess msg SignalR (fullduplex: WebSockets – Ajax pooling duplex) KnockoutJS, jQuery, Toastr StructureMap IOC/DI, Automapper (db entities <> DTO)
    • RABBIT MQ ADO.NET / EF LOG CONTROLLER COMMAND/QUERY BUS (CQS) CRAWLER CRAWLER WORKER CRAWLER WORKER CRAWLER WORKER ...
    • CRAWLER SERVICE Multi-threaded Crawler (vs evented crawler) Entity Framework 5 LINQ + RAW SQL queries with EF + ADO.NET Bulk Insert EasyNetQ, RabbitMQ, CQS pattern Structuremap, HTMLAgilityPack, NLog Protobuf
    • CRAWLER WORKER PROCESS Start or Resume Resume: load state (SQL, serialized) Get next page from queue (RabbitMQ, durable store) Download HTML (200ms – 5sec delay), HEAD req for external Check statuses, canonical, redirects Run page analysers, extract data for report, prepare for bulk insert Find links Check duplicated, blacklisted Check Robots.txt Check if visited – cache & db Normalize & store to queue (RabbitMQ) Save state every N pages (Serialize with Protobuf, store byte[] to Db)
    • RABBITMQ + EASYNETQ ADMIN UI rabbitBus.OpenChannel(c => c.Publish(new RecreateReportMessage(id))); SERVICE rabbitbus.Subscribe<RecreateReportMessage>("crawlerservice", message => { _commandBus.Execute(new MakeReportCommand(message.ProjectId)); });
    • COMMAND BUS (MEDIATOR) bool alreadyVisited = _bus.Request<bool>(new VisitedPageQuery.Input(projectId, urlHash)); _bus.Execute(new SavePageCommand(pageData, webPage)); public class SavePageReportHandler : IHandle<SavePageCommand> { // implementation } Encapsulate command / query into classes IOC / DI for finding and matching handler with command/query types Easy unit testing AOP: intercept query or command, pre/post execution (logging, auth, caching, ...)
    • ISSUES Everything will crash: net connection, db, thread, VM, ... Resuming / saving states Memory issue/leaks with some frameworks Don’t optimize before profiling (memory, db) Log everything DB indexes: how to store for fast filtering, paging DB as queueing system (don’t) CQS: command / query separation Broken HTML, crazy links Cloud services: connections fail
    • LEARNED ORM Go low level (raw SQL, bulk insert, SP) if needed Profile: memory, SQL queries Watch for 1st level cache (ORM unit of work or session) NoSQL? Caching in process – in memory Plan moving to separate service (Redis, ...) SOA Pipeline design Pub/Sub, CQS pattern (Mediator) Unit testing Cloud resiliance
    • HOSTING Hosting: All on one server for now Started with EC2 Migrated to Azure VM (higher HDD IO, faster CPU), Bizspark (free VM), free inbound traffic! Now on Hetzner (dedicated, i7, 32GB RAM, 2xSSD, Win2012 = 60€/m) Stack: Win 2012, SQL Server 2012, .NET 4.5, ASP.NET MVC 4 Load & stress testing (crawl 500k URLs) Goal: 100 parallel crawlers on VM 2CPU 4GB RAM (OS, DB) Will scale when needed
    • FUTURE PLANS Fancy reports Brand new web user interface Integration with 3th party services (MajesticSEO, ...) Special page analysis NoSQL (RavenDb or Redis) for caching Warehouse Db for browsing crawled pages Lucene for full text search (RavenDb) Refactor crawler, pipeline design, async evented design
    • THANK YOU! QUESTIONS? Hrvoje Hudoletnjak m: hrvoje@hudoletnjak.com t: twitter.com/hhrvoje Goran Čandrlid m: gorancandrlic@gmail.com t: twitter.com/chande