Mozscape no sql-at-terabyte-scale

224 views

Published on

Given at http://www.meetup.com/Seattle-Hadoop-HBase-NoSQL-Meetup/events/62000472/

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
224
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Mozscape no sql-at-terabyte-scale

    1. 1. Mozscape: NoSQL at Terabyte Scale Phil Smith Software Engineer
    2. 2. What We Do SEO & Inbound Marketing Metrics www.opensiteexplorer.org
    3. 3. What We Dowww.opensiteexplorer.orgCollect back links across the web
    4. 4. What We Dowww.opensiteexplorer.orgCollect back links across the webCompute metrics estimating value
    5. 5. What We Dowww.opensiteexplorer.orgCollect back links across the webCompute metrics estimating valueServe links and metrics with API and OSE
    6. 6. How We DoCrawl the Web ~25-30 billion pages per month
    7. 7. How We DoCrawl the Web ~25-30 billion pages per month 20 Crawler machines
    8. 8. How We DoCrawl the Web ~25-30 billion pages per month 20 Crawler machines ~256 MB/sec aggregate download rate
    9. 9. How We DoCompute Aggregates and Metrics 1:5 to 1:50 Compression Ratios
    10. 10. How We DoCompute Aggregates and Metrics 1:5 to 1:50 Compression Ratios Aggregates are Parallelized Linear Scans
    11. 11. How We DoCompute Aggregates and Metrics 1:5 to 1:50 Compression Ratios Aggregates are Parallelized Linear Scans Communication Avoided where Possible
    12. 12. How We DoSurface with a Read-Only API ~12 TB per Release in Amazon S3
    13. 13. How We DoSurface with a Read-Only API ~12 TB per Release in Amazon S3 6 m2.4xlarge Instances for Cache
    14. 14. How We DoSurface with a Read-Only API ~12 TB per Release in Amazon S3 6 m2.4xlarge Instances for Cache ~28k Requests per Minute
    15. 15. Observations and StrategyBillions of Small, Similar RecordsDe-normalization Avoids Complex JoinsBatch-style Emphasizes Spatial Locality
    16. 16. Data LayoutColumn-Orientation exploits LocalityBroken into 5GB chunks for S3~64KB Compression Runs within
    17. 17. CompressionTuned to Overcome Disk Read BoundBy-Column, Run & Gap Encoding on LZOCustomized Pipelines per Column
    18. 18. Job ControlEach Stage has Parallel, Idempotent TasksTasks are Procs with easy Command Linestdout, exit code are logged to track state
    19. 19. Checkpoints S3 Barrier Barrier Table Scan Checkpoint Table Scan Time
    20. 20. IndexingColumns have BDBs indexing by IDSubset of IDs map to Compression RunsDecompress Run and Scan to find Record
    21. 21. Physical DeploymentCrawlers run in Colo for white-listed IPsBatch Process and API layer in EC2The API might be in a colo too, butELB + Autoscaling are nice
    22. 22. Questions?We’re Hiring!

    ×