Indexing Big Data in the Cloud
Me             Scott StultsCo-Founder of OpenSource Connections            Solr / Lucene        Bash / Python / Java      ...
EricIndexing Big Data in the Cloud   3
Big DataIndexing Big Data in the Cloud   4
Big Data Wrangler    Indexing Big Data in the Cloud   5
How? Address a Real Project       Be AgileMake Small Mistaeks Fast     Succeed BIG      Indexing Big Data in the Cloud   6
USPTO GoalsPrototype Search UX    Prove Solr:      Scales    Integrates      Excels    Indexing Big Data in the Cloud   7
Scale?Indexing Big Data in the Cloud   8
Our Approach             KISS            YAGNI(This space intentionally left blank)       Indexing Big Data in the Cloud   9
Minimal Flair  Indexing Big Data in the Cloud   10
Record Everything!     Indexing Big Data in the Cloud   11
Some NumbersDoc Count                   1.1 MillionZip Files                   313Docs per Zip File           4,000Zip Fil...
TestingStart some servers Process a batch Check the clock   Indexing Big Data in the Cloud   13
start_nodesstart_nodes() {    ec2-run-instances ami-1b814f72           --block-device-mapping /dev/sdb=snap-48adde35::true...
Gut Check How fast can we do this?What can we do in parallel?       Indexing Big Data in the Cloud   15
ScalingRaise our instance limitxargs -P GNU parallel      Indexing Big Data in the Cloud   16
Shortcomings     SSH? Error recovery    One Solr  Indexing Big Data in the Cloud   17
Alternatives   CloudFormation     Puppet / ChefMultiple Cores / Shards        Hadoop     Indexing Big Data in the Cloud   18
SuccessIndexing Big Data in the Cloud   19
Victory Lap Indexing Big Data in the Cloud   20
Instances / Time   Indexing Big Data in the Cloud   21
Thank Youhttps://github.com/sstults/patent-indexing              @scottstults                #o19s               Indexing ...
Upcoming SlideShare
Loading in...5
×

Indexing big data in the cloud

1,449

Published on

Amazon Web Services offers a quick and easy way to build a scalable search platform, a flexibility is especially useful when an initial data load is required but the hardware is no longer needed for day-to-day searching and adding new documents. This presentation will cover one such approach capable of enlisting hundreds of worker nodes to ingest data, track their progress, and relinquish them back to the cloud when the job is done. The data set that will be discussed is the collection of published patent grants available through Google Patents. A single Solr instance can easily handle searching the roughly 1 million patents issued between 2010 and 2005, but up to 50 worker nodes were necessary to load that data in a reasonable amount of time. Also, the same basic approach was used to make three sizes of PNG thumbnails of the patent grant TIFF images. In that case 150 worker nodes were used to generate 1.6 Tb of data over the course of three days. In this session, attendees will learn how to leverage EC2 as a scalable indexer and tricks for using XSLT on very large XML documents.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,449
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
15
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Co-FounderSolr and Lucene consulting~6 yearsorder of attack (obvious later)partner
  • My partner EricHis book
  • Hands: who’s got a big data problemAndrea Gail -- Perfect Storm
  • Our goalSaddle bag full of data wranglin’ toolsHerd of data from horizon to horizon!
  • How we’ll do itPair programStart smallWe’ll go step-by-step
  • 1M documents.5Tb index (storing, analyzing, term vectors, etc)
  • …In other words…
  • Complex == hard to debug
  • Command-line history becomes shell script
  • Pull down a random sample (10) from Google2010 – 2005 (ish)Uniform DTDNo file sizes
  • Billing is per instance-hour $0.32 for a Large
  • review xargsOle Tange
  • Only when logged inNo error recoveryOne solr index
  • CloudFormation – Useful if we have a set of box types we need to startPuppet / Chef – Ditto – may be better at verifying box state before we startMultiple cores/shards – Relevancy loss -- merge ala David Griffin @ etsyHadoop – Next time! Merge on the reduce – does sacrifice a little time (data movement)
  • Record everythingTime everythingStart with the bare minimumIdentify and fix bottlenecks / parallelScale test
  • Reusable9 pages per patent60 M pages60 Tb3 days
  • Indexing big data in the cloud

    1. 1. Indexing Big Data in the Cloud
    2. 2. Me Scott StultsCo-Founder of OpenSource Connections Solr / Lucene Bash / Python / Java Indexing Big Data in the Cloud 2
    3. 3. EricIndexing Big Data in the Cloud 3
    4. 4. Big DataIndexing Big Data in the Cloud 4
    5. 5. Big Data Wrangler Indexing Big Data in the Cloud 5
    6. 6. How? Address a Real Project Be AgileMake Small Mistaeks Fast Succeed BIG Indexing Big Data in the Cloud 6
    7. 7. USPTO GoalsPrototype Search UX Prove Solr: Scales Integrates Excels Indexing Big Data in the Cloud 7
    8. 8. Scale?Indexing Big Data in the Cloud 8
    9. 9. Our Approach KISS YAGNI(This space intentionally left blank) Indexing Big Data in the Cloud 9
    10. 10. Minimal Flair Indexing Big Data in the Cloud 10
    11. 11. Record Everything! Indexing Big Data in the Cloud 11
    12. 12. Some NumbersDoc Count 1.1 MillionZip Files 313Docs per Zip File 4,000Zip File Size 75MFile Size 300M Indexing Big Data in the Cloud 12
    13. 13. TestingStart some servers Process a batch Check the clock Indexing Big Data in the Cloud 13
    14. 14. start_nodesstart_nodes() { ec2-run-instances ami-1b814f72 --block-device-mapping /dev/sdb=snap-48adde35::true --block-device-mapping /dev/sdi1=:10:false --block-device-mapping /dev/sdi2=:10:false --block-device-mapping /dev/sdi3=:20:false --instance-type m1.large --key uspto-proto --instance-count $MAX_NODES --group default > ~/run-output} Indexing Big Data in the Cloud 14
    15. 15. Gut Check How fast can we do this?What can we do in parallel? Indexing Big Data in the Cloud 15
    16. 16. ScalingRaise our instance limitxargs -P GNU parallel Indexing Big Data in the Cloud 16
    17. 17. Shortcomings SSH? Error recovery One Solr Indexing Big Data in the Cloud 17
    18. 18. Alternatives CloudFormation Puppet / ChefMultiple Cores / Shards Hadoop Indexing Big Data in the Cloud 18
    19. 19. SuccessIndexing Big Data in the Cloud 19
    20. 20. Victory Lap Indexing Big Data in the Cloud 20
    21. 21. Instances / Time Indexing Big Data in the Cloud 21
    22. 22. Thank Youhttps://github.com/sstults/patent-indexing @scottstults #o19s Indexing Big Data in the Cloud 22
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×