Indexing Big Data in the Cloud
Me
             Scott Stults
Co-Founder of OpenSource Connections

            Solr / Lucene

        Bash / Python / Java


             Indexing Big Data in the Cloud   2
Eric




Indexing Big Data in the Cloud   3
Big Data




Indexing Big Data in the Cloud   4
Big Data Wrangler




    Indexing Big Data in the Cloud   5
How?
 Address a Real Project
       Be Agile
Make Small Mistaeks Fast
     Succeed BIG




      Indexing Big Data in the Cloud   6
USPTO Goals
Prototype Search UX

    Prove Solr:
      Scales
    Integrates
      Excels


    Indexing Big Data in the Cloud   7
Scale?




Indexing Big Data in the Cloud   8
Our Approach

             KISS
            YAGNI

(This space intentionally left blank)




       Indexing Big Data in the Cloud   9
Minimal Flair




  Indexing Big Data in the Cloud   10
Record Everything!




     Indexing Big Data in the Cloud   11
Some Numbers

Doc Count                   1.1 Million
Zip Files                   313
Docs per Zip File           4,000
Zip File Size               75M
File Size                   300M




            Indexing Big Data in the Cloud   12
Testing
Start some servers
 Process a batch
 Check the clock




   Indexing Big Data in the Cloud   13
start_nodes
start_nodes() {
    ec2-run-instances ami-1b814f72   
        --block-device-mapping '/dev/sdb=snap-48adde35::true'   
        --block-device-mapping '/dev/sdi1=:10:false'   
        --block-device-mapping '/dev/sdi2=:10:false'   
        --block-device-mapping '/dev/sdi3=:20:false'   
        --instance-type m1.large   
        --key uspto-proto     
        --instance-count $MAX_NODES 
        --group default > ~/run-output
}




                       Indexing Big Data in the Cloud               14
Gut Check

 How fast can we do this?

What can we do in parallel?




       Indexing Big Data in the Cloud   15
Scaling

Raise our instance limit

xargs -P GNU parallel




      Indexing Big Data in the Cloud   16
Shortcomings

     SSH?
 Error recovery
    One Solr




  Indexing Big Data in the Cloud   17
Alternatives
   CloudFormation
     Puppet / Chef
Multiple Cores / Shards
        Hadoop




     Indexing Big Data in the Cloud   18
Success




Indexing Big Data in the Cloud   19
Victory Lap




 Indexing Big Data in the Cloud   20
Instances / Time




   Indexing Big Data in the Cloud   21
Thank You

https://github.com/sstults/patent-indexing

              @scottstults
                #o19s



               Indexing Big Data in the Cloud   22

Indexing big data in the cloud

Editor's Notes

  • #3 Co-FounderSolr and Lucene consulting~6 yearsorder of attack (obvious later)partner
  • #4 My partner EricHis book
  • #5 Hands: who’s got a big data problemAndrea Gail -- Perfect Storm
  • #6 Our goalSaddle bag full of data wranglin’ toolsHerd of data from horizon to horizon!
  • #7 How we’ll do itPair programStart smallWe’ll go step-by-step
  • #9 1M documents.5Tb index (storing, analyzing, term vectors, etc)
  • #10 …In other words…
  • #11 Complex == hard to debug
  • #12 Command-line history becomes shell script
  • #13 Pull down a random sample (10) from Google2010 – 2005 (ish)Uniform DTDNo file sizes
  • #15 Billing is per instance-hour $0.32 for a Large
  • #17 review xargsOle Tange
  • #18 Only when logged inNo error recoveryOne solr index
  • #19 CloudFormation – Useful if we have a set of box types we need to startPuppet / Chef – Ditto – may be better at verifying box state before we startMultiple cores/shards – Relevancy loss -- merge ala David Griffin @ etsyHadoop – Next time! Merge on the reduce – does sacrifice a little time (data movement)
  • #20 Record everythingTime everythingStart with the bare minimumIdentify and fix bottlenecks / parallelScale test
  • #21 Reusable9 pages per patent60 M pages60 Tb3 days