Indexing big data in the cloud

Indexing Big Data in the Cloud

Me
Scott Stults
Co-Founder of OpenSource Connections

Solr / Lucene

Bash / Python / Java

Indexing Big Data in the Cloud 2

Eric


Big Data


Big Data Wrangler


How?
Address a Real Project
Be Agile
Make Small Mistaeks Fast
Succeed BIG


USPTO Goals
Prototype Search UX

Prove Solr:
Scales
Integrates
Excels


Scale?


Our Approach

KISS
YAGNI

(This space intentionally left blank)


Minimal Flair


Record Everything!


Some Numbers

Doc Count 1.1 Million
Zip Files 313
Docs per Zip File 4,000
Zip File Size 75M
File Size 300M


Testing
Start some servers
Process a batch
Check the clock


start_nodes
start_nodes() {
ec2-run-instances ami-1b814f72
--block-device-mapping '/dev/sdb=snap-48adde35::true'
--block-device-mapping '/dev/sdi1=:10:false'
--instance-type m1.large
--key uspto-proto
--instance-count $MAX_NODES
--group default > ~/run-output
}


Gut Check

How fast can we do this?

What can we do in parallel?


Scaling

Raise our instance limit

xargs -P GNU parallel


Shortcomings

SSH?
Error recovery
One Solr


Alternatives
CloudFormation
Puppet / Chef
Multiple Cores / Shards
Hadoop


Success


Victory Lap


Instances / Time


Thank You

https://github.com/sstults/patent-indexing

@scottstults
#o19s


Indexing big data in the cloud

More Related Content

What's hot

Viewers also liked

Similar to Indexing big data in the cloud

More from OpenSource Connections

Recently uploaded

Indexing big data in the cloud

Editor's Notes