Your SlideShare is downloading. ×
0
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWS
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Indexing Big Data on Amazon AWS

976

Published on

Presented by Scott Stults | OpenSource Connections - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 …

Presented by Scott Stults | OpenSource Connections - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012

Amazon Web Services offers a quick and easy way to build a scalable search platform, a flexibility is especially useful when an initial data load is required but the hardware is no longer needed for day-to-day searching and adding new documents. This presentation will cover one such approach capable of enlisting hundreds of worker nodes to ingest data, track their progress, and relinquish them back to the cloud when the job is done. The data set that will be discussed is the collection of published patent grants available through Google Patents. A single Solr instance can easily handle searching the roughly 1 million patents issued between 2010 and 2005, but up to 50 worker nodes were necessary to load that data in a reasonable amount of time. Also, the same basic approach was used to make three sizes of PNG thumbnails of the patent grant TIFF images. In that case 150 worker nodes were used to generate 1.6 Tb of data over the course of three days. In this session, attendees will learn how to leverage EC2 as a scalable indexer and tricks for using XSLT on very large XML documents.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
976
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Indexing Big Data in the Cloud
  • 2. Me Scott StultsCo-Founder of OpenSource Connections Solr / Lucene Bash / Python / Java Indexing Big Data in the Cloud 2
  • 3. EricIndexing Big Data in the Cloud 3
  • 4. Big DataIndexing Big Data in the Cloud 4
  • 5. Big Data Wrangler Indexing Big Data in the Cloud 5
  • 6. How? Address a Real Project Be AgileMake Small Mistaeks Fast Succeed BIG Indexing Big Data in the Cloud 6
  • 7. USPTO GoalsPrototype Search UX Prove Solr: Scales Integrates Excels Indexing Big Data in the Cloud 7
  • 8. Scale?Indexing Big Data in the Cloud 8
  • 9. Our Approach KISS YAGNI(This space intentionally left blank) Indexing Big Data in the Cloud 9
  • 10. Minimal Flair Indexing Big Data in the Cloud 10
  • 11. Record Everything! Indexing Big Data in the Cloud 11
  • 12. Some NumbersDoc Count 1.1 MillionZip Files 313Docs per Zip File 4,000Zip File Size 75MFile Size 300M Indexing Big Data in the Cloud 12
  • 13. TestingStart some servers Process a batch Check the clock Indexing Big Data in the Cloud 13
  • 14. start_nodesstart_nodes() { ec2-run-instances ami-1b814f72 --block-device-mapping /dev/sdb=snap-48adde35::true --block-device-mapping /dev/sdi1=:10:false --block-device-mapping /dev/sdi2=:10:false --block-device-mapping /dev/sdi3=:20:false --instance-type m1.large --key uspto-proto --instance-count $MAX_NODES --group default > ~/run-output} Indexing Big Data in the Cloud 14
  • 15. Gut Check How fast can we do this?What can we do in parallel? Indexing Big Data in the Cloud 15
  • 16. ScalingRaise our instance limitxargs -P GNU parallel Indexing Big Data in the Cloud 16
  • 17. Shortcomings SSH? Error recovery One Solr Indexing Big Data in the Cloud 17
  • 18. Alternatives CloudFormation Puppet / ChefMultiple Cores / Shards Hadoop Indexing Big Data in the Cloud 18
  • 19. SuccessIndexing Big Data in the Cloud 19
  • 20. Victory Lap Indexing Big Data in the Cloud 20
  • 21. Instances / Time Indexing Big Data in the Cloud 21
  • 22. Thank Youhttps://github.com/sstults/patent-indexing @scottstults #o19s Indexing Big Data in the Cloud 22

×