Sunbirst
Upcoming SlideShare
Loading in...5
×
 

Sunbirst

on

  • 250 views

Introducing Sunbirst, a distributed pipeline processing architecture for Apache Solr.

Introducing Sunbirst, a distributed pipeline processing architecture for Apache Solr.

Statistics

Views

Total Views
250
Views on SlideShare
250
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Sunbirst Sunbirst Presentation Transcript

  • SunbirstA distributed worker model for Apache Solr @sleepyfox for sourcesense
  • What’s in the box• Context• Problem definition• One possible solution• Discussion• ... @sleepyfox for sourcesense
  • Where we are now
  • Existing system @sleepyfox for sourcesense
  • Existing system• Usual Solr production configuration: • High-volume search • Low volume indexing @sleepyfox for sourcesense
  • Existing system• Usual Solr production configuration: • High-volume search • Low volume indexing• Our customer: • High volume indexing • Low volume search @sleepyfox for sourcesense
  • Volumes @sleepyfox for sourcesense
  • Volumes• 3m new docs indexed/day @sleepyfox for sourcesense
  • Volumes• 3m new docs indexed/day• 60 day archive @sleepyfox for sourcesense
  • Volumes• 3m new docs indexed/day• 60 day archive• = 180m docs indexed @sleepyfox for sourcesense
  • Volumes• 3m new docs indexed/day• 60 day archive• = 180m docs indexed• 10k searches/day @sleepyfox for sourcesense
  • Volumes• 3m new docs indexed/day• 60 day archive• = 180m docs indexed• 10k searches/day• = 1 search per few seconds-ish @sleepyfox for sourcesense
  • Existing architecture @sleepyfox for sourcesense
  • How it works• 2 rows, each 20 shards + coordinator• Partitioning algorithm = (id % 20)• Each shard has: • Solr instance • Indexer • Optimizer • Committer • Purger @sleepyfox for sourcesense
  • How it works• Documents retrieved by coordinator in blocks of 500• These are allocated by id to shards according to the partitioning scheme• Shards poll metabases for their content• Shards index content• Coordinator archives content @sleepyfox for sourcesense
  • Challenges• Coordinator responsible for 2 things: • Archiving content • Routing searches• Redundant data flow from metabases• Partitioning scheme means (n-1/n)*100 percent of docs move on adding shard @sleepyfox for sourcesense
  • One possible future
  • Distributed workflow• Different worker pools: • Indexer • Searcher • Archiver • Coordinator • Content enricher... @sleepyfox for sourcesense
  • Ingest Pipeline CoordinatorIngester Enricher Archiver IndexerIngestqueue Ref. data Disk Archive Solr Disk @sleepyfox for sourcesense
  • ESB• Orchestration, workflow and EI patterns by Apache ServiceMix• Messaging by ApacheMQ• REST by Apache CXF• Runtime container by Apache Karaf• 100% Open Source Software @sleepyfox for sourcesense
  • Call to arms• Designed to be more generic than initial itch that needed scratching• Have Solr/Lucene committers• Happy to accept outside contributors• May eventually become Apache incubator• Contact: Nigel Runnels-Moss • @sleepyfox on Twitter • n.runnels-moss@sourcesense.com @sleepyfox for sourcesense
  • Questions