Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache SolrCloud

Apache SolrCloud

  • Be the first to comment

Apache SolrCloud

  1. 1. Searching in the Cloud Arkadiusz Masiakiewicz Michał Warecki
  2. 2. About us Arkadiusz Masiakiewicz ● NoSQL Databases ● Full-text search engines ● Neural networks ● Scuba diving Michał Warecki ● GC ● JiT Compilers ● Concurrency ● Non-blocking algorithms ● Programming languages runtime ● Astronomy
  3. 3. Agenda ● Introduction to Apache Solr ● Data in PayU ● About SolrCloud ● Performance
  4. 4. About Apache Solr ● Full-text search server based on Apache Lucene ● Platform independent ● Buzzword compatible (sharding, replication, clustering, cloud, scaling, big data, ...)
  5. 5. Where to use Solr?
  6. 6. Who uses Solr?
  7. 7. About Apache Solr Indexer / Scheduler Lucene Index Solr server Schema Config Documents / Update query Client application Query (HTTP) Response Web container
  8. 8. Platform independent (XML|JSON|PHP...)/HTTP
  9. 9. HTTP-based queries ● http://localhost:8080/solr/select?q=query – &start=50 – &rows=25 – &fq=field:ala – &facet=on&facet.field=category – &sort=dist(2, point1, point2) desc
  10. 10. Solr schema ● name ● type ● indexed ● stored ● multiValued ● required
  11. 11. Solr schema ● StrField – String (UTF-8 encoded string or Unicode). ● TextField – Text, usually multiple words or tokens. ● TrieDateField – Date field accessible for Lucene TrieRange processing. ● TrieDoubleField, TrieFloatField, TrieIntField, TrieLongField ● UUIDField – Universally Unique Identifier (UUID). Pass in a value of "NEW" and Solr will create a new UUID. ● CurrencyField – Supports currencies and exchange rates.
  12. 12. Data import Solr Data feeders XML Documents ● XML ● CSV ● File system ● Web spiders ● RSS/Atom ● POJO ● E-mail ● DB
  13. 13. Data configuration ● data source ● document – entity(query, delta query) ● field Solr DB Update query SQ L DIH
  14. 14. Plugins ● Search components ● Request handlers ● Process factory
  15. 15. Data in PayU ● ~100 GB index size ● ~400 milions of documents to search ● More than 30 fields in schema
  16. 16. Data flow in PayU Client application Solr DB 1: search params 3: DB id's 2: DB id's 4: Data
  17. 17. Connecting to your old app Hibernate criteria JPA criteria Custom criteria q=field1:testvalue&wt=javabin
  18. 18. Traditional Solr architecture Solr shard1 - config - schema Solr shard1 replica - config - schema Solr shard1 - config - schema Solr shard1 - config - schema Solr shard2 replica - config - schema Solr shard2 - config - schema ✗ manually copy config ✗ manually split index ✗ manually shard queries ✗ add replica ✗ manually copy config ✗ setup replication ✗ does not provide fail-over ✗ separate monitoring
  19. 19. Zookeeper cluster SolrCloud architecture confClusterstate conf conf conf Zookeeper Solr instance1 Shard1 leader Shard2 Replica Solr instance2 Shard1 replica Shard2 leader confClusterstate conf conf conf Zookeeper confClusterstate conf conf conf Zookeeper2 Zookeeper1 Zookeeper3 Add document
  20. 20. ZooKeeperZooKeeperZooKeeper Assigning machines Number of shards : 3 Shard1
  21. 21. ZooKeeperZooKeeperZooKeeper Assigning machines Number of shards : 3 Shard1 Shard2
  22. 22. ZooKeeperZooKeeperZooKeeper Assigning machines Number of shards : 3 Shard1 Shard2 Shard3 Now you can search and index
  23. 23. ZooKeeperZooKeeperZooKeeper Assigning machines Shard1 Shard2 Shard3 Replica shard1
  24. 24. ZooKeeperZooKeeperZooKeeper Assigning machines Shard1 Shard2 Shard3 Replica shard1 Replica shard2
  25. 25. ZooKeeperZooKeeperZooKeeper Assigning machines Shard1 Shard2 Shard3 Replica shard1 Replica shard2 Replica shard3
  26. 26. ZooKeeper configuration {"collection1":{ "shards":{ "shard1":{ "range":"80000000-8443ffff", "state":"active", "replicas":{ "10.205.33.92:8080_solr_collection1_shard1_replica2":{ "state":"active", "core":"collection1_shard1_replica2", "node_name":"10.205.33.92:8080_solr", "base_url":"http://10.205.33.92:8080/solr", "leader":"true"}, "10.205.33.93:8080_solr_collection1_shard1_replica1":{ "state":"active", "core":"collection1_shard1_replica1", "node_name":"10.205.33.93:8080_solr", "base_url":"http://10.205.33.93:8080/solr"}}},
  27. 27. SolrCloud ● Central Config in Zookeeper ● Automatic Fail-Over ● Near-Realtime ● Leader Election ● Optimistic Locking ● Durable writes
  28. 28. Sharding in SolrCloud Collection – i.e. books Shard1 Books part1 replica replica Shard3 Books part3 replica replica Shard2 Books part2 replica replica
  29. 29. Sharding in SolrCloud “--Ile dajemy shardów? 72? -- ee, daj 100 do pełnego :-)”
  30. 30. Sharding in SolrCloud ● Implicit documents distributing – uniqeId.hashCode() % numServers. ● Composite key – groupId!uniqeId – groupId.hashCode() in shard hash range 1-10 11-20 21-30 31-40 hashCode=5 hashCode=10 hashCode=35 hashCode=15 ?shard.keys=xxx! hashCode=35
  31. 31. SolrJ ● HTTP Solr Server ● Embedded Solr Server – Does not require HTTP ● Cloud Solr Server – Pass ZooKeeper hosts
  32. 32. SolrJ POJO public class Item { @Field String id; @Field("cat") String[] categories; @Field List<String> features; } //... Item item = new Item(); item.id = "one"; item.categories = new String[] { "aaa", "bbb", "ccc" }; server.addBean(item);
  33. 33. Cache ● Filter cache ● Field value cache ● Query result cache ● Document cache ● User/Generic Caches
  34. 34. Cache implementations ● Least recent used (LRU) ● Fast LRU ● Least frequent used
  35. 35. Warming queries <lst> <str name="q">solr</str> <str name="sort">testDate desc</str> </lst>
  36. 36. FQ vs Q ● Filter Query (FQ) doesn't involve very complex document scoring q=description:Potter fq=type:book
  37. 37. Questions?
  38. 38. Thank you!

×