An Introduction to Distributed Search with Datastax Enterprise Search


Published on

This is an overview of distributed search with Cassandra and Solr.

Published in: Technology

An Introduction to Distributed Search with Datastax Enterprise Search

  1. 1. TOO BIGTO FAILAn Introduction to Distributed Search with Cassandra and SolrOpenSource
  2. 2. ABOUT MESystems AnalystProgrammingInformation Retrieval
  3. 3. Created at Facebook topower inbox searchDistributed data store runon commodity serversHighly availableNo one single point of failureCASSANDRA
  5. 5. SEARCH + CASSANDRA, 1• First implementation: Solandra (originally Lucandra)• Replaced Lucene index with Cassandra column families
  6. 6. SEARCH + CASSANDRA, 2• DataStax Enterprise Search• Uses native Lucene index• All data is retrieved from Cassandra
  7. 7. Datastax Enterprise Search ClusterDistributedLinearly ScalableHighly AvailableEventually ConsistentFull-text searchAggregation
  8. 8. SETTING UPTHE SCHEMA<fields><field name="id" type="string" indexed="true" stored="true"/><field name="name" type="text" indexed="true" stored="true"/><field name="body" type="text" indexed="true" stored="true"/><field name="title" type="text" indexed="true" stored="true"/><field name="date" type="string" indexed="true" stored="true"/></fields>
  9. 9. WRITINGTO CLUSTER• Write to either Cassandra clients or Solr API• Write process is the same• True atomic updates to Cassandra
  10. 10. Cassandra nodes are set up according to row-key hash.
  11. 11. Data can be written directly to Cassandra
  12. 12. Data is distributed according to row key hash and replicationfactor
  13. 13. DSE first writes toCassandra
  14. 14. And then updates thesecondary index on Solr
  15. 15. The quorum responds with success / failure
  16. 16. Data is now distributed evenly
  17. 17. READING FROM CLUSTER• Read either Cassandra-side or through Solr API• Cassandra: fast reads*• Solr: full-text search• Read direction affects performance• Data is stored in Cassandra
  18. 18. Query is sent to node
  19. 19. Node uses gossip to find who has the information
  20. 20. QUERYING CASSANDRA• Can query Solr or Cassandra directly• Limited syntax with CQL, can use solr_query parameter
  21. 21. Querying Cassandradirectly
  22. 22. Cassandra retrievesinformation from columnfamily
  23. 23. Querying Solr index
  24. 24. Row-key hashes arestored in Solr, andCassandra is queried forstored data
  25. 25. Cassandra node sendsrequest to node with thecorresponding hash,returns information
  26. 26. Data is always synced
  27. 27. Both nodes respond with information
  28. 28. Updates can be committed and searched over in real time
  29. 29. PRODUCTION USE• Will want a mix of analytics, search nodes
  30. 30. An OLTP - OLAP integrated solution
  31. 31. TRADEOFFS• Changing the Solr schema requires reindex (standard for Solr)• No multi-valued fields or composite columns
  32. 32. Q&