©2012 DataStax 1DataStax Enterprise 3.xRealtime Analytics with SolrJason Rutherglen
©2012 DataStax 2•  Big Data Engineer at DataStax•  Co-author of ‘Programming Hive’and ‘Introduction to Solr’ fromO’ReillyA...
©2012 DataStax 3•  The company behind Cassandra•  Sells DataStax EnterpriseAbout DataStax
©2012 DataStax 4DataStax Enterprise 3.x
©2012 DataStax 5•  Single stack•  Cassandra•  Solr•  Hadoop•  Consulting•  SupportDataStax Enterprise
©2012 DataStax 6•  ZDNet article: “The biggest cloudapp of all: Netflix”•  http://zd.net/ZHtrmW•  Built on Cassandra•  “Ac...
©2012 DataStax 7•  Petabytes of growing data•  Hadoop is for batch work•  What are the solutions for realtime?What is Big ...
©2012 DataStax 8•  Near realtime•  1000 millisecond latencyWhat is realtime?
©2012 DataStax 9•  Cost of scaling to petabytes•  Physical limitationsWhy not relational databases?
©2012 DataStax 10•  Hadoop for batch•  Solr and Cassandra for realtime•  Gives most of relational capability at1/10 the co...
©2012 DataStax 11•  Distributed database heavy lifting•  Simple dynamo model•  Executes replication tasks extremelywellWhy...
©2012 DataStax 12•  Cassandra is easier, code is readable•  Fewer moving parts•  Multi-datacenter replication•  Enables lo...
©2012 DataStax 13•  HBase runs on HDFS•  HDFS is not designed for randomaccess IO•  Multiple hacks / products to performra...
©2012 DataStax 14•  Cassandra is peer to peer, there is nosingle point of failure (SPOF)•  The HDFS name node is a singlep...
©2012 DataStax 15•  Most of Facebook runs on MySQL•  Memcache front ends the readsHBase at Facebook
©2012 DataStax 16•  Hive with Hadoop•  A vague dialect of SQL•  Requires Java for UDFs•  Relational JoinsBatch Analytics
©2012 DataStax 17•  Solr•  SQL features except relational joins•  Use Hive for relational joins•  CEP (Complex Event Proce...
©2012 DataStax 18•  Storm, computes results onstreaming dataCEP (Complex Event Processing)
©2012 DataStax 19•  Java inverted indexing library•  Text analytics is raw computationover linear sets of data•  High spee...
©2012 DataStax 20•  Terms dictionary points to listdocument ids (integers)•  Tokenizes text•  Complete variety of computat...
©2012 DataStax 21•  Search server built around Lucene•  Adds faceting, distributed search•  Missed the cloud environmentfe...
©2012 DataStax 22•  Solr Cloud is a Zookeeper basedsystem•  New and probably not productionready•  Playing catch upSolr Cl...
©2012 DataStax 23•  High overlap with Solr•  More mature than Solr Cloud•  Less distributed features thanCassandraElastic ...
©2012 DataStax 24•  Columns, column families, keyspaces•  Peer to peer•  Eventual consistency•  Implements basic Google Bi...
©2012 DataStax 25•  Both implement a log structuredmerge tree file architectureLucene and Cassandra
©2012 DataStax 26•  Data is stored in Cassandra•  Data placement controlled byCassandra•  Solr is a secondary index (only)...
©2012 DataStax 27•  Separation of church and state, eg,data and indexDataStax Enterprise with Solr
©2012 DataStax 28•  Indexing is a CPU intensive task•  Not IO bound because of multi-threading•  When a thread is flushing...
©2012 DataStax 29•  IO bound, index needs to fit in RAM,then CPU bound•  Lucene enables multithreadingqueries•  Solr does ...
©2012 DataStax 30•  Eventual consistency, each node hasit’s own Lucene index•  Lucene segment files are notreplicated (lik...
©2012 DataStax 31•  Query requests are round robin’dacross nodes automaticallyDistributed Search Architecture
©2012 DataStax 32•  3.0.1 is the current release ofDataStax EnterpriseDSE 3.0.1
©2012 DataStax 33•  Ease of re-indexing•  Re-index the entire cluster or per-node•  Re-indexing occurs when the Solrschema...
©2012 DataStax 34•  Solr Cloud requires re-indexing froman external data source such as arelational databaseNew Features i...
©2012 DataStax 35•  DSE re-indexes directly fromCassandra•  No custom code is required for re-indexingNew Features in DSE ...
©2012 DataStax 36•  View the heap memory usage of thefield caches•  Perform capacity planningNew Features in DSE 3.0.1
©2012 DataStax 37•  Multithreaded re-indexing and repair•  Adding a new Solr node is fastNew Features in DSE 3.0.1
©2012 DataStax 38•  Kerberos and SSL security•  Security audit loggingNew Features in DSE 3.0.1
©2012 DataStax 39•  Near realtime: per-segment filters,facets, multivalue facets•  Solr 4.3DataStax Enterprise 3.1
©2012 DataStax 40•  vNodes•  Composite keysDataStax Enterprise 3.1
©2012 DataStax 41•  Multi datacenter live Solr schemaupdates and re-indexing•  CQL -> Solr queries, makes portingSQL appli...
©2012 DataStax 42Demo of Wikipedia
©2012 DataStax 43•  Details about every trade•  Tick data generated real time and isquantitatively query-able•  Too big to...
©2012 DataStax 44•  Computing the moving stock priceaverage in real time•  Comparing multiple moving averagesfor different...
©2012 DataStax 45•  Read latest ticks for a given company•  Query ticks for companies in specificverticals during large ev...
©2012 DataStax 46Real Time Stocks Demo
©2012 DataStax 47General
©2012 DataStax 48•  Like an SQL CREATE TABLEstatement•  Defines field types•  Defines fieldsSchema
©2012 DataStax 49•  XML based configuration options forSolrSolr Config
©2012 DataStax 50•  Commits new index segment to RAM•  Avoids ‘hard’ commit fsyncSoft Commit
©2012 DataStax 51Auto Soft Commit<!-- The default high-performance update handler --> !<updateHandler class="solr.DirectUp...
©2012 DataStax 52•  Loaded for sort and facet queries•  Uses heap spaceField Cache
©2012 DataStax 53•  Java based API for interacting with aSolr server•  DSE supports SolrJ/HTTP with nochangesSolrJ / HTTP
©2012 DataStax 54•  Auto data type mapping•  Copy fields•  Dynamic fieldsInsert data with CQL
©2012 DataStax 55•  Exists however is mainly useful fordebugging•  Limited functionality, queries a singlenodeCQL with Sol...
©2012 DataStax 56CQL Insert ExampleINSERT INTO wikipedia (key, text) !VALUES (1, when in rome)!
©2012 DataStax 57How to convert applications
©2012 DataStax 58•  Common to convert existing SQLapplications to Big Data•  Focus on the application functionalitySQL to ...
©2012 DataStax 59•  Cassandra makes all distributedoperations easySQL to Solr
©2012 DataStax 60•  SELECT * FROM wikipedia WHEREtype = ‘pdf’•  q=type:pdfSELECT WHERE
©2012 DataStax 61•  SELECT title,text FROM wikipedia•  q=*:*•  fl=title,textSELECT columns
©2012 DataStax 62•  SELECT COUNT(*) FROM wikipediaWHERE type = ‘pdf’•  q=type:pdf•  Get the num foundSELECT COUNT
©2012 DataStax 63•  SELECT * FROM stocks ORDER BYprice ASC•  q=*:*•  sort=price ascSELECT ORDER BY
©2012 DataStax 64•  SELECT AVG(price) FROM stocks•  q=*:*•  stats=true•  stats.field=price•  The average is called ‘mean’ ...
©2012 DataStax 65•  SELECT AVG(price) FROM stocksGROUP BY symbol•  q=*:*•  stats=true•  stats.field=price•  stats.facet=sy...
©2012 DataStax 66•  SELECT * FROM wikipedia WHEREtext LIKE ‘rom%’•  q=text:rom*SELECT WHERE LIKE
©2012 DataStax 67
Upcoming SlideShare
Loading in …5
×

C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

2,786 views

Published on

The presentation demonstrates how Solr may be used to create real-time analytics applications. In addition, Datastax Enterprise 3.0 will be showcased, which offers Solr version 4.0 with a number of improvements over the previous DSE release. A realtime financial application will run for the audience, and then a detailed look at how the application was built. An overview of Datastax Enterprise Solr features will be given, and how the many enhancements in DSE make it unique in the marketplace.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,786
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
68
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

  1. 1. ©2012 DataStax 1DataStax Enterprise 3.xRealtime Analytics with SolrJason Rutherglen
  2. 2. ©2012 DataStax 2•  Big Data Engineer at DataStax•  Co-author of ‘Programming Hive’and ‘Introduction to Solr’ fromO’ReillyAbout the Presenter
  3. 3. ©2012 DataStax 3•  The company behind Cassandra•  Sells DataStax EnterpriseAbout DataStax
  4. 4. ©2012 DataStax 4DataStax Enterprise 3.x
  5. 5. ©2012 DataStax 5•  Single stack•  Cassandra•  Solr•  Hadoop•  Consulting•  SupportDataStax Enterprise
  6. 6. ©2012 DataStax 6•  ZDNet article: “The biggest cloudapp of all: Netflix”•  http://zd.net/ZHtrmW•  Built on Cassandra•  “According to Cockcroft, if somethinggoes wrong, Netflix can continue torun the entire service on two out ofthree zones”Cassandra at Netflix
  7. 7. ©2012 DataStax 7•  Petabytes of growing data•  Hadoop is for batch work•  What are the solutions for realtime?What is Big Data?
  8. 8. ©2012 DataStax 8•  Near realtime•  1000 millisecond latencyWhat is realtime?
  9. 9. ©2012 DataStax 9•  Cost of scaling to petabytes•  Physical limitationsWhy not relational databases?
  10. 10. ©2012 DataStax 10•  Hadoop for batch•  Solr and Cassandra for realtime•  Gives most of relational capability at1/10 the cost, scales linearlyRelational to Big Data
  11. 11. ©2012 DataStax 11•  Distributed database heavy lifting•  Simple dynamo model•  Executes replication tasks extremelywellWhy Cassandra?
  12. 12. ©2012 DataStax 12•  Cassandra is easier, code is readable•  Fewer moving parts•  Multi-datacenter replication•  Enables low level IO tuningCassandra vs. HBase
  13. 13. ©2012 DataStax 13•  HBase runs on HDFS•  HDFS is not designed for randomaccess IO•  Multiple hacks / products to performrandom access (MapR, HDFS Jiras)Cassandra vs. HBase
  14. 14. ©2012 DataStax 14•  Cassandra is peer to peer, there is nosingle point of failure (SPOF)•  The HDFS name node is a singlepoint of failureCassandra vs. HBase
  15. 15. ©2012 DataStax 15•  Most of Facebook runs on MySQL•  Memcache front ends the readsHBase at Facebook
  16. 16. ©2012 DataStax 16•  Hive with Hadoop•  A vague dialect of SQL•  Requires Java for UDFs•  Relational JoinsBatch Analytics
  17. 17. ©2012 DataStax 17•  Solr•  SQL features except relational joins•  Use Hive for relational joins•  CEP (Complex Event Processing)•  StormRealtime Analytics
  18. 18. ©2012 DataStax 18•  Storm, computes results onstreaming dataCEP (Complex Event Processing)
  19. 19. ©2012 DataStax 19•  Java inverted indexing library•  Text analytics is raw computationover linear sets of data•  High speed computation engineLucene
  20. 20. ©2012 DataStax 20•  Terms dictionary points to listdocument ids (integers)•  Tokenizes text•  Complete variety of computation onvectors of dataInverted Indexes
  21. 21. ©2012 DataStax 21•  Search server built around Lucene•  Adds faceting, distributed search•  Missed the cloud environmentfeatures of NoSQL systems for manyyearsSolr
  22. 22. ©2012 DataStax 22•  Solr Cloud is a Zookeeper basedsystem•  New and probably not productionready•  Playing catch upSolr Cloud
  23. 23. ©2012 DataStax 23•  High overlap with Solr•  More mature than Solr Cloud•  Less distributed features thanCassandraElastic Search
  24. 24. ©2012 DataStax 24•  Columns, column families, keyspaces•  Peer to peer•  Eventual consistency•  Implements basic Google BigTablemodelCassandra Concepts
  25. 25. ©2012 DataStax 25•  Both implement a log structuredmerge tree file architectureLucene and Cassandra
  26. 26. ©2012 DataStax 26•  Data is stored in Cassandra•  Data placement controlled byCassandra•  Solr is a secondary index (only)DataStax Enterprise with Solr
  27. 27. ©2012 DataStax 27•  Separation of church and state, eg,data and indexDataStax Enterprise with Solr
  28. 28. ©2012 DataStax 28•  Indexing is a CPU intensive task•  Not IO bound because of multi-threading•  When a thread is flushing, otherthreads are indexing, CPU issaturated at all timesIndexing
  29. 29. ©2012 DataStax 29•  IO bound, index needs to fit in RAM,then CPU bound•  Lucene enables multithreadingqueries•  Solr does not multithread queriesQueries
  30. 30. ©2012 DataStax 30•  Eventual consistency, each node hasit’s own Lucene index•  Lucene segment files are notreplicated (like Solr Cloud andElasticSearch)DataStax Enterprise with Solr
  31. 31. ©2012 DataStax 31•  Query requests are round robin’dacross nodes automaticallyDistributed Search Architecture
  32. 32. ©2012 DataStax 32•  3.0.1 is the current release ofDataStax EnterpriseDSE 3.0.1
  33. 33. ©2012 DataStax 33•  Ease of re-indexing•  Re-index the entire cluster or per-node•  Re-indexing occurs when the Solrschema changesNew Features in DSE 3.0.1
  34. 34. ©2012 DataStax 34•  Solr Cloud requires re-indexing froman external data source such as arelational databaseNew Features in DSE 3.0.1
  35. 35. ©2012 DataStax 35•  DSE re-indexes directly fromCassandra•  No custom code is required for re-indexingNew Features in DSE 3.0.1
  36. 36. ©2012 DataStax 36•  View the heap memory usage of thefield caches•  Perform capacity planningNew Features in DSE 3.0.1
  37. 37. ©2012 DataStax 37•  Multithreaded re-indexing and repair•  Adding a new Solr node is fastNew Features in DSE 3.0.1
  38. 38. ©2012 DataStax 38•  Kerberos and SSL security•  Security audit loggingNew Features in DSE 3.0.1
  39. 39. ©2012 DataStax 39•  Near realtime: per-segment filters,facets, multivalue facets•  Solr 4.3DataStax Enterprise 3.1
  40. 40. ©2012 DataStax 40•  vNodes•  Composite keysDataStax Enterprise 3.1
  41. 41. ©2012 DataStax 41•  Multi datacenter live Solr schemaupdates and re-indexing•  CQL -> Solr queries, makes portingSQL applications easy for SQLdevelopersFuture
  42. 42. ©2012 DataStax 42Demo of Wikipedia
  43. 43. ©2012 DataStax 43•  Details about every trade•  Tick data generated real time and isquantitatively query-able•  Too big to query on in real time? Notanymore!Real World Example: Tick Data
  44. 44. ©2012 DataStax 44•  Computing the moving stock priceaverage in real time•  Comparing multiple moving averagesfor different stock_symbols•  Requires statistical analysis, groupby companies, and faceting featuresTick Data - Moving Average
  45. 45. ©2012 DataStax 45•  Read latest ticks for a given company•  Query ticks for companies in specificverticals during large events such aspress releases•  Compute deviation of stock data over5 years for groups of companiesTick Data Analytics - Ad HocSearches
  46. 46. ©2012 DataStax 46Real Time Stocks Demo
  47. 47. ©2012 DataStax 47General
  48. 48. ©2012 DataStax 48•  Like an SQL CREATE TABLEstatement•  Defines field types•  Defines fieldsSchema
  49. 49. ©2012 DataStax 49•  XML based configuration options forSolrSolr Config
  50. 50. ©2012 DataStax 50•  Commits new index segment to RAM•  Avoids ‘hard’ commit fsyncSoft Commit
  51. 51. ©2012 DataStax 51Auto Soft Commit<!-- The default high-performance update handler --> !<updateHandler class="solr.DirectUpdateHandler2”>!<autoSoftCommit> !<maxTime>1000</maxTime> <!– Near Realtime of 1 second -->!</autoSoftCommit> !</updateHandler>!
  52. 52. ©2012 DataStax 52•  Loaded for sort and facet queries•  Uses heap spaceField Cache
  53. 53. ©2012 DataStax 53•  Java based API for interacting with aSolr server•  DSE supports SolrJ/HTTP with nochangesSolrJ / HTTP
  54. 54. ©2012 DataStax 54•  Auto data type mapping•  Copy fields•  Dynamic fieldsInsert data with CQL
  55. 55. ©2012 DataStax 55•  Exists however is mainly useful fordebugging•  Limited functionality, queries a singlenodeCQL with Solr Query
  56. 56. ©2012 DataStax 56CQL Insert ExampleINSERT INTO wikipedia (key, text) !VALUES (1, when in rome)!
  57. 57. ©2012 DataStax 57How to convert applications
  58. 58. ©2012 DataStax 58•  Common to convert existing SQLapplications to Big Data•  Focus on the application functionalitySQL to Solr
  59. 59. ©2012 DataStax 59•  Cassandra makes all distributedoperations easySQL to Solr
  60. 60. ©2012 DataStax 60•  SELECT * FROM wikipedia WHEREtype = ‘pdf’•  q=type:pdfSELECT WHERE
  61. 61. ©2012 DataStax 61•  SELECT title,text FROM wikipedia•  q=*:*•  fl=title,textSELECT columns
  62. 62. ©2012 DataStax 62•  SELECT COUNT(*) FROM wikipediaWHERE type = ‘pdf’•  q=type:pdf•  Get the num foundSELECT COUNT
  63. 63. ©2012 DataStax 63•  SELECT * FROM stocks ORDER BYprice ASC•  q=*:*•  sort=price ascSELECT ORDER BY
  64. 64. ©2012 DataStax 64•  SELECT AVG(price) FROM stocks•  q=*:*•  stats=true•  stats.field=price•  The average is called ‘mean’ in theSolr resultsSELECT AVG
  65. 65. ©2012 DataStax 65•  SELECT AVG(price) FROM stocksGROUP BY symbol•  q=*:*•  stats=true•  stats.field=price•  stats.facet=symbolSELECT AVG GROUP BY
  66. 66. ©2012 DataStax 66•  SELECT * FROM wikipedia WHEREtext LIKE ‘rom%’•  q=text:rom*SELECT WHERE LIKE
  67. 67. ©2012 DataStax 67

×