©2012 DataStax 1
DataStax Enterprise 3.x
Realtime Analytics with Solr
Jason Rutherglen
©2012 DataStax 2
•  Big Data Engineer at DataStax
•  Co-author of ‘Programming Hive’
and ‘Introduction to Solr’ from
O’Reilly
About the Presenter
©2012 DataStax 3
•  The company behind Cassandra
•  Sells DataStax Enterprise
About DataStax
©2012 DataStax 4
DataStax Enterprise 3.x
©2012 DataStax 5
•  Single stack
•  Cassandra
•  Solr
•  Hadoop
•  Consulting
•  Support
DataStax Enterprise
©2012 DataStax 6
•  ZDNet article: “The biggest cloud
app of all: Netflix”
•  http://zd.net/ZHtrmW
•  Built on Cassandra
•  “According to Cockcroft, if something
goes wrong, Netflix can continue to
run the entire service on two out of
three zones”
Cassandra at Netflix
©2012 DataStax 7
•  Petabytes of growing data
•  Hadoop is for batch work
•  What are the solutions for realtime?
What is Big Data?
©2012 DataStax 8
•  Near realtime
•  1000 millisecond latency
What is realtime?
©2012 DataStax 9
•  Cost of scaling to petabytes
•  Physical limitations
Why not relational databases?
©2012 DataStax 10
•  Hadoop for batch
•  Solr and Cassandra for realtime
•  Gives most of relational capability at
1/10 the cost, scales linearly
Relational to Big Data
©2012 DataStax 11
•  Distributed database heavy lifting
•  Simple dynamo model
•  Executes replication tasks extremely
well
Why Cassandra?
©2012 DataStax 12
•  Cassandra is easier, code is readable
•  Fewer moving parts
•  Multi-datacenter replication
•  Enables low level IO tuning
Cassandra vs. HBase
©2012 DataStax 13
•  HBase runs on HDFS
•  HDFS is not designed for random
access IO
•  Multiple hacks / products to perform
random access (MapR, HDFS Jiras)
Cassandra vs. HBase
©2012 DataStax 14
•  Cassandra is peer to peer, there is no
single point of failure (SPOF)
•  The HDFS name node is a single
point of failure
Cassandra vs. HBase
©2012 DataStax 15
•  Most of Facebook runs on MySQL
•  Memcache front ends the reads
HBase at Facebook
©2012 DataStax 16
•  Hive with Hadoop
•  A vague dialect of SQL
•  Requires Java for UDFs
•  Relational Joins
Batch Analytics
©2012 DataStax 17
•  Solr
•  SQL features except relational joins
•  Use Hive for relational joins
•  CEP (Complex Event Processing)
•  Storm
Realtime Analytics
©2012 DataStax 18
•  Storm, computes results on
streaming data
CEP (Complex Event Processing)
©2012 DataStax 19
•  Java inverted indexing library
•  Text analytics is raw computation
over linear sets of data
•  High speed computation engine
Lucene
©2012 DataStax 20
•  Terms dictionary points to list
document ids (integers)
•  Tokenizes text
•  Complete variety of computation on
vectors of data
Inverted Indexes
©2012 DataStax 21
•  Search server built around Lucene
•  Adds faceting, distributed search
•  Missed the cloud environment
features of NoSQL systems for many
years
Solr
©2012 DataStax 22
•  Solr Cloud is a Zookeeper based
system
•  New and probably not production
ready
•  Playing catch up
Solr Cloud
©2012 DataStax 23
•  High overlap with Solr
•  More mature than Solr Cloud
•  Less distributed features than
Cassandra
Elastic Search
©2012 DataStax 24
•  Columns, column families, keyspaces
•  Peer to peer
•  Eventual consistency
•  Implements basic Google BigTable
model
Cassandra Concepts
©2012 DataStax 25
•  Both implement a log structured
merge tree file architecture
Lucene and Cassandra
©2012 DataStax 26
•  Data is stored in Cassandra
•  Data placement controlled by
Cassandra
•  Solr is a secondary index (only)
DataStax Enterprise with Solr
©2012 DataStax 27
•  Separation of church and state, eg,
data and index
DataStax Enterprise with Solr
©2012 DataStax 28
•  Indexing is a CPU intensive task
•  Not IO bound because of multi-
threading
•  When a thread is flushing, other
threads are indexing, CPU is
saturated at all times
Indexing
©2012 DataStax 29
•  IO bound, index needs to fit in RAM,
then CPU bound
•  Lucene enables multithreading
queries
•  Solr does not multithread queries
Queries
©2012 DataStax 30
•  Eventual consistency, each node has
it’s own Lucene index
•  Lucene segment files are not
replicated (like Solr Cloud and
ElasticSearch)
DataStax Enterprise with Solr
©2012 DataStax 31
•  Query requests are round robin’d
across nodes automatically
Distributed Search Architecture
©2012 DataStax 32
•  3.0.1 is the current release of
DataStax Enterprise
DSE 3.0.1
©2012 DataStax 33
•  Ease of re-indexing
•  Re-index the entire cluster or per-
node
•  Re-indexing occurs when the Solr
schema changes
New Features in DSE 3.0.1
©2012 DataStax 34
•  Solr Cloud requires re-indexing from
an external data source such as a
relational database
New Features in DSE 3.0.1
©2012 DataStax 35
•  DSE re-indexes directly from
Cassandra
•  No custom code is required for re-
indexing
New Features in DSE 3.0.1
©2012 DataStax 36
•  View the heap memory usage of the
field caches
•  Perform capacity planning
New Features in DSE 3.0.1
©2012 DataStax 37
•  Multithreaded re-indexing and repair
•  Adding a new Solr node is fast
New Features in DSE 3.0.1
©2012 DataStax 38
•  Kerberos and SSL security
•  Security audit logging
New Features in DSE 3.0.1
©2012 DataStax 39
•  Near realtime: per-segment filters,
facets, multivalue facets
•  Solr 4.3
DataStax Enterprise 3.1
©2012 DataStax 40
•  vNodes
•  Composite keys
DataStax Enterprise 3.1
©2012 DataStax 41
•  Multi datacenter live Solr schema
updates and re-indexing
•  CQL -> Solr queries, makes porting
SQL applications easy for SQL
developers
Future
©2012 DataStax 42
Demo of Wikipedia
©2012 DataStax 43
•  Details about every trade
•  Tick data generated real time and is
quantitatively query-able
•  Too big to query on in real time? Not
anymore!
Real World Example: Tick Data
©2012 DataStax 44
•  Computing the moving stock price
average in real time
•  Comparing multiple moving averages
for different stock_symbols
•  Requires statistical analysis, group
by companies, and faceting features
Tick Data - Moving Average
©2012 DataStax 45
•  Read latest ticks for a given company
•  Query ticks for companies in specific
verticals during large events such as
press releases
•  Compute deviation of stock data over
5 years for groups of companies
Tick Data Analytics - Ad Hoc
Searches
©2012 DataStax 46
Real Time Stocks Demo
©2012 DataStax 47
General
©2012 DataStax 48
•  Like an SQL CREATE TABLE
statement
•  Defines field types
•  Defines fields
Schema
©2012 DataStax 49
•  XML based configuration options for
Solr
Solr Config
©2012 DataStax 50
•  Commits new index segment to RAM
•  Avoids ‘hard’ commit fsync
Soft Commit
©2012 DataStax 51
Auto Soft Commit
<!-- The default high-performance update handler --> !
<updateHandler class="solr.DirectUpdateHandler2”>!
<autoSoftCommit> !
<maxTime>1000</maxTime> <!– Near Realtime of 1 second -->!
</autoSoftCommit> !
</updateHandler>!
©2012 DataStax 52
•  Loaded for sort and facet queries
•  Uses heap space
Field Cache
©2012 DataStax 53
•  Java based API for interacting with a
Solr server
•  DSE supports SolrJ/HTTP with no
changes
SolrJ / HTTP
©2012 DataStax 54
•  Auto data type mapping
•  Copy fields
•  Dynamic fields
Insert data with CQL
©2012 DataStax 55
•  Exists however is mainly useful for
debugging
•  Limited functionality, queries a single
node
CQL with Solr Query
©2012 DataStax 56
CQL Insert Example
INSERT INTO wikipedia (key, text) !
VALUES ('1', 'when in rome')!
©2012 DataStax 57
How to convert applications
©2012 DataStax 58
•  Common to convert existing SQL
applications to Big Data
•  Focus on the application functionality
SQL to Solr
©2012 DataStax 59
•  Cassandra makes all distributed
operations easy
SQL to Solr
©2012 DataStax 60
•  SELECT * FROM wikipedia WHERE
type = ‘pdf’
•  q=type:pdf
SELECT WHERE
©2012 DataStax 61
•  SELECT title,text FROM wikipedia
•  q=*:*
•  fl=title,text
SELECT columns
©2012 DataStax 62
•  SELECT COUNT(*) FROM wikipedia
WHERE type = ‘pdf’
•  q=type:pdf
•  Get the num found
SELECT COUNT
©2012 DataStax 63
•  SELECT * FROM stocks ORDER BY
price ASC
•  q=*:*
•  sort=price asc
SELECT ORDER BY
©2012 DataStax 64
•  SELECT AVG(price) FROM stocks
•  q=*:*
•  stats=true
•  stats.field=price
•  The average is called ‘mean’ in the
Solr results
SELECT AVG
©2012 DataStax 65
•  SELECT AVG(price) FROM stocks
GROUP BY symbol
•  q=*:*
•  stats=true
•  stats.field=price
•  stats.facet=symbol
SELECT AVG GROUP BY
©2012 DataStax 66
•  SELECT * FROM wikipedia WHERE
text LIKE ‘rom%’
•  q=text:rom*
SELECT WHERE LIKE
©2012 DataStax 67

C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

  • 1.
    ©2012 DataStax 1 DataStaxEnterprise 3.x Realtime Analytics with Solr Jason Rutherglen
  • 2.
    ©2012 DataStax 2 • Big Data Engineer at DataStax •  Co-author of ‘Programming Hive’ and ‘Introduction to Solr’ from O’Reilly About the Presenter
  • 3.
    ©2012 DataStax 3 • The company behind Cassandra •  Sells DataStax Enterprise About DataStax
  • 4.
  • 5.
    ©2012 DataStax 5 • Single stack •  Cassandra •  Solr •  Hadoop •  Consulting •  Support DataStax Enterprise
  • 6.
    ©2012 DataStax 6 • ZDNet article: “The biggest cloud app of all: Netflix” •  http://zd.net/ZHtrmW •  Built on Cassandra •  “According to Cockcroft, if something goes wrong, Netflix can continue to run the entire service on two out of three zones” Cassandra at Netflix
  • 7.
    ©2012 DataStax 7 • Petabytes of growing data •  Hadoop is for batch work •  What are the solutions for realtime? What is Big Data?
  • 8.
    ©2012 DataStax 8 • Near realtime •  1000 millisecond latency What is realtime?
  • 9.
    ©2012 DataStax 9 • Cost of scaling to petabytes •  Physical limitations Why not relational databases?
  • 10.
    ©2012 DataStax 10 • Hadoop for batch •  Solr and Cassandra for realtime •  Gives most of relational capability at 1/10 the cost, scales linearly Relational to Big Data
  • 11.
    ©2012 DataStax 11 • Distributed database heavy lifting •  Simple dynamo model •  Executes replication tasks extremely well Why Cassandra?
  • 12.
    ©2012 DataStax 12 • Cassandra is easier, code is readable •  Fewer moving parts •  Multi-datacenter replication •  Enables low level IO tuning Cassandra vs. HBase
  • 13.
    ©2012 DataStax 13 • HBase runs on HDFS •  HDFS is not designed for random access IO •  Multiple hacks / products to perform random access (MapR, HDFS Jiras) Cassandra vs. HBase
  • 14.
    ©2012 DataStax 14 • Cassandra is peer to peer, there is no single point of failure (SPOF) •  The HDFS name node is a single point of failure Cassandra vs. HBase
  • 15.
    ©2012 DataStax 15 • Most of Facebook runs on MySQL •  Memcache front ends the reads HBase at Facebook
  • 16.
    ©2012 DataStax 16 • Hive with Hadoop •  A vague dialect of SQL •  Requires Java for UDFs •  Relational Joins Batch Analytics
  • 17.
    ©2012 DataStax 17 • Solr •  SQL features except relational joins •  Use Hive for relational joins •  CEP (Complex Event Processing) •  Storm Realtime Analytics
  • 18.
    ©2012 DataStax 18 • Storm, computes results on streaming data CEP (Complex Event Processing)
  • 19.
    ©2012 DataStax 19 • Java inverted indexing library •  Text analytics is raw computation over linear sets of data •  High speed computation engine Lucene
  • 20.
    ©2012 DataStax 20 • Terms dictionary points to list document ids (integers) •  Tokenizes text •  Complete variety of computation on vectors of data Inverted Indexes
  • 21.
    ©2012 DataStax 21 • Search server built around Lucene •  Adds faceting, distributed search •  Missed the cloud environment features of NoSQL systems for many years Solr
  • 22.
    ©2012 DataStax 22 • Solr Cloud is a Zookeeper based system •  New and probably not production ready •  Playing catch up Solr Cloud
  • 23.
    ©2012 DataStax 23 • High overlap with Solr •  More mature than Solr Cloud •  Less distributed features than Cassandra Elastic Search
  • 24.
    ©2012 DataStax 24 • Columns, column families, keyspaces •  Peer to peer •  Eventual consistency •  Implements basic Google BigTable model Cassandra Concepts
  • 25.
    ©2012 DataStax 25 • Both implement a log structured merge tree file architecture Lucene and Cassandra
  • 26.
    ©2012 DataStax 26 • Data is stored in Cassandra •  Data placement controlled by Cassandra •  Solr is a secondary index (only) DataStax Enterprise with Solr
  • 27.
    ©2012 DataStax 27 • Separation of church and state, eg, data and index DataStax Enterprise with Solr
  • 28.
    ©2012 DataStax 28 • Indexing is a CPU intensive task •  Not IO bound because of multi- threading •  When a thread is flushing, other threads are indexing, CPU is saturated at all times Indexing
  • 29.
    ©2012 DataStax 29 • IO bound, index needs to fit in RAM, then CPU bound •  Lucene enables multithreading queries •  Solr does not multithread queries Queries
  • 30.
    ©2012 DataStax 30 • Eventual consistency, each node has it’s own Lucene index •  Lucene segment files are not replicated (like Solr Cloud and ElasticSearch) DataStax Enterprise with Solr
  • 31.
    ©2012 DataStax 31 • Query requests are round robin’d across nodes automatically Distributed Search Architecture
  • 32.
    ©2012 DataStax 32 • 3.0.1 is the current release of DataStax Enterprise DSE 3.0.1
  • 33.
    ©2012 DataStax 33 • Ease of re-indexing •  Re-index the entire cluster or per- node •  Re-indexing occurs when the Solr schema changes New Features in DSE 3.0.1
  • 34.
    ©2012 DataStax 34 • Solr Cloud requires re-indexing from an external data source such as a relational database New Features in DSE 3.0.1
  • 35.
    ©2012 DataStax 35 • DSE re-indexes directly from Cassandra •  No custom code is required for re- indexing New Features in DSE 3.0.1
  • 36.
    ©2012 DataStax 36 • View the heap memory usage of the field caches •  Perform capacity planning New Features in DSE 3.0.1
  • 37.
    ©2012 DataStax 37 • Multithreaded re-indexing and repair •  Adding a new Solr node is fast New Features in DSE 3.0.1
  • 38.
    ©2012 DataStax 38 • Kerberos and SSL security •  Security audit logging New Features in DSE 3.0.1
  • 39.
    ©2012 DataStax 39 • Near realtime: per-segment filters, facets, multivalue facets •  Solr 4.3 DataStax Enterprise 3.1
  • 40.
    ©2012 DataStax 40 • vNodes •  Composite keys DataStax Enterprise 3.1
  • 41.
    ©2012 DataStax 41 • Multi datacenter live Solr schema updates and re-indexing •  CQL -> Solr queries, makes porting SQL applications easy for SQL developers Future
  • 42.
  • 43.
    ©2012 DataStax 43 • Details about every trade •  Tick data generated real time and is quantitatively query-able •  Too big to query on in real time? Not anymore! Real World Example: Tick Data
  • 44.
    ©2012 DataStax 44 • Computing the moving stock price average in real time •  Comparing multiple moving averages for different stock_symbols •  Requires statistical analysis, group by companies, and faceting features Tick Data - Moving Average
  • 45.
    ©2012 DataStax 45 • Read latest ticks for a given company •  Query ticks for companies in specific verticals during large events such as press releases •  Compute deviation of stock data over 5 years for groups of companies Tick Data Analytics - Ad Hoc Searches
  • 46.
    ©2012 DataStax 46 RealTime Stocks Demo
  • 47.
  • 48.
    ©2012 DataStax 48 • Like an SQL CREATE TABLE statement •  Defines field types •  Defines fields Schema
  • 49.
    ©2012 DataStax 49 • XML based configuration options for Solr Solr Config
  • 50.
    ©2012 DataStax 50 • Commits new index segment to RAM •  Avoids ‘hard’ commit fsync Soft Commit
  • 51.
    ©2012 DataStax 51 AutoSoft Commit <!-- The default high-performance update handler --> ! <updateHandler class="solr.DirectUpdateHandler2”>! <autoSoftCommit> ! <maxTime>1000</maxTime> <!– Near Realtime of 1 second -->! </autoSoftCommit> ! </updateHandler>!
  • 52.
    ©2012 DataStax 52 • Loaded for sort and facet queries •  Uses heap space Field Cache
  • 53.
    ©2012 DataStax 53 • Java based API for interacting with a Solr server •  DSE supports SolrJ/HTTP with no changes SolrJ / HTTP
  • 54.
    ©2012 DataStax 54 • Auto data type mapping •  Copy fields •  Dynamic fields Insert data with CQL
  • 55.
    ©2012 DataStax 55 • Exists however is mainly useful for debugging •  Limited functionality, queries a single node CQL with Solr Query
  • 56.
    ©2012 DataStax 56 CQLInsert Example INSERT INTO wikipedia (key, text) ! VALUES ('1', 'when in rome')!
  • 57.
    ©2012 DataStax 57 Howto convert applications
  • 58.
    ©2012 DataStax 58 • Common to convert existing SQL applications to Big Data •  Focus on the application functionality SQL to Solr
  • 59.
    ©2012 DataStax 59 • Cassandra makes all distributed operations easy SQL to Solr
  • 60.
    ©2012 DataStax 60 • SELECT * FROM wikipedia WHERE type = ‘pdf’ •  q=type:pdf SELECT WHERE
  • 61.
    ©2012 DataStax 61 • SELECT title,text FROM wikipedia •  q=*:* •  fl=title,text SELECT columns
  • 62.
    ©2012 DataStax 62 • SELECT COUNT(*) FROM wikipedia WHERE type = ‘pdf’ •  q=type:pdf •  Get the num found SELECT COUNT
  • 63.
    ©2012 DataStax 63 • SELECT * FROM stocks ORDER BY price ASC •  q=*:* •  sort=price asc SELECT ORDER BY
  • 64.
    ©2012 DataStax 64 • SELECT AVG(price) FROM stocks •  q=*:* •  stats=true •  stats.field=price •  The average is called ‘mean’ in the Solr results SELECT AVG
  • 65.
    ©2012 DataStax 65 • SELECT AVG(price) FROM stocks GROUP BY symbol •  q=*:* •  stats=true •  stats.field=price •  stats.facet=symbol SELECT AVG GROUP BY
  • 66.
    ©2012 DataStax 66 • SELECT * FROM wikipedia WHERE text LIKE ‘rom%’ •  q=text:rom* SELECT WHERE LIKE
  • 67.