SlideShare a Scribd company logo
©2012 DataStax 1
DataStax Enterprise 3.x
Realtime Analytics with Solr
Jason Rutherglen
©2012 DataStax 2
•  Big Data Engineer at DataStax
•  Co-author of ‘Programming Hive’
and ‘Introduction to Solr’ from
O’Reilly
About the Presenter
©2012 DataStax 3
•  The company behind Cassandra
•  Sells DataStax Enterprise
About DataStax
©2012 DataStax 4
DataStax Enterprise 3.x
©2012 DataStax 5
•  Single stack
•  Cassandra
•  Solr
•  Hadoop
•  Consulting
•  Support
DataStax Enterprise
©2012 DataStax 6
•  ZDNet article: “The biggest cloud
app of all: Netflix”
•  http://zd.net/ZHtrmW
•  Built on Cassandra
•  “According to Cockcroft, if something
goes wrong, Netflix can continue to
run the entire service on two out of
three zones”
Cassandra at Netflix
©2012 DataStax 7
•  Petabytes of growing data
•  Hadoop is for batch work
•  What are the solutions for realtime?
What is Big Data?
©2012 DataStax 8
•  Near realtime
•  1000 millisecond latency
What is realtime?
©2012 DataStax 9
•  Cost of scaling to petabytes
•  Physical limitations
Why not relational databases?
©2012 DataStax 10
•  Hadoop for batch
•  Solr and Cassandra for realtime
•  Gives most of relational capability at
1/10 the cost, scales linearly
Relational to Big Data
©2012 DataStax 11
•  Distributed database heavy lifting
•  Simple dynamo model
•  Executes replication tasks extremely
well
Why Cassandra?
©2012 DataStax 12
•  Cassandra is easier, code is readable
•  Fewer moving parts
•  Multi-datacenter replication
•  Enables low level IO tuning
Cassandra vs. HBase
©2012 DataStax 13
•  HBase runs on HDFS
•  HDFS is not designed for random
access IO
•  Multiple hacks / products to perform
random access (MapR, HDFS Jiras)
Cassandra vs. HBase
©2012 DataStax 14
•  Cassandra is peer to peer, there is no
single point of failure (SPOF)
•  The HDFS name node is a single
point of failure
Cassandra vs. HBase
©2012 DataStax 15
•  Most of Facebook runs on MySQL
•  Memcache front ends the reads
HBase at Facebook
©2012 DataStax 16
•  Hive with Hadoop
•  A vague dialect of SQL
•  Requires Java for UDFs
•  Relational Joins
Batch Analytics
©2012 DataStax 17
•  Solr
•  SQL features except relational joins
•  Use Hive for relational joins
•  CEP (Complex Event Processing)
•  Storm
Realtime Analytics
©2012 DataStax 18
•  Storm, computes results on
streaming data
CEP (Complex Event Processing)
©2012 DataStax 19
•  Java inverted indexing library
•  Text analytics is raw computation
over linear sets of data
•  High speed computation engine
Lucene
©2012 DataStax 20
•  Terms dictionary points to list
document ids (integers)
•  Tokenizes text
•  Complete variety of computation on
vectors of data
Inverted Indexes
©2012 DataStax 21
•  Search server built around Lucene
•  Adds faceting, distributed search
•  Missed the cloud environment
features of NoSQL systems for many
years
Solr
©2012 DataStax 22
•  Solr Cloud is a Zookeeper based
system
•  New and probably not production
ready
•  Playing catch up
Solr Cloud
©2012 DataStax 23
•  High overlap with Solr
•  More mature than Solr Cloud
•  Less distributed features than
Cassandra
Elastic Search
©2012 DataStax 24
•  Columns, column families, keyspaces
•  Peer to peer
•  Eventual consistency
•  Implements basic Google BigTable
model
Cassandra Concepts
©2012 DataStax 25
•  Both implement a log structured
merge tree file architecture
Lucene and Cassandra
©2012 DataStax 26
•  Data is stored in Cassandra
•  Data placement controlled by
Cassandra
•  Solr is a secondary index (only)
DataStax Enterprise with Solr
©2012 DataStax 27
•  Separation of church and state, eg,
data and index
DataStax Enterprise with Solr
©2012 DataStax 28
•  Indexing is a CPU intensive task
•  Not IO bound because of multi-
threading
•  When a thread is flushing, other
threads are indexing, CPU is
saturated at all times
Indexing
©2012 DataStax 29
•  IO bound, index needs to fit in RAM,
then CPU bound
•  Lucene enables multithreading
queries
•  Solr does not multithread queries
Queries
©2012 DataStax 30
•  Eventual consistency, each node has
it’s own Lucene index
•  Lucene segment files are not
replicated (like Solr Cloud and
ElasticSearch)
DataStax Enterprise with Solr
©2012 DataStax 31
•  Query requests are round robin’d
across nodes automatically
Distributed Search Architecture
©2012 DataStax 32
•  3.0.1 is the current release of
DataStax Enterprise
DSE 3.0.1
©2012 DataStax 33
•  Ease of re-indexing
•  Re-index the entire cluster or per-
node
•  Re-indexing occurs when the Solr
schema changes
New Features in DSE 3.0.1
©2012 DataStax 34
•  Solr Cloud requires re-indexing from
an external data source such as a
relational database
New Features in DSE 3.0.1
©2012 DataStax 35
•  DSE re-indexes directly from
Cassandra
•  No custom code is required for re-
indexing
New Features in DSE 3.0.1
©2012 DataStax 36
•  View the heap memory usage of the
field caches
•  Perform capacity planning
New Features in DSE 3.0.1
©2012 DataStax 37
•  Multithreaded re-indexing and repair
•  Adding a new Solr node is fast
New Features in DSE 3.0.1
©2012 DataStax 38
•  Kerberos and SSL security
•  Security audit logging
New Features in DSE 3.0.1
©2012 DataStax 39
•  Near realtime: per-segment filters,
facets, multivalue facets
•  Solr 4.3
DataStax Enterprise 3.1
©2012 DataStax 40
•  vNodes
•  Composite keys
DataStax Enterprise 3.1
©2012 DataStax 41
•  Multi datacenter live Solr schema
updates and re-indexing
•  CQL -> Solr queries, makes porting
SQL applications easy for SQL
developers
Future
©2012 DataStax 42
Demo of Wikipedia
©2012 DataStax 43
•  Details about every trade
•  Tick data generated real time and is
quantitatively query-able
•  Too big to query on in real time? Not
anymore!
Real World Example: Tick Data
©2012 DataStax 44
•  Computing the moving stock price
average in real time
•  Comparing multiple moving averages
for different stock_symbols
•  Requires statistical analysis, group
by companies, and faceting features
Tick Data - Moving Average
©2012 DataStax 45
•  Read latest ticks for a given company
•  Query ticks for companies in specific
verticals during large events such as
press releases
•  Compute deviation of stock data over
5 years for groups of companies
Tick Data Analytics - Ad Hoc
Searches
©2012 DataStax 46
Real Time Stocks Demo
©2012 DataStax 47
General
©2012 DataStax 48
•  Like an SQL CREATE TABLE
statement
•  Defines field types
•  Defines fields
Schema
©2012 DataStax 49
•  XML based configuration options for
Solr
Solr Config
©2012 DataStax 50
•  Commits new index segment to RAM
•  Avoids ‘hard’ commit fsync
Soft Commit
©2012 DataStax 51
Auto Soft Commit
<!-- The default high-performance update handler --> !
<updateHandler class="solr.DirectUpdateHandler2”>!
<autoSoftCommit> !
<maxTime>1000</maxTime> <!– Near Realtime of 1 second -->!
</autoSoftCommit> !
</updateHandler>!
©2012 DataStax 52
•  Loaded for sort and facet queries
•  Uses heap space
Field Cache
©2012 DataStax 53
•  Java based API for interacting with a
Solr server
•  DSE supports SolrJ/HTTP with no
changes
SolrJ / HTTP
©2012 DataStax 54
•  Auto data type mapping
•  Copy fields
•  Dynamic fields
Insert data with CQL
©2012 DataStax 55
•  Exists however is mainly useful for
debugging
•  Limited functionality, queries a single
node
CQL with Solr Query
©2012 DataStax 56
CQL Insert Example
INSERT INTO wikipedia (key, text) !
VALUES ('1', 'when in rome')!
©2012 DataStax 57
How to convert applications
©2012 DataStax 58
•  Common to convert existing SQL
applications to Big Data
•  Focus on the application functionality
SQL to Solr
©2012 DataStax 59
•  Cassandra makes all distributed
operations easy
SQL to Solr
©2012 DataStax 60
•  SELECT * FROM wikipedia WHERE
type = ‘pdf’
•  q=type:pdf
SELECT WHERE
©2012 DataStax 61
•  SELECT title,text FROM wikipedia
•  q=*:*
•  fl=title,text
SELECT columns
©2012 DataStax 62
•  SELECT COUNT(*) FROM wikipedia
WHERE type = ‘pdf’
•  q=type:pdf
•  Get the num found
SELECT COUNT
©2012 DataStax 63
•  SELECT * FROM stocks ORDER BY
price ASC
•  q=*:*
•  sort=price asc
SELECT ORDER BY
©2012 DataStax 64
•  SELECT AVG(price) FROM stocks
•  q=*:*
•  stats=true
•  stats.field=price
•  The average is called ‘mean’ in the
Solr results
SELECT AVG
©2012 DataStax 65
•  SELECT AVG(price) FROM stocks
GROUP BY symbol
•  q=*:*
•  stats=true
•  stats.field=price
•  stats.facet=symbol
SELECT AVG GROUP BY
©2012 DataStax 66
•  SELECT * FROM wikipedia WHERE
text LIKE ‘rom%’
•  q=text:rom*
SELECT WHERE LIKE
©2012 DataStax 67

More Related Content

What's hot

Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
DataStax Academy
 

What's hot (20)

Hindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to CassandraHindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to Cassandra
 
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd KnownCassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
 
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsCassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
 
Introducing DataStax Enterprise 4.7
Introducing DataStax Enterprise 4.7Introducing DataStax Enterprise 4.7
Introducing DataStax Enterprise 4.7
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
 
Managing (Schema) Migrations in Cassandra
Managing (Schema) Migrations in CassandraManaging (Schema) Migrations in Cassandra
Managing (Schema) Migrations in Cassandra
 
mParticle's Journey to Scylla from Cassandra
mParticle's Journey to Scylla from CassandramParticle's Journey to Scylla from Cassandra
mParticle's Journey to Scylla from Cassandra
 
Cisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackCisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStack
 
DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
Migration Best Practices: From RDBMS to Cassandra without a Hitch
Migration Best Practices: From RDBMS to Cassandra without a HitchMigration Best Practices: From RDBMS to Cassandra without a Hitch
Migration Best Practices: From RDBMS to Cassandra without a Hitch
 
Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6
 
Cassandra in e-commerce
Cassandra in e-commerceCassandra in e-commerce
Cassandra in e-commerce
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User Store
 
Workshop - How to benchmark your database
Workshop - How to benchmark your databaseWorkshop - How to benchmark your database
Workshop - How to benchmark your database
 

Similar to C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

State of Cassandra 2012
State of Cassandra 2012State of Cassandra 2012
State of Cassandra 2012
jbellis
 
Toronto jaspersoft meetup
Toronto jaspersoft meetupToronto jaspersoft meetup
Toronto jaspersoft meetup
Patrick McFadin
 

Similar to C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen (20)

State of Cassandra 2012
State of Cassandra 2012State of Cassandra 2012
State of Cassandra 2012
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
 
Presentation big dataappliance-overview_oow_v3
Presentation   big dataappliance-overview_oow_v3Presentation   big dataappliance-overview_oow_v3
Presentation big dataappliance-overview_oow_v3
 
Toronto jaspersoft meetup
Toronto jaspersoft meetupToronto jaspersoft meetup
Toronto jaspersoft meetup
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's new in SQL Server Integration Services 2012?
What's new in SQL Server Integration Services 2012?What's new in SQL Server Integration Services 2012?
What's new in SQL Server Integration Services 2012?
 
Apache Cassandra and The Multi-Cloud by Amanda Moran
Apache Cassandra and The Multi-Cloud by Amanda MoranApache Cassandra and The Multi-Cloud by Amanda Moran
Apache Cassandra and The Multi-Cloud by Amanda Moran
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
 
The Perils and Triumphs of using Cassandra at a .NET/Microsoft Shop
The Perils and Triumphs of using Cassandra at a .NET/Microsoft ShopThe Perils and Triumphs of using Cassandra at a .NET/Microsoft Shop
The Perils and Triumphs of using Cassandra at a .NET/Microsoft Shop
 
Sql pass summit
Sql pass summitSql pass summit
Sql pass summit
 
On Cassandra Development: Past, Present and Future
On Cassandra Development: Past, Present and FutureOn Cassandra Development: Past, Present and Future
On Cassandra Development: Past, Present and Future
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
 
Ibm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_CapabilitiesIbm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_Capabilities
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
 

More from DataStax Academy

Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 

More from DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and Drivers
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph Databases
 
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkCassandra Data Maintenance with Spark
Cassandra Data Maintenance with Spark
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 

C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

  • 1. ©2012 DataStax 1 DataStax Enterprise 3.x Realtime Analytics with Solr Jason Rutherglen
  • 2. ©2012 DataStax 2 •  Big Data Engineer at DataStax •  Co-author of ‘Programming Hive’ and ‘Introduction to Solr’ from O’Reilly About the Presenter
  • 3. ©2012 DataStax 3 •  The company behind Cassandra •  Sells DataStax Enterprise About DataStax
  • 4. ©2012 DataStax 4 DataStax Enterprise 3.x
  • 5. ©2012 DataStax 5 •  Single stack •  Cassandra •  Solr •  Hadoop •  Consulting •  Support DataStax Enterprise
  • 6. ©2012 DataStax 6 •  ZDNet article: “The biggest cloud app of all: Netflix” •  http://zd.net/ZHtrmW •  Built on Cassandra •  “According to Cockcroft, if something goes wrong, Netflix can continue to run the entire service on two out of three zones” Cassandra at Netflix
  • 7. ©2012 DataStax 7 •  Petabytes of growing data •  Hadoop is for batch work •  What are the solutions for realtime? What is Big Data?
  • 8. ©2012 DataStax 8 •  Near realtime •  1000 millisecond latency What is realtime?
  • 9. ©2012 DataStax 9 •  Cost of scaling to petabytes •  Physical limitations Why not relational databases?
  • 10. ©2012 DataStax 10 •  Hadoop for batch •  Solr and Cassandra for realtime •  Gives most of relational capability at 1/10 the cost, scales linearly Relational to Big Data
  • 11. ©2012 DataStax 11 •  Distributed database heavy lifting •  Simple dynamo model •  Executes replication tasks extremely well Why Cassandra?
  • 12. ©2012 DataStax 12 •  Cassandra is easier, code is readable •  Fewer moving parts •  Multi-datacenter replication •  Enables low level IO tuning Cassandra vs. HBase
  • 13. ©2012 DataStax 13 •  HBase runs on HDFS •  HDFS is not designed for random access IO •  Multiple hacks / products to perform random access (MapR, HDFS Jiras) Cassandra vs. HBase
  • 14. ©2012 DataStax 14 •  Cassandra is peer to peer, there is no single point of failure (SPOF) •  The HDFS name node is a single point of failure Cassandra vs. HBase
  • 15. ©2012 DataStax 15 •  Most of Facebook runs on MySQL •  Memcache front ends the reads HBase at Facebook
  • 16. ©2012 DataStax 16 •  Hive with Hadoop •  A vague dialect of SQL •  Requires Java for UDFs •  Relational Joins Batch Analytics
  • 17. ©2012 DataStax 17 •  Solr •  SQL features except relational joins •  Use Hive for relational joins •  CEP (Complex Event Processing) •  Storm Realtime Analytics
  • 18. ©2012 DataStax 18 •  Storm, computes results on streaming data CEP (Complex Event Processing)
  • 19. ©2012 DataStax 19 •  Java inverted indexing library •  Text analytics is raw computation over linear sets of data •  High speed computation engine Lucene
  • 20. ©2012 DataStax 20 •  Terms dictionary points to list document ids (integers) •  Tokenizes text •  Complete variety of computation on vectors of data Inverted Indexes
  • 21. ©2012 DataStax 21 •  Search server built around Lucene •  Adds faceting, distributed search •  Missed the cloud environment features of NoSQL systems for many years Solr
  • 22. ©2012 DataStax 22 •  Solr Cloud is a Zookeeper based system •  New and probably not production ready •  Playing catch up Solr Cloud
  • 23. ©2012 DataStax 23 •  High overlap with Solr •  More mature than Solr Cloud •  Less distributed features than Cassandra Elastic Search
  • 24. ©2012 DataStax 24 •  Columns, column families, keyspaces •  Peer to peer •  Eventual consistency •  Implements basic Google BigTable model Cassandra Concepts
  • 25. ©2012 DataStax 25 •  Both implement a log structured merge tree file architecture Lucene and Cassandra
  • 26. ©2012 DataStax 26 •  Data is stored in Cassandra •  Data placement controlled by Cassandra •  Solr is a secondary index (only) DataStax Enterprise with Solr
  • 27. ©2012 DataStax 27 •  Separation of church and state, eg, data and index DataStax Enterprise with Solr
  • 28. ©2012 DataStax 28 •  Indexing is a CPU intensive task •  Not IO bound because of multi- threading •  When a thread is flushing, other threads are indexing, CPU is saturated at all times Indexing
  • 29. ©2012 DataStax 29 •  IO bound, index needs to fit in RAM, then CPU bound •  Lucene enables multithreading queries •  Solr does not multithread queries Queries
  • 30. ©2012 DataStax 30 •  Eventual consistency, each node has it’s own Lucene index •  Lucene segment files are not replicated (like Solr Cloud and ElasticSearch) DataStax Enterprise with Solr
  • 31. ©2012 DataStax 31 •  Query requests are round robin’d across nodes automatically Distributed Search Architecture
  • 32. ©2012 DataStax 32 •  3.0.1 is the current release of DataStax Enterprise DSE 3.0.1
  • 33. ©2012 DataStax 33 •  Ease of re-indexing •  Re-index the entire cluster or per- node •  Re-indexing occurs when the Solr schema changes New Features in DSE 3.0.1
  • 34. ©2012 DataStax 34 •  Solr Cloud requires re-indexing from an external data source such as a relational database New Features in DSE 3.0.1
  • 35. ©2012 DataStax 35 •  DSE re-indexes directly from Cassandra •  No custom code is required for re- indexing New Features in DSE 3.0.1
  • 36. ©2012 DataStax 36 •  View the heap memory usage of the field caches •  Perform capacity planning New Features in DSE 3.0.1
  • 37. ©2012 DataStax 37 •  Multithreaded re-indexing and repair •  Adding a new Solr node is fast New Features in DSE 3.0.1
  • 38. ©2012 DataStax 38 •  Kerberos and SSL security •  Security audit logging New Features in DSE 3.0.1
  • 39. ©2012 DataStax 39 •  Near realtime: per-segment filters, facets, multivalue facets •  Solr 4.3 DataStax Enterprise 3.1
  • 40. ©2012 DataStax 40 •  vNodes •  Composite keys DataStax Enterprise 3.1
  • 41. ©2012 DataStax 41 •  Multi datacenter live Solr schema updates and re-indexing •  CQL -> Solr queries, makes porting SQL applications easy for SQL developers Future
  • 42. ©2012 DataStax 42 Demo of Wikipedia
  • 43. ©2012 DataStax 43 •  Details about every trade •  Tick data generated real time and is quantitatively query-able •  Too big to query on in real time? Not anymore! Real World Example: Tick Data
  • 44. ©2012 DataStax 44 •  Computing the moving stock price average in real time •  Comparing multiple moving averages for different stock_symbols •  Requires statistical analysis, group by companies, and faceting features Tick Data - Moving Average
  • 45. ©2012 DataStax 45 •  Read latest ticks for a given company •  Query ticks for companies in specific verticals during large events such as press releases •  Compute deviation of stock data over 5 years for groups of companies Tick Data Analytics - Ad Hoc Searches
  • 46. ©2012 DataStax 46 Real Time Stocks Demo
  • 48. ©2012 DataStax 48 •  Like an SQL CREATE TABLE statement •  Defines field types •  Defines fields Schema
  • 49. ©2012 DataStax 49 •  XML based configuration options for Solr Solr Config
  • 50. ©2012 DataStax 50 •  Commits new index segment to RAM •  Avoids ‘hard’ commit fsync Soft Commit
  • 51. ©2012 DataStax 51 Auto Soft Commit <!-- The default high-performance update handler --> ! <updateHandler class="solr.DirectUpdateHandler2”>! <autoSoftCommit> ! <maxTime>1000</maxTime> <!– Near Realtime of 1 second -->! </autoSoftCommit> ! </updateHandler>!
  • 52. ©2012 DataStax 52 •  Loaded for sort and facet queries •  Uses heap space Field Cache
  • 53. ©2012 DataStax 53 •  Java based API for interacting with a Solr server •  DSE supports SolrJ/HTTP with no changes SolrJ / HTTP
  • 54. ©2012 DataStax 54 •  Auto data type mapping •  Copy fields •  Dynamic fields Insert data with CQL
  • 55. ©2012 DataStax 55 •  Exists however is mainly useful for debugging •  Limited functionality, queries a single node CQL with Solr Query
  • 56. ©2012 DataStax 56 CQL Insert Example INSERT INTO wikipedia (key, text) ! VALUES ('1', 'when in rome')!
  • 57. ©2012 DataStax 57 How to convert applications
  • 58. ©2012 DataStax 58 •  Common to convert existing SQL applications to Big Data •  Focus on the application functionality SQL to Solr
  • 59. ©2012 DataStax 59 •  Cassandra makes all distributed operations easy SQL to Solr
  • 60. ©2012 DataStax 60 •  SELECT * FROM wikipedia WHERE type = ‘pdf’ •  q=type:pdf SELECT WHERE
  • 61. ©2012 DataStax 61 •  SELECT title,text FROM wikipedia •  q=*:* •  fl=title,text SELECT columns
  • 62. ©2012 DataStax 62 •  SELECT COUNT(*) FROM wikipedia WHERE type = ‘pdf’ •  q=type:pdf •  Get the num found SELECT COUNT
  • 63. ©2012 DataStax 63 •  SELECT * FROM stocks ORDER BY price ASC •  q=*:* •  sort=price asc SELECT ORDER BY
  • 64. ©2012 DataStax 64 •  SELECT AVG(price) FROM stocks •  q=*:* •  stats=true •  stats.field=price •  The average is called ‘mean’ in the Solr results SELECT AVG
  • 65. ©2012 DataStax 65 •  SELECT AVG(price) FROM stocks GROUP BY symbol •  q=*:* •  stats=true •  stats.field=price •  stats.facet=symbol SELECT AVG GROUP BY
  • 66. ©2012 DataStax 66 •  SELECT * FROM wikipedia WHERE text LIKE ‘rom%’ •  q=text:rom* SELECT WHERE LIKE