Your SlideShare is downloading. ×
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

1,682

Published on

The presentation demonstrates how Solr may be used to create real-time analytics applications. In addition, Datastax Enterprise 3.0 will be showcased, which offers Solr version 4.0 with a number of …

The presentation demonstrates how Solr may be used to create real-time analytics applications. In addition, Datastax Enterprise 3.0 will be showcased, which offers Solr version 4.0 with a number of improvements over the previous DSE release. A realtime financial application will run for the audience, and then a detailed look at how the application was built. An overview of Datastax Enterprise Solr features will be given, and how the many enhancements in DSE make it unique in the marketplace.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,682
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
65
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. ©2012 DataStax 1DataStax Enterprise 3.xRealtime Analytics with SolrJason Rutherglen
  • 2. ©2012 DataStax 2•  Big Data Engineer at DataStax•  Co-author of ‘Programming Hive’and ‘Introduction to Solr’ fromO’ReillyAbout the Presenter
  • 3. ©2012 DataStax 3•  The company behind Cassandra•  Sells DataStax EnterpriseAbout DataStax
  • 4. ©2012 DataStax 4DataStax Enterprise 3.x
  • 5. ©2012 DataStax 5•  Single stack•  Cassandra•  Solr•  Hadoop•  Consulting•  SupportDataStax Enterprise
  • 6. ©2012 DataStax 6•  ZDNet article: “The biggest cloudapp of all: Netflix”•  http://zd.net/ZHtrmW•  Built on Cassandra•  “According to Cockcroft, if somethinggoes wrong, Netflix can continue torun the entire service on two out ofthree zones”Cassandra at Netflix
  • 7. ©2012 DataStax 7•  Petabytes of growing data•  Hadoop is for batch work•  What are the solutions for realtime?What is Big Data?
  • 8. ©2012 DataStax 8•  Near realtime•  1000 millisecond latencyWhat is realtime?
  • 9. ©2012 DataStax 9•  Cost of scaling to petabytes•  Physical limitationsWhy not relational databases?
  • 10. ©2012 DataStax 10•  Hadoop for batch•  Solr and Cassandra for realtime•  Gives most of relational capability at1/10 the cost, scales linearlyRelational to Big Data
  • 11. ©2012 DataStax 11•  Distributed database heavy lifting•  Simple dynamo model•  Executes replication tasks extremelywellWhy Cassandra?
  • 12. ©2012 DataStax 12•  Cassandra is easier, code is readable•  Fewer moving parts•  Multi-datacenter replication•  Enables low level IO tuningCassandra vs. HBase
  • 13. ©2012 DataStax 13•  HBase runs on HDFS•  HDFS is not designed for randomaccess IO•  Multiple hacks / products to performrandom access (MapR, HDFS Jiras)Cassandra vs. HBase
  • 14. ©2012 DataStax 14•  Cassandra is peer to peer, there is nosingle point of failure (SPOF)•  The HDFS name node is a singlepoint of failureCassandra vs. HBase
  • 15. ©2012 DataStax 15•  Most of Facebook runs on MySQL•  Memcache front ends the readsHBase at Facebook
  • 16. ©2012 DataStax 16•  Hive with Hadoop•  A vague dialect of SQL•  Requires Java for UDFs•  Relational JoinsBatch Analytics
  • 17. ©2012 DataStax 17•  Solr•  SQL features except relational joins•  Use Hive for relational joins•  CEP (Complex Event Processing)•  StormRealtime Analytics
  • 18. ©2012 DataStax 18•  Storm, computes results onstreaming dataCEP (Complex Event Processing)
  • 19. ©2012 DataStax 19•  Java inverted indexing library•  Text analytics is raw computationover linear sets of data•  High speed computation engineLucene
  • 20. ©2012 DataStax 20•  Terms dictionary points to listdocument ids (integers)•  Tokenizes text•  Complete variety of computation onvectors of dataInverted Indexes
  • 21. ©2012 DataStax 21•  Search server built around Lucene•  Adds faceting, distributed search•  Missed the cloud environmentfeatures of NoSQL systems for manyyearsSolr
  • 22. ©2012 DataStax 22•  Solr Cloud is a Zookeeper basedsystem•  New and probably not productionready•  Playing catch upSolr Cloud
  • 23. ©2012 DataStax 23•  High overlap with Solr•  More mature than Solr Cloud•  Less distributed features thanCassandraElastic Search
  • 24. ©2012 DataStax 24•  Columns, column families, keyspaces•  Peer to peer•  Eventual consistency•  Implements basic Google BigTablemodelCassandra Concepts
  • 25. ©2012 DataStax 25•  Both implement a log structuredmerge tree file architectureLucene and Cassandra
  • 26. ©2012 DataStax 26•  Data is stored in Cassandra•  Data placement controlled byCassandra•  Solr is a secondary index (only)DataStax Enterprise with Solr
  • 27. ©2012 DataStax 27•  Separation of church and state, eg,data and indexDataStax Enterprise with Solr
  • 28. ©2012 DataStax 28•  Indexing is a CPU intensive task•  Not IO bound because of multi-threading•  When a thread is flushing, otherthreads are indexing, CPU issaturated at all timesIndexing
  • 29. ©2012 DataStax 29•  IO bound, index needs to fit in RAM,then CPU bound•  Lucene enables multithreadingqueries•  Solr does not multithread queriesQueries
  • 30. ©2012 DataStax 30•  Eventual consistency, each node hasit’s own Lucene index•  Lucene segment files are notreplicated (like Solr Cloud andElasticSearch)DataStax Enterprise with Solr
  • 31. ©2012 DataStax 31•  Query requests are round robin’dacross nodes automaticallyDistributed Search Architecture
  • 32. ©2012 DataStax 32•  3.0.1 is the current release ofDataStax EnterpriseDSE 3.0.1
  • 33. ©2012 DataStax 33•  Ease of re-indexing•  Re-index the entire cluster or per-node•  Re-indexing occurs when the Solrschema changesNew Features in DSE 3.0.1
  • 34. ©2012 DataStax 34•  Solr Cloud requires re-indexing froman external data source such as arelational databaseNew Features in DSE 3.0.1
  • 35. ©2012 DataStax 35•  DSE re-indexes directly fromCassandra•  No custom code is required for re-indexingNew Features in DSE 3.0.1
  • 36. ©2012 DataStax 36•  View the heap memory usage of thefield caches•  Perform capacity planningNew Features in DSE 3.0.1
  • 37. ©2012 DataStax 37•  Multithreaded re-indexing and repair•  Adding a new Solr node is fastNew Features in DSE 3.0.1
  • 38. ©2012 DataStax 38•  Kerberos and SSL security•  Security audit loggingNew Features in DSE 3.0.1
  • 39. ©2012 DataStax 39•  Near realtime: per-segment filters,facets, multivalue facets•  Solr 4.3DataStax Enterprise 3.1
  • 40. ©2012 DataStax 40•  vNodes•  Composite keysDataStax Enterprise 3.1
  • 41. ©2012 DataStax 41•  Multi datacenter live Solr schemaupdates and re-indexing•  CQL -> Solr queries, makes portingSQL applications easy for SQLdevelopersFuture
  • 42. ©2012 DataStax 42Demo of Wikipedia
  • 43. ©2012 DataStax 43•  Details about every trade•  Tick data generated real time and isquantitatively query-able•  Too big to query on in real time? Notanymore!Real World Example: Tick Data
  • 44. ©2012 DataStax 44•  Computing the moving stock priceaverage in real time•  Comparing multiple moving averagesfor different stock_symbols•  Requires statistical analysis, groupby companies, and faceting featuresTick Data - Moving Average
  • 45. ©2012 DataStax 45•  Read latest ticks for a given company•  Query ticks for companies in specificverticals during large events such aspress releases•  Compute deviation of stock data over5 years for groups of companiesTick Data Analytics - Ad HocSearches
  • 46. ©2012 DataStax 46Real Time Stocks Demo
  • 47. ©2012 DataStax 47General
  • 48. ©2012 DataStax 48•  Like an SQL CREATE TABLEstatement•  Defines field types•  Defines fieldsSchema
  • 49. ©2012 DataStax 49•  XML based configuration options forSolrSolr Config
  • 50. ©2012 DataStax 50•  Commits new index segment to RAM•  Avoids ‘hard’ commit fsyncSoft Commit
  • 51. ©2012 DataStax 51Auto Soft Commit<!-- The default high-performance update handler --> !<updateHandler class="solr.DirectUpdateHandler2”>!<autoSoftCommit> !<maxTime>1000</maxTime> <!– Near Realtime of 1 second -->!</autoSoftCommit> !</updateHandler>!
  • 52. ©2012 DataStax 52•  Loaded for sort and facet queries•  Uses heap spaceField Cache
  • 53. ©2012 DataStax 53•  Java based API for interacting with aSolr server•  DSE supports SolrJ/HTTP with nochangesSolrJ / HTTP
  • 54. ©2012 DataStax 54•  Auto data type mapping•  Copy fields•  Dynamic fieldsInsert data with CQL
  • 55. ©2012 DataStax 55•  Exists however is mainly useful fordebugging•  Limited functionality, queries a singlenodeCQL with Solr Query
  • 56. ©2012 DataStax 56CQL Insert ExampleINSERT INTO wikipedia (key, text) !VALUES (1, when in rome)!
  • 57. ©2012 DataStax 57How to convert applications
  • 58. ©2012 DataStax 58•  Common to convert existing SQLapplications to Big Data•  Focus on the application functionalitySQL to Solr
  • 59. ©2012 DataStax 59•  Cassandra makes all distributedoperations easySQL to Solr
  • 60. ©2012 DataStax 60•  SELECT * FROM wikipedia WHEREtype = ‘pdf’•  q=type:pdfSELECT WHERE
  • 61. ©2012 DataStax 61•  SELECT title,text FROM wikipedia•  q=*:*•  fl=title,textSELECT columns
  • 62. ©2012 DataStax 62•  SELECT COUNT(*) FROM wikipediaWHERE type = ‘pdf’•  q=type:pdf•  Get the num foundSELECT COUNT
  • 63. ©2012 DataStax 63•  SELECT * FROM stocks ORDER BYprice ASC•  q=*:*•  sort=price ascSELECT ORDER BY
  • 64. ©2012 DataStax 64•  SELECT AVG(price) FROM stocks•  q=*:*•  stats=true•  stats.field=price•  The average is called ‘mean’ in theSolr resultsSELECT AVG
  • 65. ©2012 DataStax 65•  SELECT AVG(price) FROM stocksGROUP BY symbol•  q=*:*•  stats=true•  stats.field=price•  stats.facet=symbolSELECT AVG GROUP BY
  • 66. ©2012 DataStax 66•  SELECT * FROM wikipedia WHEREtext LIKE ‘rom%’•  q=text:rom*SELECT WHERE LIKE
  • 67. ©2012 DataStax 67

×