Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
When to NoSQL and When !
to Know SQL
Simon Elliston Ball
Head of Big Data
!

@sireb
!#noSQLknowSQL
!
!
!

http://nosqlknow...
what is NoSQL?
SQL
NoSQL
Not only SQL
No, SQL
Many many things
before SQL
files
multi-value
ur… hash maps?
after SQL
everything is relational
ORMs fill in the other data structures
scale up rules
data first design
and now NoSQL
datastores that suit applications
polyglot persistence: the right tools
scale out rules
APIs not EDWs
why should you care?
data growth
rapid development
fewer migration headaches… maybe
machine learning
social
big bucks.

number of tools. Median base salary is constant at $100k for those
9
using up to 10 tools, but increases with ...
So many NoSQLs…
document databases
document databases
rapid development
JSON docs
complex, variable models

known access pattern
document databases
learn a very new query language

denormalize
document form

joins?

JUST DON’T

http://www.sarahmei.com...
document vs SQL
what can SQL do?
query all the angles
sure, you can use blobs…
… but you can’t get into them
documents in SQL
SQL xml fields
mapping xquery paths is painful
native JSON
query everything: search
class of database database
full-text indexing
you know google, right…
range query
span query
keyword query
you know the score
scores

"query": {
"function_score": {
"query": {
"match": { "title": "NoSQL"}
},
"functions": [
"boost...
SQL knows the score too
scores

declare @origin float = 0;
declare @delay_weeks float = 4;
!

select top 10 * from (
selec...
you know google, right…
more like this: instant tf-idf
{
"more_like_this" : {
"fields" : ["name.first", "name.last"],
"like_...
Facets
Facets
SQL:
SELECT a.name, count(p.id) FROM
people p
JOIN industry a on a.id = p.industry_id
JOIN people_keywords pk on pk.person...
Elastic search:
{

}

"query": {
“query_string": {
“default_field”: “content”,
“query”: “keywords”
}
},
"facets": {
“myTer...
logs
untyped free-text documents
timestamped
semi-structured
discovery
aggregation and statistics
key: value
close to your programming model
distributed map | list | set
keys can be objects
SQL extensions
hash types
hstore
SQL and polymorphism
inheritance
ORMs hide the horror
turning round the rows
columnar databases
physical layout matters
turning round the rows
key!

value

type

1

A

Home

2

B

Work

3

C

Work

4

D

Work

Row storage
00001

1

A

Home

0...
teaching an old SQL new tricks
MySQL InfoBright
SQL Server Columnar Indexes
CREATE NONCLUSTERED COLUMNSTORE INDEX idx_col
...
millions of columns
eventually consistent
CQL
set | list | map types
http://cassandra.apache.org/
http://www.datastax.com/
column for hadoop and other animals
Parquet http://parquet.io
ORC files
cell level security
SQL: so many views, so much confusion
accumulo https://accumulo.apache.org/
Tim
e

ser

ies
time
retrieving time series and graphs
window functions
SELECT business_date, ticker,
close,
close /
LAG(close,1) OVER (PA...
Queues
queues in SQL
CREATE procedure [dbo].[Dequeue]
AS

!

set nocount on

!

declare @BatchSize int
set @BatchSize = 10

!

de...
queues in SQL
index fragmentation is a problem
but built in logs of a sort
message queues
specialised apis
capabilities like fan-out
routing
acknowledgement
relationships count
Graph databases
relationships count
trees and hierarchies
overloaded relationships
fancy algorithms
hierarchies with SQL
adjacency lists

CONSTRAIN fk_parent_id_id
FOREIGN KEY parent_id REFERENCES some_table.id

materialis...
graph demo
Velocity

https://www.flickr.com/photos/jiteshjagadish
when locks attack…
Don’t get ACID on the cuts
Big Data
SQL on Hadoop
System issues, !
Speed issues,!
Soft issues
the ACID, BASE litmus
Atomic
Consistent
Isolated

Basically Available
Soft-state
Eventually consistent

Durable

what matt...
Consistency

CAP it all

Partition

Availability
is it web scale?
most noSQL scales well
but clusters still need management
are you facebook? one machine is easier than n
write fast, ask questions later
SQL writes cost a lot
mainly write workload: NoSQL

low latency write workload: NoSQL
who is going to use it?
analysts: they want SQL
developers: they want applications
data scientists: the want access
choose the !
right tool

Photo: http://www.homespothq.com/
Questions
Simon Elliston Ball
simon@simonellistonball.com
!

@sireb
#noSQLknowSQL
http://bit.ly/knowSQLnoSQL
When to NoSQL and when to know SQL
Upcoming SlideShare
Loading in …5
×

When to NoSQL and when to know SQL

2,725 views

Published on

A quick guide and overview for a range of NoSQL technologies, as delivered at Code PaLOUsa 2014

Published in: Technology

When to NoSQL and when to know SQL

  1. 1. When to NoSQL and When ! to Know SQL Simon Elliston Ball Head of Big Data ! @sireb !#noSQLknowSQL ! ! ! http://nosqlknowsql.io
  2. 2. what is NoSQL? SQL NoSQL Not only SQL No, SQL Many many things
  3. 3. before SQL files multi-value ur… hash maps?
  4. 4. after SQL everything is relational ORMs fill in the other data structures scale up rules data first design
  5. 5. and now NoSQL datastores that suit applications polyglot persistence: the right tools scale out rules APIs not EDWs
  6. 6. why should you care? data growth rapid development fewer migration headaches… maybe machine learning social
  7. 7. big bucks. number of tools. Median base salary is constant at $100k for those 9 using up to 10 tools, but increases with new tools after that. O’Reilly 2013 Data Science Salary Survey Given the two patterns we have just examined—the relationships be‐
  8. 8. So many NoSQLs…
  9. 9. document databases
  10. 10. document databases rapid development JSON docs complex, variable models known access pattern
  11. 11. document databases learn a very new query language denormalize document form joins? JUST DON’T http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/
  12. 12. document vs SQL what can SQL do? query all the angles sure, you can use blobs… … but you can’t get into them
  13. 13. documents in SQL SQL xml fields mapping xquery paths is painful native JSON
  14. 14. query everything: search class of database database full-text indexing
  15. 15. you know google, right… range query span query keyword query
  16. 16. you know the score scores "query": { "function_score": { "query": { "match": { "title": "NoSQL"} }, "functions": [ "boost": 1, "gauss": { "timestamp": { "scale": "4w" } }, "script_score" : { "script" : "_score * doc['important_document'].value ? 2 : 1" } ], "score_mode": "sum" } }
  17. 17. SQL knows the score too scores declare @origin float = 0; declare @delay_weeks float = 4; ! select top 10 * from ( select title, score * case when p.important = 1 then 2.0 when p.important = 0 then 1.0 end * exp(-power(timestamp-@origin,2)/(2*@delay*7*24*3600)) + 1 as score from posts p where title like '%NoSQL%' ) as found order by score
  18. 18. you know google, right… more like this: instant tf-idf { "more_like_this" : { "fields" : ["name.first", "name.last"], "like_text" : "text like this one", "min_term_freq" : 1, "max_query_terms" : 12 } }
  19. 19. Facets
  20. 20. Facets
  21. 21. SQL: SELECT a.name, count(p.id) FROM people p JOIN industry a on a.id = p.industry_id JOIN people_keywords pk on pk.person_id = p.id JOIN keywords k on k.id = pk.keyword_id WHERE CONTAINS(p.description, 'NoSQL') OR k.name = 'NoSQL' ... GROUP BY a.name SELECT a.name, count(p.id) FROM people p JOIN area a on a.id = p.area_id JOIN people_keywords pk on pk.person_id = p.id JOIN keywords k on k.id = pk.keyword_id WHERE CONTAINS(p.description, 'NoSQL') OR k.name = 'NoSQL' ... GROUP BY a.name Facets x lots
  22. 22. Elastic search: { } "query": { “query_string": { “default_field”: “content”, “query”: “keywords” } }, "facets": { “myTerms": { "terms": { "field" : "lang", "all_terms" : true } } } Facets
  23. 23. logs untyped free-text documents timestamped semi-structured discovery aggregation and statistics
  24. 24. key: value close to your programming model distributed map | list | set keys can be objects
  25. 25. SQL extensions hash types hstore
  26. 26. SQL and polymorphism inheritance ORMs hide the horror
  27. 27. turning round the rows columnar databases physical layout matters
  28. 28. turning round the rows key! value type 1 A Home 2 B Work 3 C Work 4 D Work Row storage 00001 1 A Home 00002 2 B Work 00003 3 C Work Work … Column storage A B C D Home Work Work …
  29. 29. teaching an old SQL new tricks MySQL InfoBright SQL Server Columnar Indexes CREATE NONCLUSTERED COLUMNSTORE INDEX idx_col ON Orders (OrderDate, DueDate, ShipDate) Great for your data warehouse, but no use for OLTP
  30. 30. millions of columns eventually consistent CQL set | list | map types http://cassandra.apache.org/ http://www.datastax.com/
  31. 31. column for hadoop and other animals Parquet http://parquet.io ORC files
  32. 32. cell level security SQL: so many views, so much confusion accumulo https://accumulo.apache.org/
  33. 33. Tim e ser ies
  34. 34. time retrieving time series and graphs window functions SELECT business_date, ticker, close, close / LAG(close,1) OVER (PARTITION BY ticker ORDER BY business_date ASC) - 1 AS ret FROM sp500
  35. 35. Queues
  36. 36. queues in SQL CREATE procedure [dbo].[Dequeue] AS ! set nocount on ! declare @BatchSize int set @BatchSize = 10 ! declare @Batch table (QueueID int, QueueDateTime datetime, Title nvarchar(255)) ! begin tran ! insert into @Batch select Top (@BatchSize) QueueID, QueueDateTime, Title from QueueMeta WITH (UPDLOCK, HOLDLOCK) where Status = 0 order by QueueDateTime ASC ! declare @ItemsToUpdate int set @ItemsToUpdate = @@ROWCOUNT ! update QueueMeta SET Status = 1 WHERE QueueID IN (select QueueID from @Batch) AND Status = 0 ! if @@ROWCOUNT = @ItemsToUpdate begin commit tran select b.*, q.TextData from @Batch b inner join QueueData q on q.QueueID = b.QueueID print 'SUCCESS' end else begin rollback tran print 'FAILED' end
  37. 37. queues in SQL index fragmentation is a problem but built in logs of a sort
  38. 38. message queues specialised apis capabilities like fan-out routing acknowledgement
  39. 39. relationships count Graph databases
  40. 40. relationships count trees and hierarchies overloaded relationships fancy algorithms
  41. 41. hierarchies with SQL adjacency lists CONSTRAIN fk_parent_id_id FOREIGN KEY parent_id REFERENCES some_table.id materialised path nested sets (MPTT) path = 1.2.23.55.786.33425 Node Left Right Depth A 1 22 1 B 2 9 2 C 10 21 2 D 3 4 3 E 5 6 3 F 7 8 3 G 11 18 3 H 19 20 3 I 12 13 4 J 14 15 4 K 16 17 4
  42. 42. graph demo
  43. 43. Velocity https://www.flickr.com/photos/jiteshjagadish
  44. 44. when locks attack… Don’t get ACID on the cuts
  45. 45. Big Data
  46. 46. SQL on Hadoop
  47. 47. System issues, ! Speed issues,! Soft issues
  48. 48. the ACID, BASE litmus Atomic Consistent Isolated Basically Available Soft-state Eventually consistent Durable what matters to you?
  49. 49. Consistency CAP it all Partition Availability
  50. 50. is it web scale? most noSQL scales well but clusters still need management are you facebook? one machine is easier than n
  51. 51. write fast, ask questions later SQL writes cost a lot mainly write workload: NoSQL low latency write workload: NoSQL
  52. 52. who is going to use it? analysts: they want SQL developers: they want applications data scientists: the want access
  53. 53. choose the ! right tool Photo: http://www.homespothq.com/
  54. 54. Questions Simon Elliston Ball simon@simonellistonball.com ! @sireb #noSQLknowSQL http://bit.ly/knowSQLnoSQL

×