SQL++ FOR BIG DATA
Same Language, More Power
Date
Matthew D. Groves
22
SQL, for the win
https://insights.stackoverflow.com/survey/2019
01/
02/
03/
04/
05/
SQL & Relational
NoSQL
Analytics & Reporting
Demo
Summary & More Resources
AGENDA
44
Where am I?
• All Things Open
• https://allthingsopen.org/
55
Who am I?
• Matthew D. Groves
• Developer Advocate for Couchbase
• @mgroves on Twitter
• Podcast and blog: https://crosscuttingconcerns.com
• "I am not an expert, but I am an enthusiast." – Alan Stevens
by @natelovett
SQL++ FOR BIG DATA
Same Language, More Power
Date
Matthew D. Groves
SQL & RELATIONAL1
99
• E.F. Codd invented the relational model
• Alpha
• "Although it is logically unnecessary to store both a
relation and some permutation of it, performance
considerations could make it advisable."
Before SQL: Relational Databases
1010
• Created by Don Chamberlin & Raymond Boyce
• Designed to be English-friendly
• BCNF (Boyce-Codd Normal Form)
• "SQL" and "relational" are now synonyms
SQL
1111
• Impedance mismatch
• Scaling
• Inflexibility
• Performance
Criticisms/tradeoffs of SQL/relational
1212
Impedance mismatch
ID Username DateCreated
1 mgroves 2019-06-13
2 agroves 2019-06-14
. . .
. . .
CartID Item Price Qty
1 hat 12.99 1
1 socks 11.99 1
2 t-shirt 15.99 1
. . . .
. . . .
public class ShoppingCart
{
public int Id;
public string Username;
public List<Items> Items;
}
ShoppingCart
ShoppingCartItems
1313
Scaling
Vertical Horizontal
1414
Scaling
The Free Lunch is Over
(by Herb Sutter)
http://www.gotw.ca/publications/concurrency-ddj.htm
1515
Inflexibility
Billing
ConnectionsPurchases
Contacts
Customer
1616
• A relational database may be…
Disclaimer!
NOSQL / SQL++2
1818
JSON data is NoSQL data
1919
Example 1
{
"callsign": "UNITED",
"country": "United States",
"name": "United Airlines",
"type": "airline"
}
document key: airline_5209
2020
Example 2
document key: route_55758
{
"airlineid": "airline_5209",
"destinationairport": "ORD",
"distance": 1050.394306634423,
"equipment": "ER4 ERJ",
"schedule": [
{ "day": 0, "flight": "UA479", "utc": "15:05:00" },
{ "day": 1, "flight": "UA842", "utc": "02:27:00" },
{ "day": 1, "flight": "UA252", "utc": "03:00:00" },
// ... etc ...
],
"sourceairport": "CMH",
"stops": 0,
"type": "route"
}
2121
• Get by key
• Set by key
• Delete by key
• Map/reduce / other "operational" query
NoSQL basic operations
2222
• Problems:
• Large amounts of data
• Queries against the data could impact operations
• Have to learn a new query language
What about reporting and analytics?
ANALYTICS &
REPORTING3
2424
Operational vs Operational Analytics
2525
What is Operational?
2626
What is Operational Analytics?
2727
• Many concurrent queries
Operational workload
Well-defined
Simple (generally) Performance is vital
2828
• Low concurrency
Operational Analytics workload
Adhoc
Could be complex Low-latency is nice-to-have
2929
How are operational analytics done?
3030
¯_(ツ)_/¯
Answer 1
3131
Answer 2: Export to relational
Data
ETL
SQL
3232
Answer 3: Hadoop?
http://bit.ly/hadoop_ecosystem
3333
Answer 4: SQL++
3434
SQL Example
ID foo bar baz
1 matt groves qux
2 ali groves notqux
3 emma groves notqux
mytable
SELECT foo, bar
FROM mytable
WHERE baz = 'qux'
3535
SQL++ Example
key: 1
{
"foo" : "matt",
"bar" : "groves",
"baz" : "qux"
}
key: 2
{
"foo" : "ali",
"bar" : "groves",
"baz" : "notqux"
}
key: 3
{
"foo" : "emma",
"bar" : "groves",
"baz" : "notqux"
}
mybucket
SELECT foo, bar
FROM mybucket
WHERE baz = 'qux'
3636
SQL++ Research Project
3737
• JOIN
• UNION
• aggregation / GROUP BY
• SELECT
• LET
• LIMIT
• ORDER BY
• etc…
SQL++ is backwards compatible
3838
SQL++ has superpowers
3939
Superpower: Nested Objects
key 1
{
"name" : "matt",
"address" : {
"street" : "White Rd",
"city" : "Grove City",
"state" : "OH"
}
}
key 2
{
"name" : "emma",
"address" : {
"street" : "High St",
"city" : "Columbus",
"state" : "OH"
}
}
SELECT address.city
FROM myusers
myusers
4040
Superpower: arrays
key 1
{
"name" : "matt",
"favoriteFoods" : [
"pizza",
"cheesecake",
"donuts"
]
}
key 2
{
"name" : "emma",
"favoriteFoods" : [
"donuts",
"Lucky Charms",
"chicken"
]
}
SELECT favoriteFoods[1]
FROM myusers
myusers
4242
Superpower: Quantification
key 1
{
"name" : "matt",
"favoriteFoods" : [
"pizza",
"cheesecake",
"donuts"
]
}
key 2
{
"name" : "emma",
"favoriteFoods" : [
"donuts",
"Lucky Charms",
"chicken"
]
}
SELECT u.name
FROM myusers u
WHERE ANY f
IN u.favoriteFoods
SATISFIES f == 'pizza'
END;
myusers
4343
Implementations
4444
Implementation 1: Couchbase
SQL++
4545
Implementation 2: AsterixDB
4646
Implementation 3: Apache Drill
4747
Implementation 4: PartiQL
DEMO4
SUMMARY5
5151
NoSQL doesn't mean NoSQL anymore
++SQLNo
5252
SQL++ is SQL with JSON Superpowers
5353
Minimize your ETL, maximize your SQL skills
ETL
👎
SQL
👍
5454
• E.F. Codd original research paper
• http://db.dobo.sk/wp-content/uploads/2015/11/Codd_1970_A_relational_model.pdf
• The Free Lunch is Over
• http://www.gotw.ca/publications/concurrency-ddj.htm
• Original SEQUEL paper
• https://dl.acm.org/citation.cfm?id=811515
Resources: SQL/scaling
5555
• UCSD
• http://forward.ucsd.edu/sqlpp.html
• The SQL++ Query Language
• https://arxiv.org/abs/1405.3631
Resources: UCSD Research
5656
• Book: SQL++ for SQL Users
• Amazon: https://www.amazon.com/SQL-Users-Tutorial-Don-Chamberlin/dp/0692184503/
• Free PDF: https://resources.couchbase.com/sql_tutorial
• Videos
• NoSQL and SQL++, two sides of the same coin:
https://www.youtube.com/watch?v=KGKiSyJa0-k
• Tech Panel on Query Language Evolution:
https://www.youtube.com/watch?v=LAlDe1w7wxc
Resources: Don Chamberlin
5757
@mgroves
twitch.tv/matthewdgroves
Find me after this session!
matthew.groves@couchbase.com
Resources: Me!
5858
• 💻 Install Couchbase: https://couchbase.com/downloads
• 👩🏽🏫 Free training: https://learn.couchbase.com
• 📅 Upcoming events: https://couchbase.com/resources/events
• 📝 Blogs: https://blog.couchbase.com/category/analytics/
•❔ Forums: https://forums.couchbase.com/c/analytics
Next Steps
Frequently Asked Questions
59
1. How is Couchbase different than Mongo?
2. Is Couchbase the same thing as CouchDb?
3. How tall are you? Do you play basketball?
4. What is the Couchbase licensing situation?
5. Is Couchbase a Managed Cloud Service (DBaaS)?
Managed Cloud Server (DBaaS)
60
< Back
MongoDB vs Couchbase
61
• Architecture
• Memory first architecture
• Master-master architecture
• Auto-sharding
• Features
• SQL (N1QL)
• Full Text Search
• Analytics (NoETL)
< Back
Licensing
62
< Back
Couchbase Server Community
• Source code is Open Source (Apache 2)
• Binary release is one release behind Enterprise (except major versions)
• Free to use anywhere
• Forum support only
Couchbase Server Enterprise
• Source code is mostly Open Source (Apache 2)
• Some features not available on Community (XDCR TLS, MDS, Rack Zone,
etc)
• Free to use in dev/test/qa
• Need commercial license for prod
• Paid support provided
CouchDB and Couchbase
63
< Back
memcached

Introduction to SQL++ for Big Data: Same Language, More Power

Editor's Notes

  • #3 show that SQL is popular with Stack Overflow survey 2019 About the same as it was last year, in the 55-60% Popular doesn't necessarily equal good, of course, but if you look at the top 3, they are all in the "lingua franca" category SQL rules data https://insights.stackoverflow.com/survey/2019
  • #9 Image: http://www.sentientdevelopments.com/2011/06/primal-transhumanism.html
  • #10 EF Codd did a lot of great theoretical work and research, including the invention of the relational database Interesting quote from his original paper that describes one of the fundamental tradeoffs between relational and non-relational data, which we'll explore today After his initial paper, he designed a language called "Alpha", which was never implemented, but influential
  • #11 SQL was created to make data querying more accessible to people
  • #12 EF Codd even points out a trade-off between disk space and performance considerations
  • #13 In the database we have 5 pieces of data stored For what is actually 2 shopping carts as they exist in the application We have tools to attempt to deal with this, mainly OR/Ms And they mostly do a good job… mostly
  • #14 The easiest way to scale a relational database is vertical But this can get expensive and eventually hit a ceiling Horizontal scaling can be cheaper, can scale bigger, but is difficult to do with relational
  • #15 http://www.gotw.ca/publications/concurrency-ddj.htm This is about concurrency, which leads to distributed systems, which leads to distributed databases
  • #16 Rise of agile methodologies "we value responding to change over following a plan" Schema changes A simple change of moving "credit card number" field from customer to a new "billing" table with foreign key That's a simple example, but even that with a large enough database could have huge impact The more complex the schema change and the bigger the database, the more impact it has Which means the more expensive/risky this change will be
  • #17 I'm not here to convince you that relational is dead! You are working with small data sets (for some definition of small) You are working with simple/rarely changing data structures (for some definition of simple/rarely) You aren't feeling performance / scaling pain (yet) But don't turn off your mind yet. You aren't facing these problems now, but you may face them in the future.
  • #18 So what if it's not fine?
  • #19 Isolated pieces of data "Documents" Can be sharded / split between any number of nodes (for some reason when I think of "shards" I think of the crystals that Superman has in the fortress of solitude)
  • #20 This is a simple example Flat data, you could easily imagine this as a row in a table Notice the document KEY Document database is basically a key/value store. The value is the JSON and the key is some string This may look slightly different from database to database, but they all have a key somewhere.
  • #21 More complex example The 'schedule' element in relational would be at least one separate table with foreign keys It's all domestic data here No mismatch, easy to scale, no joining required No schema to follow, so I could add other fields TO JUST THIS ONE DOCUMENT if necessary Don't ALWAYS denormalize, notice the 'airlineid' field
  • #22 Map/reduce can be parallelized, but not great for adhoc Mongo has a javascript-like query language Couchbase actually uses SQL for operational queries
  • #23 Suppose your database is used for the backend of an ecommerce site Everything is humming along nicely, customers are adding items to shopping cart They're making purchases, browsing the catalog with well-known, well-indexed queries Suddenly I come along trying to create a report I run a complicated query or adhoc query that I don't have proper indexing, sizing, tuning for And my query impacts customers: slows them down or worse causes timeouts
  • #25 Define these terms Talk more about the differences later, when to use each one Operational: means the moment-to-moment data operations and queries that your website needs to function in order to serve customers Operational analytics: queries and reporting that is close to real time, perhaps analyzing only the last 6 months or maybe even the last hour of data - dashboards/reporting/trend analysis Analytical: the operations and queries that you need to serve customers in the extreme long run and extreme history – data science/etc
  • #26 - website - mobile app - anything where there are a lot of end users/customers/public Examples: - Query for a buyer to get a list of possible makes/models Query for a buyer to get a list of cars for sale within a search radius Query for a seller to get a list of their cars on sale Etc These are going to run often and concurrently with many other users, they are well-defined, and should run very quickly
  • #27 Answer enterprise wide questions in close to real time: - Etc: I really have no idea what cars.com is interested in, but some examples: How many cars were sold today, this week, this month How many were sold YoY Real time rankings Drill down on each of these, group by manufacturer, group by car type, etc Could be a large number of permutations / complexity Low concurrency – only a handful of people have access to an admin dashboard, for instance
  • #28 -lots of concurrent queries -queries are well defined, well indexed, optimized -queries are *generally* simple -performance is very important, How long is a customer willing to wait until they ditch your site?
  • #29 - Fewer queries, fewer queries running at the same time - queries are more adhoc in nature - queries might be VERY complex - performance is always nice, but latency is less important in this environment since it's not impacting, for instance, web page load times directly
  • #30 In my experience, I've seen 3-4 approaches
  • #31 I dunno? We don't really have a plan for this, we don't think about it We have a bunch of Access databases? We copy the operational data when we want to? Or just link to it directly and hope no one screws it up?
  • #32 export it to a relational database and use SQL, like a data warehouse - Create/maintain or buy an ETL Impedance mismatch (again!) Size/performance
  • #33 Hadoop is designed for massive scale, not massive speed. It's analytics, but it's not operational analytics. Using Hadoop and the Hadoop ecosystem is a whole other topic This may be too big of a hammer or too slow of a hammer for operational analytics * answer 3: hadoop or something - still an ETL problem – kafka, sqoop, flume, etc - how do we actually create queries? Pig, Hive, Spark, etc designed for petabytes+ two types of analytics: this is the data lake, analyze data of the entire history of the company https://medium.com/@ylashin/big-data-using-hdinsight-a-journey-in-the-zoo-ecosystem-c78b913a5ed9
  • #34 you already know how to write SQL Designed to work with richly structured data minimal or no ETL required This is the cover of a book, and notice the author
  • #36 As Don Chamberlain says, JSON kinda looks like tables "if you squint hard enough"
  • #37 SQL++ was a research project from UCSD in 2015 - https://arxiv.org/abs/1405.3631 - Couchbase's N1QL (operational) is the first implementation of this research paper
  • #38 The language itself The underlying data is different, it's not tables and rows It's collections of JSON documents
  • #39 SQL is made for flat relational data SQL++ takes it a step further to deal with structured data, and therefore it has some superpowers
  • #40 In JSON you can have nested objects Objects within objects, like address here How do I select that, project that, etc The answer is: dotted syntax
  • #41 Addressing arrays with square brackets
  • #42 If this was relational, it would be a separate table And would require a join
  • #43 I mentioned the array syntax before But I'd have to know the exact index of an item What if I wanted to ask "who are the users who have pizza as a favorite food"?
  • #44 There are 4 implementations of SQL++ that you can start using today
  • #45 This is probably the most production ready one that anyone can start using today - Analytics - Workload isolation - "Shadow copy" created with two commands It technically IS an ETL, but it is real time, and it's created with two simple commands And it's otherwise completely automated I'll show you a demo of this later Workload isolation, read only
  • #46 - "big data management system" data ingestion (ETL), variety of built in adapters (local filesystem, HDFS, socket, twitter, RSS) and it's extensible Couchbase is essentially using a customized version of AsterixDB under the hood
  • #47 - No ETL required - Seems to access data directly, which could be a workload isolation problem (operational vs analytics) "in-place analytics" Can connect to a wide variety of databases
  • #48 This is an Amazon-backed effort, It seems a lot like Apache Drill to me This one is brand new to me, I haven't used it much. It's a work in progress, it's version 0.1 now It claims to be a SQL++ implementation but they've made some choices, specifically with JOIN That don't quite line up (at least in my eyes) It seems somewhat experimental, but there are some AWS customers using it, apparently. It works on a variety of data formats including JSON but also others
  • #49 - SQL++ supports indexing - However, much of the time you don't need to worry about it – why not? the short answer is MPP, parallelism, examining metadata to pick the best execution plan - the long answer check out the video at this link - Drill supports indexing only for MapR (it will use indexes in the other databases of course)
  • #52 They say you only remember 3 things from any presentation, so here they are
  • #55 Codd research paper - http://db.dobo.sk/wp-content/uploads/2015/11/Codd_1970_A_relational_model.pdf (may not be a good link in the long run, but it's free) - The Free Lunch is Over - http://www.gotw.ca/publications/concurrency-ddj.htm - SEQUEL paper - https://dl.acm.org/citation.cfm?id=811515 (I couldn't find a free copy)
  • #56 -http://forward.ucsd.edu/sqlpp.html (SQL++ part of the FORWARD project) - https://arxiv.org/abs/1405.3631 (paper published at Cornell)
  • #57  - book - https://www.amazon.com/SQL-Users-Tutorial-Don-Chamberlin/dp/0692184503/ - free pdf: https://resources.couchbase.com/sql_tutorial - videos - https://www.youtube.com/watch?v=KGKiSyJa0-k - https://www.youtube.com/watch?v=LAlDe1w7wxc
  • #58 If anything looks interesting to you, you have questions or feedback, come talk to me afterwards I want to hear from you! My boss says I have to listen to you, it's my job. So now's your chance :)
  • #59 If you want to check out more about analytics on Couchbase, here are some free resources for you (except events, I can't promise all events will be free)
  • #61 Rackspace partnership Couchbase IS in the Azure and AWS marketplaces, and there are some wizards to make config easy, but it runs on your VMs. A full DBaaS will be coming soon
  • #62 Memory first: integrated cache, you don't need to put redis on top of couchbase Master-master: easier scaling, better scaling Auto-sharding: we call vBuckets, you don't have to come up with a sharding scheme, it's done by crc32 N1QL: SQL, mongo has a more limited query language and it's not SQL-like Full Text Search: Using the bleve search engine, language aware FTS capabilities built in Mobile & sync: Mongo has nothing like the offline-first and sync capabilities couchbase offers Mongo DOES have a DbaaS cloud provider
  • #63 Everything I've shown you today is available in Community edition The only N1QL feature I can think of not in Community is INFER and Query Plan Visualizer The Enterprise features you probably don't need unless you are Enterprise developer.