5. 5
• Relational
• E.F. Codd invented the relational model
• Alpha
• SQL
• Created by Don Chamberlin & Raymond Boyce
• Designed to be English-friendly
• "SQL" and "relational" are now synonyms
Relational and SQL
42. 42
• E.F. Codd original research paper
• http://db.dobo.sk/wp-content/uploads/2015/11/Codd_1970_A_relational_model.pdf
• The Free Lunch is Over
• http://www.gotw.ca/publications/concurrency-ddj.htm
• Original SEQUEL paper
• https://dl.acm.org/citation.cfm?id=811515
Resources: SQL/scaling
44. 44
• Book: SQL++ for SQL Users
• Amazon: https://www.amazon.com/SQL-Users-Tutorial-Don-Chamberlin/dp/0692184503/
• Free PDF: https://resources.couchbase.com/sql_tutorial
• Videos
• NoSQL and SQL++, two sides of the same coin:
https://www.youtube.com/watch?v=KGKiSyJa0-k
• Tech Panel on Query Language Evolution:
https://www.youtube.com/watch?v=LAlDe1w7wxc
Resources: Don Chamberlin
show that SQL is popular with Stack Overflow survey 2019
About the same as it was last year, in the 55-60%
Popular doesn't necessarily equal good, of course, but if you look at the top 3, they are all in the "lingua franca" category
SQL rules data
https://insights.stackoverflow.com/survey/2019
EF Codd did a lot of great theoretical work and research, including the invention of the relational database
Interesting quote from his original paper that describes one of the fundamental tradeoffs between relational and non-relational data, which we'll explore today
After his initial paper, he designed a language called "Alpha", which was never implemented, but influential
In the database we have 5 pieces of data stored
For what is actually 2 shopping carts as they exist in the application
We have tools to attempt to deal with this, mainly OR/Ms
And they mostly do a good job… mostly
The easiest way to scale a relational database is vertical
But this can get expensive and eventually hit a ceiling (The Free Lunch is Over)
Horizontal scaling can be cheaper, can scale bigger, but is difficult to do with relational
Rise of agile methodologies
"we value responding to change over following a plan"
Schema changes
A simple change of moving "credit card number" field from customer to a new "billing" table with foreign key
That's a simple example, but even that with a large enough database could have huge impact
The more complex the schema change and the bigger the database, the more impact it has
Which means the more expensive/risky this change will be
I'm not here to convince you that relational is dead!
You are working with small data sets (for some definition of small)
You are working with simple/rarely changing data structures (for some definition of simple/rarely)
You aren't feeling performance / scaling pain (yet)
But don't turn off your mind yet. You aren't facing these problems now, but you may face them in the future.
So what if it's not fine?
Isolated pieces of data
"Documents"
Can be sharded / split between any number of nodes
(for some reason when I think of "shards" I think of the crystals that Superman has in the fortress of solitude)
This is a simple example
Flat data, you could easily imagine this as a row in a table
Notice the document KEY
Document database is basically a key/value store. The value is the JSON and the key is some string
This may look slightly different from database to database, but they all have a key somewhere.
More complex example
The 'schedule' element in relational would be at least one separate table with foreign keys
It's all domestic data here
No mismatch, easy to scale, no joining required
No schema to follow, so I could add other fields TO JUST THIS ONE DOCUMENT if necessary
Don't ALWAYS normalize, notice the 'airlineid' field
Other operational query: Map/reduce, Mongo has a javascript-like query language, Couchbase uses SQL for operational queries
Suppose your database is used for the backend of an ecommerce site
Everything is humming along nicely, customers are adding items to shopping cart
They're making purchases, browsing the catalog with well-known, well-indexed queries
Suddenly I come along trying to create a report
I run a complicated query or adhoc query that I don't have proper indexing, sizing, tuning for
And my query impacts customers: slows them down or worse causes timeouts
Define these terms
Talk more about the differences later, when to use each one
Operational: means the moment-to-moment data operations and queries that your website needs to function in order to serve customers
Analytical: the operations and queries that you need to serve customers in the extreme long run and extreme history – data science/etc
Operational analytics: sits between them, closer to real time, perhaps analyzing only the last 6 months or maybe even the last hour of data - dashboards/reporting/trend analysis
- much fewer analysts than customers (hopefully?)
- queries are more adhoc in nature
- queries might be VERY complex
- performance is still nice, but less important.
There are 4 methods that I'm aware of
I have experience with most of these
I dunno?
We don't really have a plan for this, we don't think about it
We have a bunch of Access databases?
We copy the operational data when we want to?
Or just link to it directly and hope no one screws it up?
export it to a relational database and use SQL
- Create/maintain or buy an ETL
Impedance mismatch (again!)
Size/performance
Hadoop is designed for massive scale, not massive speed.
It's analytics, but it's not operational analytics.
Using Hadoop and the Hadoop ecosystem is a whole other topic
This may be too big of a hammer or too slow of a hammer for operational analytics
* answer 3: hadoop or something
- still an ETL problem – kafka, sqoop, flume, etc
- how do we actually create queries? Pig, Hive, Spark, etc
designed for petabytes+
two types of analytics:
this is the data lake, analyze data of the entire history of the company
https://medium.com/@ylashin/big-data-using-hdinsight-a-journey-in-the-zoo-ecosystem-c78b913a5ed9
you already know how to write SQL
Designed to work with richly structured data
minimal or no ETL required
This is the cover of a book, and notice the author
As Don Chamberlain says, JSON kinda looks like tables "if you squint hard enough"
SQL++ was a research project from UCSD in 2015
- https://arxiv.org/abs/1405.3631
- Couchbase's N1QL (operational) is the first implementation of this research paper
The language itself
The underlying data is different, it's not tables and rows
It's collections of JSON documents
SQL is made for flat relational data
SQL++ takes it a step further to deal with structured data, and therefore it has some superpowers
In JSON you can have nested objects
Objects within objects, like address here
How do I select that, project that, etc
The answer is: dotted syntax
Addressing arrays with square brackets
We may want to flatten that array in order to filter on the values
Consider "favoriteFoods" in relation would be a separate table
In JSON, it's not, but we might want to do an "intra-document" join
Unnest will flatten out the array and basically join each array value to its parent document
Quantification means that I want to perform some filtering of an array
To see if any or all items in an array satisfy some criteria
For instance, I want to find all users who have "pizza" as a favorite food
- Analytics
- Workload isolation
- "Shadow copy" created with two commands
It technically IS an ETL, but it is real time, and it's created with two simple commands
And it's otherwise completely automated
I'll show you a demo of this later
Workload isolation, read only
- "big data management system"
data ingestion (ETL), variety of built in adapters (local filesystem, HDFS, socket, twitter, RSS) and it's extensible
Couchbase is essentially using a customized version of AsterixDB under the hood
- No ETL required
- Seems to access data directly, which could be a workload isolation problem (operational vs analytics)
"in-place analytics"
Can connect to a wide variety of databases
They say you only remember 3 things from any presentation, so here they are
Codd research paper - http://db.dobo.sk/wp-content/uploads/2015/11/Codd_1970_A_relational_model.pdf
(may not be a good link in the long run, but it's free)
- The Free Lunch is Over - http://www.gotw.ca/publications/concurrency-ddj.htm
- SEQUEL paper - https://dl.acm.org/citation.cfm?id=811515 (I couldn't find a free copy)
-http://forward.ucsd.edu/sqlpp.html (SQL++ part of the FORWARD project)
- https://arxiv.org/abs/1405.3631 (paper published at Cornell)
If anything looks interesting to you, you have questions or feedback, come talk to me afterwards
I want to hear from you!
My boss says I have to listen to you, it's my job. So now's your chance :)