SQL For Programmers is an introduction to SQL concepts, when SQL is a better choice, and a look at the future of databases. Presented April 27th, 2015 at Big Data Techcon Boston
2. 2
Data Is the New Middle Manager – WSJ April 20thData Is the New Middle Manager – WSJ April 20th
Startups are keeping head counts low, and even
eliminating management positions, by replacing them with
a surprising substitute for leaders and decision-makers:
data.
“Every time people come to me and ask for new bodies it
turns out so much of that can be answered by asking the
right questions of our data and getting that in front of the
decision-makers,” says James Reinhart, CEO of online
secondhand clothing store thredUP. “I think frankly it’s
eliminated four to five people who would [otherwise] pull
data and crunch it,” he adds.
3. 3
Data Is the New Middle Manager – WSJ April 20thData Is the New Middle Manager – WSJ April 20th
In the past, says Mr. Bien, companies were beset by
“data bread lines,” in which managers had all the data
they needed, but their staffers had to get in line to get
the information they needed to make decisions. In the
world of just a few years ago, in which databases were
massively expensive and “business intelligence” software
cost millions of dollars and could take months to install, it
made sense to limit access to these services to
managers. But no more.
The result isn’t really “big data,” just more data, more
readily available
4. 4
What We Will Cover
What We Will Cover
SQL
5 W's
NoSQL
5 W's
When to Pick one of
the above over the
other
Data
What We Will Not Cover
Installation and
Operations of a
database
Coding
Data analysis,
architecture,
dashboards
5. 5
The Problem with ProgrammersThe Problem with Programmers
Your are up to date on the latest version of your language
of choice
6. 6
The Problem with ProgrammersThe Problem with Programmers
Your are up to date on the latest version of language of
choice
The latest version of Javascript – no problemo!
7. 7
The Problem with ProgrammersThe Problem with Programmers
Your are up to date on the latest version of language of
choice
The latest version of Javascript – no problemo!
Frameworks – you know two or three or more – plus the
ones you wrote yourself
8. 8
The Problem with ProgrammersThe Problem with Programmers
Your are up to date on the latest version of language of
choice
The latest version of Javascript – no problemo!
Frameworks – you know two or three or more – plus the
ones you wrote yourself
But roughly 2-3% have had any training in Structured
Query Language (SQL)
9. 9
So what is SQL?!?!??!?!??!??!
http://en.wikipedia.org/wiki/SQL
SQL (/ˈɛs kjuː ˈɛl/ or /ˈsiːkwəl/; Structured Query
Language) is a special-purpose programming language
designed for managing data held in a relational database
management system (RDBMS), or for stream processing
in a relational data stream management system
(RDSMS).
Originally based upon relational algebra and tuple
relational calculus, SQL consists of a data definition
language and a data manipulation language. The
scope of SQL includes data insert, query, update and
delete, schema creation and modification, and data
access control.
12. 12
Relational algebra
http://en.wikipedia.org/wiki/Relational_algebra
Relational algebra is a family of algebra with a well-
founded semantics used for modelling the data stored in
relational databases, and defining queries on it.
To organize the data, first the redundant data and
repeating groups of data are removed, which we call
normalized. By doing this the data is organized or
normalized into what is called first normal form (1NF).
Typically a logical data model documents and
standardizes the relationships between data entities (with
its elements). A primary key uniquely identifies an
instance of an entity, also known as a record.
13. 13
Relation Algebra Continued
Once the data is normalized and in sets of data (entities
and tables), the main operations of the relational algebra
can be performed which are the set operations (such as
union, intersection, and cartesian product), selection
(keeping only some rows of a table) and the projection
(keeping only some columns). Set operations are
performed in the where statement in SQL, which is
where one set of data is related to another set of data.
14. 14
So Why was SQL Developed
Very Expensive to store data
15. 15
So Why was SQL Developed
Very Expensive to store data
SQL minimal redundancies
Ancient disks, memory, systems
16. 16
So Why was SQL Developed
Very Expensive to store data
SQL minimal redundancies
Ancient disks, memory, systems
Easy logic
AND, OR, XOR
Give me all the customers who paid off their balance
last month AND those customers with a balance of
less than $10
Procedural
17. 17
Data Architecture at the heart of SQL
Minimal duplications
Logical layout
Relations between tables
Data Normalization
18. 18
Database Normalization Forms
1nf
– No columns with repeated or similar data
– Each data item cannot be broken down further
– Each row is unique (has a primary key)
– Each filed has a unique name
2nf
– Move non-key attributes that only depend on part of the
key to a new table
● Ignore tables with simple keys or no no-key attributes
3nf
– Move any non-key attributes that are more dependent
on other non-key attributes than the table key to a
new table.
● Ignore tables with zero or only one non-key attribute
19. 19
In more better English, por favor!
3NF means there are no transitive dependencies.
A transitive dependency is when two columnar
relationships imply another relationship. For example,
person -> phone# and phone# -> ringtone, so person ->
ringtone
– A → B
– It is not the case that B → A
– B → C
20. 20
And the rarely seen 4nf & 5nf
You can break the information down further but very rarely
do you need to to 4nf or 5nf
21. 21
So why do all this normalization?
http://databases.about.com/od/specificproducts/a/normaliz
ation.htm
Normalization is the process of efficiently
organizing data in a database. There are two
goals of the normalization process: eliminating
redundant data (for example, storing the same
data in more than one table ) and ensuring
data dependencies make sense (only storing
related data in a table). Both of these are
worthy goals as they reduce the amount of
space a database consumes and ensure
that data is logically stored.
22. 22
Example – Cars
Name Gender Color Model
Heather F Blue Mustang
Heather F White Challenger
Eli M Blue F-type
Oscar M Blue 911
Dave M Blue Mustang
There is redundant information
across multiple rows but each
row is unique
23. 23
2nf – split into tables
Name Gender
Heather F
Eli M
Oscar M
Dave M
Color Model Owner
Blue Mustang Heather
White Challenger Heather
Blue F-type Eli
Blue 911 Oscar
Blue Mustang Dave
Split data into
two tables –
one for owner
data and one
for car data
24. 24
3nf – split owner and car info into different tables
Car_ID Color Model Owner
_ID
1 Blue Mustang 1
2 White Challenger 1
3 Blue F-type 2
4 Blue 911 3
5 Blue Mustang 4
The car info is
separated from the
car info. Note that
the car table has a
column for the
owner's ID from the
owner table.
Owner_ID Name Gender
1 Heather F
2 Eli M
3 Oscar M
4 Dave M
25. 25
But what if White Mustang is shared or 4nf
Owner_ID Name Gender
1 Heather F
2 Eli M
3 Oscar M
4 Dave M
Car_id Model Color
1 Mustang Blue
2 Challenger White
3 F-type Blue
4 911 Blue
Car_id Owner_id
1 1
2 1
3 2
4 3
1 4
Tables for Owner,
Car, & Ownership
data
Now we have a flexible way to
search data about owners, cars, and
their relations.
26. 26
So now what!!!
By normalizing to 3nf (or 4th
), we are storing the data with
no redundancies (or very, very few)
Now we need a way to define how the data is stored
And a way to manipulate it.
27. 27
SQL
SQL is a declarative language made up of
– DDL – Data Definition Language
– DML – Data Manipulation Language
SQL was one of the first commercial languages for Edgar
F. Codd's relational model, as described in his influential
1970 paper, "A Relational Model of Data for Large Shared
Data Banks." --Wikipedia
– Codd, Edgar F (June 1970). "A Relational Model of
Data for Large Shared Data Banks". Communications
of the ACM (Association for Computing Machinery)
13 (6): 377–87. doi:10.1145/362384.362685.
Retrieved 2007-06-09.
29. 29
SQL is declarative
Describe what you want, not how to process
Hard to look at a query to tell if it is efficient by just looks
Optimizer picks GPS-like best route
– Can pick wrong – traffic, new construction, washed out
roads, and road kill! Oh my!!
You can not examine a
syntactically correct SQL query
by itself to determine if it is a
good query.
30. 30
SQL is made up of two parts
Data Definition Language (DDL)
– For defining data structures
● CREATE, DROP, ALTER, and
RENAME
Data Manipulation Language (DML)
For using data
● Used to SELECT, INSERT,
DELETE, and UPDATE data
32. 32
The stuff in the parenthesis
CHAR(30) or VARCHAR(30) will hold strings up to 30
character long.
– SQL MODE (more later) tells server to truncate or
return error if value is longer that 30 characters
–
INT(5) tells the server to show five digits of data
DECIMAL(5,3) stores five digits with two decimals, i.e.
-99.999 to 99.999
FLOAT(7,4) -999.9999 to 999.9999
34. 34
NULL No Value
Null is used to indicate a lack of value or no data
– Gender : Male, Female, NULL
Nulls are very messy in B-tree Indexing, try to avoid
Math with NULLs is best avoided
35. 35
DESC City in detail
Describe table tells us the names of the columns (Fields),
the data type, if the column is NULLABLE, Keys, any
default value, and Extras.
40. 40
Join two tables
To get a query that provides the names of the City and the
names of the countries, JOIN the two tables on a common
data between the two columns (that are hopefully
indexed!)
42. 42
Simple JOIN – Data from more than one table
Select City.Name, Country.Name, City.Population
FROM City
LEFT JOIN Country ON
(City.CountryCode = Country.Code)
LIMIT 7
For every City in the City table
Print the
City Name
The Matching Country Name from Country table
The City Population
43. 43
The Optimizer has to figure out
Select City.Name, Country.Name, City.Population
FROM City
LEFT JOIN Country ON
(City.CountryCode = Country.Code)
LIMIT 7
Permission to access database/tab es/columns
Six options for getting data (3 factorial)
And process the limit of 7 records
44. 44
Simple join
Both City and Country
have columns that
can be used for JOINs,
a RELATION!
– Country.Code
– City,CountryCode
45. 45
Transactions – ACID
Transactions are often
needed to enure the
quality of data
Bank Account
Example
The ability to roll back
actions
Set check points and
roll back to these
intermediate points
ACID
Atomicity
Consistency
Isolation
Durability
46. 46
What happens when you send a query?
Server receives the query
The user is authenticated for permissions
– Database, table, and/or column level
Syntax
Optimizer
– Statistics on data
– Cost model
● Pick cheapest option (DISK I/O)
– Changing in the future
Get the data
Sorting/Grouping/etc
Data returned
50. 50
Indexes in a Nutshell
Indexes provide direct
access to record desired
Primary Key
Moose example
Cardinality is the measure
of number of variants in
indexed column(s)
Higher the better
General rules
Keep as short as
possible
Index any column
used for joins
Overhead
Inserts, updates,
deletes need
processing
Maintenance
Too many as worse than
too few
Can be used to 'cover'
multiple columns
Can sometimes 'drag'
result with index
52. 52
Where SQL Gets Hard to Optimize
Goal:
A list of the top 100
customers (from 10
million) by sales
'All I want is a 100
rows, why does the
server look at all
customers?'
?
53. 53
Where SQL Gets Hard to Optimize
Goal:
A list of the top 100
customers (from 10
million) by sales
'All I want is a 100
rows, why does the
server look at all
customers?'
It has to perform a full
table scan before it can
sort the top 100!
54. 54
N+1 Problem
Goal:
List of all customers
with positive
account balances
and their credit
ratings
Usually the query is
written to get all
customers with
balances and then
separate queries to
each credit rating
→ each customer
query also
generates three
more queries
?
55. 55
N+1 Problem
Goal:
List of all customers
with positive
account balances
and their credit
ratings
Usually the query is
written to get all
customers with
balances and then
separate queries to
each credit rating
→ each customer
query also
generates three
more queries
Better query
Reduce the number
of queries
Data Architecture
Keep credit data in
customer record
56. 56
Ordering/Sorting → temporary tables
Goal:
List of all customers
ordered by state,
by zipcode, area-
code, last name,
first name, and
customer id
number
?
57. 57
Ordering/Sorting → temporary tables
Goal:
List of all customers
ordered by state,
by zip code, area-
code, last name,
first name, and
customer id
number
All the sorting takes time,
memory, maybe temporary
disk space
Indexes on data may help
58. 58
Object Relation Manager
ORMs try to map SQL, a declarative language, to Objects.
May produce less than optimal SQL
object-relational impedance mismatch
Extra layer to maintain/debug/support
Most use Prepared Statements to facilitate queries
Often easier to write SQL from the beginning
59. 59
NoSQL
So if SQL is so great, why
is there NoSQL???
Not all data relational
Friend of Friend of
your friends
'Facebook query'
Graphical data
Structure of data
Document databases
No Structure in the data
May have no idea of
data structure but
want to capture all
of it
Amorphous data
60. 60
http://en.wikipedia.org/wiki/NoSQL
A NoSQL (often interpreted as Not only SQL) database
provides a mechanism for storage and retrieval of
data that is modeled in means other than the tabular
relations used in relational databases. Motivations for
this approach include simplicity of design, horizontal
scaling, and finer control over availability. The data
structures used by NoSQL databases (e.g. key-value,
graph, or document) differ from those used in relational
databases, making some operations faster in NoSQL
and others faster in relational databases. The particular
suitability of a given NoSQL database depends on
the problem it must solve.
61. 61
http://en.wikipedia.org/wiki/NoSQL
NoSQL databases are increasingly used in
big data and real-time web applications.
NoSQL systems are also called "Not only
SQL" to emphasize that they may also
support SQL-like query languages. Many
NoSQL stores compromise consistency (in
the sense of the CAP theorem) in favor of
availability and partition tolerance. Barriers to
the greater adoption of NoSQL stores include
the use of low-level query languages, the lack
of standardized interfaces, and huge
investments in existing SQL. Most NoSQL
stores lack true ACID transactions
62. 62
http://en.wikipedia.org/wiki/Big_data
Big data is a broad term for data sets
so large or complex that traditional
data processing applications are
inadequate. Challenges include
analysis, capture, curation, search,
sharing, storage, transfer, visualization,
and information privacy. The term often
refers simply to the use of predictive
analytics or other certain advanced
methods to extract value from data, and
seldom to a particular size of data set.
63. 63
http://en.wikipedia.org/wiki/Big_data 2
Big data usually includes data sets with
sizes beyond the ability of commonly used
software tools to capture, curate, manage,
and process data within a tolerable
elapsed time. Big data "size" is a constantly
moving target, as of 2012 ranging from a few
dozen terabytes to many petabytes of data.
Big data is a set of techniques and
technologies that require new forms of
integration to uncover large hidden values
from large datasets that are diverse, complex,
and of a massive scale
64. 64
Is NoSQL OR Big Data new
NoSQL
No:
Key/Value pair
BDB
Hash
Yes:
Map/Reduce
Graphical
Big Data
No
Data
Warehouses
Large data sets
Yes
'Never throw
any data
away'
Increased data
feeds
“better”
analytics
65. 65
When Are You At The Limits of SQL?
Working Set No Longer Fits in Memory
Drinking From 'Fire House'
Not really OLTP Work Load
Possible Data Warehouse Alternative
Analysis Tools Do Not Need Transactions
Speed, lack there of …
Disk Space Reaching Maximum
Segregation of Data
Transactions needed versus No Transaction
Security/SOX/Regulatory needs
Project Needs
Exploration
66. 66
NoSQL Database Types
Key-Value
Data stored as a blob
Riak, Memached,
Redis, Berkeley
DB, Couchbase …
Document
Data stored as XML,
JSON, BSON; Self
describing tree
structure
Mongo, Couchbase
Column
Stores data vertically,
easy to compress
Cassandra, Hbase,
Hypertable …
Sometimes available
in RDMS
Graph
Stores Relations
between entities
67. 67
You have you choice of hammers ...
What ARE you trying to
solve??
Fully Describe
Problem
Feeds
Outputs
Processes
Identify Constraints
Time
$
Staff
Sanity
Political
Intangibles
71. 71
http://en.wikipedia.org/wiki/Apache_Hadoop
Framework written in Java
Distributed Storage on
commodity hardware for
distribute processing
HDFS (Hadoop Distributed
File System)
Data split into blocks,
distributed, blocks of data
processed in parallel (fast)
Lots of components,
please start with Apache
Bigtop for first time users
(pieces will play nicely with
each other)
Bigtop.Apache.Org
72. 72
Map/Reduce http://en.wikipedia.org/wiki/MapReduce
MapReduce is a programming model for processing and
generating large data sets with a parallel, distributed
algorithm on a cluster …
A MapReduce program is composed of a Map()
procedure that performs filtering and sorting (such as
sorting students by first name into queues, one queue
for each name) and a Reduce() procedure that performs
a summary operation (such as counting the number of
students in each queue, yielding name frequencies).
74. 74
http://en.wikipedia.org/wiki/Graph_database
Graph databases are based on graph
theory. Graph databases employ
nodes, properties, and edges.
Nodes represent entities
such as people,
businesses, accounts.
Properties are pertinent
information that relate to
nodes.
Edges are the lines that
connect nodes to nodes
or nodes to properties and
they represent the
relationship between the
two. Most of the important
information is really stored
in the edges. Meaningful
patterns emerge when
one examines the
connections and
interconnections of nodes,
properties, and edges.
75. 75
Neo4j http://en.wikipedia.org/wiki/Neo4j
Neo4j is an open-source graph database, implemented in
Java. The developers describe Neo4j as "embedded,
disk-based, fully transactional Java persistence engine
that stores data structured in graphs rather than in tables".
Neo4j is the most popular graph database.
Grate for studying relationships
Six Degrees of Kevin Bacon like questions
77. 77
Document Databases -
http://en.wikipedia.org/wiki/MongoDB
MongoDB is one of many cross-platform document-
oriented databases. Classified as a NoSQL database,
MongoDB eschews the traditional table-based relational
database structure in favor of JSON-like documents with
dynamic schemas (MongoDB calls the format BSON),
making the integration of data in certain types of
applications easier and faster
Instead of taking a business subject and breaking it up
into multiple relational structures, MongoDB can store the
business subject in the minimal number of documents. For
example, instead of storing title and author information in
two distinct relational structures, title, author, and other
title-related information can all be stored in a single
document called Book, which is much more intuitive and
usually easier to work with.
79. 79
Vertical Data
Compress the heck out of column
Very quick to read
Great for low cardinality data
Fantastic for analytic queries, i.e. MIN, MAX, AVG...
May react like more conventional RDMS and be more
'comfortable'
82. 82
Hybrids – Best of Both
MySQL NoSQL plug-in for InnoDB
JSON data manipulation
PostgreSQL
MySQL
SQL on top of Hadoop
Columnar Databases from RDMS Vendors
83. 83
Linked.in's Espresso
Espresso provides a hierarchical data model. The hierarchy is database->table->collection-
>document. Conceptually, databases and tables are exactly the same as in any RDBMS.
Database and table schemas are defined in JSON. Document schemas are defined in Avro.
84. 84
Functional Programming As An Alternative
functional programming is a programming paradigm, a
style of building the structure and elements of computer
programs, that treats computation as the evaluation of
mathematical functions and avoids changing-state and
mutable data. It is a declarative programming paradigm,
which means programming is done with expressions. In
functional code, the output value of a function depends
only on the arguments that are input to the function, so
calling a function f twice with the same value for an
argument x will produce the same result f(x) each time.
Eliminating side effects
88. 88
Summary & Q&A
Define WHAT you are trying to accomplish
Set goals
Performance
Functionality
Skills
Do a lot of reading/testing/questioning
David.Stokes@Oracle.com @Stoker
Slideshare.net/davestokes or conference web site