SlideShare a Scribd company logo
1 of 297
Download to read offline
Unstructure
:: smashing the boundaries of data ::
2014-03-07 - SxSWi Workshop
Ian Varley - @thefutureian
Chapter 0: Intro & Logistics
Hi. I’m Ian Varley.
I live in Austin, TX.
I work for Salesforce.com, doing data engineering.
(Note: this presentation is entirely my own work and opinions, and doesn’t imply anything about Salesforce’s products.)
→ @thefutureian, ianvarley.com
About me:
- BA in Philosophy
- MS in Software Engineering
- 15+ years database experience
Not really an authority on data structure, but
"You teach what you want to learn".
Why are we here?
To grok the structure of data,
and then smash it.
Some logistics:
- 150 minutes, no breaks.
(Feel free to get up, use the facilities, or leave if you're bored.)
- This will be dense.
(Lots to cover, so we'll move fast.)
- But! Do interrupt at any time with questions.
(If you’re lost, you’re not the only one.)
All materials are available:
- This presentation:
http://tiny.cc/unstructure-sxsw14-slides
- Live notes:
http://tiny.cc/unstructure-sxsw14-notes
- Code & samples:
https://github.com/ivarley/unstructure-sxsw14
There’s some setup required,
but we won’t need it right away.
Probably best if everyone starts trying to
download & install stuff now, and do it in the
background as I’m talking.
Setup, part 1: git
- Download & Install Git
http://git-scm.com/book/en/Getting-Started-Installing-Git
- Clone my repo:
$ cd ~
$ git clone https://github.com/ivarley/unstructure-sxsw14.git
Setup, part 2: heroku
- Download & Install Heroku Toolbelt
https://toolbelt.heroku.com/
- Create Heroku Account
https://id.heroku.com/signup
Setup, part 3: CouchDB
- Download & Install CouchDB
http://couchdb.apache.org/
Setup, part 4: miscellaneous
- Google Chrome
https://www.google.com/intl/en/chrome/browser/
- JSONView Plugin
http://goo.gl/K07fFs
Quick Survey:
- occupation: coders? designers? other?
- technical skill: low / medium / high
- know what a relational database is?
- know SQL?
- know what NoSQL means?
- have used a NoSQL database?
- have read Aristotle? :)
There’s a spectrum in an audience like this ...
The Hacker: Wants
hands on, doesn’t care
about theory
The Academic: Wants
heady concepts, not
comfortable with code.
The “Hackademic”:
Wants enough theory
to be grounded, and
enough hacking to
know when something
is bullshit.
What I’m aiming for:
Where we’re headed:
Chapter 1 - Warming Up
Chapter 2 - Hierarchy
Chapter 3 - Relation
Chapter 4 - Mutation
Chapter 5 - Conclusion
Any questions before we get started?
Chapter 1: Warming Up
Talking about data can be a little …
dry.
So, we’re going to use an example that
most people can relate to easily.
Music.
Photo credit: Josh Haner/The New York Times
● It's a domain we all know about.
○ But, informally (not usually for work or study)
● Lots of meaty concepts to think about:
○ Recording, Performances, Compositions, Bands, Venues, etc ...
● There are lots of music sites with data APIs:
○ Do512, EchoNest, Songkick, Sched.org, MusicBrainz, 7Digital, etc.
● It's on everyone's mind during SxSW.
● Also, I'm a musician and I felt like it.
Why music?
Running Example:
● listen.up - Internet company for “all things music”.
● Duh, this is fake, there’s no “.up” TLD
● What do we do? Everything!
● Including:
● Recorded music catalogs, streaming, purchase...
● Live music performance, booking & tickets...
● Licensing, royalties, compositions, lyrics …
● Instruments, lessons, repairs, classifieds …
● Anything else you can think of.
Running Example:
● listen.up - Internet company for “all things music”.
● Duh, this is fake, there’s no “.up” TLD
● What do we do? Everything!
● Including:
● Recorded music catalogs, streaming, purchase...
● Live music performance, booking & tickets...
● Licensing, royalties, compositions, lyrics …
● Instruments, lessons, repairs, classifieds …
● Anything else you can think of.
Is this a good business model? No, but who cares!
You are my team of
professional ontologists,
you’re going to figure out
what music data is out there
in the world for us to store.
(I’ll pay you in stock. It’ll be worth a fortune, trust me.)
Normally, this is where we might
do a big group brainstorm.
But it turns out those don’t work.
In fact, they hurt more than they help.
(https://en.wikipedia.org/wiki/Brainstorming#Challenges_to_Effective_Brainstorming)
So, we’ll do 3 steps:
1. Take 60 seconds and type as many music-
related concepts as you can.
- concepts, not proper nouns (i.e. "band", "composer", "instrument";
not "Radiohead", "Beethoven", "guitar", etc.)
- make them singular ("band", not "bands")
- not sure if it’s music-related? put it anyway.
2. Dump them into a shared google doc.
3. I’ll lowercase, dedupe, and publish.
Go!I’ll add a few prompts in case you are getting stuck ...
So, we’ll do 3 steps:
1. Take 60 seconds and type as many music-
related concepts as you can.
- concepts, not proper nouns (i.e. not "Radiohead", but "band"; not
"Beethoven" but "composer", not "guitar" but "instrument", etc.)
- make them singular ("band", not "bands")
- not sure if it’s music-related? put it anyway.
2. Dump them into this google doc:
http://tiny.cc/unstructure-sxsw14-terms
3. I’ll lowercase, dedupe, and publish here.
Now, your job is to make some sense of this.
1. Break into groups of ~3 neighbors.
a. Introduce yourselves like civilized human beings.
2. Organize this list however you want to!
a. Group things together
b. Indent things
c. Draw lines in a drawing program
d. etc.
Go!We’ll take about 10 minutes for this.
Let’s discuss a few sample things people did.
(Goal is to "sample", not for everyone to share! We don’t have all day.)
● How did you sort or group the terms?
● Did you end up with a flat list, or hierarchy?
● Did anything not fit in?
● Any higher level organization of terms?
(Note: there’s no right answer here ... yet.)
Good job.
is off to a great start.
Now it’s time to get into the meat.
What is structure? What is data?
This is a story in three parts:
Hierarchy
Relation
Mutation
These are 3 successive viewpoints that
will gradually open our eyes to the
deep structure of data.
“Structure? Isn’t this workshop supposed
to be about unstructured data?”
Poppycock. You want to see some
real unstructured data?
What most people mean when they say
unstructured data is:
“flexibly structured data”
Or possibly:
“data we don’t know the structure of yet”
(We’ll get to both of those; hold your horses.)
There are lots of boundaries to
smash in the world of data.
But we have to learn to structure
before we can unstructure.
Image credit: Rodrigo Diaz Aravena
Our minds are very fluid, and can
connect concepts in subtle ways.
Our computers aren't. They need concrete
instructions to structure & connect data.
The next three chapters will be a deep
dive into how concepts are combined to
let us represent the world in computers.
This is usually called “modeling”.
Modeling gets a bad rap.
Models cut away the accidental
and leave the essential.
Model != diagram, drawing
Model == Skeleton, Essence, Abstraction
Most of the time, it’s good to remember:
“the map is not the territory”.
But for the next couple hours, for us, the
map is precisely the territory.
We are data cartographers.
- Alfred Korzybski
So let’s dive in!
photo credit: Michael Mundy
Chapter 2: Hierarchy
We all have a pretty good intuitive
understanding of “data”. What’s yours?
data = plural of datum
datum = Latin for “given”
so ... data is “givens”?
aka “facts”?
What’s the simplest fact? A bit:
1 / 0
on / off
yes / no
true / false
“Are the lights on in this room?”
When you want more complex facts, you
chunk together simpler ones.
In one dimension, that's a "list":
● Byte = list of bits → 01000010 = 66 = “B”
● Word = list of characters → [B,e,a,t,l,e,s]
● Phrase = list of words -> “The Beatles are a band”
● And so forth ...
Of course, a 1-dimensional list
is just one (very simplistic) way
to chunk things together.
Photo credit: http://thekingoflimbspart2.com/radiohead-setlists/radiohead-2012-setlists/radiohead-setlist-houston-texas-3032012/
So you just chunk datums together,
and you get information,
knowledge, wisdom … !?
Photo By Karora (Own work) [Public
domain], via Wikimedia Commons
No. It’s not quite that easy.
f∆ƒ˙73f˚=£ƒ••XMbritneysp3ars-giraffe
is a complex structure, but it lacks
something: “meaning”
Getting a little more haughty,
we could say that the “givens” of
structured data are really
pointers to concepts.
Without at least some concept,
it’s not data: it’s noise.
But that raises all kinds of questions.
What are concepts?
What can we say about them?
What kinds of concepts are there?
What’s the difference between a
concept and the thing it points at?
Who could answer such questions?
When the epistemological going gets tough,
the tough call ...
Aristotle, 384 –322 BC
Aristotle wrote a treatise called
The Praedicamenta
(The Categories)
It’s not clear exactly what he was
categorizing (he didn’t say), but the list
stands to this day as a pretty damn
sensible way to, well, categorize.
Aristotle’s 10 "categories":
substance Stuff, essence; matter, but also universal concepts
quantity How much? How many?
quality What kind? Of what nature?
relation More, less, double, half, stronger, weaker, etc.
place Where?
time When?
position Being situated on, in, next to, sitting, touching, etc.
having Possession, state like “clothed” or “armed”
causing What did it do, make happen?
being caused What happened to it, what did it undergo?
You don’t have to agree with Aristotle’s
categories (you’d be in good company).
(But you’re also unlikely to have a sudden inspiration about it
that hasn’t already been the subject of 12 papers and a dissertation.)
But having some version of Aristotle’s list
is hard to argue with, experientially. We
sort the world into a hierarchy of
concepts; everything in its right place.
Without concepts,
we don’t have data.
We just have noise.
Of course, you can’t just store a concept
on a disk, or send it over a network.
So, not long after we had computing
machines, folks set to work figuring out
how to map and store our concepts in
the unforgiving realm of silicon.
Let’s take a trip back in time.
Photo By NASA Ames Research Center (NASA-ARC) (NIX A-
28284) [Public domain], via Wikimedia Commons
The year is 1966.
NASA is working on Saturn V and Apollo rockets,
and they can’t figure out how to store this
ginormous bill of materials. They ask:
Could these new “computers” help?
IBM: “Hey NASA! We made an
system for you to manage
information!”
NASA: “Groovy! What’s it called?”
IBM: …
...
“Information Management System.”
Thus began the not-extremely-exciting era of
hierarchical databases.
● Data is stored in records, which can have sub-records.
● There's a single strict hierarchical arrangement.
● To access data, you need to know the hierarchy.
For example:
Show Band Time
House
Of Vans
Charlie
XCX
4:15
House
Of Vans
Pusha
T
5:00
Show Venue Date
House Of
Vans
The
Mohawk
3/13/14
Chaos in
Tejas
Iron And
Lace
3/14/14
Band Song Order
Charlie
XCX
You 1
Charlie
XCX
Super
Love
2
To get to the set list, you have to
navigate through the show, to the
band, to the song.
And, you only get to choose one
hierarchy to store things in.
That sounds arbitrarily restrictive.
Why did they make it like that?
To understand,
we have to
talk about
dimensions.
We already talked about zero dimensions:
And about one dimension:
point = bit = on/off = true/false
line = list = array
It’s pretty straightforward to see
how you’d represent these,
with a series of zeros and ones.
What about 2 dimensions?
It’s a plane! (aka table, grid, matrix, spreadsheet, etc)
Tabular data is everywhere.
You're certainly familiar with the
world's most ubiquitous 2-
dimensional data tool ...
X dimension
Y dimension
Excel!
Photo credit: http://decentralist.wordpress.com/2012/10/01/libreoffice-vs-openoffice-not-always-simple/
(Or, its successor,
Google Spreadsheets,
which is way better, IMHO.)
((Especially in combination with Google Forms, which make it extremely useful.))
Spreadsheets are totally flexible.
This is a blessing and a curse.
They can be used well ...
Or poorly ...
(Fun read: http://www.epmchannel.com/2013/02/22/is-excel-the-most-dangerous-piece-of-software-in-the-world/)
Or awesomely ...
Image credit: http://gadgetose.com/excel-stop-motion-music-video/
But you get the point. 2-dimensional data is
everywhere.
2-dimensional data
doesn’t always look
like a table or grid.
We just mean that it’s
“conceptually planar”: two axes,
each w/ a set coordinate system.
(Here, rows = “web results”, and
columns = “link name”, “url”,
“description”, and “image”. No link
has two URLs, for example.)
However! We are now faced with a choice,
because we can still only actually store things
in linear form (a single stream of bits).
So do we put rows inside columns,
or columns inside rows?
To linearize the two dimensions in
a table, I can either ...
<table>
<row>
<col>Creep</col>
<col>1993</col>
<col>Pablo Honey</col>
</row>
<row>
<col>No Surprises</col>
<col>1997</col>
<col>OK Computer</col>
</row>
<row>
<col>Lucky</col>
<col>1997</col>
<col>OK Computer</col>
</row>
<row>
<col>Karma Police</col>
...
</row>
</table>
Go row-wise ...
<table>
<col>
<row>Creep</row>
<row>No Surprises</row>
<row>Lucky</row>
<row>Karma Police</row>
<row>Fake Plastic
Trees</row>
</col>
<col>
<row>1993</row>
<row>1997</row>
<row>1997</row>
<row>1997</row>
<row>1995</row>
</col>
<col>
<row>Pablo Honey</row>
...
</col>
</table>
Or column wise ...
<table>
<row>
<col>Creep</col>
<col>1993</col>
<col>Pablo Honey</col>
</row>
<row>
<col>No Surprises</col>
<col>1997</col>
<col>OK Computer</col>
</row>
<row>
<col>Lucky</col>
<col>1997</col>
<col>OK Computer</col>
</row>
<row>
<col>Karma Police</col>
...
</row>
</table>
But I can’t have it both ways.
<table>
<col>
<row>Creep</row>
<row>No Surprises</row>
<row>Lucky</row>
<row>Karma Police</row>
<row>Fake Plastic
Trees</row>
</col>
<col>
<row>1993</row>
<row>1997</row>
<row>1997</row>
<row>1997</row>
<row>1995</row>
</col>
<col>
<row>Pablo Honey</row>
...
</col>
</table>
?
(Unless I store it twice.)
Because we read left-to-right and top-to-
bottom, most of our systems store tables that
way too (row-wise).
But it’s not mandatory, of course.
Whichever way you choose, you can’t
get around the fact that you have to
choose an ordering of dimensions.
That makes sense
for 2 dimensions.
But what about …
3+ dimensions?
Same thing.
Computer science has had the idea of
multi-dimensional arrays
since the beginning.
It's a straightforward extension to a table, conceptually. It's
very hard to visualize more than 3 unless you’re on dope*.
* - This is a verrrry funny joke because multidimensional arrays use locators called dope vectors. Ha ha ha hmm.
But you’re still making it linear when
you store it.
Imagine storing the venues, shows,
bands, and songs in one file.
That’s 4 dimensions.
(Each venue has many shows;
each show has many bands;
each band has many songs; etc.)
Venue: The Mohawk
Show: Vans Day Party, 3/13/14
Band: Eagulls (12:30 p.m.)
Song: Nerve Endings
Song: Tough Luck
Band: DJ Rashad (1:15 p.m.)
Song: Holiday
Song: I Can Feel It
Band: Kelela (2:15 p.m.)
Song: ...
Band: Charli XCX (3:15 p.m.)
Song: ...
Band: Dum Dum Girls (4:15 p.m.)
Song: ...
Band: Pusha T (5:00 p.m.)
Show:
Venue: ...
Show: ...
Band: ...
Song: ...
etc.
Song: Nerve Endings
Band: Eagulls
Show: Vans Day Party, 3/13/14
Venue: The Mohawk
Song: Nerve Endings
Band: Eagulls
Show: Vans Day Party, 3/13/14
Venue: The Mohawk
Show: Official Showcase, 3/14/14
Venue: The Mohawk
Song: Holiday
Band: DJ Rashad
Show: Vans Day Party, 3/13/14
Venue: The Mohawk
Band: Cattle Decapitation
Show: Chaos In Tejas, 3/10/14
Venue: Iron And Lace
Song: ...
Band ...
Show: ...
etc.
By venue → show → band → song By song → band → show → venue
No matter how many dimensions,
storing the data still requires that you
pick a single primary orientation.
So Big Blue’s design choice makes a
little more sense now, right?
BTW, If you've used an
ATM recently…
you're an IMS user.
(It’s not as obsolete as it sounds.)
How do you actually store the linearized data?
You put it in a format.
Creep•••••••••••••1993Pablo•Honey
No Surprises••••••1997OK•Computer
Lucky•••••••••••••1997OK•Computer
Karma Police••••••1997OK•Computer
Fake Plastic Trees1995The•Bends••
Fixed-width files were all the rage in the 1960s.
Row delimiter is a line break; column delimiter is a pre-set
agreement about how many characters are in each line.
This is wasteful, brittle, and hard to read.
"Creep","1993","Pablo Honey"
"No Surprises","1997","OK Computer"
"Lucky","1997","OK Computer"
"Karma Police","1997","OK Computer"
"Fake Plastic Trees","1995","The•Bends"••
Delimited files (e.g. CSV, comma separated values):
Row delimiter is still a line break; column delimiter is variable (a
comma, in this case). Optionally, also “qualifiers” (quotes, here).
This is a decent format (but, Microsoft’s version really screwed things up for everyone.)
<row>
<col>Creep</col>
<col>1993</col>
<col>Pablo Honey</col>
</row>
<row>
<col>No
Surprises</col>
<col>1997</col>
<col>OK Computer</col>
</row>
<row>
<col>Lucky</col>
<col>1997</col>
<col>OK Computer</col>
</row>
There’s also markup (e.g. HTML)
“Tags” (<tag></tag>) give you the
start and end of rows, and the
start and end of columns within
those rows.
SGML, HTML, XML, all follow this
approach.
Song: Creep
Year: 1993
Album: Pablo Honey
%
Song: No Surprises
Album: OK Computer
Year: 1997
%
Year: 1997
Song: Lucky
Album: OK Computer
...
Or even YAML (“YAML Ain’t Markup Language”)
(Like email headers.)
Column pointers (names) are
inline with the values; rows have
many lines, and are delimited by
another character (e.g. “%”).
This is obviously more flexible,
but still inherently hierarchical.
And there are a couple modern
technologies that are hierarchical all
the way down.
In practice, the only form of markup
people use for storing data is XML.
And the most prevalent subset of YAML
people use for storing data is JSON.
(Technically JSON isn’t a subset of YAML but you shut up.)
XML was early obvious choice, because
we were used to HTML, so we all "got" it.
<?xml version="1.0"?>
<venue name="The Mohawk">
<show title="Vans Day Party, 3/13/14">
<band name="Eagulls" time="12:30 p.m.">
<song order="1">Nerve Endings</song>
<song title="Touch Luck" order="2" />
</band>
<band name="DJ Rashad" time="1:15 p.m.">
<song title="Holiday" order="1" />
<song title="I Can Feel It" order="2" />
</band>
</show>
</venue>
But XML isn’t very human-friendly.
● It’s verbose
● The wrong things grab your eye
● It’s somewhat complicated to parse
● Distinction between attributes and tag contents
is confusing.
“XML is crap. Really. There are no excuses. XML is nasty to parse for humans, and it's a disaster to parse
even for computers. There's just no reason for that horrible crap to exist.” - Linus Torvalds, Yesterday
(2014-03-06), on Google+
Now JSON is winning ...
{
"venue": {
"name": "The Mohawk",
"show" : {
"title": "Vans Day Party, 3/13/14",
"bands": [{
"name": "Eagulls", "time": "12:30 pm",
"songs": [
{"title": "Nerve Endings", "order": 1},
{"title": "Touch Luck", "order": 2}
]},
{
"name": "DJ Rashad", "time": "1:15 pm",
"songs": [
{"title": "Holiday", "order": 1},
{"title": "I Can Feel It", "order": 2}
]}
]
}
}
}
JSON:
● Where XML looks like a mass of text, JSON
looks spacious (little clutter)
● Fast to parse, for humans and computers
● Self-describing, flexible format
● Extremely simple syntax (one page)
from http://www.json.org/
One more thing: so far, we’ve talked
about a certain kind of hierarchical
relationship: containment.
But there’s another kind worth
mentioning: generalization.
This kind of relationship is common in
programming (it’s called superclassing).
But it’s uncommon (at least, explicitly) in
database systems.
We’ll come back to it later.
Hands on!
You’re doing some corporate espionage for
listen.up. You notice that do512.com seems to
have a good way to organize data in their API.
What can you steal, er, learn?
Note: the guys at do512 are friends of mine and I am in no way encouraging anyone to perform any
actual corporate espionage, no matter how cool that sounds. Listen.up is a made up company, do512’
s json API is open, and this is an exercise for learning; no stealing anything. :)
Hands on!
● Make sure you’ve got the JSONView extension in Chrome
● Go to: http://2014.do512.com/events.json
● Explore the hierarchical data that comes up
● Try “-” to collapse all, click “+” signs to unfold sections
● Also try:
○ http://2014.do512.com/venues.json
○ http://2014.do512.com/artists.json
Hands on!
● What’s the hierarchical structure of this data?
● In the same way we talked about nesting bands, venues,
shows, etc, … what are the objects being nested here?
● Would you store it differently?
● Are there any superclass / subclass relationships?
Questions?
Chapter 3: Relations
So, hierarchy! Pretty great, right?
Actually, no; it’s quite problematic if you
use it as the method for storing data.
● Lots of stuff isn't naturally hierarchical.
● You can't change the organization without changing all the
code that accesses data.
● Above 3 dimensions, the number of possible access paths
goes up dramatically! (Exponentially, in fact.)
○ The academic literature of the 60s and 70s is full of
papers describing how to do this better or faster.
But, what are you gonna do? Computers
are just boring & hierarchical, so you’d
better learn to deal with it.
Meanwhile, outside the offices of IBM,
the revolution of the ‘60s was happening.
The desirability and feasibility of
hierarchy, especially a single universal
hierarchy, was seeming less and less
important.
Reconciliation seemed impossible.
And then ...
something wonderful happened
THE MONOLITH IN KUBRICK'S 2001: A SPACE ODYSSEY (1968)
Specifically,
the math nerds
beat the business jocks.
This is Dr. Edgar F. Codd (1923-2003)
He worked for IBM in the 1960s, and
couldn’t stand the thought of our rich,
interconnected world being subjugated to
storage in hierarchical databases.
So he came up with a radical theory.
image from wikipedia:
http://en.wikipedia.org/wiki/File:Edgar_F_Codd.jpg
Two intellectual ancestors:
Set Theory Graph Theory
Set Theory:
Sets are collections of
objects. You can precisely
describe operations on sets:
● Union
● Intersection
● Difference
● Cartesian Product
Graph Theory:
Graphs are collections of nodes, connected by
edges.
Not this:
Graph Theory:
Graphs are collections of nodes, connected by
edges.
This:
Think: a social network where
the nodes are people and the
edges are friend relationships.
Codd’s genius was combining these and
proposing a declarative (rather than
imperative) access model.
● The relational model is a graph of sets
● Relations (tables) are sets of tuples (rows).
● Some attributes (columns) are edges that let you
connect the sets in interesting ways.
● You never specify “how” to get to data, just “what”
data to get, based on sets.
When he proposed this as a
way to actually store data in 1969,
people thought he was from Mars.
Everyone said “It’ll never work, computers are too slow.”
His employer, IBM, said
“Thanks but no thanks; we’ll just keep selling IMS.”
But the haters didn’t bank on two things.
1: Moore’s Law
That graph is exponential,
not linear.
We can have plenty of CPU.
2: Programmer Time
As programmer time became
more valuable than computer
time, ease of representing the
problem domain became a
dominating factor.
Due to some tomfoolery at IBM,
Codd’s “Alpha” never made it.
But another group at IBM created a quasi-
relational version called SEQUEL, which
looked kind of like COBOL.
Then in 1979, Larry Ellison copied the design
to create Oracle and SQL (SEQUEL was
trademarked). And the rest is history.
For his part, Codd waged a
decades-long battle to get a more
true representation of the relational
model adopted.
But, it never was. SQL was king.
So what is the relational model, then?
First, one quick PSA ...
Occupational Hazard:
Abstraction Vertigo
Concrete
Data
Metadata
The Structure
of Metadata
Band name: Radiohead
Year formed: 1985
Entity: Band
Attributes: Name, Year Formed
Concepts →
Entities, Attributes
Safety first.
If you feel dizzy, just ask a question.
Entity
Attribute
Relationship
There are 3 foundational concepts:
Entity
Attribute
Attribute
relationship
Entity
Attribute
Attribute
Show
Start time
Cover $
is at
Venue
Address
Age Req.
Vans Day Party
12pm
Free
is at
The Mohawk
123 Red
River
All Ages
Notice the subtle shift here.
Neither of those entities is “inside” the
other. They’re both “first class” entities,
and they’re in a relationship.
The relationships are described at the level of sets,
not ad hoc. Shows can be at venues, categorically.
Entities look pretty much exactly like 2-
dimensional tables, except for the
concept of a “key”.
That’s the attribute (or set of attributes)
that distinguishes this row from that row.
Show
Start time
Cover $
is at
Venue
Address
Age Req.
Show Start time Cover $
1234 8pm $5
5678 11pm $9
Venue Address Age Req.
The Mohawk 123 Red Riv. 21+
Beerland 456 Red Riv. All Ages
key
key
Keys describe an entity’s identity.
In practice, most systems today use surrogate
keys (i.e. IDs) to establish unambiguous identity.
eg: integers (123456), codes (X74-UUA2),
GUIDs
Show
Start time
Cover $
is at
Venue
Address
Age Req.
Show Is At Start time Cover $
1234 The Mohawk 8pm $5
5678 Stubb’s 11pm $9
Venue Address Age Req.
The Mohawk 123 Red Riv. 21+
Beerland 456 Red Riv. All Ages
Then, to get connected information out,
you traverse the relationships with
something called a JOIN.
Show
Start time
Cover $
is at
Venue
Address
Age Req.
Show Is At Start time Cover $
1234 The Mohawk 8pm $5
5678 Stubb’s 11pm $9
Venue Address Age Req.
The Mohawk 123 Red Riv. 21+
Beerland 456 Red Riv. All Ages
Show Is At Start time Cover $
1234 The Mohawk 8pm $5
5678 Stubb’s 11pm $9
Venue Address Age Req.
The Mohawk 123 Red Riv. 21+
Beerland 456 Red Riv. All Ages
Show Is At Address Age Req. Start time Cover $
1234 The Mohawk 123 Red Riv. 21+ 8pm $5
+
=
Tuple is a fancy word for “row”.
(It comes from abstracting … septuple, octuple, N-tuple …)
Cardinality means, how many tuples of
one relation can match each tuple in
another relation?
(In English: a show is at one venue,
but a show can have many bands.)
Show
Start time
Cover $
played by
Band
Name
Genre
Show Start time Cover $
1234 8pm $5
5678 11pm $9
Band ID Name Genre
RDOHD Radiohead Rock
EGLS Eagulls Indie Rock
many to many?
Show
Start time
Cover $
played by
Band
Name
Genre
Show ID Start time Cover $
1234 8pm $5
5678 11pm $9
Band ID Name Genre
RDOHD Radiohead Rock
EGLS Eagulls Indie Rock
Show ID Band ID Start Time
1234 RDOHD 9:30pm
5678 EGLS 9pm
5678 RDOHD 11pm
Normalization is the idea that if
something repeats (i.e. has a cardinality
of more than 1), it should be expressed
as another entity, not as repeating data.
Show Date Venue Street City State
Austinist Party! - Day 1 3/13 Mohawk 123 Red River Austin TX
Austinist Party! - Day 2 3/14 Mohawk 123 Red River Austin TX
Austinist Party! - Day 3 3/15 Mohawk 123 Red River Austin TX
No:
Yes: Show Date Venue
Austinist Party! - Day 1 3/13 Mohawk
Austinist Party! - Day 2 3/14 Mohawk
Austinist Party! - Day 3 3/15 Mohawk
Venue Street City State
Mohawk 123 Red River Austin TX
Basically, normalization dictates that
the same fact is never repeated
in more than one place.
Show Date Venue Street City State
Austinist Party! - Day 1 3/13 Mohawk 123 Red River Austin TX
Austinist Party! - Day 2 3/14 Mohawk 123 Red River Austin TX
Austinist Party! - Day 3 3/15 Mohawk 123 Red River Austin TX
There’s nothing wrong with this data per se, but we all know
that the final 4 columns refer to the same thing.
Show Date Venue Street City State
Austinist Party! - Day 1 3/13 Mohawk 123 Red River Austin TX
Austinist Party! - Day 2 3/14 Mohawk 985 Congress Austin TX
Austinist Party! - Day 3 3/15 Mohawk 123 Red River Austin TX
There’s nothing wrong with this data per se, but we all know
that the final 4 columns refer to the same thing.
What would it mean if the street address were different in
one of these? Are there two Mohawks?
Normalization gives you consistency,
also known as relational integrity.
Certain kinds of problems (like that one)
just can’t happen, because there’s
literally only one place where you store
the address of the Mohawk.
Show Date Venue
Austinist Party! - Day 1 3/13 Mohawk
Austinist Party! - Day 2 3/14 Mohawk
Austinist Party! - Day 3 3/15 Mohawk
Venue Street City State
Mohawk 123 Red River Austin TX
Normalization also means that the
attributes of an entity are non-repeating.
(i.e. there’s no repeating columns, or groups of columns)
Show Start time Cover $ Band #1 Band #2 Band #3
House of Vans 8pm $5 Eagulls Cyndi Lauper Radiohead
Chaos in Tejas 11am $10 Metalface CRUD Decapitation
Show Start time Cover $
House of Vans 8pm $5
Chaos in Tejas 11am $10
Show Band Order
House of Vans Eagulls 1
House of Vans Cyndi Lauper 2
House of Vans Radiohead 3
Chaos in Tejas Metalface 1
Chaos in Tejas CRUD 2
Chaos in Tejas Decapitation 3
No:
Yes:
This makes sense, because …
what if you had 4 bands? Or 400?
Do people break these rules?
All the time!
(We’ll get to valid reasons why they might want to do that.)
One last thing:
Schema is the structure of the database
itself. It is stored as ... data!
VenueID Name Address
123 Mohawk 123 Red River
456 Stubbs 456 Red River
BandID Name # Members
4321 Eagulls 4
8765 CRUD 17
Table Column Type
Venue VenueID ID
Venue Name String
Venue Address String
Band BandID ID
Band Name String
Band # Members Integer
The final piece of the relational database puzzle:
SQL (Structured Query Language)
which is about how you get stuff
out of this graph of sets.
We obviously don’t have time to really
learn SQL, but here’s the gist:
SELECT
S.cover, B.Band
FROM
Shows S INNER JOIN ShowBands B
ON B.Show = S.Show
WHERE
S.start_time > 8pm
AND B.band like ‘%face%’
ORDER BY
order ASC
Show Start time Cover
House of Vans 8pm $5
Chaos in Tejas 9pm $10
Show Band Order
House of Vans Eagulls 1
House of Vans Cyndi Lauper 2
Chaos in Tejas Metalface 1
Chaos in Tejas CRUD 2
SELECT
S.cover, B.Band
FROM
Shows S INNER JOIN ShowBands B
ON B.Show = S.Show
WHERE
S.start_time > 8pm
AND B.band like ‘%face%’
ORDER BY
order ASC
Show Start time Cover
House of Vans 8pm $5
Chaos in Tejas 9pm $10
Show Band Order
House of Vans Eagulls 1
House of Vans Cyndi Lauper 2
Chaos in Tejas Metalface 1
Chaos in Tejas CRUD 2
SELECT
S.cover, B.Band
FROM
Shows S INNER JOIN ShowBands B
ON B.Show = S.Show
WHERE
S.start_time > 8pm
AND B.band like ‘%face%’
ORDER BY
order ASC
SELECT
S.cover, B.Band
FROM
Shows S INNER JOIN ShowBands B
ON B.Show = S.Show
WHERE
S.start_time > 8pm
AND B.band like ‘%face%’
ORDER BY
order ASC
Show Band Order
House of Vans Eagulls 1
House of Vans Cyndi Lauper 2
Chaos in Tejas Metalface 1
Chaos in Tejas CRUD 2
Show Start time Cover
House of Vans 8pm $5
Chaos in Tejas 9pm $10
SELECT
S.cover, B.Band
FROM
Shows S INNER JOIN ShowBands B
ON B.Show = S.Show
WHERE
S.start_time > 8pm
AND B.band like ‘%face%’
ORDER BY
order ASC
Show Band Order
House of Vans Eagulls 1
House of Vans Cyndi Lauper 2
Chaos in Tejas Metalface 1
Chaos in Tejas CRUD 2
Show Start time Cover
House of Vans 8pm $5
Chaos in Tejas 9pm $10
$10 Metalface
SELECT
S.cover, B.Band
FROM
Shows S INNER JOIN ShowBands B
ON B.Show = S.Show
WHERE
S.start_time > 8pm
AND B.band like ‘%face%’
ORDER BY
order ASC
Show Band Order
House of Vans Eagulls 1
House of Vans Cyndi Lauper 2
Chaos in Tejas Metalface 1
Chaos in Tejas CRUD 2
Show Start time Cover
House of Vans 8pm $5
Chaos in Tejas 9pm $10
$10 Metalface 1
So that’s relational databases and SQL,
in a very small nutshell.
You can imagine,
relational databases
get very complicated
for non-trivial things.
But you now know
almost all there is to
know: relational
databases are graphs
of sets, navigated via
declarative language.
Source: http://wiki.musicbrainz.org/-/images/5/52/ngs.png
Also, remember generalization?
Theoretically, that’s just another type of
relationship between entities.
Band Genre
Radiohead Rock
Metalface Jazz
Entertainer Since
Radiohead 1985
Metalface 2005
Gonzo the Incredible 1968
Gob 2002
Penn & Teller 1990
Magician Style
Gonzo the Incredible Sorcery
Gob Fail
Penn & Teller Conjuring
I say theoretically because nobody
actually does this in practice, since
(unlike in object oriented programming)
doing this carries a performance and
complexity burden in databases.
(But, you could do it, theoretically.)
Hands on!
We’re going to spin up a relational
database using Heroku, create and
populate some tables, and show the data
on a web site.
Hands on!
About Heroku:
● Super simple hosted engine for running any code online
● Has built-in relational database capabilities with Postgres
● Uses a version control system called git
● Uses Amazon AWS to host the code
● It's owned by Salesforce.com (my company) but run
separately; I'm not an expert
● Also note that this example is written in Ruby and I’m
really not an expert on that. Hope it works!
Hands on!
Prerequisites:
1. You should already have git installed; do now if not.
2. You should have heroku toolbelt installed, and have
created a heroku login. Do now if you haven’t.
3. You should have already cloned my repo, but if not do it
now.
$ cd ~
$ git clone https://github.com/ivarley/unstructure-sxsw14
Hands on!
Steps:
1. Go to the directory you cloned the repo into:
$ cd unstructure-sxsw14
2. Create and publish the app:
$ heroku login
$ heroku create
$ git push heroku master
$ heroku open
Hands on!
Congrats! You just published a web app.
Not convinced? Edit some of the visible text in the file:
app/views/welcome/index.html.erb
and republish:
$ git commit -am 'made an edit'
$ git push heroku master
$ heroku open
Hands on!
Now let’s add some data!
Heroku automatically created a Postgres database for you.
$ heroku pg:psql
opens a database prompt. Create a table and insert data:
CREATE TABLE band (bandid INT PRIMARY KEY, name VARCHAR, genre VARCHAR);
INSERT INTO band (bandid, name, genre) VALUES (1, 'Radiohead', 'Rock');
SELECT * FROM band;
Hands on!
Add some more tables and data:
CREATE Table venue (venueid INT PRIMARY KEY, name VARCHAR, address VARCHAR);
CREATE TABLE show (venueid INT PRIMARY KEY, bandid INT, date VARCHAR);
INSERT INTO venue (venueid, name, address) VALUES (1, 'The Mohawk', '123 Red
River, Austin, TX');
INSERT INTO show (venueid, bandid, date) VALUES (1, 1, '2014-03-07');
And create a view (basically, a saved SQL statement):
CREATE VIEW shows AS SELECT b.name as band_name, v.name as venue_name, s.date
FROM band b INNER JOIN show s ON b.bandid = s.bandid INNER JOIN venue v ON s.
venueid = v.venueid;
Hands on!
Add some more tables and data:
CREATE Table venue (venueid INT PRIMARY KEY, name VARCHAR, address VARCHAR);
CREATE TABLE show (venueid INT PRIMARY KEY, bandid INT, date VARCHAR);
INSERT INTO venue (venueid, name, address) VALUES (1, 'The Mohawk', '123 Red
River, Austin, TX');
INSERT INTO show (venueid, bandid, date) VALUES (1, 1, '2014-03-07');
And create a view (basically, a saved SQL statement):
CREATE VIEW shows AS SELECT b.name as band_name, v.name as venue_name, s.date
FROM band b INNER JOIN show s ON b.bandid = s.bandid INNER JOIN venue v ON s.
venueid = v.venueid;
Hands on!
Now you can SELECT data:
SELECT * FROM shows;
And you’ll get:
band_name | venue_name | date
-----------+------------+------------
Radiohead | The Mohawk | 2014-03-07
Note if you’re into nit-picking: I'm taking a shortcut here and saying that each record in "show" is a band/venue
combination, with a date. If "show" were a proper entity (for example, if the show had a name, a promoter, etc.) then
the proper “normalized” way to model it would be to create a show entity with an ID and a venue ID, show name,
promoter, etc; and then have intersection tables between bands and shows (like a band_show table). But this is fine
for now.
Hands on!
Leave the SQL prompt (type “q”) and reopen the app:
$ heroku open
And (hopefully) voila! You see the results of:
SELECT * FROM shows;
Feel free to mess around now--add more data, to see how it’ll
show up. If you’re really advanced, try changing the ruby code
to show different SQL statements, etc. We’ll take about 10
minutes to play around, ask questions, etc.
Questions?
Chapter 4: Mutation
So, hierarchical databases were lame.
But relational databases are awesome!
SQL Rules!! Right … ?
Hrm. As awesome as they are, relational
databases have a few warts.
And some of them have only become
apparent recently.
● It’s really hard to work with recursive / graph relationships
● Results of SELECT queries are always flat tables, which
means you have to reassemble nested structures yourself
● They don’t play well with object oriented programs
● They don’t support inheritance & superclassing!
● Relational modeling tools are generally quite sucky.
● They don’t scale well to extremely large data sets,
because they promise things you can only do on a single
(non-distributed) system.
There’s a standard list of grievances:
These are all interesting points; if we had
a whole semester, we could spend
weeks on any one of them.
But I want to spend the rest of our time
on what is (IMO) the essential problem:
mutability.(i.e. change, being mutated)
Let’s zoom way out for a minute.
By NASA [Public domain], via Wikimedia Commons
We talked about hierarchies.
● Old school, classical
● Single top down view of reality
We talked about relations.
● Modern, networks, connections
● No single privileged access path or view
But both hierarchies and relations
tend towards a commitment to a
fixed representation of reality.
You’re modeling static concepts that exactly match the real
world. Doing this with high fidelity is the crowning
achievement of the relational model.
But reality doesn’t hold still.
It changes, all the time.
Any fixed way of representing the
world is doomed to become outdated.
Darwin knew what was up.
The nature of reality is to be dynamic, evolving.
A species seems like a fixed thing,
but that's just a label we attach.
They are fuzzy around the edges, and always changing.
Change is a fact of life at all levels
of the abstraction ladder…
our understanding of structure itself
the structure of our software apps
facts about things in the real world
And it’s only getting
faster.
We can either hide our
heads in the sand, or we can
figure out what to do about
it.
So let’s talk about 3 ways to
smash the boundaries of data,
in the mutation dimension:
Attribute Flexibility
Non-Destructive Mutability
Model Agility
Non-Destructive Mutability
This is an UPDATE statement in SQL:
UPDATE Band
SET name = 'Jaydiohead'
WHERE Band ID = 5678
and name = 'Radiohead'
Result:
Band ID Name
1234 Bjork
5678 Radiohead
Band ID Name
1234 Bjork
5678 Jaydiohead
You know what you can’t do?
Undo.
Edits in traditional relational databases
are destructive. If you want to keep the
old version, you have to do it yourself.
Note for data nerds: yes, databases keep transaction logs so you can undo and redo edits as part
of transactions. But this is (a) implementation dependent, (b) not typically exposed to users in the
relational model, and (c) not guaranteed to persist beyond the transaction itself.
This means that everyone has to either:
● Add extra complexity to their data model, or
● Accept that changes are destructive.
Both are pretty crappy options. Most people just
do #2.
But, sometimes you can’t do that.
Consider compliance.
● Who changed my salary?
● Who deleted this opportunity from the
pipeline?
● Who took my stapler?
You might not care a bunch about that, but Sarbanes-Oxley
sure does.
And if you don’t know what that is, count yourself lucky and go back to making pretty things.
So this …
So how do people store history now?
Option #1: audit columns (partial solution)
Band ID Name
1234 Bjork
5678 Radiohead
Option #1: audit columns (partial solution)
Band ID Name
1234 Bjork
5678 Radiohead
Band ID Name Create Date Created By Modify Date Modified By Deleted?
1234 Bjork 3/6/14 2pm Ian Varley 3/6/14 3pm Jan Jones false
5678 Jaydiohead 3/5/14 1pm John Smith 3/7/14 1pm Ian Varley false
So this … becomes this.
So how do people store history now?
So this …
So how do people store history now?
Option #2: History tables
Band ID Name
1234 Bjork
5678 Radiohead
So this … becomes this
So how do people store history now?
Option #2: History tables
Band ID Name
1234 Bjork
5678 Radiohead
Band ID Name Version Date By
1234 Bjork 1 3/6/14 Ian V
5678 Radiohead 1 3/6/14 Ian V
5678 Jaydiohead 2 3/7/14 Ian V
1234 Fjork 2 3/7/14 Ian V
Neither of these are ideal.
They ugly up your data model, which is
not just an inconvenience; it makes it
harder to see the “real” stuff.
It also violates the spirit of normalization: if something means
the same thing, don’t repeat it all over the place.
But what if … the database took care of
this for you?
Band ID Name
1234 Bjork
5678 Radiohead
Band ID Name
1234 Bjork
5678 Radiohead
Band ID Name
1234 Bjork
5678 Radiohead
Band ID Name
1234 Bjork
5678 Radiohead
Band ID Name
1234 Bjork
5678 Radiohead
Band ID Name
1234 Bjork
5678 Radiohead
Band ID Name
1234 Bjork
5678 Radiohead
Band ID Name
1234 Bjork
5678 Radiohead
Band ID Name
1234 Bjork
5678 Radiohead
Band ID Name
1234 Bjork
5678 Radiohead
Band ID Name
1234 Bjork
5678 Radiohead
Some newer ones do:
● Google pioneered this with BigTable - time is a
privileged dimension, part of the model, stored
with every datum
● NoSQL Stores like HBase follow suit
● Salesforce offers “field history” out of the box,
as a meta-feature on any entity, expressed as
a history table
● Document stores can store older versions
Does this use a lot of space? Yes.
But guess what? We have a lot of space.
And, it turns out that for a lot of
computing applications, making it
immutable makes the problem way
easier to reason about.
(For more on that, see Pat Helland’s talk,
Immutability Changes Everything.)
Here’s the crazy part. If we do this
across the board, we end up with ...
data time travel.
“K-9, show me my accounts
receivable as of last May, and
compare it with today.”
Plain old “undo” is a subset of this. Ever
wondered why Google web products all
have “undo” and nobody else’s do?
(Full revision history is the fancy version. They have that too.)
My pronouncement:
Henceforth, all databases should
make time a privileged dimension,
and retain older versions of data in
a way that supports time travel.
Attribute Flexibility
Historically, databases have been
a little something of a schoolmarm.
Image: Public domain. From the 1894 Laughable Lyrics: A Fourth Book of
Nonsense Poems, Songs, Botany, Music, etc. by Edward Lear.
Some things are decided and enforced
directly at the database level:
- what attributes an entity can have
- how many attributes there are
- what type of data to store in each attribute
- (string, number, date, currency, etc.)
- other constraints (max & min values, etc)
But why is this the database's job?
It’s partially a historical accident.
Disk & memory used to be scarce and
highly optimized, so record formats had
to be prescriptive and fixed.
It’s also partially a mindset.
Centralizing decisions about structure
is very tempting; it’s easy to overestimate
your ability to “get it right” the first time,
or find the “one true model” for all.
But …
What if a database's job is really
to store whatever fields I give it?
What if I want to say, "Let me store any
additional facts I can think of about
bands, venues, shows"?
If your model imposes less, then there's
less to change when the world changes.
This analogous to the difference between
strongly typed languages (like C and Java)
and scripting languages (like Ruby and Python).
Sometimes you want to trade safety for flexibility.
As an example, Thom Yorke’s
booking agent might use
address book software.
But what if they had to update
the database schema for
each new kind of social media
service he decides to use?
You can picture just adding more
attributes, like key-value pairs in JSON:
{
id: "ARP5KQF1187B9B4DD1",
name: "Explosions in the Sky",
genres: [{name: "post rock"}],
years_active: [{start: 1999}],
artist_location: {
location: "Austin, TX, US",
city: "Austin",
region: "Texas",
country: "United States"
}
}
You can picture just adding more
attributes, like key-value pairs in JSON:
{
id: "ARP5KQF1187B9B4DD1",
name: "Explosions in the Sky",
genres: [{name: "post rock"}],
years_active: [{start: 1999}],
artist_location: {
location: "Austin, TX, US",
city: "Austin",
region: "Texas",
country: "United States"
},
familiarity: 0.687572,
favorite_color: "Blue"
}
“Tagging” is the simplest version of this.
Each tag is a boolean
(true/false), and there
can be any number of
them.
You can already model attribute flexibility
at a meta-level in a relational database...
Band ID Name
1234 Bjork
5678 Radiohead
Band ID Attribute Value
5678 City Austin
5678 State TX
5678 Fav. Color Blue
5678 Familiarity 0.687572
After all, it’s just another degree of cardinality, right?
But writing queries against it is a
massive PITA (trust me).
And, it’s the same meta problem: if you did this
for every entity in your model, your model would
be impossible to comprehend.
Attribute flexibility is “table stakes” for
new DBs:
● Google did it with BigTable - the columns for
a row are totally flexible at run time, and the
values are simple byte arrays
● Most other NoSQL stores offer this too
● Some services make it the backbone of
what they offer (e.g. keen.io - 1 entity, but
any set of attributes you want to send)
My second pronouncement:
Henceforth, most databases should
really think about offering attribute
flexibility, accepting writes and
reads of “columns” that haven’t
been declared in advance.
Aside: if columns aren’t predefined,
how do we know their data type?
● Strings, Numbers, Dates, etc?
● JSON has the right idea:
○ If it’s in quotes, it’s a string
○ If it’s not, it’s either:
■ a number →
■ true or false
And … while we’re at it, why not just
get rid of attributes altogether,
and say that entities store JSON blobs,
with nesting intact?
(We’ll come back to that …)
Model Agility
"Schemas usually remain relatively stable
over the lifetime of a database for most
applications."
- S Navathe, 1992
"No one will need more than 637 kB of
memory for a personal computer."
- B. Gates, 1979
It’s true, though. Database schemas do tend
to remain stable … because it sucks so bad
to change them!
● Honestly, databases have always been the
least “agile” part of software development.
● It's the final frontier of "BDUF" (big design
up front)
But, wait. Aren’t schemas stored as data?
Can’t you just change them?
ALTER TABLE Band ADD COLUMN favorite_color
STRING
In theory, that works. In practice, it doesn’t.
● You can change the schema, but you can only ever
have one schema at a time
● Some changes might require downtime
● For big tables, that could mean you’re offline for hours,
days, or even weeks.
● And, by the way, if you follow my first pronouncement
about data and time travel, what happens when you
change the schema?
What to do instead?
One option is to generalize: make your model
so generic, you never have to alter a table!
This is what “architecture astronauts” do, and
it doesn’t work, because you eventually end
up with this model:
(You can sometimes find a sweet spot, but more likely you’re just
pushing the essentialism and brittleness to another layer.)
What we really need is a system where
structure changes are also non-destructive.
(We’d need that anyway if we want to time
travel with the data, right?)
There’s actually an example of this!
http://couchdb.apache.org/
Apache CouchDB is a NoSQL database.
● The database is just a flat collection of JSON files
● There’s no schema! Put in whatever you want!
● You create views (using javascript) that “materialize”
certain access patterns across your documents.
It’s not perfect (scaling is tricky, etc.) but as an illustration
of these points, it’s spot on.
CouchDB hits all 3 of our mutability desires:
● All documents are versioned.
○ Non-destructive Mutation!
● Document JSON can have any structure
○ Attribute Flexibility!
● Schema-On-Read using views
○ Model Agility!
It also happens to be wicked easy to
get started with, so let’s do one more
hands-on exercise.
Hands on!
Let’s load some music data into CouchDB!
You should have already installed CouchDB, but
if not, do so now.
Hands on!
Earlier, we used the Do512 API. This time we’ll
use The Echo Nest, a huge music data
repository.
(Which was, incidentally, just bought by Spotify yesterday, March 6th, 2014! Too
bad we didn’t get on this listen.up thing a little sooner amiright?)
They require setting up an API key for access, so
as a shortcut I’ve done that part for you.
Hands on!
The file mutation/echonest-artists-austin-v1.json
is data on the top 100 “most familiar” artists with
a hometown of Austin, TX, from this API call:
http://developer.echonest.com/api/v4/artist/search?api_key={MY_API_KEY}
&format=json&results=100&start=0&bucket=familiarity&bucket=genre&bucket=artist_location&bucket=years_active&artist_l
ocation=austin
Full disclosure, I modified the result slightly, so it’d work immediately with CouchDB bulk load:
● removed the outer "response" wrapper from the API
● changed the name of the array from "artists" to "docs"
● changed all the "id" fields to "_id" so CouchDB would use them
Hands on!
Ensure CouchDB is installed and running:
http://127.0.0.1:5984/_utils/index.html
Create a new
database called
“listenup”
Hands on!
Upload a bunch of data:
$ cd ~/unstructure-sxsw14/mutation/
$ curl -H "Content-Type:application/json" -d @echonest-arists-austin_v1.
json -X POST http://127.0.0.1:5984/listenup/_bulk_docs
Reload the database web page:
http://127.0.0.1:5984/_utils/database.html?listenup
Hands on!
See what happens when you change a value.
Let’s pick a band at random:
http://127.0.0.1:5984/_utils/document.html?listenup/ARJXDMJ11FF10D84F9
Double click, and change to:
[{"name": "garage
soul"}]
Then click “Save Document”.
Hands on!
Now there are two versions of this document:
Non-destructive mutability, in the flesh!
Hands on!
Now add a field that never existed before:
http://127.0.0.1:5984/_utils/document.html?listenup/ARJXDMJ11FF10D84F9
Click “Add Field”.
Add “Influences” as the
Field name, and click
“Save Document”.
Double click the “null” next to “Influences”,
and change the value to
[{"name": "Michael McDonald"}]
Click “Save Document”.
Hands on!
Attribute flexibility!
Notice that so far,
we haven’t once
had to specify a
schema.
Hands on!
Now we query the data by writing code to
implement materialized views. This is actually a
little complicated ...
Hands on!
(Yes, I know this is actually about Riak, not CouchDB. Why do you hate laughing?)
Comic by John Muellerleile / http://thinkdifferent.ly/fault-tolerance.png
Hands on!
As a simple example, create a new document:
{
"_id": "_design/application",
"views": {
"genre-view": {
"map": "function(artist) {if(artist.name && artist.genres)
{artist.genres.forEach(function(genre) {emit(genre, artist.name);});}}"
}
}
}
Then visit: http://127.0.0.1:5984/listenup/_design/application/_view/genre-view
Hands on!
Try making your own view. Some examples are
here:
http://guide.couchdb.org/draft/cookbook.html
Questions?
Chapter 5: Conclusion & Future Directions
So where have we been?
Hierarchy
Relation
Mutation
Before we go off to our subsequent
debauchery, if you’ll indulge me, we
actually have a tiny bit more
smashing to do.
Now it’s time to go through the looking glass.
Image from Disney Wikia: http://disney.wikia.com/wiki/Alice
(Some of these thoughts are adapted from my previous
presentation, I’ve Always Wanted To Data Model)
This is a technical
book from the 1970s.
It’s a philosophy book.
It opened my eyes to
some of the real
underlying questions.
Is data “true”?
Of course not, not categorically.
Member Year Joined
Thom Yorke 1985
Jonny Greenwood, 1985
Ed O'Brien, 1985
Colin Greenwood 1985
Philip Selway 1985
Ian Varley 2014
Official Radiohead Band Members
Is data “real”?
Robin Hood’s Band Of Merry Men
Depends on what you mean.
Merry Man Merry-ness
Robin Hood High
Little John Medium
Much The Miller’s Son Medium
Friar Tuck High
Arthur a Bland Low
Maid Marian Unknown
My personal theory:
Data is “existential claims”.
The fact that this data, and not some other, is stored
implies that someone (or something) is making a claim
about the existential state of something. This claim
may or may not correspond to the actual existential
state of that something.
As cool as the boundary-smashing tools we
looked at today are, it’s also wise to
remember that we don’t really know much
of what’s going on. For example ...
Nobody actually knows
what an “entity” really is.
“Entity” is another word for Category,
in linguistics terms.
And an important property of linguistic
categories is that they are slippery.
See:
● Steven Pinker: The Stuff Of Thought
● Douglas Hofstadter: Surfaces & Essences
● George Lakoff: Women, Fire, and Dangerous Things
part: an abstract definition of a
connected set of physical
materials that serve some
purpose
part: one instance of a
manufactured item, which
either does or does not meet
quality standards
Images: (1) Atwood Hydraulic Surge Brake Actuator, http://www.pacifictrailers.com/Atwood-Hydraulic-Brake-Actuator-Parts-List-and-Schematic/; (2) Ford Motor Company flywheel magneto assembly line 1913, source unknown
And if you think you can “solve” the
problem, I’ve got some World Trade
Center insurance policies to sell you.
That said, there are a couple techniques
we could adopt that would help:
● First-class Sub- / Super-Typing
● First-class Scoping and Aliasing
(Not that there aren’t ways to do this in relational models, but
they’re unobvious and not widely used.)
Discrete models encourage
black & white thinking in a
gray world
Sometimes the deep structure is
actually ambiguous.
Image credit: By Chire (Own work) [Public domain or Public domain], via Wikimedia Commons
Our current tools are s#!t.
Boxes & lines aren’t how we actually
think, our spatial processing of
diagrams doesn’t map well to our
temporal, spatial, and causal
comprehension of data structure.
F*** THAT NOISE.
If we had the right tools, what would
they look like?
They’d have native support for ...
● My mutability requirements
○ Non-destructive mutation
○ Attribute Flexibility
○ Model Agility
● The 3 Ps:
○ Provenance, Provability, Probability
They’d have native support for ...
● Supertyping
● Extensible Meta-Metadata
● Semantic Zoom
● Prototype Generation
● Model Versioning and Diffing
Interesting direction: RAML
● Textual (YAML-based)
● Strong, simple syntax
● Generates useful models
● Communication focussed
Could there be something
like this for data models?
Who knows ...
Thanks!
@thefutureian
ianvarley.com
(If you enjoyed this, please rate it a 5 … sxsw.com/rate)

More Related Content

What's hot

Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Graphql Intro (Tutorial and Example)
Graphql Intro (Tutorial and Example)Graphql Intro (Tutorial and Example)
Graphql Intro (Tutorial and Example)Rafael Wilber Kerr
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Trees In The Database - Advanced data structures
Trees In The Database - Advanced data structuresTrees In The Database - Advanced data structures
Trees In The Database - Advanced data structuresLorenzo Alberton
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitFlink Forward
 
GraphQL Introduction
GraphQL IntroductionGraphQL Introduction
GraphQL IntroductionSerge Huber
 
[Defcon] Hardware backdooring is practical
[Defcon] Hardware backdooring is practical[Defcon] Hardware backdooring is practical
[Defcon] Hardware backdooring is practicalMoabi.com
 
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationOri Reshef
 
Informatica Powercenter Architecture
Informatica Powercenter ArchitectureInformatica Powercenter Architecture
Informatica Powercenter ArchitectureBigClasses Com
 
Strongly Consistent Global Indexes for Apache Phoenix
Strongly Consistent Global Indexes for Apache PhoenixStrongly Consistent Global Indexes for Apache Phoenix
Strongly Consistent Global Indexes for Apache PhoenixYugabyteDB
 
Introduction to GraphQL
Introduction to GraphQLIntroduction to GraphQL
Introduction to GraphQLRodrigo Prates
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...StreamNative
 
Netflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of AnalyticsNetflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of AnalyticsBlake Irvine
 

What's hot (20)

Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Graphql Intro (Tutorial and Example)
Graphql Intro (Tutorial and Example)Graphql Intro (Tutorial and Example)
Graphql Intro (Tutorial and Example)
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Modern Data Pipelines
Modern Data PipelinesModern Data Pipelines
Modern Data Pipelines
 
Trees In The Database - Advanced data structures
Trees In The Database - Advanced data structuresTrees In The Database - Advanced data structures
Trees In The Database - Advanced data structures
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
 
GraphQL Introduction
GraphQL IntroductionGraphQL Introduction
GraphQL Introduction
 
[Defcon] Hardware backdooring is practical
[Defcon] Hardware backdooring is practical[Defcon] Hardware backdooring is practical
[Defcon] Hardware backdooring is practical
 
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisation
 
Informatica Powercenter Architecture
Informatica Powercenter ArchitectureInformatica Powercenter Architecture
Informatica Powercenter Architecture
 
Druid
DruidDruid
Druid
 
Deep Dive - DynamoDB
Deep Dive - DynamoDBDeep Dive - DynamoDB
Deep Dive - DynamoDB
 
Strongly Consistent Global Indexes for Apache Phoenix
Strongly Consistent Global Indexes for Apache PhoenixStrongly Consistent Global Indexes for Apache Phoenix
Strongly Consistent Global Indexes for Apache Phoenix
 
Introduction to GraphQL
Introduction to GraphQLIntroduction to GraphQL
Introduction to GraphQL
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
 
Netflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of AnalyticsNetflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of Analytics
 

Similar to Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Extending the Mind with Cognitive Prosthetics?
Extending the Mind with Cognitive Prosthetics? Extending the Mind with Cognitive Prosthetics?
Extending the Mind with Cognitive Prosthetics? PhiloWeb
 
Normal Considered Harmful
Normal Considered HarmfulNormal Considered Harmful
Normal Considered Harmfulgreenwop
 
BL Labs 2014 Symposium: The Mechanical Curator
BL Labs 2014 Symposium: The Mechanical CuratorBL Labs 2014 Symposium: The Mechanical Curator
BL Labs 2014 Symposium: The Mechanical Curatorbenosteen
 
Core Methods In Educational Data Mining
Core Methods In Educational Data MiningCore Methods In Educational Data Mining
Core Methods In Educational Data Miningebelani
 
Talk0 representations
Talk0 representationsTalk0 representations
Talk0 representationsSanket Patil
 
LASTconf 2018 - System Mapping: Discover, Communicate and Explore the Real Co...
LASTconf 2018 - System Mapping: Discover, Communicate and Explore the Real Co...LASTconf 2018 - System Mapping: Discover, Communicate and Explore the Real Co...
LASTconf 2018 - System Mapping: Discover, Communicate and Explore the Real Co...Colin Panisset
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligenceUmesh Meher
 
Why I became Interested in Mind Mapping
Why I became Interested in Mind MappingWhy I became Interested in Mind Mapping
Why I became Interested in Mind MappingBill Miller
 
Deep Learning from Scratch - Building with Python from First Principles.pdf
Deep Learning from Scratch - Building with Python from First Principles.pdfDeep Learning from Scratch - Building with Python from First Principles.pdf
Deep Learning from Scratch - Building with Python from First Principles.pdfYungSang1
 
The Kipling-Zachman lens
The Kipling-Zachman lensThe Kipling-Zachman lens
The Kipling-Zachman lensRichard Veryard
 
Mit6870 orsu lecture12
Mit6870 orsu lecture12Mit6870 orsu lecture12
Mit6870 orsu lecture12zukun
 
Eduserv Digital Identities Workshop
Eduserv Digital Identities WorkshopEduserv Digital Identities Workshop
Eduserv Digital Identities WorkshopYishay Mor
 
Digital Rhetoric, September 9th, 2014
Digital Rhetoric, September 9th, 2014Digital Rhetoric, September 9th, 2014
Digital Rhetoric, September 9th, 2014Miami University
 
Presenting: structure story and support
Presenting: structure story and supportPresenting: structure story and support
Presenting: structure story and supportFelienne Hermans
 
Who owns our work? (notes)
Who owns our work? (notes)Who owns our work? (notes)
Who owns our work? (notes)Dorothea Salo
 

Similar to Unstructure: Smashing the Boundaries of Data (SxSWi 2014) (20)

Extending the Mind with Cognitive Prosthetics?
Extending the Mind with Cognitive Prosthetics? Extending the Mind with Cognitive Prosthetics?
Extending the Mind with Cognitive Prosthetics?
 
Normal Considered Harmful
Normal Considered HarmfulNormal Considered Harmful
Normal Considered Harmful
 
BL Labs 2014 Symposium: The Mechanical Curator
BL Labs 2014 Symposium: The Mechanical CuratorBL Labs 2014 Symposium: The Mechanical Curator
BL Labs 2014 Symposium: The Mechanical Curator
 
Book
BookBook
Book
 
Beyond your studies
Beyond your studiesBeyond your studies
Beyond your studies
 
Core Methods In Educational Data Mining
Core Methods In Educational Data MiningCore Methods In Educational Data Mining
Core Methods In Educational Data Mining
 
Talk0 representations
Talk0 representationsTalk0 representations
Talk0 representations
 
chatcptkk.ppt
chatcptkk.pptchatcptkk.ppt
chatcptkk.ppt
 
LASTconf 2018 - System Mapping: Discover, Communicate and Explore the Real Co...
LASTconf 2018 - System Mapping: Discover, Communicate and Explore the Real Co...LASTconf 2018 - System Mapping: Discover, Communicate and Explore the Real Co...
LASTconf 2018 - System Mapping: Discover, Communicate and Explore the Real Co...
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 
Why I became Interested in Mind Mapping
Why I became Interested in Mind MappingWhy I became Interested in Mind Mapping
Why I became Interested in Mind Mapping
 
Deep Learning from Scratch - Building with Python from First Principles.pdf
Deep Learning from Scratch - Building with Python from First Principles.pdfDeep Learning from Scratch - Building with Python from First Principles.pdf
Deep Learning from Scratch - Building with Python from First Principles.pdf
 
The Kipling-Zachman lens
The Kipling-Zachman lensThe Kipling-Zachman lens
The Kipling-Zachman lens
 
Mit6870 orsu lecture12
Mit6870 orsu lecture12Mit6870 orsu lecture12
Mit6870 orsu lecture12
 
Eduserv Digital Identities Workshop
Eduserv Digital Identities WorkshopEduserv Digital Identities Workshop
Eduserv Digital Identities Workshop
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 
Digital Rhetoric, September 9th, 2014
Digital Rhetoric, September 9th, 2014Digital Rhetoric, September 9th, 2014
Digital Rhetoric, September 9th, 2014
 
Presenting: structure story and support
Presenting: structure story and supportPresenting: structure story and support
Presenting: structure story and support
 
Who owns our work? (notes)
Who owns our work? (notes)Who owns our work? (notes)
Who owns our work? (notes)
 

Recently uploaded

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Recently uploaded (20)

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

  • 1. Unstructure :: smashing the boundaries of data :: 2014-03-07 - SxSWi Workshop Ian Varley - @thefutureian
  • 2. Chapter 0: Intro & Logistics
  • 3. Hi. I’m Ian Varley. I live in Austin, TX. I work for Salesforce.com, doing data engineering. (Note: this presentation is entirely my own work and opinions, and doesn’t imply anything about Salesforce’s products.) → @thefutureian, ianvarley.com
  • 4. About me: - BA in Philosophy - MS in Software Engineering - 15+ years database experience Not really an authority on data structure, but "You teach what you want to learn".
  • 5. Why are we here? To grok the structure of data, and then smash it.
  • 6. Some logistics: - 150 minutes, no breaks. (Feel free to get up, use the facilities, or leave if you're bored.) - This will be dense. (Lots to cover, so we'll move fast.) - But! Do interrupt at any time with questions. (If you’re lost, you’re not the only one.)
  • 7. All materials are available: - This presentation: http://tiny.cc/unstructure-sxsw14-slides - Live notes: http://tiny.cc/unstructure-sxsw14-notes - Code & samples: https://github.com/ivarley/unstructure-sxsw14
  • 8. There’s some setup required, but we won’t need it right away. Probably best if everyone starts trying to download & install stuff now, and do it in the background as I’m talking.
  • 9. Setup, part 1: git - Download & Install Git http://git-scm.com/book/en/Getting-Started-Installing-Git - Clone my repo: $ cd ~ $ git clone https://github.com/ivarley/unstructure-sxsw14.git
  • 10. Setup, part 2: heroku - Download & Install Heroku Toolbelt https://toolbelt.heroku.com/ - Create Heroku Account https://id.heroku.com/signup
  • 11. Setup, part 3: CouchDB - Download & Install CouchDB http://couchdb.apache.org/
  • 12. Setup, part 4: miscellaneous - Google Chrome https://www.google.com/intl/en/chrome/browser/ - JSONView Plugin http://goo.gl/K07fFs
  • 13. Quick Survey: - occupation: coders? designers? other? - technical skill: low / medium / high - know what a relational database is? - know SQL? - know what NoSQL means? - have used a NoSQL database? - have read Aristotle? :)
  • 14. There’s a spectrum in an audience like this ... The Hacker: Wants hands on, doesn’t care about theory The Academic: Wants heady concepts, not comfortable with code. The “Hackademic”: Wants enough theory to be grounded, and enough hacking to know when something is bullshit. What I’m aiming for:
  • 15. Where we’re headed: Chapter 1 - Warming Up Chapter 2 - Hierarchy Chapter 3 - Relation Chapter 4 - Mutation Chapter 5 - Conclusion
  • 16. Any questions before we get started?
  • 18. Talking about data can be a little … dry. So, we’re going to use an example that most people can relate to easily.
  • 19. Music. Photo credit: Josh Haner/The New York Times
  • 20. ● It's a domain we all know about. ○ But, informally (not usually for work or study) ● Lots of meaty concepts to think about: ○ Recording, Performances, Compositions, Bands, Venues, etc ... ● There are lots of music sites with data APIs: ○ Do512, EchoNest, Songkick, Sched.org, MusicBrainz, 7Digital, etc. ● It's on everyone's mind during SxSW. ● Also, I'm a musician and I felt like it. Why music?
  • 21. Running Example: ● listen.up - Internet company for “all things music”. ● Duh, this is fake, there’s no “.up” TLD ● What do we do? Everything! ● Including: ● Recorded music catalogs, streaming, purchase... ● Live music performance, booking & tickets... ● Licensing, royalties, compositions, lyrics … ● Instruments, lessons, repairs, classifieds … ● Anything else you can think of.
  • 22. Running Example: ● listen.up - Internet company for “all things music”. ● Duh, this is fake, there’s no “.up” TLD ● What do we do? Everything! ● Including: ● Recorded music catalogs, streaming, purchase... ● Live music performance, booking & tickets... ● Licensing, royalties, compositions, lyrics … ● Instruments, lessons, repairs, classifieds … ● Anything else you can think of. Is this a good business model? No, but who cares!
  • 23. You are my team of professional ontologists, you’re going to figure out what music data is out there in the world for us to store. (I’ll pay you in stock. It’ll be worth a fortune, trust me.)
  • 24. Normally, this is where we might do a big group brainstorm. But it turns out those don’t work. In fact, they hurt more than they help. (https://en.wikipedia.org/wiki/Brainstorming#Challenges_to_Effective_Brainstorming)
  • 25. So, we’ll do 3 steps: 1. Take 60 seconds and type as many music- related concepts as you can. - concepts, not proper nouns (i.e. "band", "composer", "instrument"; not "Radiohead", "Beethoven", "guitar", etc.) - make them singular ("band", not "bands") - not sure if it’s music-related? put it anyway. 2. Dump them into a shared google doc. 3. I’ll lowercase, dedupe, and publish.
  • 26. Go!I’ll add a few prompts in case you are getting stuck ...
  • 27. So, we’ll do 3 steps: 1. Take 60 seconds and type as many music- related concepts as you can. - concepts, not proper nouns (i.e. not "Radiohead", but "band"; not "Beethoven" but "composer", not "guitar" but "instrument", etc.) - make them singular ("band", not "bands") - not sure if it’s music-related? put it anyway. 2. Dump them into this google doc: http://tiny.cc/unstructure-sxsw14-terms 3. I’ll lowercase, dedupe, and publish here.
  • 28. Now, your job is to make some sense of this. 1. Break into groups of ~3 neighbors. a. Introduce yourselves like civilized human beings. 2. Organize this list however you want to! a. Group things together b. Indent things c. Draw lines in a drawing program d. etc.
  • 29. Go!We’ll take about 10 minutes for this.
  • 30. Let’s discuss a few sample things people did. (Goal is to "sample", not for everyone to share! We don’t have all day.) ● How did you sort or group the terms? ● Did you end up with a flat list, or hierarchy? ● Did anything not fit in? ● Any higher level organization of terms? (Note: there’s no right answer here ... yet.)
  • 31. Good job. is off to a great start. Now it’s time to get into the meat. What is structure? What is data?
  • 32. This is a story in three parts: Hierarchy Relation Mutation
  • 33. These are 3 successive viewpoints that will gradually open our eyes to the deep structure of data.
  • 34. “Structure? Isn’t this workshop supposed to be about unstructured data?”
  • 35. Poppycock. You want to see some real unstructured data?
  • 36.
  • 37. What most people mean when they say unstructured data is: “flexibly structured data” Or possibly: “data we don’t know the structure of yet” (We’ll get to both of those; hold your horses.)
  • 38. There are lots of boundaries to smash in the world of data. But we have to learn to structure before we can unstructure. Image credit: Rodrigo Diaz Aravena
  • 39. Our minds are very fluid, and can connect concepts in subtle ways. Our computers aren't. They need concrete instructions to structure & connect data.
  • 40. The next three chapters will be a deep dive into how concepts are combined to let us represent the world in computers. This is usually called “modeling”. Modeling gets a bad rap.
  • 41. Models cut away the accidental and leave the essential. Model != diagram, drawing Model == Skeleton, Essence, Abstraction
  • 42. Most of the time, it’s good to remember: “the map is not the territory”. But for the next couple hours, for us, the map is precisely the territory. We are data cartographers. - Alfred Korzybski
  • 43. So let’s dive in! photo credit: Michael Mundy
  • 45. We all have a pretty good intuitive understanding of “data”. What’s yours?
  • 46. data = plural of datum datum = Latin for “given” so ... data is “givens”? aka “facts”?
  • 47. What’s the simplest fact? A bit: 1 / 0 on / off yes / no true / false “Are the lights on in this room?”
  • 48. When you want more complex facts, you chunk together simpler ones. In one dimension, that's a "list": ● Byte = list of bits → 01000010 = 66 = “B” ● Word = list of characters → [B,e,a,t,l,e,s] ● Phrase = list of words -> “The Beatles are a band” ● And so forth ...
  • 49. Of course, a 1-dimensional list is just one (very simplistic) way to chunk things together.
  • 51. So you just chunk datums together, and you get information, knowledge, wisdom … !? Photo By Karora (Own work) [Public domain], via Wikimedia Commons
  • 52. No. It’s not quite that easy. f∆ƒ˙73f˚=£ƒ••XMbritneysp3ars-giraffe is a complex structure, but it lacks something: “meaning”
  • 53. Getting a little more haughty, we could say that the “givens” of structured data are really pointers to concepts. Without at least some concept, it’s not data: it’s noise.
  • 54. But that raises all kinds of questions. What are concepts? What can we say about them? What kinds of concepts are there? What’s the difference between a concept and the thing it points at? Who could answer such questions?
  • 55. When the epistemological going gets tough, the tough call ... Aristotle, 384 –322 BC
  • 56. Aristotle wrote a treatise called The Praedicamenta (The Categories) It’s not clear exactly what he was categorizing (he didn’t say), but the list stands to this day as a pretty damn sensible way to, well, categorize.
  • 57. Aristotle’s 10 "categories": substance Stuff, essence; matter, but also universal concepts quantity How much? How many? quality What kind? Of what nature? relation More, less, double, half, stronger, weaker, etc. place Where? time When? position Being situated on, in, next to, sitting, touching, etc. having Possession, state like “clothed” or “armed” causing What did it do, make happen? being caused What happened to it, what did it undergo?
  • 58. You don’t have to agree with Aristotle’s categories (you’d be in good company). (But you’re also unlikely to have a sudden inspiration about it that hasn’t already been the subject of 12 papers and a dissertation.)
  • 59. But having some version of Aristotle’s list is hard to argue with, experientially. We sort the world into a hierarchy of concepts; everything in its right place.
  • 60. Without concepts, we don’t have data. We just have noise.
  • 61. Of course, you can’t just store a concept on a disk, or send it over a network. So, not long after we had computing machines, folks set to work figuring out how to map and store our concepts in the unforgiving realm of silicon.
  • 62. Let’s take a trip back in time. Photo By NASA Ames Research Center (NASA-ARC) (NIX A- 28284) [Public domain], via Wikimedia Commons
  • 63. The year is 1966. NASA is working on Saturn V and Apollo rockets, and they can’t figure out how to store this ginormous bill of materials. They ask: Could these new “computers” help?
  • 64. IBM: “Hey NASA! We made an system for you to manage information!” NASA: “Groovy! What’s it called?” IBM: … ... “Information Management System.”
  • 65. Thus began the not-extremely-exciting era of hierarchical databases. ● Data is stored in records, which can have sub-records. ● There's a single strict hierarchical arrangement. ● To access data, you need to know the hierarchy.
  • 66. For example: Show Band Time House Of Vans Charlie XCX 4:15 House Of Vans Pusha T 5:00 Show Venue Date House Of Vans The Mohawk 3/13/14 Chaos in Tejas Iron And Lace 3/14/14 Band Song Order Charlie XCX You 1 Charlie XCX Super Love 2 To get to the set list, you have to navigate through the show, to the band, to the song. And, you only get to choose one hierarchy to store things in.
  • 67. That sounds arbitrarily restrictive. Why did they make it like that?
  • 68. To understand, we have to talk about dimensions.
  • 69. We already talked about zero dimensions: And about one dimension: point = bit = on/off = true/false line = list = array
  • 70. It’s pretty straightforward to see how you’d represent these, with a series of zeros and ones. What about 2 dimensions?
  • 71. It’s a plane! (aka table, grid, matrix, spreadsheet, etc)
  • 72. Tabular data is everywhere.
  • 73.
  • 74. You're certainly familiar with the world's most ubiquitous 2- dimensional data tool ...
  • 75. X dimension Y dimension Excel! Photo credit: http://decentralist.wordpress.com/2012/10/01/libreoffice-vs-openoffice-not-always-simple/
  • 76. (Or, its successor, Google Spreadsheets, which is way better, IMHO.) ((Especially in combination with Google Forms, which make it extremely useful.))
  • 77. Spreadsheets are totally flexible. This is a blessing and a curse.
  • 78. They can be used well ...
  • 79. Or poorly ... (Fun read: http://www.epmchannel.com/2013/02/22/is-excel-the-most-dangerous-piece-of-software-in-the-world/)
  • 80. Or awesomely ... Image credit: http://gadgetose.com/excel-stop-motion-music-video/
  • 81. But you get the point. 2-dimensional data is everywhere.
  • 82. 2-dimensional data doesn’t always look like a table or grid. We just mean that it’s “conceptually planar”: two axes, each w/ a set coordinate system. (Here, rows = “web results”, and columns = “link name”, “url”, “description”, and “image”. No link has two URLs, for example.)
  • 83. However! We are now faced with a choice, because we can still only actually store things in linear form (a single stream of bits). So do we put rows inside columns, or columns inside rows?
  • 84. To linearize the two dimensions in a table, I can either ...
  • 85. <table> <row> <col>Creep</col> <col>1993</col> <col>Pablo Honey</col> </row> <row> <col>No Surprises</col> <col>1997</col> <col>OK Computer</col> </row> <row> <col>Lucky</col> <col>1997</col> <col>OK Computer</col> </row> <row> <col>Karma Police</col> ... </row> </table> Go row-wise ...
  • 86. <table> <col> <row>Creep</row> <row>No Surprises</row> <row>Lucky</row> <row>Karma Police</row> <row>Fake Plastic Trees</row> </col> <col> <row>1993</row> <row>1997</row> <row>1997</row> <row>1997</row> <row>1995</row> </col> <col> <row>Pablo Honey</row> ... </col> </table> Or column wise ...
  • 87. <table> <row> <col>Creep</col> <col>1993</col> <col>Pablo Honey</col> </row> <row> <col>No Surprises</col> <col>1997</col> <col>OK Computer</col> </row> <row> <col>Lucky</col> <col>1997</col> <col>OK Computer</col> </row> <row> <col>Karma Police</col> ... </row> </table> But I can’t have it both ways. <table> <col> <row>Creep</row> <row>No Surprises</row> <row>Lucky</row> <row>Karma Police</row> <row>Fake Plastic Trees</row> </col> <col> <row>1993</row> <row>1997</row> <row>1997</row> <row>1997</row> <row>1995</row> </col> <col> <row>Pablo Honey</row> ... </col> </table> ? (Unless I store it twice.)
  • 88. Because we read left-to-right and top-to- bottom, most of our systems store tables that way too (row-wise). But it’s not mandatory, of course.
  • 89. Whichever way you choose, you can’t get around the fact that you have to choose an ordering of dimensions.
  • 90. That makes sense for 2 dimensions. But what about … 3+ dimensions?
  • 92. Computer science has had the idea of multi-dimensional arrays since the beginning. It's a straightforward extension to a table, conceptually. It's very hard to visualize more than 3 unless you’re on dope*. * - This is a verrrry funny joke because multidimensional arrays use locators called dope vectors. Ha ha ha hmm.
  • 93. But you’re still making it linear when you store it.
  • 94. Imagine storing the venues, shows, bands, and songs in one file. That’s 4 dimensions. (Each venue has many shows; each show has many bands; each band has many songs; etc.)
  • 95. Venue: The Mohawk Show: Vans Day Party, 3/13/14 Band: Eagulls (12:30 p.m.) Song: Nerve Endings Song: Tough Luck Band: DJ Rashad (1:15 p.m.) Song: Holiday Song: I Can Feel It Band: Kelela (2:15 p.m.) Song: ... Band: Charli XCX (3:15 p.m.) Song: ... Band: Dum Dum Girls (4:15 p.m.) Song: ... Band: Pusha T (5:00 p.m.) Show: Venue: ... Show: ... Band: ... Song: ... etc. Song: Nerve Endings Band: Eagulls Show: Vans Day Party, 3/13/14 Venue: The Mohawk Song: Nerve Endings Band: Eagulls Show: Vans Day Party, 3/13/14 Venue: The Mohawk Show: Official Showcase, 3/14/14 Venue: The Mohawk Song: Holiday Band: DJ Rashad Show: Vans Day Party, 3/13/14 Venue: The Mohawk Band: Cattle Decapitation Show: Chaos In Tejas, 3/10/14 Venue: Iron And Lace Song: ... Band ... Show: ... etc. By venue → show → band → song By song → band → show → venue
  • 96. No matter how many dimensions, storing the data still requires that you pick a single primary orientation. So Big Blue’s design choice makes a little more sense now, right?
  • 97. BTW, If you've used an ATM recently… you're an IMS user. (It’s not as obsolete as it sounds.)
  • 98. How do you actually store the linearized data? You put it in a format.
  • 99. Creep•••••••••••••1993Pablo•Honey No Surprises••••••1997OK•Computer Lucky•••••••••••••1997OK•Computer Karma Police••••••1997OK•Computer Fake Plastic Trees1995The•Bends•• Fixed-width files were all the rage in the 1960s. Row delimiter is a line break; column delimiter is a pre-set agreement about how many characters are in each line. This is wasteful, brittle, and hard to read.
  • 100. "Creep","1993","Pablo Honey" "No Surprises","1997","OK Computer" "Lucky","1997","OK Computer" "Karma Police","1997","OK Computer" "Fake Plastic Trees","1995","The•Bends"•• Delimited files (e.g. CSV, comma separated values): Row delimiter is still a line break; column delimiter is variable (a comma, in this case). Optionally, also “qualifiers” (quotes, here). This is a decent format (but, Microsoft’s version really screwed things up for everyone.)
  • 101. <row> <col>Creep</col> <col>1993</col> <col>Pablo Honey</col> </row> <row> <col>No Surprises</col> <col>1997</col> <col>OK Computer</col> </row> <row> <col>Lucky</col> <col>1997</col> <col>OK Computer</col> </row> There’s also markup (e.g. HTML) “Tags” (<tag></tag>) give you the start and end of rows, and the start and end of columns within those rows. SGML, HTML, XML, all follow this approach.
  • 102. Song: Creep Year: 1993 Album: Pablo Honey % Song: No Surprises Album: OK Computer Year: 1997 % Year: 1997 Song: Lucky Album: OK Computer ... Or even YAML (“YAML Ain’t Markup Language”) (Like email headers.) Column pointers (names) are inline with the values; rows have many lines, and are delimited by another character (e.g. “%”). This is obviously more flexible, but still inherently hierarchical.
  • 103. And there are a couple modern technologies that are hierarchical all the way down.
  • 104. In practice, the only form of markup people use for storing data is XML. And the most prevalent subset of YAML people use for storing data is JSON. (Technically JSON isn’t a subset of YAML but you shut up.)
  • 105. XML was early obvious choice, because we were used to HTML, so we all "got" it. <?xml version="1.0"?> <venue name="The Mohawk"> <show title="Vans Day Party, 3/13/14"> <band name="Eagulls" time="12:30 p.m."> <song order="1">Nerve Endings</song> <song title="Touch Luck" order="2" /> </band> <band name="DJ Rashad" time="1:15 p.m."> <song title="Holiday" order="1" /> <song title="I Can Feel It" order="2" /> </band> </show> </venue>
  • 106. But XML isn’t very human-friendly. ● It’s verbose ● The wrong things grab your eye ● It’s somewhat complicated to parse ● Distinction between attributes and tag contents is confusing. “XML is crap. Really. There are no excuses. XML is nasty to parse for humans, and it's a disaster to parse even for computers. There's just no reason for that horrible crap to exist.” - Linus Torvalds, Yesterday (2014-03-06), on Google+
  • 107. Now JSON is winning ... { "venue": { "name": "The Mohawk", "show" : { "title": "Vans Day Party, 3/13/14", "bands": [{ "name": "Eagulls", "time": "12:30 pm", "songs": [ {"title": "Nerve Endings", "order": 1}, {"title": "Touch Luck", "order": 2} ]}, { "name": "DJ Rashad", "time": "1:15 pm", "songs": [ {"title": "Holiday", "order": 1}, {"title": "I Can Feel It", "order": 2} ]} ] } } }
  • 108. JSON: ● Where XML looks like a mass of text, JSON looks spacious (little clutter) ● Fast to parse, for humans and computers ● Self-describing, flexible format ● Extremely simple syntax (one page)
  • 110. One more thing: so far, we’ve talked about a certain kind of hierarchical relationship: containment. But there’s another kind worth mentioning: generalization.
  • 111.
  • 112.
  • 113. This kind of relationship is common in programming (it’s called superclassing). But it’s uncommon (at least, explicitly) in database systems. We’ll come back to it later.
  • 114. Hands on! You’re doing some corporate espionage for listen.up. You notice that do512.com seems to have a good way to organize data in their API. What can you steal, er, learn? Note: the guys at do512 are friends of mine and I am in no way encouraging anyone to perform any actual corporate espionage, no matter how cool that sounds. Listen.up is a made up company, do512’ s json API is open, and this is an exercise for learning; no stealing anything. :)
  • 115. Hands on! ● Make sure you’ve got the JSONView extension in Chrome ● Go to: http://2014.do512.com/events.json ● Explore the hierarchical data that comes up ● Try “-” to collapse all, click “+” signs to unfold sections ● Also try: ○ http://2014.do512.com/venues.json ○ http://2014.do512.com/artists.json
  • 116. Hands on! ● What’s the hierarchical structure of this data? ● In the same way we talked about nesting bands, venues, shows, etc, … what are the objects being nested here? ● Would you store it differently? ● Are there any superclass / subclass relationships?
  • 119. So, hierarchy! Pretty great, right?
  • 120. Actually, no; it’s quite problematic if you use it as the method for storing data. ● Lots of stuff isn't naturally hierarchical. ● You can't change the organization without changing all the code that accesses data. ● Above 3 dimensions, the number of possible access paths goes up dramatically! (Exponentially, in fact.) ○ The academic literature of the 60s and 70s is full of papers describing how to do this better or faster.
  • 121. But, what are you gonna do? Computers are just boring & hierarchical, so you’d better learn to deal with it.
  • 122. Meanwhile, outside the offices of IBM, the revolution of the ‘60s was happening.
  • 123. The desirability and feasibility of hierarchy, especially a single universal hierarchy, was seeming less and less important.
  • 125. something wonderful happened THE MONOLITH IN KUBRICK'S 2001: A SPACE ODYSSEY (1968)
  • 126. Specifically, the math nerds beat the business jocks.
  • 127. This is Dr. Edgar F. Codd (1923-2003) He worked for IBM in the 1960s, and couldn’t stand the thought of our rich, interconnected world being subjugated to storage in hierarchical databases. So he came up with a radical theory. image from wikipedia: http://en.wikipedia.org/wiki/File:Edgar_F_Codd.jpg
  • 128. Two intellectual ancestors: Set Theory Graph Theory
  • 129. Set Theory: Sets are collections of objects. You can precisely describe operations on sets: ● Union ● Intersection ● Difference ● Cartesian Product
  • 130. Graph Theory: Graphs are collections of nodes, connected by edges. Not this:
  • 131. Graph Theory: Graphs are collections of nodes, connected by edges. This: Think: a social network where the nodes are people and the edges are friend relationships.
  • 132. Codd’s genius was combining these and proposing a declarative (rather than imperative) access model. ● The relational model is a graph of sets ● Relations (tables) are sets of tuples (rows). ● Some attributes (columns) are edges that let you connect the sets in interesting ways. ● You never specify “how” to get to data, just “what” data to get, based on sets.
  • 133. When he proposed this as a way to actually store data in 1969, people thought he was from Mars. Everyone said “It’ll never work, computers are too slow.” His employer, IBM, said “Thanks but no thanks; we’ll just keep selling IMS.”
  • 134. But the haters didn’t bank on two things.
  • 135. 1: Moore’s Law That graph is exponential, not linear. We can have plenty of CPU.
  • 136. 2: Programmer Time As programmer time became more valuable than computer time, ease of representing the problem domain became a dominating factor.
  • 137. Due to some tomfoolery at IBM, Codd’s “Alpha” never made it. But another group at IBM created a quasi- relational version called SEQUEL, which looked kind of like COBOL. Then in 1979, Larry Ellison copied the design to create Oracle and SQL (SEQUEL was trademarked). And the rest is history.
  • 138. For his part, Codd waged a decades-long battle to get a more true representation of the relational model adopted. But, it never was. SQL was king.
  • 139. So what is the relational model, then?
  • 140. First, one quick PSA ...
  • 141. Occupational Hazard: Abstraction Vertigo Concrete Data Metadata The Structure of Metadata Band name: Radiohead Year formed: 1985 Entity: Band Attributes: Name, Year Formed Concepts → Entities, Attributes
  • 142. Safety first. If you feel dizzy, just ask a question.
  • 143. Entity Attribute Relationship There are 3 foundational concepts:
  • 145. Show Start time Cover $ is at Venue Address Age Req.
  • 146. Vans Day Party 12pm Free is at The Mohawk 123 Red River All Ages
  • 147. Notice the subtle shift here. Neither of those entities is “inside” the other. They’re both “first class” entities, and they’re in a relationship. The relationships are described at the level of sets, not ad hoc. Shows can be at venues, categorically.
  • 148. Entities look pretty much exactly like 2- dimensional tables, except for the concept of a “key”. That’s the attribute (or set of attributes) that distinguishes this row from that row.
  • 149. Show Start time Cover $ is at Venue Address Age Req. Show Start time Cover $ 1234 8pm $5 5678 11pm $9 Venue Address Age Req. The Mohawk 123 Red Riv. 21+ Beerland 456 Red Riv. All Ages key key
  • 150. Keys describe an entity’s identity. In practice, most systems today use surrogate keys (i.e. IDs) to establish unambiguous identity. eg: integers (123456), codes (X74-UUA2), GUIDs
  • 151. Show Start time Cover $ is at Venue Address Age Req. Show Is At Start time Cover $ 1234 The Mohawk 8pm $5 5678 Stubb’s 11pm $9 Venue Address Age Req. The Mohawk 123 Red Riv. 21+ Beerland 456 Red Riv. All Ages
  • 152. Then, to get connected information out, you traverse the relationships with something called a JOIN.
  • 153. Show Start time Cover $ is at Venue Address Age Req. Show Is At Start time Cover $ 1234 The Mohawk 8pm $5 5678 Stubb’s 11pm $9 Venue Address Age Req. The Mohawk 123 Red Riv. 21+ Beerland 456 Red Riv. All Ages
  • 154. Show Is At Start time Cover $ 1234 The Mohawk 8pm $5 5678 Stubb’s 11pm $9 Venue Address Age Req. The Mohawk 123 Red Riv. 21+ Beerland 456 Red Riv. All Ages Show Is At Address Age Req. Start time Cover $ 1234 The Mohawk 123 Red Riv. 21+ 8pm $5 + =
  • 155. Tuple is a fancy word for “row”. (It comes from abstracting … septuple, octuple, N-tuple …)
  • 156. Cardinality means, how many tuples of one relation can match each tuple in another relation? (In English: a show is at one venue, but a show can have many bands.)
  • 157. Show Start time Cover $ played by Band Name Genre Show Start time Cover $ 1234 8pm $5 5678 11pm $9 Band ID Name Genre RDOHD Radiohead Rock EGLS Eagulls Indie Rock many to many?
  • 158. Show Start time Cover $ played by Band Name Genre Show ID Start time Cover $ 1234 8pm $5 5678 11pm $9 Band ID Name Genre RDOHD Radiohead Rock EGLS Eagulls Indie Rock Show ID Band ID Start Time 1234 RDOHD 9:30pm 5678 EGLS 9pm 5678 RDOHD 11pm
  • 159. Normalization is the idea that if something repeats (i.e. has a cardinality of more than 1), it should be expressed as another entity, not as repeating data.
  • 160. Show Date Venue Street City State Austinist Party! - Day 1 3/13 Mohawk 123 Red River Austin TX Austinist Party! - Day 2 3/14 Mohawk 123 Red River Austin TX Austinist Party! - Day 3 3/15 Mohawk 123 Red River Austin TX No: Yes: Show Date Venue Austinist Party! - Day 1 3/13 Mohawk Austinist Party! - Day 2 3/14 Mohawk Austinist Party! - Day 3 3/15 Mohawk Venue Street City State Mohawk 123 Red River Austin TX
  • 161. Basically, normalization dictates that the same fact is never repeated in more than one place.
  • 162. Show Date Venue Street City State Austinist Party! - Day 1 3/13 Mohawk 123 Red River Austin TX Austinist Party! - Day 2 3/14 Mohawk 123 Red River Austin TX Austinist Party! - Day 3 3/15 Mohawk 123 Red River Austin TX There’s nothing wrong with this data per se, but we all know that the final 4 columns refer to the same thing.
  • 163. Show Date Venue Street City State Austinist Party! - Day 1 3/13 Mohawk 123 Red River Austin TX Austinist Party! - Day 2 3/14 Mohawk 985 Congress Austin TX Austinist Party! - Day 3 3/15 Mohawk 123 Red River Austin TX There’s nothing wrong with this data per se, but we all know that the final 4 columns refer to the same thing. What would it mean if the street address were different in one of these? Are there two Mohawks?
  • 164. Normalization gives you consistency, also known as relational integrity. Certain kinds of problems (like that one) just can’t happen, because there’s literally only one place where you store the address of the Mohawk.
  • 165. Show Date Venue Austinist Party! - Day 1 3/13 Mohawk Austinist Party! - Day 2 3/14 Mohawk Austinist Party! - Day 3 3/15 Mohawk Venue Street City State Mohawk 123 Red River Austin TX
  • 166. Normalization also means that the attributes of an entity are non-repeating. (i.e. there’s no repeating columns, or groups of columns)
  • 167. Show Start time Cover $ Band #1 Band #2 Band #3 House of Vans 8pm $5 Eagulls Cyndi Lauper Radiohead Chaos in Tejas 11am $10 Metalface CRUD Decapitation Show Start time Cover $ House of Vans 8pm $5 Chaos in Tejas 11am $10 Show Band Order House of Vans Eagulls 1 House of Vans Cyndi Lauper 2 House of Vans Radiohead 3 Chaos in Tejas Metalface 1 Chaos in Tejas CRUD 2 Chaos in Tejas Decapitation 3 No: Yes:
  • 168. This makes sense, because … what if you had 4 bands? Or 400?
  • 169. Do people break these rules? All the time! (We’ll get to valid reasons why they might want to do that.)
  • 170. One last thing: Schema is the structure of the database itself. It is stored as ... data! VenueID Name Address 123 Mohawk 123 Red River 456 Stubbs 456 Red River BandID Name # Members 4321 Eagulls 4 8765 CRUD 17 Table Column Type Venue VenueID ID Venue Name String Venue Address String Band BandID ID Band Name String Band # Members Integer
  • 171. The final piece of the relational database puzzle: SQL (Structured Query Language) which is about how you get stuff out of this graph of sets.
  • 172. We obviously don’t have time to really learn SQL, but here’s the gist:
  • 173. SELECT S.cover, B.Band FROM Shows S INNER JOIN ShowBands B ON B.Show = S.Show WHERE S.start_time > 8pm AND B.band like ‘%face%’ ORDER BY order ASC
  • 174. Show Start time Cover House of Vans 8pm $5 Chaos in Tejas 9pm $10 Show Band Order House of Vans Eagulls 1 House of Vans Cyndi Lauper 2 Chaos in Tejas Metalface 1 Chaos in Tejas CRUD 2 SELECT S.cover, B.Band FROM Shows S INNER JOIN ShowBands B ON B.Show = S.Show WHERE S.start_time > 8pm AND B.band like ‘%face%’ ORDER BY order ASC
  • 175. Show Start time Cover House of Vans 8pm $5 Chaos in Tejas 9pm $10 Show Band Order House of Vans Eagulls 1 House of Vans Cyndi Lauper 2 Chaos in Tejas Metalface 1 Chaos in Tejas CRUD 2 SELECT S.cover, B.Band FROM Shows S INNER JOIN ShowBands B ON B.Show = S.Show WHERE S.start_time > 8pm AND B.band like ‘%face%’ ORDER BY order ASC
  • 176. SELECT S.cover, B.Band FROM Shows S INNER JOIN ShowBands B ON B.Show = S.Show WHERE S.start_time > 8pm AND B.band like ‘%face%’ ORDER BY order ASC Show Band Order House of Vans Eagulls 1 House of Vans Cyndi Lauper 2 Chaos in Tejas Metalface 1 Chaos in Tejas CRUD 2 Show Start time Cover House of Vans 8pm $5 Chaos in Tejas 9pm $10
  • 177. SELECT S.cover, B.Band FROM Shows S INNER JOIN ShowBands B ON B.Show = S.Show WHERE S.start_time > 8pm AND B.band like ‘%face%’ ORDER BY order ASC Show Band Order House of Vans Eagulls 1 House of Vans Cyndi Lauper 2 Chaos in Tejas Metalface 1 Chaos in Tejas CRUD 2 Show Start time Cover House of Vans 8pm $5 Chaos in Tejas 9pm $10 $10 Metalface
  • 178. SELECT S.cover, B.Band FROM Shows S INNER JOIN ShowBands B ON B.Show = S.Show WHERE S.start_time > 8pm AND B.band like ‘%face%’ ORDER BY order ASC Show Band Order House of Vans Eagulls 1 House of Vans Cyndi Lauper 2 Chaos in Tejas Metalface 1 Chaos in Tejas CRUD 2 Show Start time Cover House of Vans 8pm $5 Chaos in Tejas 9pm $10 $10 Metalface 1
  • 179. So that’s relational databases and SQL, in a very small nutshell.
  • 180. You can imagine, relational databases get very complicated for non-trivial things. But you now know almost all there is to know: relational databases are graphs of sets, navigated via declarative language. Source: http://wiki.musicbrainz.org/-/images/5/52/ngs.png
  • 181. Also, remember generalization? Theoretically, that’s just another type of relationship between entities.
  • 182. Band Genre Radiohead Rock Metalface Jazz Entertainer Since Radiohead 1985 Metalface 2005 Gonzo the Incredible 1968 Gob 2002 Penn & Teller 1990 Magician Style Gonzo the Incredible Sorcery Gob Fail Penn & Teller Conjuring
  • 183. I say theoretically because nobody actually does this in practice, since (unlike in object oriented programming) doing this carries a performance and complexity burden in databases. (But, you could do it, theoretically.)
  • 184. Hands on! We’re going to spin up a relational database using Heroku, create and populate some tables, and show the data on a web site.
  • 185. Hands on! About Heroku: ● Super simple hosted engine for running any code online ● Has built-in relational database capabilities with Postgres ● Uses a version control system called git ● Uses Amazon AWS to host the code ● It's owned by Salesforce.com (my company) but run separately; I'm not an expert ● Also note that this example is written in Ruby and I’m really not an expert on that. Hope it works!
  • 186. Hands on! Prerequisites: 1. You should already have git installed; do now if not. 2. You should have heroku toolbelt installed, and have created a heroku login. Do now if you haven’t. 3. You should have already cloned my repo, but if not do it now. $ cd ~ $ git clone https://github.com/ivarley/unstructure-sxsw14
  • 187. Hands on! Steps: 1. Go to the directory you cloned the repo into: $ cd unstructure-sxsw14 2. Create and publish the app: $ heroku login $ heroku create $ git push heroku master $ heroku open
  • 188. Hands on! Congrats! You just published a web app. Not convinced? Edit some of the visible text in the file: app/views/welcome/index.html.erb and republish: $ git commit -am 'made an edit' $ git push heroku master $ heroku open
  • 189. Hands on! Now let’s add some data! Heroku automatically created a Postgres database for you. $ heroku pg:psql opens a database prompt. Create a table and insert data: CREATE TABLE band (bandid INT PRIMARY KEY, name VARCHAR, genre VARCHAR); INSERT INTO band (bandid, name, genre) VALUES (1, 'Radiohead', 'Rock'); SELECT * FROM band;
  • 190. Hands on! Add some more tables and data: CREATE Table venue (venueid INT PRIMARY KEY, name VARCHAR, address VARCHAR); CREATE TABLE show (venueid INT PRIMARY KEY, bandid INT, date VARCHAR); INSERT INTO venue (venueid, name, address) VALUES (1, 'The Mohawk', '123 Red River, Austin, TX'); INSERT INTO show (venueid, bandid, date) VALUES (1, 1, '2014-03-07'); And create a view (basically, a saved SQL statement): CREATE VIEW shows AS SELECT b.name as band_name, v.name as venue_name, s.date FROM band b INNER JOIN show s ON b.bandid = s.bandid INNER JOIN venue v ON s. venueid = v.venueid;
  • 191. Hands on! Add some more tables and data: CREATE Table venue (venueid INT PRIMARY KEY, name VARCHAR, address VARCHAR); CREATE TABLE show (venueid INT PRIMARY KEY, bandid INT, date VARCHAR); INSERT INTO venue (venueid, name, address) VALUES (1, 'The Mohawk', '123 Red River, Austin, TX'); INSERT INTO show (venueid, bandid, date) VALUES (1, 1, '2014-03-07'); And create a view (basically, a saved SQL statement): CREATE VIEW shows AS SELECT b.name as band_name, v.name as venue_name, s.date FROM band b INNER JOIN show s ON b.bandid = s.bandid INNER JOIN venue v ON s. venueid = v.venueid;
  • 192. Hands on! Now you can SELECT data: SELECT * FROM shows; And you’ll get: band_name | venue_name | date -----------+------------+------------ Radiohead | The Mohawk | 2014-03-07 Note if you’re into nit-picking: I'm taking a shortcut here and saying that each record in "show" is a band/venue combination, with a date. If "show" were a proper entity (for example, if the show had a name, a promoter, etc.) then the proper “normalized” way to model it would be to create a show entity with an ID and a venue ID, show name, promoter, etc; and then have intersection tables between bands and shows (like a band_show table). But this is fine for now.
  • 193. Hands on! Leave the SQL prompt (type “q”) and reopen the app: $ heroku open And (hopefully) voila! You see the results of: SELECT * FROM shows; Feel free to mess around now--add more data, to see how it’ll show up. If you’re really advanced, try changing the ruby code to show different SQL statements, etc. We’ll take about 10 minutes to play around, ask questions, etc.
  • 196. So, hierarchical databases were lame. But relational databases are awesome! SQL Rules!! Right … ?
  • 197. Hrm. As awesome as they are, relational databases have a few warts. And some of them have only become apparent recently.
  • 198. ● It’s really hard to work with recursive / graph relationships ● Results of SELECT queries are always flat tables, which means you have to reassemble nested structures yourself ● They don’t play well with object oriented programs ● They don’t support inheritance & superclassing! ● Relational modeling tools are generally quite sucky. ● They don’t scale well to extremely large data sets, because they promise things you can only do on a single (non-distributed) system. There’s a standard list of grievances:
  • 199. These are all interesting points; if we had a whole semester, we could spend weeks on any one of them.
  • 200. But I want to spend the rest of our time on what is (IMO) the essential problem: mutability.(i.e. change, being mutated)
  • 201. Let’s zoom way out for a minute. By NASA [Public domain], via Wikimedia Commons
  • 202. We talked about hierarchies. ● Old school, classical ● Single top down view of reality We talked about relations. ● Modern, networks, connections ● No single privileged access path or view
  • 203. But both hierarchies and relations tend towards a commitment to a fixed representation of reality. You’re modeling static concepts that exactly match the real world. Doing this with high fidelity is the crowning achievement of the relational model.
  • 204. But reality doesn’t hold still. It changes, all the time. Any fixed way of representing the world is doomed to become outdated.
  • 205. Darwin knew what was up. The nature of reality is to be dynamic, evolving.
  • 206. A species seems like a fixed thing, but that's just a label we attach. They are fuzzy around the edges, and always changing.
  • 207. Change is a fact of life at all levels of the abstraction ladder… our understanding of structure itself the structure of our software apps facts about things in the real world
  • 208. And it’s only getting faster. We can either hide our heads in the sand, or we can figure out what to do about it.
  • 209. So let’s talk about 3 ways to smash the boundaries of data, in the mutation dimension: Attribute Flexibility Non-Destructive Mutability Model Agility
  • 211. This is an UPDATE statement in SQL: UPDATE Band SET name = 'Jaydiohead' WHERE Band ID = 5678 and name = 'Radiohead' Result: Band ID Name 1234 Bjork 5678 Radiohead Band ID Name 1234 Bjork 5678 Jaydiohead
  • 212. You know what you can’t do? Undo. Edits in traditional relational databases are destructive. If you want to keep the old version, you have to do it yourself. Note for data nerds: yes, databases keep transaction logs so you can undo and redo edits as part of transactions. But this is (a) implementation dependent, (b) not typically exposed to users in the relational model, and (c) not guaranteed to persist beyond the transaction itself.
  • 213. This means that everyone has to either: ● Add extra complexity to their data model, or ● Accept that changes are destructive. Both are pretty crappy options. Most people just do #2.
  • 214. But, sometimes you can’t do that. Consider compliance. ● Who changed my salary? ● Who deleted this opportunity from the pipeline? ● Who took my stapler? You might not care a bunch about that, but Sarbanes-Oxley sure does. And if you don’t know what that is, count yourself lucky and go back to making pretty things.
  • 215. So this … So how do people store history now? Option #1: audit columns (partial solution) Band ID Name 1234 Bjork 5678 Radiohead
  • 216. Option #1: audit columns (partial solution) Band ID Name 1234 Bjork 5678 Radiohead Band ID Name Create Date Created By Modify Date Modified By Deleted? 1234 Bjork 3/6/14 2pm Ian Varley 3/6/14 3pm Jan Jones false 5678 Jaydiohead 3/5/14 1pm John Smith 3/7/14 1pm Ian Varley false So this … becomes this. So how do people store history now?
  • 217. So this … So how do people store history now? Option #2: History tables Band ID Name 1234 Bjork 5678 Radiohead
  • 218. So this … becomes this So how do people store history now? Option #2: History tables Band ID Name 1234 Bjork 5678 Radiohead Band ID Name Version Date By 1234 Bjork 1 3/6/14 Ian V 5678 Radiohead 1 3/6/14 Ian V 5678 Jaydiohead 2 3/7/14 Ian V 1234 Fjork 2 3/7/14 Ian V
  • 219. Neither of these are ideal. They ugly up your data model, which is not just an inconvenience; it makes it harder to see the “real” stuff. It also violates the spirit of normalization: if something means the same thing, don’t repeat it all over the place.
  • 220. But what if … the database took care of this for you? Band ID Name 1234 Bjork 5678 Radiohead Band ID Name 1234 Bjork 5678 Radiohead Band ID Name 1234 Bjork 5678 Radiohead Band ID Name 1234 Bjork 5678 Radiohead Band ID Name 1234 Bjork 5678 Radiohead Band ID Name 1234 Bjork 5678 Radiohead Band ID Name 1234 Bjork 5678 Radiohead Band ID Name 1234 Bjork 5678 Radiohead Band ID Name 1234 Bjork 5678 Radiohead Band ID Name 1234 Bjork 5678 Radiohead Band ID Name 1234 Bjork 5678 Radiohead
  • 221. Some newer ones do: ● Google pioneered this with BigTable - time is a privileged dimension, part of the model, stored with every datum ● NoSQL Stores like HBase follow suit ● Salesforce offers “field history” out of the box, as a meta-feature on any entity, expressed as a history table ● Document stores can store older versions
  • 222. Does this use a lot of space? Yes. But guess what? We have a lot of space.
  • 223. And, it turns out that for a lot of computing applications, making it immutable makes the problem way easier to reason about. (For more on that, see Pat Helland’s talk, Immutability Changes Everything.)
  • 224. Here’s the crazy part. If we do this across the board, we end up with ... data time travel. “K-9, show me my accounts receivable as of last May, and compare it with today.”
  • 225. Plain old “undo” is a subset of this. Ever wondered why Google web products all have “undo” and nobody else’s do? (Full revision history is the fancy version. They have that too.)
  • 226. My pronouncement: Henceforth, all databases should make time a privileged dimension, and retain older versions of data in a way that supports time travel.
  • 228. Historically, databases have been a little something of a schoolmarm. Image: Public domain. From the 1894 Laughable Lyrics: A Fourth Book of Nonsense Poems, Songs, Botany, Music, etc. by Edward Lear.
  • 229. Some things are decided and enforced directly at the database level: - what attributes an entity can have - how many attributes there are - what type of data to store in each attribute - (string, number, date, currency, etc.) - other constraints (max & min values, etc)
  • 230. But why is this the database's job? It’s partially a historical accident. Disk & memory used to be scarce and highly optimized, so record formats had to be prescriptive and fixed.
  • 231. It’s also partially a mindset. Centralizing decisions about structure is very tempting; it’s easy to overestimate your ability to “get it right” the first time, or find the “one true model” for all.
  • 232. But … What if a database's job is really to store whatever fields I give it? What if I want to say, "Let me store any additional facts I can think of about bands, venues, shows"?
  • 233. If your model imposes less, then there's less to change when the world changes. This analogous to the difference between strongly typed languages (like C and Java) and scripting languages (like Ruby and Python). Sometimes you want to trade safety for flexibility.
  • 234. As an example, Thom Yorke’s booking agent might use address book software. But what if they had to update the database schema for each new kind of social media service he decides to use?
  • 235. You can picture just adding more attributes, like key-value pairs in JSON: { id: "ARP5KQF1187B9B4DD1", name: "Explosions in the Sky", genres: [{name: "post rock"}], years_active: [{start: 1999}], artist_location: { location: "Austin, TX, US", city: "Austin", region: "Texas", country: "United States" } }
  • 236. You can picture just adding more attributes, like key-value pairs in JSON: { id: "ARP5KQF1187B9B4DD1", name: "Explosions in the Sky", genres: [{name: "post rock"}], years_active: [{start: 1999}], artist_location: { location: "Austin, TX, US", city: "Austin", region: "Texas", country: "United States" }, familiarity: 0.687572, favorite_color: "Blue" }
  • 237. “Tagging” is the simplest version of this. Each tag is a boolean (true/false), and there can be any number of them.
  • 238. You can already model attribute flexibility at a meta-level in a relational database... Band ID Name 1234 Bjork 5678 Radiohead Band ID Attribute Value 5678 City Austin 5678 State TX 5678 Fav. Color Blue 5678 Familiarity 0.687572 After all, it’s just another degree of cardinality, right?
  • 239. But writing queries against it is a massive PITA (trust me). And, it’s the same meta problem: if you did this for every entity in your model, your model would be impossible to comprehend.
  • 240. Attribute flexibility is “table stakes” for new DBs: ● Google did it with BigTable - the columns for a row are totally flexible at run time, and the values are simple byte arrays ● Most other NoSQL stores offer this too ● Some services make it the backbone of what they offer (e.g. keen.io - 1 entity, but any set of attributes you want to send)
  • 241. My second pronouncement: Henceforth, most databases should really think about offering attribute flexibility, accepting writes and reads of “columns” that haven’t been declared in advance.
  • 242. Aside: if columns aren’t predefined, how do we know their data type? ● Strings, Numbers, Dates, etc? ● JSON has the right idea: ○ If it’s in quotes, it’s a string ○ If it’s not, it’s either: ■ a number → ■ true or false
  • 243. And … while we’re at it, why not just get rid of attributes altogether, and say that entities store JSON blobs, with nesting intact? (We’ll come back to that …)
  • 245. "Schemas usually remain relatively stable over the lifetime of a database for most applications." - S Navathe, 1992 "No one will need more than 637 kB of memory for a personal computer." - B. Gates, 1979
  • 246. It’s true, though. Database schemas do tend to remain stable … because it sucks so bad to change them! ● Honestly, databases have always been the least “agile” part of software development. ● It's the final frontier of "BDUF" (big design up front)
  • 247. But, wait. Aren’t schemas stored as data? Can’t you just change them? ALTER TABLE Band ADD COLUMN favorite_color STRING
  • 248. In theory, that works. In practice, it doesn’t. ● You can change the schema, but you can only ever have one schema at a time ● Some changes might require downtime ● For big tables, that could mean you’re offline for hours, days, or even weeks. ● And, by the way, if you follow my first pronouncement about data and time travel, what happens when you change the schema?
  • 249. What to do instead? One option is to generalize: make your model so generic, you never have to alter a table!
  • 250. This is what “architecture astronauts” do, and it doesn’t work, because you eventually end up with this model: (You can sometimes find a sweet spot, but more likely you’re just pushing the essentialism and brittleness to another layer.)
  • 251. What we really need is a system where structure changes are also non-destructive. (We’d need that anyway if we want to time travel with the data, right?)
  • 252. There’s actually an example of this! http://couchdb.apache.org/
  • 253. Apache CouchDB is a NoSQL database. ● The database is just a flat collection of JSON files ● There’s no schema! Put in whatever you want! ● You create views (using javascript) that “materialize” certain access patterns across your documents. It’s not perfect (scaling is tricky, etc.) but as an illustration of these points, it’s spot on.
  • 254. CouchDB hits all 3 of our mutability desires: ● All documents are versioned. ○ Non-destructive Mutation! ● Document JSON can have any structure ○ Attribute Flexibility! ● Schema-On-Read using views ○ Model Agility!
  • 255. It also happens to be wicked easy to get started with, so let’s do one more hands-on exercise.
  • 256. Hands on! Let’s load some music data into CouchDB! You should have already installed CouchDB, but if not, do so now.
  • 257. Hands on! Earlier, we used the Do512 API. This time we’ll use The Echo Nest, a huge music data repository. (Which was, incidentally, just bought by Spotify yesterday, March 6th, 2014! Too bad we didn’t get on this listen.up thing a little sooner amiright?) They require setting up an API key for access, so as a shortcut I’ve done that part for you.
  • 258. Hands on! The file mutation/echonest-artists-austin-v1.json is data on the top 100 “most familiar” artists with a hometown of Austin, TX, from this API call: http://developer.echonest.com/api/v4/artist/search?api_key={MY_API_KEY} &format=json&results=100&start=0&bucket=familiarity&bucket=genre&bucket=artist_location&bucket=years_active&artist_l ocation=austin Full disclosure, I modified the result slightly, so it’d work immediately with CouchDB bulk load: ● removed the outer "response" wrapper from the API ● changed the name of the array from "artists" to "docs" ● changed all the "id" fields to "_id" so CouchDB would use them
  • 259. Hands on! Ensure CouchDB is installed and running: http://127.0.0.1:5984/_utils/index.html Create a new database called “listenup”
  • 260. Hands on! Upload a bunch of data: $ cd ~/unstructure-sxsw14/mutation/ $ curl -H "Content-Type:application/json" -d @echonest-arists-austin_v1. json -X POST http://127.0.0.1:5984/listenup/_bulk_docs Reload the database web page: http://127.0.0.1:5984/_utils/database.html?listenup
  • 261. Hands on! See what happens when you change a value. Let’s pick a band at random: http://127.0.0.1:5984/_utils/document.html?listenup/ARJXDMJ11FF10D84F9 Double click, and change to: [{"name": "garage soul"}] Then click “Save Document”.
  • 262. Hands on! Now there are two versions of this document: Non-destructive mutability, in the flesh!
  • 263. Hands on! Now add a field that never existed before: http://127.0.0.1:5984/_utils/document.html?listenup/ARJXDMJ11FF10D84F9 Click “Add Field”. Add “Influences” as the Field name, and click “Save Document”. Double click the “null” next to “Influences”, and change the value to [{"name": "Michael McDonald"}] Click “Save Document”.
  • 264. Hands on! Attribute flexibility! Notice that so far, we haven’t once had to specify a schema.
  • 265. Hands on! Now we query the data by writing code to implement materialized views. This is actually a little complicated ...
  • 266. Hands on! (Yes, I know this is actually about Riak, not CouchDB. Why do you hate laughing?) Comic by John Muellerleile / http://thinkdifferent.ly/fault-tolerance.png
  • 267. Hands on! As a simple example, create a new document: { "_id": "_design/application", "views": { "genre-view": { "map": "function(artist) {if(artist.name && artist.genres) {artist.genres.forEach(function(genre) {emit(genre, artist.name);});}}" } } } Then visit: http://127.0.0.1:5984/listenup/_design/application/_view/genre-view
  • 268. Hands on! Try making your own view. Some examples are here: http://guide.couchdb.org/draft/cookbook.html
  • 270. Chapter 5: Conclusion & Future Directions
  • 271. So where have we been?
  • 273. Before we go off to our subsequent debauchery, if you’ll indulge me, we actually have a tiny bit more smashing to do.
  • 274. Now it’s time to go through the looking glass. Image from Disney Wikia: http://disney.wikia.com/wiki/Alice
  • 275. (Some of these thoughts are adapted from my previous presentation, I’ve Always Wanted To Data Model)
  • 276. This is a technical book from the 1970s. It’s a philosophy book. It opened my eyes to some of the real underlying questions.
  • 278. Of course not, not categorically. Member Year Joined Thom Yorke 1985 Jonny Greenwood, 1985 Ed O'Brien, 1985 Colin Greenwood 1985 Philip Selway 1985 Ian Varley 2014 Official Radiohead Band Members
  • 280. Robin Hood’s Band Of Merry Men Depends on what you mean. Merry Man Merry-ness Robin Hood High Little John Medium Much The Miller’s Son Medium Friar Tuck High Arthur a Bland Low Maid Marian Unknown
  • 281. My personal theory: Data is “existential claims”. The fact that this data, and not some other, is stored implies that someone (or something) is making a claim about the existential state of something. This claim may or may not correspond to the actual existential state of that something.
  • 282. As cool as the boundary-smashing tools we looked at today are, it’s also wise to remember that we don’t really know much of what’s going on. For example ...
  • 283. Nobody actually knows what an “entity” really is.
  • 284. “Entity” is another word for Category, in linguistics terms. And an important property of linguistic categories is that they are slippery. See: ● Steven Pinker: The Stuff Of Thought ● Douglas Hofstadter: Surfaces & Essences ● George Lakoff: Women, Fire, and Dangerous Things
  • 285. part: an abstract definition of a connected set of physical materials that serve some purpose part: one instance of a manufactured item, which either does or does not meet quality standards Images: (1) Atwood Hydraulic Surge Brake Actuator, http://www.pacifictrailers.com/Atwood-Hydraulic-Brake-Actuator-Parts-List-and-Schematic/; (2) Ford Motor Company flywheel magneto assembly line 1913, source unknown
  • 286. And if you think you can “solve” the problem, I’ve got some World Trade Center insurance policies to sell you.
  • 287. That said, there are a couple techniques we could adopt that would help: ● First-class Sub- / Super-Typing ● First-class Scoping and Aliasing (Not that there aren’t ways to do this in relational models, but they’re unobvious and not widely used.)
  • 288. Discrete models encourage black & white thinking in a gray world
  • 289. Sometimes the deep structure is actually ambiguous. Image credit: By Chire (Own work) [Public domain or Public domain], via Wikimedia Commons
  • 290. Our current tools are s#!t. Boxes & lines aren’t how we actually think, our spatial processing of diagrams doesn’t map well to our temporal, spatial, and causal comprehension of data structure.
  • 292. If we had the right tools, what would they look like?
  • 293. They’d have native support for ... ● My mutability requirements ○ Non-destructive mutation ○ Attribute Flexibility ○ Model Agility ● The 3 Ps: ○ Provenance, Provability, Probability
  • 294. They’d have native support for ... ● Supertyping ● Extensible Meta-Metadata ● Semantic Zoom ● Prototype Generation ● Model Versioning and Diffing
  • 295. Interesting direction: RAML ● Textual (YAML-based) ● Strong, simple syntax ● Generates useful models ● Communication focussed Could there be something like this for data models?
  • 297. Thanks! @thefutureian ianvarley.com (If you enjoyed this, please rate it a 5 … sxsw.com/rate)