All Things Open 2014 - Day 2
Thursday, October 23rd, 2014
Doug Turnbull
Search & Big Data Architect for OpenSource Connections
Databases
Stop Worrying & Love the SQL - A Case Study
3. OpenSource Connections
Most Importantly we do...
Make my search results more relevant!
“Search Relevancy”
What database works best for problem X?
“(No)SQL Architect/Trusted Advisor”
4. OpenSource Connections
How products actually get built
Rena: Doug, John can you come by this afternoon?
One of our Solr-based products needs some urgent relevancy
work
Its Friday, it needs to get done today!
Us: Sure!
The Client
(Rena!)
smart
cookie!
5. OpenSource Connections
A few hours later
Us: we’ve made a bit of progress!
image frustration-1081 by jseliger2
Rena: but everytime we fix something, we break an
existing search!
Us: yeah! we’re stuck in a whack-a-mole-game
other image: whack a mole by jencu
7. OpenSource Connections
I HAVE AN IDEA
● Middle of the afternoon, I stop doing search
work and start throwing together some
python from flask import Flask
app = Flask(__name__)
Everyone: Doug, stop that, you have important search work to do!
Me: We’re not making any progress!
WE NEED A WAY TO REGRESSION TEST OUR RELEVANCY AS WE TUNE!
Everyone: You’re nuts!
8. OpenSource Connections
What did I make?
Focus on gathering stakeholder (ie Rena)
feedback on search, coupled w/ workbench
tuning against that feedback
Today we have customers...
… forget that, tell me about your failures!
9. OpenSource Connections
Our war story
My mistakes:
● Building a product
● Selling a product
● As a user experience engineer
● As an Angular developer
● At choosing databases
10. OpenSource Connections
Quepid 0.0.0.0.0.0.1
Track multiple user searches
for this query (hdmi cables) Rena rates this document
as a good/bad search result
need to store:
<search> -> <id for search result> -> <rating 1-10>
“hdmi cables” -> “doc1234” -> “10”
*Actual UI may have been much uglier
11. OpenSource Connections
Data structure selection under duress
● What’s simple, easy, and will persist our
data?
● What plays well with python?
● What can I get working now in Rena’s office?
12. OpenSource Connections
Redis
● In memory “Data Structure Server”
○ hashes, lists, simple key-> value storage
● Persistent -- write to disk every X minutes
13. OpenSource Connections
Redis
from redis import Redis
redis = Redis()
redis.set("foo", "bar")
redis.get("foo") # gets ‘bar’
$ pip install redis
Easy to install and go! Specific to our problem:
from redis import Redis
redis = Redis()
ratings = {“doc1234”: “10”,
“doc532”: “5”}
searchQuery = “hdmi cables”
redis.hsetall(searchQuery, ratings)
Store a hash table
at “hdmi cables”
with:
“doc1234” -> “10”
“doc532” -> “5”
14. OpenSource Connections
Success!
● My insanity paid off that afternoon
● Now we’re left with a pile of hacked together
(terrible) code -- now what?
15. OpenSource Connections
Adding some features
● Would like to add multiple “cases”
(different search projects that solve different problems)
● Would like to add user accounts
● Still a one-off for Silverchair
17. OpenSource Connections
Cases in Redis?
from redis import Redis
redis = Redis()
ratings = {“doc1234”: “10”,
“doc532”: “5”}
searchQuery = “hdmi cables”
redis.hset(searchQuery, ratings)
Recall our existing implementation
“data model”
Out of the box, redis can deal with 2 levels deep:
{
“hdmi cables”: {
“doc1234”: “10”,
“doc532”: “5”
},
“ethernet cables”
...
}
Can’t add extra layer (redis hash only one layer)
{“cable site”: {
“hdmi cables”: {...}
“ethernet cables”: {...}
}
“laws site: {...}}
18. OpenSource Connections
Time to give up Redis?
“All problems in computer science can be solved by another level of indirection” -- David Wheeler
Crazy Idea: Add dynamic prefix to query keys to indicate case, ie:
{
“case_cablestore_hdmi cables”: {
“doc1234”: “10”,
“doc532”: “5”
},
“case_cablestore_ethernet cables”: {
… },
“case_statelaws_car tax”: {
…}
}
Queries for “Cable Store” case
Query for “State Laws” case
redis.keys(“case_cablestore*”)
To Fetch:
19. OpenSource Connections
Store other info about cases?
New problem: we need to store some information about cases, case name, et
{
“case_cablestore_hdmi cables”: {
“doc1234”: “10”,
“doc532”: “5”
},
“case_cablestore_ethernet cables”: {
… },
“case_statelaws_car tax”: {
…}
}
Where would it go here?
{
“case_cablestore” {
“name”: “cablestore”,
“created” “20140101”
},
“case_cablestore_query_hdmi cables”: {
“doc1234”: “10”,
“doc532”: “5”
},
“case_cablestore_query_ethernet cables”:
{
… },
“case_statelaws_query_car tax”: {
…}
}
20. OpenSource Connections
Oh but let’s add users
Extrapolating on past patterns {
“user_doug” {
“name”: “Doug”,
“created_date”: “20140101”
},
“user_doug_case_cablestore” {
“name”: “cablestore”,
“created_date” “20140101”
},
“user_doug_case_cablestore_query_hdmi cables”: {
“doc1234”: “10”,
“doc532”: “5”
},
“user_doug_case_cablestore_query_ethernet cables”:
{
… },
“user_tom_case_statelaws_query_car tax”: {
…}
}image: Rage Wallpaper from Flickr user Thoth God of Knowledge
You right now!
21. OpenSource Connections
Step Back
We ask ourselves: Is this tool a
product? Is it useful outside of this
customer?
What level of software engineering helps us move forward?
● Migrate to RDMS?
● “NoSQL” options?
● Clean up use of Redis somehow?
22. OpenSource Connections
SubRedis
Operationalizes hierarchy inside of redis
https://github.com/softwaredoug/subredis
from redis import Redis
from subredis import SubRedis
redis = Redis()
sr = SubRedis(“case_%s” % caseId , redis)
ratings = {“doc1234”: “10”,
“doc532”: “5”}
searchQuery = “hdmi cables”
sr.hsetall(searchQuery, ratings)
Create a redis sandbox for this case
Interact with this case’s queries with redis
sandbox specific to that case
Behind the scenes, subredis
queries/appends the case_1 prefix to
everything
23. OpenSource Connections
SubRedis == composable
userSr = SubRedis(“user_%s” % userId , redis)
caseSr = SubRedis(“case_%s” % caseId , userSr)
# Sandbox redis for queries about user
ratings = {“doc1234”: “10”,
“doc532”: “5”}
searchQuery = “hdmi cables”
caseSr.hsetall(searchQuery, ratings)
SubRedis takes any Redis like
thing, and works safely in that
sandbox
Now working on sandbox, within a sandbox
24. OpenSource Connections
Does something reasonable under the hood
{
“user_1_name”: “Doug”,
“user_1_created_date”: “Doug”,
“user_1_case_1_name”: “name”: “cablestore”
“user_1_case_1_hdmi cables”: {
“doc1234”: “10”,
“doc532”: “5”
},
“user_2_name”, “Rena”,
...
}
All
Redis
user_1
subred.
case_1
subred.
25. OpenSource Connections
We reflect again
● Ok we tried this out as a product. Launched.
● Paid off *some* tech debt, but wtf are we
doing
● Works well enough, we’ve got a bunch of
new features, forge ahead
26. OpenSource Connections
We reflect again
● We have real customers
● Our backend is evolving away from simple
key-value storage
○ user accounts? users that share cases? stored
search snapshots? etc etc
27. OpenSource Connections
Attack of the relational
Given our current set of tools, how would we solve the problem
“case X can be shared between multiple users”?
{
“user_1_name”: “Doug”,
“user_1_created_date”: “Doug”,
“user_1_case_1_name”: “name”: “cablestore”
“user_1_case_1_hdmi cables”: {
“doc1234”: “10”,
“doc532”: “5”
},
“user_2_name”, “Rena”,
“user_2_case_1_name”: “name”: “cablestore”
“user_2_case_1_hdmi cables”: {
“doc1234”: “10”,
“doc532”: “5”
},
}
Could duplicate the data?
This stinks!
● Updates require visiting many (every?)
user, looking for this case
● Bloated database
Duplicate the data?
28. OpenSource Connections
Attack of the relational
Given our current set of tools, how would we solve the problem
“case X can be shared between multiple users”?
{
“user_1_name”: “Doug”,
“user_1_created_date”: “Doug”,
“user_1_cases”: [1, ...]
“case_1_name”: “name”: “cablestore”
“case_1_hdmi cables”: {
“doc1234”: “10”,
“doc532”: “5”
},
“user_2_name”, “Rena”,
“user_2_cases”: [1, ...]
...
}
User 1
Case 1
User 2
Store list of
owned cases
Break out cases to a top-level record?
29. OpenSource Connections
SudRedisRelational?
{
“user_1_name”: “Doug”,
“user_1_created_date”: “Doug”,
“user_1_cases”: [1, ...]
“case_1_name”: “name”: “cablestore”
“case_1_hdmi cables”: {
“doc1234”: “10”,
“doc532”: “5”
},
“user_2_name”, “Rena”,
“user_2_cases”: [1, ...]
...
}
We’ve actually just normalized our data.
Why was this good?
● We want to update case 1 in isolation
without anomalies
● We don’t want to visit every user to
update case 1!
● We want to avoid duplication
We just made our “NoSQL” database a bit relational
30. OpenSource Connections
Other Problems
● Simple CRUD tasks like “delete a case”
need to be coded up
● We’re managing our own record ids
● Is any of this atomic? does it occur in
isolation?
32. OpenSource Connections
Irony
● This is the exact situation we warn clients
about in our (No)SQL Architect Roles.
○ Relational == General Purpose
○ Many-many, many-one, one-many, etc
○ Relational == consistent tooling
○ NoSQL == solve specific problems well
33. OpenSource Connections
So we went relational!
● Took advantage of great tooling: MySQL,
Sqlalchemy (ORM), Alembic (migrations)
● Modeled our data relationships exactly like
we needed them to be modeled
34. OpenSource Connections
Map db Python classes
class SearchQuery(Base):
__tablename__ = 'query'
id = Column(Integer, primary_key=True)
search_string = Column(String)
ratings = relationship("QueryRating")
class QueryRating(Base):
__tablename__ = 'rating'
id = Column(Integer, primary_key=True)
doc_id = Column(String)
rating = Column(Integer)
Can model my domain in coder-friendly
classes class SearchQuery(Base):
__tablename__ = 'query'
id = Column(Integer, primary_key=True)
search_string = Column(String)
ratings = relationship("QueryRating")
class QueryRating(Base):
__tablename__ = 'rating'
id = Column(Integer, primary_key=True)
doc_id = Column(String)
rating = Column(Integer)
36. OpenSource Connections
Migrations are good
alembic revision --autogenerate -m "name for tries"
alembic upgrade head
alembic downgrade 0ab51c25c
How do you upgrade your database to add/move/reorganize data?
● Redis this was always done manually/scripted
● Migrations with RDMS are a very robust/well-understood way to
handle this
SQLAlchemy has “alembic” to help:
37. OpenSource Connections
Modeling Users ←→ Cases
association_table = Table(case2users, Base.metadata,
Column('case_id', Integer, ForeignKey('case.id')),
Column('user_id', Integer, ForeignKey('user.id'))
)
class User(Base):
__tablename__ = 'user'
id = Column(Integer, primary_key=True)
cases = relationship("Case",
secondary=association_table)
class Case(Base):
__tablename__ = 'case'
id = Column(Integer, primary_key=True)
Can model many-many relationships
38. OpenSource Connections
Ultimate Query Flexibility
for user in User.query.all():
for case in user.cases:
print case.caseName
for user in User.query.filter(User.isPaying==True):
for case in user.cases:
print case.caseName
Print all cases:
Cases from paying members:
39. OpenSource Connections
Lots of things easier
● backups
● robust hosting services (RDS)
● industrial strength ACID with flexible
querying
● 3rd-party tooling (ie VividCortex for MySQL)
40. OpenSource Connections
When NoSQL?
● Solve specific problems well
○ Optimize for specific query patterns
○ Full-Text Search (Elasticsearch, Solr)
○ Caching, shared data structure (Redis)
● Optimize for specific scaling problems
○ Provide a denormalized “view” of your data for
specific task