Advanced Django ORM
     techniques
 Daniel Roseman   http://blog.roseman.org.uk
About Me
• Python user for five years
• Discovered Django four years ago
• Worked full-time with Python/Django since
  2008.
• Top Django answerer on StackOverflow!
• Occasionally blog on Django, concentrating
  on efficient use of the ORM.
Contents

• Behind the scenes: models and fields
• How model relationships work
• More efficient relationships
• Other optimising techniques
Django ORM
efficiency: a story
414 queries!
How can you stop this
 happening to you?


             http://www.flickr.com/photos/m0n0/4479450696
Behind the scenes:
models and fields


               http://www.flickr.com/photos/spacesuitcatalyst/847530840
Defining a model

• Model structure initialised via metaclass
• Called when model is first defined
• Resulting model class stored in cache to
  use when instantiated
Fields

• Fields have contribute_to_class
• Adds methods, eg get_FOO_display()
• Enables use of descriptors for field access
Model metadata

•   Model._meta

•   .fields

•   .get_field(fieldname)

•   .get_all_related_objects()
Model instantiation

• Instance is populated from database initially
• Has no subsequent relationship with db
  until save
• No identity between models
Querysets
• Model=manager returns a queryset:
  foos Foo.objects.all()

• Queryset is an ordered list of instances
  of a single model
• No database access yet
• Slice: foos[0]
• Iterate: {% for foo in foos %}
Where do all those
  queries come from?
• Repeated queries
• Lack of caching
• Relational lookup
• Templates as well as views
Repeated queries
    def get_absolute_url(self):
      return "%s/%s" % (
         self.category.slug,
         self.slug
      )


    Same category, but query is
    repeated for each article
Repeated queries
• Same link on every
  page

• Dynamic, so can't
  go in urlconf

• Could be cached
  or memoized
Relationships




        http://www.flickr.com/photos/katietegtmeyer/124315322
Relational lookups

• Forwards:
  foo.bar.field



• Backwards:
  bar.foo_set.all()
Example models
class Foo(models.Model):
 name = models.CharField(max_length=10)


class Bar(models.Model):
 name = models.CharField(max_length=10)
 foo = models.ForeignKey(Foo)
Forwards relationship

>>> bar = Bar.objects.all()[0]
>>> bar.__dict__
{'id': 1, 'foo_id': 1, 'name': u'item1'}
Forwards relationship
>>> bar.foo.name
u'item1'
>>> bar.__dict__
{'_foo_cache': <Foo: Foo object>, 'id': 1,
'foo_id': 1, 'name': u'item1'}
Fowards relationships
• Relational access implemented via a
    descriptor:
    django.db.models.fields.related.
    SingleRelatedObjectDescriptor

•   __get__ tries to access _foo_cache

• If doesn't exist, does lookup and creates
    cache
select_related
• Automatically follows foreign keys in SQL
  query
• Prepopulates _foo_cache
• Doesn't follow null=True relationships by
  default
• Makes query more expensive, so be sure
  you need it
Backwards relationships
{% for foo in my_foos %}
 {% for bar in foo.bar_set.all %}
  {{ bar.name }}
 {% endfor %}
{% endfor %}
Backwards relationships
• One query per foo
• If you iterate over foo_set again, you
  generate a new set of db hits
• No _foo_cache
• select_related does not work here
Optimising backwards
    relationships

• Get all related objects at once
• Sort by ID of parent object
• Then cache in hidden attribute as with
  select_related
qs = Foo.objects.filter(criteria=whatever)
obj_dict = dict([(obj.id, obj)
           for obj in qs])
objects = Bar.objects.filter(foo__in=qs)
relation_dict = {}
for obj in objects:
 relation_dict.setdefault(
          obj.foo_id, []).append(obj)
for id, related in relation_dict.items():
 obj_dict[id]._related = related
qs = Foo.objects.filter(criteria=whatever)
obj_dict = dict([(obj.id, obj)
           for obj in qs])
objects = Bar.objects.filter(foo__in=qs)
relation_dict = {}
for obj in objects:
 relation_dict.setdefault(
          obj.foo_id, []).append(obj)
for id, related in relation_dict.items():
 obj_dict[id]._related = related
qs = Foo.objects.filter(criteria=whatever)
obj_dict = dict([(obj.id, obj)
           for obj in qs])
objects = Bar.objects.filter(foo__in=qs)
relation_dict = {}
for obj in objects:
 relation_dict.setdefault(
          obj.foo_id, []).append(obj)
for id, related in relation_dict.items():
 obj_dict[id]._related = related
qs = Foo.objects.filter(criteria=whatever)
obj_dict = dict([(obj.id, obj)
           for obj in qs])
objects = Bar.objects.filter(foo__in=qs)
relation_dict = {}
for obj in objects:
 relation_dict.setdefault(
          obj.foo_id, []).append(obj)
for id, related in relation_dict.items():
 obj_dict[id]._related = related
qs = Foo.objects.filter(criteria=whatever)
obj_dict = dict([(obj.id, obj)
           for obj in qs])
objects = Bar.objects.filter(foo__in=qs)
relation_dict = {}
for obj in objects:
 relation_dict.setdefault(
          obj.foo_id, []).append(obj)
for id, related in relation_dict.items():
 obj_dict[id]._related = related
qs = Foo.objects.filter(criteria=whatever)
obj_dict = dict([(obj.id, obj)
           for obj in qs])
objects = Bar.objects.filter(foo__in=qs)
relation_dict = {}
for obj in objects:
 relation_dict.setdefault(
          obj.foo_id, []).append(obj)
for id, related in relation_dict.items():
 obj_dict[id]._related = related
Optimising backwards
[{'time': '0.000', 'sql': u'SELECT
"foobar_foo"."id", "foobar_foo"."name" FROM
"foobar_foo"'},
{'time': '0.000', 'sql': u'SELECT
"foobar_bar"."id", "foobar_bar"."name",
"foobar_bar"."foo_id" FROM "foobar_bar"
WHERE "foobar_bar"."foo_id" IN (SELECT
U0."id" FROM "foobar_foo" U0)'}]
Optimising backwards

• Still quite expensive, as can mean large
  dependent subquery – MySQL in particular
  very bad at these
• But now just two queries instead of n
• Not automatic – need to remember to use
  _related_items attribute
Generic relations
• Foreign key to ContentType, object_id
• Descriptor to enable direct access
• iterating through creates n+m
  queries(n=number of source objects,
  m=number of different content types)
• ContentType objects automatically cached
• Forwards relationship creates _foo_cache
• but select_related doesn't work
generics = {}
for item in queryset:
  generics.setdefault(item.content_type_id,
                 
 set()).add(item.object_id)
content_types = ContentType.objects.in_bulk(
                 generics.keys())
relations = {}
for ct, fk_list in generics.items():
 ct_model = content_types[ct].model_class()
 relations[ct] = ct_model.objects.
           in_bulk(list(fk_list))
for item in queryset:
 setattr(item, '_content_object_cache',
    relations[content_type_id][item.object_id]
 )
generics = {}
for item in queryset:
  generics.setdefault(item.content_type_id,
                 
 set()).add(item.object_id)
content_types = ContentType.objects.in_bulk(
                 generics.keys())
relations = {}
for ct, fk_list in generics.items():
 ct_model = content_types[ct].model_class()
 relations[ct] = ct_model.objects.
           in_bulk(list(fk_list))
for item in queryset:
 setattr(item, '_content_object_cache',
    relations[content_type_id][item.object_id]
 )
generics = {}
for item in queryset:
  generics.setdefault(item.content_type_id,
                 
 set()).add(item.object_id)
content_types = ContentType.objects.in_bulk(
                 generics.keys())
relations = {}
for ct, fk_list in generics.items():
 ct_model = content_types[ct].model_class()
 relations[ct] = ct_model.objects.
           in_bulk(list(fk_list))
for item in queryset:
 setattr(item, '_content_object_cache',
    relations[content_type_id][item.object_id]
 )
generics = {}
for item in queryset:
  generics.setdefault(item.content_type_id,
                 
 set()).add(item.object_id)
content_types = ContentType.objects.in_bulk(
                 generics.keys())
relations = {}
for ct, fk_list in generics.items():
 ct_model = content_types[ct].model_class()
 relations[ct] = ct_model.objects.
           in_bulk(list(fk_list))
for item in queryset:
 setattr(item, '_content_object_cache',
    relations[content_type_id][item.object_id]
 )
generics = {}
for item in queryset:
  generics.setdefault(item.content_type_id,
                 
 set()).add(item.object_id)
content_types = ContentType.objects.in_bulk(
                 generics.keys())
relations = {}
for ct, fk_list in generics.items():
 ct_model = content_types[ct].model_class()
 relations[ct] = ct_model.objects.
           in_bulk(list(fk_list))
for item in queryset:
 setattr(item, '_content_object_cache',
    relations[content_type_id][item.object_id]
 )
generics = {}
for item in queryset:
  generics.setdefault(item.content_type_id,
                 
 set()).add(item.object_id)
content_types = ContentType.objects.in_bulk(
                 generics.keys())
relations = {}
for ct, fk_list in generics.items():
 ct_model = content_types[ct].model_class()
 relations[ct] = ct_model.objects.
           in_bulk(list(fk_list))
for item in queryset:
 setattr(item, '_content_object_cache',
    relations[content_type_id][item.object_id]
 )
generics = {}
for item in queryset:
  generics.setdefault(item.content_type_id,
                 
 set()).add(item.object_id)
content_types = ContentType.objects.in_bulk(
                 generics.keys())
relations = {}
for ct, fk_list in generics.items():
 ct_model = content_types[ct].model_class()
 relations[ct] = ct_model.objects.
           in_bulk(list(fk_list))
for item in queryset:
 setattr(item, '_content_object_cache',
    relations[content_type_id][item.object_id]
 )
generics = {}
for item in queryset:
  generics.setdefault(item.content_type_id,
                 
 set()).add(item.object_id)
content_types = ContentType.objects.in_bulk(
                 generics.keys())
relations = {}
for ct, fk_list in generics.items():
 ct_model = content_types[ct].model_class()
 relations[ct] = ct_model.objects.
           in_bulk(list(fk_list))
for item in queryset:
 setattr(item, '_content_object_cache',
    relations[content_type_id][item.object_id]
 )
generics = {}
for item in queryset:
  generics.setdefault(item.content_type_id,
                 
 set()).add(item.object_id)
content_types = ContentType.objects.in_bulk(
                 generics.keys())
relations = {}
for ct, fk_list in generics.items():
 ct_model = content_types[ct].model_class()
 relations[ct] = ct_model.objects.
           in_bulk(list(fk_list))
for item in queryset:
 setattr(item, '_content_object_cache',
    relations[content_type_id][item.object_id]
 )
Other optimising
  techniques
Memoizing
• Cache property on first access
• Can cache within instance, if multiple
  accesses within same request
def get_expensive_items(self):
 if not hasattr(self, '_cache'):
  self._cache = self.expensive_op()
 return self._cache
DB Indexes

• Pay attention to slow query log and
  debug toolbar output
• Add extra indexes where necessary -
  especially for multiple-column lookup
• Use EXPLAIN
Outsourcing

• Does all the logic need to go in the web
  app?
• Services - via eg Piston
• Message queues
• Distributed tasks, eg Celery
Summary

• Understand where queries are coming
  from
• Optimise where necessary, within Django
  or in the database
• and...
PROFILE
Daniel Roseman

http://blog.roseman.org.uk

Advanced Django ORM techniques

  • 1.
    Advanced Django ORM techniques Daniel Roseman http://blog.roseman.org.uk
  • 2.
    About Me • Pythonuser for five years • Discovered Django four years ago • Worked full-time with Python/Django since 2008. • Top Django answerer on StackOverflow! • Occasionally blog on Django, concentrating on efficient use of the ORM.
  • 3.
    Contents • Behind thescenes: models and fields • How model relationships work • More efficient relationships • Other optimising techniques
  • 4.
  • 6.
  • 7.
    How can youstop this happening to you? http://www.flickr.com/photos/m0n0/4479450696
  • 8.
    Behind the scenes: modelsand fields http://www.flickr.com/photos/spacesuitcatalyst/847530840
  • 9.
    Defining a model •Model structure initialised via metaclass • Called when model is first defined • Resulting model class stored in cache to use when instantiated
  • 10.
    Fields • Fields havecontribute_to_class • Adds methods, eg get_FOO_display() • Enables use of descriptors for field access
  • 11.
    Model metadata • Model._meta • .fields • .get_field(fieldname) • .get_all_related_objects()
  • 12.
    Model instantiation • Instanceis populated from database initially • Has no subsequent relationship with db until save • No identity between models
  • 13.
    Querysets • Model=manager returnsa queryset: foos Foo.objects.all() • Queryset is an ordered list of instances of a single model • No database access yet • Slice: foos[0] • Iterate: {% for foo in foos %}
  • 14.
    Where do allthose queries come from? • Repeated queries • Lack of caching • Relational lookup • Templates as well as views
  • 15.
    Repeated queries def get_absolute_url(self): return "%s/%s" % ( self.category.slug, self.slug ) Same category, but query is repeated for each article
  • 16.
    Repeated queries • Samelink on every page • Dynamic, so can't go in urlconf • Could be cached or memoized
  • 17.
    Relationships http://www.flickr.com/photos/katietegtmeyer/124315322
  • 18.
    Relational lookups • Forwards: foo.bar.field • Backwards: bar.foo_set.all()
  • 19.
    Example models class Foo(models.Model): name = models.CharField(max_length=10) class Bar(models.Model): name = models.CharField(max_length=10) foo = models.ForeignKey(Foo)
  • 20.
    Forwards relationship >>> bar= Bar.objects.all()[0] >>> bar.__dict__ {'id': 1, 'foo_id': 1, 'name': u'item1'}
  • 21.
    Forwards relationship >>> bar.foo.name u'item1' >>>bar.__dict__ {'_foo_cache': <Foo: Foo object>, 'id': 1, 'foo_id': 1, 'name': u'item1'}
  • 22.
    Fowards relationships • Relationalaccess implemented via a descriptor: django.db.models.fields.related. SingleRelatedObjectDescriptor • __get__ tries to access _foo_cache • If doesn't exist, does lookup and creates cache
  • 23.
    select_related • Automatically followsforeign keys in SQL query • Prepopulates _foo_cache • Doesn't follow null=True relationships by default • Makes query more expensive, so be sure you need it
  • 24.
    Backwards relationships {% forfoo in my_foos %} {% for bar in foo.bar_set.all %} {{ bar.name }} {% endfor %} {% endfor %}
  • 25.
    Backwards relationships • Onequery per foo • If you iterate over foo_set again, you generate a new set of db hits • No _foo_cache • select_related does not work here
  • 26.
    Optimising backwards relationships • Get all related objects at once • Sort by ID of parent object • Then cache in hidden attribute as with select_related
  • 27.
    qs = Foo.objects.filter(criteria=whatever) obj_dict= dict([(obj.id, obj) for obj in qs]) objects = Bar.objects.filter(foo__in=qs) relation_dict = {} for obj in objects: relation_dict.setdefault( obj.foo_id, []).append(obj) for id, related in relation_dict.items(): obj_dict[id]._related = related
  • 28.
    qs = Foo.objects.filter(criteria=whatever) obj_dict= dict([(obj.id, obj) for obj in qs]) objects = Bar.objects.filter(foo__in=qs) relation_dict = {} for obj in objects: relation_dict.setdefault( obj.foo_id, []).append(obj) for id, related in relation_dict.items(): obj_dict[id]._related = related
  • 29.
    qs = Foo.objects.filter(criteria=whatever) obj_dict= dict([(obj.id, obj) for obj in qs]) objects = Bar.objects.filter(foo__in=qs) relation_dict = {} for obj in objects: relation_dict.setdefault( obj.foo_id, []).append(obj) for id, related in relation_dict.items(): obj_dict[id]._related = related
  • 30.
    qs = Foo.objects.filter(criteria=whatever) obj_dict= dict([(obj.id, obj) for obj in qs]) objects = Bar.objects.filter(foo__in=qs) relation_dict = {} for obj in objects: relation_dict.setdefault( obj.foo_id, []).append(obj) for id, related in relation_dict.items(): obj_dict[id]._related = related
  • 31.
    qs = Foo.objects.filter(criteria=whatever) obj_dict= dict([(obj.id, obj) for obj in qs]) objects = Bar.objects.filter(foo__in=qs) relation_dict = {} for obj in objects: relation_dict.setdefault( obj.foo_id, []).append(obj) for id, related in relation_dict.items(): obj_dict[id]._related = related
  • 32.
    qs = Foo.objects.filter(criteria=whatever) obj_dict= dict([(obj.id, obj) for obj in qs]) objects = Bar.objects.filter(foo__in=qs) relation_dict = {} for obj in objects: relation_dict.setdefault( obj.foo_id, []).append(obj) for id, related in relation_dict.items(): obj_dict[id]._related = related
  • 33.
    Optimising backwards [{'time': '0.000','sql': u'SELECT "foobar_foo"."id", "foobar_foo"."name" FROM "foobar_foo"'}, {'time': '0.000', 'sql': u'SELECT "foobar_bar"."id", "foobar_bar"."name", "foobar_bar"."foo_id" FROM "foobar_bar" WHERE "foobar_bar"."foo_id" IN (SELECT U0."id" FROM "foobar_foo" U0)'}]
  • 34.
    Optimising backwards • Stillquite expensive, as can mean large dependent subquery – MySQL in particular very bad at these • But now just two queries instead of n • Not automatic – need to remember to use _related_items attribute
  • 35.
    Generic relations • Foreignkey to ContentType, object_id • Descriptor to enable direct access • iterating through creates n+m queries(n=number of source objects, m=number of different content types) • ContentType objects automatically cached • Forwards relationship creates _foo_cache • but select_related doesn't work
  • 36.
    generics = {} foritem in queryset: generics.setdefault(item.content_type_id, set()).add(item.object_id) content_types = ContentType.objects.in_bulk( generics.keys()) relations = {} for ct, fk_list in generics.items(): ct_model = content_types[ct].model_class() relations[ct] = ct_model.objects. in_bulk(list(fk_list)) for item in queryset: setattr(item, '_content_object_cache', relations[content_type_id][item.object_id] )
  • 37.
    generics = {} foritem in queryset: generics.setdefault(item.content_type_id, set()).add(item.object_id) content_types = ContentType.objects.in_bulk( generics.keys()) relations = {} for ct, fk_list in generics.items(): ct_model = content_types[ct].model_class() relations[ct] = ct_model.objects. in_bulk(list(fk_list)) for item in queryset: setattr(item, '_content_object_cache', relations[content_type_id][item.object_id] )
  • 38.
    generics = {} foritem in queryset: generics.setdefault(item.content_type_id, set()).add(item.object_id) content_types = ContentType.objects.in_bulk( generics.keys()) relations = {} for ct, fk_list in generics.items(): ct_model = content_types[ct].model_class() relations[ct] = ct_model.objects. in_bulk(list(fk_list)) for item in queryset: setattr(item, '_content_object_cache', relations[content_type_id][item.object_id] )
  • 39.
    generics = {} foritem in queryset: generics.setdefault(item.content_type_id, set()).add(item.object_id) content_types = ContentType.objects.in_bulk( generics.keys()) relations = {} for ct, fk_list in generics.items(): ct_model = content_types[ct].model_class() relations[ct] = ct_model.objects. in_bulk(list(fk_list)) for item in queryset: setattr(item, '_content_object_cache', relations[content_type_id][item.object_id] )
  • 40.
    generics = {} foritem in queryset: generics.setdefault(item.content_type_id, set()).add(item.object_id) content_types = ContentType.objects.in_bulk( generics.keys()) relations = {} for ct, fk_list in generics.items(): ct_model = content_types[ct].model_class() relations[ct] = ct_model.objects. in_bulk(list(fk_list)) for item in queryset: setattr(item, '_content_object_cache', relations[content_type_id][item.object_id] )
  • 41.
    generics = {} foritem in queryset: generics.setdefault(item.content_type_id, set()).add(item.object_id) content_types = ContentType.objects.in_bulk( generics.keys()) relations = {} for ct, fk_list in generics.items(): ct_model = content_types[ct].model_class() relations[ct] = ct_model.objects. in_bulk(list(fk_list)) for item in queryset: setattr(item, '_content_object_cache', relations[content_type_id][item.object_id] )
  • 42.
    generics = {} foritem in queryset: generics.setdefault(item.content_type_id, set()).add(item.object_id) content_types = ContentType.objects.in_bulk( generics.keys()) relations = {} for ct, fk_list in generics.items(): ct_model = content_types[ct].model_class() relations[ct] = ct_model.objects. in_bulk(list(fk_list)) for item in queryset: setattr(item, '_content_object_cache', relations[content_type_id][item.object_id] )
  • 43.
    generics = {} foritem in queryset: generics.setdefault(item.content_type_id, set()).add(item.object_id) content_types = ContentType.objects.in_bulk( generics.keys()) relations = {} for ct, fk_list in generics.items(): ct_model = content_types[ct].model_class() relations[ct] = ct_model.objects. in_bulk(list(fk_list)) for item in queryset: setattr(item, '_content_object_cache', relations[content_type_id][item.object_id] )
  • 44.
    generics = {} foritem in queryset: generics.setdefault(item.content_type_id, set()).add(item.object_id) content_types = ContentType.objects.in_bulk( generics.keys()) relations = {} for ct, fk_list in generics.items(): ct_model = content_types[ct].model_class() relations[ct] = ct_model.objects. in_bulk(list(fk_list)) for item in queryset: setattr(item, '_content_object_cache', relations[content_type_id][item.object_id] )
  • 45.
  • 46.
    Memoizing • Cache propertyon first access • Can cache within instance, if multiple accesses within same request def get_expensive_items(self): if not hasattr(self, '_cache'): self._cache = self.expensive_op() return self._cache
  • 47.
    DB Indexes • Payattention to slow query log and debug toolbar output • Add extra indexes where necessary - especially for multiple-column lookup • Use EXPLAIN
  • 48.
    Outsourcing • Does allthe logic need to go in the web app? • Services - via eg Piston • Message queues • Distributed tasks, eg Celery
  • 49.
    Summary • Understand wherequeries are coming from • Optimise where necessary, within Django or in the database • and...
  • 50.
  • 51.

Editor's Notes

  • #3 (background: montage of Limmud, rosemanblog, Capital, Classic, Heart, GlassesDirect)
  • #4 Some of same ideas in Guido&apos;s Appstats talk this morning
  • #9 It&apos;s a model, in a field, geddit?
  • #10 For more, see Marty Alchin, Pro Django (Apress)
  • #11 descriptors used especially in related objects - see later
  • #12 Very useful for introspection and working out what&apos;s going on
  • #13 explain identity: multiple instances relating to same model row aren&apos;t the same object, changes made to one don&apos;t reflect the other; even saving one with new values won&apos;t be reflected in others.
  • #14 Update, Aggregates, Q, F
  • #17 Find repeated queries with my branch of the django-debug-toolbar, or SimonW&apos;s original query debug middleware
  • #21 Actually in 1.2 there&apos;s an extra _state object in __dict__, which is used for the multiple DB support (which I&apos;m not covering here).
  • #23 Lack of model identity means that accessing the related item on one instance does not cause cache to be created on other instances that might reference the same db row
  • #26 Note: backwards cache does work on OneToOne as of 1.2
  • #38 +----+--------------------+-----------+-----------------+---------------+---------+---------+------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------------+-----------+-----------------+---------------+---------+---------+------+------+-------------+ | 1 | PRIMARY | sandy_bar | ALL | NULL | NULL | NULL | NULL | 100 | Using where | | 2 | DEPENDENT SUBQUERY | U0 | unique_subquery | PRIMARY | PRIMARY | 4 | func | 1 | Using where | +----+-----------+----+--------------------+-----------+-----------------+---------------+---------+---------+------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------------+-----------+-----------------+---------------+---------+---------+------+------+-------------+ | 1 | PRIMARY | sandy_bar | ALL | NULL | NULL | NULL | NULL | 100 | Using where | | 2 | DEPENDENT SUBQUERY | U0 | unique_subquery | PRIMARY | PRIMARY | 4 | func | 1 | Using where | +----+--------------------+-----------+-----------------+---------------+---------+---------+------+------+-------------+ --------+-----------+-----------------+---------------+---------+---------+------+------+-------------+