...and then it crashed

Programming the web with Python, Django and Javascript.

Using Django querysets effectively

Object Relational Mapping (ORM) systems make interacting with an SQL database much easier, but have a reputation of being inefficient and slower than raw SQL.

Using ORM effectively means understanding a little about how it queries the database. In this post, I’ll highlight ways of efficiently using the Django ORM system for medium and huge datasets.

Django querysets are lazy

A queryset in Django represents a number of rows in the database, optionally filtered by a query. For example, the following code represents all people in the database whose first name is ‘Dave’:

1
person_set = Person.objects.filter(first_name="Dave")

The above code doesn’t run any database queries. You can can take the person_set and apply additional filters, or pass it to a function, and nothing will be sent to the database. This is good, because querying the database is one of the things that significantly slows down web applications.

To fetch the data from the database, you need to iterate over the queryset:

1
2
for person in person_set:
    print(person.last_name)

Django querysets have a cache

The moment you start iterating over a queryset, all the rows matched by the queryset are fetched from the database and converted into Django models. This is called evaluation. These models are then stored by the queryset’s built-in cache, so that if you iterate over the queryset again, you don’t end up running the same query twice.

For example, the following code will only execute one database query:

1
2
3
4
5
6
7
pet_set = Pet.objects.filter(species="Dog")
# The query is executed and cached.
for pet in pet_set:
    print(pet.first_name)
# The cache is used for subsequent iteration.
for pet in pet_set:
    print(pet.last_name)

if statements trigger queryset evaluation

The most useful thing about the queryset cache is that it allows you to efficiently test if your queryset contains rows, and then only iterate over them if at least one row was found:

1
2
3
4
5
6
restaurant_set = Restaurant.objects.filter(cuisine="Indian")
# The `if` statement evaluates the queryset.
if restaurant_set:
    # The cache is used for subsequent iteration.
    for restaurant in restaurant_set:
        print(restaurant.name)

The queryset cache is a problem if you don’t need all the results

Sometimes, rather than iterating over results, you just want to see if at least one result exists. In that case, simply using an if statement on the queryset will still fully evaluate the queryset and populate it’s cache, even if you never plan on using those results!

1
2
3
4
5
6
city_set = City.objects.filter(name="Cambridge")
# The `if` statement evaluates the queryset.
if city_set:
    # We don't need the results of the queryset here, but the
    # ORM still fetched all the rows!
    print("At least one city called Cambridge still stands!")

To avoid this, use the exists() method to check whether at least one matching row was found:

1
2
3
4
5
6
tree_set = Tree.objects.filter(type="deciduous")
# The `exists()` check avoids populating the queryset cache.
if tree_set.exists():
    # No rows were fetched from the database, so we save on
    # bandwidth and memory.
    print("There are still hardwood trees in the world!")

The queryset cache is a problem if your queryset is huge

If you’re dealing with thousands of rows of data, fetching them all into memory at once can be very wasteful. Even worse, huge querysets can lock up server processes, causing your entire web application to grind to a halt.

To avoid populating the queryset cache, but to still iterate over all your results, use the iterator() method to fetch the data in chunks, and throw away old rows when they’ve been processed.

1
2
3
4
5
star_set = Star.objects.all()
# The `iterator()` method ensures only a few rows are fetched from
# the database at a time, saving memory.
for star in star_set.iterator():
    print(star.name)

Of course, using the iterator() method to avoid populating the queryset cache means that iterating over the same queryset again will execute another query. So use iterator() with caution, and make sure that your code is organised to avoid repeated evaluation of the same huge queryset.

if statements are a problem if your queryset is huge

As shown previously, the queryset cache is great for combining an if statement with a for statement, allowing conditional iteration over a queryset. For huge querysets, however, populating the queryset cache is not an option.

The simplest solution is to combine exists() with iterator(), avoiding populating the queryset cache at the expense of running two database queries.

1
2
3
4
5
6
molecule_set = Molecule.objects.all()
# One database query to test if any rows exist.
if molecule_set.exists():
    # Another database query to start fetching the rows in batches.
    for molecule in molecule_set.iterator():
        print(molecule.velocity)

A more complex solution is to make use of Python’s advanced iteration methods to take a peek at the first item in the iterator() before deciding whether to continue iteration.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
atom_set = Atom.objects.all()
# One database query to start fetching the rows in batches.
atom_iterator = atom_set.iterator()
# Peek at the first item in the iterator.
try:
    first_atom = next(atom_iterator)
except StopIteration:
    # No rows were found, so do nothing.
    pass
else:
    # At least one row was found, so iterate over
    # all the rows, including the first one.
    from itertools import chain
    for atom in chain([first_atom], atom_iterator):
        print(atom.mass)

Beware of naive optimisation

The queryset cache exists in order to reduce the number of database queries made by your application, and under normal usage will ensure that your database is only queried when necessary.

Using the exists() and iterator() methods allow you to optimize the memory usage of your application. However, because they don’t populate the queryset cache, they can lead to extra database queries.

So code carefully, and if things start to slow down, take a look at the bottlenecks in your code, and see if a little queryset optimisation might help things along.

Comments