Object Relational Mapping (ORM) systems make interacting with an SQL database much easier, but have a reputation of being inefficient and slower than raw SQL.
Using ORM effectively means understanding a little about how it queries the database. In this post, I’ll highlight ways of efficiently using the Django ORM system for medium and huge datasets.
Django querysets are lazy
A queryset in Django represents a number of rows in the database, optionally filtered by a query. For example, the following code represents all people in the database whose first name is ‘Dave’:
The above code doesn’t run any database queries. You can can take the
person_set and apply additional filters, or pass it to a function, and nothing will be sent to the database. This is good, because querying the database is one of the things that significantly slows down web applications.
To fetch the data from the database, you need to iterate over the queryset:
Django querysets have a cache
The moment you start iterating over a queryset, all the rows matched by the queryset are fetched from the database and converted into Django models. This is called evaluation. These models are then stored by the queryset’s built-in cache, so that if you iterate over the queryset again, you don’t end up running the same query twice.
For example, the following code will only execute one database query:
1 2 3 4 5 6 7
if statements trigger queryset evaluation
The most useful thing about the queryset cache is that it allows you to efficiently test if your queryset contains rows, and then only iterate over them if at least one row was found:
1 2 3 4 5 6
The queryset cache is a problem if you don’t need all the results
Sometimes, rather than iterating over results, you just want to see if at least one result exists. In that case, simply using an
if statement on the queryset will still fully evaluate the queryset and populate it’s cache, even if you never plan on using those results!
1 2 3 4 5 6
To avoid this, use the
exists() method to check whether at least one matching row was found:
1 2 3 4 5 6
The queryset cache is a problem if your queryset is huge
If you’re dealing with thousands of rows of data, fetching them all into memory at once can be very wasteful. Even worse, huge querysets can lock up server processes, causing your entire web application to grind to a halt.
To avoid populating the queryset cache, but to still iterate over all your results, use the
iterator() method to fetch the data in chunks, and throw away old rows when they’ve been processed.
1 2 3 4 5
Of course, using the
iterator() method to avoid populating the queryset cache means that iterating over the same queryset again will execute another query. So use
iterator() with caution, and make sure that your code is organised to avoid repeated evaluation of the same huge queryset.
if statements are a problem if your queryset is huge
As shown previously, the queryset cache is great for combining an
if statement with a
for statement, allowing conditional iteration over a queryset. For huge querysets, however, populating the queryset cache is not an option.
The simplest solution is to combine
iterator(), avoiding populating the queryset cache at the expense of running two database queries.
1 2 3 4 5 6
A more complex solution is to make use of Python’s advanced iteration methods to take a peek at the first item in the
iterator() before deciding whether to continue iteration.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Beware of naive optimisation
The queryset cache exists in order to reduce the number of database queries made by your application, and under normal usage will ensure that your database is only queried when necessary.
iterator() methods allow you to optimize the memory usage of your application. However, because they don’t populate the queryset cache, they can lead to extra database queries.
So code carefully, and if things start to slow down, take a look at the bottlenecks in your code, and see if a little queryset optimisation might help things along.