Unfortunately, the process model of Gunicorn makes it unsuitable for running production Python sites on Heroku.
Gunicorn uses a pre-forking process model by default. This means that network requests are handed off to a pool of worker processes, and that these worker processes take care of reading and writing the entire HTTP request to the client. If the client has a fast network connection, the entire request/response cycle takes a fraction of a second. However, if the client is slow (or deliberately misbehaving), the request can take much longer to complete.
Because Gunicorn has a relatively small (2x CPU cores) pool of workers, if can only handle a small number of concurrent requests. If all the worker processes become tied up waiting for network traffic, the entire server will become unresponsive. To the outside world, your web application will cease to exist.
For this reason, Guncorn strongly recommends that it is used behind a buffering reverse proxy, like Nginx. This means that the entire request and response will be buffered, protecting Gunicorn from delays caused by a slow network.
However, while Heroku does provide limited request/response buffering, large file uploads/downloads can still bypass the buffer. This means that your site is still trivially vulnerable to accidental (or deliberate) Denial of Service (DoS) attacks.
Waitress is a pure-Python HTTP server that supports request and response buffering, using in-memory and temporary file buffers to completely shield your Python application from slow network clients.
Waitress can be installed in your Heroku app using pip:
1 2 |
|
And then added to your Procfile like this:
1
|
|
The Guncicorn docs suggest using an alternative async worker class when serving requests directly to the internet. This avoids the problem of slow network clients by allowing thousands of asyncronous HTTP requests to be processes in parallel.
Unfortunately, this approach introduces a different problem. The Django ORM will open a separate database connection for each request, quickly leading to thousands of simulataneous database connections being created. On the cheaper Heroku Postgres plans, this can easily cause requests to fail due to refused database connections.
By using a fixed pool of worker processes, Waitress makes it much easier to control the number of database connections being opened by Django, while still protecting you against slow network traffic.
For an easy quickstart, and a more in-depth guide to running Django apps on Heroku, please check out the django-herokuapp project on GitHub.
]]>str.decode()
and unicode.encode()
methods to convert whole strings between the builtin unicode
and str
types.
As an example, here’s a simple way to load the contents of a utf-16 file, remove all vertical tab codepoints, and write it out as utf-8. (This can be important when working with broken XML.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
This approach work just fine unless you have to deal with really big files. At that point, loading all the data into RAM becomes a problem.
The Python standard library includes the codecs
module that allow you to incrementally move through a file, loading only a small chunk of unicode data into memory at a time.
The simplest way is to modify the above example to use the codecs.open()
helper.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Dealing with files can get tedious. For complex processing tasks, it can be nice to just deal with iterators of unicode data.
Here’s an efficient way to read an iterator of unicode chunks from a file using iterdecode()
.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Here’s how to write an iterator of unicode chunks to a file using iterencode()
.
1 2 3 4 5 6 7 8 9 |
|
Using these two functions, removing all vertical tab codepoints from a stream of unicode data just becomes a case of plumbing everything together.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
codecs
module?It might seem simpler to just read binary chunks from a regular file
object, encoding and decoding that chunk using the standard str.decode()
and unicode.encode()
methods like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
Unfortunately, some unicode codepoints are encoded as more than one byte of binary data. Simply reading a chunk of bytes from a file and passing it to decode()
can result in an unexpected UnicodeDecodeError
if your chunk happens to split up a multi-byte codepoint.
Using the tools in codecs
will help keep you safe from unpredictable crashes in production!
Python 3 makes working with unicode files a lot easier. The builtin method open()
contains all the functionality you need to easily modify unicode data and switch between encodings.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Python 3 rules! Happy coding!
]]>Using ORM effectively means understanding a little about how it queries the database. In this post, I’ll highlight ways of efficiently using the Django ORM system for medium and huge datasets.
A queryset in Django represents a number of rows in the database, optionally filtered by a query. For example, the following code represents all people in the database whose first name is ‘Dave’:
1
|
|
The above code doesn’t run any database queries. You can can take the person_set
and apply additional filters, or pass it to a function, and nothing will be sent to the database. This is good, because querying the database is one of the things that significantly slows down web applications.
To fetch the data from the database, you need to iterate over the queryset:
1 2 |
|
The moment you start iterating over a queryset, all the rows matched by the queryset are fetched from the database and converted into Django models. This is called evaluation. These models are then stored by the queryset’s built-in cache, so that if you iterate over the queryset again, you don’t end up running the same query twice.
For example, the following code will only execute one database query:
1 2 3 4 5 6 7 |
|
if
statements trigger queryset evaluationThe most useful thing about the queryset cache is that it allows you to efficiently test if your queryset contains rows, and then only iterate over them if at least one row was found:
1 2 3 4 5 6 |
|
Sometimes, rather than iterating over results, you just want to see if at least one result exists. In that case, simply using an if
statement on the queryset will still fully evaluate the queryset and populate it’s cache, even if you never plan on using those results!
1 2 3 4 5 6 |
|
To avoid this, use the exists()
method to check whether at least one matching row was found:
1 2 3 4 5 6 |
|
If you’re dealing with thousands of rows of data, fetching them all into memory at once can be very wasteful. Even worse, huge querysets can lock up server processes, causing your entire web application to grind to a halt.
To avoid populating the queryset cache, but to still iterate over all your results, use the iterator()
method to fetch the data in chunks, and throw away old rows when they’ve been processed.
1 2 3 4 5 |
|
Of course, using the iterator()
method to avoid populating the queryset cache means that iterating over the same queryset again will execute another query. So use iterator()
with caution, and make sure that your code is organised to avoid repeated evaluation of the same huge queryset.
if
statements are a problem if your queryset is hugeAs shown previously, the queryset cache is great for combining an if
statement with a for
statement, allowing conditional iteration over a queryset. For huge querysets, however, populating the queryset cache is not an option.
The simplest solution is to combine exists()
with iterator()
, avoiding populating the queryset cache at the expense of running two database queries.
1 2 3 4 5 6 |
|
A more complex solution is to make use of Python’s advanced iteration methods to take a peek at the first item in the iterator()
before deciding whether to continue iteration.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
The queryset cache exists in order to reduce the number of database queries made by your application, and under normal usage will ensure that your database is only queried when necessary.
Using the exists()
and iterator()
methods allow you to optimize the memory usage of your application. However, because they don’t populate the queryset cache, they can lead to extra database queries.
So code carefully, and if things start to slow down, take a look at the bottlenecks in your code, and see if a little queryset optimisation might help things along.
]]>Instead, Javascript uses a prototype inheritance pattern, which is a little different. To confuse matters, many frameworks attempt to “fix” Javascript inheritance by making it work more like classical inheritance. The end result is a mess.
Thankfully, Javascript inheritance is actually pretty easy!
Let’s define a new class, Animal
. Animals have a name, an age, and can make a noise.
1 2 3 4 5 6 7 8 9 10 |
|
You can instantiate an Animal
like this:
1 2 |
|
Let’s now make a Dog
, which is like an animal, but also has a breed, and makes a different noise.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
You can then instantiate a Dog
like this:
1 2 |
|
While it’s possible to implement something similar to private methods and properties, it probably isn’t worth the time and effort (and performance penalty) of using them. Simply prefixing non-public methods and properties with an underscore is a good way of indicating they they’re not part of the public API.
Interfaces can be useful, but Javascript doesn’t support them. In any case, they would add to the download size of your code.
prototype.constructor
?If your code depends on prototype.constructor
being set, then you can use the following helper method instead of calling Object.create()
directly.
1 2 3 4 5 6 7 8 9 10 |
|
In reality, prototype.constructor
isn’t very useful, so it’s probably best to just call Object.create()
directly.
Javascript code has an unfortunate habit of turning into a mess of nested callbacks and copy-and-paste logic. Defining a hierarchy of helper classes is just one of the many techniques that allow you to write modular, maintainable code.
For example, consider building:
Validator
class that implements basic checks for required fields. Subclasses could provide integer validation, date validation, password length validation, etc.Model
class that contains the HTTP syncronization logic.Service
class that provides logic for updating data within bound scopes.<video>
and <audio>
tags. It was called html5media.js.
At the time, I assumed that the shim would be obsolete within a few years, just as soon as major browsers adopted a common standard and video codec. Unfortunately, the shim is still used by hundreds of thousands of people each day, and embedding video is just as confusing as ever.
Please, just save yourself a headache, and host your video on YouTube, Vimeo, or some other third party service. They employ some very clever people who’ve solved all the problems with embedding video.
Take a deep breath. In order to embed video in your site, there are four major groups of people you need to keep happy:
For the rest of this post, I’ll take you through the steps required to allow an increasing number of people to watch your video.
The simplest video embed code you can possibly use is as follows:
1 2 3 4 5 6 |
|
Congratulations! Your video will now play in:
In order to make your video work in legacy browsers, you need to add a script tag to the <head>
of your document. This script, the venerable html5media.js, will provide a Flash video player fallback for legacy browsers.
1 2 3 4 5 6 7 8 9 |
|
Note: The syntax of the <video>
tag has changed to include an explicit closing tag, to avoid confusing older browsers.
Fantastic! Your video will now play in:
At this point, the vast majority of internet users will be able to play your video. The only people who’ll be left out will be:
To allow Firefox and Opera users to view your video using their native players, you need to transcode your video into an open-source format, and embed both files in your page. I’d recommend using the free Miro Video Encoder to transcode your video to WebM format. You can then embed it using the following code:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Note: We’re adding explicit closing tags to <source>
elements to avoid confusing legacy browsers.
Unbelievable! Now your video will play in:
It’s just the owners of under-powered mobile devices who’ll struggle to play your video now.
The latest mobile devices support high-resolution video, but cheap Android phones and iPhone 3GS will refuse to play anything higher-resolution than about 320 x 180 pixels. To keep these devices happy, you need to transcode your video to this lower resolution. Miro Video Encoder has a built-in iPhone 3GS setting, so just use that.
Now you can embed your video using the following code:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
OMG! What a monster! But now everyone will be able to play your video!
The most common causes of problems are:
There’s a page full of troubleshooting information on the html5media video hosting wiki. Your problem is almost certainly covered there.
Ahahahahahahahaha!
Ahahahaha!
No.
]]>By the end of this post, I hope you’ll be in a better position to use Git on all your software projects, and understand the benefits of doing so.
Installing Git on your system is simply a case of selecting the correct Git installer, and downloading the software onto your computer. Once you’ve got Git installed, setting up a Git repositity for your software project is as simple as typing the following commands into a terminal:
1 2 |
|
The more frequently you commit, the easier it is to go back at a later date and understand the work you’ve been doing (and maybe even undo some of that work). Committing is easy!
1 2 |
|
This is why it’s a good idea to commit your work frequently. If you make a stupid coding mistake, and want to revert your code back to how it was before you broke everything, then just run the following command:
1
|
|
Just type the following command to display a history of all your past commits:
1
|
|
To revert your codebase to some time in the past, simply copy the corresponding commit hash to your clipboard, close the Git log by pressing q
, then type the following command:
1
|
|
(A commit hash looks a bit like this: 814c219a338006492bf6f751d958461dd3e8b775
)
Once you’ve finished with the older version of your code, you can go back to the lastest version by running the following command:
1
|
|
Alternatively, if you want to keep this older version of your code (and discard any changes you’ve made since then), simply commit it using the following commands, and keep working:
1 2 |
|
To protect against hard drive failure, it’s a good idea to back up your code. You can either set up your own code hosting service, or save yourself some effort and get a free BitBucket account.
Once you’ve created a remote repository, you can connect your local codebase to it using the following commands:
1 2 |
|
Then, whenever you’ve made a few commits that you want to push to the server, just run the following command:
1
|
|
Once you’ve put your code online, you can invite other people to work on it too. In order to get a copy of your code, they just need to run the following command:
1 2 |
|
They can then make changes, commit
them, and push
them to the server. In order for you to see the changes that they have make, just run the following command on your machine:
1
|
|
push
If you try to push
your code, and you get an error message saying this:
1
|
|
Don’t worry, just run git pull
, and try pushing again.
pull
If you try to pull
some code, and get an error message saying this:
1
|
|
Don’t worry, this just means that two people have tried to edit the same file. Just open the conflicting file in your editor, fix the contents, and run the following commands:
1 2 |
|
Dropbox is pretty good. However, for coding projects, Git has some key advantages:
xml.dom.minidom
, and there is a much better way…
For this example, we’ll be attempting to process a 43MB document containing 4000 books. The test data can be downloaded here. The two methods shall be tested by providing implementations of iter_authors_and_descriptions
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
The minidom method is both awkward and inefficient.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
Due to loading the entire document in one chunk, minidom takes a long time to run, and uses a lot of memory.
1 2 3 4 |
|
The cElementTree method is also awkward (this is XML, after all). However, by using the iterparse()
method to avoid loading the whole document into memory, a great deal more efficiency can be acheived.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
The results speak for themselves. By using cElementTree, you can process XML in half the time and only use 10% of the memory.
1 2 3 4 |
|
These tests are hardly scientific, so feel free to download the code and see how it runs in your own environment. In any case, the next time your servers get melted by an XML document, consider giving cElementTree a spin.
]]>The result of these changes is that it is now impossible to access the Twitter API directly from the browser.
A simple solution is to write your own proxy server, which can then run on your own domain. The minimum features for a useful Twitter API proxy are:
Writing a Python/Ruby/PHP script to handle this is easy, but it’s a waste of valuable server resources. Far better to let nginx, the best caching reverse proxy server in the world, do the hard work instead.
Creating a Twitter application allows you to authenticate with the API. Just visit https://dev.twitter.com/apps and register your application with Twitter.
Once your new app is created, head over to it’s detail page and make a note of the consumer key and consumer secret. You’ll need these for the next step.
The easiest way to authenticate with the Twitter API is to obtain a bearer token for your proxy server, which is a simple code that can be sent as a HTTP header with every request.
To obtain your bearer token, run the following shell commands, substituting your own consumer key and consumer secret.
1 2 3 |
|
After a few seconds, your terminal will print out a JSON string containing your bearer token. It will look something like this:
1
|
|
Make a note of the access_token
field. You’ll need this for the next step.
Simply place the following settings in your nginx configuration, adjusting paths as necessary. In particular, make sure that proxy_cache_path
, server_name
and root
are all correct. Most important of all, replace the INSERT_YOUR_BEARER_TOKEN
placeholder with the bearer token you obtained in step 2.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
|
Phew! That’s it, simply restart nginx and hit the following URL in your browser to make sure that everything is working!
https://api.twitter.yourdomain.com/1.1/search/tweets.json?q=cats
]]>