...and then it crashed

Programming the web with Python, Django and Javascript.

Don’t use Gunicorn to host your Django sites on Heroku

Gunicorn is a pure-Python HTTP server that’s widely used for deploying Django (and other Python) sites in production. Heroku is an excellent Platform As A Service (PAAS) provider that will host any Python HTTP application, and recommends using Gunicorn to power your apps.

Unfortunately, the process model of Gunicorn makes it unsuitable for running production Python sites on Heroku.

Gunicorn is designed to be used behind a buffering reverse proxy

Gunicorn uses a pre-forking process model by default. This means that network requests are handed off to a pool of worker processes, and that these worker processes take care of reading and writing the entire HTTP request to the client. If the client has a fast network connection, the entire request/response cycle takes a fraction of a second. However, if the client is slow (or deliberately misbehaving), the request can take much longer to complete.

Because Gunicorn has a relatively small (2x CPU cores) pool of workers, if can only handle a small number of concurrent requests. If all the worker processes become tied up waiting for network traffic, the entire server will become unresponsive. To the outside world, your web application will cease to exist.

For this reason, Guncorn strongly recommends that it is used behind a buffering reverse proxy, like Nginx. This means that the entire request and response will be buffered, protecting Gunicorn from delays caused by a slow network.

However, while Heroku does provide limited request/response buffering, large file uploads/downloads can still bypass the buffer. This means that your site is still trivially vulnerable to accidental (or deliberate) Denial of Service (DoS) attacks.

The Waitress HTTP server protects your from slow network clients

Waitress is a pure-Python HTTP server that supports request and response buffering, using in-memory and temporary file buffers to completely shield your Python application from slow network clients.

Waitress can be installed in your Heroku app using pip:

1
2
$ pip install waitress
$ pip freeze > requirements.txt

And then added to your Procfile like this:

1
web: waitress-serve --port=$PORT {project_name}.wsgi:application

Why not use Gunicorn async workers?

The Guncicorn docs suggest using an alternative async worker class when serving requests directly to the internet. This avoids the problem of slow network clients by allowing thousands of asyncronous HTTP requests to be processes in parallel.

Unfortunately, this approach introduces a different problem. The Django ORM will open a separate database connection for each request, quickly leading to thousands of simulataneous database connections being created. On the cheaper Heroku Postgres plans, this can easily cause requests to fail due to refused database connections.

By using a fixed pool of worker processes, Waitress makes it much easier to control the number of database connections being opened by Django, while still protecting you against slow network traffic.

Check out django-herokuapp on GitHub

For an easy quickstart, and a more in-depth guide to running Django apps on Heroku, please check out the django-herokuapp project on GitHub.

Working with unicode streams in Python

When working with unicode in Python, the standard approach is to use the str.decode() and unicode.encode() methods to convert whole strings between the builtin unicode and str types.

As an example, here’s a simple way to load the contents of a utf-16 file, remove all vertical tab codepoints, and write it out as utf-8. (This can be important when working with broken XML.)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Load the file contents.
with open("input.txt", "rb") as input:
    data = input.read()

# Decode binary data as utf-16.
data = data.decode("utf-16")

# Remove vertical tabs.
data = data.replace(u"\u000B", u"")

# Encode unicode data as utf-8.
data = data.encode("utf-8")

# Write the data as utf-8.
with open("output.txt", "wb") as output:
    output.write(data)

This approach work just fine unless you have to deal with really big files. At that point, loading all the data into RAM becomes a problem.

Using a streaming encoder/decoder

The Python standard library includes the codecs module that allow you to incrementally move through a file, loading only a small chunk of unicode data into memory at a time.

The simplest way is to modify the above example to use the codecs.open() helper.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import codecs

# Open both input and output streams.
input = codecs.open("input.txt", "rb", encoding="utf-16")
output = codecs.open("output.txt", "wb", encoding="utf-8")

# Stream chunks of unicode data.
with input, output:
    while True:
        # Read a chunk of data.
        chunk = input.read(4096)
        if not chunk:
            break
        # Remove vertical tabs.
        chunk = chunk.replace(u"\u000B", u"")
        # Write the chunk of data.
        output.write(chunk)

Files are horrible… let’s use iterators!

Dealing with files can get tedious. For complex processing tasks, it can be nice to just deal with iterators of unicode data.

Here’s an efficient way to read an iterator of unicode chunks from a file using iterdecode().

1
2
3
4
5
6
7
8
9
10
11
12
from functools import partial
from codecs import iterdecode

# Returns an iterator of unicode chunks from the given path.
def iter_unicode_chunks(path, encoding):
    # Open the input file.
    with open(path, "rb") as input:
        # Convert the binary file into binary chunks.
        binary_chunks = iter(partial(input.read, 1), "")
        # Convert the binary chunks into unicode chunks.
        for unicode_chunk in iterdecode(binary_chunks, encoding):
            yield unicode_chunk

Here’s how to write an iterator of unicode chunks to a file using iterencode().

1
2
3
4
5
6
7
8
9
from codecs import iterencode

# Writes an iterator of unicode chunks to the given path.
def write_unicode_chunks(path, unicode_chunks, encoding):
    # Open the output file.
    with open(path, "wb") as output:
        # Convert the unicode chunks to binary.
        for binary_chunk in iterencode(unicode_chunks, encoding):
            output.write(binary_chunk)

Using these two functions, removing all vertical tab codepoints from a stream of unicode data just becomes a case of plumbing everything together.

1
2
3
4
5
6
7
8
9
10
11
12
# Load the unicode chunks from the file.
unicode_chunks = iter_unicode_chunks("input.txt", encoding="utf-16")

# Modify the unicode chunks.
unicode_chunks = (
    chunk.replace(u"\u000B", u"")
    for chunk
    in unicode_chunks
)

# Write the chunks to a file.
write_unicode_chunks("output.txt", unicode_chunks, encoding="utf-8")

Why even bother with the codecs module?

It might seem simpler to just read binary chunks from a regular file object, encoding and decoding that chunk using the standard str.decode() and unicode.encode() methods like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# BAD IDEA! DON'T DO IT THIS WAY!

# Open both input and output streams.
with open("input.txt", "rb") as input, open("output.txt", "wb") as output:
    # Iterate over chunks of binary data.
    while True:
        # Read a chunk of data.
        chunk = input.read(4096)
        if not chunk:
            break
        # UNSAFE: Decode binary data as utf-16.
        chunk = chunk.decode("utf-16")
        # Remove vertical tabs.
        chunk = chunk.replace(u"\u000B", u"")
        # Encode unicode data as utf-8.
        chunk = chunk.encode("utf-8")
        # Write the chunk of data.
        output.write(chunk)

Unfortunately, some unicode codepoints are encoded as more than one byte of binary data. Simply reading a chunk of bytes from a file and passing it to decode() can result in an unexpected UnicodeDecodeError if your chunk happens to split up a multi-byte codepoint.

Using the tools in codecs will help keep you safe from unpredictable crashes in production!

What about Python 3?

Python 3 makes working with unicode files a lot easier. The builtin method open() contains all the functionality you need to easily modify unicode data and switch between encodings.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Open both input and output streams.
input = open("input.txt", "rt", encoding="utf-16")
output = open("output.txt", "wt", encoding="utf-8")

# Stream chunks of unicode data.
with input, output:
    while True:
        # Read a chunk of data.
        chunk = input.read(4096)
        if not chunk:
            break
        # Remove vertical tabs.
        chunk = chunk.replace("\u000B", "")
        # Write the chunk of data.
        output.write(chunk)

Python 3 rules! Happy coding!

Using Django querysets effectively

Object Relational Mapping (ORM) systems make interacting with an SQL database much easier, but have a reputation of being inefficient and slower than raw SQL.

Using ORM effectively means understanding a little about how it queries the database. In this post, I’ll highlight ways of efficiently using the Django ORM system for medium and huge datasets.

Django querysets are lazy

A queryset in Django represents a number of rows in the database, optionally filtered by a query. For example, the following code represents all people in the database whose first name is ‘Dave’:

1
person_set = Person.objects.filter(first_name="Dave")

The above code doesn’t run any database queries. You can can take the person_set and apply additional filters, or pass it to a function, and nothing will be sent to the database. This is good, because querying the database is one of the things that significantly slows down web applications.

To fetch the data from the database, you need to iterate over the queryset:

1
2
for person in person_set:
    print(person.last_name)

Django querysets have a cache

The moment you start iterating over a queryset, all the rows matched by the queryset are fetched from the database and converted into Django models. This is called evaluation. These models are then stored by the queryset’s built-in cache, so that if you iterate over the queryset again, you don’t end up running the same query twice.

For example, the following code will only execute one database query:

1
2
3
4
5
6
7
pet_set = Pet.objects.filter(species="Dog")
# The query is executed and cached.
for pet in pet_set:
    print(pet.first_name)
# The cache is used for subsequent iteration.
for pet in pet_set:
    print(pet.last_name)

if statements trigger queryset evaluation

The most useful thing about the queryset cache is that it allows you to efficiently test if your queryset contains rows, and then only iterate over them if at least one row was found:

1
2
3
4
5
6
restaurant_set = Restaurant.objects.filter(cuisine="Indian")
# The `if` statement evaluates the queryset.
if restaurant_set:
    # The cache is used for subsequent iteration.
    for restaurant in restaurant_set:
        print(restaurant.name)

The queryset cache is a problem if you don’t need all the results

Sometimes, rather than iterating over results, you just want to see if at least one result exists. In that case, simply using an if statement on the queryset will still fully evaluate the queryset and populate it’s cache, even if you never plan on using those results!

1
2
3
4
5
6
city_set = City.objects.filter(name="Cambridge")
# The `if` statement evaluates the queryset.
if city_set:
    # We don't need the results of the queryset here, but the
    # ORM still fetched all the rows!
    print("At least one city called Cambridge still stands!")

To avoid this, use the exists() method to check whether at least one matching row was found:

1
2
3
4
5
6
tree_set = Tree.objects.filter(type="deciduous")
# The `exists()` check avoids populating the queryset cache.
if tree_set.exists():
    # No rows were fetched from the database, so we save on
    # bandwidth and memory.
    print("There are still hardwood trees in the world!")

The queryset cache is a problem if your queryset is huge

If you’re dealing with thousands of rows of data, fetching them all into memory at once can be very wasteful. Even worse, huge querysets can lock up server processes, causing your entire web application to grind to a halt.

To avoid populating the queryset cache, but to still iterate over all your results, use the iterator() method to fetch the data in chunks, and throw away old rows when they’ve been processed.

1
2
3
4
5
star_set = Star.objects.all()
# The `iterator()` method ensures only a few rows are fetched from
# the database at a time, saving memory.
for star in star_set.iterator():
    print(star.name)

Of course, using the iterator() method to avoid populating the queryset cache means that iterating over the same queryset again will execute another query. So use iterator() with caution, and make sure that your code is organised to avoid repeated evaluation of the same huge queryset.

if statements are a problem if your queryset is huge

As shown previously, the queryset cache is great for combining an if statement with a for statement, allowing conditional iteration over a queryset. For huge querysets, however, populating the queryset cache is not an option.

The simplest solution is to combine exists() with iterator(), avoiding populating the queryset cache at the expense of running two database queries.

1
2
3
4
5
6
molecule_set = Molecule.objects.all()
# One database query to test if any rows exist.
if molecule_set.exists():
    # Another database query to start fetching the rows in batches.
    for molecule in molecule_set.iterator():
        print(molecule.velocity)

A more complex solution is to make use of Python’s advanced iteration methods to take a peek at the first item in the iterator() before deciding whether to continue iteration.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
atom_set = Atom.objects.all()
# One database query to start fetching the rows in batches.
atom_iterator = atom_set.iterator()
# Peek at the first item in the iterator.
try:
    first_atom = next(atom_iterator)
except StopIteration:
    # No rows were found, so do nothing.
    pass
else:
    # At least one row was found, so iterate over
    # all the rows, including the first one.
    from itertools import chain
    for atom in chain([first_atom], atom_iterator):
        print(atom.mass)

Beware of naive optimisation

The queryset cache exists in order to reduce the number of database queries made by your application, and under normal usage will ensure that your database is only queried when necessary.

Using the exists() and iterator() methods allow you to optimize the memory usage of your application. However, because they don’t populate the queryset cache, they can lead to extra database queries.

So code carefully, and if things start to slow down, take a look at the bottlenecks in your code, and see if a little queryset optimisation might help things along.

Javascript object inheritance isn’t complicated

Javascript’s method of object inheritance causes a lot of confusion, even amongst experienced programmers. This is largely because it doesn’t follow the classical inheritance pattern found in many other popular programming languages, suce as Java, PHP, Python and Ruby.

Instead, Javascript uses a prototype inheritance pattern, which is a little different. To confuse matters, many frameworks attempt to “fix” Javascript inheritance by making it work more like classical inheritance. The end result is a mess.

Thankfully, Javascript inheritance is actually pretty easy!

Defining a new class

Let’s define a new class, Animal. Animals have a name, an age, and can make a noise.

1
2
3
4
5
6
7
8
9
10
// Define the Animal constructor.
function Animal(name, age) {
    this.name = name;
    this.age = age;
}

// Define the makeNoise method.
Animal.prototype.makeNoise = function() {
    return "Snuffle";
}

You can instantiate an Animal like this:

1
2
var animal = new Animal("Fluffy", 5);
animal.makeNoise();  // => "Snuffle"

Inheriting from this class

Let’s now make a Dog, which is like an animal, but also has a breed, and makes a different noise.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
function Dog(name, age, breed) {
    // Call the parent constructor.
    Animal.call(this, name, age);
    // Add dog-specific constructor logic.
    this.breed = breed;
}

// Extend the Animal class.
Dog.prototype = Object.create(Animal.prototype);

// Extend the makeNoise method.
Dog.prototype.makeNoise = function() {
    var parentNoise = Animal.prototype.makeNoise.call(this);
    return parentNoise + "... Woof!";
}

You can then instantiate a Dog like this:

1
2
var dog = new Dog("Spot", 5, "Golden Retriever");
dog.makeNoise();  // => "Snuffle... Woof!"

What about private methods and properties?

While it’s possible to implement something similar to private methods and properties, it probably isn’t worth the time and effort (and performance penalty) of using them. Simply prefixing non-public methods and properties with an underscore is a good way of indicating they they’re not part of the public API.

What about interfaces?

Interfaces can be useful, but Javascript doesn’t support them. In any case, they would add to the download size of your code.

What about prototype.constructor?

If your code depends on prototype.constructor being set, then you can use the following helper method instead of calling Object.create() directly.

1
2
3
4
5
6
7
8
9
10
function inherits(child, parent) {
    child.prototype = Object.create(parent.prototype, {
        constructor: {
            value: parent,
            enumerable: false,
            writable: true,
            configurable: true
        }
    });
}

In reality, prototype.constructor isn’t very useful, so it’s probably best to just call Object.create() directly.

Using inheritance in your code

Javascript code has an unfortunate habit of turning into a mess of nested callbacks and copy-and-paste logic. Defining a hierarchy of helper classes is just one of the many techniques that allow you to write modular, maintainable code.

For example, consider building:

  • A hierarchy of Javascript form validation classes. These could all inherit from a base Validator class that implements basic checks for required fields. Subclasses could provide integer validation, date validation, password length validation, etc.
  • A set of models that map to server-side database objects. These could all inherit from a common Model class that contains the HTTP syncronization logic.
  • A related set of AngularJS services. Data-driven services could inherit from a base Service class that provides logic for updating data within bound scopes.

Embedding HTML5 video is still pretty hard

In the early days of the HTML5 movement, I wrote the first major cross-browser compatibility shim for HTML5 <video> and <audio> tags. It was called html5media.js.

At the time, I assumed that the shim would be obsolete within a few years, just as soon as major browsers adopted a common standard and video codec. Unfortunately, the shim is still used by hundreds of thousands of people each day, and embedding video is just as confusing as ever.

So how do I embed video in my site?

Please, just save yourself a headache, and host your video on YouTube, Vimeo, or some other third party service. They employ some very clever people who’ve solved all the problems with embedding video.

Haha… no, really. How do I embed video in my site?

Take a deep breath. In order to embed video in your site, there are four major groups of people you need to keep happy:

  1. Modern browsers using commercial codecs (Chrome, Safari, IE9+)
  2. Modern browsers using open-source codecs (Firefox, Opera)
  3. Legacy browsers (IE8)
  4. Under-powered mobile devices (iPhone 3GS, cheap Android)

For the rest of this post, I’ll take you through the steps required to allow an increasing number of people to watch your video.

Embedding video for modern browsers with commercial codecs

The simplest video embed code you can possibly use is as follows:

1
2
3
4
5
6
<!DOCTYPE html>
<html>
    <body>
        <video src="video.mp4" width=640 height=360 controls>
    </body>
</html>

Congratulations! Your video will now play in:

  • Chrome
  • Safari (inc. Mobile Safari on iPhone 4+)
  • IE9+

Adding support for legacy browsers

In order to make your video work in legacy browsers, you need to add a script tag to the <head> of your document. This script, the venerable html5media.js, will provide a Flash video player fallback for legacy browsers.

1
2
3
4
5
6
7
8
9
<!DOCTYPE html>
<html>
    <head>
        <script src="http://api.html5media.info/1.1.5/html5media.min.js"></script>
    </head>
    <body>
        <video src="video.mp4" width=640 height=360 controls></video>
    </body>
</html>

Note: The syntax of the <video> tag has changed to include an explicit closing tag, to avoid confusing older browsers.

Fantastic! Your video will now play in:

  • Chrome
  • Safari (inc. Mobile Safari on iPhone 4+)
  • IE9+
  • IE8 (via Flash)
  • Firefox (via Flash)
  • Opera (via Flash)

At this point, the vast majority of internet users will be able to play your video. The only people who’ll be left out will be:

  • Firefox or Opera users without Flash
  • Owners of under-powered mobile devices.

Adding Flash-free support for modern browers with open-source codecs

To allow Firefox and Opera users to view your video using their native players, you need to transcode your video into an open-source format, and embed both files in your page. I’d recommend using the free Miro Video Encoder to transcode your video to WebM format. You can then embed it using the following code:

1
2
3
4
5
6
7
8
9
10
11
12
<!DOCTYPE html>
<html>
    <head>
        <script src="http://api.html5media.info/1.1.5/html5media.min.js"></script>
    </head>
    <body>
        <video src="video.mp4" width=640 height=360 controls>
            <source src="video.mp4"></source>
            <source src="video.webm"></source>
        </video>
    </body>
</html>

Note: We’re adding explicit closing tags to <source> elements to avoid confusing legacy browsers.

Unbelievable! Now your video will play in:

  • Chrome
  • Safari (inc. Mobile Safari on iPhone 4+)
  • IE9+
  • IE8 (via Flash)
  • Firefox (via Flash)
  • Opera (via Flash)

It’s just the owners of under-powered mobile devices who’ll struggle to play your video now.

Adding support for under-powered mobile devices

The latest mobile devices support high-resolution video, but cheap Android phones and iPhone 3GS will refuse to play anything higher-resolution than about 320 x 180 pixels. To keep these devices happy, you need to transcode your video to this lower resolution. Miro Video Encoder has a built-in iPhone 3GS setting, so just use that.

Now you can embed your video using the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
<!DOCTYPE html>
<html>
    <head>
        <script src="http://api.html5media.info/1.1.5/html5media.min.js"></script>
    </head>
    <body>
        <video src="video.mp4" width=640 height=360 controls>
            <source src="video.mp4" media="only screen and (min-device-width: 568px)"></source>
            <source src="video-low.mp4" media="only screen and (max-device-width: 568px)"></source>
            <source src="video.webm"></source>
        </video>
    </body>
</html>

OMG! What a monster! But now everyone will be able to play your video!

  • Chrome
  • Safari (inc. Mobile Safari on iPhone 4+)
  • IE9+
  • IE8 (via Flash)
  • Firefox (via Flash)
  • Opera (via Flash)
  • Mobile Safari (iPhone 3GS)
  • Android Browser (inc. cheap Android phones)

Help! My video still isn’t playing!

The most common causes of problems are:

  • Video encoding errors.
  • Incorrect server configuration.

There’s a page full of troubleshooting information on the html5media video hosting wiki. Your problem is almost certainly covered there.

I want to customize the player UI, and make it look consistent across all browsers!

Ahahahahahahahaha!

Ahahahaha!

No.

Git makes writing code easier!

Git is a distributed version control system that’s a quite tricky to get into, and it can be hard to justify spending time getting to know it.

By the end of this post, I hope you’ll be in a better position to use Git on all your software projects, and understand the benefits of doing so.

Git is easy to start using.

Installing Git on your system is simply a case of selecting the correct Git installer, and downloading the software onto your computer. Once you’ve got Git installed, setting up a Git repositity for your software project is as simple as typing the following commands into a terminal:

1
2
$ cd /path/to/your/project
$ git init

Commit your code to Git as often as possible

The more frequently you commit, the easier it is to go back at a later date and understand the work you’ve been doing (and maybe even undo some of that work). Committing is easy!

1
2
$ git add .
$ git commit -a -m "Description of commit"

If you’ve made a terrible mistake, simply roll back to the last commit

This is why it’s a good idea to commit your work frequently. If you make a stupid coding mistake, and want to revert your code back to how it was before you broke everything, then just run the following command:

1
$ git revert --hard

Go back time and see how your code used to look

Just type the following command to display a history of all your past commits:

1
$ git log

To revert your codebase to some time in the past, simply copy the corresponding commit hash to your clipboard, close the Git log by pressing q, then type the following command:

1
$ git checkout YOUR_COMMIT_HASH -- .

(A commit hash looks a bit like this: 814c219a338006492bf6f751d958461dd3e8b775)

Once you’ve finished with the older version of your code, you can go back to the lastest version by running the following command:

1
$ git reset --hard

Alternatively, if you want to keep this older version of your code (and discard any changes you’ve made since then), simply commit it using the following commands, and keep working:

1
2
$ git add .
$ git commit -a -m "Rolled back to previous version of code"

Back up your code on a server

To protect against hard drive failure, it’s a good idea to back up your code. You can either set up your own code hosting service, or save yourself some effort and get a free BitBucket account.

Once you’ve created a remote repository, you can connect your local codebase to it using the following commands:

1
2
$ git remote add origin ssh://bitbucket.org/your-repo-name.git
$ git push origin master -u

Then, whenever you’ve made a few commits that you want to push to the server, just run the following command:

1
$ git push

Work with other people

Once you’ve put your code online, you can invite other people to work on it too. In order to get a copy of your code, they just need to run the following command:

1
2
$ git clone ssh://bitbucket.org/your-repo-name.git
$ cd your-repo-name

They can then make changes, commit them, and push them to the server. In order for you to see the changes that they have make, just run the following command on your machine:

1
$ git pull

Sometimes things go wrong when you push

If you try to push your code, and you get an error message saying this:

1
`! [rejected] master -> master (non-fast-forward)`

Don’t worry, just run git pull, and try pushing again.

Sometimes things go wrong when you pull

If you try to pull some code, and get an error message saying this:

1
CONFLICT (content): Merge conflict in some_file

Don’t worry, this just means that two people have tried to edit the same file. Just open the conflicting file in your editor, fix the contents, and run the following commands:

1
2
$ git add .
$ git commit -a -m "Merging in changes"

Can’t I just use Dropbox?

Dropbox is pretty good. However, for coding projects, Git has some key advantages:

  1. With Git, you choose when to save a version, and those versions get meaningful descriptions.
  2. Git is extremely good at merging changes made by multiple people to the same file.
  3. Git can save versions, and rollback to previous versions, even when offline.
  4. When you understand Git, and know some of it’s more complex features, it can do magic.

Processing XML with Python - you’re probably doing it wrong

If you think that processing XML in Python sucks, and your code is eating up hundreds on megabytes of RAM just to process a simple document, then don’t worry. You’re probably using xml.dom.minidom, and there is a much better way…

The test – books.xml

For this example, we’ll be attempting to process a 43MB document containing 4000 books. The test data can be downloaded here. The two methods shall be tested by providing implementations of iter_authors_and_descriptions.

(xml_test.py) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import resource, time
from xml.dom import minidom

def main():
    authors = set()
    max_description = 0
    start_time = time.time()
    with open("books.xml") as handle:
        for author, description in iter_authors_and_descriptions(handle):
            authors.add(author)
            max_description = max(max_description, len(description))
    # Print out the report.
    end_time = time.time()
    report = resource.getrusage(resource.RUSAGE_SELF)
    print("Unique authors: {}".format(len(authors)))
    print("Longest description: {}".format(max_description))
    print("Time taken: {} ms".format(int((end_time - start_time) * 1000)))
    print("Max memory: {} MB".format(report.ru_maxrss / 1024 / 1024))

if __name__ == "__main__":
    main()

The wrong way – minidom

The minidom method is both awkward and inefficient.

(minidom.py) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from xml.dom import minidom

def get_child_text(parent, child_name):
    child = parent.getElementsByTagName(child_name)[0]
    return "".join(
        grandchild.data
        for grandchild
        in child.childNodes
        if grandchild.nodeType in (grandchild.TEXT_NODE, grandchild.CDATA_SECTION_NODE)
    )

def iter_authors_and_descriptions(handle):
    document = minidom.parse(handle)
    for book in document.getElementsByTagName("book"):
        yield (
            get_child_text(book, "author"),
            get_child_text(book, "description"),
        )

Due to loading the entire document in one chunk, minidom takes a long time to run, and uses a lot of memory.

1
2
3
4
Unique authors: 999
Longest description: 63577
Time taken: 368 ms
Max memory: 107 MB

The right way – cElementTree

The cElementTree method is also awkward (this is XML, after all). However, by using the iterparse() method to avoid loading the whole document into memory, a great deal more efficiency can be acheived.

(cElementTree.py) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from xml.etree import cElementTree

def iter_elements_by_name(handle, name):
    events = cElementTree.iterparse(handle, events=("start", "end",))
    _, root = next(events)  # Grab the root element.
    for event, elem in events:
        if event == "end" and elem.tag == name:
            yield elem
            root.clear()  # Free up memory by clearing the root element.

def iter_authors_and_descriptions(handle):
    for book in iter_elements_by_name(handle, "book"):
        yield (
            book.find("author").text,
            book.find("description").text,
        )

The results speak for themselves. By using cElementTree, you can process XML in half the time and only use 10% of the memory.

1
2
3
4
Unique authors: 999
Longest description: 63577
Time taken: 192 ms
Max memory: 8 MB

Conclusion

These tests are hardly scientific, so feel free to download the code and see how it runs in your own environment. In any case, the next time your servers get melted by an XML document, consider giving cElementTree a spin.

How to turn nginx into a caching, authenticated Twitter API proxy

Very soon, the old Twitter 1.0 API will be turned off, making a switch to the 1.1 API essential. Unfortunately, the new API has a couple of restrictions that can make the transition very difficult.

  • Mandatory authentication HTTP headers – Using JSONP is now impossible.
  • Restrictive crossdomain.xml – Using CORS is now impossible.

The result of these changes is that it is now impossible to access the Twitter API directly from the browser.

So, just use a proxy, right?

A simple solution is to write your own proxy server, which can then run on your own domain. The minimum features for a useful Twitter API proxy are:

  • Adds the required authentication HTTP headers to your request.
  • Caches the results to avoid exceeding API rate limits.

Writing a Python/Ruby/PHP script to handle this is easy, but it’s a waste of valuable server resources. Far better to let nginx, the best caching reverse proxy server in the world, do the hard work instead.

Step 1 – Create a Twitter application

Creating a Twitter application allows you to authenticate with the API. Just visit https://dev.twitter.com/apps and register your application with Twitter.

Once your new app is created, head over to it’s detail page and make a note of the consumer key and consumer secret. You’ll need these for the next step.

Step 2 – Obtain a bearer token for the application

The easiest way to authenticate with the Twitter API is to obtain a bearer token for your proxy server, which is a simple code that can be sent as a HTTP header with every request.

To obtain your bearer token, run the following shell commands, substituting your own consumer key and consumer secret.

1
2
3
$ export CONSUMER_KEY=XXXXXXXXXXXXXXXXXXXXX
$ export CONSUMER_SECRET=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
$ curl -H "Authorization: Basic `echo -ne "$CONSUMER_KEY:$CONSUMER_SECRET" | base64`" -d "grant_type=client_credentials" https://api.twitter.com/oauth2/token

After a few seconds, your terminal will print out a JSON string containing your bearer token. It will look something like this:

1
{"token_type":"bearer","access_token":"AAAAAAAAAAAAAAAAAAAAAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"}

Make a note of the access_token field. You’ll need this for the next step.

Step 3 – Update your nginx configuration file

Simply place the following settings in your nginx configuration, adjusting paths as necessary. In particular, make sure that proxy_cache_path, server_name and root are all correct. Most important of all, replace the INSERT_YOUR_BEARER_TOKEN placeholder with the bearer token you obtained in step 2.

(nginx.conf) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# This defines a 10 megabyte cache for the proxy service, and needs to live
# outside of the virtual host configuration. Adjust the path according to
# your environment.
proxy_cache_path  /var/cache/nginx/twitter_api_proxy levels=1:2 keys_zone=twitter_api_proxy:10m;

# The virtual host configuration.
server {

  # If your want to secure your proxy with SSL, replace with the appropriate SSL configuration.
  listen 80;

  # Replace this with the name of the domain you wish to run your proxy on.
  server_name api.twitter.yourdomain.com;

  # Replace this with your own document root.
  root /var/www;

  # This setting attempts to use files in the document root before
  # hitting the Twitter proxy. This allows you to put a permissive
  # crossdomain.xml file in your document root, and have it show up
  # in the browser.
  location / {
    try_files $uri $uri/index.html @twitter;
  }

  # The Twitter proxy code!
  location @twitter {

    # Caching settings, to avoid rate limits on the API service.
    proxy_cache twitter_api_proxy;
    proxy_cache_use_stale error updating timeout;
    proxy_cache_valid 200 302 404 5m;  # The server cache expires after 5 minutes - adjust as required.
    proxy_ignore_headers X-Accel-Expires Expires Cache-Control Set-Cookie;

    # Hide Twitter's own caching headers - we're applying our own.
    proxy_hide_header X-Accel-Expires;
    proxy_hide_header Expires;
    proxy_hide_header Cache-Control;
    proxy_hide_header pragma;
    proxy_hide_header set-cookie;
    expires 5m;  # The browser cache expires after 5 minutes - adjust as required.

    # Set the correct host name to connect to the Twitter API.
    proxy_set_header Host api.twitter.com;

    # Add authentication headers - edit and add in your own bearer token.
    proxy_set_header Authorization "Bearer INSERT_YOUR_BEARER_TOKEN"

    # Actually proxy the request to Twitter API!
    proxy_pass https://api.twitter.com;
  }

}

Phew! That’s it, simply restart nginx and hit the following URL in your browser to make sure that everything is working!

http://api.twitter.yourdomain.com/1.1/search/tweets.json?q=cats