Working with unicode streams in Python

When working with unicode in Python, the standard approach is to use the str.decode() and unicode.encode() methods to convert whole strings between the builtin unicode and str types.

As an example, here’s a simple way to load the contents of a utf-16 file, remove all vertical tab codepoints, and write it out as utf-8. (This can be important when working with broken XML.)

# Load the file contents.
with open("input.txt", "rb") as input:
    data = input.read()

# Decode binary data as utf-16.
data = data.decode("utf-16")

# Remove vertical tabs.
data = data.replace(u"\u000B", u"")

# Encode unicode data as utf-8.
data = data.encode("utf-8")

# Write the data as utf-8.
with open("output.txt", "wb") as output:
    output.write(data)

This approach work just fine unless you have to deal with really big files. At that point, loading all the data into RAM becomes a problem.

Using a streaming encoder/decoder

The Python standard library includes the codecs module that allow you to incrementally move through a file, loading only a small chunk of unicode data into memory at a time.

The simplest way is to modify the above example to use the codecs.open() helper.

import codecs

# Open both input and output streams.
input = codecs.open("input.txt", "rb", encoding="utf-16")
output = codecs.open("output.txt", "wb", encoding="utf-8")

# Stream chunks of unicode data.
with input, output:
    while True:
        # Read a chunk of data.
        chunk = input.read(4096)
        if not chunk:
            break
        # Remove vertical tabs.
        chunk = chunk.replace(u"\u000B", u"")
        # Write the chunk of data.
        output.write(chunk)

Files are horrible… let’s use iterators!

Dealing with files can get tedious. For complex processing tasks, it can be nice to just deal with iterators of unicode data.

Here’s an efficient way to read an iterator of unicode chunks from a file using iterdecode().

from functools import partial
from codecs import iterdecode

# Returns an iterator of unicode chunks from the given path.
def iter_unicode_chunks(path, encoding):
    # Open the input file.
    with open(path, "rb") as input:
        # Convert the binary file into binary chunks.
        binary_chunks = iter(partial(input.read, 1), "")
        # Convert the binary chunks into unicode chunks.
        for unicode_chunk in iterdecode(binary_chunks, encoding):
            yield unicode_chunk

Here’s how to write an iterator of unicode chunks to a file using iterencode().

from codecs import iterencode

# Writes an iterator of unicode chunks to the given path.
def write_unicode_chunks(path, unicode_chunks, encoding):
    # Open the output file.
    with open(path, "wb") as output:
        # Convert the unicode chunks to binary.
        for binary_chunk in iterencode(unicode_chunks, encoding):
            output.write(binary_chunk)

Using these two functions, removing all vertical tab codepoints from a stream of unicode data just becomes a case of plumbing everything together.

# Load the unicode chunks from the file.
unicode_chunks = iter_unicode_chunks("input.txt", encoding="utf-16")

# Modify the unicode chunks.
unicode_chunks = (
    chunk.replace(u"\u000B", u"")
    for chunk
    in unicode_chunks
)

# Write the chunks to a file.
write_unicode_chunks("output.txt", unicode_chunks, encoding="utf-8")

Why even bother with the `codecs` module?

It might seem simpler to just read binary chunks from a regular file object, encoding and decoding that chunk using the standard str.decode() and unicode.encode() methods like this:

# BAD IDEA! DON'T DO IT THIS WAY!

# Open both input and output streams.
with open("input.txt", "rb") as input, open("output.txt", "wb") as output:
    # Iterate over chunks of binary data.
    while True:
        # Read a chunk of data.
        chunk = input.read(4096)
        if not chunk:
            break
        # UNSAFE: Decode binary data as utf-16.
        chunk = chunk.decode("utf-16")
        # Remove vertical tabs.
        chunk = chunk.replace(u"\u000B", u"")
        # Encode unicode data as utf-8.
        chunk = chunk.encode("utf-8")
        # Write the chunk of data.
        output.write(chunk)

Unfortunately, some unicode codepoints are encoded as more than one byte of binary data. Simply reading a chunk of bytes from a file and passing it to decode() can result in an unexpected UnicodeDecodeError if your chunk happens to split up a multi-byte codepoint.

Using the tools in codecs will help keep you safe from unpredictable crashes in production!

What about Python 3?

Python 3 makes working with unicode files a lot easier. The builtin method open() contains all the functionality you need to easily modify unicode data and switch between encodings.

# Open both input and output streams.
input = open("input.txt", "rt", encoding="utf-16")
output = open("output.txt", "wt", encoding="utf-8")

# Stream chunks of unicode data.
with input, output:
    while True:
        # Read a chunk of data.
        chunk = input.read(4096)
        if not chunk:
            break
        # Remove vertical tabs.
        chunk = chunk.replace("\u000B", "")
        # Write the chunk of data.
        output.write(chunk)

Python 3 rules! Happy coding!

...and then it crashed

Programming the web with Python, Django and Javascript.

Working with unicode streams in Python

Using a streaming encoder/decoder

Files are horrible… let’s use iterators!

Why even bother with the `codecs` module?

What about Python 3?

Comments

Using a streaming encoder/decoder

Files are horrible… let’s use iterators!

Why even bother with the codecs module?

What about Python 3?

Comments

Why even bother with the `codecs` module?