Programming the web with Python, Django and Javascript.
Working with unicode streams in Python
When working with unicode in Python, the standard approach is to use the str.decode() and unicode.encode() methods to convert whole strings between the builtin unicode and str types.
As an example, here’s a simple way to load the contents of a utf-16 file, remove all vertical tab codepoints, and write it out as utf-8. (This can be important when working with broken XML.)
12345678910111213141516
# Load the file contents.withopen("input.txt","rb")asinput:data=input.read()# Decode binary data as utf-16.data=data.decode("utf-16")# Remove vertical tabs.data=data.replace(u"\u000B",u"")# Encode unicode data as utf-8.data=data.encode("utf-8")# Write the data as utf-8.withopen("output.txt","wb")asoutput:output.write(data)
This approach work just fine unless you have to deal with really big files. At that point, loading all the data into RAM becomes a problem.
Using a streaming encoder/decoder
The Python standard library includes the codecs module that allow you to incrementally move through a file, loading only a small chunk of unicode data into memory at a time.
The simplest way is to modify the above example to use the codecs.open() helper.
1234567891011121314151617
importcodecs# Open both input and output streams.input=codecs.open("input.txt","rb",encoding="utf-16")output=codecs.open("output.txt","wb",encoding="utf-8")# Stream chunks of unicode data.withinput,output:whileTrue:# Read a chunk of data.chunk=input.read(4096)ifnotchunk:break# Remove vertical tabs.chunk=chunk.replace(u"\u000B",u"")# Write the chunk of data.output.write(chunk)
Files are horrible… let’s use iterators!
Dealing with files can get tedious. For complex processing tasks, it can be nice to just deal with iterators of unicode data.
Here’s an efficient way to read an iterator of unicode chunks from a file using iterdecode().
123456789101112
fromfunctoolsimportpartialfromcodecsimportiterdecode# Returns an iterator of unicode chunks from the given path.defiter_unicode_chunks(path,encoding):# Open the input file.withopen(path,"rb")asinput:# Convert the binary file into binary chunks.binary_chunks=iter(partial(input.read,1),"")# Convert the binary chunks into unicode chunks.forunicode_chunkiniterdecode(binary_chunks,encoding):yieldunicode_chunk
Here’s how to write an iterator of unicode chunks to a file using iterencode().
123456789
fromcodecsimportiterencode# Writes an iterator of unicode chunks to the given path.defwrite_unicode_chunks(path,unicode_chunks,encoding):# Open the output file.withopen(path,"wb")asoutput:# Convert the unicode chunks to binary.forbinary_chunkiniterencode(unicode_chunks,encoding):output.write(binary_chunk)
Using these two functions, removing all vertical tab codepoints from a stream of unicode data just becomes a case of plumbing everything together.
123456789101112
# Load the unicode chunks from the file.unicode_chunks=iter_unicode_chunks("input.txt",encoding="utf-16")# Modify the unicode chunks.unicode_chunks=(chunk.replace(u"\u000B",u"")forchunkinunicode_chunks)# Write the chunks to a file.write_unicode_chunks("output.txt",unicode_chunks,encoding="utf-8")
Why even bother with the codecs module?
It might seem simpler to just read binary chunks from a regular file object, encoding and decoding that chunk using the standard str.decode() and unicode.encode() methods like this:
123456789101112131415161718
# BAD IDEA! DON'T DO IT THIS WAY!# Open both input and output streams.withopen("input.txt","rb")asinput,open("output.txt","wb")asoutput:# Iterate over chunks of binary data.whileTrue:# Read a chunk of data.chunk=input.read(4096)ifnotchunk:break# UNSAFE: Decode binary data as utf-16.chunk=chunk.decode("utf-16")# Remove vertical tabs.chunk=chunk.replace(u"\u000B",u"")# Encode unicode data as utf-8.chunk=chunk.encode("utf-8")# Write the chunk of data.output.write(chunk)
Unfortunately, some unicode codepoints are encoded as more than one byte of binary data. Simply reading a chunk of bytes from a file and passing it to decode() can result in an unexpected UnicodeDecodeError if your chunk happens to split up a multi-byte codepoint.
Using the tools in codecs will help keep you safe from unpredictable crashes in production!
What about Python 3?
Python 3 makes working with unicode files a lot easier. The builtin method open() contains all the functionality you need to easily modify unicode data and switch between encodings.
123456789101112131415
# Open both input and output streams.input=open("input.txt","rt",encoding="utf-16")output=open("output.txt","wt",encoding="utf-8")# Stream chunks of unicode data.withinput,output:whileTrue:# Read a chunk of data.chunk=input.read(4096)ifnotchunk:break# Remove vertical tabs.chunk=chunk.replace("\u000B","")# Write the chunk of data.output.write(chunk)