Monday, September 1, 2014

Setting LANG and LC_ALL when using mod_wsgi.

So I am at DjangoCon US 2014 and one of the first pain points for using mod_wsgi that came up in discussion at DjangoCon US was the lang and locale settings. These settings influence what the default encoding is for Python when implicitly converting Unicode to byte strings. In other words, they dictate what is going on at the Unicode/bytes boundary.

Now this should not really be an issue with WSGI at least, because you should always be explicitly specifying the encoding you want the response content to be when it is being returned from the WSGI application and that should match the 'charset' attribute specified in the 'Content-Type' response header. There are however lots of other cases where the problem can still present itself.

Take for example a simple case in a command line interpreter of printing out the value of a Unicode string into the Apache error log:

>>> print(u'\u292e')

On a system with a sane configuration, this would display as you expect. The reason for this is that the your login shell environment would typically set an environment value such as the 'LANG' environment variable. On my MacOS X system for example I have:

LANG=en_AU.UTF-8

When I use the 'locale' module to see what Python sees, we get:

>>> import locale
>>> locale.getdefaultlocale()
('en_AU', 'UTF-8')
>>> locale.getpreferredencoding()
'UTF-8'

UTF-8 is generally the magic value that solves all problems. With that you should generally be okay.

The problem now is that when using Apache/mod_wsgi on many Linux systems, the Apache process doesn't inherit any environment variables which override the default locale or language settings. So what the Python code running under Apache/mod_wsgi sees is:

>>> import locale
>>> locale.getdefaultlocale()
(None, None)
>>> locale.getpreferredencoding()
'US-ASCII'

With the Python interpreter now using these values, if we try and print out a Unicode value, we can encounter problems.

>>> print(u'\u292e')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u292e' in position 0: ordinal not in range(128)

And this is the trap that people encounter when using Apache/mod_wsgi. They will run up their Python WSGI application with a development server, such as that provided by Django, and everything will work fine. Host Apache/mod_wsgi on a Linux system though, and if they have not ensured that encodings are always being used explicitly when converting from Unicode to bytes, they can start to get 'UnicodeEncoderError' exceptions.

What is the solution then?

As is detailed in the Django documentation, you can set the environment variables yourself.

export LANG='en_US.UTF-8'
export LC_ALL='en_US.UTF-8'

The problem now is where do you set them. This is because they need to be set in the environment under which the initial Apache process starts up.

For a standard Apache Software Foundation layout for an Apache httpd installation, the file this needs to be put in is the 'envvars' file. This file would exist in the same directory as where 'apachectl' resides.

The location means though that you are modifying what is a system file. If you upgrade Apache it is quite possible, unless the package update looks for an existing file, that it will be overridden.

A further problem is that Linux systems generally do not use the Apache Software Foundation installation layout and do away with the 'envvars' file completely. In these cases you have to find the correct system init script file to add the settings in. What file this is, differs between Linux distributions.

So being able to override the value for the lang and locale can be a pain. It can differ between Linux distributions and even changing the file may not survive system package upgrades.

Is there a better solution?

If using embedded mode of mod_wsgi to run your WSGI application with Apache the answer is currently no.

This isn't because it is technically impossible, but because Apache can be used to host more than just Python web applications at the same time and I didn't think it would be particularly friendly to provide a mod_wsgi configuration directive which modified the lang and locale for the whole of Apache, thereby affecting other Apache modules.

Was this a wise choice, I am not sure. I am probably still open to providing such an option to override them for the whole of Apache, but certainly would need to document the dangers of using it.

The thing is though, that unless you really know what you are doing, you shouldn't be using embedded mode of mod_wsgi on UNIX systems anyway, even though it is the default. The preferred and better configuration is to use mod_wsgi daemon mode.

For mod_wsgi daemon mode then, what can be done?

For this way of using mod_wsgi what you can do is set the 'lang' and 'locale' options to the 'WSGIDaemonProcess' directive.

WSGIDaemonProcess my-django-site lang='en_US.UTF-8' locale='en_US.UTF-8'

For daemon mode of mod_wsgi at least, that is the solution for this particular pain point.

Do note though that you must be using mod_wsgi 3.4 or later to have access to these options of the 'WSGIDaemonProcess' directive.

No comments: