Wednesday, September 3, 2014

Hosting PHP web applications in conjunction with mod_wsgi.

Yes, yes, I know. You can stop shaking your head now. It is a sad fact of life though that the need to mix both PHP web application code with Python web application code on the same Apache instance is something that some people need to do. One instance is where those PHP developers have seen the light and want to migrate their existing legacy PHP web application to Python, but are not able to do it all in one go, instead needing to do it piece meal, with the Python web application code progressively taking over from the PHP web application.

Ask around on the Internet and once you get past the 'why on earth to you want to do that' type of reactions, you will often be told it is either not possible or too hard and that you should just ditch the PHP web application entirely and just use Python. This isn't particularly helpful and is also very misleading as it is actually quite simple to allow both a PHP web application and Python web application to run concurrently on the same Apache instance.

In going this path though, there is one very important detail that you must first appreciate. That is that the typical Apache server MPM configuration for a PHP web application under Apache is generally not favourable to a Python web application. Because of this, never run your Python web application in embedded mode if you are also running a PHP web application on the same Apache server. If you do, then the performance of your overall Apache instance will be affected, having an impact on both the PHP and Python web applications.

What you want to do to mitigate such problems is run your Python web application in daemon mode of mod_wsgi. This means that the Python web application will run in its own process and the Apache child worker process will merely act as a proxy for requests being sent to the Python web application. This ensures that the Python web application processes are not subject to the dynamic process management features of Apache for child worker processes, which is where a lot of the problems arise when running with embedded mode.

Because it is so important that embedded mode not be used, to ensure you get this right and don't actually still run your Python web application in embedded mode, you should disable embedded mode entirely.

The configuration for mod_wsgi in the Apache configuration where running a single Python web application should therefore include something like:

# Define a mod_wsgi daemon process group.
WSGIDaemonProcess my-python-web-application display-name=%{GROUP}
# Force the Python web application to run in the mod_wsgi daemon process group.
WSGIProcessGroup my-python-web-application
WSGIApplicationGroup %{GLOBAL}
# Disable embedded mode of mod_wsgi.
WSGIRestrictEmbedded On 

Obviously if running more than one Python web application then you may need to use a more complicated configuration. Either way, ensure you aren't using embedded mode and that any Python web applications are running in daemon mode instead. All the following discussion will assume that you have got this in place.

Having dealt with that, we can now move onto trying to setup up the Apache configuration to serve both the PHP web application and the Python web application.

For this we now need to delve into the typical ways that each is hosted by Apache.

In the case of PHP, the typical approach involves having Apache handle the primary URL routing by matching a URL to actual files in the file system. So if the default Apache web server document directory contains the files:

favicon.ico
index.php
page-1.php
page-2.php
robots.txt 

then if a request arrives which uses a URL of '/robots.txt', then Apache will return the contents of that file. If however a URL of '/page-1.php' arrives, then Apache will actually load the code in the file called 'page-1.php' and execute it as PHP code. That PHP code will then be responsible for generating the actual response content.

The 'index.php' file is generally a special file and although one could make a request against it using the URL '/index.php', what is more generally done is to tell Apache that if a request comes in for '/', which notionally maps to the directory itself, that it instead be routed to 'index.php'. 

The way things typically work for PHP then is that any PHP code files are simply dropped in the existing directory which Apache is serving up static files from. Apache does the URL routing, mapping a URL to an actual physical file on the file system. When it finds a file corresponding to a URL, it will return the contents of that file, or if the file type represents a special case, the handler for that file type will be invoked instead. For the case of PHP code files, this will result in the code being executed to generate the response.

This is all achieved by using an Apache configuration of:

DocumentRoot /var/www/html
<Directory /var/www/html>
DirectoryIndex index.php
AddHandler application/x-httpd-php .php
</Directory>

In this you can start to see why people say PHP is so easy to use as all you need to do is drop the PHP code files in the right directory and they work. In this simple configuration, there is no need for users to worry about URL routing as that is done for them by the web server.

Now you can actually do a similar thing with mod_wsgi for Python script files by extending this to:

DocumentRoot /var/www/html
<Directory /var/www/html>
Options ExecCGI
DirectoryIndex index.py index.php
AddHandler application/x-httpd-php .php
AddHandler wsgi-script .py
</Directory>

That is, you can now simply drop Python code files with a '.py' extension into the directory and they would be executed as Python code when a URL mapped to that file. So if instead of 'index.php' you had 'index.py', accessing the URL for the directory, Apache in seeing that 'index.py' now exists, would use that to serve the request rather than 'index.php'. If the URL instead explicitly referenced a '.py' file by name, then that would be executed to handle the request instead.

Reality is that no one does things this way for Python web applications and there a few reasons why.

The first reason is that Python web applications interact with an underlying server using the web server gateway interface (WSGI). This is a very low level interface and quite unfriendly to new users.

This is in contrast to PHP where what is in the PHP file is no where near as low level and instead comes from the direction of being HTML with PHP code snippets interspersed. Those PHP code snippets can then access details of the request and any request content through a high level interface.

For WSGI however, there is no high level interface and you are effectively left having to work at the lowest level and process the request and any request content yourself.

WSGI therefore steers you towards needing to use a separate Python web framework or toolkit to do all that hard work and provide a simpler high level interface onto the request and for generating a response.

At this level where Apache is allowed to handle all the URL routing, then the two Python packages which would be most useful are Werkzeug and Paste. These packages focus mainly on encapsulating the request and response to make your life easier as far as processing the request and generating a response. What they don't do is dictate a URL routing mechanism and thus why they are a good match when using Apache in the way above.

There is therefore no reason why you can't use this approach similar to PHP of simply dropping Python code files a directory, but you are going to have to do more work.

A bigger problem and the second reason why people don't write Python web applications in this way is that of code reloading.

When writing a web application in PHP, every time you modify a PHP code file it will be automatically reloaded and the new code read and used. This is because ultimately, nothing is persistent for a PHP web application and everything is read in again for every request.

Well, that isn't quite true, but as far as you can tell as a user though that is the case.

The reason it isn't strictly true is that all the PHP extensions you may want to use in your web application, and a lot more you don't, are all preloaded into the process where the PHP code is to be executed. The code for these stays persistent across requests. What does get thrown away those is all the code for your application and the corresponding data.

This is in contrast to Python where all code for separate Python code modules is loaded dynamically on demand the first time it is required. Further, the Python code objects are intermingled with other data for your application. There is also no ready distinction between your application code and unchanging code from a separate third party package or a module from the Python standard library.

It is therefore not possible to throw away just your application code and data at the end of each request. Instead, what occurs for Python web applications is that all this application code and data stays persistent in the memory of the process between requests.

As far as code reloading goes this makes it much more difficult. This is because even for a trivial code change you need to kill off the persistent process and start over. The greater cost associated with Python web applications, due to the fact that nothing is preloaded, means that such a restart is expensive. If this was done on every request, the performance will drop dramatically.

Python doesn't therefore lend itself very well to what PHP users are used to of simply being able to drop code files in a directory and for all code changes to be picked up automatically.

The preferred approach in Python is therefore to use a much higher level framework providing simpler and more structured interfaces. These web frameworks provide the high level request and response object which make handling a request easier, but they also take over URL routing as well. This means that instead of relying on Apache to perform URL routing right down to the level of a resource or handler, it only needs to route down to the top level entry point for the whole WSGI application. After that point, the frameworks themselves will handle URL routing.

One can still use the above method as the gateway into a WSGI application using a high level Python web framework, but it doesn't quite work properly when you want to take over the root of the web site.

To get things to work properly, for a Python web application we can use a different type of configuration.

Alias / /var/www/wsgi/main.py
<Directory /var/www/wsgi>
Options ExecCGI
AddHandler wsgi-script .py
Order allow,deny
Allow from all
</Directory>

Specifically, the 'Alias' directive allows us to say that all requests that fall under the URL starting with '/', in this case the whole site, will be routed to the resource specified. As that resources maps to a Python code file, it will then be executed as Python code, thus providing the gateway into our WSGI application, with it being able to then perform the actual URL routing required to map a request to a specific handler function.

Because for Python web applications this will be a common idiom, mod_wsgi provides a simpler way of doing the same thing:

WSGIScriptAlias / /var/www/wsgi/main.py
<Directory /var/www/wsgi>
Order allow,deny
Allow from all
</Directory>

Using the 'WSGIScriptAlias' directive from mod_wsgi in this case means that we do not need to worry about setting the 'ExecCGI' option, or map that the file with a '.py' extension should be executed as a WSGI script.

Even when using 'WSGIScriptAlias', you do still need to work in conjunction with Apache access controls, it doesn't provide a back door for avoiding the access controls to ensure you are always using best security practices.

We have now what is the more typical Apache configuration for a Python web application, but how then do we use this in conjunction with an existing PHP application that may be hosted on the same site.

The primary problem if it isn't obvious is that using 'WSGIScriptAlias' for '/' means that all requests to the site are being hijacked and sent into the Python web application. In other words, it would shadow any existing PHP web application that may be hosted out of the document directory for the web server.

The simplest thing which can be done at this point is to host the Python web application at a sub URL instead of the root of the site.

WSGIScriptAlias /suburl /var/www/wsgi/main.py

The result will be that all requests prefixed with that sub URL will then go to that Python web application. Anything else will be mapped against the document directory of the server and thus potentially to the PHP web application.

Using a sub URL however isn't always practical. It may be fine where the Python web application is actually a sub site, but if you are intending to replace the existing PHP web application, it is likely preferable that the Python web application give the appearance of being hosted at the root of the site at the same time as the PHP web application is also being hosted at the root of the site.

Is this even possible? If possible, how do we do it?

The answer is that is possible, but we have to rely on a little magic. This magic comes in the form of the mod_rewrite module for Apache.

Our starting point in this case will be the prior example we had whereby we could drop both PHP and Python code files in the document directory for the server. To that we are going to add our mod_rewrite rules.

DocumentRoot /var/www/html
<Directory /var/www/html>
Options ExecCGI
DirectoryIndex index.php
AddHandler application/x-httpd-php .php
AddHandler wsgi-script .py
RewriteEngine On
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ /main.py/$1 [QSA,PT,L]
</Directory>

What this magic rewrite rule will do is look at each request as it comes in and determine if Apache was able to map the URL to a file within the document directory. If Apache was able to successfully map the URL to a file, then the request will be processed normally.

If however the URL could not be mapped to an actual physical file in the document directory, the request will be rewritten such that the request will be redirected to the resource 'main.py'.

Because though 'main.py' is being mapped to mod_wsgi as a Python code file, the result will be that the Python web application in that file will instead be used to handle the request.

All that remains now is to create 'main.py', which will normally be the existing WSGI script file you use an entry point to your WSGI application.

In copying in the 'main.py' file, ensure that is all you copy in from your existing Python web application. Do not go placing all the source code for your existing Python web application under the server document directory. This is because with this Apache configuration, a URL would then be able to be mapped to those source code files even though it isn't intended that they be so accessible.

So keep your actual Python web application code separate. It is even better in some respects that 'main.py' not be your original WSGI script file. Preferably all it should do is import the WSGI application entry point from the original in the separate source code directory for your Python web application. This limits the danger from having source code in the server document directory, because even if you later stuff up the configuration and accidentally make it so someone can download the actual contents of 'main.py', they haven't got hold of any sensitive data.

Making 'main.py' be a simple wrapper implementing a level of indirection is actually better for another reason.

This is because when we use the mod_rewrite rules above to trigger the internal redirect within Apache, the adjustments it makes to the URL can stuff up what URLs are then subsequently exposed to a user of your site.

This comes about because normally where your Python web application would see a URL as:

/some/url

it will instead see it as:

/main.py/some/url

Or more specifically, the 'SCRIPT_NAME' variable will be passed into the WSGI environ dictionary as:

/main.py

rather than an empty string.

The consequences of this is that when your Python web application creates a full URL for the purposes of redirection, that URL will then also have '/main.py' as part of it.

Exposing this internal detail of how we are hosting the Python web application part of the site isn't what we want to do, so we want to strip that out. That way any full URLs which are constructed will make it appear that the Python web application is still hosted at the root of the site and a user will be none the wiser.

def _application(environ, start_response):
# The original application entry point.
...
import posixpath
def application(environ, start_response):
# Wrapper to set SCRIPT_NAME to actual mount point.
  environ['SCRIPT_NAME'] = posixpath.dirname(environ['SCRIPT_NAME'])
  if environ['SCRIPT_NAME'] == '/':
environ['SCRIPT_NAME'] = ''
  return _application(environ, start_response)

Because we are hosting at the root of the site, we could have just set 'SCRIPT_NAME' to an empty string and be done with it. I use here though a more durable solution in case the rewrite URLs were being used for a sub directory of the server document directory.

And we are done, the result being that we have one site which has both a PHP web application and a Python web application which believe they are both hosted at the root of the site. When a request comes in, Apache will map the URL to file based resources in the server document directory. If that file is a static file the contents of that file will be served immediately. If instead the URL mapped to a PHP code file, then PHP will handle the request. Finally, if the request doesn't map to any file based resource, then the request will be passed through to the Python web application, which will perform its own routing based on the URL to work out how the request should be handled.

This mechanism enables you to add a Python web application to the site and then progressively transfer the functionality of the existing PHP web application across to the Python web application. If URLs aren't changing as part of the transition, then it is a simple matter of removing the PHP code file for a specific URL and that URL will then be handled by the Python web application instead.

Otherwise, you would implement the new URL handlers in the Python web application and then change the existing PHP web application to send requests off to the new URLs.

The key URL for the root of the site will with the above configuration be handled by the 'index.php' file. When you are finally ready to cut it over, then you just need to remove the 'index.php' file, plus the second 'RewriteCond' for '%{REQUEST_FILENAME} !-d' and the URL requests for the root of the site will also be sent through to the Python web application.

So summarising, there are two things that need to be done.

The first step is changing the Apache configuration to use mod_rewrite rules to fallback to sending requests through to the Python web application.

# Define a mod_wsgi daemon process group.
WSGIDaemonProcess my-python-web-application display-name=%{GROUP}
# Force the Python web application to run in the mod_wsgi daemon process group.
WSGIProcessGroup my-python-web-application
WSGIApplicationGroup %{GLOBAL}
# Disable embedded mode of mod_wsgi.
WSGIRestrictEmbedded On
# Set document root and rules for access.
DocumentRoot /var/www/html
<Directory /var/www/html>
Options ExecCGI
DirectoryIndex index.php
AddHandler application/x-httpd-php .php
AddHandler wsgi-script .py
RewriteEngine On
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ /main.py/$1 [QSA,PT,L]
</Directory>

The second step is setting up the 'main.py' file for the entry point to the Python web application, and implement the fix up for 'SCRIPT_NAME'.

def _application(environ, start_response):
# The original application entry point.
...
import posixpath
def application(environ, start_response):
# Wrapper to set SCRIPT_NAME to actual mount point.
  environ['SCRIPT_NAME'] = posixpath.dirname(environ['SCRIPT_NAME'])
  if environ['SCRIPT_NAME'] == '/':
environ['SCRIPT_NAME'] = ''
  return _application(environ, start_response)

Overall the concept is simple, it is just the detail of the implementation which may not be obvious and why some may think it is not possible.

What was the DjangoCon US 2014 angle in all this?

The issue of how to do this came up as Collin Anderson will be presenting at talk at DjangoCon called 'Integrating Django and Wordpress can be simple'. His talk is on a much broader topic, but I thought I would add a bit to explain in more detail how one can do PHP and Python site merging with Apache.

So if you are at Django and have to deal with PHP applications still, maybe drop in and watch Collin's talk.

Tuesday, September 2, 2014

Python module search path and mod_wsgi.

When you run the Python interpreter on the command line as an interactive console, if a module to be imported resides in the same directory as you ran the 'python' executable, then it will be found no problems.

How can this be though, as we haven't done anything to add the current working directory into 'sys.path', nor has the Python interpreter itself done so?

>>> import os, sys
>>> os.getcwd() in sys.path
False

What does get put in 'sys.path' then and does that give us a clue to why it is being found?

>>> sys.path
['', '/Library/Frameworks/Python.framework/Versions/3.4/lib/python34.zip',
'/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4',
'/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/plat-darwin',
'/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/lib-dynload',
'/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages']

As you can see we have a whole bunch of directories related to our actual Python installation.

We can also see that 'sys.path' includes an empty string. What is that about?

On the assumption that it is there for a reason, in order to work out what it might be for, lets try and delete that entry and see if it affects our attempt to import a module.

$ touch example.py
$ python3.4
Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import example
$ python3.4
Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> del sys.path[0]
>>> import example
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named 'example'

As you can see it is significant. When we delete the empty string from 'sys.path' we can no longer import any modules from the current working directory.

As it turns out, what this magic value of an empty string does is tell Python that when performing a module import, it should look in the current working directory. That is, the directory that would be returned by:

>>> import os
>>> os.getcwd()
'/private/tmp'

Initially this would be the directory in which you ran the 'python' executable, but if you so happened to use 'os.chdir()' to change the working directory, the current working directory will change and thus where Python looks for modules when imported will also change, instead now using the directory you changed to.

What about when executing Python against a script file instead of running an interactive console?

$ python3.4 -i example.py
>>> import sys
>>> sys.path
['/private/tmp', '/Library/Frameworks/Python.framework/Versions/3.4/lib/python34.zip',
'/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4',
'/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/plat-darwin',
'/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/lib-dynload',
'/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages']

This time there is no empty string. Instead of the empty string, Python has calculated the name of the directory which the script is located in and added that to 'sys.path'.

Any module in the same directory will still be found when imported, but importantly, if the current working directory of the application changes, that same directory will still be searched and the directory to be searched will not change.

So what does this all have to do with mod_wsgi?

Well under mod_wsgi, because it is an embedded system and it uses the C APIs directly for initialising the Python interpreter, neither the empty directory or the directory containing any script file will be added to 'sys.path'.

This means that only the module directories which form part of the standard Python module search path will actually ever be searched. Any directory where the WSGI application may reside will not be automatically searched. In fact, at the time of initialisation of the Python interpreter, it will not generally be known what WSGI applications will even be run within a specific Python interpreter as resolving of which WSGI script file to load will only be done lazily as actual web requests arrive.

The consequence of this is that if your WSGI application is not all contained in a single WSGI script file, then you will need to explicitly setup additional directories that Python should search for modules.

For a Django application, that means adding the project base directory and this is what was touched on in the discussion at DjangoCon US 2014 I had, with me saying again 'but there is a better way'.

If using embedded mode what you would need to do to have it search the base directory for the Django projects is:

WSGIPythonPath /path/to/mysite.com

If using daemon mode, you would instead use:

WSGIDaemonProcess example python-path=/path/to/mysite.com

If you leave it at that, then although your project modules will be found, the current working directory of your application is a bit of an unknown. What it will actually be set to is up to the mercy of Apache, with it usually being set to the '/' directory.

This is okay because in a Python web application you should always be referring to any files you need by an absolute path name and not a relative path name. If you didn't and had been using the Django development server, you might then find that lots of things break when you go to deploy on a real WSGI server such as Apache/mod_wsgi.

This is because all the attempts to access files by a relative path name will fail, as the current working directory isn't where you expected it to be.

Although it is still preferable to always use absolute path names, if for some reason that cannot be done, then with mod_wsgi daemon mode at least, you can also tell mod_wsgi to use a specific directory as the current working directory. This can be done using the 'home' option to the 'WSGIDaemonProcess' directive.

WSGIDaemonProcess example home=/path/to/mysite.com python-path=/path/to/mysite.com

As had occurred before, this shows out embedded mode to be a bit of a second class citizen, as there is no equivalent configuration for when using embedded mode.

Like with lang and locale settings, the issue is that when using embedded mode of mod_wsgi, your Python WSGI application is potentially running in the same processes as other Apache module code. You can't therefore simply hijack the current working directory for yourself. Worst case from this if you did, is that something else had made its own assumptions about what the current working directory should be and so you break it in the process.

The semi isolation offered by mod_wsgi daemon mode therefore allows you to safely change the current working directory, or at least if you are running the one WSGI application in each mod_wsgi daemon process group. If you therefore must have the current working directory be a certain directory, you can use daemon mode and the 'home' option.

In using the 'home' option the way it has worked in the past is that it only set the current working directory. This changed though in mod_wsgi 4.1.0 such that modules will be searched for automatically in that directory as well.

This means that from mod_wsgi 4.1.0 onwards you can actually simplify the options for daemon mode to:

WSGIDaemonProcess example home=/path/to/mysite.com

Combining this with a Python virtual environment which you want to use just for that daemon process group, you would use:

WSGIDaemonProcess example python-home=/path/to/venv home=/path/to/mysite.com

We therefore have a simpler way to setup the current working directory of the WSGI application so that relative paths do still work if you have managed not to ensure they are all absolute. The need to add that working directory separately to the Python module search is gone as it will be done automatically. And finally we don't have to dig down to the 'site-packages' directory and can just specify the root of the Python virtual environment.

All well and good, but I did realise when writing this post that I probably made a bad decision at the time of changing how 'home' worked in Python 4.1.0.

What I did was that I completely forgot that rather than use an empty string when running a 'python' executable against a script file, that it adds the directory the script is contained in.

I am not sure what I was thinking of at the time but what I did was to add an empty string into 'sys.path' when the 'home' option was used.

This still produces the desired result as the current working directory is the directory where we want to load the Python modules from, but problems would arise though if for some reason your code decided to change the current working directory during the life of the process. I even warn about this in the releases notes for the change, so as I said, I truly don't know what I was thinking of at the time to allow that one through.

Now I know you are all sensible programmers and would not successively go and change the current working directory from your WSGI application code, especially in a multi threaded configuration where it would likely be quite unsafe, so for now it is probably okay. I will though now change the behaviour in mod_wsgi 4.3.0 to use the more logical mechanism as shown right back at the start, of using the actual directory path for the 'home' option in 'sys.path' so that changing the working directory will not affect where modules are imported from.

Debugging with pdb when using mod_wsgi.

In the early days of mod_wsgi I made a decision to impose a restriction on the use of stdin and stdout by Python WSGI web applications. My reasoning around this was that if you want to make a WSGI application portable to any WSGI deployment mechanism, then you should not be attempting to use stdin/stdout. This includes either reading or writing to these file objects, or even performing a check on them to try and determine if code is running in a process attached to a TTY device.

The restriction was generally driven by the fact that WSGI adapters for CGI relied on stdin and stdout to communicate with the web server the script was running in. Although such CGI/WSGI adapters could have saved away the original stdin and stdout for their own use and then replaced the original 'sys.stdin' and 'sys.stdout' with working alternatives so users code didn't care, the original example CGI/WSGI adapter in the WSGI specification never did that, so no one as a result thought about the issue and did something about it themselves when implementing CGI/WSGI adapters.

As to what the problem was, the issue was that if any user code decided to use 'print()' to dump out debugging information so it appeared in a WSGI server log, when that WSGI application was hosted using a CGI/WSGI adapter, that debug output would end up in the HTTP response sent back to the client, as stdout is used by a CGI script to communicate with the web server.

So all well and good and I thought I was doing a good thing by encouraging people to write portable WSGI application code. This isn't how users saw things though, they didn't care about such things and because they got an exception when they tried to use stdin or stdout they blamed mod_wsgi and not that what they were doing wasn't portable.

What happened therefore is that documentation for some Python web frameworks and various blog posts started to say that mod_wsgi has these restrictions and/or was broken and here is how you workaround it. The Flask documentation even today still carries such a warning even though it isn't relevant to more recent mod_wsgi versions, with the restriction removed back in mod_wsgi 3.0, which was released on 21st November 2009, almost five years ago.

For some more background on this issue you can read my prior blog post back in 2009 about it. In short though, if you are using:

WSGIRestrictStdout Off

in the Apache configuration file, or using:

import sys
sys.stdout = sys.stderr

in the WSGI script file, you do not need to if using mod_wsgi version 3.0 or later. 

The reason this issue came up in my discussions with people during the hallway track of DjangoCon was because we were discussing the Django debug toolbar and Python debuggers such as pdb.

In the case of pdb, in order for it to work, it needs to have access to the original stdin and stdout attached to your console in order to provide you with an interactive session.

When you remap 'sys.stdout' to 'sys.stderr' in your WSGI script file you are replacing the original stdout with stderr where stderr is always going to be connected to the Apache error log. Any output from pdb would therefore end up in the Apache error log and would not show in your interactive console.

But wait you say, Apache/mod_wsgi runs all the processes which run your actual WSGI application as background processes so how could it work anyway. There is no way at that point that stdin and stdout would still be connected to any console shell and since Apache is generally started as root on system startup, how is that even helpful.

What is little known is that it is in fact possible to run Apache with mod_wsgi in a single process mode where Apache is run in the foreground and where stdin and stdout are attached to your console, allowing you to potentially interact with the process.

If using a standard Apache setup, the steps required are admittedly a bit fiddly to get this running.

The first thing you need to do is if you are using mod_wsgi daemon mode, you have to comment out the mod_wsgi directives which set that up. This then defaults your WSGI application back to running in embedded mode.

The next thing you need to do is if you are using the worker or event MPMs of Apache, you need to change the MPM configuration to only create a single worker thread per process.

Finally, you then need to manually start the Apache server from a shell, giving it the '-DONE_PROCESS'  or -X' option.

/usr/sbin/httpd -X

 If you are on a Linux system, it is possible you will also need to set the 'APACHE_RUN_USER' and 'APACHE_RUN_GROUP' environment variables as well. This is because on some Linux systems, the standard Apache configuration is dependent on these environment variables having been set by the 'apachectl' script. If needing to set them, they should be set to the user and group of the standard Apache user.

Do all that and you can now place in your code:

import pdb; pdb.set_trace()

and when that code is executed you will be thrown into an interactive pdb session where you can interact with your WSGI application. To exit out of the pdb session enter 'cont' and it will continue with the request.

You can find further information about all this in the mod_wsgi documentation about pdb. Do be warned that the WSGI middleware described there isn't strictly correct and only intercepts an exception which occurs when creating the iterable to be returned, which for a generator is even before your code gets executed. It may therefore be best to stick with 'pdb.set_trace()' for now until I fix that WSGI middleware.

So it is possible to use pdb with WSGI applications hosted using Apache/mod_wsgi, but the steps do make it a bit onerous.

This is the point where some of the more recent work I am doing on mod_wsgi makes this more practical.

With the newer mod_wsgi express variant you don't have to worry about the Apache configuration, making it an ideal way to run up Apache/mod_wsgi in a development environment.

For this specific use case of wanting to run pdb, the next version of mod_wsgi (4.3.0), supports a new option for mod_wsgi express which allows it to be run in this single process mode for you automatically, thus making it easier to use pdb to debug a WSGI application running under Apache/mod_wsgi.

What is the current version of mod_wsgi?

If you pick up any Linux distribution, you will most likely come to the conclusion that the newest version of mod_wsgi available is 3.3 or 3.4. Check when those versions were released and you will find:

  • mod_wsgi version 3.3 - released 25th July 2010
  • mod_wsgi version 3.4 - released 22nd August 2012

Problem is that people look at that and seeing that there are only infrequent releases and nothing recently, they think that mod_wsgi is no longer being developed or supported. I have even seen such comments to the effect of 'mod_wsgi is dead, use XYZ instead'.

From the perspective of someone who develops Open Source, the inclusion of out of date package versions on Linux distributions has always been a right pain in the neck.

Because of these older versions present in Linux LTS versions, you can't as easily update the documentation to ignore older ways of doing things and instead focus on just the new and better ways.

You are constantly having to deal with users who refuse to upgrade because they worship at the alter of these DEB or RPM gods.

That and users who believe you only exist to help them is why I got so miffed about Open Source and why the releases became fewer and farther in between in the first place. What was the point of providing updated versions with lots of new stuff if they weren't going to be used anyway.

For many Open Source packages you might not actually get away with such infrequent releases, but because mod_wsgi served one specific purpose and did it very well, there wasn't a great need to be updating it anyway. Where updates were necessary, it tended to revolve around changes in newer versions of Python and Apache. The mod_wsgi code itself was stable and reliable enough that it just kept on working with no issues.

In my hallway track discussions at DjangoCon US 2014, this issue of mod_wsgi versions came up. There was actually some surprise that there were actually newer versions than those listed above.

I will not list all the newer versions than 3.4, but some key versions are:

  • mod_wsgi version 3.5 - released 20th May 2014 (security release)
  • mod_wsgi version 4.0 - never released
  • mod_wsgi version 4.1.0 - released 23rd May 2014
  • mod_wsgi version 4.1.3 - released 3rd June 2014
  • mod_wsgi version 4.2.0 - released 8th June 2014
  • mod_wsgi version 4.2.8 - released 22nd August 2014
  • mod_wsgi version 4.3.0 - working on it

When you look back at the full release history, from the first official version ever released on the 5th September 2007 up to when version 3.5 was released, but not including version 3.5, you get 21 releases in a period of 6 1/2 years.

Not exactly a lot of releases and it definitely slowed down in the last 4 years of that with only 2 releases.

Contrast that to the period of time from version 3.5 up till now. In that short time of a bit over 3 months there have been 14 releases. I even switched to a 3 digit version numbering scheme from a 2 digit number scheme precisely so I felt no impediment to making new releases.

So mod_wsgi was resting there for quite a while, but it isn't the dead parrot from the Monty Python skit, it is definitely alive and kicking.

Given there has been so many versions released of late, you might think you would have seen more said about them. I have made a few noises on Twitter and the mod_wsgi mailing list, but in the main I have indeed been rather quiet about them.

The reason I have been quiet is that I have returned to working on mod_wsgi for the enjoyment of it. I don't really care if no one wants to use the newer features right now and am more than happy to just use it as a means to explore my ideas for what makes a good deployment system for Python WSGI applications.

If you want to try and work out what I have been working on, for now I would suggest you read the documentation on the mod_wsgi package entry on PyPi. For lower level details of specific changes, see the release notes. Otherwise, keep an eye here on my blog as what I am working on has been a topic of discussion at DjangoCon as well, which means I will have to explain something about the newer versions if I am to keep reporting on my DjangoCon hallway track discussions as planned.

Using Python virtual environments with mod_wsgi.

You should be using Python virtual environments and if you don't know why you should, maybe you should find out.

That said, the use of Python virtual environments was the next topic that came up in my hallway track discussions at DjangoCon US 2014. The pain point here is in part actually of my own creation. This is because although there are better ways of using Python virtual environments with mod_wsgi available today than there used to be, I have never actually gone back and properly fixed up the documentation to reflect the changes.

When using mod_wsgi embedded mode, one would use the 'WSGIPythonHome' directive, setting it to be the top level directory of the Python virtual environment you wish to use. If you don't know what that is supposed to be, then you can interrogate it using the command line Python interpreter:

>>> import sys
>>> sys.prefix
'/Users/graham/Projects/mod_wsgi/venv'

Most important is that this should refer to a directory. It is an all too common mistake that I see that people set the 'WSGIPythonHome' directive to be the path to the 'python' executable from the virtual environment. That is plain wrong, so please do not do it, doing so will see the setting be ignored completely and the default algorithm for finding what Python installation to use will be used instead.

If using daemon mode of mod_wsgi and you are hosting only the one Python WSGI application, then you can again just rely on the 'WSGIPythonHome' directive, pointing it at the Python virtual environment you want to use. If you are hosting more than one WSGI application however, and you want each to use a different Python virtual environment, then you need to do a bit more work.

The mod_wsgi documentation on this steers you towards a convoluted bit of code to include in your WSGI application to do this, explain in part why this is the safest option.

ALLDIRS = ['usr/local/pythonenv/PYLONS-1/lib/python2.5/site-packages']
import sys 
import site
# Remember original sys.path.
prev_sys_path = list(sys.path)
# Add each new site-packages directory.
for directory in ALLDIRS:
site.addsitedir(directory)
# Reorder sys.path so new directories at the front.
new_sys_path = []
for item in list(sys.path):
if item not in prev_sys_path:
new_sys_path.append(item)
sys.path.remove(item)
sys.path[:0] = new_sys_path

Part of the reasoning behind giving that as the recipe was a distrust of the 'activate_this.py' script that is included in a Python virtual environment and advertised as the solution to use for embedded Python environments such as mod_wsgi.

The reason I was cool on 'activate_this.py' was that it stomped on the value of 'sys.prefix'. In the context of mod_wsgi, because the Python installation that mod_wsgi was actually compiled against or using may be at a different location, I was worried about whether modifying 'sys.prefix' would cause something to break.

I therefore gave only guarded approval to using 'activate_this.py'.

In the many years mod_wsgi has been available though, I have to admit that no issues ever came up around 'sys.prefix' being overridden.

So, if you do not have access to make changes in the Apache configuration files for some reason, then the easiest way to activate a Python virtual environment in your WSGI script file is:

activate_this = '/usr/local/pythonenv/PYLONS-1/bin/activate_this.py'
execfile(activate_this, dict(__file__=activate_this))

This is still a pain to have to include because you are adding to the WSGI script file knowledge of the execution environment it is being run in, which is notionally a bad idea.

The alternative to modifying the WSGI script file was to add just the 'site-packages' directory from the Python virtual environment in the Apache configuration.

For embedded mode of mod_wsgi you would do this by using the 'WSGIPythonPath' directive:

WSGIPythonPath /usr/local/pythonenv/PYLONS-1/lib/python2.5/site-packages

If using daemon mode of mod_wsgi you would use the 'python-path' option to the WSGIDaemonProcess directive.

WSGIDaemonProcess pylons python-path=/usr/local/pythonenv/PYLONS-1/lib/python2.5/site-packages

What was ugly about this was that you had to refer to the 'site-packages' directory where it existed down in the Python virtual environment. That directory name also included the Python version, so if you ever changed what Python version you were using, you had to remember to go change the configuration.

The good news is that since mod_wsgi version 3.4 or later there is a better way.

Rather than fiddling with what goes into 'sys.path' using the 'WSGIPythonPath' directive or the 'python-path' option to 'WSGIDaemonProcess', you can use the 'python-home' option on the 'WSGIDaemonProcess' directive itself.

WSGIDaemonProcess pylons python-home=/usr/local/pythonenv/PYLONS-1

As when using the 'WSGIPythonHome' directive, this should be the top level directory of the Python virtual environment you wish to use. In this case the value will only be used for this specific mod_wsgi daemon process group.

If you are therefore using a new enough mod_wsgi version, and using mod_wsgi daemon mode, then switch to the 'python-home' option of 'WSGIDaemonProcess'.

Monday, September 1, 2014

Setting LANG and LC_ALL when using mod_wsgi.

So I am at DjangoCon US 2014 and one of the first pain points for using mod_wsgi that came up in discussion at DjangoCon US was the lang and locale settings. These settings influence what the default encoding is for Python when implicitly converting Unicode to byte strings. In other words, they dictate what is going on at the Unicode/bytes boundary.

Now this should not really be an issue with WSGI at least, because you should always be explicitly specifying the encoding you want the response content to be when it is being returned from the WSGI application and that should match the 'charset' attribute specified in the 'Content-Type' response header. There are however lots of other cases where the problem can still present itself.

Take for example a simple case in a command line interpreter of printing out the value of a Unicode string into the Apache error log:

>>> print(u'\u292e')

On a system with a sane configuration, this would display as you expect. The reason for this is that the your login shell environment would typically set an environment value such as the 'LANG' environment variable. On my MacOS X system for example I have:

LANG=en_AU.UTF-8

When I use the 'locale' module to see what Python sees, we get:

>>> import locale
>>> locale.getdefaultlocale()
('en_AU', 'UTF-8')
>>> locale.getpreferredencoding()
'UTF-8'

UTF-8 is generally the magic value that solves all problems. With that you should generally be okay.

The problem now is that when using Apache/mod_wsgi on many Linux systems, the Apache process doesn't inherit any environment variables which override the default locale or language settings. So what the Python code running under Apache/mod_wsgi sees is:

>>> import locale
>>> locale.getdefaultlocale()
(None, None)
>>> locale.getpreferredencoding()
'US-ASCII'

With the Python interpreter now using these values, if we try and print out a Unicode value, we can encounter problems.

>>> print(u'\u292e')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u292e' in position 0: ordinal not in range(128)

And this is the trap that people encounter when using Apache/mod_wsgi. They will run up their Python WSGI application with a development server, such as that provided by Django, and everything will work fine. Host Apache/mod_wsgi on a Linux system though, and if they have not ensured that encodings are always being used explicitly when converting from Unicode to bytes, they can start to get 'UnicodeEncoderError' exceptions.

What is the solution then?

As is detailed in the Django documentation, you can set the environment variables yourself.

export LANG='en_US.UTF-8'
export LC_ALL='en_US.UTF-8'

The problem now is where do you set them. This is because they need to be set in the environment under which the initial Apache process starts up.

For a standard Apache Software Foundation layout for an Apache httpd installation, the file this needs to be put in is the 'envvars' file. This file would exist in the same directory as where 'apachectl' resides.

The location means though that you are modifying what is a system file. If you upgrade Apache it is quite possible, unless the package update looks for an existing file, that it will be overridden.

A further problem is that Linux systems generally do not use the Apache Software Foundation installation layout and do away with the 'envvars' file completely. In these cases you have to find the correct system init script file to add the settings in. What file this is, differs between Linux distributions.

So being able to override the value for the lang and locale can be a pain. It can differ between Linux distributions and even changing the file may not survive system package upgrades.

Is there a better solution?

If using embedded mode of mod_wsgi to run your WSGI application with Apache the answer is currently no.

This isn't because it is technically impossible, but because Apache can be used to host more than just Python web applications at the same time and I didn't think it would be particularly friendly to provide a mod_wsgi configuration directive which modified the lang and locale for the whole of Apache, thereby affecting other Apache modules.

Was this a wise choice, I am not sure. I am probably still open to providing such an option to override them for the whole of Apache, but certainly would need to document the dangers of using it.

The thing is though, that unless you really know what you are doing, you shouldn't be using embedded mode of mod_wsgi on UNIX systems anyway, even though it is the default. The preferred and better configuration is to use mod_wsgi daemon mode.

For mod_wsgi daemon mode then, what can be done?

For this way of using mod_wsgi what you can do is set the 'lang' and 'locale' options to the 'WSGIDaemonProcess' directive.

WSGIDaemonProcess my-django-site lang='en_US.UTF-8' locale='en_US.UTF-8'

For daemon mode of mod_wsgi at least, that is the solution for this particular pain point.

Do note though that you must be using mod_wsgi 3.4 or later to have access to these options of the 'WSGIDaemonProcess' directive.

Reporting on the DjangoCon US 2014 hallway track.

I have only been in Portland for a few hours for DjangoCon, and despite some lack of sleep, I already feel that being here is recharging my enthusiasm for working on Open Source, something that has still been sagging a bit lately. I don't wish to return to that dark abyss I was in, so definitely what I need.

Now lots of people write up reports on conferences including live noting them, but I thought I might see if I can manage something a bit different for DjangoCon US this year. That is, rather than simply live note about the talks or give a general round up of how I felt about the conference, I am going to try and do some quick fire blog posts on specific topics that have come up in discussions with others in the hallways or during social drinks.

Even during the first evenings drinks graciously hosted by Steve Holden, a whole bunch of topics related to mod_wsgi came up. In most cases these resolved around where someone was experiencing a pain point over how to do something, but where my response was, "oh, there is a better way of doing that".

I will therefore endeavour to write up some mini blog posts on the topics that come up and expand on them as I can.

Since the hallway track can can be a learning experience in itself, I thought this might be of interest to others as well.

And if you are at DjangoCon, do make sure you make good use of the hallway track. Reach out and talk to people you don't know. Hunt down those experts who might be able to help you with a specific problem. Such discussions need not be one way and I know I learn lots about how people use mod_wsgi from such conversations.

If you do want to specifically get hold of me, you will find I will not be there on all days. I will at least be there on the first day of talks and for the sprints. I also aim to get to anything happening in the evenings. See you there.