Tuesday, September 2, 2014

Python module search path and mod_wsgi.

When you run the Python interpreter on the command line as an interactive console, if a module to be imported resides in the same directory as you ran the 'python' executable, then it will be found no problems.

How can this be though, as we haven't done anything to add the current working directory into 'sys.path', nor has the Python interpreter itself done so?

>>> import os, sys
>>> os.getcwd() in sys.path

What does get put in 'sys.path' then and does that give us a clue to why it is being found?

>>> sys.path
['', '/Library/Frameworks/Python.framework/Versions/3.4/lib/python34.zip',

As you can see we have a whole bunch of directories related to our actual Python installation.

We can also see that 'sys.path' includes an empty string. What is that about?

On the assumption that it is there for a reason, in order to work out what it might be for, lets try and delete that entry and see if it affects our attempt to import a module.

$ touch example.py
$ python3.4
Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import example
$ python3.4
Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> del sys.path[0]
>>> import example
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named 'example'

As you can see it is significant. When we delete the empty string from 'sys.path' we can no longer import any modules from the current working directory.

As it turns out, what this magic value of an empty string does is tell Python that when performing a module import, it should look in the current working directory. That is, the directory that would be returned by:

>>> import os
>>> os.getcwd()

Initially this would be the directory in which you ran the 'python' executable, but if you so happened to use 'os.chdir()' to change the working directory, the current working directory will change and thus where Python looks for modules when imported will also change, instead now using the directory you changed to.

What about when executing Python against a script file instead of running an interactive console?

$ python3.4 -i example.py
>>> import sys
>>> sys.path
['/private/tmp', '/Library/Frameworks/Python.framework/Versions/3.4/lib/python34.zip',

This time there is no empty string. Instead of the empty string, Python has calculated the name of the directory which the script is located in and added that to 'sys.path'.

Any module in the same directory will still be found when imported, but importantly, if the current working directory of the application changes, that same directory will still be searched and the directory to be searched will not change.

So what does this all have to do with mod_wsgi?

Well under mod_wsgi, because it is an embedded system and it uses the C APIs directly for initialising the Python interpreter, neither the empty directory or the directory containing any script file will be added to 'sys.path'.

This means that only the module directories which form part of the standard Python module search path will actually ever be searched. Any directory where the WSGI application may reside will not be automatically searched. In fact, at the time of initialisation of the Python interpreter, it will not generally be known what WSGI applications will even be run within a specific Python interpreter as resolving of which WSGI script file to load will only be done lazily as actual web requests arrive.

The consequence of this is that if your WSGI application is not all contained in a single WSGI script file, then you will need to explicitly setup additional directories that Python should search for modules.

For a Django application, that means adding the project base directory and this is what was touched on in the discussion at DjangoCon US 2014 I had, with me saying again 'but there is a better way'.

If using embedded mode what you would need to do to have it search the base directory for the Django projects is:

WSGIPythonPath /path/to/mysite.com

If using daemon mode, you would instead use:

WSGIDaemonProcess example python-path=/path/to/mysite.com

If you leave it at that, then although your project modules will be found, the current working directory of your application is a bit of an unknown. What it will actually be set to is up to the mercy of Apache, with it usually being set to the '/' directory.

This is okay because in a Python web application you should always be referring to any files you need by an absolute path name and not a relative path name. If you didn't and had been using the Django development server, you might then find that lots of things break when you go to deploy on a real WSGI server such as Apache/mod_wsgi.

This is because all the attempts to access files by a relative path name will fail, as the current working directory isn't where you expected it to be.

Although it is still preferable to always use absolute path names, if for some reason that cannot be done, then with mod_wsgi daemon mode at least, you can also tell mod_wsgi to use a specific directory as the current working directory. This can be done using the 'home' option to the 'WSGIDaemonProcess' directive.

WSGIDaemonProcess example home=/path/to/mysite.com python-path=/path/to/mysite.com

As had occurred before, this shows out embedded mode to be a bit of a second class citizen, as there is no equivalent configuration for when using embedded mode.

Like with lang and locale settings, the issue is that when using embedded mode of mod_wsgi, your Python WSGI application is potentially running in the same processes as other Apache module code. You can't therefore simply hijack the current working directory for yourself. Worst case from this if you did, is that something else had made its own assumptions about what the current working directory should be and so you break it in the process.

The semi isolation offered by mod_wsgi daemon mode therefore allows you to safely change the current working directory, or at least if you are running the one WSGI application in each mod_wsgi daemon process group. If you therefore must have the current working directory be a certain directory, you can use daemon mode and the 'home' option.

In using the 'home' option the way it has worked in the past is that it only set the current working directory. This changed though in mod_wsgi 4.1.0 such that modules will be searched for automatically in that directory as well.

This means that from mod_wsgi 4.1.0 onwards you can actually simplify the options for daemon mode to:

WSGIDaemonProcess example home=/path/to/mysite.com

Combining this with a Python virtual environment which you want to use just for that daemon process group, you would use:

WSGIDaemonProcess example python-home=/path/to/venv home=/path/to/mysite.com

We therefore have a simpler way to setup the current working directory of the WSGI application so that relative paths do still work if you have managed not to ensure they are all absolute. The need to add that working directory separately to the Python module search is gone as it will be done automatically. And finally we don't have to dig down to the 'site-packages' directory and can just specify the root of the Python virtual environment.

All well and good, but I did realise when writing this post that I probably made a bad decision at the time of changing how 'home' worked in Python 4.1.0.

What I did was that I completely forgot that rather than use an empty string when running a 'python' executable against a script file, that it adds the directory the script is contained in.

I am not sure what I was thinking of at the time but what I did was to add an empty string into 'sys.path' when the 'home' option was used.

This still produces the desired result as the current working directory is the directory where we want to load the Python modules from, but problems would arise though if for some reason your code decided to change the current working directory during the life of the process. I even warn about this in the releases notes for the change, so as I said, I truly don't know what I was thinking of at the time to allow that one through.

Now I know you are all sensible programmers and would not successively go and change the current working directory from your WSGI application code, especially in a multi threaded configuration where it would likely be quite unsafe, so for now it is probably okay. I will though now change the behaviour in mod_wsgi 4.3.0 to use the more logical mechanism as shown right back at the start, of using the actual directory path for the 'home' option in 'sys.path' so that changing the working directory will not affect where modules are imported from.

No comments: