Friday, April 10, 2015

Automatic patching of Python applications.

In my previous posts on monkey patching I discussed the ordering problem. That is, that the ability to properly monkey patch is dependent on whether we can get in before any other code has already imported the module we want to patch. The specific issue in this case is where other code has imported a reference to a function within a module by name and stored that in it is own namespace. In other words, where it has used:

from module import function

If we can’t get in early enough, then it becomes necessary to monkey patch all such uses of a target function as well, which in the general case is impossible as we will not know where the function has been imported.

Part of the solution I described for this was to use a post import hook mechanism to allow us to get access to a module for monkey patching before the module is even returned back to any code where it is being imported. This technique is still though dependent on the post import hook mechanism itself being installed before any other code is effectively run. This means having to manually modify the main Python script file for an application, something which isn’t always practical.

The point of this post is to look at how we can avoid the need to even modify that main Python script file. For this there are a few techniques that could be used. I am going to look at the most evil of those techniques first and then talk about others in a subsequent post.

Executable code in .pth files

As part of the Python import system and how it determines what directories are searched for Python modules, there is a mechanism whereby for a package it is possible to install a file with a ‘.pth’ extension into the Python 'site-packages' directory. The actual Python package code itself then might actually be installed in a different location not actually on the Python module search path, most often actually in a versioned subdirectory of the 'site-packages' directory. The purpose of the ‘.pth’ file is to act as a pointer to where the actual code for the Python package lives.

In the simple case the ‘.pth’ file will contain a relative or absolute path name to the name of the actual directory containing the code for the Python package. In the case of it being a relative path name, then it will be taken relative to the directory in which the ‘.pth’ file is located.

With such ‘.pth’ files in place, when the Python interpreter is initialising itself and setting up the Python module search path, after it has added in all the default directories to be searched, it will look through the ‘site-packages’ directory and parse each ‘.pth’ file, adding to the final list of directories to be searched any directories specified within the ‘.pth’ files.

Now at one point in the history of Python this ‘.pth’ mechanism was enhanced to allow for a special case. This special case was that if a line in the ‘.pth’ file started with ‘import’, the line would be executed as Python code instead of simply adding it as a directory to the list of directories to be searched for modules.

I am told this originally was to allow special startup code to be executed for a module to allow registration of a non standard codec for Unicode. It has though since also been used in the implementation of ‘easy_install’ and if you have ever run ‘easy-install’ and looked at the ‘easy-install.pth’ file in the ‘site-packages directory you will find some code which looks like:

import sys; sys.__plen = len(sys.path)
./antigravity-0.1-py2.7.egg
import sys; new=sys.path[sys.__plen:]; del sys.path[sys.__plen:]; p=getattr(sys,'__egginsert',0); sys.path[p:p]=new; sys.__egginsert = p+len(new)

So as long as you can fit the code on one line, you can potentially do some quite nasty stuff inside of a ‘.pth’ file every time that the Python interpreter is run.

Personally I find the concept of executable code inside of a ‘.pth’ file really dangerous and up until now have avoided relying on this feature of ‘.pth’ files.

My concerns over executable code in ‘.pth’ files is the fact that it is always run. This means that even if you had installed a pre built RPM/DEB package or a Python wheel into a system wide Python installation, with the idea that this was somehow much safer because you were avoiding running the ‘setup.py’ file for a package as the ‘root’ user, the ‘.pth’ file means that the package can still subsequently run code without you realising and without you even having imported the module into any application.

If one wanted to be paranoid about security, then Python should really have a whitelisting mechanism for what ‘.pth’ files you wanted to trust and allow code to be executed from every time the Python interpreter is run, especially as the ‘root’ user.

I will leave that discussion up to others if anyone else cares to be concerned and for now at least will show how this feature of ‘.pth’ files can be used (abused) to implement a mechanism for automated monkey patching of any Python application being run.

Adding Python import hooks

In the previous post where I talked about the post import hook mechanism, the code I gave as needing to be able to be manually added at the start of any Python application script file was:

import os
from wrapt import discover_post_import_hooks
patches = os.environ.get('WRAPT_PATCHES')
if patches:
    for name in patches.split(','):
        name = name.strip()
        if name:
            print 'discover', name
            discover_post_import_hooks(name)

What this was doing was using an environment variable as the source of names for any packages registered using ‘setuptools’ entry points that contained monkey patches we wanted to have applied.

Knowing about the ability to have executable code in ‘.pth’ files, lets now work out how we can use that to instead have this code executed automatically every time the Python interpreter is run, thereby avoiding the need to manually modify every Python application we want to have monkey patches applied to.

In practice however, the code we will need is actually going to have to be slightly more complicated than this and as a result not something that can be readily added directly to a ‘.pth’ file due to the limitation of code needing to all be on one line. What we will therefore do is put all our code in a separate module and execute it from there. We don’t want to be too nasty and import that module every time though, perhaps scaring users when they see it imported even if not used, so we will gate even that by the presence of the environment variable.

What we can therefore use in our ‘.pth’ is:

import os, sys; os.environ.get('AUTOWRAPT_BOOTSTRAP') and __import__('autowrapt.bootstrap') and sys.modules['autowrapt.bootstrap'].bootstrap()

That is, if the environment  variable is set to a non empty value only then do we import our module containing our bootstrap code and execute it.

As to the bootstrap code, this is where things get a bit messy. We can’t just use the code we had used before when manually modifying the Python application script file. This is because of where in the Python interpreter initialisation the parsing of ‘.pth’ files is done.

The problems are twofold. The first issue with executing the discovery of the import hooks directly when the ‘.pth’ file is processed is that the order in which they are processed is unknown and so at the point our code is run the final Python module search path may not have been setup. The second issue is that ‘.pth’ file processing is done before any ‘sitecustomize.py’ or ‘usercustomize.py’ processing has been done. The Python interpreter therefore may not be in its final configured state. We therefore have to be a little bit careful of what we do.

What we really want is to defer any actions until the Python interpreter initialisation has been completed. The problem is how we achieve that.

Python interpreter ‘site’ module

The actual final parts of Python interpreter initialisation is performed from the ‘main()’ function of the ‘site’ module:

def main():
    global ENABLE_USER_SITE
    abs__file__()
    known_paths = removeduppaths()
    if ENABLE_USER_SITE is None:
        ENABLE_USER_SITE = check_enableusersite()
    known_paths = addusersitepackages(known_paths)
    known_paths = addsitepackages(known_paths)
    if sys.platform == 'os2emx':
        setBEGINLIBPATH()
    setquit()
    setcopyright()
    sethelper()
    aliasmbcs()
    setencoding()
    execsitecustomize()
    if ENABLE_USER_SITE:
        execusercustomize()
    # Remove sys.setdefaultencoding() so that users cannot change the
    # encoding after initialization. The test for presence is needed when
    # this module is run as a script, because this code is executed twice.
    if hasattr(sys, "setdefaultencoding"):
        del sys.setdefaultencoding

The ‘.pth’ parsing and code execution we want to rely upon is done within the ‘addsitepackages()’ function.

What we really want therefore is to defer any execution of our code until after the functions ‘execsitecustomize()’ or ‘execusercustomize()’ are run. The way to achieve that is to monkey patch those two functions and trigger our code when they have completed.

We have to monkey patch both because the ‘usercustomize.py’ processing is optional dependent on whether ‘ENABLE_USER_SITE’ is true or not. Our 'bootstrap()’ function therefore needs to look like:

def _execsitecustomize_wrapper(wrapped):
    def _execsitecustomize(*args, **kwargs):
        try:
            return wrapped(*args, **kwargs)
        finally:
            if not site.ENABLE_USER_SITE:
                _register_bootstrap_functions()
    return _execsitecustomize
def _execusercustomize_wrapper(wrapped):
    def _execusercustomize(*args, **kwargs):
        try:
            return wrapped(*args, **kwargs)
        finally:
            _register_bootstrap_functions()
    return _execusercustomize

def bootstrap():
    site.execsitecustomize = _execsitecustomize_wrapper(site.execsitecustomize)
    site.execusercustomize = _execusercustomize_wrapper(site.execusercustomize)

Despite everything I have ever said about how manually constructed monkey patches is bad and that the ‘wrapt’ module should be used for doing monkey patching, we can’t actually use the ‘wrapt’ module in this case. This is because technically, as a user installed package, the ‘wrapt’ package may not be usable at this point. This could occur where ‘wrapt’ was installed in such a way that the ability to import it was itself dependent on the processing of ‘.pth’ files. As a result we drop down to using a simple wrapper using a function closure.

In the actual wrappers, you can see how which of the two wrappers actually ends up calling ‘_register_bootstrap_functions()’ is dependent on whether ‘ENABLE_USER_SITE’ is true or not, only calling it in ‘execsitecustomize()’ if support for ‘usersitecustomize’ was enabled.

Finally we now have our '_register_bootstrap_functions()’ defined as:

_registered = False

def _register_bootstrap_functions():
   global _registered
    if _registered:
        return
    _registered = True

    from wrapt import discover_post_import_hooks
    for name in os.environ.get('AUTOWRAPT_BOOTSTRAP', '').split(','):
        discover_post_import_hooks(name)

Bundling it up as a package

We have worked out the various bits we require, but how do we get this installed, in particular how do we get the custom ‘.pth’ file installed. For that we use a ‘setup.py’ file of:

import sys
import os
from setuptools import setup
from distutils.sysconfig import get_python_lib
setup_kwargs = dict(
    name = 'autowrapt',
    packages = ['autowrapt'],
    package_dir = {'autowrapt': 'src'},
    data_files = [(get_python_lib(prefix=''), ['autowrapt-init.pth'])],
    entry_points = {'autowrapt.examples’: ['this = autowrapt.examples:autowrapt_this']},
    install_requires = ['wrapt>=1.10.4'],
)
setup(**setup_kwargs)

To get that ‘.pth’ installed we have used the ‘data_files’ argument to the ’setup()’ call. The actual location for installing the file is determined using the ‘get_python_lib()’ function from the ‘distutils.sysconfig’ module. The ‘prefix' argument of an empty string ensures that a relative path for the ‘site-packages’ directory where Python packages should be installed is used rather than an absolute path.

Very important when installing this package though is that you cannot use ‘easy_install’ or ‘python setup.py install’. One can only install this package using ‘pip’.

The reason for this is that if not using ‘pip’, then the package installation tool can install the package as an egg. In this situation the custom ‘.pth’ file will actually be installed within the egg directory and not actually within the ‘site-packages’ directory.

The only ‘.pth’ file added to the ‘site-packages’ directory will be that used to map that the ‘autowrapt’ package exists in the sub directory. The ‘addsitepackages()’ function called from the ‘site’ module doesn’t in turn process ‘.pth’ files contained in a directory added by a ‘.pth’ file, so our custom ‘.pth’ file would be skipped.

When using ‘pip’ it doesn’t use eggs by default and so we are okay.

Also do be aware that this package will not work with ‘buildout’ as it will always install packages as eggs and explicitly sets up the Python module search path itself in any Python scripts installed into the Python installation.

Trying out an example

The actual complete source code for this package can be found at:

The package has also been released on PyPi as ‘autowrapt’ so you can actually try it, and use it if you really want to.

To allow for a easy quick test to see that it works, the ‘autowrapt’ package bundles an example monkey patch. In the above ‘setup.py’ this was set up by:

entry_points = {'autowrapt.examples’: ['this = autowrapt.examples:autowrapt_this']},

This entry point definition names a monkey patch with the name ‘autowrapt.examples’. The definition says that when the ‘this’ module is installed, the monkey patch function ‘autowrapt_this()’ in the module ‘autowrapt.examples’ will be called.

So to run the test do:

pip install autowrapt

This should also install the ‘wrapt’ module if you don’t have the required minimum version.

Now run the command line interpreter as normal and at the prompt do:

import this

This should result in the Zen of Python being displayed.

Exit the Python interpreter and now instead run:

AUTOWRAPT_BOOTSTRAP=autowrapt.examples python

This runs the Python interpreter again, but also sets the environment variable ‘AUTOWRAPT_BOOTSTRAP’ with the value ‘autowrapt.examples’ matching the name of the entry point defined in the ‘setup.py’ file for ‘autowrapt'.

The actual code for the ‘autowrapt_this()’ function was:

from __future__ import print_function
def autowrapt_this(module):
    print('The wrapt package is absolutely amazing and you should use it.')

so if we now again run:

import this

we should now see an extended version of the Zen of Python.

We didn’t actually monkey patch any code in the target module in this case, but it shows that the monkey patch function was actually triggered when expected.

Other bootstrapping mechanisms

Although this mechanism is reasonably clean and only requires the setting of an environment variable, it cannot be used with ‘buildout’ as mentioned. For ‘buildout’ we need to investigate other approaches we could use to achieve the same affect. I will cover such other options in the next blog post on this topic.

Monday, April 6, 2015

Integrating mod_wsgi-express as a Django admin command.

To followup on my first post introducing mod_wsgi-express, I posted about how to use mod_wsgi-express with Django. This provided us with a workable solution but did result in us needing to duplicate some information which was available within the Django settings file on the command line to mod_wsgi-express.

The next step therefore was to avoid this requirement by integrating mod_wsgi-express into the Django site itself so that it can be executed as a Django management command. By integrating mod_wsgi-express in this way it can directly interrogate the Django settings module for your Django project to obtain the information it wants.

In short, instead of having to run:

mod_wsgi-express start-server --url-alias /static static --application-type module mysite.wsgi

our goal is to be able to run:

python manage.py runmodwsgi

Updating installed applications

The actual integration of mod_wsgi-express into the Django project is actually a quite simple matter. This is because the mod_wsgi package implements the Django application abstraction. We therefore need only list the appropriate mod_wsgi sub package in ‘INSTALLED_APPS’ in the Django settings file.

INSTALLED_APPS = (
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
'mod_wsgi.server',
)

In particular we have added ‘mod_wsgi.server’ to ‘INSTALLED_APPS’.

The result of doing this is to extend the Django project such that the ‘runmodwsgi’ command is now available as a management command when running ‘python manage.py’.

When ‘runmodwsgi’ is now run a number of pieces of information are extracted from the Django settings file to allow it to run correctly.

The first is that the value of ‘BASE_DIR’ from the Django settings file is used to determine the working directory for mod_wsgi-express when it is run. This directory is automatically added into the Python module search path so that the Python package corresponding to the site can be found, no matter where ‘manage.py’ was run from.

Next is that the ‘WSGI_APPLICATION’ setting is used to determine the actual true location of the WSGI application entry point.

In our initial use of mod_wsgi-express we relied on the WSGI application entry point being stored in the ‘wsgi.py’ module file within the Python package for the site. Further, we relied on the WSGI application entry point still being called ‘application’, which so happens to coincide with the default that mod_wsgi-express looks for.

In practice neither of these need be true as a user may decide to ignore the ‘wsgi.py’ generated by the ‘startproject’ management command and either use a different module within the package or change the name of the WSGI application entry point function. If a user had done this and were intending to still use the Django development server, it would have been necessary to update the ‘WSGI_APPLICATION’ setting. As the Django development server uses ‘WSGI_APPLICATION’ so we do with the ‘runmodwsgi’ management command.

Finally, we use the values of ‘STATIC_URL’ and ‘STATIC_ROOT’ to determine where any static media files will exist, subsequent to ‘collectstatic’ having being run, and at what sub URL they should be made available.

With this information being automatically picked up, we can now say:

python manage.py runmodwsgi

So we have achieved our goal, but there is at least one more change that you might make to improve the experience when using mod_wsgi-express with the Django site. Or at least when using it in a development environment.

Logging of Python exceptions

When using Django, by default the details of any exceptions occurring in your code when handling a request are not logged in any way. To know about any exceptions occurring you need to update the Django settings file and configure one of a number of different possible ways for capturing them. In a production environment, which you would use depends on the strategy you are using for monitoring your applications.

The two primary mechanisms often used are to configure Django to send you an email for each exception which occurred, or to simply log the details in the error log for your WSGI server. It is this latter mechanism which we are going to set up here. For this we are going to add to the Django settings file:

LOGGING = {
    'version': 1,
    'disable_existing_loggers': False,
    'handlers': {
        'console': {
            'class': 'logging.StreamHandler',
        },
    },
    'loggers': {
        'django': {
            'handlers': ['console'],
            'level': os.getenv('DJANGO_LOG_LEVEL', 'INFO'),
        },
    },
}

With this in place the details of any exceptions will be logged to the Apache error log for mod_wsgi-express. The path to the error log will be displayed to the terminal when running mod_wsgi-express so you know where it is.

As the Apache error log is a distinct file, when using mod_wsgi-express during development, the need to consult the separate error log file may be inconvenient. In this case one can instead tell mod_wsgi-express to log all information to the terminal instead.

python manage.py runmodwsgi --log-to-terminal

If you are using mod_wsgi-express in a development environment but are using a different WSGI server in production, even if a normal Apache/mod_wsgi installation, having the exceptions logged to the error log may not always be appropriate. If this is the case and you would only want the exception details logged to the error log when using mod_wsgi-express, you can check for the existence of an environment variable which tells you if mod_wsgi-express is being used.

if os.environ.get('MOD_WSGI_EXPRESS'):
    LOGGING = {
        …
    } 

Use as a development server

Now I keep mentioning things you might want to do in a development environment, but mod_wsgi-express isn’t just about development environments. As Apache/mod_wsgi is used under the hood, it is more than suitable for use in production environments. As is always the case when deploying to a production server, you want to ensure that you do tune the server configuration properly for that environment.

By default mod_wsgi-express will create a single process to run your Django site in, with multithreading being used to handle concurrent requests. The default number of request handler threads used by the process is 5. So when moving to production consideration should be given to adjusting both the number of processes as well as the number of threads.

As well as tuning the capacity of the server in this way, there are various other considerations that need to be made. One of the primary things is that mod_wsgi-express when run directly as shown, will actually run the server in the foreground and will not run it as a daemon. This is so that it is convenient to use it in a development environment, where shutting down the server is as simple as doing CTRL-C in the terminal window where you ran it.

For production you wouldn’t want to use mod_wsgi-express in exactly this way. Instead there is a way of having mod_wsgi-express generate the configuration and and a set of management scripts but not actually run the server. This captures all the information about the site and then later one can use the management script to start/stop the site as necessary from any operating system start up scripts or process management system.

Anyway, it is not my aim to talk about moving mod_wsgi-express into production at this point, instead I want to focus more about using it in a development environment and why it can be a better solution for that than the builtin Django development server.

The first reason as noted above is that mod_wsgi-express runs in a multithreaded configuration. In contrast when you use the builtin development server it runs with a single process and a single thread. This means it is too easy to develop code which works fine in the Django development server but then fails in production when using a multithreaded WSGI server.

By using a production capable WSGI server such as mod_wsgi-express during development, where both multi process and multithreaded configurations can be setup, you are able to better replicate the environment you will expect to run under in production. So although mod_wsgi-express does use multithreading by default, to encourage the creation of thread safe code, if you truly did want to avoid multithreading and wanted to use single threaded processes, you could use:

python manage.py runmodwsgi —processes 3 --threads 1

The benefits of using mod_wsgi-express also extend to the serving up of static media files as these are now also being handled by a proper web server. You can therefore get a better idea of how your web application is going to perform in production because you can do load testing against mod_wsgi-express and know that it isn’t going to be too different to what you would expect in a production environment, ignoring of course differences in hardware or operating system.

Reloading of source code

You might be saying at this point though that one of the benefits of using the Django development server is that it offers automatic source code reloading. Well, mod_wsgi-express has that as well.

python manage.py runmodwsgi --reload-on-changes

With the ‘—reload-on-changes’ option to mod_wsgi-express, any code files associated with modules that have been loaded into the running Python web application will be monitored for changes. If any of those files on disk change in between the time it was loaded and when the check was made, then the web application process(es) will be automatically shutdown and restarted. In this way, when the next request is made it will see the more recent code changes.

Do note though that this doesn’t extend to any changes made to the static media files in their original locations. This is because the running of ‘collectstatic’ resulted in the original files being copied into the common directory so they could be served. As a result, for changes to CSS stylesheets, Javascript and images to be picked up, you will need to run ‘collectstatic’ again. You will not however have to restart mod_wsgi-express, as once ‘collectstatic’ is run, it should then pick up the latest versions.

Even with just the ability to automatically reload on code changes we have already covered what the Django development server is principally relied upon for, but mod_wsgi-express implements a whole bunch of other features as well which go beyond what the builtin development server provides.

These additional features are part of a special debug mode and include builtin support for debugging exceptions using pdb, profiling of code, code coverage and recording of requests, including headers and content. I will cover debug mode of mod_wsgi-express in the next blog post.

Sunday, April 5, 2015

Using mod_wsgi-express with Django.

In my last post I finally officially introduced mod_wsgi-express, an extension to mod_wsgi that I have been working on over the past year and a half. The purpose of mod_wsgi-express is to radically simplify the steps required to deploy a Python WSGI application using Apache and mod_wsgi. In that post I introduced some of the basic functionality of mod_wsgi-express. As Django is the most popular Python web framework, in this post I want to explain what is involved in using mod_wsgi-express with Django.

The structure of a Django project

When starting a new Django project the ‘startproject’ command is supplied to the ‘django-admin’ script to create the initial project structure. The directories and files created in the directory where the command is run is:

mysite/manage.py
mysite/mysite/__init__.py
mysite/mysite/settings.py
mysite/mysite/urls.py
mysite/mysite/wsgi.py

The immediate subdirectory created below where the ‘startproject’ command was run, and which contains the ‘manage.py’ script, is referred to as the base directory. The name of the base directory is whatever you called the name of your project.

The ‘manage.py’ file contained in the base directory is a site specific variant of the ‘django-admin’ script, which provides a means of executing various builtin Django admin commands against your site, as well as any admin commands which may later be added to the site by associating add-on applications with it.

In addition to the ‘manage.py’ script a further subdirectory, with the same name as the base directory, will have been created. This directory will contain the actual code for your site. Worth noting is that this directory actually contains an ‘__init__.py’ file. This is important as it actually marks the directory as a Python package, rather than just a collection of Python modules.

This directory also contains two other files which are important and come into play when aiming to deploy your site. These are the ‘settings.py’ file, which is used to customise the capabilities of your site and how it may be made available, plus the ‘wsgi.py’ file, which contains the WSGI application entry point which any WSGI server needs to know about. It is the function or other callable object specified as the WSGI application entry point to which the details of each request received by the WSGI server will be passed to have your Django application serve the request.

As far as now using mod_wsgi-express with such a Django project, there are two main things that mod_wsgi-express needs to be told. These are the location of the base directory and the full name of the module containing the WSGI application entry point.

The location of the base directory is important because it is the parent directory for the Python package containing the code for your site. This directory will need to be added to the Python module search path so the Python package containing your site code can be found. In particular we need it to be able to find the ‘wsgi.py’ file when referring to it via its full module name.

So if we were still located in the top level directory where the ‘startproject’ command was run, one above the base directory, we would run ‘mod_wsgi-express’ as:

mod_wsgi-express start-server --working-directory mysite --application-type module mysite.wsgi

If we had instead run mod_wsgi-express inside of the base directory, we could have used just:

mod_wsgi-express start-server --application-type module mysite.wsgi

The later still works because the working directory will be set to be the current directory if none is explicitly supplied.

For a simple hello world program that returns plain text, this is all that is required. If however you also have static media files for CSS stylesheets, Javascript and images, then this is not sufficient. This is readily apparent if you were to visit the ‘/admin’ URL of your site as no styling will be applied to what is displayed in the browser.

To have such styling applied for the ‘/admin’ URL, we also need to tell mod_wsgi-express about the location of any static media files and at what sub URL they need to be made available so that they can be found by any HTTP client.

Hosting of static media files

When using Django’s own builtin development server, it will automatically make available any static media files at the required sub URL. When you use a separate WSGI server however, that automatic mapping and hosting of the static media files will not occur.

It is possible to configure Django to host these static media files itself even in a production setting, but this is sub optimal and not recommended. Instead such static media files should be hosted by a proper web server.

In the case of mod_wsgi-express, because it is actually using Apache underneath, then we already have a true web server available which can be used to host the static media files in a production setting.

Before we can make use of that capability though, we first need to setup the Django project to allow us to easily collect together all the static media files together in the one location. This is necessary as normally such static media files are spread out across different locations, including as part of Django itself, or any installed add-on applications.

The first thing we therefore need to do is modify the ‘settings.py’ file and add the setting:

STATIC_ROOT = os.path.join(BASE_DIR, 'static')

This is best placed at the end of the ‘settings.py’ file immediately after the existing setting for ‘STATIC_URL’.

What the ‘STATIC_ROOT’ setting does is say that all static media files are to be placed into a directory called ‘static’ located within the base directory.

To actually get the files copied into that location we now need to run:

python manage.py collectstatic

Note that although we run this initially, this command must be run every time any update is made to static media files located within any add-on applications, or were Django itself updated to a newer version

After having run this command, this will now leave us with:

mysite/manage.py
mysite/mysite/__init__.py
mysite/mysite/settings.py
mysite/mysite/urls.py
mysite/mysite/wsgi.py
mysite/static/admin/css/...
mysite/static/admin/img/...
mysite/static/admin/js/...

So the ‘static’ subdirectory has been created, with a further subdirectory containing the static media files for the admin component of Django implementing the ‘/admin’ sub URL.

Now running mod_wsgi-express from within the base directory we would use:

mod_wsgi-express start-server --url-alias /static static --application-type module mysite.wsgi

The new option in this command is ‘—url-alias’. This option takes two arguments. The first is the sub URL where the static media files are to be made available. This should match the value which the ‘STATIC_URL’ setting had been set to. The second argument is the location of the directory where the static media files were copied. As we are running this command in the base directory and the location of the static media files is an immediate subdirectory, we can specify this as just the name of the subdirectory.

We should now have a working Django site using Apache/mod_wsgi. We have though had to duplicate certain information on the command line which is actually available in the Django settings file. The next step therefore is to eliminate that requirement by integrating mod_wsgi-express into the Django site itself as a Django admin management command. I will cover how that is done in the next blog post about this topic.

Thursday, April 2, 2015

Introducing mod_wsgi-express.

The Apache/mod_wsgi project is now over 8 years old. Long gone are the days when it was viewed as being the new cool thing to use. These days people seeking a hosting mechanism for Python WSGI applications tend to gravitate to other solutions.

That mod_wsgi is associated with the Apache project doesn't particularly help as Apache is seen as being old and stale. Truth is that the Apache httpd server has never stopped being improved on and is quite a lot better now than it was 8 years ago around the time when mod_wsgi was started.

Even though the Apache httpd server itself has an even longer history going back almost 20 years, it is still the workhorse of the Internet and provides a rock solid platform for hosting web sites. It can still hold its own against competing solutions and for hosting Python WSGI applications using mod_wsgi, is a proven reliable solution.

Now of those 8 years since mod_wsgi was started, there was actually about 3 years where very little development was done on it. This was because personally I got burnt out over the whole WSGI on Python 3 saga. I finally got myself out of that hole about a year and a half ago, and have been working away since on quite a significant number of changes to mod_wsgi that I haven't publicly said much about to date, let alone documented.

It is therefore long overdue to formally introduce one of the projects I have been working on. This project is mod_wsgi-express.

Setting up of mod_wsgi

One of the major bugbears of mod_wsgi has been the perception that it is too hard to setup, especially if building from source code yourself. The task of getting it installed was only slightly easier if you used a pre-built binary package provided by your operating system, but using such a pre-built package could in itself result in a whole host of other problems when it wasn't compiled for the particular version of Python you wanted to use.

With the module at least installed, configuring Apache was no less of a problem, especially on Linux systems which come with a set default configuration which was tailored for static file hosting or PHP.

The end result is that most people walked away with a bad experience and a production system which was operating at a level no where near what it was actually capable of. For the case of using Apache/mod_wsgi for development, the need for rapid iteration on changes in an application and the need to therefore be constantly restarting the web server, made the use of Apache/mod_wsgi seem all too hard.

A large part of what I have been working on for the past year and a half has therefore been about improving that experience. Key was coming up with a system which provided an out of box configuration which was much better suited for Python web applications than the standard Linux defaults, yet was still customisable as necessary to further tune it to suit the specifics of your particular Python web application.

Installation from PyPi using pip

The first major difference with mod_wsgi-express over the traditional path of installing mod_wsgi is that you can install it like any other Python package. In other words you can 'pip install' it directly from PyPi. You can even list it in a 'requirements.txt' file for 'pip.

pip install mod_wsgi

If you have a complete Apache httpd server installation on your system then that is all that is required. The resulting mod_wsgi module for Apache will have been compiled against and will be installed as part of your Python installation or virtual environment.

There is more though to mod_wsgi-express than just the ability to easily compile the module for Apache. In addition to compiling the module, a separate script called 'mod_wsgi-express' is installed. It is in this script that all the magic actually occurs.

Before I get onto what exactly the 'mod_wsgi-express' script does, I do want to point out that if for some reason you don't have a complete Apache installation, so are perhaps missing the development header files that are required to build Apache modules, or the installed Apache is not the latest recommended version, then that is also covered.

For this case where you also need to be able to install a fresh version of the Apache httpd server itself, you can do:

pip install mod_wsgi-httpd
pip install mod_wsgi 

In this case we are installing two packages. We are first installing 'mod_wsgi-httpd' and then 'mod_wsgi'.

What installation of the 'mod_wsgi-httpd' package from PyPi will do is actually pull down the source code for the Apache httpd server as well as other libraries it requires and automatically compile it and install it also.

The Apache httpd server is quite a big project and so this will take a little while, but it allows you to ignore the system Apache installation, with the 'mod_wsgi' package when subsequently being installed, detecting the version of Apache installed by 'mod_wsgi-httpd' and so using it instead.

Important to note is that install Apache using 'mod_wsgi-httpd' will not interfere with any existing Apache installation you may have. Like the 'mod_wsgi' package, it will be installed as part of your Python installation or virtual environment.

Hosting the WSGI application

So we have the Apache httpd server installed and the 'mod_wsgi' module for Apache also compiled and installed. We haven't though yet configured Apache as yet.

This is where the 'mod_wsgi-express' script comes into play.

If we have a WSGI application defined in a WSGI script file called 'hello.wsgi', all we now need to do is run:

mod_wsgi-express start-server hello.wsgi

Doing this will yield something like:

Server URL : http://localhost:8000/
Server Root : /tmp/mod_wsgi-localhost:8000:502
Server Conf : /tmp/mod_wsgi-localhost:8000:502/httpd.conf
Error Log File : /tmp/mod_wsgi-localhost:8000:502/error_log (warn)
Request Capacity : 5 (1 process * 5 threads)
Request Timeout : 60 (seconds)
Queue Backlog : 100 (connections)
Queue Timeout : 45 (seconds)
Server Capacity : 20 (event/worker), 20 (prefork)
Server Backlog : 500 (connections)
Locale Setting : en_AU.UTF-8

You can then access the WSGI application on the specified URL, that by default being port 8000 on the localhost.

As to the configuration of Apache, there actually wasn't any.

The key benefit of the 'mod_wsgi-express' script is that it does all the configuration for you, setting up a configuration purpose built for running your specific WSGI application right there on the command line.

Running Apache/mod_wsgi has therefore become as easy as running other pure Python WSGI servers such as gunicorn.

Alternatives to a WSGI script file

Like when using mod_wsgi in Apache in the more traditional approach, the 'mod_wsgi-express' script defaults to requiring a WSGI script file. There are specific reasons, deriving from how Apache works, that a script file path is used rather than a Python module name. There are however also some benefits to how a WSGI script file is used which are lacking when a module name is used.

I'll try to explain those reasons and the benefits another time, but if you really want to use a module name instead, then that is also possible. So if instead of 'hello.wsgi' you actually had 'hello.py', making it a Python module, you could instead run:

mod_wsgi-express start-server --application-type module hello

It is also even possible to provide a Paste 'ini' file as input by specifying the 'paste' application type.

mod_wsgi-express start-server --application-type paste hello.ini

Hosting static file assets

Python web applications are usually never just dynamically generated pages. Instead they are generally accompanied by a bunch of static files for CSS stylesheets, Javascript and images.

This is where 'mod_wsgi-express' being based around mod_wsgi running under Apache brings additional value. That is that the Apache httpd server was primarily intended for service static files. Even though we are hosting a dynamic Python web application, we can still make use of that capability. This can be done in a few ways.

First up, if all static file assets are to exist at a sub URL of the site, then they can be readily mapped into place using the '--url-alias' option. The arguments to this are the sub URL and then the path to the directory containing the static files.

mod_wsgi-express start-server --url-alias /static ./htdocs/static hello.wsgi

For any site though, there are often special static files which need to exist at the root of the site. These are files such as 'robots.txt' and 'favicon.ico'.

These could be mapped individually using '--url-alias' as it does also allow the file system path to be that of a file:

mod_wsgi-express start-server --url-alias /static ./htdocs/static \
--url-alias /favicon.ico ./htdocs/favicon.ico \
--url-alias /robots.txt ./htdocs/robots.txt hello.wsgi

A better alternative though is to simply contain all the files in the one directory, here called 'htdocs', with the location matching the URL they should appear at, and declare that as the document root.

mod_wsgi-express start-server --document-root ./htdocs hello.wsgi

If you are a long time mod_wsgi user you may be familiar with the problem that mounting a WSGI application at the root of the site actually hides any static files that exist in the document root for the server. In the case of mod_wsgi-express though, specific Apache configuration is used such that any static files in the directory will actually overlay and take precedence over the WSGI application.

Thus if a URL matches a static file in the document directory the static file will be served up, otherwise the request will be passed on as normal to the WSGI application. Addition of new static file assets is therefore as simple as dropping them into the document directory with a path matching the URL it is to be available at.

By using Apache/mod_wsgi we therefore get the best of both worlds. A performant way of serving up static file assets as well as the dynamic Python web application.

This is something you don't get from a pure Python WSGI server such as gunicorn. For gunicorn you would have to use a Python WSGI middleware to intercept requests and map them to any static files. This is in contrast to using Apache where handling of static file assets is all done in C code by Apache below the level that the Python interpreter would even be involved.

Hosting just static files

Since mod_wsgi-express actually provides such a convenient way of hosting static files, there is even a mode which allows you to say that you aren't actually wanting to run a Python web application at all, and only want to host static files.

Thus instead of the quick command often used by Python users to run up a server to temporarily host some static files, of:

python -m SimpleHTTPServer

you can with mod_wsgi-express do:

mod_wsgi-express start-server --application-type static --document-root .

You are therefore running a production grade server for the task rather than the Python SimpleHTTPServer implementation.

This may not seem a big deal, but can be very convenient where you also need to be able to use a secure HTTP connection, or even use client certificates to control access to the files. These are things that you cannot do with SimpleHTTPServer, but can do with mod_wsgi-express.

And much much more

This only starts to scratch the surface of what one can do with mod_wsgi-express and what sort of configurability it provides. In future posts I will talk about other features of mod_wsgi-express, including using it to run a secure HTTP server, using it as a development server, as well as how to set it up for use in production environments, taking over from the normal Apache installation.

If you want to play with mod_wsgi-express and get a head start on what some of its other bundled capabilities are, then you can run the command:

mod_wsgi-express start-server --help

Also check out the PyPi page for 'mod_wsgi' at:

If you have any questions about mod_wsgi-express, use the mod_wsgi mailing list to get help.

Saturday, March 21, 2015

Generating full stack traces for exceptions in Python.

My last few blog posts have been about monkey patching. I will be continuing with more posts on that topic, but I am going to take a slight detour away from that topic in this post to talk about a problem related to capturing stack trace information for exceptions in Python applications.

This does still have some relevance to monkey patching, in as much as one of the reasons you may want to monkey patch a Python application is to add in wrappers which will intercept the the details of an exception raised deep with an application. You might want to do this for the case where the exception would otherwise be captured by an application and translated into a different exception, with loss of information about the inner exception, or in the case of a web application result in a generic HTTP 500 web page response with no useful information captured. So monkey patching can be a useful debugging tool where it may not be convenient to modify the original source code of the application.

Tracebacks for exceptions

To get started, lets consider the following Python script:

def function1():
    raise RuntimeError('xxx')
def function2():
    function1()
def function3():
    function2()
def function4():
    function3()
def function5():
    function4()
function5()

If we run this script we will get:

Traceback (most recent call last):
  File "generate-1.py", line 16, in <module>
    function5()
  File "generate-1.py", line 14, in function5
    function4()
  File "generate-1.py", line 11, in function4
    function3()
  File "generate-1.py", line 8, in function3
    function2()
  File "generate-1.py", line 5, in function2
    function1()
  File "generate-1.py", line 2, in function1
    raise RuntimeError('xxx')
RuntimeError: xxx

In this case we have an exception occurring which was never actually caught within the script itself and is propagated all the way up to the top level, causing the script to be terminated and the traceback printed.

When the traceback was printed, it showed all stack frames from the top level all the way down to the point where the exception occurred.

Now consider the Python script:

import traceback
def function1():
    raise RuntimeError('xxx')
def function2():
    function1()
def function3():
    try:
      function2()
    except Exception:
      traceback.print_exc()
def function4():
    function3()
def function5():
    function4()
function5()

In this script we have a 'try/except' block half way down the sequence of calls. The call to 'function2()' is made within the 'try' block and when the exception is raised, it is handled within the 'except' block. At that point we use the 'traceback.print_exc()' function to output the details of the exception, but then let the script continue on normally to completion.

For this Python script the output is:

Traceback (most recent call last):
  File "generate-2.py", line 11, in function3
    function2()
  File "generate-2.py", line 7, in function2
    function1()
  File "generate-2.py", line 4, in function1
    raise RuntimeError('xxx')
RuntimeError: xxx

What you see here though is that we loose information about the outer stack frames for the sequence of calls that led down to the point where the 'try/except' block existed.

When we want to capture details of an exception for logging purposes so as to later debug an issue, this loss of information can make it harder to debug a problem if the function containing the 'try/except' block could be called from multiple places.

How then can we capture the outer stack frames so we have that additional context?

Capturing the current stack

There are a number of ways of obtaining information about the current stack. If we are just wanting to dump out the current stack to a log then we can use 'traceback.print_stack()'.

import traceback
def function1():
    raise RuntimeError('xxx')
def function2():
    function1()
def function3():
    try:
        function2()
    except Exception:
        traceback.print_stack()
        print '--------------'
        traceback.print_exc()
def function4():
    function3()
def function5():
    function4()
function5()

Run this variant of the Python script and we now get:

  File "generate-3.py", line 23, in <module>
    function5()
  File "generate-3.py", line 21, in function5
    function4()
  File "generate-3.py", line 18, in function4
    function3()
  File "generate-3.py", line 13, in function3
    traceback.print_stack()
--------------
Traceback (most recent call last):
  File "generate-3.py", line 11, in function3
    function2()
  File "generate-3.py", line 7, in function2
    function1()
  File "generate-3.py", line 4, in function1
    raise RuntimeError('xxx')
RuntimeError: xxx

So we now have the inner stack frames corresponding to the exception traceback, as well as those outer stack frames corresponding to the current stack. From this we can presumably now join these two sets of stack frames together and get a complete stack trace for where the exception occurred.

If you look closely though you may notice something. That is that there is actually an overlap in the stack frames which are shown for each, plus that the function we have called to print the current stack is also shown.

In the case of the overlap the issue is that in the inner stack frames from the traceback, it shows an execution point in 'function3()' of line 11. This corresponds to the point where 'function2()' was called within the 'try' block and in which the exception occurred.

At the same time, the outer stack frames from the current execution stack show line 13 in 'function3()', which is the point within the 'except' block where we called 'traceback.print_stack()' to display the current stack.

So the top most stack frame from the traceback is actually want we want and we need to ignore the bottom most two stack frames from the current stack if we were to join these together.

Now although the output of these two functions can be directed to any file like object and thus an instance of 'StringIO' could be used to capture the output, we would still need to break apart the formatted text output, drop certain parts and rearrange others to get the final desired result.

Dealing with such pre formatted output could therefore be a pain, especially if what we really wanted was the raw information about the filename, line number, function and potentially the code snippet. What other options therefore exist for getting such raw information?

Using the inspect module

When needing to perform introspection or otherwise derive information about Python objects, the module you want to use is the 'inspect' module. For the case of getting information about the current exception and current stack, the two functions you can use are 'inspect.trace()' and 'inspect.stack()'. Using these we can rewrite our Python script as:

import inspect
def function1():
    raise RuntimeError('xxx')
def function2():
    function1()
def function3():
    try:
        function2()
    except Exception:
        for item in reversed(inspect.stack()):
            print item[1:]
        print '--------------'
        for item in inspect.trace():
            print item[1:]
def function4():
    function3()
def function5():
    function4()
function5()

This time we get:

('generate-4.py', 25, '<module>', ['function5()\n'], 0)
('generate-4.py', 23, 'function5', [' function4()\n'], 0)
('generate-4.py', 20, 'function4', [' function3()\n'], 0)
('generate-4.py', 13, 'function3', [' for item in reversed(inspect.stack()):\n'], 0)
--------------
('generate-4.py', 11, 'function3', [' function2()\n'], 0)
('generate-4.py', 7, 'function2', [' function1()\n'], 0)
('generate-4.py', 4, 'function1', [" raise RuntimeError('xxx')\n"], 0)

So these functions provide us with the raw information rather than pre formatted text, thus making it easier to process. For each stack frame we also get a reference to the frame object itself, but since we didn't care about that we skipped it when displaying each frame.

Because though we might want to generate such a combined stack trace in multiple places we obviously separate this out into a function of its own.

import inspect
def print_full_stack():
    print 'Traceback (most recent call last):'
    for item in reversed(inspect.stack()[2:]):
        print ' File "{1}", line {2}, in {3}\n'.format(*item),
    for line in item[4]:
        print ' ' + line.lstrip(),
    for item in inspect.trace():
        print ' File "{1}", line {2}, in {3}\n'.format(*item),
    for line in item[4]:
        print ' ' + line.lstrip(),
def function1():
    raise RuntimeError('xxx')
def function2():
    function1()
def function3():
    try:
        function2()
    except Exception:
        print_full_stack()
def function4():
    function3()
def function5():
    function4()
function5()

The final result would now be:

Traceback (most recent call last):
  File "generate-5.py", line 32, in <module>
    function5()
  File "generate-5.py", line 30, in function5
    function4()
  File "generate-5.py", line 27, in function4
    function3()
  File "generate-5.py", line 22, in function3
    function2()
  File "generate-5.py", line 18, in function2
    function1()
  File "generate-5.py", line 15, in function1
    raise RuntimeError('xxx')

Using the exception traceback

We are done right? No.

In this case we have relied on functions from the 'inspect' module that rely on being called directly from within the 'except' block.

That is, for generating the outer stack frames for the current stack we always assume that we need to drop two stack frames from the result of calling 'inspect.stack()'.

For the inner stack frames from the exception, the 'inspect.trace()' function relies on there being an exception which is currently being handled.

That we are assuming we should skip two stack frames for the current stack is a little bit fragile. For example, consider the case where we don't actually call 'print_full_stack()' within the 'except' block itself.

def function1():
    raise RuntimeError('xxx')
def function2():
    function1()
def function3a():
    print_full_stack()
def function3():
    try:
        function2()
    except Exception:
        function3a()
def function4():
    function3()
def function5():
    function4()
function5()

The result here is:

Traceback (most recent call last):
  File "generate-6.py", line 35, in <module>
    function5()
  File "generate-6.py", line 33, in function5
    function4()
  File "generate-6.py", line 30, in function4
    function3()
  File "generate-6.py", line 27, in function3
    function3a()
  File "generate-6.py", line 25, in function3
    function2()
  File "generate-6.py", line 18, in function2
    function1()
  File "generate-6.py", line 15, in function1
    raise RuntimeError('xxx')

As can be seen, we actually end up with an additional stack frame being inserted corresponding to 'function3a()' which we called within the 'except' block and which in turn called 'print_full_stack()'.

To ensure we do the right thing here we need to look at what 'inspect.stack()' and 'inspect.trace()' actually do.

def stack(context=1):
    """Return a list of records for the stack above the caller's frame."""
    return getouterframes(sys._getframe(1), context)
def trace(context=1):
    """Return a list of records for the stack below the current exception."""
    return getinnerframes(sys.exc_info()[2], context)

So the problem we have with the extra stack frame is that 'inspect.stack()' uses 'sys._getframe()' to grab the current stack. This is correct and what it is intended to do, but not really what we want. What we instead want is the outer stack frames corresponding to where the exception was caught.

As it turns out this is available as an attribute on the traceback object for the exception called 'tb_frame'. Learning from how these two functions are implemented, we can therefore change our function to print the full stack.

import sys
import inspect
def print_full_stack(tb=None):
    if tb is None:
        tb = sys.exc_info()[2]
    print 'Traceback (most recent call last):'
    for item in reversed(inspect.getouterframes(tb.tb_frame)[1:]):
        print ' File "{1}", line {2}, in {3}\n'.format(*item),
        for line in item[4]:
            print ' ' + line.lstrip(),
        for item in inspect.getinnerframes(tb):
            print ' File "{1}", line {2}, in {3}\n'.format(*item),
        for line in item[4]:
            print ' ' + line.lstrip(),
def function1():
    raise RuntimeError('xxx')
def function2():
    function1()
def function3a():
    print_full_stack()
def function3():
    try:
        function2()
    except Exception:
        function3a()
def function4():
    function3()
def function5():
    function4()
function5()

We are now back to the desired result we are after.

Traceback (most recent call last):
  File "generate-7.py", line 39, in <module>
    function5()
  File "generate-7.py", line 37, in function5
    function4()
  File "generate-7.py", line 34, in function4
    function3()
  File "generate-7.py", line 29, in function3
    function2()
  File "generate-7.py", line 22, in function2
    function1()
  File "generate-7.py", line 19, in function1
    raise RuntimeError('xxx')

Using a saved traceback

In making this last modification we actually implemented 'print_full_stack()' to optionally accept an existing traceback. If none was supplied then we would instead use the traceback for the current exception being handled.

It is likely a rare situation where it would be required, but this allows one to pass in a traceback object which had been saved away and retained beyond the life of the 'try/except' block which generated it.

Be aware though that doing this can generate some surprising results.

def function1():
    raise RuntimeError('xxx')
def function2():
    function1()
def function3():
    try:
        function2()
    except Exception:
        return sys.exc_info()[2]
def function4():
     tb = function3()
     print_full_stack(tb)
def function5():
     function4()
function5()

In this case we return the traceback to an outer scope and only within that outer function attempt to print out the full stack for the exception.

Traceback (most recent call last):
  File "generate-8.py", line 37, in <module>
    function5()
  File "generate-8.py", line 35, in function5
    function4()
  File "generate-8.py", line 32, in function4
    print_full_stack(tb)
  File "generate-8.py", line 26, in function3
    function2()
  File "generate-8.py", line 22, in function2
    function1()
  File "generate-8.py", line 19, in function1
    raise RuntimeError('xxx')

The problem here is that in 'function4()' rather than seeing the line where the call to 'function3()' was made, we see where the call to 'print_full_stack()' is made.

The reason for this is that although the traceback contains a snapshot of information from the current stack at the time of the exception, this only extends back as far as the 'try/except' block.

When we are accessing 'tb.tb_frame' and getting the outer frames, it is still accessing potentially active stack frames for any currently executing code.

So what has happened is that in looking at the stack frame for 'function4()' it is picking up the current execution line number at that point in time, which has shifted from the time when the original exception occurred. That is, control returned back into 'function4()' and execution progressed to the next line and the call 'print_full_stack()'.

Although the ability to print the full stack trace for the exception is useful, it is only reliable if called within the same function that the 'try/except' block existed. If you return the traceback object to an outer function and then try and produce the full stack, line number information in the outer stack frames can be wrong, due to the code execution point within those functions shifting if subsequent code in those functions had been executed since the exception occurred.

If anyone knows a way around this, beyond creating a snapshot of the full stack within the same function as where the 'try/except' occurred, I would be interested to hear how. My current understanding is that there isn't any way and it is just a limitation one has to live with in that case.

Wednesday, March 18, 2015

Ordering issues when monkey patching in Python.

In my recent post about safely applying monkey patches in Python, I mentioned how one of the issues that arises is when a monkey patch is applied. Specifically, if the module you need to monkey patch has already been imported and was being used by other code, that it could have created a local reference to a target function you wish to wrap, in its own namespace. So although your monkey patch would work fine where the original function was used direct from the module, you would not cover where it was used via a local reference.

Coincidentally, Ned Batchelder recently posted about using monkey patching to debug an issue where temporary directories were not being cleaned up properly. Ned described this exact issue in relation to wanting to monkey patch the 'mkdtemp()' function from the 'tempfile' module. In that case he was able to find an alternate place within the private implementation for the module to patch so as to avoid the problem. Using some internal function like this may not always be possible however.

What I want to start discussing with this post is mechanisms one can use from wrapt to deal with this issue of ordering. A major part of the solution is what are called post import hooks. This is a mechanism which was described in PEP 369 and although it never made it into the Python core, it is still possible to graft this ability into Python using existing APIs. From this we can then add additional capabilities for discovering monkey patching code and automatically apply it when modules are imported, before other modules get the module and so before they can create a reference to a function in their own namespace.

Post import hook mechanism

In PEP 369, a primary use case presented was illustrated by the example:

import imp

@imp.when_imported('decimal')
def register(decimal):
    Inexact.register(decimal.Decimal)

The basic idea is that when this code was seen it would cause a callback to be registered within the Python import system such that when the 'decimal' module was imported, that the 'register()' function which the decorator had been applied to, would be called. The argument to the 'register()' function would be the reference to the module the registration had been against. The function could then perform some action against the module before it was returned to whatever code originally requested the import.

Instead of using the decorator '@imp.when_imported' decorator, one could also explicitly use the 'imp.register_post_import_hook()' function to register a post import hook.

import imp

def register(decimal):
    Inexact.register(decimal.Decimal)

imp.register_post_import_hook(register, 'decimal')

Although PEP 369 was never incorporated into Python, the wrapt module provides implementations for both the decorator and the function, but within the 'wrapt' module rather than 'imp'.

Now what neither the decorator or the function really solved alone was the ordering issue. That is, you still had the problem that these could be triggered after the target module had already been imported. In this case the post import hook function would still be called, albeit for our case too late to get in before the reference to the function we want to monkey patch had been created in a different namespace.

The simplest solution to this problem is to modify the main Python script for your application and setup all the post import hook registrations you need as the absolute very first thing that is done. That is, before any other modules are imported from your application or even modules from the standard library used to parse any command line arguments.

Even if you are able to do this, because though the registration functions require an actual callable, it does mean you are preloading the code to perform all the monkey patches. This could be a problem if they in turn had to import further modules as the state of your application may not yet have been setup such that those imports would succeed.

They say though that one level of indirection can solve all problems and this is an example of where that principle can be applied. That is, rather than import the monkey patching code, you can setup a registration which would only lazily load the monkey patching code itself if the module to be patched was imported, and then execute it.

import sys
from wrapt import register_post_import_hook
def load_and_execute(name):
    def _load_and_execute(target_module):
        __import__(name)
        patch_module = sys.modules[name]
        getattr(patch_module, 'apply_patch')(target_module)
    return _load_and_execute
register_post_import_hook(load_and_execute('patch_tempfile'), 'tempfile')

In the module file 'patch_tempfile.py' we would now have:

from wrapt import wrap_function_wrapper
def _mkdtemp_wrapper(wrapped, instance, args, kwargs):
    print 'calling', wrapped.__name__
    return wrapped(*args, **kwargs)
def apply_patch(module):
    print 'patching', module.__name__
    wrap_function_wrapper(module, 'mkdtemp', _mkdtemp_wrapper)

Running the first script with the interactive interpreter so as to leave us in the interpreter, we can then show what happens when we import the 'tempfile' module and execute the 'mkdtemp()' function.

$ python -i lazyloader.py
>>> import tempfile
patching tempfile
>>> tempfile.mkdtemp()
calling mkdtemp
'/var/folders/0p/4vcv19pj5d72m_bx0h40sw340000gp/T/tmpfB8r20'

In other words, unlike how most monkey patching is done, we aren't forcibly importing a module in order to apply the monkey patches on the basis it might be used. Instead the monkey patching code stays dormant and unused until the target module is later imported. If the target module is never imported, the monkey patch code for that module is itself not even imported.

Discovery of post import hooks

Post import hooks as described provide a slightly better way of setting up monkey patches so they are applied. This is because they are only activated if the target module containing the function to be patched is even imported. This avoids unnecessarily importing modules you may not even use, and which otherwise would increase memory usage of your application.

Ordering is still important and as a result it is important to ensure that any post import hook registrations are setup before any other modules are imported. You also need to modify your application code every time you want to change what monkey patches are applied. This latter point could be inconvenient if only wanting to add monkey patches infrequently for the purposes of debugging issues.

A solution to the latter issue is to separate out monkey patches into separately installed modules and use a registration mechanism to announce their availability. Python applications could then have common boiler plate code executed at the very start which discovers based on supplied configuration what monkey patches should be applied. The registration mechanism would then allow the monkey patch modules to be discovered at runtime.

One particular registration mechanism which can be used here is 'setuptools' entry points. Using this we can package up monkey patches so they could be separately installed ready for use. The structure of such a package would be:

setup.py
src/__init__.py
src/tempfile_debugging.py

The 'setup.py' file for this package will be:

from setuptools import setup
NAME = 'wrapt_patches.tempfile_debugging'
def patch_module(module, function=None):
    function = function or 'patch_%s' % module.replace('.', '_')
    return '%s = %s:%s' % (module, NAME, function)
ENTRY_POINTS = [
    patch_module('tempfile'),
]
setup_kwargs = dict(
    name = NAME,
    version = '0.1',
    packages = ['wrapt_patches'],
    package_dir = {'wrapt_patches': 'src'},
    entry_points = { NAME: ENTRY_POINTS },
)
setup(**setup_kwargs)

As a convention so that our monkey patch modules are easily identifiable we use a namespace package. The parent package in this case will be 'wrapt_patches' since we are working with wrapt specifically.

The name for this specific package will be 'wrapt_patches.tempfile_debugging' as the theoretical intent is that we are going to create some monkey patches to help us debug use of the 'tempfile' module, along the lines of what Ned described in his blog post.

The key part of the 'setup.py' file is the definition of the 'entry_points'. This will be set to a dictionary mapping the package name to a list of definitions listing what Python modules this package contains monkey patches for.

The 'src/__init__.py' file will then contain:

import pkgutil
__path__ = pkgutil.extend_path(__path__, __name__)

as is required when creating a namespace package.

Finally, the monkey patches will actually be contained in 'src/tempfile_debugging.py' and for now is much like what we had before.

from wrapt import wrap_function_wrapper
def _mkdtemp_wrapper(wrapped, instance, args, kwargs):
    print 'calling', wrapped.__name__
    return wrapped(*args, **kwargs)
def patch_tempfile(module):
    print 'patching', module.__name__
    wrap_function_wrapper(module, 'mkdtemp', _mkdtemp_wrapper)

With the package defined we would install it into the Python installation or virtual environment being used.

In place now of the explicit registrations which we previously added at the very start of the Python application main script file, we would instead add:

import os
from wrapt import discover_post_import_hooks
patches = os.environ.get('WRAPT_PATCHES')
if patches:
    for name in patches.split(','):
        name = name.strip()
        if name:
            print 'discover', name
            discover_post_import_hooks(name)

If we were to run the application with no specific configuration to enable the monkey patches then nothing would happen. If however they were enabled, then they would be automatically discovered and applied as necessary.

$ WRAPT_PATCHES=wrapt_patches.tempfile_debugging python -i entrypoints.py
discover wrapt_patches.tempfile_debugging
>>> import tempfile
patching tempfile

What would be ideal is if PEP 369 ever did make it into the core of Python that a similar bootstrapping mechanism be incorporated into Python itself so that it was possible to force registration of monkey patches very early during interpreter initialisation. Having this in place we would have a guaranteed way of addressing the ordering issue when doing monkey patching.

As that doesn't exist right now, what we did in this case was modify our Python application to add the bootstrap code ourselves. This is fine where you control the Python application you want to be able to potentially apply monkey patches to, but what if you wanted to monkey patch a third party application and you didn't want to have to modify its code. What are the options in that case?

As it turns out there are some tricks that can be used in that case. I will discuss such options for monkey patching a Python application you can't actually modify in my next blog post on this topic of monkey patching.

Thursday, March 12, 2015

Using wrapt to support testing of software.

When talking about unit testing in Python, one of the more popular packages used to assist in that task is the Mock package. I will no doubt be labelled as a heretic but when I have tried to use it for things it just doesn't seem to sit right with my way of thinking.

It may also just be that what I am trying to apply it to isn't a good fit. In what I want to test it usually isn't so much that I want to mock out lower layers, but more that I simply want to validate data being passed through to the next layer or otherwise modify results. In other words I usually still need the system as a whole to function end to end and possibly over an extended time.

So for the more complex testing I need to do I actually keep falling back on the monkey patching capabilities of wrapt. It may well just be that since I wrote wrapt that I am more familiar with its paradigm, or that I prefer the more explicit way that wrapt requires you to do things. Either way, for me at least wrapt helps me to get the job done quicker.

To explain a bit more about the monkey patching capabilities of wrapt, I am in this blog post going to show how some of the things you can do in Mock you can do with wrapt. Just keep in mind that I am an absolute novice when it comes to Mock and so I could also just be too dumb to understand how to use it properly for what I want to do easily.

Return values and side effects

If one is using Mock and you want to temporarily override the value returned by a method of a class when called, one way is to use:

from mock import Mock, patch
class ProductionClass(object):
    def method(self, a, b, c, key):
        print a, b, c, key
@patch(__name__+'.ProductionClass.method', return_value=3)
def test_method(mock_method):
    real = ProductionClass()
    result = real.method(3, 4, 5, key='value')
    mock_method.assert_called_with(3, 4, 5, key='value')
    assert result == 3

With what I have presented so far of the wrapt package, an equivalent way of doing this would be:

from wrapt import patch_function_wrapper
class ProductionClass(object):
    def method(self, a, b, c, key):
        print a, b, c, key
@patch_function_wrapper(__name__, 'ProductionClass.method')
def wrapper(wrapped, instance, args, kwargs):
    assert args == (3, 4, 5) and kwargs.get('key') == 'value'
    return 3
def test_method():
    real = ProductionClass()
    result = real.method(3, 4, 5, key='value')
    assert result == 3

An issue with this though is that the 'wrapt.patch_function_wrapper()' function I previously described applies a permanent patch. This is okay where it does need to survive for the life of the process, but in the case of testing we usually want to only have a patch apply to the single unit test function being run at that time. So the patch should be removed at the end of that test and before the next function is called.

For that scenario, the wrapt package provides an alternate decorator '@wrapt.transient_function_wrapper'.  This can be used to create a wrapper function that will only be applied for the scope of a specific call that the decorated function is applied to. We can therefore write the above as:

from wrapt import transient_function_wrapper
class ProductionClass(object):
    def method(self, a, b, c, key):
        print a, b, c, key
@transient_function_wrapper(__name__, 'ProductionClass.method')
def apply_ProductionClass_method_wrapper(wrapped, instance, args, kwargs):
    assert args == (3, 4, 5) and kwargs.get('key') == 'value'
    return 3
@apply_ProductionClass_method_wrapper
def test_method():
    real = ProductionClass()
    result = real.method(3, 4, 5, key='value')
    assert result == 3

Although this example shows how to return a substitute for the method being called, the more typical case is that I still want to call the original wrapped function. Thus, perhaps validating the arguments being passed in or the return value being passed back from the lower layers.

For this blog post when I tried to work out how to do that with Mock the general approach I came up with was the following.

from mock import Mock, patch
class ProductionClass(object):
    def method(self, a, b, c, key):
    print a, b, c, key
def wrapper(wrapped):
    def _wrapper(self, *args, **kwargs):
assert args == (3, 4, 5) and kwargs.get('key') == 'value'
        return wrapped(self, *args, **kwargs)
    return _wrapper
@patch(__name__+'.ProductionClass.method', autospec=True,
        side_effect=wrapper(ProductionClass.method))
def test_method(mock_method):
    real = ProductionClass()
    result = real.method(3, 4, 5, key='value')

There were two tricks here. The first is the 'autospec=True' argument to '@Mock.patch' to have it perform method binding, and the second being the need to capture the original method from the 'ProductionClass' before any mock had been applied to it, so I could then in turn call it when the side effect function for the mock was called.

No doubt someone will tell me that I am doing this all wrong and there is a simpler way, but that is the best I could come up with after 10 minutes of reading the Mock documentation.

When using wrapt to do the same thing, what is used is little different to what was used when mocking the return value. This is because the wrapt function wrappers will work with both normal functions or methods and so nothing special has to be done when wrapping methods. Further, when the wrapt wrapper function is called, it is always passed the original function which was wrapped, so no magic is needed to stash that away.

from wrapt import transient_function_wrapper
class ProductionClass(object):
    def method(self, a, b, c, key):
        print a, b, c, key
@transient_function_wrapper(__name__, 'ProductionClass.method')
def apply_ProductionClass_method_wrapper(wrapped, instance, args, kwargs):
    assert args == (3, 4, 5) and kwargs.get('key') == 'value'
    return wrapped(*args, **kwargs)
@apply_ProductionClass_method_wrapper
def test_method():
    real = ProductionClass()
    result = real.method(3, 4, 5, key='value')

Using this ability to easily intercept a call to perform validation of data being passed, but still call the original, I can relatively easily create a whole bunch of decorators for performing validation on data as is it is passed through different parts of the system. I can then stack up these decorators on any test function that I need to add them to.

Wrapping of return values

The above recipes cover being able to return a fake return value, returning the original, or some slight modification of the original where it is some primitive data type or collection. In some cases though I actually want to put a wrapper around the return value to modify how subsequent code interacts with it. 

The first example of this is where the wrapped function returns another function which would then be called by something higher up the call chain. Here I may want to put a wrapper around the returned function to allow me to then intercept when it is called.

In the case of using Mock I would do something like:

from mock import Mock, patch
def function():
    pass
class ProductionClass(object):
    def method(self, a, b, c, key):
        return function
def wrapper2(wrapped):
    def _wrapper2(*args, **kwargs):
        return wrapped(*args, **kwargs)
    return _wrapper2
def wrapper1(wrapped):
    def _wrapper1(self, *args, **kwargs):
        func = wrapped(self, *args, **kwargs)
        return Mock(side_effect=wrapper2(func))
    return _wrapper1
@patch(__name__+'.ProductionClass.method', autospec=True,
        side_effect=wrapper1(ProductionClass.method))
def test_method(mock_method):
    real = ProductionClass()
    func = real.method(3, 4, 5, key='value')
    result = func()

And with wrapt I would instead do:

from wrapt import transient_function_wrapper, function_wrapper
def function():
    pass
class ProductionClass(object):
    def method(self, a, b, c, key):
        return function
@function_wrapper
def result_function_wrapper(wrapped, instance, args, kwargs):
    return wrapped(*args, **kwargs)
@transient_function_wrapper(__name__, 'ProductionClass.method')
def apply_ProductionClass_method_wrapper(wrapped, instance, args, kwargs):
    return result_function_wrapper(wrapped(*args, **kwargs))
@apply_ProductionClass_method_wrapper
def test_method():
    real = ProductionClass()
    func = real.method(3, 4, 5, key='value')
    result = func()

In this example I have used a new decorator called '@wrapt.function_wrapper'. I could also have used '@wrapt.decorator' in this example. The '@wrapt.function_wrapper' decorator is actually just a cut down version of '@wrapt.decorator', lacking some of the bells and whistles that one doesn't generally need when doing explicit monkey patching, but otherwise it can be used in the same way.

I can therefore apply a wrapper around a function returned as a result. I could could even apply the same principal where a function is being passed in as an argument to some other function.

A different scenario to a function being returned is where an instance of a class is returned. In this case I may want to apply a wrapper around a specific method of just that instance of the class.

With the Mock library it again comes down to using its 'Mock' class and having to apply it in different ways to achieve the result you want. I am going to step back from Mock now though and just focus on how one can do things using wrapt.

So, depending on the requirements there are a couple of ways one could do this with wrapt.

The first approach is to replace the method on the instance directly with a wrapper which encapsulates the original method.

from wrapt import transient_function_wrapper, function_wrapper
class StorageClass(object):
    def run(self):
        pass
storage = StorageClass()
class ProductionClass(object):
    def method(self, a, b, c, key):
        return storage
@function_wrapper
def run_method_wrapper(wrapped, instance, args, kwargs):
    return wrapped(*args, **kwargs)
@transient_function_wrapper(__name__, 'ProductionClass.method')
def apply_ProductionClass_method_wrapper(wrapped, instance, args, kwargs):
    storage = wrapped(*args, **kwargs)
    storage.run = run_method_wrapper(storage.run)
    return storage
@apply_ProductionClass_method_wrapper
def test_method():
    real = ProductionClass()
    data = real.method(3, 4, 5, key='value')
    result = data.run()

This will create the desired result but in this example actually turns out to be a bad way of doing it.

The problem in this case is that the object being returned is one which has a life time beyond the test. That is, we are modifying an object stored at global scope and which might be used for a different test. By simply replacing the method on the instance, we have made a permanent change.

This would be okay if it was a temporary instance of a class created on demand just for that one call, but not where it is persistent like in this case.

We can't therefore modify the instance itself, but need to wrap the instance in some other way to intercept the method call.

To do this we make use of what is called an object proxy. This is a special object type which we can create an instance of to wrap another object. When accessing the proxy object, any attempts to access attributes will actually return the attribute from the wrapped object. Similarly, calling a method on the proxy will call the method on the wrapped object.

Having a distinct proxy object though allows us to change the behaviour on the proxy object and so change how code interacts with the wrapped object. We can therefore avoid needing to change the original object itself.

For this example what we can therefore do is:

from wrapt import transient_function_wrapper, ObjectProxy
class StorageClass(object):
    def run(self):
        pass
storage = StorageClass()
class ProductionClass(object):
    def method(self, a, b, c, key):
        return storage
class StorageClassProxy(ObjectProxy):
    def run(self):
        return self.__wrapped__.run()
@transient_function_wrapper(__name__, 'ProductionClass.method')
def apply_ProductionClass_method_wrapper(wrapped, instance, args, kwargs):
    storage = wrapped(*args, **kwargs)
    return StorageClassProxy(storage)
@apply_ProductionClass_method_wrapper
def test_method():
    real = ProductionClass()
    data = real.method(3, 4, 5, key='value')
    result = data.run()

That is, we define the 'run()' method on the proxy object to intercept the call of the same method on the original object. We can then proceed to return fake values, validate arguments or results, or modify them as necessary.

With the proxy we can even intercept access to an attribute of the original object by adding a property to the proxy object.

from wrapt import transient_function_wrapper, ObjectProxy
class StorageClass(object):
    def __init__(self):
        self.name = 'name'
storage = StorageClass()
class ProductionClass(object):
    def method(self, a, b, c, key):
        return storage
class StorageClassProxy(ObjectProxy):
    @property
    def name(self):
        return self.__wrapped__.name
@transient_function_wrapper(__name__, 'ProductionClass.method')
def apply_ProductionClass_method_wrapper(wrapped, instance, args, kwargs):
    storage = wrapped(*args, **kwargs)
    return StorageClassProxy(storage)
@apply_ProductionClass_method_wrapper
def test_method():
    real = ProductionClass()
    data = real.method(3, 4, 5, key='value')
    assert data.name == 'name'

Building a better Mock

You might be saying at this point that Mock does a lot more than this. You might even want to point out how Mock can save away details about the call which can be checked later at the level of the test harness, rather than having to resort to raising assertion errors down in the wrappers themselves which can be an issue if code catches the exceptions before you see them.

This is all true, but the goal at this point for wrapt has been to provide monkey patching mechanisms which do respect introspection, the descriptor protocol and other things besides. That I can use it for the type of testing I do is a bonus. 

You aren't limited to using just the basic building blocks themselves though and personally I think wrapt could be a great base on which to build a better Mock library for testing.

I therefore leave you with one final example to get you thinking about the ways this might be done if you are partial to the way that Mock does things.

from wrapt import transient_function_wrapper
class ProductionClass(object):
    def method(self, a, b, c, key):
        pass
def patch(module, name):
    def _decorator(wrapped):
        class Wrapper(object):
            @transient_function_wrapper(module, name)
            def __call__(self, wrapped, instance, args, kwargs):
                self.args = args
                self.kwargs = kwargs
                return wrapped(*args, **kwargs)
        wrapper = Wrapper()
        @wrapper
        def _wrapper():
            return wrapped(wrapper)
        return _wrapper
    return _decorator
@patch(__name__, 'ProductionClass.method')
def test_method(mock_method):
    real = ProductionClass()
    result = real.method(3, 4, 5, key='value')
    assert real.method.__name__ == 'method'
    assert mock_method.args == (3, 4, 5)
    assert mock_method.kwargs.get('key') == 'value'

So that is a quick run down of the main parts of the functionality provided by wrapt for doing monkey patching. There are a few others things, but that is in the main all you usually require. I use monkey patching for actually adding instrumentation into existing code to support performance monitoring, but I have shown here how the same techniques can be used in writing tests for your code as an alternative to a package like Mock.

As I mentioned in my previous post though, one of the big problems with monkey patching is the order in which modules get imported relative to when the monkey patching is done. I will talk more about that issue in the next post.