Thursday, December 18, 2014

Apache/mod_wsgi for Windows on life support.

The critics will say that Apache/mod_wsgi as a whole is on life support and not worth saving. I have a history of creating Open Source projects that no one much wants to use, or get bypassed over time, but I am again having lots of fun with Apache/mod_wsgi so I don't particularly care what they may have to say right now.

I do have to say though, that right now it looks like continued support for the Windows platform is in dire straights unless I can get some help.

A bit of history is in order to understand how things got to this point.

First up, I hate Windows with an absolute passion. Always have and always will. Whenever I even try and use a Windows box it is like my whole body is screaming "stop, don't do it, this is not natural". My patience in having to use a Windows system is therefore very low, even when I can load it up with a decent shell and set of command line tools.

Even with my distaste for Windows I did at one point manage to get mod_wsgi ported to Windows, this was possibly more luck than anything else. A few issues have come up with the Windows support over the years as new versions of Python came out, but luckily there were very few which were Windows specific and I could in the main blissfully ignore Windows and whatever I was doing on more familiar UNIX systems just worked. The Apache ecosystem and runtime libraries helped here a lot as they hid nearly everything about the differences between Windows and UNIX. It was more Python that I had to deal with the little differences.

All my development back then for mod_wsgi on Windows was done with a 32 bit version of Windows XP. I did for a time make available precompiled binaries for two different Python versions, but only 32 bit and only for the binary distribution of Apache that the Apache Software Foundation made available at the time.

As 64 bit platforms came along I couldn't do much to help people and I also never tried third party distributions of Apache for Windows either. Luckily though someone who I don't even know stepped in and started making available precompiled Windows binaries for a range of Python and Apache versions for both 32 bit and 64 bit platforms. I am quite grateful for the effort they put in as it meant I could in the main ignore Windows and only have to deal with it when something broke because a new version of Python had come out.

Now a number of years ago I got into extreme burnout over my Open Source work and mod_wsgi got neglected for a number of years. No big deal for users because mod_wsgi had proven itself so stable over the years that it was possible to neglect it and it didn't cause anyone any problems.

When I finally managed to truly dig myself out of the hole I was in earlier this year I well and truly ripped into mod_wsgi and started making quite a lot of changes. I made no attempt though to try and ensure that it kept working on Windows. Even the makefiles I used for Windows would no longer work as I finally started breaking up the large single monolithic source code file that was mod_wsgi into many smaller code files.

To put in context how much I have been working on mod_wsgi this year, in the 6 1/2 years up to the start of this year there had been about 20 separate releases of mod_wsgi. In just this year alone I am already nearly up to 20 releases. The most recent release at this time is version 4.4.2.

So I haven't been idle, doing quite a lot of work on mod_wsgi and doing much more frequent releases. Now if only the Linux distributions would actually bother to catch up, with some still shipping versions of mod_wsgi from a number of years ago.

What this all means is that Windows support has been neglected for more than 20 versions and obviously a lot could have changed in that time in the mod_wsgi source code. There has also been a number of new Python releases since the last time I even tried to build mod_wsgi myself on Windows.

With the last version of mod_wsgi for Windows available being version 3.5, the calls for updated binaries has been growing. My benefactor who builds the Windows binaries again stepped in and tried himself to get some newer builds compiled. He did manage this and started making them available.

Unfortunately the reports of a range of problems started to come in. These range from Apache crashing on start up or after a number of requests and also in some cases the Apache configuration to have requests sent through to Apache weren't even working.

Eventually I relented and figured I better try and sort it out. Just getting myself to a point where I could start debugging things was huge drama. First up my trusty old copy of Windows XP was too old to even be usable with current Microsoft tools required to compile code for Python. I therefore had to stoop to actually going out and buying a copy of Windows 8.1. That was a sad day for me.

Being the most recent version of Windows I would have thought the experience of setting things up would have improved over the years. Not so, it turned out to be even more aggravating. Just getting Windows 8.1 installed into a VM and applying all of the operating system patches took over 6 hours. Did I say I hated Windows.

Next I need to get the Microsoft compilers installed. Which version of these you need for Python these days appear to be documented in some dusty filing cabinet in a distant galaxy. At least I couldn't find it in an obvious place in the documentation, which was frustrating. Result was I wasted time installing the wrong version of Visual Studio Express only to find it wouldn't work.

So I work out I need to use Visual Studio Express 2008, but is there is an obvious link in the documentation of where to get that from. Of course not, or not that I could find and you have to go searching with Google and weeding out what are the dead links until you finally find the right one.

I had been at this for a few days now.

Rather stupidly I thought I would ignore my existing makefiles for Windows and try and build mod_wsgi using a Python 'setup.py' file like I have been doing with the 'pip' installable version for UNIX systems. That failed immediately because I had installed a 64 bit version of Python and distutils can only build 32 bit binaries using the Express edition of Visual Studio 2008. So I had to remove the 64 bit versions of Python and Apache and install 32 bit versions.

So I got distutils compiling the code, but then found I had to create a dummy Python module initialisation function that wasn't required on UNIX systems. I finally did though manage to get it all compiled and the one 'setup.py' file would work for different Python versions.

The joy was short lived though as Apache would crash on startup with every single Python version I tried.

My first suspicion was that it was the dreaded DLL manifest problem that has occasionally cropped up over the years.

The problem here is that long ago when building Python modules using distutils, it would attach a manifest to the resulting Python extension module which matched that used for the main Python binary itself. This meant that any DLL runtimes such as 'msvcr90.dll' would be explicitly referenced by the extension module DLL.

Some bright spark decided though that this was redundant and changed distutils to stop doing this. This would be fine so long as when using Python in an embedded system that the main executable, Apache in this instance, was compiled with the same compiler and so linked that runtime DLL.

When Apache builds started though to use newer versions of the Microsoft compilers Apache stopped linking that DLL. So right now the most recent Apache builds don't link 'msvcr90.dll'. The result therefore of distutils not linking that DLL any more meant the 'msvcr90.dll' DLL was missing when mod_wsgi was loaded into Apache and it crashed.

I therefore went back to my previous makefiles. This actually turned out to be less painful than I was expecting and I even managed to clean up the makefiles, separating out the key rules into a common file with the version specific versions having just the locations of Apache and Python. I even managed to avoid needing to define where the Microsoft compilers were located, which varied depending on various factors. It did mean that to do the compilation you had to do it from the special Windows command shell you can startup from the application menus with all the compiled specific shell variables already set, but I could live with that.

Great, all things were compiling again. I even managed to use a DLL dependency walker to verify that 'msvcr90.dll' was being linked in properly so it would be found okay when mod_wsgi was loaded into Apache.

First up I tried Python 2.7 and I was happy to see that it actually appeared to work. When I tried in turn Python 2.6, 3.2, 3.3 and 3.4 though, they either crashed immediately when initialising Python or when handling the first request.

I have tried for a while to narrow down where the problems are by adding in extra logging, but the main culprit is deep down in Python somewhere. Although I did actually manage to get the Microsoft debugger attached to the process before it crashed, because neither of the precompiled binaries for Apache and Python have debugging information, what one got was absolutely useless.

So that is where I am at. I have already wasted quite a lot of time on this and lack the patience but also the skills for debugging stuff on Windows. The most likely next step I guess would be to try and build up versions of Apache and Python from scratch with debugging symbols so that the exact point of the crashes can be determined. I can easily see that being an even larger time suck and so am not prepared to go there.

The end result is that unless someone else can some in and rescue the situation that is likely the end of Windows support for Apache/mod_wsgi.

So summarising where things are at.

1. I have multiple versions of Python installed on a Windows 8.1 system. These are python 2.6, 2.7, 3.2, 3.3 and 3.4. They are all 32 bit because getting Visual Studio Express to compile 64 bit binaries is apparently a drama in itself if using Python distutils, as one has to modify the installed version of distutil. I decided to skip on that problem and just use 32 bit versions. Whether having many versions of Python installed at the same time is a problem in itself I have no idea.

2. I have both Visual Studio Express 2008 and 2012 installed. Again don't know whether having both installed at the same time will cause conflicts of some sort.

3. I am using Apache 2.4 32 bit binaries from Apache Lounge. I haven't bothered to try and find older Apache 2.2 versions at this point.

4. Latest mod_wsgi version compiled from Python 2.7 does actually appear to work, or at least for the few requests I gave it.

5. Use Python version 2.6, 3.2, 3.3 or 3.4 and mod_wsgi will always crash on Apache startup or when handling the first request.

Now I know from experience that it is a pretty rare person who even considers whether they might be able to contribute to mod_wsgi. The combination of Apache, embedding Python using Python C APIs, non trivial use of Python sub interpreters and multithreading seem to scare people away. I can't understand why, such complexity can actually be fun.

Anyway, I throw all this information out here in case there is anyone brave enough to want to help me out.

If you are, then the latest mod_wsgi source code can be found on github at:

  • https://github.com/GrahamDumpleton/mod_wsgi
The source code has a 'win32' subdirectory. You need to create a Visual Studio 2008 Command Prompt window and get yourself into that directory. You can then build specific Apache/Python versions by running:
nmake -f ap24py33.mk clean
nmake -f ap24py33.mk
nmake -f ap24py33.mk install
Just point at the appropriate makefile.
 
The 'install' target will tell you what to add into the Apache configuration file to load the mod_wsgi module after it has copied it to the Apache modules directory.
 
You can then follow normal steps for configuring a WSGI application to run with Apache, start up Apache and see what happens.
 
Have success and learn anything then jump on the mod_wsgi mailing list and let me know. We can see where we go from there.
 
Much thanks and kudos in advance to anyone who does try.

Tuesday, December 16, 2014

Launching applications in Docker containers.

So far in this current series of blog posts I introduced the Docker image I have created for hosting Python WSGI applications using Apache/mod_wsgi. I then went on to explain what happens when you build your own image derived from it which incorporates your specific Python web application. In this blog post I am going to explain what happens when you run the image and how your Python web application gets started.

As shown in the previous blog posts I gave an example of the Dockerfile you would use for a simple WSGI hello world application:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
CMD [ "wsgi.py" ]

I also presented a more complicated example for a Django site. The Dockerfile for that was still only:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
CMD [ "--working-directory", "example", \
"--url-alias", "/static", "example/htdocs", \
"--application-type", "module", "example.wsgi" ]

In neither of these though is there anything that looks like a command to start up Apache/mod_wsgi or any other WSGI server. So first up lets explore how an application is started up inside of a Docker container.

Containers as an application

There are actually two different approaches one can take as to how to start up an application inside of a Docker container.

The first is to create a base image which contains the application you want to run. When you now start up the Docker container, you provide the full application command line as part of the command to launch the container.

For example, if you had created an image containing the Python interpreter and a Django web application with all dependencies installed, you could start up the Django web application using the Django development server using:

docker run my-python-env python example/manage.py runserver

If you wanted to start up an interactive shell so you could explore the environment of the container and/or manually run up your application, you could start it with:

docker run -it my-python-env bash

The second approach entails viewing the container itself as an application. If setup up in this way the image would be hardwired to start up a specific command when the container is launched.

For example, if when building your container you specified in the Dockerfile:

CMD [ “python”, “example/manage.py”, “runserver” ]

when you now start the container, you do not need to provide the command to run. In other words, running:

docker run my-python-env

would automatically start the Django development server.

You could if you wish still provide a command, but it will override the default specified by the ‘CMD’ instruction and cannot be used to extend the existing command with additional options.

If you therefore wanted to change for some reason the port the Django development server was listening on within the container, you would have to duplicate the whole command.

docker run my-python-env python example/manage.py runserver 80

The ‘CMD’ instruction, although it allows you to supply a default command, doesn’t therefore go so far as to make the container behave like it is an application in its own right, which can accept arbitrary command line arguments when run.

To have a container behave like that, we use an alternative instruction to ‘CMD’ called ‘ENTRYPOINT'.

So we swap the ‘CMD’ instruction with ‘ENTRYPOINT’, setting it to the default command for the container.

ENTRYPOINT [ “python”, “example/manage.py”, “runserver” ]

When we now run the container as:

docker run my-python-env

the Django development server will again be run, but we can now supply additional command line options which will be appended to the default command.

docker run my-python-env 80

Although we changed to the ‘ENTRYPOINT’ instruction, it and the ‘CMD’ instruction are not exclusive. You could actually write in the Dockerfile:

ENTRYPOINT [ “python”, “example/manage.py”, “runserver” ]
CMD [ “80” ]

In this case when you start up the container, the combined command line of:

python example/manage.py runserver 80

would be run.

Supply any command line options when starting the container in this case, and they will override those specified by the ‘CMD’ instruction, but the ‘ENTRYPOINT’ instruction will be left as is. Running:

docker run my-python-env 8080

would therefore result in the combined command line of:

python example/manage.py runserver 8080

Those therefore are the basic principals around how to launch an application within a container when it is started. Now lets look at what the Docker image I have created for running a Python WSGI application is using.

Inheritance of an ENTRYPOINT

In the prior blog post I explained how the Docker image I provide isn’t just one image but a pair of images. The base image packaged up all the required tools, including Python, Apache and mod_wsgi. The derived image was what was called an ‘onbuild’ image and controlled how your specific Python web application would be built and combined with the base image.

The other thing that the ‘onbuild’ image did was to define the process for how the web application would then be started up when the container was run. The complete Dockerfile for the ‘onbuild’ image was:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7

WORKDIR /app

ONBUILD COPY . /app
ONBUILD RUN mod_wsgi-docker-build

EXPOSE 80

ENTRYPOINT [ "mod_wsgi-docker-start" ]

As can be seen, this specified an ‘ENTRYPOINT’ instruction, which as we now know from above, will result in the ‘mod_wsgi-docker-start’ command being run automatically when the container is started.

Remember though that to create an image with your specific Python web application you actually had to create a further image deriving from this ‘onbuild’ image. For our Django web application that was:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild

CMD [ "--working-directory", "example", \
"--url-alias", "/static", "example/htdocs", \
"--application-type", "module", "example.wsgi" ]

This final Dockerfile doesn’t itself specify an ‘ENTRYPOINT’ instruction, but it does define a ‘CMD’ instruction.

This highlights an important point. That is that an ‘ENTRYPOINT’ instruction will be inherited from a base image and will also be applied to the derived image. Thus when we startup up the container corresponding to the final image, that command will still be run.

As it turns out, a ‘CMD’ instruction will also be inherited by a derived image, but in this case the final image specified its own ‘CMD’ instruction. The final result of all this was that when the container was started up, the full command that was executed was:

mod_wsgi-docker-start —working-directory example \
—url-alias /static example/htdocs \
    —application-type module example.wsgi

As was the case when building the final image and the ‘mod_wsgi-docker-build’ was being run in the context of building that image, there is various magic going on inside ‘mod_wsgi-docker-start’ so time to delve into what it is doing.

Preparing the environment on start up

When the image containing your Python web application was being built, the ‘mod_wsgi-docker-build’ script allowed you to provide special hook scripts which would be run during the process of building the image. These were a ‘pre-build’ and ‘build’ hook. These allowed you to perform special actions prior to ‘pip’ being run to install any Python packages, as well as afterwards, to perform any subsequent application specific setup.

When deploying the web application and starting it up, it is common practice to again provide an ability to have special hook scripts to allow you to perform additional steps. These are sometimes called a ‘deploy’ and ‘post-deploy’ hook.

One of the first things that the ‘mod_wsgi-docker-start’ script therefore does is execute any ‘deploy’ hook.

if [ -x .docker/action_hooks/deploy ]; then
echo " -----> Running .docker/action_hooks/deploy"
.docker/action_hooks/deploy
fi

As with the hooks for the build process, this should reside in the ‘.docker/action_hooks’ directory and the script needs to be executable.

Now normally for a web site if you have a backend database it would be persistent and the data stored in it would have a life independent of the life time of individual containers running your web application.

If however you were using Docker to create a throw away instance of a Python web application and it was paired to a transient database instance that only existed for the life of the Python web application, then one of the steps you would need to perform is that of preparing the database, creating any tables and possibly populating it with data. This would need to be done before starting the actual Python web application.

One use therefore of the ‘deploy’ hook if running a Django web application is to run inside of the hook script the ‘migrate’ Django management command.

#!/usr/bin/env bash
python example/manage.py migrate

There is one important thing to note here which is different to what happens when hook scripts are run during the build phase. That is that during the build phase the ‘mod_wsgi-docker-build’ script itself and any hook scripts would use:

set -eo pipefail

The intent of this was to cause a hard failure when building the image rather than leaving the image in an inconsistent state.

What to do about errors during the actual deployment phase is a bit more problematic. If one causes a hard failure then it would cause the container to shutdown immediately. Taking such a drastic action may be undesirable as it robs you of the possibility of trying to deal with any errors and/or alerting of a problem and then running in a degraded state while someone is able to look at and rectify any issue.

It is something I don’t know what the best answer is right now so am open to suggestions of how to deal with it. For now therefore if an error does occur in a ‘deploy’ hook, the actual Python web application will be started up anyway.

Starting up the Python web application

Having run any ‘deploy’ hooks we are now ready to start up the actual Python web application. The part of the ‘mod_wsgi-docker-start’ script which does this is:

SERVER_ARGS="--log-to-terminal --startup-log --port 80"
exec mod_wsgi-express start-server ${SERVER_ARGS} "$@"

The key step in this is the execution of the ‘mod_wsgi-express’ command. I will defer to a subsequent blog post to talk more about what ‘mod_wsgi-express' is doing, but it is enough to know right now that it is what is actually starting up Apache and loading up your Python WSGI application so that it can then handle any web requests.

In running ‘mod_wsgi-express’ we by default supply it with the ‘—log-to-terminal’ option to have it log to stdout/stderr so that Docker can collect the logs automatically, making them available to commands such as ‘docker logs’.

When we do run ‘mod_wsgi-express’, it is important to note that we do it via an ‘exec’. This is done so that Apache will replace the current script process and so inherit process ID 1. This is done because Docker treats process ID 1 as being special and it is that process where it delivers any signals injected into the container from outside. Having Apache run as process ID 1 therefore ensures that it receives shutdown signals for the container properly and can attempt to shutdown the Python web application in an orderly manner.

What process ID 1 should be in a Docker container is actually something that sees a bit of debate. In one corner the Docker philosophy has that a container should only run one application and so it is sufficient for that application process to run as process ID 1.

In the other corner you have others who argue that in the limited environment of a Docker container you are missing many of the things that the normal init process running in a UNIX system as process ID 1 would do such as to cleanup zombie processes and the like. You also lack the ability for an application process to be restarted if it crashes.

You therefore will see people who will run something like ‘runit’ or ‘supervisord’ inside of the container with those starting up and managing the actual application processes.

For the image I am providing I am relying on the fact that Apache is its own process supervisor for its managed child processes and has demonstrated stability and although its child process may crash, the Apache parent process is rock solid and doesn’t crash.

I did contemplate the use of ‘supervisord’ for other reasons, but a big problem with ‘supervisord' is that it still has not been ported to Python 3. This is an issue because when using Docker it is common to provide separate images for different Python versions. This is done to cut down on the amount of fat in the images. This means that for a Python 3 image there only exists Python 3. Having to install a copy of Python 2 as well just so one can run ‘supervisord’ is therefore somewhat annoying.

The problem of ‘post-deploy’ actions

Now I mentioned previously that it is common to have both a ‘deploy’ and ‘post-deploy’ hook script, with the ‘deploy’ script as shown being run prior to starting the actual Python web application. The idea with the ‘post-deploy’ script is that it would be run after the Python web application has been started.

Such ‘post-deploy’ scripts cause a number of complications.

The first is that to satisfy the requirement that the Apache process be process ID 1, it was necessary to ‘exec’ the ‘mod_wsgi-express’ command when run. Because an ‘exec’ was done, then nothing else can be done after that within the ‘mod_wsgi-docker-start’ script as it is no longer running.

One might be able to support a ‘post-deploy’ script when using something like ‘runit’ or ‘supervisord’ if they allow ordering relationships to be defined for when commands are started up, and they allow for one off commands to be run rather than attempting to rerun the command if it exits straight away.

Even so, this doesn’t solve all the problems that exist with what might be run from a ‘post-deploy’ script.

To understand what these issues are we need to look at a couple of examples of what might be done in a ‘post-deploy’ script.

The first example is that you might want to use a ‘post-deploy’ script to hit your Python web application with requests against certain URLs to forcibly preload parts of the application code base before it starts handling web requests. This might be desirable with Django in particular because of the way it lazily loads view handlers in most cases only when they are accessed.

The second example is where you need to poll the Python web application to ensure that it has actually started up properly and is serving requests okay. When all is okay, you might then use some sort of notification system to announce availability of this instance of the Python web application, resulting in it being added into a server group of a front end load balancer.

In both these cases, such a script has to be tolerant of the web application being slow to start and cannot assume that it will actually be ready when the scripts run. Since this is the case and they would need to poll for availability of the Python web application, the need for having a separate ‘post-deploy’ phase is somewhat diminished. One one can instead do is start these actions as background processes during the ‘deploy’ phase instead.

So due to the requirement that Apache needs to run as process ID 1, there is at this point no distinct ‘post-deploy’ phase. To achieve the same result, these should be run as background tasks from the ‘deploy’ phase instead, polling for availability of the Python web application before then running what ever it is intended they should do.

Starting up an interactive shell

As previously described, one of the things that one can do when ‘ENTRYPOINT’ is used, is override the actual command that would be run by a container, even when the Dockerfile for the image defined a ‘CMD’ instruction. It was thus possible to run:

docker run -it my-python-env bash

This gives us access to an interactive shell to explore the container or run any commands manually.

When we use ‘ENTRYPOINT’ it would appear that we loose this ability.

All is not lost though, it just means we need to handle this a bit differently.

Before I show how that is done, one feature of Docker that is worth pointing out is that it is actually possible to get an interactive shell into an already running container. This is something that originally would have required running ‘sshd’ inside of the container, or required using a mechanism called ’nsenter’. Because this was a common requirement though, Docker has for a while now provided the ‘exec’ command.

What you can therefore do if you already have a running container you want to inspect is run:

docker exec -it hungry_bardeen bash

where the name is the auto generated name for the container or one you have assigned when the container was started.

If we don’t already have a running container though, what do we do?

Normally we can supply ‘bash’ as the command and it will work because there is an implicit ‘ENTRYPOINT’ for a container of ‘bash -c’. This is what allows us to specify any command.

In our case the implicit ‘ENTRYPOINT’ has been replaced with ‘mod_wsgi-docker-start’ and anything we supply on the command line when the container is run will be passed as options to it instead.

The first thing we have to do is work out how we can reset the value of the ‘ENTRYPOINT’ from the command line. Luckily this can be done using the ‘—entrypoint’ command.

So we can try running:

docker run -it —entrypoint bash my-python-app

This will have the desired affect of running bash for us, but will generally fail.

The reason it will fail is that any ‘CMD’ defined within the image will be passed to the ‘bash’ command as arguments when run as the entry point.

Trying to wipe out the options specified by ‘CMD’ using:

docker run -it —entrypoint bash my-python-app ''

doesn’t help as ‘bash’ will then look for and try to execute a script with an empty name.

To get around this problem and make things a little easier, the image I supply for hosting Python WSGI applications supplies an additional command called ‘mod_wsgi-docker-shell’. What one would therefore run is:

docker run -it —entrypoint mod_wsgi-docker-shell my-python-app

In this case the ‘mod_wsgi-docker-shell’ script would be run and although what is defined by the ‘CMD’ instruction is still passed as arguments to it, they will ignored and ‘mod_wsgi-docker-shell’ will ‘exec’ the ‘bash’ command with no arguments.

That therefore is my little backdoor. The only other way I found of getting around the issue, is the non obvious command of:

docker run -it —entrypoint nice my-python-app bash

In this case we are relying on the fact that the ‘nice’ command will in turn execute any arguments it is supplied. I thought that ‘mod_wsgi-docker-shell’ may be more obvious though.

What is ‘mod_wsgi-express’?

So what exactly is ‘mod_wsgi-express’ then? In my next blog I will introduce what ‘mod_wsgi-express’ is doing.

The ‘mod_wsgi-express’ program is something that I have been working on for over a year now and has been available since early this year. Although I have commented a bit about it on Twitter, and in a conference talk I did, I haven’t as yet written any blog posts about it. My next blog post will therefore be the first proper public debut of ‘mod_wsgi-express’ and what it can do.

Wednesday, December 10, 2014

Deferred build actions for Docker images.

In my last blog post I introduced what I have been doing on creating a production quality Docker image for hosting Python WSGI applications using Apache/mod_wsgi.

In that post I gave an example of the Dockerfile you would use for a simple WSGI hello world application:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild

CMD [ "wsgi.py" ]

I also presented a more complicated example for a Django site. The Dockerfile for that was still only:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild

CMD [ "--working-directory", "example", \
"--url-alias", "/static", "example/htdocs", \
"--application-type", "module", "example.wsgi" ]

In the case of the Django site there are actually going to be quite a number of files that are required, including the Python code and static file assets. There was also a list of Python packages needing to be installed defined in a pip 'requirements.txt' file.

The big question therefore is how with only that small Dockerfile were all those required files getting copied across and how were all the Python packages getting installed?

Using an ONBUILD Docker image

The name of the Docker image which the local Dockerfile derived from was in this case called:

grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild

The clue as to what is going on here is the 'onbuild' qualifier in the name of the image.

Looking at the Dockerfile used to build that 'onbuild' image we find:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7
WORKDIR /app
ONBUILD COPY . /app
ONBUILD RUN mod_wsgi-docker-build
EXPOSE 80
ENTRYPOINT [ "mod_wsgi-docker-start" ]

If you were expecting to see here a whole lot of instructions for creating a Docker image consisting of Python, Apache and mod_wsgi you will be sadly disappointed. This is because all that more complicated stuff is actually contained in a further base image called:

grahamdumpleton/mod-wsgi-docker:python-2.7

So what has been done is to create a pair of Docker images. An underlying base image is what provides all the different software packages we need. The derived 'onbuild' image is what defines how a specific users application is combined with the base image to form the final deployable application image.

This type of split is a common pattern one sees for Docker images which are trying to provide a base level image for deploying applications written in a specific programming language.

Using a derived image to define how the application is combined with the base image means that if someone doesn't agree with the specific way that is being done, they can ignore the 'onbuild' image and simply derive direct from the base image and define things their own way.

Deferring execution using ONBUILD

When creating a Dockerfile, you will often use an instruction such as:

COPY . /app

What this does is that at the point that the Docker image is being built, it will copy the contents of the directory the Dockerfile is contained in into the Docker image. In this case everything will be copied into the '/app' directory of the final Docker image.

If you are doing everything in one Dockerfile this is fine. In this case though I want to include that sort of boiler plate that is always required to be run in a base image, but for the instruction to only be executed when a user is creating their own derived image with their specific application code.

To achieve this I used the 'ONBUILD' instruction, prefixing the original 'COPY' instruction.

The affect of using 'ONBUILD' is that the 'COPY' instruction will not actually be run. All that will happen is that the 'COPY' instruction will be recorded. Any instruction prefixed with 'ONBUILD' will then only be replayed when a further derived image is being created which derives from this image.

Specifically, any such instructions will be run as the first steps after the 'FROM' instruction in a derived image, with:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7
WORKDIR /app
ONBUILD COPY . /app
ONBUILD RUN mod_wsgi-docker-build
EXPOSE 80
ENTRYPOINT [ "mod_wsgi-docker-start" ]

and:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
CMD [ "wsgi.py" ]

being the same as if we had written:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7
WORKDIR /app
EXPOSE 80
ENTRYPOINT [ "mod_wsgi-docker-start" ]

and:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
COPY . /app
RUN mod_wsgi-docker-build
CMD [ "wsgi.py" ]

You can determine whether a specific image you want to use is using 'ONBUILD' instructions by inspecting the image:

$ docker inspect grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild

Look in the output for the 'OnBuild' section and you should find all the 'ONBUILD' instructions listed.

"OnBuild": [
"COPY . /app",
"RUN mod_wsgi-docker-build"
],

Anyway, as noted before, the point here is to make what the final user does as simple as possible and avoid the need to always include specific boilerplate, thus the reason for using 'ONBUILD'.

Installing extra system packages

For this specific Docker image, the deferred steps triggered by the 'ONBUILD' instructions will perform a number of tasks.

The first and most obvious action is that performed by the instruction:

COPY . /app

As explained before, this will copy the contents of the application directory into the Docker image.

The next step which is performed is:

RUN mod_wsgi-docker-build

This will execute the program 'mod_wsgi-docker-build'. This magic program is what is going to do all the hard work. The actual program was originally included into the image by way of the base image that the 'onbuild' image derived from. That is, in:

grahamdumpleton/mod-wsgi-docker:python-2.7

The key thing that the 'mod_wsgi-docker-build' script will do is run 'pip' to install any Python packages which are listed in the 'requirements.txt' file. Before 'pip' is run though there are a few other things we also want to do.

The first thing that the 'mod_wsgi-docker-build' script does is:

if [ -x .docker/action_hooks/pre-build ]; then
echo " -----> Running .docker/action_hooks/pre-build"
.docker/action_hooks/pre-build
fi

This executes an optional 'pre-build' hook script which can be supplied by the user and allows a user to perform additional steps prior to any Python packages being installed using 'pip'. Such a hook script would be placed into the '.docker/action_hooks' directory.

The main purpose of the 'pre-build' hook script would be to install additional system packages that may be required to be present when installing the Python packages. This would be necessary due to the fact that the base image is a minimal image that only installs the minimum system packages required for Python and Apache to run. It does not try and anticipate every single common package that may be required to cover the majority of use cases.

If one did try and guess what additional packages people might need, then the size of the base image would end up being significantly larger in size. As such, installation of such additional system packages is up to the user based on their specific requirements. Being Docker though, there are no restrictions on installing additional system packages. This is in contrast to your typical PaaS providers, where you are limited to the packages they decided you might need.

As an example, if the Python web application to be run needed to talk to a Postgres database using the Python 'psycopg2' module, then the Postgres client libraries will need to be first installed. To ensure that they are installed a 'pre-build' script containing the following would be used.

#!/usr/bin/env bash

set -eo pipefail

apt-get update
apt-get install -y libpq-dev

rm -r /var/lib/apt/lists/* 

The line:

set -eo pipefail

in this script is important as it ensures that the shell script will exit immediately on any error. The 'mod_wsgi-docker-build' script does the same, with the reason being that you want the build of the image to fail on any error, rather than errors being silently ignored and an incomplete image created.

The line:

rm -r /var/lib/apt/lists/*

is an attempt to clean up any temporary files so that the image isn't bloated with files that aren't required at runtime.

Do note that the script itself must also be executable else it will be ignored. It would be nice to not have this requirement and for the script to be made executable automatically if it wasn't but doing so seems to trip a bug in Docker.

Installing required Python packages

The next thing that the 'mod_wsgi-docker-build' script does is:

if [ ! -f requirements.txt ]; then
if [ -f setup.py ]; then
echo "-e ." > requirements.txt
fi
fi

In this case we are actually checking to see whether there even is a 'requirements.txt' file. Not having one is okay and we simply wouldn't install any Python packages.

There is though a special case we check for if there is no 'requirements.txt' file. That is if the directory for the application contains a 'setup.py' file. If it does, then we assume that the application directory is itself a Python package that needs to first be installed, or at least, that the 'setup.py' has defined a set of dependencies that need to be installed.

Now normally when you have a 'setup.py' file you would run:

python setup.py install

We can though achieve the same result by listing the currently directory as '.' in the 'requirements.txt' file, preceded by the special '-e' option.

This approach of using a 'setup.py' as a means of installing a set of Python packages was an approach often used before 'pip' became available and the preferred option. It is still actually the only option that some PaaS providers allow for installing packages, with them not supporting the use of a 'requirements.txt' file and automatic installation of packages using 'pip'.

Interestingly, this ability to use a 'setup.py' file and installing the application as a package is something that the official Docker images for Python can't do at this time. This is because they use:

RUN mkdir -p /usr/src/app
WORKDIR /usr/src/app
ONBUILD COPY requirements.txt /usr/src/app/
ONBUILD RUN pip install -r requirements.txt
ONBUILD COPY . /usr/src/app

The problem they have is that they only copy the 'requirements.txt' file into the image before running 'pip' and because the current working directory when running 'pip' only contains that 'requirements.txt' file, then the 'requirements.txt' file cannot refer to anything contained in the application directory.

So the order in which they do things prohibits it from working. I am not sure I fathom why they have chosen to do it the way they have and whether they have a specific reason. It is perhaps something they should change, but I don't rule out they may have a valid reason for doing that so haven't reported it as an issue at this point.

After seeing if what packages to install may be governed by a 'setup.py' file the next steps run are:

if [ -f requirements.txt ]; then
if (grep -Fiq "git+" requirements.txt); then
echo " -----> Installing git"
apt-get update && \
apt-get install -y git --no-install-recommends && \
rm -r /var/lib/apt/lists/*
fi
fi

if [ -f requirements.txt ]; then
if (grep -Fiq "hg+" requirements.txt); then
echo " -----> Installing mercurial"
pip install -U mercurial
fi
fi

These steps are required to determine whether the 'requirements.txt' file lists packages to be installed which are actually managed under a version control system such as git or mercurial. If there is, then since the base image provides only a minimal set of packages, then it will be necessary to install git or mercurial as necessary.

Finally we can now run 'pip' to install any packages listed in the 'requirements.txt' file:

if [ -f requirements.txt ]; then
echo " -----> Installing dependencies with pip"
pip install -r requirements.txt -U --allow-all-external \
--exists-action=w --src=.docker/tmp
fi

Performing final application setup

All the system packages and Python packages are now installed. This is along with all the Python application code also having been copied into the image. There is one more thing to do though. That is to provide the ability for a user to provide a 'build' hook of their own to perform any final application setup.

if [ -x .docker/action_hooks/build ]; then
echo " -----> Running .docker/action_hooks/build"
.docker/action_hooks/build
fi

As with the 'pre-build' script, the 'build' script needs to be placed into the '.docker/action_hooks' directory and needs to be executable.

The purpose of this script is to allow any steps to be carried out that require all the system packages and Python packages to have first been installed.

An example of such a script would be if running Django and you wanted to have the building of the Docker image collect together all the static file assets, rather than having to do it prior to attempting to build the image.

#!/usr/bin/env bash

set -eo pipefail

python example/manage.py collectstatic --noinput

Some PaaS providers in their build systems will try and automatically detect if Django is being used and run this step for you without you knowing. Although I have no problem with using special magic, I do have an issue where the system would actually be having to guess if it was your intent that such a step be run. I therefore believe in this case it is better that it be explicit and that the user define the step themselves. This avoids any unexpected problems and eliminates the need to provide special options to disable such magic steps when they aren't desired and go wrong.

Starting the Python web application

The above steps complete the build process for the Docker image. The next phase would be to deploy the Docker image and get it running. This is where the final part of the 'onbuild' Dockerfile comes into play:

EXPOSE 80
ENTRYPOINT [ "mod_wsgi-docker-start" ]

I will explain what happens with that in my next post on this topic.

Tuesday, December 2, 2014

Hosting Python WSGI applications using Docker.

As I mentioned in my previous blog post I see a lot of promise for Docker. The key thing that I personally see as being able to gain from Docker, as a provider of a hosting solution for Python WSGI applications, is that I can get back some control over the hosting experience that developers will have.

Right now things can quickly become a bit of a mess, because the experience that developers have of Apache/mod_wsgi is going to be dictated by how a Linux distribution or hosting provider has setup Apache, and how easy they have made customising it in order to add the ability to host Python WSGI applications and then tune the Apache server. The less than optimal experience that developers usually have means they do not get to appreciate how well Apache/mod_wsgi can work and simply desert it for other options.

In the case of Docker, I can provide a pre packaged image for hosting Python WSGI applications which uses my knowledge of how to set up Apache and mod_wsgi properly to give the best experience. I can therefore hope that Docker may help me to win back some of those who otherwise never really understood the strengths of Apache and mod_wsgi.

Current offerings for Python and Docker

Although Docker is still young, to be frank, the bulk of the information around about running Python WSGI application with Docker is pretty woeful. The instructions provided focus more on how to use Docker itself rather than how to create a production capable hosting solution for Python WSGI applications within the container. Nearly all explanations I have found describe the use of builtin development servers for Python web frameworks such as Flask and Django. Using inbuilt development servers for production is generally a very bad idea.

In some cases they will suggest the use of gunicorn or CherryPy WSGI servers, but these themselves cannot handle hosting of static files. How exactly you are meant to host the static files they don't really provide details on, at most perhaps suggesting the use of S3 as a completely separate mechanism for hosting them.

There are available some Docker images for using uWSGI, but they are generally setup with the specific requirements of that user in mind, rather than trying to provide a good reusable image that can be applied across many uses cases, without you yourself having to do some measure of re-configuration. Again they aren't exactly transparent as far as handling static files and leave that mostly up to you to work out how to solve.

The final problem with the uWSGI Docker images is that they are effectively unsupported efforts and haven't been updated in some time. They therefore are not keeping up to date with any security fixes or general bug fixes in the packages they are using.

Using Apache/mod_wsgi and Docker

To date I have not seen anyone attempt to describe how to use Apache and mod_wsgi with Docker. It isn't something that I am going to do exactly either, in as much as rather than describe how you yourself could create an image for using Apache and mod_wsgi with Docker, I am simply going to provide a pre packaged image instead. What I will describe therefore is how to use that image and how best to use it in its pre packaged form.

This blog post is therefore the first introduction to this endeavour. I will show you how to use the Docker image with a couple of different WSGI applications and then in subsequent blog posts I will start peeling apart the layers and explain the different parts that go into it and what capabilities it has. Provided I don't get too carried away with doing more coding, which is obviously the fun bit, I will back things up by finally starting to upgrade the mod_wsgi documentation to cover it and all the other new features that are available in mod_wsgi these days.

Running a Hello World WSGI application

Lets start out therefore with the canonical WSGI hello world application.

def application(environ, start_response):
status = '200 OK'
output = 'Hello World!'
response_headers = [('Content-type', 'text/plain'),
('Content-Length', str(len(output)))]
start_response(status, response_headers)
return [output]

Create a new empty directory and place this in a file called 'wsgi.py'.

This Hello World program has no associated static files, nor does it require any additional Python modules to be installed. Even though no separate modules are required at this point, we will still create a 'requirements.txt' file in the same directory. This 'requirements.txt' file will be left empty for this example.

The next step is to create a 'Dockerfile' to build up our Docker image. As we are going to use the pre packaged Docker image I am providing and it embeds various magic, all that the 'Dockerfile' needs to contain is:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
CMD [ "wsgi.py" ]

For the image being derived from, an 'ENTRYPOINT' is already defined which will run up Apache/mod_wsgi. The 'CMD' instruction therefore only needs to provide any options, which at this point consists only of the path to the WSGI script file, which we had called 'wsgi.py'.

We can now build the Docker image for the Hello World example:

docker build -t my-python-app .

and then run it:

docker run -it --rm -p 8000:80 --name my-running-app my-python-app

The Hello World WSGI application will now be accessible by pointing your browser at port 8000 on the Docker host.

Running a Django based web site

We don't run Hello World applications as our web sites, so we also need to be able to run whole Python web sites constructed using web frameworks such as Django as well. It is with more complicated web applications that we start to also have static files that need to be hosted at the same time, so we need to deal with that somehow. The Python module search path can also require special setup so that the Python interpreter can actually find where the Python code for the web application is located.

So imagine that you have a Django web application constructed using the standard layout. From the top directory of this we would therefore have something like:

example/
example/example/
example/example/__init__.py
example/example/settings.py
example/example/urls.py
example/example/views.py
example/example/wsgi.py
example/htdocs/
example/htdocs/admin
example/htdocs/admin/...
example/manage.py
requirements.txt

The 'requirements.txt' which was used to create any local virtual environment used during development would already exist, and at the minimum would contain:

Django

Within the directory would then be the actual project directory which was created using the Django admin 'startproject' command.

As this example requires static files, we setup the Django settings file to define the location of a directory to keep the static files:

STATIC_ROOT = os.path.join(BASE_DIR, 'htdocs')
STATIC_URL = '/static/'

and then run the Django admin 'collectstatic' command. The 'collectstatic' command copies all the static file assets from any Django applications into the common 'htdocs' directory. This directory will then need to be mounted at the '/static' URL when we run Apache/mod_wsgi.

What we are going to do now is create a 'Dockerfile' in the same directory as the 'requirements.txt' file. This will be the root of our application when copied across to the Docker image.

Now normally when Apache/mod_wsgi gets run with with the pre packaged image, the root directory of the application would normally be the current working directory for the application and also be added to the Python module search path. For a Django site, what we really want is for the top level 'example' directory to be the current working directory and for it to be searched for Python modules. This is necessary so that the correct directory is searched from for the Django settings file, which for this example has the module path 'example.settings'.

With the way Django lays out the project and creates the 'wsgi.py' file such that it is importable as 'example.wsgi', it can be preferable to use it as a module rather than as a WSGI script file. I'll get into the distinction another time, but importing it as a module does allow me to show off that it is possible to use a WSGI script file, a module or even a Paste style ini configuration file as the application entry point.

With all that said, we now actually create the 'Dockerfile' and in it we place:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
CMD [ "--working-directory", "example", \
"--url-alias", "/static", "example/htdocs", \
"--application-type", "module", "example.wsgi" ]

The options to the 'CMD' instruction in this case serve the following purposes.

The '--working-directory' option says that the 'example' directory should actually be set to be the current working directory for the WSGI application when run. That directory will also be added automatically to the Python module search path so that the 'example' package which contains all the code can be found.

The '--url-alias' option says that the static files in the 'examples/htdocs' directory should be mounted at the '/static' URL as was specified by the 'STATIC_URL' setting in the Django settings module.

The '--application-type' option says that rather than the WSGI application entry point being specified as a WSGI script file, it is defined by the listed module path. The default for this would have been 'script', with another possible value being 'paste' for a Paste style ini configuration file.

Finally, the 'example.wsgi' option is the Python module path for the 'wsgi.py' sub module in the project 'example' package.

As before we build the Docker image and then run it.

In a real Django site we would normally also have a database and possibly a key/value cache of some sort. Setting these up is beyond the scope of this post but would follow normal Docker practices. That or you might use a tool such as Fig to manage the linking and spin up of all the containers.

Setting up the Apache configuration file

In short there isn't one and you do not have to concern yourself with it.

This is a key point with being able to supply a pre packaged image for hosting using Apache/mod_wsgi. Since users often don't want to learn properly how to set up Apache and as such it causes so much grief, I can completely remove the need for a developer to have to worry about it.

Instead I can provide a simplified set of command line options which implement the basic features that most sites would want to use when setting up Apache/mod_wsgi. The scripts under pinning the pre packaged Docker image can then dynamically generate the Apache configuration on the fly based on the specific options provided.

In doing this, this is where I can apply my knowledge of how to set up Apache/mod_wsgi and ensure things are down correctly, securely and in a way that would give a good level of performance out of the box.

This doesn't mean you can't avoid needing to tune the settings to get Apache/mod_wsgi to run for your specific site, but the number of knobs you would have to worry about is greatly reduced as everything else would be handled automatically.

But how does this all actually work?

So this is an initial introduction to just one of a number of new things I have been working on related to mod_wsgi. As I start to peel back some of the layers to explain how all this works I will start to introduce some of the other things I have been cooking up over the last year, including alluding to other things that hopefully I will get to down the track.

If you can't wait and want to play around with these docker images they can be found, along with some basic setup information, on the Docker hub.

If you need help in working out how to use them or have any other questions, then rather than try and post questions against this blog, go and ask your questions on the mod_wsgi mailing list. Please do not use StackOverflow or related sites as I don't answer questions there any more and no one there will know anything anyway since this is all so new.

Friday, November 28, 2014

Managing the developer experience using docker.

I have always found hosting Python web applications using a PaaS an exercise in frustration. The problem is that they aren't generally about providing options and flexibility. Instead they are about pigeonholing you in to working in one particular way and often with quite specific technologies they control. If you want to use something different you are usually out of luck, that or they make it so painful that it isn't worth the trouble.

Sometimes you even have to question whether the people who built the services even understand what the options are that were available, or whether they have let their own personal biases dictate what technologies you can use.

For that reason I believe docker has a lot going for it. Specifically, that as a developer one can now more readily choose the technologies you wish to work with. Which hosting service you may choose to use is now more likely going to come down to how well their service orchestration works, the general quality of their service and the price of the service, rather than whether you can use a specific technology to run the actual Python web application.

From the perspective of being a third party provider of a specific way to run Python web applications, I also see lots of benefits in the docker way of doing things.

This is because I can provide an appliance like image for running a Python web application which incorporates best practices as far as the tools used and how it is configured. I can do this without being subject to the whims of the PaaS provider and any artificial gating mechanism they may impose which restricts me from providing a solution which a developer can choose to use.

I am also no longer reliant on a developer actually going to the effort of having to work out for themselves the best way to set up and configure a solution. This is because with how docker images work, I can provide an out of the box experience which is going to be a lot better than what a developer may have been able to achieve with their own limited knowledge of the best way to set things up.

In some respects what it means is that I am able to better manage or control the developer experience in running a specific solution.

Because I can provide a superior prepackaged solution, you don't end up with the situation as it occurs now, where the issues the developers encounters in setting it up are of their own making and not that of the tool being used, yet they quite readily go off and blame the tools rather than their own lack of expertise in being able to set it up properly.

The root problem here is that developers these days are after quick solutions because they simply don't have the time to work out how to use something properly. They also have a tendency though to go off and bad mouth a solution when in their rush to get something working they do have some issue. Developers are also like sheep and will readily believe what others say, even though it can be totally wrong, and will then themselves tell everyone else the same thing even though they are woefully ignorant of what is reality.

This mindless heard like mentality is why we go through these cycles of some tool being seen as the new 'cool' thing. There is generally nothing wrong with the old tools if used properly, but the misinformation that gets out there generates a life of its own and so in the end can doom a particular solution even though it still may be just as capable as the new comer.

Docker and how it works therefore opens up this dream scenario for a provider of a solution whereby they can choose what technologies they want to use, bundling it up in a way that they can also control how it is used, but in doing so provide a superior experience for the developer who uses it.

The analogy here is any products from Apple. Sure you will get that percentage who rile against it because they feel it is locked down and they don't like that they loose some measure of control, but for the majority who don't care, they simply judge it on its merits as far as it gets the job done with a minimal amount of effort. So the provider is able to provide a better product and the end user developer is also happier.

Wednesday, September 3, 2014

Hosting PHP web applications in conjunction with mod_wsgi.

Yes, yes, I know. You can stop shaking your head now. It is a sad fact of life though that the need to mix both PHP web application code with Python web application code on the same Apache instance is something that some people need to do. One instance is where those PHP developers have seen the light and want to migrate their existing legacy PHP web application to Python, but are not able to do it all in one go, instead needing to do it piece meal, with the Python web application code progressively taking over from the PHP web application.

Ask around on the Internet and once you get past the 'why on earth to you want to do that' type of reactions, you will often be told it is either not possible or too hard and that you should just ditch the PHP web application entirely and just use Python. This isn't particularly helpful and is also very misleading as it is actually quite simple to allow both a PHP web application and Python web application to run concurrently on the same Apache instance.

In going this path though, there is one very important detail that you must first appreciate. That is that the typical Apache server MPM configuration for a PHP web application under Apache is generally not favourable to a Python web application. Because of this, never run your Python web application in embedded mode if you are also running a PHP web application on the same Apache server. If you do, then the performance of your overall Apache instance will be affected, having an impact on both the PHP and Python web applications.

What you want to do to mitigate such problems is run your Python web application in daemon mode of mod_wsgi. This means that the Python web application will run in its own process and the Apache child worker process will merely act as a proxy for requests being sent to the Python web application. This ensures that the Python web application processes are not subject to the dynamic process management features of Apache for child worker processes, which is where a lot of the problems arise when running with embedded mode.

Because it is so important that embedded mode not be used, to ensure you get this right and don't actually still run your Python web application in embedded mode, you should disable embedded mode entirely.

The configuration for mod_wsgi in the Apache configuration where running a single Python web application should therefore include something like:

# Define a mod_wsgi daemon process group.
WSGIDaemonProcess my-python-web-application display-name=%{GROUP}
# Force the Python web application to run in the mod_wsgi daemon process group.
WSGIProcessGroup my-python-web-application
WSGIApplicationGroup %{GLOBAL}
# Disable embedded mode of mod_wsgi.
WSGIRestrictEmbedded On 

Obviously if running more than one Python web application then you may need to use a more complicated configuration. Either way, ensure you aren't using embedded mode and that any Python web applications are running in daemon mode instead. All the following discussion will assume that you have got this in place.

Having dealt with that, we can now move onto trying to setup up the Apache configuration to serve both the PHP web application and the Python web application.

For this we now need to delve into the typical ways that each is hosted by Apache.

In the case of PHP, the typical approach involves having Apache handle the primary URL routing by matching a URL to actual files in the file system. So if the default Apache web server document directory contains the files:

favicon.ico
index.php
page-1.php
page-2.php
robots.txt 

then if a request arrives which uses a URL of '/robots.txt', then Apache will return the contents of that file. If however a URL of '/page-1.php' arrives, then Apache will actually load the code in the file called 'page-1.php' and execute it as PHP code. That PHP code will then be responsible for generating the actual response content.

The 'index.php' file is generally a special file and although one could make a request against it using the URL '/index.php', what is more generally done is to tell Apache that if a request comes in for '/', which notionally maps to the directory itself, that it instead be routed to 'index.php'. 

The way things typically work for PHP then is that any PHP code files are simply dropped in the existing directory which Apache is serving up static files from. Apache does the URL routing, mapping a URL to an actual physical file on the file system. When it finds a file corresponding to a URL, it will return the contents of that file, or if the file type represents a special case, the handler for that file type will be invoked instead. For the case of PHP code files, this will result in the code being executed to generate the response.

This is all achieved by using an Apache configuration of:

DocumentRoot /var/www/html
<Directory /var/www/html>
DirectoryIndex index.php
AddHandler application/x-httpd-php .php
</Directory>

In this you can start to see why people say PHP is so easy to use as all you need to do is drop the PHP code files in the right directory and they work. In this simple configuration, there is no need for users to worry about URL routing as that is done for them by the web server.

Now you can actually do a similar thing with mod_wsgi for Python script files by extending this to:

DocumentRoot /var/www/html
<Directory /var/www/html>
Options ExecCGI
DirectoryIndex index.py index.php
AddHandler application/x-httpd-php .php
AddHandler wsgi-script .py
</Directory>

That is, you can now simply drop Python code files with a '.py' extension into the directory and they would be executed as Python code when a URL mapped to that file. So if instead of 'index.php' you had 'index.py', accessing the URL for the directory, Apache in seeing that 'index.py' now exists, would use that to serve the request rather than 'index.php'. If the URL instead explicitly referenced a '.py' file by name, then that would be executed to handle the request instead.

Reality is that no one does things this way for Python web applications and there a few reasons why.

The first reason is that Python web applications interact with an underlying server using the web server gateway interface (WSGI). This is a very low level interface and quite unfriendly to new users.

This is in contrast to PHP where what is in the PHP file is no where near as low level and instead comes from the direction of being HTML with PHP code snippets interspersed. Those PHP code snippets can then access details of the request and any request content through a high level interface.

For WSGI however, there is no high level interface and you are effectively left having to work at the lowest level and process the request and any request content yourself.

WSGI therefore steers you towards needing to use a separate Python web framework or toolkit to do all that hard work and provide a simpler high level interface onto the request and for generating a response.

At this level where Apache is allowed to handle all the URL routing, then the two Python packages which would be most useful are Werkzeug and Paste. These packages focus mainly on encapsulating the request and response to make your life easier as far as processing the request and generating a response. What they don't do is dictate a URL routing mechanism and thus why they are a good match when using Apache in the way above.

There is therefore no reason why you can't use this approach similar to PHP of simply dropping Python code files a directory, but you are going to have to do more work.

A bigger problem and the second reason why people don't write Python web applications in this way is that of code reloading.

When writing a web application in PHP, every time you modify a PHP code file it will be automatically reloaded and the new code read and used. This is because ultimately, nothing is persistent for a PHP web application and everything is read in again for every request.

Well, that isn't quite true, but as far as you can tell as a user though that is the case.

The reason it isn't strictly true is that all the PHP extensions you may want to use in your web application, and a lot more you don't, are all preloaded into the process where the PHP code is to be executed. The code for these stays persistent across requests. What does get thrown away those is all the code for your application and the corresponding data.

This is in contrast to Python where all code for separate Python code modules is loaded dynamically on demand the first time it is required. Further, the Python code objects are intermingled with other data for your application. There is also no ready distinction between your application code and unchanging code from a separate third party package or a module from the Python standard library.

It is therefore not possible to throw away just your application code and data at the end of each request. Instead, what occurs for Python web applications is that all this application code and data stays persistent in the memory of the process between requests.

As far as code reloading goes this makes it much more difficult. This is because even for a trivial code change you need to kill off the persistent process and start over. The greater cost associated with Python web applications, due to the fact that nothing is preloaded, means that such a restart is expensive. If this was done on every request, the performance will drop dramatically.

Python doesn't therefore lend itself very well to what PHP users are used to of simply being able to drop code files in a directory and for all code changes to be picked up automatically.

The preferred approach in Python is therefore to use a much higher level framework providing simpler and more structured interfaces. These web frameworks provide the high level request and response object which make handling a request easier, but they also take over URL routing as well. This means that instead of relying on Apache to perform URL routing right down to the level of a resource or handler, it only needs to route down to the top level entry point for the whole WSGI application. After that point, the frameworks themselves will handle URL routing.

One can still use the above method as the gateway into a WSGI application using a high level Python web framework, but it doesn't quite work properly when you want to take over the root of the web site.

To get things to work properly, for a Python web application we can use a different type of configuration.

Alias / /var/www/wsgi/main.py
<Directory /var/www/wsgi>
Options ExecCGI
AddHandler wsgi-script .py
Order allow,deny
Allow from all
</Directory>

Specifically, the 'Alias' directive allows us to say that all requests that fall under the URL starting with '/', in this case the whole site, will be routed to the resource specified. As that resources maps to a Python code file, it will then be executed as Python code, thus providing the gateway into our WSGI application, with it being able to then perform the actual URL routing required to map a request to a specific handler function.

Because for Python web applications this will be a common idiom, mod_wsgi provides a simpler way of doing the same thing:

WSGIScriptAlias / /var/www/wsgi/main.py
<Directory /var/www/wsgi>
Order allow,deny
Allow from all
</Directory>

Using the 'WSGIScriptAlias' directive from mod_wsgi in this case means that we do not need to worry about setting the 'ExecCGI' option, or map that the file with a '.py' extension should be executed as a WSGI script.

Even when using 'WSGIScriptAlias', you do still need to work in conjunction with Apache access controls, it doesn't provide a back door for avoiding the access controls to ensure you are always using best security practices.

We have now what is the more typical Apache configuration for a Python web application, but how then do we use this in conjunction with an existing PHP application that may be hosted on the same site.

The primary problem if it isn't obvious is that using 'WSGIScriptAlias' for '/' means that all requests to the site are being hijacked and sent into the Python web application. In other words, it would shadow any existing PHP web application that may be hosted out of the document directory for the web server.

The simplest thing which can be done at this point is to host the Python web application at a sub URL instead of the root of the site.

WSGIScriptAlias /suburl /var/www/wsgi/main.py

The result will be that all requests prefixed with that sub URL will then go to that Python web application. Anything else will be mapped against the document directory of the server and thus potentially to the PHP web application.

Using a sub URL however isn't always practical. It may be fine where the Python web application is actually a sub site, but if you are intending to replace the existing PHP web application, it is likely preferable that the Python web application give the appearance of being hosted at the root of the site at the same time as the PHP web application is also being hosted at the root of the site.

Is this even possible? If possible, how do we do it?

The answer is that is possible, but we have to rely on a little magic. This magic comes in the form of the mod_rewrite module for Apache.

Our starting point in this case will be the prior example we had whereby we could drop both PHP and Python code files in the document directory for the server. To that we are going to add our mod_rewrite rules.

DocumentRoot /var/www/html
<Directory /var/www/html>
Options ExecCGI
DirectoryIndex index.php
AddHandler application/x-httpd-php .php
AddHandler wsgi-script .py
RewriteEngine On
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ /main.py/$1 [QSA,PT,L]
</Directory>

What this magic rewrite rule will do is look at each request as it comes in and determine if Apache was able to map the URL to a file within the document directory. If Apache was able to successfully map the URL to a file, then the request will be processed normally.

If however the URL could not be mapped to an actual physical file in the document directory, the request will be rewritten such that the request will be redirected to the resource 'main.py'.

Because though 'main.py' is being mapped to mod_wsgi as a Python code file, the result will be that the Python web application in that file will instead be used to handle the request.

All that remains now is to create 'main.py', which will normally be the existing WSGI script file you use an entry point to your WSGI application.

In copying in the 'main.py' file, ensure that is all you copy in from your existing Python web application. Do not go placing all the source code for your existing Python web application under the server document directory. This is because with this Apache configuration, a URL would then be able to be mapped to those source code files even though it isn't intended that they be so accessible.

So keep your actual Python web application code separate. It is even better in some respects that 'main.py' not be your original WSGI script file. Preferably all it should do is import the WSGI application entry point from the original in the separate source code directory for your Python web application. This limits the danger from having source code in the server document directory, because even if you later stuff up the configuration and accidentally make it so someone can download the actual contents of 'main.py', they haven't got hold of any sensitive data.

Making 'main.py' be a simple wrapper implementing a level of indirection is actually better for another reason.

This is because when we use the mod_rewrite rules above to trigger the internal redirect within Apache, the adjustments it makes to the URL can stuff up what URLs are then subsequently exposed to a user of your site.

This comes about because normally where your Python web application would see a URL as:

/some/url

it will instead see it as:

/main.py/some/url

Or more specifically, the 'SCRIPT_NAME' variable will be passed into the WSGI environ dictionary as:

/main.py

rather than an empty string.

The consequences of this is that when your Python web application creates a full URL for the purposes of redirection, that URL will then also have '/main.py' as part of it.

Exposing this internal detail of how we are hosting the Python web application part of the site isn't what we want to do, so we want to strip that out. That way any full URLs which are constructed will make it appear that the Python web application is still hosted at the root of the site and a user will be none the wiser.

def _application(environ, start_response):
# The original application entry point.
...
import posixpath
def application(environ, start_response):
# Wrapper to set SCRIPT_NAME to actual mount point.
  environ['SCRIPT_NAME'] = posixpath.dirname(environ['SCRIPT_NAME'])
  if environ['SCRIPT_NAME'] == '/':
environ['SCRIPT_NAME'] = ''
  return _application(environ, start_response)

Because we are hosting at the root of the site, we could have just set 'SCRIPT_NAME' to an empty string and be done with it. I use here though a more durable solution in case the rewrite URLs were being used for a sub directory of the server document directory.

And we are done, the result being that we have one site which has both a PHP web application and a Python web application which believe they are both hosted at the root of the site. When a request comes in, Apache will map the URL to file based resources in the server document directory. If that file is a static file the contents of that file will be served immediately. If instead the URL mapped to a PHP code file, then PHP will handle the request. Finally, if the request doesn't map to any file based resource, then the request will be passed through to the Python web application, which will perform its own routing based on the URL to work out how the request should be handled.

This mechanism enables you to add a Python web application to the site and then progressively transfer the functionality of the existing PHP web application across to the Python web application. If URLs aren't changing as part of the transition, then it is a simple matter of removing the PHP code file for a specific URL and that URL will then be handled by the Python web application instead.

Otherwise, you would implement the new URL handlers in the Python web application and then change the existing PHP web application to send requests off to the new URLs.

The key URL for the root of the site will with the above configuration be handled by the 'index.php' file. When you are finally ready to cut it over, then you just need to remove the 'index.php' file, plus the second 'RewriteCond' for '%{REQUEST_FILENAME} !-d' and the URL requests for the root of the site will also be sent through to the Python web application.

So summarising, there are two things that need to be done.

The first step is changing the Apache configuration to use mod_rewrite rules to fallback to sending requests through to the Python web application.

# Define a mod_wsgi daemon process group.
WSGIDaemonProcess my-python-web-application display-name=%{GROUP}
# Force the Python web application to run in the mod_wsgi daemon process group.
WSGIProcessGroup my-python-web-application
WSGIApplicationGroup %{GLOBAL}
# Disable embedded mode of mod_wsgi.
WSGIRestrictEmbedded On
# Set document root and rules for access.
DocumentRoot /var/www/html
<Directory /var/www/html>
Options ExecCGI
DirectoryIndex index.php
AddHandler application/x-httpd-php .php
AddHandler wsgi-script .py
RewriteEngine On
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ /main.py/$1 [QSA,PT,L]
</Directory>

The second step is setting up the 'main.py' file for the entry point to the Python web application, and implement the fix up for 'SCRIPT_NAME'.

def _application(environ, start_response):
# The original application entry point.
...
import posixpath
def application(environ, start_response):
# Wrapper to set SCRIPT_NAME to actual mount point.
  environ['SCRIPT_NAME'] = posixpath.dirname(environ['SCRIPT_NAME'])
  if environ['SCRIPT_NAME'] == '/':
environ['SCRIPT_NAME'] = ''
  return _application(environ, start_response)

Overall the concept is simple, it is just the detail of the implementation which may not be obvious and why some may think it is not possible.

What was the DjangoCon US 2014 angle in all this?

The issue of how to do this came up as Collin Anderson will be presenting at talk at DjangoCon called 'Integrating Django and Wordpress can be simple'. His talk is on a much broader topic, but I thought I would add a bit to explain in more detail how one can do PHP and Python site merging with Apache.

So if you are at Django and have to deal with PHP applications still, maybe drop in and watch Collin's talk.

Tuesday, September 2, 2014

Python module search path and mod_wsgi.

When you run the Python interpreter on the command line as an interactive console, if a module to be imported resides in the same directory as you ran the 'python' executable, then it will be found no problems.

How can this be though, as we haven't done anything to add the current working directory into 'sys.path', nor has the Python interpreter itself done so?

>>> import os, sys
>>> os.getcwd() in sys.path
False

What does get put in 'sys.path' then and does that give us a clue to why it is being found?

>>> sys.path
['', '/Library/Frameworks/Python.framework/Versions/3.4/lib/python34.zip',
'/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4',
'/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/plat-darwin',
'/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/lib-dynload',
'/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages']

As you can see we have a whole bunch of directories related to our actual Python installation.

We can also see that 'sys.path' includes an empty string. What is that about?

On the assumption that it is there for a reason, in order to work out what it might be for, lets try and delete that entry and see if it affects our attempt to import a module.

$ touch example.py
$ python3.4
Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import example
$ python3.4
Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> del sys.path[0]
>>> import example
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named 'example'

As you can see it is significant. When we delete the empty string from 'sys.path' we can no longer import any modules from the current working directory.

As it turns out, what this magic value of an empty string does is tell Python that when performing a module import, it should look in the current working directory. That is, the directory that would be returned by:

>>> import os
>>> os.getcwd()
'/private/tmp'

Initially this would be the directory in which you ran the 'python' executable, but if you so happened to use 'os.chdir()' to change the working directory, the current working directory will change and thus where Python looks for modules when imported will also change, instead now using the directory you changed to.

What about when executing Python against a script file instead of running an interactive console?

$ python3.4 -i example.py
>>> import sys
>>> sys.path
['/private/tmp', '/Library/Frameworks/Python.framework/Versions/3.4/lib/python34.zip',
'/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4',
'/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/plat-darwin',
'/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/lib-dynload',
'/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages']

This time there is no empty string. Instead of the empty string, Python has calculated the name of the directory which the script is located in and added that to 'sys.path'.

Any module in the same directory will still be found when imported, but importantly, if the current working directory of the application changes, that same directory will still be searched and the directory to be searched will not change.

So what does this all have to do with mod_wsgi?

Well under mod_wsgi, because it is an embedded system and it uses the C APIs directly for initialising the Python interpreter, neither the empty directory or the directory containing any script file will be added to 'sys.path'.

This means that only the module directories which form part of the standard Python module search path will actually ever be searched. Any directory where the WSGI application may reside will not be automatically searched. In fact, at the time of initialisation of the Python interpreter, it will not generally be known what WSGI applications will even be run within a specific Python interpreter as resolving of which WSGI script file to load will only be done lazily as actual web requests arrive.

The consequence of this is that if your WSGI application is not all contained in a single WSGI script file, then you will need to explicitly setup additional directories that Python should search for modules.

For a Django application, that means adding the project base directory and this is what was touched on in the discussion at DjangoCon US 2014 I had, with me saying again 'but there is a better way'.

If using embedded mode what you would need to do to have it search the base directory for the Django projects is:

WSGIPythonPath /path/to/mysite.com

If using daemon mode, you would instead use:

WSGIDaemonProcess example python-path=/path/to/mysite.com

If you leave it at that, then although your project modules will be found, the current working directory of your application is a bit of an unknown. What it will actually be set to is up to the mercy of Apache, with it usually being set to the '/' directory.

This is okay because in a Python web application you should always be referring to any files you need by an absolute path name and not a relative path name. If you didn't and had been using the Django development server, you might then find that lots of things break when you go to deploy on a real WSGI server such as Apache/mod_wsgi.

This is because all the attempts to access files by a relative path name will fail, as the current working directory isn't where you expected it to be.

Although it is still preferable to always use absolute path names, if for some reason that cannot be done, then with mod_wsgi daemon mode at least, you can also tell mod_wsgi to use a specific directory as the current working directory. This can be done using the 'home' option to the 'WSGIDaemonProcess' directive.

WSGIDaemonProcess example home=/path/to/mysite.com python-path=/path/to/mysite.com

As had occurred before, this shows out embedded mode to be a bit of a second class citizen, as there is no equivalent configuration for when using embedded mode.

Like with lang and locale settings, the issue is that when using embedded mode of mod_wsgi, your Python WSGI application is potentially running in the same processes as other Apache module code. You can't therefore simply hijack the current working directory for yourself. Worst case from this if you did, is that something else had made its own assumptions about what the current working directory should be and so you break it in the process.

The semi isolation offered by mod_wsgi daemon mode therefore allows you to safely change the current working directory, or at least if you are running the one WSGI application in each mod_wsgi daemon process group. If you therefore must have the current working directory be a certain directory, you can use daemon mode and the 'home' option.

In using the 'home' option the way it has worked in the past is that it only set the current working directory. This changed though in mod_wsgi 4.1.0 such that modules will be searched for automatically in that directory as well.

This means that from mod_wsgi 4.1.0 onwards you can actually simplify the options for daemon mode to:

WSGIDaemonProcess example home=/path/to/mysite.com

Combining this with a Python virtual environment which you want to use just for that daemon process group, you would use:

WSGIDaemonProcess example python-home=/path/to/venv home=/path/to/mysite.com

We therefore have a simpler way to setup the current working directory of the WSGI application so that relative paths do still work if you have managed not to ensure they are all absolute. The need to add that working directory separately to the Python module search is gone as it will be done automatically. And finally we don't have to dig down to the 'site-packages' directory and can just specify the root of the Python virtual environment.

All well and good, but I did realise when writing this post that I probably made a bad decision at the time of changing how 'home' worked in Python 4.1.0.

What I did was that I completely forgot that rather than use an empty string when running a 'python' executable against a script file, that it adds the directory the script is contained in.

I am not sure what I was thinking of at the time but what I did was to add an empty string into 'sys.path' when the 'home' option was used.

This still produces the desired result as the current working directory is the directory where we want to load the Python modules from, but problems would arise though if for some reason your code decided to change the current working directory during the life of the process. I even warn about this in the releases notes for the change, so as I said, I truly don't know what I was thinking of at the time to allow that one through.

Now I know you are all sensible programmers and would not successively go and change the current working directory from your WSGI application code, especially in a multi threaded configuration where it would likely be quite unsafe, so for now it is probably okay. I will though now change the behaviour in mod_wsgi 4.3.0 to use the more logical mechanism as shown right back at the start, of using the actual directory path for the 'home' option in 'sys.path' so that changing the working directory will not affect where modules are imported from.