Showing posts with label mod_wsgi. Show all posts
Showing posts with label mod_wsgi. Show all posts

Friday, March 21, 2008

Version 2.0 of mod_wsgi is now available.

Due to the arrival of a baby 1.0, version 2.0 of mod_wsgi has been a bit slower in coming than originally planned, but the wait is now over and it is available for download. The major improvements in version 2.0 of mod_wsgi are detailed below, but there are various other little goodies as well, so check out the change notes on the wiki.

Process Reloading Mechanism

When using daemon mode of mod_wsgi and the WSGI script file for your application is changed, the default behaviour is to now restart the daemon processes automatically upon receipt of the next request against that application. Thus, when making changes to any code or configuration data for your application, all you now need to do is touch the WSGI script file and the daemon processes for just that application will be automatically restarted and the application reloaded. This means that it is no longer necessary to send signals explicitly to the daemon processes, or restart the whole of Apache. This means that elevated privileges are not required by users and applications owned by other users in a shared hosting environment will not be affected when one users application is restarted.

Apache Authentication Provider

When using Apache 2.2, mod_wsgi provides the means to implement Apache authentication providers in Python. This means that password authentication for HTTP Basic and Digest authentication, plus other custom authentication mechanisms implemented by other Apache modules, can be delegated to your Python application. This for example can be used to implement HTTP authentication for a Trac instance against a user database maintained within a Django instance running on the same site. If using Apache 2.0 the mechanism is also available, but only in support of standard Apache HTTP Basic authentication.

Python Virtual Environments

More integrated support for Python virtual environments such as 'virtualenv' is now provided. These changes make it possible for different daemon process groups to be easily associated with distinct Python virtual environments. Where daemon process groups are being setup for different users, or to separate different applications, the use of Python virtual environments means that each can use different versions of modules or packages and not interfere with each other.

WSGI File Wrapper Extension

Support for 'wsgi.file_wrapper' extension has been added with operating system mechanisms such as sendfile() and mmap() being used when possible to speed up sending of any data back to a client. Provided an application is written to use this optional extension, then serving up of static files by the application should be greatly improved.

Daemon Mode Now Even Faster

Some underperforming code related to the socket used to communicate between the Apache child processes and the daemon processes has been replaced. This has result in a 40% improvement in base level performance for a simple hello world program. This means that daemon mode now performs even faster relative to competing solutions. Do remember though that the network level is usually never the bottleneck and it is the Python application and database queries where things slow down. Thus, although it is quicker, in the grander scheme of things the improvement wouldn't be noticed in most applications.

Sunday, November 18, 2007

Version 1.3 of mod_wsgi is now available.

Version 1.3 of mod_wsgi is a bug fix only release, addressing issues with mod_wsgi daemon processes hanging under certain conditions. It is highly recommended that users of mod_wsgi who use daemon mode of mod_wsgi upgrade to this new version. A third release candidate for version 2.0 is also now being made available incorporating the same fix and adding an additional feature to detect errant Python C extension modules that don't release the Python GIL when running long, potentially blocking, operations.

Sunday, November 11, 2007

Poor man's Python virtual environment.

There have now been a number of attempts at implementing virtual environments for Python. That is, providing a means of having multiple isolated environments for the one Python installation on a system, such that it would be possible to run different applications on the same system, but using different sets of installed Python packages. The prime standalone examples of these are virtual-python, workingenv and virtualenv.

It may well just be that I choose to use MacOS X with the older Python 2.3.5 that comes with the operating system, but even with the more recent virtualenv, it just doesn't seem to always want to work properly. Even when I have ensured that PATH includes first the 'bin' directory for the virtual environment, such that the environment specific versions of tools such as 'easy_install' are found first, for whatever reason, some packages will still want to write back into the operating system '/Library/Python' directory when I wouldn't expect them to. I also don't seem to be alone in having such problems as evidenced by comments to Ian Bicking's blog where he originally announced virtualenv as a replacement for workingenv.

Now, I will admit that I still haven't found time to properly dig into the internals of Python eggs and so may be missing something, but for creating Python virtual environments, what I don't understand is why simply setting the PYTHONHOME environment variable isn't sufficient to get it all working for the typical case. Yes it means that the environment variable has to always be set, as well as PATH including the 'bin' directory for the virtual environment, but it avoids the idiosyncrasies arising from the way that Python tries to work out where the installed Python library directory is.

To understand what the PYTHONHOME environment variable is all about, one has to consult the source code comments in the file 'Modules/getpath.c' of the Python source code, as any other online documentation seems to be rather lacking. Read the source code as well as the comments and you will find that Python goes through a number of steps to try and determine where the Python lib directory is located when it is run. These can be summarised as:
  1. Look relative to argv[0] to determine if being run out of Python source code build directory.
  2. Consult the PYTHONHOME environment variable for directory prefix corresponding to where Python was installed.
  3. Look relative to argv[0] to determine if being run out of Python installation directories. If argv[0] isn't an absolute path, search PATH for the executable which was used and look relative to that instead.
  4. Look relative to directory prefix where Python was supposed to have been installed.
The way the virtualenv appears to work is that it tries to set things up so that step 3 still applies.

On MacOS X things gets a bit tricky though, as it doesn't use the method exactly as described. Instead it seems that an absolute path to the Python framework encoded into the executable itself is somehow used. This means that to get virtualenv to work, it is necessary for it copy the Python executable and then use the MacOS X 'install_name_tool' program to change where the Python executable is picking up the Python framework from else it will continue to use the Python lib directory from the original Python installation.

Windows also doesn't follow the rules either, with the location of the Python DLL somehow determining where the Python lib directory needs to be.

Either way, once Python has found what it believes is the location of the Python lib directory it will use that and will skip the subsequent steps.

Now, although how step 3 works is different based on what platform you are running on, step 2 is the same. As such setting the PYTHONHOME environment seems to be a simpler and more deterministic way of specifying where the Python lib directory is located, avoiding the need to perform fixups to the Python executable on MacOS X.

As to how to setup a Python virtual environment based on using the PYTHONHOME environment variable, for UNIX based systems it is just a matter of creating a parallel copy of the installed Python installation using symlinks. In some respects it is therefore quite similar to the original virtual-python, except that the Python executable itself is also just a symlink and not a copy.

On MacOS X with Python 2.3 the required steps would therefore be:

mkdir $HOME/pythonenv
cd $HOME/pythonenv

mkdir -p ENV1/bin
mkdir -p ENV1/include
mkdir -p ENV1/lib/python2.3

ln -s /usr/bin/python2.3 ENV1/bin/
ln -s python2.3 ENV1/bin/python

ln -s /usr/include/python2.3 ENV1/include/

for i in /usr/lib/python2.3/*; do ln -s $i ENV1/lib/python2.3/; done

rm ENV1/lib/python2.3/site-packages
mkdir ENV1/lib/python2.3/site-packages

To use the virtual environment the 'bin' directory would be added to the head of your PATH and the PYTHONHOME environment variable set.

PATH="$HOME/pythonenv/ENV1/bin:$PATH"
export PATH

PYTHONHOME="$HOME/pythonenv/ENV1"
export PYTHONHOME

Note that it is the specific intent here that the 'site-packages' directory from the original Python installation is ignored. It would therefore be necessary to reinstall all required packages, including 'setuptools', once the PATH and PYTHONHOME variables had been setup.

Obviously, setting PYTHONHOME has implications if you want to run scripts from one Python application which are themselves standalone Python scripts which refer to a different Python virtual environment. Other issues come up if trying to run scripts which use a completely different version of Python. As such, this poor man's version of Python virtual environments isn't going to work for everyone, but for what I am doing with web applications and mod_wsgi it works fine, not giving me the problems that virtualenv does on MacOS X.

How exactly Python virtual environments, of any variety, can be used with mod_wsgi and how mod_wsgi version 2.0 has been enhanced to make it all reasonably simple to manage I'll cover in a subsequent blog entry. If you can't wait, then you can also check out a non sanitised version of a description about it on the mod_wsgi user group.

Wednesday, October 31, 2007

Version 1.2 of mod_wsgi is now available.

Mark beat me to the punch again and got word out about mod_wsgi 1.2 before I myself got a chance to sit down and blog about it. I'll have to start paying him as my publicist soon.

Version 1.2 of mod_wsgi is a bug fix only release, addressing issues with WSGI specification compliance, sub process invocation from Python in a mod_wsgi daemon process and most importantly of all, an issue whereby a second sub interpreter instance could be created for each WSGI application group when targeted by a specifically formed URL.

This latter issue of a second sub interpreter being created only affects users of Apache 1.3 and 2.0. Because it can have the affect of doubling the memory in use by the application, it is highly recommended that users of these Apache versions upgrade to mod_wsgi 1.2, given that in a memory constrained environment the bug could be exploited as a form of remote denial of service attack.

At the same time as mod_wsgi 1.2 has been released, the first release candidate for mod_wsgi 2.0 has also been released. This version provides a number of new features including, integration with Apache authentication and authorisation mechanisms in Apache 2.2, a new process reloading option for mod_wsgi daemon processes which makes reloading a Python application when changes are made trivial, and direct support for Python virtual environments such as workingenv and virtualenv. I'll blog about these and other new features in mod_wsgi 2.0 in the coming weeks.

If you want to discuss any of the new mod_wsgi 2.0 features in the mean time, check out the change notes or pop on over to the mod_wsgi Google Group.

Sunday, October 14, 2007

Google hates mod_wsgi.

According to Google Analytics, 99% of all search engine traffic landing at the mod_wsgi site is via Google. As a consequence, the results are going to be pretty significant when Google, for no good reason that I can see, recently dropped the mod_wsgi site home page from its search results. What makes even less sense is that mod_wsgi is hosted on the Google Code hosting service.

Hopefully the famed Google page rank algorithm will wake from its slumber at some point and start listing it again, otherwise it is going to make it hard for people to find the site. More worrying is that it looks like it might be starting to drop individual pages on the site as well, as searching for quite specific terms which appear within the site pages aren't showing up either, although they did previously.

Is this Google's way of getting retribution on me for grumbling so much about how the search on Google groups has been stuffed up so often of late and is always quite far behind with its search results, or is it just symptomatic of the rot starting to set in at Google. :-(

Monday, October 1, 2007

Version 1.1 of mod_wsgi is now available.

Version 1.1 of mod_wsgi is now available and can be downloaded from http://www.modwsgi.org. This is a bug fix release only and no new features are included. Two main problems addressed are possibility of processes crashing when multiple threads hit race condition on sending output via sys.stdout/sys.stderr, and conflict with the Apache mod_logio module which would result in mod_wsgi daemon processes crashing. A description of all changes in this version can be found in the change notes. Updating to this version is recommended for all users.

Wednesday, September 12, 2007

Parallel Python discussion and mod_wsgi.

The sad fact is that many high profile Python developers like to ignore what has been done in relation to the use of Python in conjunction with the Apache web server. I'm not sure whether this is because of a bias towards pure Python solutions, whether they just can't be bothered, or that they simply don't have the time to look properly at what is being done by others. Anyway, the latest comment which shows a lack of understanding of what already exists comes from Guido himself in his blog entry made in response to Bruce Eckel's blog on Python 3K or Python 2.9.

In his blog entry Guido says the following in relation to concurrency in Python:

"
Another route I'd like to see explored is integrating one such solution into an existing web framework (or perhaps as WSGI middleware) so that web applications have an easy way out without redesigning their architecture."

Reality is that this ability to spread a Python based web application across multiple processes already exists with Apache when using mod_python or mod_wsgi. This is because Apache itself (on UNIX at least) is implemented as a multi process web server. As such, incoming requests are distributed across the numerous Apache child processes dealing with requests. When using Apache at least, there is therefore no problem with properly utilising multiple core processors. As I have also blogged before in 'Web hosting landscape and mod_wsgi', the fact that a lot of other stuff is also going on in Apache at the same time, which is not implemented using Python, adds to the fact that for solutions embedding Python into Apache the GIL is not the big issue people think it is.

In addition to solutions such as mod_python and mod_wsgi, which embed Python into the Apache child processors, there are also other solutions such as mod_fastcgi and daemon mode of mod_wsgi, which are able to create multiple distinct daemon processes to which requests are proxied. This again results in requests being distributed across multiple processors.

The WSGI specification even takes into consideration that such multi process web servers exist through the existence of the 'wsgi.multiprocess' flag in the WSGI environment passed to a WSGI application.

Now, it may be the case that Guido more had in mind the ability within a WSGI application, using WSGI midleware, to on forward some subset of URLs to another processes. But then, even this can already be achieved using existing WSGI middleware for proxying requests to another web server. To use such a feature though means making a conscious decision and changing the code of your application, although using something like Paste Deploy may at least limit that to being a configuration change.

In addition to proxy middleware, mod_wsgi also has an ability to divide up an existing monolithic application to run across multiple processes. In the case of mod_wsgi no changes at all need to be made to the structure of the WSGI application. Instead, the mapping of a particular subset of URLs to a distinct process is handled by mod_wsgi even before the specific WSGI application is invoked.

As an example, imagine that one was running Django and wanted all the '/admin' pages to be executed within the context of their own process. To achieve this, all that is required is for the following Apache configuration to be used:

WSGIDaemonProcess django processes=3 threads=10
WSGIDaemonProcess django-admin processes=1 threads=10

WSGIProcessGroup django

WSGIScriptAlias / /usr/local/django/mysite/apache/django.wsgi

<Location /admin>
WSGIProcessGroup django-admin
</Location>
This results in the bulk of the Django application being distributed across 3 multi thread processes. Using a combination of the 'Location' and 'WSGIProcessGroup' directives, the process group to be used for '/admin' URL is then overridden. The result is that any handlers related to '/admin', and URLs underneath that point, are instead executed by a different process.

So, the ability for distributing execution of a Python web application across multiple processes and thereby reducing the impact of the Python GIL already exists. Future changes to mod_wsgi should make this even more flexible, with the introduction of transient daemon processes and an ability to anchor a user session to a specific daemon process using cookies where required.

Wednesday, September 5, 2007

Version 1.0 of mod_wsgi is now available.

Version 1.0 of mod_wsgi is now available and can be downloaded from http://www.modwsgi.org. The package is regarded as being quite stable and therefore suitable for use in production environments.

For those not familiar with mod_wsgi, the aim in developing it is to implement a simple to use Apache module which can host any Python application which supports the Python WSGI interface. In addition, the module should be suitable for use in hosting high performance production web sites, as well as your average personal sites running on commodity web hosting services.

This initial version of mod_wsgi is suitable for use on dedicated systems or virtual private servers. With suitable configuration it could also be used by web hosting companies specialising in providing hosting for Python web applications.

With a bit more work and encouragement, future versions of mod_wsgi will include additional features which should help it to also break into the truly low cost commodity web hosting market where Python is currently sadly lacking as an option. So, stay tuned for more updates.

Sunday, September 2, 2007

Relative popularity of Python web frameworks.

The battle between the different Python web frameworks over which is technically best is always interesting to watch. In watching these battles and monitoring the various discussion forums for each, one gets a pretty good feel as to which at least is winning the popularity contest. All the same it would be nice to see some actual figures to backup the assumptions one makes. One way that I can see of doing this is to look at the number of unique visits to the mod_wsgi documentation describing how to host each framework on top of mod_wsgi. Although there be lots of caveats, the result of doing such an analysis is pretty well what I expected, with Django coming out on top.

The web frameworks (or non frameworks as some like to call themselves) for which instructions are currently provided for mod_wsgi are CherryPy, Django, Karrigell, Pylons, TurboGears and web.py. Instructions for each have all been up for more than a month on the mod_wsgi web site, so for the analysis I have taken the statistics for the month of August. For that period, the number of unique page views against each was as follows:


PackageCount
Django332
Pylons96
TurboGears89
web.py75
CherryPy71
Karrigell34

FWIW, the mod_wsgi documentation also provides instructions for using Trac and MoinMoin on top of mod_wsgi. The number of unique page views for these packages was:


PackageCount
Trac324
MoinMoin44

Although interesting, these results cannot tell the whole picture for a variety of reasons. These include whether or not respective packages actually reference mod_wsgi (or how prominently) as a hosting solution in their own documentation and how often I have personally referred to mod_wsgi on each packages mailing lists as an alternate solution to a particular persons problem.

Beyond those issues, there are actually a number of technically related reasons as to why for a particular package there may not have been as much traffic to the mod_wsgi web site.

The main issue is that although all are capable of being hosted using Apache and mod_wsgi, of the web frameworks only Django promotes strongly the idea that for production sites one should use Apache. At the moment the recommendation in that respect is mod_python, but at least the idea of using Apache is not a foreign concept. Thus for Django, the builtin web server is only seen as being a practical hosting solution for a development instance of Django.

For most of the other packages they instead see the builtin web server they provide as being capable enough to support a production site. Thus, although they may describe or reference other ways of hosting a site developed using the package, the only way that Apache generally factors into the equation is as a proxy to their own web server and as a means of hosting static files.

As far as using a web server implemented in pure Python as opposed to hosting on top of Apache, there does also seem to be a reasonable amount of bias against using Apache. In part this appears to be due to some ignorance as to the pros and cons of using Apache and how to set it up properly, but also partly because of Python zealotry. In other words, just like every programming language, some are so strongly enamoured by Python that they simply cannot except that there are other ways of doing things.

Such a pro Python only stance could actually be seen as being detrimental to the chances of Python being accepted within commodity web hosting. This is because commodity web hosting companies will not find it acceptable that they would have to setup and support pure Python back end web server applications to which they merely proxy requests.

Instead, commodity web hosting want a system that can be easily integrated with their existing Apache installations (normally setup for PHP), yet doesn't place undue memory requirements and overhead on Apache. Above all, the ability to provide hosting for Python web applications must be very simple to configure and fit in with the large scale automated systems they have for configuring the many sites they would host using one Apache installation.

Thus in some respects, packages which try to steer developers to always using the builtin web server are only going to make it harder for that package to be accepted by web hosting companies. Some thought must be given to ensuring that packages are easy to deploy and setup under Apache in a web hosting environment. If this is not done, then you will not see those packages being supported by web hosting companies and as a result people will simply move to those packages which have put in the effort to make it easy to deploy under Apache.

Tuesday, August 14, 2007

Version 1.0 of mod_wsgi real close now.

Am now onto the third release candidate for version 1.0 of mod_wsgi. If no more problems come up before this weekend is out, I will release the final version 1.0 on Sunday, or shortly thereafter. So, this is the last chance to give it a go and provide feedback on any problems. Many thanks to all who have given feedback so far.

Monday, July 30, 2007

Commodity shared hosting and mod_wsgi.

The first release candidate for mod_wsgi has been out a little while now and so far there has not been a single complaint about its stability. The one and only bug reported has been a rather silly mistake related to decoding the optional umask for daemon processes. Although the signs are good, will web hosting companies want to use it anyway?

In answering this question one has to look beyond stability concerns and look at whether the functionality provided by mod_wsgi fits the requirements of a web hosting environment. To work this out one has to consider the different types of commercially available web hosting environments on offer.

The most full featured web hosting environment is a dedicated server or virtual private server with root access. In this case the user has full access to the operating system and can install whatever they want. Because there are no real restrictions, a user would be free to install mod_wsgi and configure it so as to run applications in embedded mode, daemon mode or a combination of both. The only issue here is whether the user feels that mod_wsgi offers a better solution to mod_python, mod_fastcgi or other such solutions. The web hosting company in this case doesn't affect the actual user's decision.

Cheaper options than such a full featured web hosting environment are available, but as the cost comes down, the restrictions on what the user can do also increases. The next step down of note is what I'll call advanced shared hosting. This is where many users share the same system but each user is provided with their own Apache instance running on a dedicated port. All these distinct Apache instances then sit behind a proxy of some form listening on the standard HTTP port. Because each user gets there own Apache instance, it can run as that user and therefore the user still has a reasonable measure of control over the server. As such the user could once again most likely choose to use mod_wsgi and configure it so as to run applications in embedded mode, daemon mode or a combination of both

At the lower end of the scale is commodity shared hosting. This is where applications created by many different users are hosted together using a common Apache instance. The main problem with sharing a single Apache instance in this way is that all code runs as the user that Apache runs as. The consequences of running all code as a single user is that different users applications can interfere with each other. To avoid the problems that can result from this, any hosting solution for dynamic applications must be able to separate applications so that they run in distinct processes, running as the user that is the owner of the application.

When using mod_wsgi for hosting Python applications, daemon mode can be used to create daemon processes running as distinct users, with WSGI applications being delegated to an appropriate daemon process group. Although this feature is available, in the initial version of mod_wsgi the configuration of the daemon processes is effectively static, with the number of daemon processes and the user they would run as needing to be predetermined in advance. This would be okay for small web hosting companies who specialise in Python web hosting and who manually change the Apache configuration when each new site is added, but it becomes a problem where site configuration is more automated and restarts of Apache are avoided at all costs.

One way around this problem would be to configure some number of spare daemon processes in advance, with each such spare daemon process running as a user which hasn't as yet been allocated to a specific customer. When a new customer arrives who wants to be able to run a Python site, they would be given a user id corresponding to one of the spare daemon processes. Actual mapping of any WSGI applications running under the new users site would then be performed using a rewrite map which draws data from some form of database that can be dynamically updated without restarting Apache.

Although this may be viewed as a bit of a kludge, this approach would probably be quite acceptable for web hosting companies who specialise in providing hosting for Python applications. This is because it is likely that they would already have in advance worked out how many different Python application instances a machine could likely accommodate. This may be based on a general rule of thumb, or by applying strict quotas on the amount of memory that any one Python application can use, with any application process being killed off and restarted when the set memory limit is reached. That there may be some number of spare daemon processes running at any one time wouldn't be an issue as the amount of memory they use would be quite small, much less than the limit which would be imposed on a users application.

Unfortunately, not all web hosting companies are going to want to specialise in providing web hosting for Python applications, nor are they going to want to dedicate specific machines to be used for just that purpose. Instead, they will want to take their existing infrastructure, most likely designed to support PHP applications, and try to use it concurrently for hosting Python applications. Their goal in doing this is will be to maintain their existing site density, thereby still retaining their existing cost structure.

What such web hosting companies will want to avoid as much as possible is the need to run long lived daemon processes. This is because even if only a small percentage of the possibly thousands of sites they may host want to use Python, the overall memory requirements will increase much more significantly than if they were PHP applications. This would likely result in them having to reduce the number of sites they can host on the same hardware and increase their costs.

In addition to not wanting to run long lived daemon processes, the fact that the number of daemon processes to be run, and the users they run as, currently has to be predefined would make it too hard to manage. This is because such large scale hosting will load balance a large number of sites across many machines in a cluster. As a result it would impractical to have to update and restart every Apache instance in the cluster when more daemon processes need to be added.

As a result, for such large scale web hosting a more simplistic configuration mechanism is required where additional daemon processes can be added dynamically without changes needing to be made to the static configuration. Further, such daemon processes need to be able to be setup to be transient in nature. That is, they need to be able to shutdown automatically if they are idle for some, usually quite short, predefined period. By doing this, the memory used by the daemon process will be released, reducing the possibility that all of physical memory will be used up and the operating system needing to swap memory to disk.

Thus, so far as commodity web hosting goes, mod_wsgi would probably be a reasonable solution for web hosting companies who specialise in providing hosting for Python applications and who dedicate machines for this purpose. It is as yet not suitable for large web hosting companies who aren't specialists in Python web hosting, who want to maintain one homogeneous machine configuration across all their machines which is suitable for hosting varied web application frameworks and languages, and who wish to run with the highest site density possible so as to reduce costs to the maximum.

Although adding support for transient daemon processes to mod_wsgi may entice these latter type of web hosting companies to use mod_wsgi, this goes contrary to the preferred option with Python applications of maintaining a long running daemon process. As such, if it were adopted, it would be to the detriment of the user experience due to the possibly long startup times which may be encountered if an application is infrequently used and is always having to be restarted due to being killed off.

So, although the addition of support for transient daemon processes is being looked at for a future version of mod_wsgi, it is more being added to provide additional flexibility for users of dedicated systems. If configuration can be made quite simple then large scale web hosting might well adopt it, but frankly, you may well get what you pay for as far as the user experience that people interacting with your site will have.

If you are serious about providing a high quality site, you are probably better off spending a bit more money each month and get a site from a web hosting company that specialises in Python web hosting and provides true long lived daemon processes for running your site. This statement would apply equally whether it is mod_wsgi that is used or other solutions such as mod_fastcgi, as large scale web hosting isn't really designed to accommodate the additional demands of Python web applications, and for the time being at least, is always going to be tailored more to PHP applications.

Monday, July 2, 2007

Web hosting landscape and mod_wsgi.

At the end of last year I described on the mod_python mailing list various ideas I had for how one could improve the situation with Python web hosting. These ideas were detailed in:
A subsequent discussion at the first SyPy meetup in January gave me the drive to follow up on the ideas and since then I have been furiously hacking away, with the result being the mod_wsgi module I spoke of in those posts.

As I described in those posts I saw mod_wsgi as only being a first step. Before considering again what one might do beyond mod_wsgi though, it is worthwhile to look at what mod_wsgi has become and how the result fits into the web hosting landscape. In particular, does it actually have the potential to improve the lot of Python developers by providing a compelling solution which will be attractive to companies providing commodity web hosting.

To understand this, one needs to look at what features mod_wsgi provides and specifically the two different modes of operation that have been implemented.

The first mode of operation I tend to refer to as 'embedded' mode. This is where your Python web application runs in the context of the standard Apache child processes. At least in terms of how Python sub interpreters are used, this is the same as how things work with mod_python. Thus, if you have both mod_python and mod_wsgi loaded, applications running under each will share the same process, although they generally would at least run in distinct Python sub interpreters. As far as sharing goes, the process may also be host to PHP or mod_perl applications as well.

Running applications in the Apache child processes would generally always result in the best performance possible when compared to other alternatives available for using Python with Apache such as mod_fastcgi and mod_scgi or even a second web server behind mod_proxy. This is because the Python application is running in the same process that is accepting and performing the initial parsing of the request from a client. In other words, overhead is as low as it can be as everything is done together in the one process.

In addition to the low overhead, there are also other positive benefits deriving from how Apache works when using this mode. The first is that Apache uses multiple child processes to handle requests. As a result, any contention for the Python GIL within the context of a single process is not an issue, as each process will be independent. Thus there is no impediment when using multi processor systems.

That said, the GIL is not as big a deal as some people make out, even when using Apache with only one multi-threaded child process for accepting requests. This is because the code which handles accepting of requests, determines which Apache handler should process the request, along with the code for reading the request content and writing out the response content, is all written in C and is in no way linked to Python. As a consequence there are large sections of code where the GIL is not being held. On top of that, the same web server may also be serving up static files where again the GIL doesn't even come into the picture. So, more than enough opportunity for making good use of those multiple processors.

The second major benefit comes from Apache's ability to scale up to meet increases in load. The way this works is that Apache will only initially create a certain number of child processes to handle requests. If however the number of requests builds up to the point that the processes wouldn't be able to keep pace, Apache will create additional child processes to meet the demand. It will keep doing this as needs be, although eventually it will stop based on whatever the maximum number of child process is set to, so as not to totally overload your system.

When the number of requests finally starts to drop down once more, to recover resources Apache will start to kill off any child processes now deemed as unnecessary, eventually getting back to the starting level. So it is that Apache is able to comfortably deal with the ebb and flow of demand without unduly choking.

So there is a lot of good to be had from how Apache works when using mod_wsgi in this mode. At the same time however a number of issues also arise.

The first is that the child processes generally run as a special non privileged user. This means that this user needs to be given access to the files which make up an application or which the application in turn needs to read. This user will also need to be given special access to files or directories the application needs to write any data to. Because Apache may be used to host a number of different applications, it means however that all applications can read files making up any other application and make changes to any writable directories or files used by those other applications which are writable to the user.

The second problem is that although in mod_wsgi distinct Python sub interpreters are used to keep different applications separate, this isn't fool proof. Problems can arise where different applications attempt to use different versions of a particular C extension module, as Python only loads C extensions once for the whole process and not separately for each sub interpreter. Thus, which application gets to load their version first wins out and when subsequent applications load it, they will get the correct version of any Python wrappers, but that code may not match the API provided by the C extension module itself.

A third more serious problem however, is that since Python supports C extension modules, it would be possible for someone with nefarious intent to load a module which gives them access to other sub interpreters data and code thereby bypassing the firewalls put in place by mod_wsgi. Such a module would thus allow them to spy into another application, change how it works or steal private information. A very wily hacker may take this even further and poke into the internals of Apache, possibly inserting special handler code into various phases of the request processing cycle, or modifying configuration data used by other modules.

All up, what this means is that although mod_wsgi goes to great lengths to try and ensure that applications can't interfere with each other, it can't be made completely bullet proof. As a result, 'embedded' mode of mod_wsgi would only be suitable in situations where the owners of the web servers are also the owners of the applications running under it. At no time would it ever be recommended that 'embedded' mode would be suitable as a basis for running applications owned by different users in a web hosting environment.

Do note that these problems aren't the fault of mod_wsgi specifically. Some derive from the way Apache works and others from how Python works. Using mod_python as an alternative will not offer anything better. In fact mod_python actually has more problems due to the open nature of how it hooks into Apache, thus making it easier to modify the behaviour of Apache and potentially access into other applications or steal private information.

Originally the intent in writing mod_wsgi was to only target users who also controlled the web server they were using. As a consequence, these issues weren't specifically seen as being a problem that needed to be countered. During the development of mod_wsgi however, that the existence of mod_wsgi seemed to be raising the hopes of many that a suitable simple solution for commodity Python web hosting might not be far away, meant that it was necessary to look at how one could address the problems. The end result of this was the addition of 'daemon' mode to mod_wsgi.

The main difference between 'daemon' and 'embedded' mode is that in 'daemon' mode the actual application code is not run within the context of the Apache child processes, but within separate daemon processes able to be run as a distinct user. Although there is a performance penalty resulting from having to proxy the request through to the distinct daemon process which is to handle the request, because the application is now isolated into a separate process the problems described above for 'embedded' mode are eliminated.

In the first instance, because the daemon process runs as a distinct user, only that user and not the user that the Apache child processes run as will need access to the Python code files that make up the application. The same applies to writable directories or files with them only needing to be modifiable by the user that the daemon process runs as. Thus, any actual Python code or private data pertaining to the application is protected and safe from access by other users of the system.

The only files which would still need to be readable to the user that the Apache child process runs as are any static files such as HTML pages, graphics or media files. This is because the main Apache child process would still provide the service of serving up these files.

The problem with C extension modules being global to a process is also eliminated with 'daemon' mode by the fact that multiple daemon processes can be created and each application assigned to their own process. This ability to isolate an application from others by assigning them to different processes, also prevents hackers from interfering with another users running application.

As a consequence, although 'embedded' mode would not be suitable for a server environment where applications owned by different users need to be hosted together, 'daemon' mode has the necessary protections available to make it safe to use in such a hostile environment and thus it would be suitable for shared web hosting environments.

When one looks at mod_wsgi a whole, the result is a package which is suitable both for building both high performance web sites and for commodity web hosting. In both cases configuration is simple, with the one application script file being suitable for use in both modes. A complex Python web application may even make use of both modes at the same time. For example, application components requiring better performance could be run in 'embedded' mode, but with other application components requiring special access privileges, which are memory hungry or processor intensive, being delegated off to distinct daemon processes.

In the end, this combination of abilities makes mod_wsgi a somewhat more flexible platform than other available solutions for developing WSGI applications using Apache. At the same time, because everything is in a single package all managed through Apache, configuration is much simpler and there is no need to install or manage any distinct back end infrastructure.

So, although my original plans didn't envision incorporating a 'daemon' mode, the effort in adding it has been quite worthwhile, with the elusive goal of a way of providing commodity web hosting for Python applications now perhaps being achievable after all. :-)

Friday, March 30, 2007

Reloading of Python code into web applications.

One of the major complaints with Python web frameworks is the need to restart the application whenever changes are made to the code. To try and avoid restarts or to make it easier to manage, different Python web frameworks and hosting technologies employ a number of different techniques. These include reloading Python modules into the existing running application, using a supervisor process to monitor for code changes and restart the actual server process automatically, or simply providing a means for a normal user, as opposed to a super user, to completely restart the web server.

As far as reloading Python modules into the existing running application goes, the most developed example of this technique is the module importer present in mod_python 3.3. The module importer in mod_python is different to other more simplistic module importers as it tracks the parent/child relationships between imported modules. This information means the module importer can determine that it needs to reload a top level request handler module even though the module itself hasn't changed but where some other module it is dependent on has changed.

Whereas module importers normally keep modules in sys.modules and they must all be uniquely named, the mod_python module importer also avoids a whole host of problems by not holding the web applications modules in sys.modules but in a separate caching system whereby they are identified by the absolute path name of the modules code file. This means it is possible for the same name to be used for a code file in multiple directories without the need to artificially hold modules in a Python package to avoid name space collisions. When a module is reloaded it is also not reloaded on top of the existing module as is done for modules in sys.modules, therefore eliminating problems with module pollution when attributes are deleted from the code but still exist in the loaded module, as well as various multi-threading problems which can arise due to reloading new code and data on top of existing code and data that may be getting used at the time.

All these features do come with some cost in performance. More of an issue though is that the importer is bound to mod_python and thus is only of use in web applications which themselves bind closely to the mod_python API. As such, although mod_python has this quite sophisticated module importer, it is absolutely useless if you are running some WSGI based application on top of mod_python as by the very nature of what WSGI is, such an application will only use features that can be hosted on top of multiple hosting technologies so can't make use of it.

One could separate the module importer from mod_python and make it a standalone package, but even then you are unlikely to see it adopted by any of the existing Python web frameworks. This is because these Python web frameworks already have their own way of doing things, and even if the module importer may be a better solution, to use the module importer would more than likely break compatibility of older applications and require users to perform some restructuring of their code. Use of the module importer may also not be able to be made totally transparent and thus one would be forcing new concepts on the user that they have to deal with. Finally, as good as the mod_python module importer may be, it still isn't going to be suitable in all situations, and thus you will still end up with some subset of modules that cannot be safely reloaded into an existing running application.

So, all in all it is very unlikely that one will see any form of sophisticated module importer system for reloading modules on the fly that works properly and is used as some sort of standard across the various Python web frameworks. Instead one will continue to see half baked solutions which don't really work. This will not necessarily be from lack of trying on the efforts of the people implementing them, it is just that reloading code safely into Python is hard for the general case, if not impossible.

Given the above, the only real practical solution that will work with all Python web frameworks is to throw away the interpreter contents and start over whenever one needs to pick up any code changes. To date this has always meant killing off the whole process and restarting it. This brute force approach may be fine where you manage and control you own web hosting environment, but isn't really practical for all those who rely on shared web hosting implemented using Apache for their web services. This type of service is problematic because the company managing the web server is hardly likely to be amendable to a constant stream of user requests to restart the web server every time they make a change to their code.

Packages for Apache such as mod_fastcgi, mod_scgi and mod_proxy_ajp have tried to address this problem by moving the actual web application into a specialised back end process and merely using Apache as a proxy for requests. Again using proxying, one could even use another web server as the back end process.

By using a back end process in this way a number of problems can be solved. The first is that because the back end process can be dedicated to a specific user and run as that user, control for restarting it can then be handed to the user. The second problem that is solved is that any code is no longer executing as the user that Apache would run as. This eliminates problems with user code accessing parts of the file system they potentially shouldn't and with user code interfering with a different users code as they can be given their own dedicated file system space to write to.

Although these provide the control that a user needs, a solution that doesn't just use another Apache server instance as the back end process is going to be foreign to most web hosting companies and can as a result be be hard for them to setup. This is because not only does one need to build and install the required Apache module, there are multiple choices as to how to implement the back end parts of the system and it may not be obvious as to why one should be used over another. This isn't made any better by virtue of a lack of good solid documentation and less than adequate support for running and managing such systems. For a web hosting company that wants something that is quick and cheap to get working and requires little management they currently appear not to be particularly attractive solutions.

In terms of how else one can solve this problem, the only other alternative for Python is that instead of killing off the whole process which is hosting a web application, one could just destroy the particular interpreter instance within the running process. If one is to pursue this approach, what is required though is the ability to be able to create and control additional Python sub interpreters and be able to run the whole web application or parts of it, inside of a sub interpreter rather than the main Python interpreter. Having that, it would then be possible to kill off a particular sub interpreter thereby destroying that part of the web application and recreate it in a new sub interpreter using the new code base.

At present your standard Python runtime doesn't support such manipulation of Python sub interpreters. Using Python sub interpreters is not new though, with mod_python using sub interpreters to provide separation between web applications or parts of them. What mod_python doesn't allow though is for new sub interpreters to be able to be created and used from a web application itself in some way.

That there is currently no way of manipulating Python sub interpreters from within a Python application doesn't mean it can't be done though. All that is required is a C extension module that provides a means of creating the sub interpreters and then mediating a way of making a call from one sub interpreter to another. Although it sounds simple in practice, there are various gotchas in getting it to work correctly. It also potentially opens up a big can of worms due to issues that can arise with sharing objects between sub interpreters as well as safely managing the destruction of interpreters.

That is it for now though. In the next blog installment I'll go more into this idea of using disposable sub interpreters within Python web applications, explaining the various problems and also showing examples of some working code. Will also discuss how one could constrain the idea so as to make it a moderately safe technique to make use of in mod_wsgi and possibly other WSGI application stacks.