Monday, July 30, 2007

Commodity shared hosting and mod_wsgi.

The first release candidate for mod_wsgi has been out a little while now and so far there has not been a single complaint about its stability. The one and only bug reported has been a rather silly mistake related to decoding the optional umask for daemon processes. Although the signs are good, will web hosting companies want to use it anyway?

In answering this question one has to look beyond stability concerns and look at whether the functionality provided by mod_wsgi fits the requirements of a web hosting environment. To work this out one has to consider the different types of commercially available web hosting environments on offer.

The most full featured web hosting environment is a dedicated server or virtual private server with root access. In this case the user has full access to the operating system and can install whatever they want. Because there are no real restrictions, a user would be free to install mod_wsgi and configure it so as to run applications in embedded mode, daemon mode or a combination of both. The only issue here is whether the user feels that mod_wsgi offers a better solution to mod_python, mod_fastcgi or other such solutions. The web hosting company in this case doesn't affect the actual user's decision.

Cheaper options than such a full featured web hosting environment are available, but as the cost comes down, the restrictions on what the user can do also increases. The next step down of note is what I'll call advanced shared hosting. This is where many users share the same system but each user is provided with their own Apache instance running on a dedicated port. All these distinct Apache instances then sit behind a proxy of some form listening on the standard HTTP port. Because each user gets there own Apache instance, it can run as that user and therefore the user still has a reasonable measure of control over the server. As such the user could once again most likely choose to use mod_wsgi and configure it so as to run applications in embedded mode, daemon mode or a combination of both

At the lower end of the scale is commodity shared hosting. This is where applications created by many different users are hosted together using a common Apache instance. The main problem with sharing a single Apache instance in this way is that all code runs as the user that Apache runs as. The consequences of running all code as a single user is that different users applications can interfere with each other. To avoid the problems that can result from this, any hosting solution for dynamic applications must be able to separate applications so that they run in distinct processes, running as the user that is the owner of the application.

When using mod_wsgi for hosting Python applications, daemon mode can be used to create daemon processes running as distinct users, with WSGI applications being delegated to an appropriate daemon process group. Although this feature is available, in the initial version of mod_wsgi the configuration of the daemon processes is effectively static, with the number of daemon processes and the user they would run as needing to be predetermined in advance. This would be okay for small web hosting companies who specialise in Python web hosting and who manually change the Apache configuration when each new site is added, but it becomes a problem where site configuration is more automated and restarts of Apache are avoided at all costs.

One way around this problem would be to configure some number of spare daemon processes in advance, with each such spare daemon process running as a user which hasn't as yet been allocated to a specific customer. When a new customer arrives who wants to be able to run a Python site, they would be given a user id corresponding to one of the spare daemon processes. Actual mapping of any WSGI applications running under the new users site would then be performed using a rewrite map which draws data from some form of database that can be dynamically updated without restarting Apache.

Although this may be viewed as a bit of a kludge, this approach would probably be quite acceptable for web hosting companies who specialise in providing hosting for Python applications. This is because it is likely that they would already have in advance worked out how many different Python application instances a machine could likely accommodate. This may be based on a general rule of thumb, or by applying strict quotas on the amount of memory that any one Python application can use, with any application process being killed off and restarted when the set memory limit is reached. That there may be some number of spare daemon processes running at any one time wouldn't be an issue as the amount of memory they use would be quite small, much less than the limit which would be imposed on a users application.

Unfortunately, not all web hosting companies are going to want to specialise in providing web hosting for Python applications, nor are they going to want to dedicate specific machines to be used for just that purpose. Instead, they will want to take their existing infrastructure, most likely designed to support PHP applications, and try to use it concurrently for hosting Python applications. Their goal in doing this is will be to maintain their existing site density, thereby still retaining their existing cost structure.

What such web hosting companies will want to avoid as much as possible is the need to run long lived daemon processes. This is because even if only a small percentage of the possibly thousands of sites they may host want to use Python, the overall memory requirements will increase much more significantly than if they were PHP applications. This would likely result in them having to reduce the number of sites they can host on the same hardware and increase their costs.

In addition to not wanting to run long lived daemon processes, the fact that the number of daemon processes to be run, and the users they run as, currently has to be predefined would make it too hard to manage. This is because such large scale hosting will load balance a large number of sites across many machines in a cluster. As a result it would impractical to have to update and restart every Apache instance in the cluster when more daemon processes need to be added.

As a result, for such large scale web hosting a more simplistic configuration mechanism is required where additional daemon processes can be added dynamically without changes needing to be made to the static configuration. Further, such daemon processes need to be able to be setup to be transient in nature. That is, they need to be able to shutdown automatically if they are idle for some, usually quite short, predefined period. By doing this, the memory used by the daemon process will be released, reducing the possibility that all of physical memory will be used up and the operating system needing to swap memory to disk.

Thus, so far as commodity web hosting goes, mod_wsgi would probably be a reasonable solution for web hosting companies who specialise in providing hosting for Python applications and who dedicate machines for this purpose. It is as yet not suitable for large web hosting companies who aren't specialists in Python web hosting, who want to maintain one homogeneous machine configuration across all their machines which is suitable for hosting varied web application frameworks and languages, and who wish to run with the highest site density possible so as to reduce costs to the maximum.

Although adding support for transient daemon processes to mod_wsgi may entice these latter type of web hosting companies to use mod_wsgi, this goes contrary to the preferred option with Python applications of maintaining a long running daemon process. As such, if it were adopted, it would be to the detriment of the user experience due to the possibly long startup times which may be encountered if an application is infrequently used and is always having to be restarted due to being killed off.

So, although the addition of support for transient daemon processes is being looked at for a future version of mod_wsgi, it is more being added to provide additional flexibility for users of dedicated systems. If configuration can be made quite simple then large scale web hosting might well adopt it, but frankly, you may well get what you pay for as far as the user experience that people interacting with your site will have.

If you are serious about providing a high quality site, you are probably better off spending a bit more money each month and get a site from a web hosting company that specialises in Python web hosting and provides true long lived daemon processes for running your site. This statement would apply equally whether it is mod_wsgi that is used or other solutions such as mod_fastcgi, as large scale web hosting isn't really designed to accommodate the additional demands of Python web applications, and for the time being at least, is always going to be tailored more to PHP applications.

Monday, July 2, 2007

Web hosting landscape and mod_wsgi.

At the end of last year I described on the mod_python mailing list various ideas I had for how one could improve the situation with Python web hosting. These ideas were detailed in:
A subsequent discussion at the first SyPy meetup in January gave me the drive to follow up on the ideas and since then I have been furiously hacking away, with the result being the mod_wsgi module I spoke of in those posts.

As I described in those posts I saw mod_wsgi as only being a first step. Before considering again what one might do beyond mod_wsgi though, it is worthwhile to look at what mod_wsgi has become and how the result fits into the web hosting landscape. In particular, does it actually have the potential to improve the lot of Python developers by providing a compelling solution which will be attractive to companies providing commodity web hosting.

To understand this, one needs to look at what features mod_wsgi provides and specifically the two different modes of operation that have been implemented.

The first mode of operation I tend to refer to as 'embedded' mode. This is where your Python web application runs in the context of the standard Apache child processes. At least in terms of how Python sub interpreters are used, this is the same as how things work with mod_python. Thus, if you have both mod_python and mod_wsgi loaded, applications running under each will share the same process, although they generally would at least run in distinct Python sub interpreters. As far as sharing goes, the process may also be host to PHP or mod_perl applications as well.

Running applications in the Apache child processes would generally always result in the best performance possible when compared to other alternatives available for using Python with Apache such as mod_fastcgi and mod_scgi or even a second web server behind mod_proxy. This is because the Python application is running in the same process that is accepting and performing the initial parsing of the request from a client. In other words, overhead is as low as it can be as everything is done together in the one process.

In addition to the low overhead, there are also other positive benefits deriving from how Apache works when using this mode. The first is that Apache uses multiple child processes to handle requests. As a result, any contention for the Python GIL within the context of a single process is not an issue, as each process will be independent. Thus there is no impediment when using multi processor systems.

That said, the GIL is not as big a deal as some people make out, even when using Apache with only one multi-threaded child process for accepting requests. This is because the code which handles accepting of requests, determines which Apache handler should process the request, along with the code for reading the request content and writing out the response content, is all written in C and is in no way linked to Python. As a consequence there are large sections of code where the GIL is not being held. On top of that, the same web server may also be serving up static files where again the GIL doesn't even come into the picture. So, more than enough opportunity for making good use of those multiple processors.

The second major benefit comes from Apache's ability to scale up to meet increases in load. The way this works is that Apache will only initially create a certain number of child processes to handle requests. If however the number of requests builds up to the point that the processes wouldn't be able to keep pace, Apache will create additional child processes to meet the demand. It will keep doing this as needs be, although eventually it will stop based on whatever the maximum number of child process is set to, so as not to totally overload your system.

When the number of requests finally starts to drop down once more, to recover resources Apache will start to kill off any child processes now deemed as unnecessary, eventually getting back to the starting level. So it is that Apache is able to comfortably deal with the ebb and flow of demand without unduly choking.

So there is a lot of good to be had from how Apache works when using mod_wsgi in this mode. At the same time however a number of issues also arise.

The first is that the child processes generally run as a special non privileged user. This means that this user needs to be given access to the files which make up an application or which the application in turn needs to read. This user will also need to be given special access to files or directories the application needs to write any data to. Because Apache may be used to host a number of different applications, it means however that all applications can read files making up any other application and make changes to any writable directories or files used by those other applications which are writable to the user.

The second problem is that although in mod_wsgi distinct Python sub interpreters are used to keep different applications separate, this isn't fool proof. Problems can arise where different applications attempt to use different versions of a particular C extension module, as Python only loads C extensions once for the whole process and not separately for each sub interpreter. Thus, which application gets to load their version first wins out and when subsequent applications load it, they will get the correct version of any Python wrappers, but that code may not match the API provided by the C extension module itself.

A third more serious problem however, is that since Python supports C extension modules, it would be possible for someone with nefarious intent to load a module which gives them access to other sub interpreters data and code thereby bypassing the firewalls put in place by mod_wsgi. Such a module would thus allow them to spy into another application, change how it works or steal private information. A very wily hacker may take this even further and poke into the internals of Apache, possibly inserting special handler code into various phases of the request processing cycle, or modifying configuration data used by other modules.

All up, what this means is that although mod_wsgi goes to great lengths to try and ensure that applications can't interfere with each other, it can't be made completely bullet proof. As a result, 'embedded' mode of mod_wsgi would only be suitable in situations where the owners of the web servers are also the owners of the applications running under it. At no time would it ever be recommended that 'embedded' mode would be suitable as a basis for running applications owned by different users in a web hosting environment.

Do note that these problems aren't the fault of mod_wsgi specifically. Some derive from the way Apache works and others from how Python works. Using mod_python as an alternative will not offer anything better. In fact mod_python actually has more problems due to the open nature of how it hooks into Apache, thus making it easier to modify the behaviour of Apache and potentially access into other applications or steal private information.

Originally the intent in writing mod_wsgi was to only target users who also controlled the web server they were using. As a consequence, these issues weren't specifically seen as being a problem that needed to be countered. During the development of mod_wsgi however, that the existence of mod_wsgi seemed to be raising the hopes of many that a suitable simple solution for commodity Python web hosting might not be far away, meant that it was necessary to look at how one could address the problems. The end result of this was the addition of 'daemon' mode to mod_wsgi.

The main difference between 'daemon' and 'embedded' mode is that in 'daemon' mode the actual application code is not run within the context of the Apache child processes, but within separate daemon processes able to be run as a distinct user. Although there is a performance penalty resulting from having to proxy the request through to the distinct daemon process which is to handle the request, because the application is now isolated into a separate process the problems described above for 'embedded' mode are eliminated.

In the first instance, because the daemon process runs as a distinct user, only that user and not the user that the Apache child processes run as will need access to the Python code files that make up the application. The same applies to writable directories or files with them only needing to be modifiable by the user that the daemon process runs as. Thus, any actual Python code or private data pertaining to the application is protected and safe from access by other users of the system.

The only files which would still need to be readable to the user that the Apache child process runs as are any static files such as HTML pages, graphics or media files. This is because the main Apache child process would still provide the service of serving up these files.

The problem with C extension modules being global to a process is also eliminated with 'daemon' mode by the fact that multiple daemon processes can be created and each application assigned to their own process. This ability to isolate an application from others by assigning them to different processes, also prevents hackers from interfering with another users running application.

As a consequence, although 'embedded' mode would not be suitable for a server environment where applications owned by different users need to be hosted together, 'daemon' mode has the necessary protections available to make it safe to use in such a hostile environment and thus it would be suitable for shared web hosting environments.

When one looks at mod_wsgi a whole, the result is a package which is suitable both for building both high performance web sites and for commodity web hosting. In both cases configuration is simple, with the one application script file being suitable for use in both modes. A complex Python web application may even make use of both modes at the same time. For example, application components requiring better performance could be run in 'embedded' mode, but with other application components requiring special access privileges, which are memory hungry or processor intensive, being delegated off to distinct daemon processes.

In the end, this combination of abilities makes mod_wsgi a somewhat more flexible platform than other available solutions for developing WSGI applications using Apache. At the same time, because everything is in a single package all managed through Apache, configuration is much simpler and there is no need to install or manage any distinct back end infrastructure.

So, although my original plans didn't envision incorporating a 'daemon' mode, the effort in adding it has been quite worthwhile, with the elusive goal of a way of providing commodity web hosting for Python applications now perhaps being achievable after all. :-)