Showing posts with label mod_python. Show all posts
Showing posts with label mod_python. Show all posts

Monday, July 2, 2007

Web hosting landscape and mod_wsgi.

At the end of last year I described on the mod_python mailing list various ideas I had for how one could improve the situation with Python web hosting. These ideas were detailed in:
A subsequent discussion at the first SyPy meetup in January gave me the drive to follow up on the ideas and since then I have been furiously hacking away, with the result being the mod_wsgi module I spoke of in those posts.

As I described in those posts I saw mod_wsgi as only being a first step. Before considering again what one might do beyond mod_wsgi though, it is worthwhile to look at what mod_wsgi has become and how the result fits into the web hosting landscape. In particular, does it actually have the potential to improve the lot of Python developers by providing a compelling solution which will be attractive to companies providing commodity web hosting.

To understand this, one needs to look at what features mod_wsgi provides and specifically the two different modes of operation that have been implemented.

The first mode of operation I tend to refer to as 'embedded' mode. This is where your Python web application runs in the context of the standard Apache child processes. At least in terms of how Python sub interpreters are used, this is the same as how things work with mod_python. Thus, if you have both mod_python and mod_wsgi loaded, applications running under each will share the same process, although they generally would at least run in distinct Python sub interpreters. As far as sharing goes, the process may also be host to PHP or mod_perl applications as well.

Running applications in the Apache child processes would generally always result in the best performance possible when compared to other alternatives available for using Python with Apache such as mod_fastcgi and mod_scgi or even a second web server behind mod_proxy. This is because the Python application is running in the same process that is accepting and performing the initial parsing of the request from a client. In other words, overhead is as low as it can be as everything is done together in the one process.

In addition to the low overhead, there are also other positive benefits deriving from how Apache works when using this mode. The first is that Apache uses multiple child processes to handle requests. As a result, any contention for the Python GIL within the context of a single process is not an issue, as each process will be independent. Thus there is no impediment when using multi processor systems.

That said, the GIL is not as big a deal as some people make out, even when using Apache with only one multi-threaded child process for accepting requests. This is because the code which handles accepting of requests, determines which Apache handler should process the request, along with the code for reading the request content and writing out the response content, is all written in C and is in no way linked to Python. As a consequence there are large sections of code where the GIL is not being held. On top of that, the same web server may also be serving up static files where again the GIL doesn't even come into the picture. So, more than enough opportunity for making good use of those multiple processors.

The second major benefit comes from Apache's ability to scale up to meet increases in load. The way this works is that Apache will only initially create a certain number of child processes to handle requests. If however the number of requests builds up to the point that the processes wouldn't be able to keep pace, Apache will create additional child processes to meet the demand. It will keep doing this as needs be, although eventually it will stop based on whatever the maximum number of child process is set to, so as not to totally overload your system.

When the number of requests finally starts to drop down once more, to recover resources Apache will start to kill off any child processes now deemed as unnecessary, eventually getting back to the starting level. So it is that Apache is able to comfortably deal with the ebb and flow of demand without unduly choking.

So there is a lot of good to be had from how Apache works when using mod_wsgi in this mode. At the same time however a number of issues also arise.

The first is that the child processes generally run as a special non privileged user. This means that this user needs to be given access to the files which make up an application or which the application in turn needs to read. This user will also need to be given special access to files or directories the application needs to write any data to. Because Apache may be used to host a number of different applications, it means however that all applications can read files making up any other application and make changes to any writable directories or files used by those other applications which are writable to the user.

The second problem is that although in mod_wsgi distinct Python sub interpreters are used to keep different applications separate, this isn't fool proof. Problems can arise where different applications attempt to use different versions of a particular C extension module, as Python only loads C extensions once for the whole process and not separately for each sub interpreter. Thus, which application gets to load their version first wins out and when subsequent applications load it, they will get the correct version of any Python wrappers, but that code may not match the API provided by the C extension module itself.

A third more serious problem however, is that since Python supports C extension modules, it would be possible for someone with nefarious intent to load a module which gives them access to other sub interpreters data and code thereby bypassing the firewalls put in place by mod_wsgi. Such a module would thus allow them to spy into another application, change how it works or steal private information. A very wily hacker may take this even further and poke into the internals of Apache, possibly inserting special handler code into various phases of the request processing cycle, or modifying configuration data used by other modules.

All up, what this means is that although mod_wsgi goes to great lengths to try and ensure that applications can't interfere with each other, it can't be made completely bullet proof. As a result, 'embedded' mode of mod_wsgi would only be suitable in situations where the owners of the web servers are also the owners of the applications running under it. At no time would it ever be recommended that 'embedded' mode would be suitable as a basis for running applications owned by different users in a web hosting environment.

Do note that these problems aren't the fault of mod_wsgi specifically. Some derive from the way Apache works and others from how Python works. Using mod_python as an alternative will not offer anything better. In fact mod_python actually has more problems due to the open nature of how it hooks into Apache, thus making it easier to modify the behaviour of Apache and potentially access into other applications or steal private information.

Originally the intent in writing mod_wsgi was to only target users who also controlled the web server they were using. As a consequence, these issues weren't specifically seen as being a problem that needed to be countered. During the development of mod_wsgi however, that the existence of mod_wsgi seemed to be raising the hopes of many that a suitable simple solution for commodity Python web hosting might not be far away, meant that it was necessary to look at how one could address the problems. The end result of this was the addition of 'daemon' mode to mod_wsgi.

The main difference between 'daemon' and 'embedded' mode is that in 'daemon' mode the actual application code is not run within the context of the Apache child processes, but within separate daemon processes able to be run as a distinct user. Although there is a performance penalty resulting from having to proxy the request through to the distinct daemon process which is to handle the request, because the application is now isolated into a separate process the problems described above for 'embedded' mode are eliminated.

In the first instance, because the daemon process runs as a distinct user, only that user and not the user that the Apache child processes run as will need access to the Python code files that make up the application. The same applies to writable directories or files with them only needing to be modifiable by the user that the daemon process runs as. Thus, any actual Python code or private data pertaining to the application is protected and safe from access by other users of the system.

The only files which would still need to be readable to the user that the Apache child process runs as are any static files such as HTML pages, graphics or media files. This is because the main Apache child process would still provide the service of serving up these files.

The problem with C extension modules being global to a process is also eliminated with 'daemon' mode by the fact that multiple daemon processes can be created and each application assigned to their own process. This ability to isolate an application from others by assigning them to different processes, also prevents hackers from interfering with another users running application.

As a consequence, although 'embedded' mode would not be suitable for a server environment where applications owned by different users need to be hosted together, 'daemon' mode has the necessary protections available to make it safe to use in such a hostile environment and thus it would be suitable for shared web hosting environments.

When one looks at mod_wsgi a whole, the result is a package which is suitable both for building both high performance web sites and for commodity web hosting. In both cases configuration is simple, with the one application script file being suitable for use in both modes. A complex Python web application may even make use of both modes at the same time. For example, application components requiring better performance could be run in 'embedded' mode, but with other application components requiring special access privileges, which are memory hungry or processor intensive, being delegated off to distinct daemon processes.

In the end, this combination of abilities makes mod_wsgi a somewhat more flexible platform than other available solutions for developing WSGI applications using Apache. At the same time, because everything is in a single package all managed through Apache, configuration is much simpler and there is no need to install or manage any distinct back end infrastructure.

So, although my original plans didn't envision incorporating a 'daemon' mode, the effort in adding it has been quite worthwhile, with the elusive goal of a way of providing commodity web hosting for Python applications now perhaps being achievable after all. :-)

Friday, March 30, 2007

Reloading of Python code into web applications.

One of the major complaints with Python web frameworks is the need to restart the application whenever changes are made to the code. To try and avoid restarts or to make it easier to manage, different Python web frameworks and hosting technologies employ a number of different techniques. These include reloading Python modules into the existing running application, using a supervisor process to monitor for code changes and restart the actual server process automatically, or simply providing a means for a normal user, as opposed to a super user, to completely restart the web server.

As far as reloading Python modules into the existing running application goes, the most developed example of this technique is the module importer present in mod_python 3.3. The module importer in mod_python is different to other more simplistic module importers as it tracks the parent/child relationships between imported modules. This information means the module importer can determine that it needs to reload a top level request handler module even though the module itself hasn't changed but where some other module it is dependent on has changed.

Whereas module importers normally keep modules in sys.modules and they must all be uniquely named, the mod_python module importer also avoids a whole host of problems by not holding the web applications modules in sys.modules but in a separate caching system whereby they are identified by the absolute path name of the modules code file. This means it is possible for the same name to be used for a code file in multiple directories without the need to artificially hold modules in a Python package to avoid name space collisions. When a module is reloaded it is also not reloaded on top of the existing module as is done for modules in sys.modules, therefore eliminating problems with module pollution when attributes are deleted from the code but still exist in the loaded module, as well as various multi-threading problems which can arise due to reloading new code and data on top of existing code and data that may be getting used at the time.

All these features do come with some cost in performance. More of an issue though is that the importer is bound to mod_python and thus is only of use in web applications which themselves bind closely to the mod_python API. As such, although mod_python has this quite sophisticated module importer, it is absolutely useless if you are running some WSGI based application on top of mod_python as by the very nature of what WSGI is, such an application will only use features that can be hosted on top of multiple hosting technologies so can't make use of it.

One could separate the module importer from mod_python and make it a standalone package, but even then you are unlikely to see it adopted by any of the existing Python web frameworks. This is because these Python web frameworks already have their own way of doing things, and even if the module importer may be a better solution, to use the module importer would more than likely break compatibility of older applications and require users to perform some restructuring of their code. Use of the module importer may also not be able to be made totally transparent and thus one would be forcing new concepts on the user that they have to deal with. Finally, as good as the mod_python module importer may be, it still isn't going to be suitable in all situations, and thus you will still end up with some subset of modules that cannot be safely reloaded into an existing running application.

So, all in all it is very unlikely that one will see any form of sophisticated module importer system for reloading modules on the fly that works properly and is used as some sort of standard across the various Python web frameworks. Instead one will continue to see half baked solutions which don't really work. This will not necessarily be from lack of trying on the efforts of the people implementing them, it is just that reloading code safely into Python is hard for the general case, if not impossible.

Given the above, the only real practical solution that will work with all Python web frameworks is to throw away the interpreter contents and start over whenever one needs to pick up any code changes. To date this has always meant killing off the whole process and restarting it. This brute force approach may be fine where you manage and control you own web hosting environment, but isn't really practical for all those who rely on shared web hosting implemented using Apache for their web services. This type of service is problematic because the company managing the web server is hardly likely to be amendable to a constant stream of user requests to restart the web server every time they make a change to their code.

Packages for Apache such as mod_fastcgi, mod_scgi and mod_proxy_ajp have tried to address this problem by moving the actual web application into a specialised back end process and merely using Apache as a proxy for requests. Again using proxying, one could even use another web server as the back end process.

By using a back end process in this way a number of problems can be solved. The first is that because the back end process can be dedicated to a specific user and run as that user, control for restarting it can then be handed to the user. The second problem that is solved is that any code is no longer executing as the user that Apache would run as. This eliminates problems with user code accessing parts of the file system they potentially shouldn't and with user code interfering with a different users code as they can be given their own dedicated file system space to write to.

Although these provide the control that a user needs, a solution that doesn't just use another Apache server instance as the back end process is going to be foreign to most web hosting companies and can as a result be be hard for them to setup. This is because not only does one need to build and install the required Apache module, there are multiple choices as to how to implement the back end parts of the system and it may not be obvious as to why one should be used over another. This isn't made any better by virtue of a lack of good solid documentation and less than adequate support for running and managing such systems. For a web hosting company that wants something that is quick and cheap to get working and requires little management they currently appear not to be particularly attractive solutions.

In terms of how else one can solve this problem, the only other alternative for Python is that instead of killing off the whole process which is hosting a web application, one could just destroy the particular interpreter instance within the running process. If one is to pursue this approach, what is required though is the ability to be able to create and control additional Python sub interpreters and be able to run the whole web application or parts of it, inside of a sub interpreter rather than the main Python interpreter. Having that, it would then be possible to kill off a particular sub interpreter thereby destroying that part of the web application and recreate it in a new sub interpreter using the new code base.

At present your standard Python runtime doesn't support such manipulation of Python sub interpreters. Using Python sub interpreters is not new though, with mod_python using sub interpreters to provide separation between web applications or parts of them. What mod_python doesn't allow though is for new sub interpreters to be able to be created and used from a web application itself in some way.

That there is currently no way of manipulating Python sub interpreters from within a Python application doesn't mean it can't be done though. All that is required is a C extension module that provides a means of creating the sub interpreters and then mediating a way of making a call from one sub interpreter to another. Although it sounds simple in practice, there are various gotchas in getting it to work correctly. It also potentially opens up a big can of worms due to issues that can arise with sharing objects between sub interpreters as well as safely managing the destruction of interpreters.

That is it for now though. In the next blog installment I'll go more into this idea of using disposable sub interpreters within Python web applications, explaining the various problems and also showing examples of some working code. Will also discuss how one could constrain the idea so as to make it a moderately safe technique to make use of in mod_wsgi and possibly other WSGI application stacks.