Every so often one will see posts from people trying to encourage others to contribute back to open source communities, such as that for Python. Over the years I feel I have done my fair share, and having done that, I cant but help feel a bit jaded these days about where things are currently at. To explain why, let me chronicle how I see things as having evolved over the years.
When Python first came out I recollect sucking down the source distribution as separate little shar files that were posted onto the alt.sources USENET news group. There was a bit of documentation with it, but not much. If you had a problem you hoped you could work it out yourself as there wasn't really any forums back then that you could go ask for help about it.
A few years later the comp.lang.python news group was created and there was at least then somewhere you could go. You could try and troll through the archives for the news group if you had access, but history was generally limited by how much your news server kept and it wasn't exactly the best resource for trying to find out much about what others were trying to use Python for. Because help was hard to get, one really appreciated it when you did get an answer to your problem and people showed that appreciation.
The Python news group has always been a quite civilised place in that respect, but the same can't be said of all news groups at the time. People often only had small pipes down which news feeds came, so the last thing you wanted was people being idiots in the way they dealt with others on the news group, or as far as what expectations they had over what you would do for them. News group etiquette evolved, including how you could help yourself by asking good questions so as to get the best response. Such good practices were collated in posts like How To Ask Questions The Smart Way.
Some books eventually became available and these were the first really broad resource where you could learn about the language, but also about the myriad of things you could do with it. If you were like me, you would buy up every book you could find on the topic in the hope that you would find that one nugget of information in it that would help you with the specific problems you were working on.
Next up came what we know today as the Internet and that is when people started posting online their own resources, be it articles on how to do something, or their own code libraries. Initially you still had to know where to go to find this information, but by asking on news groups and looking through hand crafted meta index sites you could be lucky. This was all made a lot easier when search engines such as AltaVista and eventually Google came along, albeit that you still had to actually do the research to find what you wanted.
News groups and also mailing lists were still important forums when you had questions, but these partly started to be replaced by question and answer sites. Right now the most prominent question and answer site for developers is Stack Overflow. Unfortunately, such sites have also been the start of the rot that is setting in.
In the past when people have freely given up their time to help others by answering questions, or posting useful information online, it was genuinely because they did want to contribute back. Sure some may have done it just to make a name for themselves, but I would say that most would do it because it felt good to be able to help. Quite often the people who were answering were more than just knowledgable people, but the actual people who wrote the software you were trying to use.
The question and answer sites have changed that now. These days it is a game, where the goal is to win as many badges and accrue as many points as possible. This is more and more devaluing the responses one does get. This is because those who are answering are increasingly not those knowledgeable people who actually understand the problem and know the answers, but the people who are best at using Google to find the answers from elsewhere and then cut and paste them as the answer.
A lot of the time this does actually provide the answer someone wants, but at this point the person who is asking the question is no longer dealing with someone who genuinely cares about understanding your problem properly and come up with the best solution. No longer can one see the person answering as a trusted advisor whom you might form an ongoing relationship with and this I believe is having the effect of changing the behaviour of the people asking questions. This attitude isn't limited just to question and answer sites like Stack Overflow, but is starting to leach out into other more traditional forums such as new groups, mailing lists and IRC.
Increasingly, those who have the questions are just treating all these forums like a help desk. They no longer expect to encounter the real experts, nor try and form any relationship with those who may be trying to help them. They are just in there to get want they want, leave and get on with their work.
More and more they cannot even be bothered to try and research the problem themselves by using Google, nor even explain what their problem is properly. The idea therefore of asking questions in a smart way so as to get the best response is vanishing. Instead people will just throw a question out their in the hope they get an answer back by the time they have come back from getting their coffee.
To me this is slowly destroying the relationship building which actually creates a good community. People asking questions just don't appreciate any more the people who are trying to help and those knowledgeable people who were once willing to help, can no longer be bothered because it more and more comes across as a thankless tasks. What is enjoyable in replying with an answer which is the same as saying Let Me Google That For You?
So it is all well and good to try and get in and contribute back to a community, but if you want to feel you are getting something out of it and feel that you are building those good relationships which are the foundation of any good community, you should be careful about how you go about contributing back.
Personally I wouldn't bother with Stack Overflow unless you want to play the game. Commenting on posts in Redit isn't much better. IRC has a better level of community interaction, but sometimes channels are dominated by a small group of individuals who can be quite biased in their opinions, which can be a turn off and not really that productive when someone wants to find out about something that those dominant people don't like.
To me therefore, mailing lists or news groups with a large diverse following are still the best forums available if you really want to interact with and become a part of the community. Local Python user groups and attending conferences can also be worthwhile, but you have to work a lot harder by reaching out rather than just sitting back and watching.
Whether things will get worse or better I don't know. Right now though things definitely seem out of balance to me. I just hope that it doesn't become the new status quo. The Python community has always been an open and accepting one, I wouldn't like to see it dragged down because of changing attitudes brought on by quick fix question and answer sites.
Graham Dumpleton
Wednesday, December 5, 2012
Sunday, October 14, 2012
WSGI middleware and the hidden write() callable.
When I posted recently about the obligations of a WSGI server or middleware to call close() on the iterable returned from a WSGI application, I posted a pattern for a WSGI middleware of:
class Middleware(object):
def __init__(self, application):
self.application = application
def __call__(self, environ, start_response):
iterable = None
try:
iterable = self.application(environ, start_response)
for data in iterable:
yield data
finally:
if hasattr(iterable, 'close'):
iterable.close()
In this example, as each block of data was consumed from the iterable of the WSGI application, it was immediately yielded. The use of yield in turn caused the result returned from the '__call__()' method to be a generator satisfying the requirement that a WSGI application or middleware return an iterable.
This WSGI middleware will work for the specific case of response content being passed through unmodified, but things get a bit more complicated if it were wanting to actually modify the response content.
Consider for example the following.
class Middleware(object):
def __init__(self, application):
self.application = application
def transform(self, data):
return data.upper()
def __call__(self, environ, start_response):
iterable = None
try:
iterable = self.application(environ, start_response)
for data in iterable:
yield self.transform(data)
finally:
if hasattr(iterable, 'close'):
iterable.close()
Although it looks entirely reasonable, this WSGI middleware is not correct.
What is missing from this WSGI middleware is handling of the 'write()' callable that is returned by 'start_response()' and which can be used as an alternative to returning data via the iterable returned from the WSGI application.
If the above WSGI middleware was used to wrap a WSGI application which used that 'write()' callable, then the response wouldn't actually be transformed as intended.
Having to deal with the 'write()' callable in WSGI middleware is a pain and unduly makes writing WSGI middleware potentially quite complicated as you need to track two data paths through the WSGI middleware. It is even possible that response content may be provided by both 'write()' and via the iterable for the same request.
A revised version of the WSGI middleware which supports the 'write()' callable is:
class Middleware(object):
def __init__(self, application):
self.application = application
def transform(self, data):
return data.upper()
def __call__(self, environ, start_response):
def _start_response(status, response_headers, *args):
write = start_response(status, response_headers, *args)
def _write(data): write(self.transform(data))
return _write
iterable = None
try:
iterable = self.application(environ, _start_response)
for data in iterable:
yield self.transform(data)
finally:
if hasattr(iterable, 'close'):
iterable.close()
That 'write()' exists and how to deal with it in WSGI middleware which is transforming a response is often glossed over in tutorials on WSGI. It is somewhat lucky then that most people resort to using web frameworks because it would be a point which would be easy to get wrong with WSGI middleware being non compliant if not supported.
Note that in this example the transformation being done does not modify the amount of data returned for the response. If the amount of data being returned is being modified then additional steps need to be taken to ensure a correct response. Options around what needs to be done where a WSGI middleware is changing the content length will be the subject of a future post.
class Middleware(object):
def __init__(self, application):
self.application = application
def __call__(self, environ, start_response):
iterable = None
try:
iterable = self.application(environ, start_response)
for data in iterable:
yield data
finally:
if hasattr(iterable, 'close'):
iterable.close()
In this example, as each block of data was consumed from the iterable of the WSGI application, it was immediately yielded. The use of yield in turn caused the result returned from the '__call__()' method to be a generator satisfying the requirement that a WSGI application or middleware return an iterable.
This WSGI middleware will work for the specific case of response content being passed through unmodified, but things get a bit more complicated if it were wanting to actually modify the response content.
Consider for example the following.
class Middleware(object):
def __init__(self, application):
self.application = application
def transform(self, data):
return data.upper()
def __call__(self, environ, start_response):
iterable = None
try:
iterable = self.application(environ, start_response)
for data in iterable:
yield self.transform(data)
finally:
if hasattr(iterable, 'close'):
iterable.close()
Although it looks entirely reasonable, this WSGI middleware is not correct.
What is missing from this WSGI middleware is handling of the 'write()' callable that is returned by 'start_response()' and which can be used as an alternative to returning data via the iterable returned from the WSGI application.
If the above WSGI middleware was used to wrap a WSGI application which used that 'write()' callable, then the response wouldn't actually be transformed as intended.
Having to deal with the 'write()' callable in WSGI middleware is a pain and unduly makes writing WSGI middleware potentially quite complicated as you need to track two data paths through the WSGI middleware. It is even possible that response content may be provided by both 'write()' and via the iterable for the same request.
A revised version of the WSGI middleware which supports the 'write()' callable is:
class Middleware(object):
def __init__(self, application):
self.application = application
def transform(self, data):
return data.upper()
def __call__(self, environ, start_response):
def _start_response(status, response_headers, *args):
write = start_response(status, response_headers, *args)
def _write(data): write(self.transform(data))
return _write
iterable = None
try:
iterable = self.application(environ, _start_response)
for data in iterable:
yield self.transform(data)
finally:
if hasattr(iterable, 'close'):
iterable.close()
That 'write()' exists and how to deal with it in WSGI middleware which is transforming a response is often glossed over in tutorials on WSGI. It is somewhat lucky then that most people resort to using web frameworks because it would be a point which would be easy to get wrong with WSGI middleware being non compliant if not supported.
Note that in this example the transformation being done does not modify the amount of data returned for the response. If the amount of data being returned is being modified then additional steps need to be taken to ensure a correct response. Options around what needs to be done where a WSGI middleware is changing the content length will be the subject of a future post.
Saturday, October 13, 2012
Why are you using embedded mode of mod_wsgi?
If you are using Apache/mod_wsgi and when you set it up all you cared about was getting something working, but didn't care much for understanding how things worked and how you should set it up, chances are you are running your WSGI application in embedded mode of mod_wsgi. If you don't understand how to setup Apache, using embedded mode is nearly always a bad idea. Much more so if the Apache installation you are using is the default installation offered up by many Linux distributions, because in that case you are likely running the Apache prefork MPM, a choice which only compounds the problems you can experience.
The preferred mode of running a WSGI application under mod_wsgi if you have no idea of how to setup and tune Apache to match your specific web application, is daemon mode. Even if you reckon you do know how to setup Apache it is still safer to use daemon mode.
This unfortunate situation whereby embedded mode is the default came about because in the very first incarnation of mod_wsgi it was designed to mimic what mod_python did. As a result, it only supported the concept of embedded mode. This is where the WSGI application runs within the actual Apache child processes, the same processes which are also handling serving of static file requests.
Although daemon mode, which is more akin to how FASTCGI works with the WSGI application running in separate dedicated processes, was added later, embedded mode was already the default and it was hard to change at that point. Daemon mode also needed additional configuration whereas with embedded mode, things would at least run out of the box. Under Windows only embedded mode is supported, so having daemon mode be the default on UNIX systems but embedded mode the default on Windows was also seen as confusing.
The problems with embedded mode aren't so much due to the fact that the WSGI application is running in the actual Apache child processes, but that management of the processes is done by Apache and as such is subject to the general MPM settings of Apache. For the typical default Apache configuration the MPM settings are set up for serving of static files. The settings are not necessarily going to work very well for a dynamic web application with a large memory footprint that performs better when kept persistent in memory, as is the case for the majority of Python web applications.
PHP gets away with running okay within the Apache child processes because of how PHP was designed to work. Specifically in PHP, any application code is effectively reloaded on each request and so it has been optimised in various ways to perform adequately under that scenario. Python being a general purpose programming language adapted to run web applications has a much larger startup cost for both the interpreter and for loading up a web application. Certain aspects of how the Python interpreter is implemented and loading of Python modules managed, also means it is not possible to use some of the techniques that PHP uses around preloading of the interpreter and code modules prior to forking of the Apache child processes. Python in Apache can simply therefore never match PHPs efficiency in this respect.
The big problem therefore is simply the default configuration. If the MPM settings are properly setup and tuned for your specific Python web application running under embedded mode, then embedded mode will perform better than daemon mode. Use the default settings or don't configure it properly and you risk setting yourself up for a world of hurt.
How best to configure the MPM settings of Apache when running a WSGI application in embedded mode is not the point of this post. I will deal with that another time. The point of this post is to help you identify when you are running embedded mode and show you how to setup daemon mode in its basic configuration instead. You can at least then shift away from using embedded mode if you didn't even realise you were using it and avoid causing problems for yourself.
To determine if your WSGI application is running in embedded mode, replace its WSGI script with the test WSGI script as follows:
import sys
def application(environ, start_response):
status = '200 OK'
name = repr(environ['mod_wsgi.process_group'])
output = 'mod_wsgi.process_group = %s' % name
response_headers = [('Content-type', 'text/plain'),
('Content-Length', str(len(output)))]
start_response(status, response_headers)
return [output]
If the configuration is such that the WSGI application is running in embedded mode, then you will see:
mod_wsgi.process_group = ''
That is, the process group definition will be an empty string.
If instead you are already running in the preferred daemon mode, you would see a non empty string giving the name of the daemon process group.
The preferred mode of running a WSGI application under mod_wsgi if you have no idea of how to setup and tune Apache to match your specific web application, is daemon mode. Even if you reckon you do know how to setup Apache it is still safer to use daemon mode.
If daemon mode is a better option, why isn't it the default?
This unfortunate situation whereby embedded mode is the default came about because in the very first incarnation of mod_wsgi it was designed to mimic what mod_python did. As a result, it only supported the concept of embedded mode. This is where the WSGI application runs within the actual Apache child processes, the same processes which are also handling serving of static file requests.
Although daemon mode, which is more akin to how FASTCGI works with the WSGI application running in separate dedicated processes, was added later, embedded mode was already the default and it was hard to change at that point. Daemon mode also needed additional configuration whereas with embedded mode, things would at least run out of the box. Under Windows only embedded mode is supported, so having daemon mode be the default on UNIX systems but embedded mode the default on Windows was also seen as confusing.
Why is running in embedded mode so bad?
The problems with embedded mode aren't so much due to the fact that the WSGI application is running in the actual Apache child processes, but that management of the processes is done by Apache and as such is subject to the general MPM settings of Apache. For the typical default Apache configuration the MPM settings are set up for serving of static files. The settings are not necessarily going to work very well for a dynamic web application with a large memory footprint that performs better when kept persistent in memory, as is the case for the majority of Python web applications.
PHP gets away with running okay within the Apache child processes because of how PHP was designed to work. Specifically in PHP, any application code is effectively reloaded on each request and so it has been optimised in various ways to perform adequately under that scenario. Python being a general purpose programming language adapted to run web applications has a much larger startup cost for both the interpreter and for loading up a web application. Certain aspects of how the Python interpreter is implemented and loading of Python modules managed, also means it is not possible to use some of the techniques that PHP uses around preloading of the interpreter and code modules prior to forking of the Apache child processes. Python in Apache can simply therefore never match PHPs efficiency in this respect.
The big problem therefore is simply the default configuration. If the MPM settings are properly setup and tuned for your specific Python web application running under embedded mode, then embedded mode will perform better than daemon mode. Use the default settings or don't configure it properly and you risk setting yourself up for a world of hurt.
How should embedded mode be configured?
Determining if embedded mode is being used.
To determine if your WSGI application is running in embedded mode, replace its WSGI script with the test WSGI script as follows:
import sys
def application(environ, start_response):
status = '200 OK'
name = repr(environ['mod_wsgi.process_group'])
output = 'mod_wsgi.process_group = %s' % name
response_headers = [('Content-type', 'text/plain'),
('Content-Length', str(len(output)))]
start_response(status, response_headers)
return [output]
If the configuration is such that the WSGI application is running in embedded mode, then you will see:
mod_wsgi.process_group = ''
That is, the process group definition will be an empty string.
If instead you are already running in the preferred daemon mode, you would see a non empty string giving the name of the daemon process group.
Identifying which Apache MPM you are using.
Even if your WSGI application is running in daemon mode, if the only thing you are using Apache for is to host the WSGI application and serve static files, then it is also recommended that you use worker MPM rather than prefork MPM as worker MPM will cut down on memory use by the Apache child processes.
To determine if you are using prefork MPM or worker MPM, you could try and work it out by looking at what operating system packages are installed, but the definitive way of doing it, is to run the Apache binary with the '-V' option.
$ /usr/sbin/httpd -V
Server version: Apache/2.2.14 (Unix)
Server built: Feb 10 2010 22:22:39
Server's Module Magic Number: 20051115:23
Server loaded: APR 1.3.8, APR-Util 1.3.9
Compiled using: APR 1.3.8, APR-Util 1.3.9
Architecture: 64-bit
Server MPM: Prefork
threaded: no
forked: yes (variable process count)
Server compiled with....
-D APACHE_MPM_DIR="server/mpm/prefork"
-D APR_HAS_SENDFILE
-D APR_HAS_MMAP
-D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
-D APR_USE_FLOCK_SERIALIZE
-D APR_USE_PTHREAD_SERIALIZE
-D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
-D APR_HAS_OTHER_CHILD
-D AP_HAVE_RELIABLE_PIPED_LOGS
-D DYNAMIC_MODULE_LIMIT=128
-D HTTPD_ROOT="/usr"
-D SUEXEC_BIN="/usr/bin/suexec"
-D DEFAULT_PIDLOG="/private/var/run/httpd.pid"
-D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
-D DEFAULT_LOCKFILE="/private/var/run/accept.lock"
-D DEFAULT_ERRORLOG="logs/error_log"
-D AP_TYPES_CONFIG_FILE="/private/etc/apache2/mime.types"
-D SERVER_CONFIG_FILE="/private/etc/apache2/httpd.conf"
The 'Server MPM' field will tell you which MPM your Apache has been compiled for.
If for some reason you can't work out which is the Apache binary, because your Linux distribution calls it something other than 'httpd', or they have modified it so it will not run unless some magic environment variables are set, then you can also guess what is running by using the following WSGI script.
import sys
def application(environ, start_response):
status = '200 OK'
output = 'wsgi.multithread = %s' % repr(environ['wsgi.multithread'])
response_headers = [('Content-type', 'text/plain'),
('Content-Length', str(len(output)))]
start_response(status, response_headers)
return [output]
If you get the output:
wsgi.multithread = True
you are likely running worker MPM and otherwise are running prefork MPM.
Running a WSGI application in daemon mode.
To force a WSGI application to run in daemon mode, the WSGIDaemonProcess and WSGIProcessGroup directives would need to be defined. For example, to setup a daemon process group containing two multithreaded processes one could use:
WSGIDaemonProcess example.com processes=2 threads=15
WSGIProcessGroup example.com
The WSGIDaemonProcess directive specifies the details of the daemon process group. The WSGIProcessGroup indicates that any WSGI application specified within the same context is to be delegated to run in that daemon process group.
A complete virtual host configuration for this type of setup would therefore be something like:
<VirtualHost *:80>
ServerName www.example.com
ServerAlias example.com
ServerAdmin webmaster@example.com
DocumentRoot /usr/local/www/documents
Alias /robots.txt /usr/local/www/documents/robots.txt
Alias /favicon.ico /usr/local/www/documents/favicon.ico
Alias /media/ /usr/local/www/documents/media/
<Directory /usr/local/www/documents>
Order allow,deny
Allow from all
</Directory>
WSGIDaemonProcess example.com processes=2 threads=15 display-name=%{GROUP}
WSGIProcessGroup example.com
WSGIScriptAlias / /usr/local/www/wsgi-scripts/myapp.wsgi
<Directory /usr/local/www/wsgi-scripts>
Order allow,deny
Allow from all
</Directory>
</VirtualHost>
After appropriate changes have been made Apache will need to be restarted. For this example, the URL 'http://www.example.com/' would then be used to access the the WSGI application.
Note that you obviously should substitute the paths and hostname with values appropriate for your system.
After making the changes, use the test WSGI script described before to verify the WSGI application is in fact running in daemon mode. It is a common mistake to see people use WSGIDaemonProcess but then not use WSGIProcessGroup or other configuration mechanisms to ensure that the WSGI application is in fact delegated to the daemon process group, so double check you got the configuration correct.
Immediate benefits of using daemon mode.
When you use daemon mode, the number of processes and threads is static. This is one of the immediate benefits of using daemon mode. Specifically, that process management is more predictable. One of the big problems with using embedded mode is that Apache can decide to create additional processes or kill off existing ones. For a web application with large startup costs this is not a good idea as you could suddenly see increased CPU usage due to more processes being started right at the time you don't need it such as when a throughput spike occurs. This can actually cause performance to degrade in the short term rather than improve.
If using embedded mode and you need to update the code for your Python web application, you have no choice but to restart the whole of Apache. If using daemon mode, you can avoid restarting the whole of Apache and can instead simply touch the WSGI script file to update its modification date. This will have the side effect of causing the daemon processes to restart on the next request. This is also convenient when using Apache/mod_wsgi as a development environment to ensure parity with your production environment.
Additional reference documentation.
The information above gives a quick heads up on how to check whether you are running in embedded mode and how instead to get your WSGI application running in daemon mode. For additional information also read:
Related blog posts which would be worthwhile reading are:
The most up to date version of mod_wsgi 3.4. This was only released recently and the majority of distributions will not as yet have it as a packaged binary. You should at least aim to use mod_wsgi 3.3. If you are on a Linux distribution which is still only supplying mod_wsgi 2.8 or older, you really should think about upgrading to a more modern operating system distribution as 2.8 was released almost 3 years ago.
Thursday, October 11, 2012
Obligations for calling close() on the iterable returned by a WSGI application.
Despite the WSGI specification having been around for so long, one keeps seeing instances where it is implemented wrongly. At New Relic where I work on the Python agent, we have been caught by this a number of times. The problem is that if people don't implement a WSGI server or WSGI middleware per the requirements of the specification, our Python agent for web application performance monitoring will not always work as intended.
This is especially the case in relation to the obligations of a WSGI server or WSGI middleware to call the close() method, if it exists, on the iterable object returned by the WSGI application. This caused us ongoing issues for a while with uWSGI, but we worked with Roberto, the uWSGI author, and managed to get the issues resolved, finally being completely fixed (we hope) in uWSGI 1.2.6.
The next notable example we ran up against where a major package isn't implementing the WSGI specification correctly is Raven, the Python client for Sentry. We have reported this to David, the Sentry author, but since there is some confusion about what the code should be doing, I am going to use it as an example to explain what the obligations of a WSGI middleware are in relation to the close() method, hopefully so I don't have to keep explaining it.
The current code for the Sentry client in Raven is:
class Sentry(object):
"""
A WSGI middleware which will attempt to capture any
uncaught exceptions and send them to Sentry.
>>> from raven.base import Client
>>> application = Sentry(application, Client())
"""
def __init__(self, application, client):
self.application = application
self.client = client
def __call__(self, environ, start_response):
try:
for event in self.application(environ, start_response):
yield event
except Exception:
exc_info = sys.exc_info()
self.handle_exception(exc_info, environ)
exc_info = None
raise
def handle_exception(self, exc_info, environ):
event_id = self.client.capture('Exception',
exc_info=exc_info,
data={
'sentry.interfaces.Http': {
'method': environ.get('REQUEST_METHOD'),
'url': get_current_url(environ, strip_querystring=True),
'query_string': environ.get('QUERY_STRING'),
# TODO
# 'data': environ.get('wsgi.input'),
'headers': dict(get_headers(environ)),
'env': dict(get_environ(environ)),
}
},
)
return event_id
The reason this code is wrong is because it does not satisfy the following requirement from the WSGI specification:
In other words, although a server or gateway must ensure that close() is called on the iterable returned by the application, a WSGI middleware must also ensure that it does the same for any iterable it may consume from a wrapped WSGI application or component.
The code:
for event in self.application(environ, start_response):
yield event
is therefore incomplete, because although it consumes the iterable returned by the wrapped WSGI application, it does not ensure close() is called on it upon completion, or in the event of any errors.
There are numerous ways one can structure a WSGI middleware, but following the general pattern used by Raven, a WSGI middleware that does ensure that close() is called would be implemented as follows.
class Middleware1(object):
def __init__(self, application):
self.application = application
def __call__(self, environ, start_response):
iterable = None
try:
iterable = self.application(environ, start_response)
for data in iterable:
yield data
finally:
if hasattr(iterable, 'close'):
iterable.close()
Important to note here is that the act of calling the wrapped application to obtain the iterable has been separated from the process of iterating over it. This is necessary in order that we have a reference to the iterable to call close() in the three cases necessary, they being an exception occurring when the actual iterator object itself is being created from the iterable object, if an exception occurs when getting the next item from the iterator object and finally upon the last item being yielded from the iterator.
Now because WSGI middleware weren't specifically mentioned in the requirement, it does open up a slight grey area. The problem is the statement:
For a WSGI middleware written as above, that isn't necessarily the point at which it would be called. Instead the close() method would get called when the 'for' loop has completed. We have therefore lost any direction association between when close() is called by the WSGI server itself upon actual completion of the request and when it is called by the WSGI middleware.
Overall, for WSGI middleware it probably isn't a critical issue and having it called immediately the 'for' loop exits is fine. If it were important that close() be directly chained, then it would be necessary to implement it differently, instead using.
class Iterable2(object):
def __init__(self, iterable):
self.iterable = iterable
if hasattr(iterable, 'close'):
self.close = iterable.close
def __iter__(self):
for data in self.iterable:
yield data
class Middleware2(object):
def __init__(self, application):
self.application = application
def __call__(self, environ, start_response):
return Iterable2(self.application(environ, start_response))
That way the close() method is only called for the iterable returned from the wrapped application when close() is called by the WSGI server, or any further WSGI middleware that wraps this one.
Requiring two classes like this does complicate the implementation of the WSGI middleware however, because both may need to track state for the current request as the iterable is consumed.
Assuming the first pattern for implementing the WSGI middleware is okay, the existing Sentry client in Raven would be rewritten as follows.
class Sentry(object):
def __init__(self, application, client):
self.application = application
self.client = client
def __call__(self, environ, start_response):
iterable = None
try:
iterable = self.application(environ, start_response)
for event in iterable:
yield event
except Exception:
exc_info = sys.exc_info()
self.handle_exception(exc_info, environ)
exc_info = None
raise
finally:
if hasattr(iterable, 'close'):
iterable.close()
def handle_exception(self, exc_info, environ):
...
This provides the same functionality as it originally performed, but also ensures the close() method is called correctly if the iterable provides one.
We are not done though, because technically an exception could be raised by the close() method when it is called. Presumably it would be desirable for this also to be captured and reported to Sentry. The more complete solution which does this is:
class Sentry(object):
def __init__(self, application, client):
self.application = application
self.client = client
def __call__(self, environ, start_response):
iterable = None
try:
iterable = self.application(environ, start_response)
for event in iterable:
yield event
except Exception:
exc_info = sys.exc_info()
self.handle_exception(exc_info, environ)
exc_info = None
raise
finally:
if hasattr(iterable, 'close'):
try:
iterable.close()
except Exception:
exc_info = sys.exc_info()
self.handle_exception(exc_info, environ)
exc_info = None
raise
def handle_exception(self, exc_info, environ):
...
So are you implementing your WSGI middleware correctly?
This is especially the case in relation to the obligations of a WSGI server or WSGI middleware to call the close() method, if it exists, on the iterable object returned by the WSGI application. This caused us ongoing issues for a while with uWSGI, but we worked with Roberto, the uWSGI author, and managed to get the issues resolved, finally being completely fixed (we hope) in uWSGI 1.2.6.
The next notable example we ran up against where a major package isn't implementing the WSGI specification correctly is Raven, the Python client for Sentry. We have reported this to David, the Sentry author, but since there is some confusion about what the code should be doing, I am going to use it as an example to explain what the obligations of a WSGI middleware are in relation to the close() method, hopefully so I don't have to keep explaining it.
What Raven currently does wrong.
The current code for the Sentry client in Raven is:
class Sentry(object):
"""
A WSGI middleware which will attempt to capture any
uncaught exceptions and send them to Sentry.
>>> from raven.base import Client
>>> application = Sentry(application, Client())
"""
def __init__(self, application, client):
self.application = application
self.client = client
def __call__(self, environ, start_response):
try:
for event in self.application(environ, start_response):
yield event
except Exception:
exc_info = sys.exc_info()
self.handle_exception(exc_info, environ)
exc_info = None
raise
def handle_exception(self, exc_info, environ):
event_id = self.client.capture('Exception',
exc_info=exc_info,
data={
'sentry.interfaces.Http': {
'method': environ.get('REQUEST_METHOD'),
'url': get_current_url(environ, strip_querystring=True),
'query_string': environ.get('QUERY_STRING'),
# TODO
# 'data': environ.get('wsgi.input'),
'headers': dict(get_headers(environ)),
'env': dict(get_environ(environ)),
}
},
)
return event_id
The reason this code is wrong is because it does not satisfy the following requirement from the WSGI specification:
"""If the iterable returned by the application has a close() method, the server or gateway must call that method upon completion of the current request, whether the request was completed normally, or terminated early due to an application error during iteration or an early disconnect of the browser."""What is not entirely clear in the language used here, specifically there only being a reference to 'the server or gateway', is that this obligation extends to WSGI middleware as well.
In other words, although a server or gateway must ensure that close() is called on the iterable returned by the application, a WSGI middleware must also ensure that it does the same for any iterable it may consume from a wrapped WSGI application or component.
The code:
for event in self.application(environ, start_response):
yield event
is therefore incomplete, because although it consumes the iterable returned by the wrapped WSGI application, it does not ensure close() is called on it upon completion, or in the event of any errors.
Pattern for a generic WSGI middleware.
There are numerous ways one can structure a WSGI middleware, but following the general pattern used by Raven, a WSGI middleware that does ensure that close() is called would be implemented as follows.
class Middleware1(object):
def __init__(self, application):
self.application = application
def __call__(self, environ, start_response):
iterable = None
try:
iterable = self.application(environ, start_response)
for data in iterable:
yield data
finally:
if hasattr(iterable, 'close'):
iterable.close()
Important to note here is that the act of calling the wrapped application to obtain the iterable has been separated from the process of iterating over it. This is necessary in order that we have a reference to the iterable to call close() in the three cases necessary, they being an exception occurring when the actual iterator object itself is being created from the iterable object, if an exception occurs when getting the next item from the iterator object and finally upon the last item being yielded from the iterator.
Now because WSGI middleware weren't specifically mentioned in the requirement, it does open up a slight grey area. The problem is the statement:
"""must call that method upon completion of the current request"""The intent of that statement, at least from the perspective of the WSGI server, is that close() only be called once all content has been consumed from the iterable and that content has been sent to the client.
For a WSGI middleware written as above, that isn't necessarily the point at which it would be called. Instead the close() method would get called when the 'for' loop has completed. We have therefore lost any direction association between when close() is called by the WSGI server itself upon actual completion of the request and when it is called by the WSGI middleware.
Overall, for WSGI middleware it probably isn't a critical issue and having it called immediately the 'for' loop exits is fine. If it were important that close() be directly chained, then it would be necessary to implement it differently, instead using.
class Iterable2(object):
def __init__(self, iterable):
self.iterable = iterable
if hasattr(iterable, 'close'):
self.close = iterable.close
def __iter__(self):
for data in self.iterable:
yield data
class Middleware2(object):
def __init__(self, application):
self.application = application
def __call__(self, environ, start_response):
return Iterable2(self.application(environ, start_response))
That way the close() method is only called for the iterable returned from the wrapped application when close() is called by the WSGI server, or any further WSGI middleware that wraps this one.
Requiring two classes like this does complicate the implementation of the WSGI middleware however, because both may need to track state for the current request as the iterable is consumed.
Correcting the Sentry client in Raven.
Assuming the first pattern for implementing the WSGI middleware is okay, the existing Sentry client in Raven would be rewritten as follows.
class Sentry(object):
def __init__(self, application, client):
self.application = application
self.client = client
def __call__(self, environ, start_response):
iterable = None
try:
iterable = self.application(environ, start_response)
for event in iterable:
yield event
except Exception:
exc_info = sys.exc_info()
self.handle_exception(exc_info, environ)
exc_info = None
raise
finally:
if hasattr(iterable, 'close'):
iterable.close()
def handle_exception(self, exc_info, environ):
...
This provides the same functionality as it originally performed, but also ensures the close() method is called correctly if the iterable provides one.
We are not done though, because technically an exception could be raised by the close() method when it is called. Presumably it would be desirable for this also to be captured and reported to Sentry. The more complete solution which does this is:
class Sentry(object):
def __init__(self, application, client):
self.application = application
self.client = client
def __call__(self, environ, start_response):
iterable = None
try:
iterable = self.application(environ, start_response)
for event in iterable:
yield event
except Exception:
exc_info = sys.exc_info()
self.handle_exception(exc_info, environ)
exc_info = None
raise
finally:
if hasattr(iterable, 'close'):
try:
iterable.close()
except Exception:
exc_info = sys.exc_info()
self.handle_exception(exc_info, environ)
exc_info = None
raise
def handle_exception(self, exc_info, environ):
...
Check what your WSGI middleware does.
So are you implementing your WSGI middleware correctly?
Monday, October 8, 2012
Requests running in wrong Django instance under Apache/mod_wsgi.
Configuring Apache/mod_wsgi to host multiple Django instances has always been a bit tricky for some. In practice though it should be quite straight forward. For a single Django instance mounted at the root of the web site, the WSGIScriptAlias line would be something like:
WSGIScriptAlias / /some/path/project-1/wsgi.py
If wanting to host a second Django instance under the same host name but at a sub URL, you would use:
WSGIScriptAlias /suburl /some/path/project-2/wsgi.py
WSGIScriptAlias / /some/path/project-1/wsgi.py
If the Django instances are not under the same host, then it would instead simply be a matter of adding them to the respective VirtualHost.
<VirtualHost *:80>
ServerName site-1.example.com
WSGIScriptAlias /some/path/project-1/wsgi.py
...
</VirtualHost>
<VirtualHost *:80>
ServerName site-2.example.com
WSGIScriptAlias /some/path/project-2/wsgi.py
...
</VirtualHost>
In both cases, whether under the same host name or different ones, both Django instances would run in the same process. Separation would be maintained however by virtue of mod_wsgi running each WSGI application mounted using WSGIScriptAlias in a distinct Python sub interpreter within the processes they are running in.
The directive which controls which named Python sub interpreter within the process is used is WSGIApplicationGroup. The default for this directive is %{RESOURCE}.
For this default value of %{RESOURCE}the sub interpreter name will be constructed from the host name (as specified by the ServerName directive), the port (if not port 80/443) and the value of the WSGI environment variable SCRIPT_NAME as deduced from the URL mount point set by the WSGIScriptAlias directive.
So in the first instance above where both Django instances run under the same host name, the distinct named sub interpreters within the process would be called:
site-1.example.com:/
site-1.example.com:/suburl
In the second instance where they run under separate host names they would be:
site-1.example.com:/
site-2.example.com:/
So as long as you don't fiddle with which sub interpreter is used by specifying the WSGIApplicationGroup directive, mod_wsgi should maintain separation between the multiple Django instances.
What therefore can go wrong and why would requests get routed to the wrong Django instance?
all requests would get routed into the first Django instance, even those for '/suburl', as the shorter URL of '/' specified with the first WSGIScriptAlias directive would always match, even before attempting to match against '/suburl'.
The solution if this is the cause is obviously to reorder the WSGIScriptAlias directives as appropriate to ensure the longest URLs come first.
In order to specify the module that the Django applications settings are contained in, it is necessary to set the process environment variable DJANGO_SETTINGS_MODULE.
import os
os.environ['DJANGO_SETTINGS_MODULE'] = 'mysite.settings'
import django.core.handlers.wsgi
application = django.core.handlers.wsgi.WSGIHandler()
When using Apache/mod_wsgi, this is done in the WSGI script file and the environment variable would be set in os.environ when the WSGI script file is loaded.
As much as process environment variables and global variables have their limitations and are arguably a bad idea for specifying configuration, this has still worked okay until recently.
Problems started when Django 1.4 was released however. In Django 1.4 the content of the WSGI script file was changed from what was described previously for Django 1.3 and older versions to:
import os
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'mysite.settings')
from django.core.wsgi import get_wsgi_application
application = get_wsgi_application()
Although the differences on first glance appear to be fine, they aren't and the WSGI script file in Django 1.4 will break Apache/mod_wsgi for hosting multiple Django instances in the same process.
The key problem is what the setdefault() method does when setting the environment variable DJANGO_SETTINGS_MODULE compared to using assignment as previously. In the case of assignment the environment variable is always updated. For setdefault(), it is only updated if it is not already set.
You might ask why would this be a problem. It is a problem because although os.environ looks like a normal dictionary, it isn't. Instead it is actually a custom class which only looks like a dictionary. When a key/value is set in os.environ, it is also setting it at the C level by calling putenv().
What now happens with our multiple Django instances running in the same process is that for the first one to be loaded, DJANGO_SETTINGS_MODULE will not be set and so setdefault() will actually set it, including it being set globally to the process. When the second Django instance is loaded, at the point the sub interpreter is created, os.environ is populated from the C level environ, thereby picking up the value of DJANGO_SETTINGS_MODULE set when the first Django instance was loaded. In this case setdefault() will not override the value as it already sees it as being set.
Technically this leakage of environment variables between Python sub interpreters within the one process would always happen, but use of setdefault() instead of assignment means DJANGO_SETTINGS_MODULE will not get overridden to be the correct value for the second Django instance to be loaded.
The end result of the leakage can be one of two things. If the name of the Django settings module used for the first Django instance doesn't exist in the context of the second Django instance, then an import failure will occur and the Django instance will fail to be initialised. In this first case an actual failure will occur with nothing working and so it will be fairly obvious.
A more problematic case though is where you are using one code base and multiple Django settings modules for each of the distinct Django instances being run. In this case the Django settings module may well be found, with the result being that when attempting to load up the second Django instance, a duplicate of the first instance would be loaded instead. Where URLs are now meant to be routed to the second instance, they would instead be handled as if being sent to the first instance as the configuration for the first is still being used.
There are two solutions if this is the cause. The quickest is to replace the use of setdefault() to set the environment variable in the WSGI script file with more usual assignment.
os.environ['DJANGO_SETTINGS_MODULE'] = 'mysite.settings'
An alternative which involves a bit more work, but can have other benefits, is to switch to using daemon mode of mod_wsgi to run the Django instances and delegate each to a separate set of processes. By running the Django instances in separate processes there can be no possibility of environment variables leaking from one to the other.
WSGIDaemonProcess project-2
WSGIScriptAlias /suburl /some/path/project-2/wsgi.py process-group=project-2
WSGIScriptAlias / /some/path/project-1/wsgi.py
If wanting to host a second Django instance under the same host name but at a sub URL, you would use:
WSGIScriptAlias /suburl /some/path/project-2/wsgi.py
WSGIScriptAlias / /some/path/project-1/wsgi.py
If the Django instances are not under the same host, then it would instead simply be a matter of adding them to the respective VirtualHost.
<VirtualHost *:80>
ServerName site-1.example.com
WSGIScriptAlias /some/path/project-1/wsgi.py
...
</VirtualHost>
<VirtualHost *:80>
ServerName site-2.example.com
WSGIScriptAlias /some/path/project-2/wsgi.py
...
</VirtualHost>
In both cases, whether under the same host name or different ones, both Django instances would run in the same process. Separation would be maintained however by virtue of mod_wsgi running each WSGI application mounted using WSGIScriptAlias in a distinct Python sub interpreter within the processes they are running in.
The directive which controls which named Python sub interpreter within the process is used is WSGIApplicationGroup. The default for this directive is %{RESOURCE}.
For this default value of %{RESOURCE}the sub interpreter name will be constructed from the host name (as specified by the ServerName directive), the port (if not port 80/443) and the value of the WSGI environment variable SCRIPT_NAME as deduced from the URL mount point set by the WSGIScriptAlias directive.
So in the first instance above where both Django instances run under the same host name, the distinct named sub interpreters within the process would be called:
site-1.example.com:/
site-1.example.com:/suburl
In the second instance where they run under separate host names they would be:
site-1.example.com:/
site-2.example.com:/
So as long as you don't fiddle with which sub interpreter is used by specifying the WSGIApplicationGroup directive, mod_wsgi should maintain separation between the multiple Django instances.
What therefore can go wrong and why would requests get routed to the wrong Django instance?
Ordering of WSGIScriptAlias directives.
The first scenario where one may see requests being handled by the wrong Django instance is where the multiple Django instances are running under the same host name and the ordering of the WSGIScriptAlias directives is wrong.
When using the WSGIScriptAlias multiple times under the same host name, it is important that the WSGIScriptAlias for sub URLs comes first.
In other words, the ordering is such that the most deeply nested URLs must come first. If you don't do that, then the shorter URL will match first and take precedence, thereby swallowing up all requests for both Django instances.
For example, if the directives above were instead written as:
WSGIScriptAlias / /some/path/project-1/wsgi.py
WSGIScriptAlias /suburl /some/path/project-2/wsgi.py
The solution if this is the cause is obviously to reorder the WSGIScriptAlias directives as appropriate to ensure the longest URLs come first.
Leaking of process environment variables.
In order to specify the module that the Django applications settings are contained in, it is necessary to set the process environment variable DJANGO_SETTINGS_MODULE.
import os
os.environ['DJANGO_SETTINGS_MODULE'] = 'mysite.settings'
import django.core.handlers.wsgi
application = django.core.handlers.wsgi.WSGIHandler()
When using Apache/mod_wsgi, this is done in the WSGI script file and the environment variable would be set in os.environ when the WSGI script file is loaded.
As much as process environment variables and global variables have their limitations and are arguably a bad idea for specifying configuration, this has still worked okay until recently.
Problems started when Django 1.4 was released however. In Django 1.4 the content of the WSGI script file was changed from what was described previously for Django 1.3 and older versions to:
import os
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'mysite.settings')
from django.core.wsgi import get_wsgi_application
application = get_wsgi_application()
Although the differences on first glance appear to be fine, they aren't and the WSGI script file in Django 1.4 will break Apache/mod_wsgi for hosting multiple Django instances in the same process.
The key problem is what the setdefault() method does when setting the environment variable DJANGO_SETTINGS_MODULE compared to using assignment as previously. In the case of assignment the environment variable is always updated. For setdefault(), it is only updated if it is not already set.
You might ask why would this be a problem. It is a problem because although os.environ looks like a normal dictionary, it isn't. Instead it is actually a custom class which only looks like a dictionary. When a key/value is set in os.environ, it is also setting it at the C level by calling putenv().
What now happens with our multiple Django instances running in the same process is that for the first one to be loaded, DJANGO_SETTINGS_MODULE will not be set and so setdefault() will actually set it, including it being set globally to the process. When the second Django instance is loaded, at the point the sub interpreter is created, os.environ is populated from the C level environ, thereby picking up the value of DJANGO_SETTINGS_MODULE set when the first Django instance was loaded. In this case setdefault() will not override the value as it already sees it as being set.
Technically this leakage of environment variables between Python sub interpreters within the one process would always happen, but use of setdefault() instead of assignment means DJANGO_SETTINGS_MODULE will not get overridden to be the correct value for the second Django instance to be loaded.
The end result of the leakage can be one of two things. If the name of the Django settings module used for the first Django instance doesn't exist in the context of the second Django instance, then an import failure will occur and the Django instance will fail to be initialised. In this first case an actual failure will occur with nothing working and so it will be fairly obvious.
A more problematic case though is where you are using one code base and multiple Django settings modules for each of the distinct Django instances being run. In this case the Django settings module may well be found, with the result being that when attempting to load up the second Django instance, a duplicate of the first instance would be loaded instead. Where URLs are now meant to be routed to the second instance, they would instead be handled as if being sent to the first instance as the configuration for the first is still being used.
There are two solutions if this is the cause. The quickest is to replace the use of setdefault() to set the environment variable in the WSGI script file with more usual assignment.
os.environ['DJANGO_SETTINGS_MODULE'] = 'mysite.settings'
An alternative which involves a bit more work, but can have other benefits, is to switch to using daemon mode of mod_wsgi to run the Django instances and delegate each to a separate set of processes. By running the Django instances in separate processes there can be no possibility of environment variables leaking from one to the other.
WSGIDaemonProcess project-2
WSGIScriptAlias /suburl /some/path/project-2/wsgi.py process-group=project-2
WSGIDaemonProcess project-1
WSGIScriptAlias / /some/path/project-1/wsgi.py process-group=project-1
Fallback to default VirtualHost definition.
Apache supports hosting of sites under multiple host names by way of name based virtual hosts. These are setup with the VirtualHost directive as previously shown.
<VirtualHost *:80>
ServerName site-1.example.com
ServerAlias www.site-1.example.com
WSGIScriptAlias /some/path/project-1/wsgi.py
...
</VirtualHost>
<VirtualHost *:80>
ServerName site-2.example.com
WSGIScriptAlias /some/path/project-2/wsgi.py
...
</VirtualHost>
The ServerName directive specified within the VirtualHost gives the primary host name by which the site is identified. If the same site can be accessed by other names, the ServerAlias directive can be used to list them explicitly, or by using a wildcard pattern.
What is not obvious however is if there are any host names by which the server IP is addressable and they are not covered by a ServerName or ServerAlias directive, rather than Apache giving an error, it will route the request to the first VirtualHost definition it found when it read its configuration.
In the above example, if the host name 'www.site-2.example.com' also existed and mapped to the server IP, because that host name wasn't covered by a ServerAlias directive for the second VirtualHost, the request would actually end up being handled by the Django instance running 'site-1.example.com'.
To address this problem you simply need to be diligent in ensuring that you have correctly mapped all host names you wish to have directed at a site. Using mod_wsgi daemon mode will make absolutely no difference in this situation as it is Apache that is routing the request to the wrong VirtualHost before the request even gets passed off to mod_wsgi.
As a failsafe to pick up such issues and ensure that requests don't unintentionally go to the wrong site, a good practice may be to ensure that the first VirtualHost that Apache encounters when reading its configuration is actually a dummy definition which doesn't equate to an actual site.
<VirtualHost _default_:*>
Deny from all
</VirtualHost>
This could then be setup to fail all requests by way of forbidding access. A custom error document could also be used to customise the error response if necessary.
Subscribe to:
Posts (Atom)