Friday, October 2, 2009

Details on WSGI 1.0 amendments/clarifications.

In my last gargantuan post I tried to lay out a possible roadmap for the future of WSGI. That generated another large discussion on the Python WEB-SIG list. I don't want to go into any outcomes of that discussion just yet though. Instead, I want to focus on one part of that roadmap and go into it in more detail. That part was the set of amendments/clarifications I listed towards the end of the roadmap. Although these were thrown in towards the end of the post, they actually stand independent of the bigger issue of how WSGI should look for Python 3.X.

The extent to which this is the case is that even if nothing is worked out about Python 3.X and how URI variables should be represented, we could still bring out a revision of WSGI 1.0 called WSGI 1.1, which contains just those amendments. In fact, like trying to come out with changes related to Python 3.X support, getting out a revision of the WSGI specification with some of those changes has also been attempted in the past and nothing has ever come of it.

To better understand what those changes are about, I will list of each of them and explain the reason behind the suggested change.

1. The 'readline()' function of 'wsgi.input' may optionally take a size hint.

The WSGI specification as it is written states 'The optional "size" argument to readline() is not supported, as it may be complex for server authors to implement, and is not often used in practice'.

Unfortunately since that was written the 'cgi.FieldStorage' class in the Python standard library was modified to make use of the ability of 'readline()' of a file like object to accept a size argument. Because many WSGI frameworks make use of 'cgi.FieldStorage' for the purposes of parsing the content of POST requests, this has resulted in those WSGI framework being in conflict with the WSGI specification as they indirectly rely on a feature which the API specification says a complying WSGI doesn't need to support.

2. The 'wsgi.input' must provide an empty string as end of input stream marker.

The WSGI specification as it is written states 'The server is not required to read past the client's specified Content-Length, and is allowed to simulate an end-of-file condition if the application attempts to read past that point. The application should not attempt to read more data than is specified by the CONTENT_LENGTH variable.'

In other words the WSGI server can provide an end of file indicator in the way of an empty string, but is not required to. On the basis that such an end of file indicator is not guaranteed to be present, the WSGI application is not allowed to read more content from 'wsgi.input' than is specified by CONTENT_LENGTH. In the case where CONTENT_LENGTH was not defined or is an empty string, then CONTENT_LENGTH is supposed to be interpreted as being '0'.

As far as a WSGI application code goes, this means that any WSGI application has to keep track of how much data it has read from 'wsgi.input' in a variable, incrementing that by the block size it uses for each read. It has to then compare that to CONTENT_LENGTH, ensuring it doesn't read more than it should. This may mean that it has to deal with a partial read on last iteration where CONTENT_LENGTH is not a multiple of the block size used by the code when performing reads.

There are few problems with this. The first is that although the WSGI application notionally should not rely on there being an end of file indicator in the way of a read returning an empty string, it still has to cater for that situation. This is because premature closure of the client connection may not always result in an exception being raised, indicating a failed read. Instead it may simply result in the return of an empty string. This could still occur in cases where an implementation doesn't return an empty string when all the request content has been consumed. If a WSGI application does not deal with an empty string being returned for this case and assumes it can just keep reading until amount of data specified by CONTENT_LENGTH is returned or an error occurs, then it can find itself in a tight loop when the read returns an empty string instead.

Short story is that the WSGI application has to deal with an empty string anyway, so why not make it mandatory for a WSGI adapter to ensure one is always supplied.

The scenario where this may be problematic for a WSGI adapter is where HTTP/1.1 and keep alive connections are supported. In this case, if 'wsgi.input' corresponds to the socket connection from the client, one can't rely on the send half of the connection being closed when all request content is sent. If a WSGI application were to read more data than specified by CONTENT_LENGTH in this case, it would potentially cause the WSGI application to hang forever, or at least until the client timed out and closed the whole connection. This is effectively why the WSGI specification says that WSGI application should not read more data than it specified by CONTENT_LENGTH. That is, it relies on WSGI applications being well behaved and once all data specified by CONTENT_LENGTH is read, that it processes the data and returns any response. That same socket connection could then be used for a subsequent request.

Note that for correct operation, not only is it relied upon that the WSGI application not read more data than specified by CONTENT_LENGTH, it is actually necessary that the WSGI application read exactly that amount of data when available. If the WSGI application doesn't consume the exact amount of data, then it will still be present in the buffers of the socket connection and could be erroneously interpreted as being the start of a subsequent request over the same socket connection.

A new requirement to make it mandatory for a WSGI adapter to ensure an empty string is returned as end of file sentinel, means that raw socket can now no longer be supplied as 'wsgi.input', instead 'wsgi.input' will need to be supplied as an object that wraps the socket connection. This wrapper object will need to count how much data has been read, and when the amount of data reaches that as defined by CONTENT_LENGTH, any subsequent reads should return an empty string instead. Having the WSGI adapter do this though is good, as it ensures that a WSGI application can't hang. Further more, the WSGI adapter can at the end of a request ensure that any data supplied with the request which wasn't consumed is read and discarded and that if the socket connection is reused for a subsequent request, that any residual data doesn't get wrongly interpreted as being part of that subsequent request.

Forcing the requirement for an empty string to be returned as end of file sentinel, also resolves a potential problem when 'readline()' is used without a size argument. The issue here is that since a size argument doesn't have to be supplied, it is possible to inadvertently attempt to read more more data than specified by CONTENT_LENGTH. For code such as 'cgi.FieldStorage', the only reason a problem doesn't occur is that the request content has its own end marker within the input stream to represent the end of data and this always ends with a newline. Thus 'readline()' will always supply the end marker on any final read. If for some reason that newline wasn't supplied, or the request content was malformed in some way, it again is possible that a WSGI application could hang due to attempting to read more data than it should.

As you can see, mandating that a WSGI adapter be required to return an empty string as end of file sentinel avoids a number of potential problems. Such a change also has the potential to simplify WSGI application code. This is because at present a WSGI application has to maintain a variable tracking how much data it has read so as to ensure it doesn't read more than CONTENT_LENGTH. If an empty string as end of file sentinel is guaranteed, it doesn't necessarily need to do this any more. Although, that said, robust WSGI applications may want to still do validation based on amount of data read in case request content was truncated due to client closing connection prematurely.

Depending on an empty string for end sentinel also opens up possibility for a few new features which at present aren't strictly allowed by WSGI specification.

The first of these is the concept of mutating input filters. These are web server level input filters or WSGI middleware components, which modify the request content as it is being read in.

The best example of this is the mod_deflate module for Apache, which implements the ability to handle compressed data in the request content. In this scenario the CONTENT_LENGTH would actually be the original size of the compressed data sent by the client. When that is uncompressed the actual amount of data would normally be greater than that specified by CONTENT_LENGTH. If the current WSGI specification is followed, an application is only allowed to read up to CONTENT_LENGTH so it would end up truncating the input, but possibly think that all was fine. Further, because it hadn't read all the actual request content, the residual when HTTP/1.1 is used could be wrongly interpreted as start of next request.

End result is that an Apache module such as mod_deflate which mutates the input stream, turning it into something which is of a different length, can't be used with WSGI applications which only read up to length specified by CONTENT_LENGTH. If instead the WSGI application just kept reading data until an empty string was returned, then it would all work fine.

Amusingly, because 'cgi.FieldStorage' uses 'readline()' and so effectively ignores CONTENT_LENGTH, the mod_deflate module can actually be used in the case where it is known that 'cgi.FieldStorage' would always be used to interpret the request content.

That such inconsistencies exist gives further weight to there being a requirement that an empty string always be used as end sentinel and that WSGI applications give greater weight to that, than the value of CONTENT_LENGTH. The result is that CONTENT_LENGTH would be mostly ignored, perhaps only being used where a WSGI application wants to block a request before reading anything if the original amount of request content was greater than a certain amount, in that case returning a 413 error response.

The other feature that could be supported is the ability of a request to use chunked data. That is, where the HTTP request uses 'chunked' as the transfer encoding.

In this type of request there is no content length and as the specification is now, a WSGI application should treat that as zero length and would believe there is no data at all to read. Some WSGI adapters work around this by reading in all the request content, calculating the length and setting CONTENT_LENGTH before the request is passed through to the WSGI application. To do this though means that all the request content has to be held in memory or possibly subsequently written to disk if it turns out to be a large amount of data.

Having to do this complicates the WSGI adapter and denies the WSGI application the ability to stream the request content and process it as it arrives as the WSGI adapter will already have read it in.

By having an empty string be required as an end sentinel, the WSGI application itself seeing that CONTENT_LENGTH wasn't specified, but seeing that HTTP_TRANSFER_ENCODING was 'chunked' could just go ahead and read the request content, processing it as it goes, until the empty string as end sentinel was reached.

The ability to handle chunked request content is likely to grow in importance as some of the smart phones now available fallback to using chunked request content when the amount of data being sent is greater than some predefined value. That predefined value is a part of the software on the phone however, is not controllable, nor is use of chunked request able to be turned off.

If people want to be able to use Python to implement web applications that support such devices, then WSGI specification needs to be able to handle chunked request content.

The only issue with this is that existing WSGI applications are going to see CONTENT_LENGTH being missing as indicating there is no data at all. If chunked request content is now made permissible under WSGI specification by virtue of empty string as end sentinel being mandatory, then existing applications will appear to fail and it will not be obvious what is going wrong.

As such, although WSGI should be able to handle chunked request content, it may be wise that WSGI adapters have such ability to pass through chunked requests as being optional, with the server needing to be configured explicitly to allow them.

Apache/mod_wsgi currently takes this stance with mod_wsgi 3.0 because in prior versions it always blocked chunked requests anyway, as was known that weren't strictly supported by WSGI specification. But then, wsgiref server has never treated chunked requests any differently so it already tried to pass them through, but with WSGI applications always treating them as requests where no content was supplied.

So, although would want to be able to handle chunked request content, not obvious whether such requests should always be passed on, with potential for silent failures, or whether should require it to be explicitly enabled in the server.

3. The size argument to 'read()' function of 'wsgi.input' would be optional and if not supplied the function would return all available request content. Thus would make 'wsgi.input' more file like as the WSGI specification suggests it is, but isn't really per original definition.

This one is reasonably self explanatory. The WSGI specification says that 'wsgi.input' is supposed to be file like yet doesn't support this basic feature of such objects. In practice it may not be overly useful except for quick and dirty scripts, but if we are going to clean up 'wsgi.input' so it is more file like in other areas, should also align this as well.

4. The 'wsgi.file_wrapper' supplied by the WSGI adapter must honour the Content-Length response header and must only return from the file that amount of content. This would guarantee that using wsgi.file_wrapper to return part of a file for byte range requests would work.

The WSGI specification currently says in respect of 'transmission should begin at the current position within the "file" at the time that transmission begins, and continue until the end is reached.' In other words, all data with file from that point on should be returned.

The problem with this though is where the 'Content-Length' response header as specified by the WSGI application defined a value which was actually less than the amount which could be read from the file.

Strictly speaking a WSGI application shouldn't be returning more data than specified by the 'Content-Length' response header, yet the WSGI specification is effectively saying the WSGI adapter should do exactly that.

The WSGI specification should be more precise in the area of how much data is allowed to be returned in relation to what is defined by the 'Content-Length' response header. One benefit of doing that would be that 'wsgi.file_wrapper' could then reliably be used as a means of returning a range of a file in an optimised way to satisfy byte range requests.

Right now it is a bit of an unknown whether a WSGI adapter which supports 'wsgi.file_wrapper' is ignoring the 'Content-Length' as WSGI specification suggests should be done, or whether they see common sense prevailing and not allow more data than specified by 'Content-Length' to be returned.

Because of HTTP/1.1 and pipe lining of requests, allowing more data to be returned than specified by 'Content-Length' could cause problems with any additional data returned being interpreted as the start of the response corresponding to any subsequent request.

5. Any WSGI application or middleware should not return more data than specified by the Content-Length response header if defined.

This is just enforcing the rule about 'Content-Length' above to WSGI middleware. Albeit that it perhaps only need to apply where the WSGI middleware is inserting its own wrapper around the generator returned to perform a transformation on the returned data, or to calculate a response header based on the returned content.

6. The WSGI adapter must not pass on to the server any data above what the Content-Length response header defines if supplied.

Again, same rule, but this time as fallback saying that the WSGI adapter shouldn't return more data than it should. This is to catch where WSGI middleware or application isn't implemented correctly and is returning wrong amount of data.

Anyway, that is what the amendments were about and what issues they were trying to address.

In addition to these, there have been a couple of suggestions made by others as well.

The first of these was that a mutating input filter should override CONTENT_LENGTH and set it to the value -1. Whether using such a value would cause problems is a bit of an unknown given that value is no longer positive. For Apache at least, it also isn't possible to reliably detect when an Apache input filter mutates the input and changes its length, so no way to even do that.

The second suggestion was that 'wsgi.input' should supply the additional method 'tell()'. This would allow one to find out how much data had already been read. Technically this wouldn't be too hard as many WSGI adapters would be explicitly counting the amount of data read anyway in order to simulate end of file sentinel, so would be available. Introducing this method though may signal to some that 'wsgi.input' should also support 'seek()', but rewinding input isn't generally going to be able to be implemented. Also, with using empty string as end sentinel, there would be a movement away from needing to know how much data was read as it is being read. Either way, it certainly isn't a requirement for things to work, so perhaps best to not add it.

So, as far as those other two suggestions go, I am not convinced they are worth including.

Now, I always take time to get blog posts out these days, but I hope to follow this post up soon with another which details how the above and other discussions about WSGI and Python 3.X fit into my mod_wsgi version 3.0 release plans. I was never going to hang around waiting forever until decisions are made, so have decided that I am going to make some of my own. :-)

No comments: