Saturday, January 22, 2011

Testing a wsgi.file_wrapper implementation.

Not many WSGI servers or gateways have attempted to provide an implementation of the wsgi.file_wrapper extension. Of those, some provide the callable for generating the file wrapper, but then do not go on to implement any high performance optimisation. Part of the reason for this is that the Python standard library does not expose the UNIX sendfile() function. As such, a full implementation of wsgi.file_wrapper is usually restricted to WSGI server or gateway implementations implemented directly in C code.

Despite the UNIX sendfile() call not being present in the Python standard library, it still is possible in a pure Python WSGI server or gateway to fully implement wsgi.file_wrapper with optimisations. This can be done by using the ‘ctypes’ module to access and call the underlying UNIX sendfile() function. I will explain how to do that in a subsequent blog post, but before doing that I am going to describe a series of tests which can be used to validate the operation of a wsgi.file_wrapper implementation.

The first series of tests check aspects of wsgi.file_wrapper which would potentially be exercised under normal use. The remainder of the tests try and break implementations by doing unexpected things.

The tests can be used to validate an existing WSGI server or gateway that provides an implementation. They are also a good indicator of what needs to be taken into consideration when actually implementing wsgi.file_wrapper and which is why I am presenting them before an actual implementation.

Note that the tests assume that wsgi.file_wrapper exists and the code doesn’t supply its own alternative if it doesn’t. Thus if your WSGI server or gateway does not implement wsgi.file_wrapper at all, the tests will fail in the lookup of wsgi.file_wrapper in the first place.

File Objects

The standard use case is to open a file, wrap it using wsgi.file_wrapper and return it.
import os
import string

def application(environ, start_response):
status = '200 OK'

response_headers = [('Content-type', 'text/plain')]
start_response(status, response_headers)

filelike = file('/tmp/filetest.txt', 'w')
filelike.write(string.ascii_lowercase)
filelike.close()

filelike = file('/tmp/filetest.txt', 'r')

return environ['wsgi.file_wrapper'](filelike)
In this case we have not supplied a Content-Length response header. As such, the expectation would be that the complete file should be returned.

The WSGI specification given in PEP 333 doesn’t make any mention about whether a WSGI server or gateway need add a Content-Length if none is supplied, however the revised PEP 3333 does, stating:

"""If the application doesn't supply a Content-Length, the server may generate one from the file using its knowledge of the underlying file implementation."""

Best practice would be that the Content-Length header is added, especially in the case that sendfile() is used to send the file in one big chunk. This is because there isn’t going to be an opportunity to apply HTTP/1.1 chunked encoding on the response content.

If an implementation isn’t attempting to use sendfile(), but instead is writing the data itself in chunks, then it could if the client protocol was HTTP/1.1, choose to use chunked transfer encoding rather than setting the Content-Length response header. As the content length can be readily calculated though, it would generally be simpler just to set the response header and send the content as is.

Do note though that if the WSGI server or gateway is setting the Content-Length response header, then it should ensure that it only sends that amount of data. This would be important where the file to be sent is being appended to, eg., a log file. In other words, once the WSGI server or gateway indicates it will send a certain amount of data, it shouldn’t then just stream till the end of file in case the size of the file has since changed.

File Like Objects

Using a standard Python file object would be the typical use case, but any file like object can technically be used. For this PEP 3333 says:

"""Note that even if the object is not suitable for the platform API, the wsgi.file_wrapper must still return an iterable that wraps read() and close(), so that applications using file wrappers are portable across platforms."""

This actually applies in two ways. The first is if the platform itself doesn’t provide a way of dealing with a file object in a performant way, or if a file like object doesn’t provide the attributes which would allow a specific mechanism to be used.

The Windows operating system is an example of the first, whereby the sendfile() call is not available. An example of the latter is where rather than a file object being supplied a file like object such as a StringIO object is supplied.

On a UNIX system where an implementation is using sendfile(), the fallback can therefore be tested using an instance of StringIO.
import StringIO
import string

def application(environ, start_response):
status = '200 OK'

response_headers = [('Content-type', 'text/plain')]
start_response(status, response_headers)

filelike = StringIO.StringIO(string.ascii_lowercase)

return environ['wsgi.file_wrapper'](filelike)
Either way, the content accessible should still be written back to the client and the wsgi.file_wrapper implementation should not fail.

Using StringIO in particular is a good test because it does not provide a fileno() method and if an implementation blindly assumes that a object will always be provided that has a fileno() method it will fail. For the case where the implementation is written in C and the PyObject_AsFileDescriptor() function is used to get the file descriptor the implementation needs to validate that ‘-1’ is not returned and fallback to processing the iterable instead. If this isn’t done, then the implementation may try and access an invalid file descriptor.

For this specific case, we again haven’t supplied a Content-Length and it wouldn’t necessarily be possible to deduce it. As such, if the client protocol was HTTP/1.1, then chunked transfer encoding might be used if the HTTP/WSGI server supports it.

Content Length

No matter what type of file like object is used, be it an actual file object or otherwise, a WSGI application can optionally set a Content-Length response header itself. This header is meant to be the authoritative indicator as to how much response data is to be returned.

If the Content-Length header is supplied and the value it is given equates to the actual amount of content then there is no problem. If however the given content length is less than the actual amount of data accessible via the file like object, then only that amount of data should be returned and no more.

The WSGI specification as outlined in PEP 333 is actually wrong in this respect in that it says:

"""Apart from the handling of close(), the semantics of returning a file wrapper from the application should be the same as if the application had returned iter(filelike.read, ''). In other words, transmission should begin at the current position within the "file" at the time that transmission begins, and continue until the end is reached."""

In other words, it doesn’t indicate that the Content-Length header should be taken into consideration. PEP 3333 corrects this and states:

"""Apart from the handling of close(), the semantics of returning a file wrapper from the application should be the same as if the application had returned iter(filelike.read, ''). In other words, transmission should begin at the current position within the "file" at the time that transmission begins, and continue until the end is reached, or until Content-Length bytes have been written."""

This scenario needs to be tested and for both a file object and the fallback case for any file like object. Thus would want to perform the test:
import os
import string

def application(environ, start_response):
status = '200 OK'

response_headers = [('Content-type', 'text/plain'),
('Content-length', str(len(string.ascii_lowercase)/2))]
start_response(status, response_headers)

filelike = file('/tmp/filetest.txt', 'w+')
filelike.write(string.ascii_lowercase)
filelike.close()

filelike = file('/tmp/filetest.txt', 'r')

return environ['wsgi.file_wrapper'](filelike)
and:
import StringIO
import string

def application(environ, start_response):
status = '200 OK'

response_headers = [('Content-type', 'text/plain'),
('Content-length', str(len(string.ascii_lowercase)/2))]
start_response(status, response_headers)

filelike = StringIO.StringIO(string.ascii_lowercase)

return environ['wsgi.file_wrapper'](filelike)
Note that one can’t rely on using a web browser to validate the output in these cases. This is because a web browser will normally not display any additional data that has been sent beyond what the Content-Length indicated should exist. It is therefore necessary to use a network snooping tool or even telnet directly to the HTTP server port and enter the request and view the raw details of the HTTP response.
$ telnet localhost 8000
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
GET / HTTP/1.0

HTTP/1.1 200 OK
Content-type: text/plain
Content-length: 13

abcdefghijklmnopqrstuvwxyzConnection closed by foreign host.
A valid implementation should only return the amount of data specified by the Content-Length header and no more, thus not what the above output shows.

Current Position

One of the sections previously quoted from PEP 3333 states:

"""Apart from the handling of close(), the semantics of returning a file wrapper from the application should be the same as if the application had returned iter(filelike.read, ''). In other words, transmission should begin at the current position within the "file" at the time that transmission begins, and continue until the end is reached, or until Content-Length bytes have been written."""

Key to note here is the phrase ‘transmission should begin at the current position within the "file"’.

Normally when a file is opened the current seek position or file pointer will be at the start of the file. This means that where no Content-Length is specified the complete contents of the file would be returned. If however the current position within the file was not at the start of the file, then the complete file should not be returned and instead only from the current position up to the end of the file, or if Content-Length is specified, only that many bytes from the the current position should be returned.

Tests which validate the correct behaviour for these situations are:
import os
import string

def application(environ, start_response):
status = '200 OK'

response_headers = [('Content-type', 'text/plain')]
start_response(status, response_headers)

filelike = file('/tmp/filetest.txt', 'w+')
filelike.write(string.ascii_lowercase)
filelike.flush()

filelike.seek(len(string.ascii_lowercase)/2, os.SEEK_SET)

return environ['wsgi.file_wrapper'](filelike)
and:
import os
import string

def application(environ, start_response):
status = '200 OK'

response_headers = [('Content-type', 'text/plain'),
('Content-length', str(len(string.ascii_lowercase)/4))]
start_response(status, response_headers)

filelike = file('/tmp/filetest.txt', 'w+')
filelike.write(string.ascii_lowercase)
filelike.flush()

filelike.seek(len(string.ascii_lowercase)/2, os.SEEK_SET)

return environ['wsgi.file_wrapper'](filelike)
For the first test, from the middle of the file to the end of the file should be returned and if the Content-Length header is set automatically, should correctly reflect that reduced length. The second test should again start from the middle of the file, but should only return 6 characters from the 3rd section of the file.

As with the prior test setting Content-Length, you can’t rely on the browser for the latter test as it will only show as much data as indicated by Content-Length even if more data had actually be returned. A check should therefore be made of the raw HTTP response.

There is no strict need to perform this test where a file like object such as StringIO is used as the way data is consumed in that case means that reading can only start from the current position. An actual file object is different because the optimisation to return the file contents would usually work directly with the file descriptor and not via any high level interface.

Multiple Instances

Now it is time to try and start breaking the wsgi.file_wrapper implementation by doing abnormal things. Granted that these things wouldn’t normally be done, but by testing them it tests the robustness of the implementation of wsgi.file_wrapper. If an implementation takes short cuts or uses a bad design, then it could result in incorrect behaviour, or in worst case for a C based implementation, cause the process to crash.

The first test that can be done is to use wsgi.file_wrapper multiple times within the context of the same request. Obviously only the result of one invocation of wsgi.file_wrapper can actually be returned by the WSGI application. This need not even be the result of the last call. No matter which is returned, the original file used with that which is returned should be what is written back to the HTTP client.

We therefore first check for case where last instance of wsgi.file_wrapper created is returned.
import os
import string

def application(environ, start_response):
status = '200 OK'

response_headers = [('Content-type', 'text/plain'),]
start_response(status, response_headers)

filelike1 = file('/tmp/filetest-a.txt', 'w+')
filelike1.write(string.ascii_lowercase)
filelike1.flush()
filelike1.seek(0, os.SEEK_SET)

file_wrapper1 = environ['wsgi.file_wrapper'](filelike1)

filelike2 = file('/tmp/filetest-b.txt', 'w+')
filelike2.write(string.ascii_uppercase)
filelike2.flush()
filelike2.seek(0, os.SEEK_SET)

file_wrapper2 = environ['wsgi.file_wrapper'](filelike2)

return file_wrapper2
and then the for the case of returning instance of wsgi.file_wrapper which wasn’t the last created.
import os
import string

def application(environ, start_response):
status = '200 OK'

response_headers = [('Content-type', 'text/plain'),]
start_response(status, response_headers)

filelike1 = file('/tmp/filetest-a.txt', 'w+')
filelike1.write(string.ascii_lowercase)
filelike1.flush()
filelike1.seek(0, os.SEEK_SET)

file_wrapper1 = environ['wsgi.file_wrapper'](filelike1)

filelike2 = file('/tmp/filetest-b.txt', 'w+')
filelike2.write(string.ascii_uppercase)
filelike2.flush()
filelike2.seek(0, os.SEEK_SET)

file_wrapper2 = environ['wsgi.file_wrapper'](filelike2)

return file_wrapper1
What is being checked for here is that the implementation isn’t just caching the last inputs given to a call of wsgi.file_wrapper and using that. The details must correspond to that used in the instance of wsgi.file_wrapper actually returned.

Closed File Object

A file object in Python is a wrapper around a C FILE pointer. In turn the FILE pointer is a wrapper around an actual file descriptor. When performing optimisations to transmit a file using the UNIX sendfile() call it is necessary to use the file descriptor. That this is what occurs introduces a couple of complications that need to be catered for.

The first is that the file descriptor and FILE pointer can technically differ as to their understanding of the current seek position within the file. This is because a FILE pointer implements a level of buffering and adjustments to the seek position of the FILE pointer, as well as file contents, may not be reflected in the file descriptor until a flush is performed, writing the data to disk.

The prior test relating to the current position within the file object is intended to try and check for this disparity, although in practice whether it will capture a problem may be dependent on how a specific operating system implements FILE pointers. Important thing to note is that the current position within the file object should be determined by using the ‘tell()’ method of the file object and not by interrogating the seek position from the file descriptor. If an implementation gets the information directly from the file descriptor, then ultimately it will likely fail at some point where code is performing seeks on the file object.

The second complication, and which the following test will check, is the fact that there are multiple handles to the actual file. When dealing with a file object, if one closes the file object and then subsequently performs an operation on it a Python exception would be raised. If instead you held a reference to the file descriptor only and the file object was closed, you may not get an error back. This would be the case where the file descriptor had since been reused.

As a result, if an implementation of wsgi.file_wrapper caches a reference to the file descriptor up front when first called, and then uses that when writing the file contents back as the response content, if the file object had been closed in between, then the wrong file contents could be returned if the file descriptor had been reused.

A correct implementation should delay to the last moment obtaining a reference to the file descriptor. If the file object had been closed in the interim this should see a Python exception if implementation of wsgi.file_wrapper is done in Python code, or ‘-1’ being returned by ‘PyObject_AsFileDescriptor()’ if implemented in C code. If in C code, it should technically then fallback onto attempting to stream the file object, since a non file object would also result in ‘-1’ being returned, in which case it should fail when trying to read data from the file object.

The test for this then is to use wsgi.file_wrapper to create the result to return and before that is actually returned close the file object.
import os
import string

def application(environ, start_response):
status = '200 OK'

response_headers = [('Content-type', 'text/plain'),]
start_response(status, response_headers)

filelike = file('/tmp/filetest-a.txt', 'w+')
filelike.write(string.ascii_lowercase)
filelike.flush()
filelike.seek(0, os.SEEK_SET)

file_wrapper = environ['wsgi.file_wrapper'](filelike)

filelike.close()

return file_wrapper
One should also test the variation of where the file object is closed before wsgi.file_wrapper is invoked to create the return value for the WSGI application.
import os
import string

def application(environ, start_response):
status = '200 OK'

response_headers = [('Content-type', 'text/plain'),]
start_response(status, response_headers)

filelike = file('/tmp/filetest.txt', 'w+')
filelike.write(string.ascii_lowercase)
filelike.flush()
filelike.seek(0, os.SEEK_SET)

filelike.close()

file_wrapper = environ['wsgi.file_wrapper'](filelike)

return file_wrapper
However wsgi.file_wrapper is implemented, it should result in the error being detected. Because processing of the result is being done after having returned from the WSGI application, the fact that an error occurred would normally be logged in some way and the request terminated with the connection to HTTP client being abruptly closed. That is, there isn’t a way of raising an exception in the context of the original WSGI application.

So, these are the tests. In a future blog post I will show how wsgi.file_wrapper should be implemented.

3 comments:

Marius Gedminas said...

Excellent!

Do you plan to build an automated test suite out of this?

Graham Dumpleton said...

There are two sides to automated testing. The client side, which is easy, and the server side, which isn't. The server side is a problem if the intention is to make it easy to apply the tests across multiple WSGI servers as how you deploy a WSGI application with each is different. Even if you focus on just Apache/mod_wsgi, it can get complicated if you have to adjust the Apache configuration between requests and restart the server. The old mod_python package had a system for doing just this, but it was quite messy. Time is better spent elsewhere at the moment.

Graham Dumpleton said...

For reference, appears that there is a move to have sendfile() as part of Python libraries. See http://bugs.python.org/issue10882