Tuesday, May 19, 2015

Effects of yielding multiple blocks in a WSGI application response.

In my last post I introduced a Python decorator that can be used for measuring the overall time taken by a WSGI application to process a request and for the response to then be sent on its way back to the client.

That prior post showed an example where the complete response was generated and returned as one string. That the response has to be returned as one complete string is not a requirement of the WSGI specification, albeit for many Python web frameworks it is the more typical scenario for responses generated as the output from a page template system.

If needing to return very large responses, generating it and returning it as one complete string wouldn't necessarily be practical. This is because doing so would result in excessive memory usage due to the need to keep the complete response in memory. This problem would be further exacerbated in a multi threaded configuration where concurrent requests could all be trying to return very large responses at the same time.

In this sort of situation it will be necessary to instead return the response a piece at a time, by returning from the WSGI application an iterable that can generate the response as it is being sent.

Although this may well not be the primary way in which responses are generated from a WSGI application, it is still an important use case. It is a use case though which I have never seen covered in any of the benchmarks that people like to run when comparing WSGI servers. Instead benchmarks focus only on the case where the complete response is returned in one go as a single string.

In this post therefore I am going to look at this use case where response content is generated as multiple blocks and see how differently configured WSGI servers perform and what is going on under the covers so as to impact the response times as seen.

Returning the contents of a file

The example I am going to use here is returning the contents of a file. There is actually an optional extension that WSGI servers can implement to optimise the case of returning a file, but I am going to bypass that extension at this time and instead handle returning the contents of the file myself.

The WSGI application being used in this case is:

from timer1 import timed_wsgi_application1
@timed_wsgi_application1
def application(environ, start_response):
    status = '200 OK'
   response_headers = [('Content-type', 'text/plain')]
   start_response(status, response_headers)
   def file_wrapper(filelike, block_size=8192):
      try:
         data = filelike.read(block_size)
         while data:
            yield data
            data = filelike.read(block_size)
      finally:
         try:
            data.close()
         except Exception:
            pass
   return file_wrapper(open('/usr/share/dict/words'), 128)

On MacOS X the size of the '/usr/share/dict/words' file is about 2.5MB. In this example we are going to return the data in 128 byte blocks so as to better highlight the impacts of many separate blocks being returned.

Running this example with the three most popular WSGI servers we get from a typical run:

  • gunicorn app:application # 714.012ms
  • mod_wsgi-express start-server app.py # 159.944ms
  • uwsgi --http 127.0.0.1:8000 --module app:application  # 388.556ms

In all configurations only the WSGI server has been used, no front ends, and with each accepting requests directly via HTTP on port 8000.

What is notable from this test is the widely differing times taken by each of the WSGI servers to deliver up the same response. It highlights why one cannot rely purely on simple 'Hello World!' benchmarks. Instead you have to be cognisant of how your WSGI application delivers up its responses.

In this case if your WSGI application had a heavy requirement for delivering up large responses broken up into many separate chunks, which WSGI server you use and how you have it configured may be significant.

Flushing of data blocks

Having presented these results, lets now delve deeper into the possible reasons for the large disparity between the different WSGI servers.

The first thing to working out why there may be a difference is to understand what is actually happening when you return an iterable from a WSGI application which yields more than one data block.

The relevant part of the WSGI specification is the section on buffering and streaming. In this section it states:

WSGI servers, gateways, and middleware must not delay the transmission of any block; they must either fully transmit the block to the client, or guarantee that they will continue transmission even while the application is producing its next block. A server/gateway or middleware may provide this guarantee in one of three ways:

1. Send the entire block to the operating system (and request that any O/S buffers be flushed) before returning control to the application, OR
2. Use a different thread to ensure that the block continues to be transmitted while the application produces the next block.
3. (Middleware only) send the entire block to its parent gateway/server.

In simple terms this means that a WSGI server is not allowed to buffer the response content and must ensure that it will actually be sent back to the HTTP client immediately or at least in parallel to fetching the next data block to be sent.

In general WSGI servers adopt option (1) and will immediately send any response content on any socket back to the HTTP client, blocking until the operating system has accepted the complete data and will ensure it is sent. All three WSGI servers tested above are implemented using option (1).

For the case of there being many blocks of data, especially smaller blocks of data, there thus can be a considerable amount of overhead in having to write out the data to the socket all the time.

In this case though the same example was used for all WSGI servers, thus the number and size of the blocks was always the same and didn't differ. There must be more to the difference than just this if all WSGI servers are writing data out to the socket immediately.

As the WSGI servers are all implementing option (1), the only other apparent difference would be the overhead of the code implementing the WSGI servers themselves.

For example, gunicorn is implemented in pure Python code and so as a result could show greater overhead than mod_wsgi and uWSGI which are both implemented in C code.

Are there though other considerations, especially since both mod_wsgi-express and uWSGI are implemented as C code yet still showed quite different results?

INET versus UNIX sockets

In all the above cases the WSGI servers were configured to accept connections directly via HTTP over an INET socket connection.

For the case of mod_wsgi-express though there is a slight difference. This is because mod_wsgi-express will run up mod_wsgi under Apache in daemon mode. That is, the WSGI application will not actually be running inside of the main Apache child worker processes which are actually handling the INET socket connection back to the HTTP client.

Instead the WSGI application will be running in a separate daemon process run up by mod_wsgi and to which the Apache child worker processes are communicating as a proxy via a UNIX socket connection.

To explore whether this may account for why mod_wsgi-express shows a markedly better response time, what we can do is run mod_wsgi-express in debug mode. This is a special mode which forces Apache and mod_wsgi to run as one process, rather than the normal situation where there is an Apache parent process, Apache child worker processes and the mod_wsgi daemon processes.

This debug mode is normally used when wishing to be able to interact with the WSGI application running under mod_wsgi, such as if using the Python debugger pdb or some other interactive debugging tool which exposes a console prompt direct from the WSGI application process.

The side affect of using debug mode though means that the WSGI application is effectively running in a similar way to mod_wsgi embedded mode, meaning that when writing back a response, the data blocks will be written direct onto the INET socket connection back to the HTTP client.

Running with this configuration we get:

  • mod_wsgi-express start-server --debug-mode app.py # 470.487ms

As it turns out that there is a difference between daemon mode and embedded mode of mod_wsgi, now lets also consider uWSGI.

Although uWSGI is being used to accept HTTP connections directly over an INET connection, the more typical arrangement for uWSGI is to use it behind nginx. Obviously using nginx as a HTTP proxy isn't really going to help as one would see similar results as shown, but uWSGI also supports its own internal wire protocol for talking to nginx called 'uwsgi', so lets try that instead and see if that makes a difference.

In using the 'uwsgi' wire protocol though, we still have two possible choices we can make for configuring it. The first is that an INET socket connection is used between nginx and uWSGI and the second is to use a UNIX socket connection instead.

  • uwsgi --socket 127.0.0.1:9000 --module app:application # 284.802ms
  • uwsgi --socket /tmp/uwsgi.sock --module app12:application # 143.614ms

From this test we see two things.

The first is that even when using an INET socket connection between nginx and uWSGI, the time spent in the WSGI application is improved. This is most likely because of the more efficient 'uwsgi' wire protocol being used in place of the HTTP protocol. The uWSGI process is thus able to offload the response more quickly.

The second is that switching to a UNIX socket connection reduces the time spent in the WSGI application even more due to the lower overheads of writing to a UNIX socket connection compared to an INET socket connection.

Although the time spent in the WSGI application is reduced in both cases, it is vitally important to understand that this need not translate into an overall reduced response time to the same degree as seen by the HTTP client.

This applies equally to mod_wsgi when run in daemon mode. In both the case of mod_wsgi in daemon mode and uWSGI behind nginx the front end process is allowing the backend process running the WSGI application to more quickly offload the response only. It doesn't eliminate the fact that the front end represents an extra hop in the communications with the HTTP client.

In other words, where time is being spent has marginally been shifted to the front end proxy out of the WSGI application.

This doesn't mean that the effort isn't entirely wasted though. This is because WSGI applications have a constrained set of resources in the form of processes/threads for handling web requests. Thus the quicker you can offload the response from the WSGI application, the quicker the process or thread is freed up to be able to handle the next request.

Use of a front end proxy as exists with mod_wsgi in daemon mode, or where uWSGI is run behind nginx, actually allows both WSGI servers to perform more efficiently and so they can handle a greater load than they would otherwise be able to if they were dealing direct with HTTP clients.

Serving up of static files

Although we can use the WSGI application code used for this test to serve up static files, in general, serving up static files from a WSGI application is a bad idea. This is because the overheads will still be significantly more than serving up the static files from a proper web server.

To illustrate the difference, we can make use of the fact that mod_wsgi-express is actually Apache running mod_wsgi and have Apache serve up our file instead. We can do this using the command:

mod_wsgi-express start-server app.py --document-root /usr/share/dict/

What will happen is that if the URL maps to a physical file in '/usr/share/dict', then it will be served up directly by Apache. If the URL doesn't map to a file, then the request will fall through to the WSGI application, which will serve it up as before.

As we can't readily time in Apache how long a static file request takes to sufficient resolution, we will simply time the result of using 'curl' to make the request.

$ time curl -s -o /dev/null http://localhost:8000/
real 0m0.161s
user 0m0.018s
sys 0m0.062s
$ time curl -s -o /dev/null http://localhost:8000/words
real 0m0.013s
user 0m0.005s
sys 0m0.005s

Where as to serve up the static file took 161ms when served via the WSGI application, it took only 13ms when served as a static file.

The uWSGI WSGI server has a similar option for overlaying static files on top of a WSGI application.

uwsgi --http 127.0.0.1:8000 --module app:application --static-check /usr/share/dict/

Comparing the two methods using uWSGI we get:

$ time curl -s -o /dev/null http://localhost:8000/
real 0m0.381s
user 0m0.029s
sys 0m0.092s
$ time curl -s -o /dev/null http://localhost:8000/words
real 0m0.025s
user 0m0.006s
sys 0m0.009s

As with mod_wsgi-express, one sees a similar level of improvement.

If using nginx in front of uWSGI, you could even go one step further and offload the serving up of static files to nginx with a likely further improvement due to the elimination of one extra hop and nginx's known reputation for being a high performance web server.

Using uWSGI's ability to serve static files is still though a reasonable solution where it would be difficult or impossible to install nginx, such as on a PaaS.

The static file serving capabilities of mod_wsgi-express and uWSGI would in general certainly be better than pure Python options for serving static files, although how much better will depend on whether such Python based solutions make use of WSGI server extensions for serving static files in a performant way. Such extensions and how they work will be considered in a future post. 

What is to be learned

The key take aways from the analysis in this post are:

  1. A pure Python WSGI server such as gunicorn will not perform as well as C based WSGI servers such as mod_wsgi and uWSGI when a WSGI application is streaming response content as many separate data blocks.
  2. Any WSGI server which is interacting directly with an INET socket connection back to a HTTP client will suffer from the overheads of an INET socket connection when the response consists of many separate data blocks. Use of a front end proxy which allows a UNIX socket connection to be used for the proxy connection will improve performance and allow the WSGI server to offload responses quicker, freeing up the worker process or thread sooner to handle a subsequent request.
  3. Serving of static files is better offloaded to a separate web server, or separate features of a WSGI server designed specifically for handling of static files.

Note that in this post I only focused on the three WSGI servers of gunicorn, mod_wsgi and uWSGI. Other WSGI servers do exist and I intend to revisit the Tornado and Twisted WSGI containers and the Waitress WSGI server in future posts.

I am going to deal with those WSGI servers separately as they are all implemented on top of a core which makes use of asynchronous rather than blocking communications. Use of an asynchronous layer has impacts on the ability to properly time how long the Python process is busy handling a specific web request. These WSGI servers also have other gotchas related to their use due to the asynchronous layer. They thus require special treatment.

1 comment:

Unknown said...

I'm a .net developer leaning asp.net core. To be able to understand owin I go to ruby racks, and to wsgi. I was reading WSGI spec and got confused at "buffering and streaming".
This post has enormously helped me understanding the idea behind the design. so , a big thank you to the author!