Thursday, February 20, 2014

Vertically partitioning Python web applications.

If you have managed to miss them, I have since the start of the year put out quite a number of blog posts related to decorators. These posts covered material I would have covered in a talk I had submitted for PyCon US, but that talk was not accepted. There was actually a second talk I also proposed for PyCon US, but it also was not accepted. This post covers some of what that talk would have covered.

It is unlikely I will be turning the topic of that second talk into another series of posts as I have done for my decorator talk, but some of the underlying issues came up yet again in discussions in the mod_wsgi mailing list recently, so figured I may as well turn parts of those mailing list posts into a blog post. I am going to try and be more diligent this year in converting more interesting mailing list discussions into blog posts, since mailing lists are generally the last place people go these days to actually try and find information.

Not all URLs are equal

Now the premise of my talk was that because web application developers will simply deploy their complete web site as one process, that they are actually limiting their ability to have their web application perform optimally. This is because the overall configuration of the web server or container is ultimately going to be dictated by the worst case requirements of small parts of their web application. These parts of their web application may not even be the most frequently visited parts of their web application, but parts which are only infrequently used.

The best example of this is where the handler for a specific URL in your web application has a large transient memory requirement.

As an example, the admin pages in a Django application may not be frequently used, but they may have a requirement to process a lot of data. This could create a large transient memory requirement just for the one request, but since memory allocations from the operating system are generally never given back, this one infrequent request will blow out memory usage for the whole application. This memory once allocated will be retained by the process until the process is subsequently restarted.

Because of this, you could have a stupid situation whereby a request which is only run once every fifteen minutes, could over the course of a few hours, progressively be handled by a different process in a multiprocess web server configuration. Thus your overall memory usage will seem to jump up for no good reason until finally all processes have finally hit a plateau where they have allocated the maximum amount of memory they require to handle the worst case transient memory usage requirements for individual requests.

It can get worse though if you also have multithreading being used in each process. As the response time for a memory hungry URL gets longer and longer, you raise the odds that you could have two such memory hungry requests needing to be handled concurrently within the same process in different threads. What this means is that your worst case memory usage isn't actually just the worst case memory requirement for handling one request against that specific URL, but that multiplied by the number of request threads in the process.

As well as Django admin pages, further examples I have seen in the past where people have been hit by this are the generation of a sitemap, PDF generation and in some cases even RSS feeds where a significant amount of content is returned with each item rather than it just being a summary.

The big problem in all of this is identifying which URL has the large transient memory requirement. Tools available for this aren't good and you generally have to fallback to adhoc solutions to try and work it out. I'll leave such techniques to another time.

Reducing the impacts

As to solving the problem when you have identified which URLs are the problem, ideally you would change how the code works to avoid the large transient memory requirement. If you cannot do that, or not straight away, then you can for some Python WSGI servers fall back on a number of different techniques to at least lesson the impact, by configuring the WSGI server or container differently.

For example, if using mod_wsgi daemon mode, you can start using the 'inactivity-timeout' and 'maximum-requests' options to the WSGIDaemonProcess directive for the mod_wsgi daemon process group your application is running in.

What the 'maximum-requests' option does is automatically restart a specific mod_wsgi daemon process after that process has handled a set number of requests. If the WSGI application doesn't actually handle a steady stream of requests all the time, then the 'inactivity-timeout' can be used at the other extreme, causing a mod_wsgi daemon process to be restarted if it is idle for more than a set period of time.

In both cases the process is reset, with the Python interpreter being reinitialised. It will now stay at that base level of memory usage until a new web request comes in for the WSGI application. When that occurs the WSGI application will once again be loaded and start to process web requests.

So the result of the two options is to allow you to reset the memory usage back to a more sensible level if you have an issue with memory growth within the process due to incorrect management of resources, or simply because of specific URLs which have high transient memory requirements which are blowing out the overall memory requirements of the whole web application.

The problem with using such options as a solution is that the requirements for a small set of URLs are still dictating the configuration for the whole application. Previously the problem was that the memory usage requirements for a small number of URLs would end up dictating how many instances of your web application you could afford to deploy within the memory constraints of the host you were using. In using these options you are just changing the sorts of problems that can occur because of the performance impacts these options can cause.

In the case of setting a maximum for the number of requests handled for the process, the problem is that you can introduce a significant amount of process churn if this is set too low relative to the overall throughput. That is, the processes will get restarted on a frequent basis, with a subsequent increase in CPU load from having to constantly reload the web application. The need to restart the process and reload the web application also means that you are temporarily reducing the overall capacity of your site, potentially resulting in longer overall response times as requests are delayed in being able to be processed.

I talked about this issue of process churn last year in my PyCon US talk 'Making Apache suck less for hosting Python web applications'. The talk at that time didn't though deal with the problem that a web application is actually a bunch of parts that can have quite different requirements. There was an implied assumption that it was safe to treat the web application as a black box with more or less uniform behaviour across all URLs. Taking this view though is always going to end up with a sub optimal outcome.

Partitioning your application

The better solution to this problem with not all URLs being equal and having different resource requirements, is to vertically partition your web application and spread it across multiple processes. Where each process only handles a subset of URLs. Each distinct set of processes can then be optimally configured to suit the requirements of the code which is being run in it.

Performing such partitioning with most WSGI deployment mechanism can be quite complicated. This is because the WSGI servers themselves have no means of adequately partitioning a web application and still be managed by the one WSGI container. Instead, what would be required is to run multiple distinct WSGI servers on separate ports and have them all sitting behind a front end web server such as nginx, with nginx matching URLs and performing routing as required.

If however you are using Apache/mod_wsgi such partitioning can be quite easily managed with only minor changes to the existing configuration. This is the case because of mod_wsgi's ability to handle multiple daemon process groups in which WSGI applications can be run. It is then possible to use builtin features of Apache for matching URLs, to dynamically delegate a subset of URLs to an alternate daemon process group to the main part of the application.

Take for example admin URLs in Django. If these are indeed infrequently used but can have a large transient memory requirement, what it is possible to do is:

WSGIDaemonProcess main processes=5 threads=5
WSGIDaemonProcess admin processes=1 threads=3 \
    inactivity-timeout=30 maximum-requests=20

WSGIApplicationGroup %{GLOBAL}
WSGIProcessGroup main

WSGIScriptAlias / /some/path/wsgi.py

<Location /admin>
WSGIProcessGroup admin
</Location>

So what we have done is created two daemon process groups and specified that the admin pages should be handled in a distinct daemon process group of its own where we can be more aggressive with the configuration and use inactivity timeout and maximum requests to combat excessive memory use. In doing this we have left alone things for the bulk of the web application and so would not be impacting on it as far as process churn is concerned.

The end result is that we can tailor configuration settings for different parts of the application. The only requirement is that we can reasonably easily separate the the different parts of the application out based on URL by matching it with a Location or LocationMatch directive in Apache.

Now in this example we have done this specifically to separate out the misbehaving parts of an application, but the converse can also be done.

Quite often most of the traffic for a site will often hit a small number of URLs. The performance of this small number of URLs, but very frequently visited, could be impeded by having to use a more general configuration for the WSGI container to satisfy the requirements of everything else running in the web application.

What may work better is to delegate the very high trafficked URLs into their own daemon process groups with a processes/threads mix tuned for that scenario.

Because that daemon process group is only going to handle a smaller number of URLs, the actual amount of code from your application that would ever be executed within that process could potentially be much smaller. So long as your code base is setup such that it only lazily imports code for specific handlers when necessary the first time, you can keep this optimised process quite lean as far as memory usage.

So instead of having every process having to be very fat and eventually load up all parts of your application code, you can leave that for a smaller number of processes, where although they are going to serve up a greater number of different URLs, wouldn't necessarily get as much traffic and so don't have to have as much capacity.

You might therefore have the following:

WSGIDaemonProcess main processes=1 threads=5
WSGIDaemonProcess volume processes=3 threads=5
WSGIDaemonProcess admin processes=1 threads=3 \
    inactivity-timeout=30 maximum-requests=20

WSGIApplicationGroup %{GLOBAL}
WSGIProcessGroup main

WSGIScriptAlias / /some/path/wsgi.py

<Location ~ "^/$">
WSGIProcessGroup volume
</Location>

<Location /publications/article/>
WSGIProcessGroup volume
</Location>

<Location /admin>
WSGIProcessGroup admin
</Location>

In this case we are delegating just the URLs corresponding to the home page for the site and one further sub URL into one daemon process group. As less code within the application would be required to service these requests, the process should have a lower memory footprint and so we can afford to spread the requests across a greater number of processes, each with a small number of threads, to avoid as much as possible any adverse effects of the GIL from running a high number of threads.

The admin pages would still be separated out due to our original issue with the transient memory requirements for those. Everything else would run in our main daemon process group. Because though that now handles a much lower volume of traffic, we can get away with fewer processes. Given this process will still be quite fat as it would need to still load most of the code for the web application, fewer processes means less memory usage overall.

So by juggling things like this, handling as special cases worst case URLs for transient memory usage, as well as your high traffic URLs, one can often quite dramatically control the amount of memory used as well as improve the response times for those URLs which are hit the most.

Monitoring for success

Now the one requirement that must be satisfied for all this to be able to be done successfully is that you must have monitoring. If you have no monitoring at all then you are going to have no idea about the traffic which is hitting each part of your web application. It will as a result be impossible to tune the processes/threads mix for each daemon process group you are running to optimise the capacity utilisation and response times.

Any such monitoring will at least have to allow you to monitor separately the memory usage of each distinct daemon process group. If monitoring memory from outside of the processes this can be a problem though, as most tools will only see the processes as all belonging to Apache.

It is possible to define a 'display-name' option against each WSGIDaemonProcess directive to set a name for processes in that daemon process group. You can even ask mod_wsgi to automatically name them based on the name of the daemon process group. For example:

WSGIDaemonProcess main display-name=%{GROUP} processes=1 threads=5
WSGIDaemonProcess volume display-name=%{GROUP} processes=3 threads=5
WSGIDaemonProcess admin display-name=%{GROUP} processes=1 threads=3 \
    inactivity-timeout=30 maximum-requests=20

This would result in the processes being labelled as '(wsgi:main)', '(wsgi:volume)' and '(wsgi:admin)'.

Such labels though are only respected by tools such as BSD derived 'ps' or 'htop'. The label may not necessarily be respected by all monitoring systems.

Trying to track request throughput and response times using a separate tool can also have its own problems. Apache can be set up to log response times in addition to the URLs accessed, but if needing live monitoring then it means that the access log has to be constantly processed to derive the information. Any such tool may not though, unless sufficiently customisable, allow you to break out throughput and response time based on the subsets of URLs being matched and delegated to the different daemon process groups.

A better solution here can be to use a monitoring system that runs within the web application processes themselves.

I am possibly a little bit biased in this opinion since I do work there and wrote the Python agent, but by far the best single package available for monitoring production Python web applications is New Relic.

In using New Relic though, we do have to make some additional configuration changes. This is because normally when using New Relic, the one web application would report under one application in the New Relic UI.

For the above configuration however, what we want is to be able to view the data collected separately for each daemon process group. At the same time though, it would still be useful to be able to view it collectively under one application.

One could do some tricks in the WSGI script involving selecting different environment sections from the New Relic Python agent configuration file, but the easier way is to use the Python agent's ability to set the application name being reported to dynamically for each request. Such ability to set the application name dynamically can be done through a key/value pair passed in via the WSGI environ dictionary into the WSGI application.

In the case of Apache/mod_wsgi, WSGI environ key/value pairs can be set using the SetEnv directive in the Apache configuration file itself. What we can therefore do is:

WSGIDaemonProcess main processes=1 threads=5
WSGIDaemonProcess volume processes=3 threads=5
WSGIDaemonProcess admin processes=1 threads=3 \
    inactivity-timeout=30 maximum-requests=20

WSGIApplicationGroup %{GLOBAL}
WSGIProcessGroup main

SetEnv newrelic.app_name 'My Site (main);My Site'

WSGIScriptAlias / /some/path/wsgi.py

<Location ~ "^/$">
WSGIProcessGroup volume
SetEnv newrelic.app_name 'MySite (volume);My Site'
</Location>

<Location /publications/article/>
WSGIProcessGroup volume
SetEnv newrelic.app_name 'MySite (volume);My Site'
</Location>

<Location /admin>
WSGIProcessGroup admin
SetEnv newrelic.app_name 'MySite (admin);My Site'
</Location>

We are again are using specialisation via the Location directive. In this case we are using it to override what the application name the New Relic Python agent reports to for the different URLs.

We are also in this case using a semi colon separated list of names.

The result is that each daemon process group logs under a separate application in the New Relic UI of the form 'My Site (XXX)' but at the same time they also all report to 'My Site'.

This way you can still have a combined view, but you can also look at each daemon process group in isolation.

The ability to see each daemon process group in isolation is important, because you can then do the following separately for each daemon process group.

View response times.
View throughput.
View memory usage.
View CPU usage.
View capacity analysis report.
Trigger thread profiler.

If things were separated and they were all reporting only to the same application, the data presented by this would be all mixed up and for the last four items especially, could be confusing.

Future posts and talks

Okay, so that is probably going to be a lot to digest but represents just a part of what I would have presented at PyCon US if my talk had been accepted.

Other things I would have talked about would have included dealing with request back log when overloaded due to increase traffic for certain URLs, dealing with danger of malicious POST requests with large content size and various other topics deriving from the fact that not all URLs in a web application are equal.

As I said, I can't see myself turning that talk into a series of blog posts instead due to already having a lot on my plate, but can always see what happens. One of the things I do have to do, which could be viewed as a sort of consolation prize for not getting any talks accepted for PyCon US, is that I do now need to present a vendor workshop on behalf of New Relic at PyCon US this year.

Since I don't like doing straight marketing, you can be assured this will be packed full of lots of useful technical information about monitoring performance of applications, metrics collection, instrumentation and data visualisation. Further information on this will be forthcoming, but hopefully you can reserve some time for late in the second day of tutorials before the conference proper. I'll likely post some details here later, but also keep an eye out on the New Relic blog for more information on that workshop.

Use of threading in mod_wsgi daemon mode.

Every so often I get asked why when examining a mod_wsgi daemon process do there appear to be more threads running than what have been configured to handle requests. Since I can never be bothered trying to find what I wrote the previous time, I end up writing the explanation from scratch each time. Figured it may be time to simply throw it up here on my blog so I can at least refer to it here.

Consider therefore the Apache/mod_wsgi configuration of:

WSGIDaemonProcess mysite processes=3 threads=2 display-name=mod_wsgi
WSGIProcessGroup mysite
WSGIScriptAlias / /some/path/wsgi.py

What this results in is mod_wsgi creating a set of three processes distinct from the main Apache child worker process. Multiple instances of the WSGI application will then be run, one in each process. The Apache child worker processes will automatically proxy the requests, with a request being picked up by one of the daemon processes and handed off to the WSGI application to handle. With the number of threads being specified as two, the most concurrent requests each process can handle is two. With a total of three processes, that is a total of six concurrent requests able to be handled across the whole WSGI application.

If we were to now look at the resulting process tree using something like htop tree view, we would see:

20179 www-data 20 0 147M 76408 5680 S 28.0 0.9 10:17.85 │ ├─ mod-wsgi -k start
20240 www-data 20 0 147M 76408 5680 S  1.0 0.9  0:14.99 │ │ ├─ mod-wsgi -k start
20215 www-data 20 0 147M 76408 5680 S 12.0 0.9  5:05.16 │ │ ├─ mod-wsgi -k start
20214 www-data 20 0 147M 76408 5680 S 14.0 0.9  4:53.99 │ │ ├─ mod-wsgi -k start
20213 www-data 20 0 147M 76408 5680 S  0.0 0.9  0:00.63 │ │ ├─ mod-wsgi -k start
20212 www-data 20 0 147M 76408 5680 S  0.0 0.9  0:00.00 │ │ └─ mod-wsgi -k start

20178 www-data 20 0 138M 67680 5212 S 52.0 0.8 11:01.62 │ ├─ mod-wsgi -k start
20241 www-data 20 0 138M 67680 5212 S  0.0 0.8  0:15.45 │ │ ├─ mod-wsgi -k start
20230 www-data 20 0 138M 67680 5212 S 15.0 0.8  5:17.81 │ │ ├─ mod-wsgi -k start
20229 www-data 20 0 138M 67680 5212 S 35.0 0.8  5:24.63 │ │ ├─ mod-wsgi -k start
20228 www-data 20 0 138M 67680 5212 S  0.0 0.8  0:00.71 │ │ ├─ mod-wsgi -k start
20227 www-data 20 0 138M 67680 5212 S  0.0 0.8  0:00.00 │ │ └─ mod-wsgi -k start

20177 www-data 20 0 137M 67764 5428 S  7.0 0.8 10:47.27 │ ├─ mod-wsgi -k start
20207 www-data 20 0 137M 67764 5428 S  0.0 0.8  0:15.18 │ │ ├─ mod-wsgi -k start
20206 www-data 20 0 137M 67764 5428 S  7.0 0.8  5:16.82 │ │ ├─ mod-wsgi -k start
20205 www-data 20 0 137M 67764 5428 S  0.0 0.8  5:11.55 │ │ ├─ mod-wsgi -k start
20204 www-data 20 0 137M 67764 5428 S  0.0 0.8  0:00.69 │ │ ├─ mod-wsgi -k start
20203 www-data 20 0 137M 67764 5428 S  0.0 0.8  0:00.00 │ │ └─ mod-wsgi -k start

The question is, since we have configured the daemon process group to only have 2 threads per process, why are we seeing a total of 5 threads in each process?

The answer is that for a configuration of:

WSGIDaemonProcess mysite threads=num ...

there will be num+3 threads, where 'num' is the number of request threads indicated by the 'threads' option to the WSGIDaemonProcess directive. If no 'threads' option was specified, then the number of request threads will be 15.

What are these three extra threads then? They are as follows:

1. The main thread which was left running after the daemon process forked from Apache. It is from this thread that the requests threads are created initially. It will also create the other two additional threads, the purpose of which is described below. After it has done this, this main thread becomes a caretaker for the whole process. It will wait on a special socketpair, which a signal handler will write a character to as a flag that the process should shutdown. In other words, this main thread just sits there and stops the process from exiting until told to.

2. The second thread is a monitoring thread. What it does is manage things like the process activity timeout and shutdown timeout. If either of those timeouts occur it will send a signal to the same process (ie., itself), to trigger shutdown of the process.

3. The third thread is yet another monitoring thread, but one which specifically detects whether the whole Python interpreter itself gets into a complete deadlock and stops doing anything. If this is detected it will again send a signal to the same process to trigger a shutdown.

So the additional threads are to manage process shutdown and ensure the process is still alive and doing stuff.

Now under normal circumstances there will be one further additional thread created, but this is a transient thread which is only created at the time that the main thread has detected that the process is due to shutdown. This thread is what is called the reaper thread. All it will do is sleep for a specified period and if the process is still running at that point, it will forcibly kill the process.

The reaper thread is needed because the main thread will provide a short grace period to allow requests to complete and then destroy the Python interpreter, including triggering of any 'atexit' registered callbacks in each sub interpreter context. Because pending requests and destruction of the interpreters could take an unknown amount of time, or even hang, the reaper thread ensures the process is still killed if shutdown takes longer than allowed.

And there we have it, all intentional and nothing to worry about.

Monday, February 17, 2014

Performance overhead when applying decorators to methods.

This is the tenth post in my series of blog posts about Python decorators and how I believe they are generally poorly implemented. It follows on from the previous post titled 'Performance overhead of using decorators', with the very first post in the series being 'How you implemented your Python decorator is wrong'.

In the previous post I started looking at the performance implications of using decorators. In that post I started out by looking at the overheads when applying a decorator to a normal function, comparing a decorator implemented as a function closure to the more robust decorator implementation which has been the subject of this series of posts.

For a 2012 model MacBook Pro the tests yielded for a straight function call:

10000000 loops, best of 3: 0.132 usec per loop

When using a decorator implemented as a function closure the result was:

1000000 loops, best of 3: 0.326 usec per loop

And finally with the decorator factory described in this series of blog posts:

1000000 loops, best of 3: 0.771 usec per loop

This final figure was based on a pure Python implementation. When however the object proxy and function wrapper were implemented as a C extension, it was possible to get this down to:

1000000 loops, best of 3: 0.382 usec per loop

This result was not much different to when using a decorator implemented as a function closure.
What now for when decorators are applied to methods of a class?

Overhead of having to bind functions

The issue with applying decorators to methods of a class is that if you are going to honour the Python execution model, the decorator needs to be implemented as a descriptor and correctly bind methods to a class or class instance when accessed. In the decorator described in this series of posts we actually made use of that mechanism so as to be able to determine when a decorator was being applied to a normal function, instance method or class method.

Although this process of binding ensures correct operation, it comes at an additional cost in overhead over what a decorator implemented as a function closure, which does not make any attempt to preserve the Python execution model, would do.

In order to see what extra steps occur, we can again use the Python profile hooks mechanism to trace execution of the call of our decorated function. In this case the execution of an instance method.
First lets check again what we would get for a decorator implemented as a function closure.

def my_function_wrapper(wrapped):
    def _my_function_wrapper(*args, **kwargs):
        return wrapped(*args, **kwargs)
    return _my_function_wrapper

class Class(object):
    @my_function_wrapper
    def method(self):
        pass

instance = Class()

import sys

def tracer(frame, event, arg):
    print(frame.f_code.co_name, event)

sys.setprofile(tracer)

instance.method()

The result in running this is effectively the same as when decorating a normal function.

_my_function_wrapper call
method call
method return
_my_function_wrapper return

We should therefore expect that the overhead will not be substantially different when we perform actual timing tests.

Now for when using our decorator factory. To provide context this time we need to present the complete recipe for the implementation.

class object_proxy(object):
    def __init__(self, wrapped):
        self.wrapped = wrapped
        try:
            self.__name__ = wrapped.__name__
        except AttributeError:
            pass

    @property
    def __class__(self):
        return self.wrapped.__class__

    def __getattr__(self, name):
        return getattr(self.wrapped, name)

class bound_function_wrapper(object_proxy):
    def __init__(self, wrapped, instance, wrapper, binding, parent):
        super(bound_function_wrapper, self).__init__(wrapped)
        self.instance = instance
        self.wrapper = wrapper
        self.binding = binding
        self.parent = parent

    def __call__(self, *args, **kwargs):
        if self.binding == 'function':
            if self.instance is None:
                instance, args = args[0], args[1:]
                wrapped = functools.partial(self.wrapped, instance)
                return self.wrapper(wrapped, instance, args, kwargs)
            else:
                return self.wrapper(self.wrapped, self.instance, args, kwargs)
        else:
            instance = getattr(self.wrapped, '__self__', None)
            return self.wrapper(self.wrapped, instance, args, kwargs)

    def __get__(self, instance, owner):
        if self.instance is None and self.binding == 'function':
            descriptor = self.parent.wrapped.__get__(instance, owner)
            return bound_function_wrapper(descriptor, instance, self.wrapper,
                    self.binding, self.parent)
        return self

class function_wrapper(object_proxy):

    def __init__(self, wrapped, wrapper):
        super(function_wrapper, self).__init__(wrapped)
        self.wrapper = wrapper
        if isinstance(wrapped, classmethod):
            self.binding = 'classmethod'
        elif isinstance(wrapped, staticmethod):
            self.binding = 'staticmethod'
        else:
            self.binding = 'function'

    def __get__(self, instance, owner):
        wrapped = self.wrapped.__get__(instance, owner)
        return bound_function_wrapper(wrapped, instance, self.wrapper,
                self.binding, self)

    def __call__(self, *args, **kwargs):
        return self.wrapper(self.wrapped, None, args, kwargs)

def decorator(wrapper):
    def _wrapper(wrapped, instance, args, kwargs):
        def _execute(wrapped):
            if instance is None:
                return function_wrapper(wrapped, wrapper)
            elif inspect.isclass(instance):
                return function_wrapper(wrapped,
                        wrapper.__get__(None, instance))
            else:
                return function_wrapper(wrapped,
                        wrapper.__get__(instance, type(instance)))
        return _execute(*args, **kwargs)
    return function_wrapper(wrapper, _wrapper)

With our decorator implementation now being:

@decorator
def my_function_wrapper(wrapped, instance, args, kwargs):
    return wrapped(*args, **kwargs)

the result we get when executing the decorated instance method of the class is:

('__get__', 'call') # function_wrapper
('__init__', 'call') # bound_function_wrapper
('__init__', 'call') # object_proxy
('__init__', 'return')
('__init__', 'return')
('__get__', 'return')
('__call__', 'call') # bound_function_wrapper
('my_function_wrapper', 'call')
('method', 'call')
('method', 'return')
('my_function_wrapper', 'return')
('__call__', 'return')

As can be seen, due to the binding of the method to the instance of the class which occurs in __get__(), a lot more is now happening. The overhead can therefore be expected to be significantly more also.

Timing execution of the method call

As before, rather than use the implementation above, the actual implementation from the wrapt library will again be used.
This time our test is run as:

$ python -m timeit -s 'import benchmarks; c=benchmarks.Class()' 'c.method()'

For the case of no decorator being used on the instance method, the result is:

10000000 loops, best of 3: 0.143 usec per loop

This is a bit more than for the case of a normal function call due to the binding of the function to the instance which is occurring.

Next up is using the decorator implemented as a function closure. For this we get:

1000000 loops, best of 3: 0.382 usec per loop

Again, somewhat more than the undecorated case, but not a great deal more than when the decorator implemented as a function closure was applied to a normal function. The overhead of this decorator when applied to a normal function vs a instance method is therefore not significantly different.

Now for the case of our decorator factory and function wrapper which honours the Python execution model, by ensuring that binding of the function to the instance of the class is done correctly.
First up is where a pure Python implementation is used.

100000 loops, best of 3: 6.67 usec per loop

Ouch. Compared to when using a function closure to implement the decorator, this is quite an additional hit in runtime overhead.

Although this is only about an extra 6 usec per call, you do need to think about this in context. In particular, if such a decorator is applied to a function which is called 1000 times in the process of handing a web request, that is an extra 6 ms added on top of the response time for that web request.

This is the point where many will no doubt argue that being correct is not worth it if the overhead is simply too much. But then, it also isn't likely the case that the decorated function, nor the decorator itself are going to do nothing and so the additional overhead incurred may still be a small percentage of the run time cost of those and so not in practice noticeable.

All the same, can the use of a C extension improve things?

For the case of the object proxy and function wrapper being implemented as a C extension, the result is:

1000000 loops, best of 3: 0.836 usec per loop

So instead of 6 ms, that is less than 1 ms of additional overhead if the decorated function was called a 1000 times.

It is still somewhat more than when using a decorator implemented as a function closure, but reiterating again, the use of a function closure when decorating a method of a class is technically broken by design as it does not honour the Python execution model.

Who cares if it isn't quite correct

Am I splitting hairs and being overly pedantic in wanting things to be done properly?

Sure, for what you are using decorators for you may well get away with using a decorator implemented as a function closure. When you start though moving into the area of using function wrappers to perform monkey patching of arbitrary code, you cannot afford to do things in a sloppy way.

If you do not honour the Python execution model when doing monkey patching, you can too easily break in very subtle and obscure ways the third party code you are monkey patching. Customers don't really like it when what you do crashes their web application. So for what I need to do at least, it does matter and it matters a lot.

Now in this post I have only considered the overhead when decorating instance methods of a class. I did not cover what the overheads are when decorating static methods and class methods. If you are curious about those and how they may be different, you can check out the benchmarks for the full range of cases in the wrapt documentation.

In the next post I will touch once again on issues of performance overhead, but also a bit on alternative ways of implementing a decorator so as to try and address the problems raised in my very first post. This will be as a part of a comparison between the approach described in this series of posts and the way in which the 'decorator' module available from PyPi implements its variant of a decorator factory.

Saturday, February 8, 2014

Performance overhead of using decorators.

This is the ninth post in my series of blog posts about Python decorators and how I believe they are generally poorly implemented. It follows on from the previous post titled 'The @synchronized decorator as context manager', with the very first post in the series being 'How you implemented your Python decorator is wrong'.

The posts so far in this series were bashed out in quick succession in a bit over a week. Because that was quite draining on the brain and due to other commitments I took a bit of a break. Hopefully I can get through another burst of posts, initially about performance considerations when implementing decorators and then start a dive into how to implement the object proxy which underlies the function wrapper the decorator mechanism described relies on.

Overhead in decorating a normal function

In this post I am only going to look at the overhead of decorating a normal function with the decorator mechanism which has been described. The relevant part of the decorator mechanism which comes into play in this case is:

class function_wrapper(object_proxy):

def __init__(self, wrapped, wrapper):
super(function_wrapper, self).__init__(wrapped)
self.wrapper = wrapper
...

def __get__(self, instance, owner):
...

def __call__(self, *args, **kwargs):
return self.wrapper(self.wrapped, None, args, kwargs)

def decorator(wrapper):
def _wrapper(wrapped, instance, args, kwargs):
def _execute(wrapped):
if instance is None:
return function_wrapper(wrapped, wrapper)
elif inspect.isclass(instance):
return function_wrapper(wrapped, wrapper.__get__(None, instance))
else:
return function_wrapper(wrapped, wrapper.__get__(instance, type(instance)))
return _execute(*args, **kwargs)
return function_wrapper(wrapper, _wrapper)

If you want to refresh your memory of the complete code that was previously presented you can check back to the last post where it was described in full.

With our decorator factory, when creating a decorator and then decorating a normal function with it we would use:

@decorator
def my_function_wrapper(wrapped, instance, args, kwargs):
return wrapped(*args, **kwargs)

@my_function_wrapper
def function():
pass

This is in contrast to the same decorator created in the more traditional way using a function closure.

def my_function_wrapper(wrapped):
def _my_function_wrapper(*args, **kwargs):
return wrapped(*args, **kwargs)
return _my_function_wrapper

@my_function_wrapper
def function():
pass

Now what actually occurs in these two different cases when we make the call:

function()

Tracing the execution of the function

In order to trace the execution of our code we can use Python's profile hooks mechanism.

import sys

def tracer(frame, event, arg):
print(frame.f_code.co_name, event)

sys.setprofile(tracer)

function()

The purpose of the profile hook is to allow you to register a callback function which is called on the entry and exit of all functions. Using this was can trace the sequence of function calls that are being made.

For the case of a decorator implemented as a function closure this yields:

_my_function_wrapper call
function call
function return
_my_function_wrapper return

So what we see here is that the nested function of our function closure is called. This is because the decorator in the case of a using a function closure is replacing 'function' with a reference to that nested function. When that nested function is called, it then in turn calls the original wrapped function.

For our implementation using our decorator factory, when we do the same thing we instead get:

__call__ call
my_function_wrapper call
function call
function return
my_function_wrapper return
__call__ return

The difference here is that our decorator replaces 'function' with an instance of our function wrapper class. Being a class, when it is called as if it was a function, the __call__() method is invoked on the instance of the class. The __call__() method is then invoking the user supplied wrapper function, which in turn calls the original wrapped function.

The result therefore is that we have introduced an extra level of indirection, or in other words an extra function call into the execution path.

Keep in mind though that __call__() is actually a method though and not just a normal function. Being a method that means there is actually a lot more work going on behind the scenes than a normal function call. In particular, the unbound method needs to be bound to the instance of our function wrapper class before it can be called. This doesn't appear in the trace of the calls, but it is occurring and that will incur additional overhead.

Timing the execution of the function

By performing the trace above we know that our solution incurs an additional method call overhead. How much actual extra overhead is this resulting in though?

To try and measure the increase in overhead in each solution we can use the 'timeit' module to time the execution of our function call. As a baseline, we first want to time the call of a function without any decorator applied.

# benchmarks.py

def function():
pass

To time this we use the command:

$ python -m timeit -s 'import benchmarks' 'benchmarks.function()'

The 'timeit' module when used in this way will perform a suitable large number of iterations of calling the function, divide the resulting total time for all calls with the count of the number and end up with a time value for a single call.

For a 2012 model MacBook Pro this yields:

10000000 loops, best of 3: 0.132 usec per loop

Next up is to try with a decorator implemented as a function closure. For this we get:

1000000 loops, best of 3: 0.326 usec per loop

And finally with our decorator factory:

1000000 loops, best of 3: 0.771 usec per loop

In this final case, rather than use the exact code as has been presented so far in this series of blog posts, I have used the 'wrapt' module implementation of what has been described. This implementation works slightly differently as it has a few extra capabilities over what has been described and the design is also a little bit different. The overhead will still be roughly equivalent and if anything will cast things as being slightly worse than the more minimal implementation.

Speeding up execution of the wrapper

At this point no doubt there will be people wanting to point out that this so called better way of implementing a decorator is too slow to be practical to use, even if it is more correct as far as properly honouring things such as the descriptor protocol for method invocation.

Is there therefore anything that can be done to speed up the implementation?

That is of course a stupid question for me to be asking because you should realise by now that I would find a way. :-)

The path that can be taken at this point is to implement everything that has been described for the function wrapper and object proxy as a Python C extension module. For simplicity we can keep the decorator factory itself implemented as pure Python code as execution of that is not time critical as it would only be invoked once when the decorator is applied to the function and not on every call of the decorated function.

One thing I am definitely not going to do is blog about how to go about implementing the function wrapper and object proxy as a Python C extension module. Rest assured though that it works in the same way as the parallel pure Python implementation. It does obviously though run a lot quicker due to being implemented as C code using the Python C APIs rather than as pure Python code.

What is the result then by implementing the function wrapper and object proxy as a Python C extension module? It is:

1000000 loops, best of 3: 0.382 usec per loop

So although a lot more effort was required in actually implementing the function wrapper and object proxy as a Python C extension module, the effort was well worth it, with the results now being very close to the implementation of the decorator that used a function closure.

Normal functions vs methods of classes

So far we have only considered the case of decorating a normal function. As expected, due to the introduction of an extra level of indirection as well as the function wrapper being implemented as a class, overhead was notably more. Albeit, that it was still in the order of only half a microsecond.

All the same, we were able to speed things up to a point, by implementing our function wrapper and object proxy as C code, where the overhead above that of a decorator implemented as a function closure was negligible.

What now about where we decorate methods of a class. That is, instance methods, class methods and static methods. For that you will need to wait until the next blog post in this series on decorators.