Tuesday, August 11, 2015

Running ASYNC web applications under OpenShift.

In a previous post I explained how one could run custom WSGI servers under OpenShift. This can get a little tricky as by default the OpenShift Python cartridges are setup to use Apache and mod_wsgi to host your WSGI server. In order to be able to run an alternate WSGI server such as gunicorn, uWSGI or Waitress, you need to provide a special ‘app.py’ file which exec’s a shell script which in turn then executes the target WSGI server.

This level of indirection was needed to allow the alternate WSGI server to be started up, but at the same time use the same process name as the original process which the OpenShift platform created. If this wasn’t done then OpenShift wouldn’t be able to correctly identify the WSGI server process and would think that it may not have started up correctly and force the gear into an error state.

Although the ‘app.py’ file was used to allow us to run an alternate WSGI server, what wasn’t fully explained in that previous post was how the ‘app.py’ file would normally be used to directly run a web server which was embedded inside of the Python web application itself.

There are actually two options here as to what one could do. The first is that where a WSGI server can be run in an embedded way, then it could be used instead of running a standalone WSGI server. The second option is not to use a WSGI server or framework at all, instead using an ASYNC framework, such as the Tornado web server and framework.

The purpose of this blog post is to discuss that second option, of running an ASYNC web application implemented using the Tornado web server and framework on OpenShift. Although the new OpenShift 3 using Docker and Kubernetes was recently officially released for Enterprise customers, this post will focus on the existing OpenShift 2 and so is applicable to the current OpenShift Online offering.

Embedded web server

First up, lets explain a bit better as to what is meant by an embedded web server.

When we talk about WSGI servers, what is the more typical thing to do is to use a standalone WSGI server and to point it at a WSGI script file or application module. It is the WSGI servers job to then load the WSGI application, handle the web requests and forward the requests onto the WSGI application whose entry point was given in the WSGI script file or application module.

If for example one was using ‘mod_wsgi-express’ and had a WSGI script file, one would simply run:

mod_wsgi-express start-server /some/path/hello.wsgi

The contents of the ‘hello.wsgi’ file might then be:

def application(environ, start_response):
status = '200 OK'
output = 'Hello World!'

response_headers = [('Content-type', 'text/plain'),
            ('Content-Length', str(len(output)))]

start_response(status, response_headers)

return [output]

Key here is that the only thing in this file is the WSGI application, there is nothing about any specific WSGI server which is being used. As such, you must use a separate WSGI server to be able to host this WSGI application.

The alternative is to embed the WSGI server in the Python web application code file itself. Thus instead you might have a file call ‘app.py’ which contains:

def application(environ, start_response):
status = '200 OK'
output = 'Hello World!'

response_headers = [('Content-type', 'text/plain'),
('Content-Length', str(len(output)))]

start_response(status, response_headers)

return [output]
if __name__ == '__main__':
from wsgiref.simple_server import make_server
httpd = make_server('', 8000, application)
httpd.serve_forever()

When this ‘app.py’ is run as:

python app.py

it will start up the WSGI server which is contained within the ‘wsgiref’ module of the Python standard library. In other worlds, the Python application script file is self contained.

One warning that should be made here, is that most WSGI servers which allow for embedding in this way are quite simplistic. Be very careful about using them and check their capabilities to ensure that they are indeed capable of being used in a production setting.

One thing that they generally do not support is a multi process configuration. That is, they only run with the one web application process. This can be an issue for CPU bound web applications as the runtime characteristics of the Python global interpreter lock will limit how much work can be done within the one process. The only solution to that is to have a web server that uses multiple processes to handle requests.

Also be careful that any embeddable WSGI server isn’t just single threaded, as this means that only one request can be handled at a time, limiting the amount of traffic it can support. This is especially the case when web requests aren’t all handled quickly, as a single long running request can start causing backlogging, delaying all subsequent requests.

ASYNC web servers

One solution for long running requests, or at least those which are principally I/O bound, is to use an ASYNC web server.

In looking at ASYNC as an alternative, just be aware that ASYNC web servers have a completely different API for implementing Python web applications. This is quite different to WSGI applications. The API for WSGI applications relies on a blocking process/thread model for handling of web requests. It is not readily possible to marry up a WSGI application with an ASYNC server, such that the WSGI application can benefit from the characteristics that an ASYNC web server and framework brings.

This means that you would have to convert your existing WSGI application to be ASYNC. More specifically, you would need to convert it to the API for the ASYNC web framework you chose to use. This is because there is no standardised API for ASYNC as there is with the WSGI API specification for synchronous or blocking web servers.

Before going down that path, also consider whether you really need to convert completely over to ASYNC. Writing and maintaining ASYNC web applications can be a lot more work than if using WSGI. You might therefore consider separating out just certain parts of an existing WSGI application and convert it to ASYNC. Only do it though where it really makes sense. ASYNC is not a magic solution to all problems.

That all said, in the Python world one of the most popular ASYNC web frameworks for implementing Python web applications is Tornado.

A simple ASYNC Tornado web application would be written as:

import tornado.ioloop
import tornado.web
class MainHandler(tornado.web.RequestHandler):
    def get(self):
        self.write("Hello World!")
application = tornado.web.Application([
    (r"/", MainHandler),
])
if __name__ == "__main__":
    application.listen(8000)
    tornado.ioloop.IOLoop.current().start()

In this simple example you aren’t going to see much if any benefit over a WSGI application and server as it isn’t doing anything.

The benefits of an ASYNC framework come into play when the web application needs to make calls out to backend services such as databases and other web applications. Provided that the clients for such backend services are ASYNC aware and integrate with the ASYNC framework event loop, then rather than blocking, the web application can give up control and allow other requests to be handled while it is waiting.

The different web request handlers therefore effectively cooperate, explicitly yielding up control at points where they would otherwise block. This is done without the use of multithreading and so technically one isn’t encumbered with the additional overheads of using additional threads.

The theory is that with less resources being used, it is then possible to handle a much higher number of concurrent requests than might be possible if using a multithreaded server.

You do have to be careful though, as this will break down where code run by a request handler does block or where you run CPU intensive tasks. ASYNC frameworks are therefore not the answer for everything and you must be very careful in how you implement ASYNC web applications.

Running on OpenShift

In order to run this Tornado application on OpenShift, you will need to make some changes. This is necessary as when running under OpenShift, the OpenShift platform will indicate what IP address and port the web server should be listening on. You cannot use port 80 and should avoid arbitrarily selecting a port.

If you do not use the port that OpenShift allocates for you, then your web traffic will not be routed to your web application and your web application may not even start up properly.

Although your own web application will not be running on port 80, OpenShift will still forward the HTTP requests received on either port 80 or port 443 for your externally visible host name, through to your web application. So do the right thing and all should work okay.

As to the IP address and port to use, these will be passed to your web application via the environment variables:

  • OPENSHIFT_PYTHON_PORT
  • OPENSHIFT_PYTHON_IP

The modified Tornado web application which you add to the ‘app.py’ file you push up to OpenShift would therefore be:

import os
import tornado.ioloop
import tornado.web
class MainHandler(tornado.web.RequestHandler):
def get(self):
self.write("Hello World!")
application = tornado.web.Application([
(r"/", MainHandler),
])
port = int(os.environ.get('OPENSHIFT_PYTHON_PORT', '8000'))
ip = os.environ.get('OPENSHIFT_PYTHON_IP', 'localhost')
if __name__ == "__main__":
application.listen(port, ip)
tornado.ioloop.IOLoop.current().start()

So that the Tornado web framework code would be available, you also need to ensure you add ‘tornado’ to the ‘requirements.txt’ file as well. For the full details of creating a Python web application on OpenShift, I defer to the online documentation.

In the remainder of this post I want to start a discussion about a few things that you need to be careful about when using Tornado on OpenShift. Some of these issues apply to Tornado in general, but have some added relevance when running on a PaaS where the amount of memory may be less than when running on your own hardware.

Limiting content buffering

It doesn’t matter what PaaS provider you use, the typical base offering provides only 512MB. This may be greater where you have specifically paid for an extra amount of memory, or where using a large instance size.

Normally memory would only come in as a factor when considering how much memory your own web application consumes. This is especially significant where using a web server which is capable of running with multiple processes as you need to consider that each process will use up to the nominal maximum amount that your web application uses. Thus, how much memory your web application uses dictates how many separate processes you can use in a multi process configuration before you run out of system memory.

When using the Tornado web server though there is a hidden trap which I don’t believe many would appreciate exists and would have done something about.

The issue in this case is that when the Tornado web server receives a request which contains request content, it will by default read all that request content into memory before even passing the request to the handler for that request. This is noted in the Tornado documentation by the statement:

By default uploaded files are fully buffered in memory; if you need to handle files that are too large to comfortably keep in memory see the stream_request_body class decorator.

It is a little detail, but the potential impacts of it are significant. This is because of the fact that Tornado can in theory process so many requests at the same time and so those concurrent requests can each be buffering up to 100MB at the same time and so blowing out memory usage. In fact, I am not even sure if there is a hard limit.

Even if there is a hard limit it is likely to be set to be quite high. Such a limit isn’t therefore generally going to help, as Tornado will only block requests automatically when the request content length is specified as being greater than 100MB in size. If chunked request content were being sent, it can’t even block it as the amount of request content will not be known in advance, so it still has to read and buffer the request content to work out the size and whether the limit is reached.

With such a high default limit and there being no effective limit that can be applied at the outset for chunked request content, it is actually relatively easy, as a result of the buffering, to cause a Tornado web application to use up huge amounts of memory. This doesn’t even need to be the result of a concerted denial of service attack by a malicious actor. Instead, if you are using Tornado for handling large file uploads and need to deal with slow clients such as mobiles, then many concurrent requests could quite easily cause a Tornado web application to use quite a lot of memory just during the phase of initially reading the request in.

This is all in contrast to WSGI applications where no request content is read until the WSGI application itself decides to read in the content, allowing the WSGI application itself to decide how it is handled. This is possible with WSGI because of the blocking model and use of processes/threads for concurrency. Things get much harder in ASYNC systems.

Tornado 4.0+ does now offer a solution to avoid these problems but it is not the default and is an opt in mechanism for which you have to add specific code into your web application on each request handler.

This newer mechanism in Tornado is that rather than the request content being buffered up and being passed complete to your handler as part of the request, it will be passed to your handler via a ‘data_received()’ method as the data arrives. This will be the raw data though and does mean that if handling a form post or file upload, you will need to parse the raw data yourself to decode it.

Anyway, the point of raising these issue is to highlight the need to pay close attention to the Tornado web server configuration and how request content is handled. This is because the default configuration and way things are handled, is as I understand it, susceptible to memory issues and not just through the normal operation of your web application but also through deliberate attacks. In a memory constrained environment of a PaaS, the last thing you want is to run out of memory.

What the overall best practices are for Tornado for handling this I don’t know and I welcome anyone pointing out any resources where it clearly explains how best to design a Tornado web application to avoid problems with large request content.

From my limited knowledge of using Tornado, I would at least suggest look at doing the following:

  • If you are not handling large uploads, set the ‘max_buffer_size’ value to be something a lot smaller than 100MB. It needs to be just enough to handle any encoded form POST data or other file uploads you need to handle.
  • Look at the new request content streaming API in Tornado 4.0+ and consider implementing a common ‘data_received()’ method for all your handlers such that more restrictive per handler limits can be placed on the ‘max_buffer_size’. This could be handled by way of a decorator on the method. With this set the limit to ‘0’ for all handlers which would never receive any request content. Even if handling a form post, set a limit commensurate with what you expect. You would though also need new common code for parsing form post data received via ‘data_received()’.
  • For large file uploads, use ‘data_received()’ to process the data straight away, or save the data straight out to a temporary file, rather than buffer it up in memory, until you are read to process it.
Although I have looked specifically at request content as that presents a more serious problem due to possibly being used as an attack vector, also be mindful to what degree Tornado may buffer up response content as well when it cannot be written out to a client in a timely manner. It has been a while since I have looked at that side of Tornado so I can’t remember exactly how that works.
 
In closing on this issue, it needs to be stressed that this isn’t an OpenShift specific issue. It can happen in any environment. It is raised in relation to OpenShift because of it being common that PaaS offerings generally have less memory available per instance for your web application to use.
 
For those who understand the inner workings of Tornado better than I, which wouldn’t be hard, if I have misrepresented anything about how Tornado works then please let me know, providing an explanation of what does happen.

Automatic scaling of gears

Another issue which needs some attention when using ASYNC applications on OpenShift is how automatic scaling for gears works.

The issue with automatic scaling is that a main selling point of ASYNC web applications is that they can handle a much larger number of concurrent requests. Because of how automatic scaling works though, the fixed thresholds on when scaling occurs may result in some surprises if you enable use of auto scaling in OpenShift for an ASYNC web application handling a large number of concurrent requests.

As this is a more complicated issue I will look at that in a subsequent post.

Friday, July 3, 2015

Using Apache to start and manage Docker containers.

In the last couple of posts (1, 2) I described what needed to be done when migrating a Python web site running under Apache/mod_wsgi to running inside of a Docker container. This included the steps necessary to have the existing Apache instance proxy requests for the original site through to the appropriate port on the Docker host and deal with any fix ups necessary to ensure that the backend Python web site understood what the public facing URL was.

In changing to running the Python web site under Docker, I didn’t cover the issue of how the instance of the Docker container itself would be started up and managed. All I gave was an example command line for manually starting the container.

docker run --rm -p 8002:80 blog.example.com

The assumption here was that you already had the necessary infrastructure in place to start such Docker containers when the system started, and restart them automatically if for some reason they stopped running.

There are various ways one could manage service orchestration under Docker. These all come with their own infrastructure which has to be set up and managed.

If instead you are just after something simple to keep the Python web site you migrated into a Docker container running, and also manage it in conjunction with the front end Apache instance, then there is actually a trick one can do using mod_wsgi on the front end Apache instance.

Daemon process groups

When using mod_wsgi, by default any hosted WSGI application will run in what is called embedded mode. Although this is the default, if you are running on a UNIX system it is highly recommended you do not use embedded mode and instead use what is called daemon mode.

The difference is that with embedded mode, the WSGI application runs inside of the Apache child worker processes. These are the same processes which handle any requests received by Apache for serving up static files. Using embedded mode can result in various issues due to the way Apache manages those processes. The best solution is simply not to use embedded mode and use daemon mode instead.

For daemon mode, what happens is that a group of one or more separate daemon processes are created by mod_wsgi and the WSGI application is instead run within those. All that the Apache child worker processes do in this case is transparently proxy the requests through to the WSGI application running in those separate daemon processes. Being a separate set of processes, mod_wsgi is able to better control how those processes are managed.

In the initial post the example given was using daemon mode, but the aim was to move the WSGI application out of the front end Apache altogether and run it using a Docker container instead. This necessitated the manual configuration to proxy the requests through to that now entirely separate web application instance running under Docker.

Now an important aspect of how mod_wsgi daemon process groups work, is that the step of setting up a daemon process groups is separate to the step of saying what WSGI application should actually run in that daemon process group. What this means is that it is possible to tell mod_wsgi to create a daemon process group, but then never actually run a WSGI application in it.

Combining that with the ability of mod_wsgi to load and run a specific Python script in the context of the processes making up a daemon process group when those processes are started, it is actually possible to use a daemon process group to run other Python based services instead and have Apache manage that service. This could for example be used to implement a mini background task execution service in Python allowing you to offload work from the WSGI application processes, with it all managed as part of the Apache instance.

As far as mod_wsgi is concerned it doesn’t really care what the process does though, it will simply create the process and trigger the loading of the initial Python script. It doesn’t even really care if that Python script performs an ‘exec()’ to run a completely different program, thus replacing the Python process with something else. It is this latter trick of being able to run a separate program that we can use to have Apache manage the life of the Docker instance created from our container image.

Running the Docker image

In the prior posts, the basic configuration we ended up with for proxying the requests through to the Python web site running under Docker was:

# blog.example.com
<VirtualHost *:80>
ServerName blog.example.com
ProxyPass / http://docker.example.com:8002/
ProxyPassReverse / http://docker.example.com:8002/

RequestHeader set X-Forwarded-Port 80
</VirtualHost>

This was after we had removed the configuration which had created a mod_wsgi daemon process group and delegated the WSGI application to run in it. We are now going to add back the daemon process group, but we will not set up any WSGI application to run in it. Instead we will setup a Python script to be loaded in the process when it starts using the ‘WSGIImportScript’ directive.

# blog.example.com<VirtualHost *:80>
ServerName blog.example.com

ProxyPass / http://docker.example.com:8002/
ProxyPassReverse / http://docker.example.com:8002/

RequestHeader set X-Forwarded-Port 80

WSGIDaemonProcess blog.example.com threads=1
WSGIImportScript /some/path/blog.example.com/docker-admin.py \
    process-group=blog.example.com application-group=%{GLOBAL}
</VirtualHost>

In the ‘docker-admin.py’ file we now add:

import os
os.execl('/usr/local/bin/docker', '(docker:blog.example.com)', 'run',
    '--rm', '-p', '8002:80', ‘blog.example.com')

With this in place, when Apache is started, mod_wsgi will create a daemon process group with a single process. It will then immediately load and execute the ‘docker-admin.py’ script which in turn will execute the ‘docker' program to run up a Docker container using the image created for the backend WSGI application.

The resulting process tree would look like:

-+= 00001 root /sbin/launchd
\-+= 64263 root /usr/sbin/httpd -D FOREGROUND
|--- 64265 _www /usr/sbin/httpd -D FOREGROUND
\--- 64270 _www (docker:blog.example.com.au) run --rm -p 8002:80 blog.example.com

Of note, the ‘docker’ program was left running in foreground mode waiting for the Docker container to exit. Because it is running the Python web application, that will not occur unless explicitly shutdown.

If the container exited because the Apache instance run by mod_wsgi-express crashed for some reason, then being a managed daemon process created by mod_wsgi, it will be detected that the ‘docker’ program process had exited and a new mod_wsgi daemon process created to replace it, thereby executing the ‘docker-admin.py’ script again and so restarting the WSGI application running under Docker.

Killing the backend WSGI application explicitly by running ‘docker kill’ on the Docker instance will also cause it to exit, but again it will be replaced automatically.

The backend WSGI application would only be shutdown completely by shutting down the front end Apache itself.

Using this configuration, Apache with mod_wsgi, is therefore effectively being used as a simple process manager to startup and keep alive the backend WSGI application running under Docker. If the Docker instance exits it will be replaced. If Apache is shutdown, then so will the Docker instance.

Managing other services

Although the example here showed starting up of the WSGI application which was shifted out of the front end Apache, there is no reason that a similar thing couldn’t be done for other services being run under Docker. For example, you could create separate dummy mod_wsgi daemon process groups and corresponding scripts, to start up Redis or even a database.

Because the front end Apache is usually already going to be integrated into the operating system startup scripts, we have managed to get management of Docker containers without needing to setup a separate system to create and manage them. If you are only playing or do not have a complicated set of services running under Docker, then this could save a bit of effort and be just as effective.

With whatever the service is though, the one thing you may want to look at carefully is how a service is shutdown.

The issue here is how Apache signals the shutdown of any managed process and what happens if it doesn’t shutdown promptly.

Unfortunately how Apache does this cannot be overridden, so you do have to be mindful of it in case it would cause an issue.

Specifically, when Apache is shutdown or a restart triggered, Apache will send the ‘SIGINT’ signal to each managed child process. If that process has not shutdown after one second, it will send the signal again. The same will occur if after a total of two seconds the process hasn't shutdown. Finally, if three seconds elapsed in total, then Apache will send a ‘SIGKILL’ signal.

Realistically any service should be tolerant of being killed abruptly, but if you have a service which can take a long time to shutdown and is susceptible to problems if forcibly killed, that could be an issue and this may not be a suitable way of managing them.

Wednesday, July 1, 2015

Redirection problems when proxying to Apache running in Docker.

In my prior post I described various issues which can arise when moving a Python web application hosted using Apache/mod_wsgi into a Docker container and then using the existing Apache instance to proxy requests through to the Docker instance.

The issues arose due to the instance of the backend Python web application running under Docker, not knowing what the true URL was that was being used to access the site. The backend would only know about the URL used to identify the Docker host and the port on which the Python web application was being exposed. It did not know what the original URL was that the HTTP client used, nor specifically whether a HTTP or HTTPS connection was being used by the client.

To remedy this situation, the original Apache instance which was proxying the requests through to the backend, was configured to pass through extra request headers giving details about the remote client, public host name of the site, port accessed and the protocol scheme. An ability of mod_wsgi to interpret these headers and fix up the WSGI environ dictionary passed with each request with correct values was then enabled. The end result was that the WSGI application was able to correctly determine what URL the site was originally accessed as and so generate correct URLs for the same site in HTML responses and redirection headers such as ‘Location’.

Because though a feature of mod_wsgi was being used here to fix up the information passed into the WSGI application, it wouldn’t be applied if the same backend Apache instance was also being used to host static files. As only static files are being served up one might expect that this wouldn’t be an issue, but there is one situation where it is.

In this blog post I will describe how when using mod_wsgi-express inside of a Docker container you can host static files at the same time as your WSGI application. I will then illustrate the specific problem still left unresolved by the prior method used to fix up requests bound for the WSGI application. Finally I will look at various solutions to the problem.

Hosting static files

Our final Dockerfile from the prior post was:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
CMD [ "--trust-proxy-header", "X-Forwarded-Host", \
"--trust-proxy-header", "X-Forwarded-Port", \
"--trust-proxy-header", "X-Forwarded-For", \
"--trust-proxy-header", "X-Forwarded-Scheme", \
"site.wsgi" ]

The configuration for the front end Apache which proxied requests through to our WSGI application running in the Docker container was:

# blog.example.com
<VirtualHost *:80>
ServerName blog.example.com
ProxyPass / http://docker.example.com:8002/

RequestHeader set X-Forwarded-Port 80
</VirtualHost>

The Docker image was being run as:

docker run --rm -p 8002:80 blog.example.com

This was all making use of mod_wsgi-express shipped as part of the ‘mod_wsgi-docker’ image from Docker Hub.

The arguments to ‘CMD’ were the extra options we had been passing to mod_wsgi-express.

In addition to being able to host a Python WSGI application with no need for a user to configure Apache themselves, mod_wsgi-express can still be used to also host static files using Apache.

The simplest way of doing this would be to create a ‘htdocs’ directory within your project and populate it with any static files you have. You then tell mod_wsgi-express to use that directory as the root directory for any static document files.

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
CMD [ "--trust-proxy-header", "X-Forwarded-Host", \
"--trust-proxy-header", "X-Forwarded-Port", \
"--trust-proxy-header", "X-Forwarded-For", \
"--trust-proxy-header", "X-Forwarded-Scheme", \
"--document-root", "htdocs", "site.wsgi" ]

Normally when manually setting up mod_wsgi, the WSGI application if mounted at the root of the web site would hide any static files which may reside in the Apache ‘DocumentRoot’ directory, however mod_wsgi-express uses a method which allows any static files to be overlaid on top of the WSGI application.

That is, if a static file is matched by a URL, then the static file will be served up, otherwise if no static file is found, the request will be passed through to the WSGI application instead.

The benefit of this approach is that you do not need to fiddle with mounting selected directories at a sub URL to the site, or even individual files if wishing to return files such as ‘robots.txt’ or ‘favicon.ico’. You just dump the static files in the ‘htdocs’ directory with a path corresponding to the URL path they should be accessed as.

The redirection problem

Imagine now that the ‘htdocs’ directory contained:

htdocs/robots.txt
htdocs/static/
htdocs/static/files.txt

That is, in the top level directory is contained a ‘robots.txt’ file as well as a sub directory. In the sub directory we then had a further file called ‘files.txt’.

If we now access either of the actual files using the URLs:

http://blog.example.com/robots.txt
http://blog.example.com/static/files.txt

then all works as expected and the contents of those files will be returned.

So long as the URLs always target the actual files in this way then all is good.

Where a problem arises though is were the URL corresponding to the ’static’ subdirectory is accessed, and specifically where no trailing slash was added to the URL.

http://blog.example.com/static

Presuming that the URL is accessed from the public Internet where the Docker host is not going to be accessible, then the request from the browser will fail, indicating that the location is not accessible.

The reason for this is that when a URL is accessed which maps to a directory on the file system, and no trailing slash was added, then Apache will force a redirection back to the same directory, but using a URL with a trailing slash.

Looking at this using ‘curl’ we would see response headers coming back of:

$ curl -v http://blog.example.com/static
* Hostname was NOT found in DNS cache
* Trying 1.2.3.4...
* Connected to blog.example.com (1.2.3.4) port 80 (#0)
> GET /static HTTP/1.1
> User-Agent: curl/7.37.1
> Host: blog.example.com
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Date: Wed, 01 Jul 2015 01:37:46 GMT
* Server Apache is not blacklisted
< Server: Apache
< Location: http://docker.example.com:8002/static/
< Content-Length: 246
< Content-Type: text/html; charset=iso-8859-1

Apache has therefore responded with a HTTP status code of 301, indicating via the ‘Location’ response header that the resource being requested is actually located at 'http://docker.example.com:8002/static/'.

When the web browser now accesses the URL given in the ‘Location’ response header, it will fail, as ‘docker.example.com’ is an internal site and not accessible on the public Internet.

Fixing response headers

When using Apache to host just the WSGI application we didn’t have this issue as we relied on mod_wsgi to fix the details related to host/port/scheme as it was being passed into the WSGI application. Thus any redirection URL that may have been generated by the WSGI application would have been correct to start with.

For response headers in this case where what mod_wsgi was doing doesn’t apply, we can use another technique to fix up the URL. This is done by fixing the response headers in the front end proxy as the response passes back through it. This is done using the ‘ProxyPassReverse’ directive.

# blog.example.com
<VirtualHost *:80>
ServerName blog.example.com
ProxyPass / http://docker.example.com:8002/
ProxyPassReverse / http://docker.example.com:8002/

RequestHeader set X-Forwarded-Port 80
</VirtualHost>

When the ‘ProxyPassReverse’ directive is specified, the front end will adjust any URL in the ‘Location', 'Content-Location' and ‘URI' headers of HTTP redirect responses. It will replace the URL given to the directive with what the URL for the front facing Apache instance was.

With this in place we now get the URL we would expect to see in the ‘Location’ response header.

$ curl -v http://blog.example.com/static
* Hostname was NOT found in DNS cache
* Trying 1.2.3.4...
* Connected to blog.example.com (1.2.3.4) port 80 (#0)
> GET /static HTTP/1.1
> User-Agent: curl/7.37.1
> Host: blog.example.com
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Date: Wed, 01 Jul 2015 02:10:51 GMT
* Server Apache is not blacklisted
< Server: Apache
< Location: http://blog.example.com/static/
< Content-Length: 246
< Content-Type: text/html; charset=iso-8859-1
<
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="http://docker.example.com:8002/static/">here</a>.</p>
</body></html>
* Connection #0 to host blog.example.com left intact

We aren’t done though as this time the actual body of the response is also being shown. Although the ‘Location’ header is now correct, the URL as it appears in the response content is still wrong.

Fixing response content

Fixing up any incidences of the incorrect URL in the response content is a bit more complicated. If using Apache 2.4 for the front end though, one can however use the ‘mod_proxy_html’ module. For our example here, after having enabled ‘mod_proxy_html’, we can modify the proxy setup to be:

# blog.example.com
<VirtualHost *:80>
ServerName blog.example.com
ProxyPass / http://docker.example.com:8002/
ProxyPassReverse / http://docker.example.com:8002/

ProxyHTMLEnable On
ProxyHTMLURLMap http://docker.example.com:8002 http://blog.example.com 

RequestHeader set X-Forwarded-Port 80
</VirtualHost>

What this will do is cause any response from the backend of type 'text/html' or 'application/xhtml+xml’ to be processed as it is returned back via the proxy, with any text matching 'http://docker.example.com:8002' being replaced with 'http://blog.example.com'. The result then from our redirection would be:

$ curl -v http://blog.example.com/static
* Hostname was NOT found in DNS cache
* Trying 1.2.3.4...
* Connected to blog.example.com (1.2.3.4) port 80 (#0)
> GET /static HTTP/1.1
> User-Agent: curl/7.37.1
> Host: blog.example.com
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Date: Wed, 01 Jul 2015 02:56:47 GMT
* Server Apache is not blacklisted
< Server: Apache
< Location: http://blog.example.com/static/
< Content-Type: text/html;charset=utf-8
< Content-Length: 185
<
<html><head><title>301 Moved Permanently</title></head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="http://blog.example.com/static/">here</a>.</p>
* Connection #0 to host blog.example.com left intact

As it turns out, with the configuration being used in the backend, in following the now correct URL we will get a response of:

$ curl -v http://blog.example.com/static/
* Hostname was NOT found in DNS cache
* Trying 1.2.3.4...
* Connected to blog.example.com (1.2.3.4) port 80 (#0)
> GET /static/ HTTP/1.1
> User-Agent: curl/7.37.1
> Host: blog.example.com
> Accept: */*
>
< HTTP/1.1 404 Not Found
< Date: Wed, 01 Jul 2015 02:57:55 GMT
* Server Apache is not blacklisted
< Server: Apache
< Content-Type: text/html;charset=utf-8
< Content-Length: 151
<
<html><head><title>404 Not Found</title></head><body>
<h1>Not Found</h1>
<p>The requested URL /static/ was not found on this server.</p>
* Connection #0 to host blog.example.com left intact

So in some respects the example may have been a bit pointless as it subsequently led to a HTTP 404 response, but it did illustrate the redirection problem that exists for static files and how to deal with it.

When using mod_wsgi-express it is possible to enable directory listings:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
CMD [ "--trust-proxy-header", "X-Forwarded-Host", \
"--trust-proxy-header", "X-Forwarded-Port", \
"--trust-proxy-header", "X-Forwarded-For", \
"--trust-proxy-header", "X-Forwarded-Scheme", \
"--document-root", "htdocs", \
"—directory-listing”, \
"site.wsgi" ]

or even directory index files:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
CMD [ "--trust-proxy-header", "X-Forwarded-Host", \
"--trust-proxy-header", "X-Forwarded-Port", \
"--trust-proxy-header", "X-Forwarded-For", \
"--trust-proxy-header", "X-Forwarded-Scheme", \
"--document-root", "htdocs", \
"—directory-index", "index.html”, \
"site.wsgi" ]

so the redirection need not have ended up as a HTTP 404 response.

Either way, the problem with using ‘mod_proxy_html’ in such a situation is that all HTML responses will be processed. In this case the only HTML response which was an issue was the redirection response created by Apache itself. The additional processing overheads of ‘mod_proxy_html’ may not therefore be warranted.

Error response content

Due to the potential of ‘mod_proxy_html’ causing undue overhead if applied to all HTML responses, a simpler way to address the problem may be to change the content of the error responses generated by Apache.

For such redirection responses, it is unlikely these days to encounter a situation where the actual page content would be displayed up on a browser and where a user would need to manually follow the redirection link. Instead the ‘Location’ response header would usually be picked up automatically and the browser would go to the redirection URL immediately.

As the URL in the response content is unlikely to be used, one could therefore change the response simply not to include it. This can be done with Apache by overriding the error document content in the back end.

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
CMD [ "--trust-proxy-header", "X-Forwarded-Host", \
"--trust-proxy-header", "X-Forwarded-Port", \
"--trust-proxy-header", "X-Forwarded-For", \
"--trust-proxy-header", "X-Forwarded-Scheme", \
"--document-root", "htdocs", \
"--error-document", "301", "/301.html", \
"site.wsgi" ]

The actual content you do want to respond with would then be placed as a static file in the ‘htdocs’ directory called ‘301.html’.

<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved.</p>
</body></html>

Trying ‘curl’ once again, we now get:

$ curl -v http://blog.example.com/static
* Hostname was NOT found in DNS cache
* Trying 1.2.3.4...
* Connected to blog.example.com (1.2.3.4) port 80 (#0)
> GET /static HTTP/1.1
> User-Agent: curl/7.37.1
> Host: blog.example.com
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Date: Wed, 01 Jul 2015 03:31:48 GMT
* Server Apache is not blacklisted
< Server: Apache
< Location: http://blog.example.com/static/
< Last-Modified: Wed, 01 Jul 2015 03:27:24 GMT
< ETag: "89-519c7e6bebf00"
< Accept-Ranges: bytes
< Content-Length: 137
< Content-Type: text/html
<
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved.</p>
</body></html>
* Connection #0 to host blog.example.com left intact

For this particular case of also using mod_wsgi-express to serve up static files, this is the only situation like this that I know of. If however you were using other Apache modules on the backend which were resulting in other inbuilt Apache error responses being generated, then you may want to review those and also override the error documents for them if necessary.

Other options available

In the original problem described when proxying to a WSGI application, and with this subsequent issue with static file hosting, I have endeavoured to only use features to resolve the problem available in mod_wsgi or Apache 2.4. These aren’t necessarily the only options available for dealing with these issues though.

Another solution that exists is the third party Apache module called mod_rpaf. This module supports doing something similar to what mod_wsgi was doing for WSGI applications, but such that it can be applied to everything handled by the Apache server. Thus it should be able to cover the issues with both a WSGI application and serving of static files.

I am not going to cover ‘mod_rpaf’ here, but may do so in a future post.

One issue with using ‘mod_rpaf’, and which I would cover in any future post, is that it isn’t a module which is bundled with Apache. As a result one needs to build it from source code and incorporate it into the Apache installation within Docker to be able to use it.

I am also a bit concerned about what mechanisms it may have for where multiple headers can be used to indicate the same information. In the mod_wsgi module there are specific safe guards for that and although ‘mod_rpaf’ does have the ability to denote the IPs of trusted proxies, I don’t believe it provides a way of saying which header is specifically trusted when multiple options exist to pass information. It will need to be something I research further before posting about it. 

Tuesday, June 30, 2015

Proxying to a Python web application running in Docker.

I have seen a few questions of late being posed about how to setup Apache to proxy to a Python web application running in a Docker container. The questions seem to be the result of people who have in the past previously run Apache/mod_wsgi on the host, with potentially multiple Python web applications, now making use of Docker to break out the individual web applications into separate containers.

Because though Apache running on the host was making use of name based virtual hosts, so that only one IP address was required across all the different sites, they need to retain that Apache instance as the termination point. The front end Apache as a result then merely becomes a proxy for the sites which have been separated out into the separate Docker containers.

In this post I am going to explain how name based virtual hosting works and how proxying would be set up when moving the sites into a separate Docker container. I will also explain how to pass through to the sites now running in the containers details about how the site was being accessed to ensure that URLs generated by the web applications, which need to refer back to the same site via its external host name, are correct.

Name based virtual hosts

Name based virtual hosts under the HTTP protocol was a mechanism developed to allow many different web sites being accessed by different host names, to be made available on a single machine using only one IP address. The ability to use name based virtual hosts eliminated the need to allocate many IP addresses to the same physical host if wanting to share the resources of the host for many sites.

Because all the sites would have the same IP address, the mechanism relies though on the HTTP request sent by a client including a special ‘Host’ header which gives the name of the site the request is being targeted at.

To understand this better lets first look at how the host names themselves would be mapped in DNS.

Imagine to start with that you had acquired a VPS from a service provider which you then setup to run a web server. This VPS would have its own specific IP address allocated to it. In DNS this would have a name associated with it using what is called an A record.

host-1234.webhostingservice.com. A 1.2.3.4

The host name here could be a generic one which the hosting service would have setup to map to the IP address, it wouldn’t be related to the specific name of the site you want to run.

For the multiple web sites you now want to host, you need to create a name alias, which says that if your site name is accessed, that the request should actually be sent to the above host on that IP address. This is done in DNS using a CNAME record.

www.example.com. CNAME host-1234.webhostingservice.com.
blog.example.com. CNAME host-1234.webhostingservice.com.
wiki.example.com. CNAME host-1234.webhostingservice.com.

So it doesn’t now matter whether the site ‘www.example.com’, ‘blog.example.com’ or ‘wiki.example.com’ is accessed, the requests are all sent to the host with name ‘host-1234.webhostingservice.com’, listening on IP address ‘1.2.3.4’.

As we are going to access all these sites using HTTP, then the requests will all arrive at port 80 on that host, being the default port that the web server will listen for HTTP requests.

In order that the web server can distinguish between requests for the three different sites, the HTTP request needs to include the ‘Host’ header. Thus, if the web request was destined for ‘blog.example.com’, the HTTP request headers would include:

Host: blog.example.com

Apache virtual hosts

The next part of the puzzle is how Apache deals with these requests and knows which of the multiple sites it needs to be routed to.

This is managed in Apache by setting up multiple ‘VirtualHost’ definitions. Within each of the ‘VirtualHost’ definitions would be placed the configuration specific to that site. In the case of using mod_wsgi, you would then end up with something like:

<VirtualHost _default_:80>
Require all denied
</VirtualHost>
# www.example.com

<VirtualHost *:80>
ServerName www.example.com
WSGIDaemonProcess www.example.com
WSGIScriptAlias / /some/path/www.example.com/site.wsgi \
process-group=www.example.com application-group=%{GLOBAL}
<Directory /some/path/www.example.com>
<Files site.wsgi>
Require all granted
</Files>
</Directory>
</VirtualHost>
# blog.example.com

<VirtualHost *:80>
ServerName blog.example.com
WSGIDaemonProcess blog.example.com
WSGIScriptAlias / /some/path/blog.example.com/site.wsgi \
process-group=blog.example.com application-group=%{GLOBAL}
<Directory /some/path/blog.example.com>
<Files site.wsgi>
Require all granted
</Files>
</Directory>
</VirtualHost>
# wiki.example.com

<VirtualHost *:80>
ServerName wiki.example.com
WSGIDaemonProcess wiki.example.com
WSGIScriptAlias / /some/path/wiki.example.com/site.wsgi \
process-group=wiki.example.com application-group=%{GLOBAL}
<Directory /some/path/wiki.example.com>
<Files site.wsgi>
Require all granted
</Files>
</Directory>
</VirtualHost>

That is, we have three ‘VirtualHost’ definitions corresponding to each site, with all of them being setup as being linked to port 80, the default HTTP port.

What differs in each ‘VirtualHost’ is the value of the ‘ServerName’ directive. It is this directive which specifies what the site host name is for that specific ‘VirtualHost’.

When a request is now received on port 80, the ‘Host’ header is inspected, and the host name listed there is matched against each ‘VirtualHost’ based on the value of the ‘ServerName’ directive. If it matches then the request is routed to that specific WSGI application running under mod_wsgi.

If the host name listed by the ‘Host’ header doesn’t match the value of ‘ServerName’ in any ‘VirtualHost’ then what happens is that Apache will fallback to sending the request to whatever was the first ‘VirtualHost’ defined in the configuration files.

As it is likely undesirable to have requests for an arbitrary host name we don’t know about being sent to ‘www.example.com', a default ‘VirtualHost’ was instead created as the first one. Any request which doesn’t match on the host name will now go to it, and since 'Require all denied’ was all that was contained in that ‘VirtualHost’ it will result in Apache sending back a ‘403 Forbidden’ response.

The use of such a default ‘VirtualHost’ as a fallback ensures that we get a hard error in the case of where we might mess up the configuration, rather than the request being unexpectedly handled by a different site.

Moving sites to Docker

At this point we have our three sites implemented as separate WSGI applications running in separate daemon process groups using mod_wsgi. The WSGI applications for each site are already separated at this point and shouldn’t interfere with each other. All the same, separating each to run in its own Docker container does bring various benefits including better levels of isolation from the other sites and the host operating system, plus the ready ability to run up the sites separately during development and testing, using the same setup as would be used in production.

For a production deployment of Docker, Apache with mod_wsgi is still going to be better than or as good as other alternatives and mod_wsgi-express makes deploying a WSGI application in a Docker container even easier than the traditional way of deploying WSGI applications with Apache. What you definitely shouldn’t do is switch to using any builtin development server provided by a web framework. Even with Docker, such development servers are still not suitable for production use even though they are often what is used in documentation or blog posts related to running Python with Docker.

For Apache/mod_wsgi, the Docker image you should use as a base is:

There are multiple versions of the image for Python 2.7, 3.3 and 3.4. There is also a base image if you need to take full control over how a derived image is built, otherwise the ‘onbuild’ images supplied provide a convenient way of deploying a WSGI application in a container. Even if using the ‘onbuild’ image, hook scripts can still be supplied as part of your application code to perform special build actions or pre deploy steps within the container when started.

For our above sites, the ‘Dockerfile’ to build our Docker image would be as simple as:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
CMD [ "site.wsgi" ]

If the WSGI application had a list of Python modules that needed to be available, they would be listed in the ‘requirements.txt’ file in the root of the application directory along side the ‘Dockerfile’. The image can then be built by running:

docker build -t blog.example.com .

We can then run the image manually using:

docker run --rm -p 8002:80 blog.example.com

When using the ‘mod_wsgi-docker’ image, within the Docker container the Apache instance will listen on port 80. As we are going to be running a container for each site, they can’t all be exported as port 80 on the Docker host. As a result we need to map the internal port 80 to a different external port in each case. Thus we map ‘www.example.com’ to port 8001, ‘blog.example.com’ to port 8002 and ‘wiki.example.com’ to port 8003.

With the sites now running inside of a Docker container on our Docker host, we need to change the front end Apache configuration to proxy requests for the site through to the appropriate Docker container, rather than running the WSGI application for the site on the front end Apache instance.

Previously the Apache configuration was:

# blog.example.com

<VirtualHost *:80>
ServerName blog.example.com
WSGIDaemonProcess blog.example.com
WSGIScriptAlias / /some/path/blog.example.com/site.wsgi \
process-group=blog.example.com application-group=%{GLOBAL}
<Directory /some/path/blog.example.com>
<Files site.wsgi>
Require all granted
</Files>
</Directory>
</VirtualHost>

This now needs to be changed to:

# blog.example.com
<VirtualHost *:80>
ServerName blog.example.com
ProxyPass / http://docker.example.com:8002/
</VirtualHost>

We have therefore removed the mod_wsgi related configuration from the ‘VirtualHost’ and replaced it with a configuration which instead proxies any request for that specific ‘VirtualHost’, through to the port exported for that site on the Docker host.

URL reconstruction

The proxy changes above are enough to at least have HTTP requests originally intended for ‘blog.example.com’, to be proxied through to the new home for the site running under the Docker container. The use of a proxy in this way will however cause a few problems. This is because the WSGI application will now instead think it is running at ‘http://docker.example.com:8002'. This is reflected in the values passed through to the WSGI application for each request in the WSGI environ dictionary.

HTTP_HOST: 'docker.example.com:8002'
PATH_INFO: '/'
QUERY_STRING: ''
SERVER_NAME: 'docker.example.com'
SERVER_PORT: '8002'
SCRIPT_NAME: ''
wsgi.url_scheme: 'http'

We can see the affects of the problem by creating a test WSGI application which implements the algorithm for URL reconstruction as outlined in the WSGI specification (PEP 3333).

from urllib import quote
def reconstruct_url(environ):
url = environ['wsgi.url_scheme']+'://'
    if environ.get('HTTP_HOST'):
url += environ['HTTP_HOST']
else:
url += environ['SERVER_NAME']
    if environ['wsgi.url_scheme'] == 'https':
if environ['SERVER_PORT'] != '443':
url += ':' + environ['SERVER_PORT']
else:
if environ['SERVER_PORT'] != '80':
url += ':' + environ['SERVER_PORT']
    url += quote(environ.get('SCRIPT_NAME', ''))
url += quote(environ.get('PATH_INFO', ''))
    if environ.get('QUERY_STRING'):
url += '?' + environ['QUERY_STRING']
    return url
def application(environ, start_response):
status = '200 OK'
    output = reconstruct_url(environ)
    response_headers = [('Content-type', 'text/plain'),
('Content-Length', str(len(output)))]
start_response(status, response_headers)
    return [output]

The result from this will be:

http://docker.example.com:8002/

What we want to see here though is the public facing URL and not the URL used internally by the proxy. Thus we want:

http://blog.example.com/

The consequences of this are that when a WSGI application constructs a URL for use in links within a HTML page response, or in any HTTP response header such as the ‘Location’ header, the URL returned will be wrong.

If the internal site is not publicly accessible then the user will not be able to then reach the site when the URL is followed.

If the internal site is actually publicly accessible via the URL, then it will work, but then they will be accessing it using what should notionally be a private URL and not the public URL. Such leakage of an internal URL is undesirable.

The solution to this problem is in what additional HTTP request headers the front end Apache will send along with the request when the ‘ProxyPass’ directive is used. These headers appear in the WSGI environ dictionary passed with each request as:

HTTP_X_FORWARDED_FOR: '127.0.0.1'
HTTP_X_FORWARDED_HOST: 'blog.example.com'
HTTP_X_FORWARDED_SERVER: 'blog.example.com'

Some Python web frameworks provide inbuilt support or extensions which can be used to take such request headers and use them to override the values set for key values in the WSGI environ dictionary.

For the above code to reconstruct the URL to work properly, the two key values we need to override are ‘HTTP_HOST’ and ‘SERVER_PORT’. These need to be overridden with what the values would have been as if seen by the front end Apache server and not the backend.

For ‘HTTP_HOST’, we can get what would have been the value as seen by the front end by taking the value of the ‘HTTP_X_FORWARDED_HOST’ value. This value reflects the value of the HTTP request header ‘X-Forwarded-Host’ as set by the ‘ProxyPass’ directive when the request was being proxied.

What the use of the ‘ProxyPass’ directive doesn’t provide us with though is what the front end port was that the original request was accepted on. We have two ways though that we could use to ensure this information is passed through. In both cases it involves setting the ‘X-Forwarded-Port’ request header when proxying the request. This header name is the accepted convention for passing across the port that the front end web server accepted the request on.

The first way is to use the Apache ‘mod_headers’ module to set the request header and hard wire it to be port 80.

# blog.example.com
<VirtualHost *:80>
ServerName blog.example.com
ProxyPass / http://docker.example.com:8002/

RequestHeader set X-Forwarded-Port 80
</VirtualHost>

The second also uses ‘mod_headers’, but with the port number being calculated dynamically using ‘mod_rewrite’.

# blog.example.com
<VirtualHost *:80>
ServerName blog.example.com
ProxyPass / http://docker.example.com:8002/

RewriteEngine On
RewriteRule .* - [E=SERVER_PORT:%{SERVER_PORT},NE]

RequestHeader set X-Forwarded-Port %{SERVER_PORT}e
</VirtualHost>

In either case, the request header is passed in the WSGI environ dictionary of the back end WSGI application as ‘HTTP_X_FORWARDED_PORT’ and similar to how ‘HTTP_FORWARDED_HOST’ was used to override ‘HTTP_HOST’, can be used to override ‘SERVER_PORT’.

Having overridden ‘HTTP_HOST’ and ‘SERVER_PORT’, our reconstructed URL would now be correct.

Who do you really trust

As mentioned, some Python web frameworks do provide features to allow the WSGI environ values to be overridden based on these special headers. Unfortunately not all Python web frameworks do, and even though it is possible to find separate WSGI middleware which may do it, they may not support overriding the port as well as the host. Some don’t even touch the host and port at all, and are actually only concerned with an entirely separate problem of overriding ‘REMOTE_ADDR’ in the WSGI environ dictionary based on information passed in the ‘X-Fowarded-For’ request header by a proxy.

Because of the fact that Python web frameworks may not provide a feature to handle such special request headers and because adding in a WSGI middleware can be fiddly, more recent versions of mod_wsgi have inbuilt support for handling such headers.

In this case where we are using the ‘mod_wsgi-docker’ image, we would change:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
CMD [ "site.wsgi" ]

to:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
CMD [ "--trust-proxy-header", "X-Forwarded-Host", \
"--trust-proxy-header", "X-Forwarded-Port", \
"--trust-proxy-header", "X-Forwarded-For", \
"site.wsgi" ]

Built in to mod_wsgi is knowledge of what the different headers are used for and when they are marked as being trusted, the appropriate override will be done. In this case, as a result of the headers which were listed as trusted, we update ‘HTTP_HOST’, ‘SERVER_PORT’ and ‘REMOTE_ADDR’.

Now unique to how mod_wsgi works which other Python based solutions don’t provide, is that it understands that for some of the values to be overridden, there are multiple conventions as to what request headers are used.

For example, for ‘REMOTE_ADDR’, request headers which might be used are ‘X-Forwarded-For’, ‘X-Client-IP’ and ‘X-Real-IP’. In the face of there being multiple request headers that can represent the same thing, it is important that only the one marked as trusted be used and that the other request headers that can mean the same thing be wiped out and not passed along with the request.

This is to avoid the sort of situation that can arise where the Apache front end is only setting the ‘X-Forwarded-For’ request header. If a malicious client itself set the ‘X-Real-IP’ header and some WSGI middleware was also being used which took the value of ‘HTTP_X_REAL_IP’ and used it to override ‘REMOTE_ADDR’, then you have a potential security issue through a client dictating what IP address the web application thinks it is coming from.

To avoid such problems, when mod_wsgi is told to only trust the ‘X-Forwarded-For’ header, it will ensure that the other headers of ‘X-Client-IP’ and ‘X-Real-IP’ are cleared and not passed through to then be incorrectly interpreted by a WSGI middleware.

This avoids one scenario that can arise as to whether we trust such headers, but we also have the problem of whether the request even came from the proxy which we expected it to, and whose headers we trust.

If a client was able to bypass the proxy and send a request direct to the backend, then it could set the headers itself to anything it wanted and the backend wouldn’t know.

If this scenario is a concern, then mod_wsgi supports an additional feature where the IP address of the trusted proxy itself can be specified. Such so called trusted headers will then only be honoured if coming from that proxy. If they aren’t from that proxy, then all the equivalent headers for the value we are trying to override will be cleared from the request and ignored.

Imagining therefore that the IP address of the host where the front end Apache was running was ‘192.168.59.3’, we would use the ‘—trust-proxy’ option to ‘mod_wsgi-express’ as:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
CMD [ "--trust-proxy-header", "X-Forwarded-Host", \
"--trust-proxy-header", "X-Forwarded-Port", \
"--trust-proxy-header", "X-Forwarded-For", \
 "--trust-proxy", "192.168.59.3", "site.wsgi" ]

If multiple proxies were being used, then the ‘—trust-proxy’ option can be supplied more than once. Alternatively, an IP subnet description can be used.

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
CMD [ "--trust-proxy-header", "X-Forwarded-Host", \
"--trust-proxy-header", "X-Forwarded-Port", \
"--trust-proxy-header", "X-Forwarded-For", \
 "--trust-proxy", "192.168.59.0/24", "site.wsgi" ]

In addition to being able to list an immediate proxy, you can supply IP addresses of other proxies if the request needs to traverse through more than one. These would come into play with determining how far back in the proxy chain, the value of the ‘X-Fowarded-For’ header can be trusted in determining the true client IP address. In other words, the client IP address will be interpreted as being that immediately before the forward most trusted host.

Although I have shown here how this trust mechanism is setup when using mod_wsgi-express, the same can be done if configuring mod_wsgi in the Apache configuration itself. In that case the configuration directives are ‘WSGITrustedProxyHeaders’ and ‘WSGITrustedProxies’.

Secure connections

The changes described above cover the situation of a normal HTTP connection. What though where the original front end Apache was also accepting secure connections.

The original Apache configuration in this case for each site would have been something like:

# blog.example.com
<VirtualHost *:80>
ServerName blog.example.com

WSGIDaemonProcess blog.example.com

WSGIScriptAlias / /some/path/blog.example.com/site.wsgi \
    process-group=blog.example.com application-group=%{GLOBAL}

<Directory /some/path/blog.example.com>
<Files site.wsgi>
Require all granted
</Files>
</Directory>
</VirtualHost>

<VirtualHost *:443>
ServerName blog.example.com

SSLEngine On
SSLCertificateFile /some/path/blog.example.com/server.crt
SSLCertificateKeyFile /some/path/blog.example.com/server.key

WSGIScriptAlias / /some/path/blog.example.com/site.wsgi \
   process-group=blog.example.com application-group=%{GLOBAL}
<Directory /some/path/blog.example.com>
<Files site.wsgi>
Require all granted
</Files>
</Directory>
</VirtualHost>

In separating out the site into a separate Docker container, as the front end Apache is still the initial termination point, it will still be necessary for it to also handle the secure connections. So long as connections through to the site running under Docker are in a secure non public network, then we can stick with just using a normal HTTP connection between the front end and the site running under Docker.

The updated Apache configuration when it is changed to proxy the site through to the Docker instance will be:

# blog.example.com
<VirtualHost *:80>
ServerName blog.example.com
ProxyPass / http://docker.example.com:8002/

RequestHeader set X-Forwarded-Port 80
</VirtualHost>

<VirtualHost *:443>
ServerName blog.example.com

SSLEngine On
SSLCertificateFile /some/path/blog.example.com/server.crt
SSLCertificateKeyFile /some/path/blog.example.com/server.key

ProxyPass / http://docker.example.com:8002/

RequestHeader set X-Forwarded-Port 443
</VirtualHost>

Checking access to our site over the secure connection with our test script which reconstructs the URL and we find though we are not getting the desired result. Specifically, the reconstructed URL is using ‘http’ as the protocol scheme and not ‘https’. This would mean any URLs generated in HTML responses or in a response header such as ‘Location’, would cause users to start using an insecure connection.

The problem in this case is that the ‘wsgi.url_scheme’ value in the WSGI environ dictionary is being passed as ‘http’. This is due to the web server running in the Docker instance only accepting HTTP connections. We somehow now need to pass across from the front end that the initial connection to the front end Apache instance was in fact a secure connection.

For passing information about whether a secure connection was used, there is no one HTTP header which is universally accepted as a de-facto standard. There are actually about five different headers which different software packages have supported. In our example here we will use the ‘X-Forwarded-Scheme’ header, setting it to be ‘https’ when that protocol scheme is being used.

# blog.example.com
<VirtualHost *:80>
ServerName blog.example.com
ProxyPass / http://docker.example.com:8002/

RequestHeader set X-Forwarded-Port 80
</VirtualHost>

<VirtualHost *:443>
ServerName blog.example.com

SSLEngine On
SSLCertificateFile /some/path/blog.example.com/server.crt
SSLCertificateKeyFile /some/path/blog.example.com/server.key

ProxyPass / http://docker.example.com:8002/

RequestHeader set X-Forwarded-Port 443
RequestHeader set X-Forwarded-Scheme https
</VirtualHost>

Determining the value for this header could also have been done using mod_rewrite, but we will stick with hard wiring the value. We only need to do this for the ‘VirtualHost’ corresponding to the secure connection, i.e., port 443, as the lack of the header will be taken as meaning a HTTP connection was used.

In the Dockerfile where the options to mod_wsgi-express are being supplied, we now add this header as an additional trusted proxy header and mod_wsgi will ensure that the correct thing is done.

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
CMD [ "--trust-proxy-header", "X-Forwarded-Host", \
"--trust-proxy-header", "X-Forwarded-Port", \
"--trust-proxy-header", "X-Forwarded-For", \
"--trust-proxy-header", "X-Forwarded-Scheme", \
"site.wsgi" ]

The end result will be that where a secure connection was used at the front end Apache, then mod_wsgi running in the Docker instance will override ‘wsgi.url_scheme’ to be ‘https’ and URL reconstruction will then be correct.

Non WSGI handlers

The above configurations and the fixes to the WSGI environ dictionary done by mod_wsgi, running in the Docker instance, will ensure that the construction of any URLs when being done within a WSGI application will work correctly.

If also hosting static files using the Apache instance running in the Docker instance, or running a dynamic web application using some other Apache module, the fixes will not apply to those scenarios.

Being that we are running a Python web application using mod_wsgi, the only concern for us may be whether the lack of the fixes when serving up static files using the same Apache running in the Docker instance could be an issue. You might be thinking that it wouldn’t since what is being served up is static and so not affected by what URL it was accessed as.

Unfortunately there is one scenario where it can be an issue even when serving up static files. I will describe that issue in a future post, where it can be a problem, and what can be done about it when there is.

Friday, June 26, 2015

Installing a custom Python version into a Docker image.

It is a growing problem with Linux distributions that many packages they ship quickly become out of date, and due to the policies of how the Linux distributions are managed, the packages do not get updates. The only hope for getting a newer version of a package is to wait for the next version of the Linux distribution and hope that they include the version of the package you do want.

In practice though, waiting for the next version of a Linux distribution is not something you can usually do, you have to go with what is available at the time, or even perhaps use an older version due to maturity or support concerns. Worse, once you do settle on a specific version of a Linux distribution, you are generally going to be stuck with it for many years to come.

The end result is that although you can be lucky and the Linux distribution may at least update a package you need with security fixes, if popular enough, you will be out of luck when it comes to getting bug fixes or general improvements to the package.

This has been a particular problem with major Python versions, the Apache web server, and also my own mod_wsgi package for Apache.

At this time, many so called long term support (LTS) versions of Linux ship a version of mod_wsgi which is in practice about 5 years old and well over 20 releases behind. That older version, although it may have received one security fix which was made available, has not had other bugs fixed which might have security implications, or which can result in excessive memory use, especially with older Apache versions.

The brave new world of Docker offers a solution to this because it makes it easier for users to install their own newer versions of packages which are important to them. It is therefore possible for example to even find official Docker images which provide Python 2.7, 3.2, 3.3 or 3.4, many more than the base Linux version itself offers.

The problem with any Docker image which builds its own version of Python however is whether when it is installed it has followed best practices as to the right way to install it which has been developed over the years for the official base Linux versions of the package. There is also the problem of whether all required libraries were installed that modules in the Python standard library actually require. If such libraries aren't present, then modules which require them will simply not be installed when compiling Python from source code, the installation of Python itself will not be aborted.

In this blog post I am going to cover some of the key requirements and configuration options which should be used when installing Python in a Docker image so as to make it align with general practice as to what is done by base Linux distributions.

Although this relates mainly to installing a custom Python version into a Docker image, what is described here is also relevant to service providers which provide hosting services for Python, as well as any specialist packages which provide a means to make installation of Python easier as part of some tool for managing Python versions and virtual environments.

I have encountered a number of service providers over the years which have had inferior Python installations which exclude certain modules or prevent the installation of certain third party modules, including the inability to install mod_wsgi. Unfortunately not all service providers seem to care about offering options for users and are simply just wanting to make anything available so they can tick off Python from some list, but not really care how good of an experience they provide for Python users.

Required system packages

Python is often referred to as 'batteries included'. This means that it provides a large number of Python modules in the standard library for a range of tasks. A number of these modules have a dependency on certain system packages being installed, otherwise that Python module will not be able to be installed.

This is further complicated by the fact that Linux distributions will usually split up packages into a runtime package and a developer package. For example, the base Linux system may supply the package allowing you to create a SQLite database and interact with it through a CLI, but it will not by default install the developer package which would allow you to build the 'sqlite3' package included in the Python standard library.

What the names of these required system packages are can vary based on the Linux distribution. Often people arrive at the list of what are the minimum packages which would need to be installed by a process of trial and error after seeing what Python packages from the standard library hadn't been installed when compiling Python from source code. A better way is to try and learn from what the Python version provided with the Linux distribution does.

On Debian we can do this by using the 'apt-cache show' command to list the dependencies for the Python packages. When we dig into the packages this way we find two key packages.

The first of these is the 'python2.7-stdlib' package. This lists the dependencies:

Depends: libpython2.7-minimal (= 2.7.9-2),
mime-support,
libbz2-1.0,
libc6 (>= 2.15),
libdb5.3,
libexpat1 (>= 2.1~beta3),
libffi6 (>= 3.0.4),
libncursesw5 (>= 5.6+20070908),
libreadline6 (>= 6.0),
libsqlite3-0 (>= 3.5.9),
libssl1.0.0 (>= 1.0.1),
libtinfo5

Within the 'python2.7-minimal' package we also find:

Depends: libpython2.7-minimal (= 2.7.9-2),
zlib1g (>= 1:1.2.0)

In these two lists it is the library packages which we are concerned with, as it is for those that we need to ensure that the corresponding developer package is installed so header files are available when compiling any Python modules which require that library.

The command we can next use to try and determine what the developer packages are is the 'apt-cache search' command. Take for example the 'zlib1g' package:

# apt-cache search --names-only zlib1g
zlib1g - compression library - runtime
zlib1g-dbg - compression library - development
zlib1g-dev - compression library - development

The developer package we are interested in here is 'zlib1g-dev', which will include the header files we are looking for. We are not interested in 'zlib1g-dbg' as we do not need the debugging information for doing debugging with a C debugger, so we do not need versions of libraries including symbols.

We can therefore go through each of the library packages and see what we can find. For Debian at least, the developer packages we are after have a '-dev' suffix added to the package name in some form.

Do note though that the developer packages for some libraries may not have the version number in the package name. This is the case for the SSL libraries for example:

# apt-cache search --names-only libssl
libssl-ocaml - OCaml bindings for OpenSSL (runtime)
libssl-ocaml-dev - OCaml bindings for OpenSSL
libssl-dev - Secure Sockets Layer toolkit - development files
libssl-doc - Secure Sockets Layer toolkit - development documentation
libssl1.0.0 - Secure Sockets Layer toolkit - shared libraries
libssl1.0.0-dbg - Secure Sockets Layer toolkit - debug information

For this we would use just 'libssl-dev'.

Running through all these packages, the list of developer packages we likely need to have installed in order to be satisfied that we can build all Python packages included as part of the Python standard library are:

libbz2-1.0 ==> libbz2-dev
libc6 ==> libc6-dev
libdb5.3 => libdb-dev
libexpat1 ==> libexpat1-dev
libffi6 ==> libffi-dev
libncursesw5 ==> libncursesw5-dev
libreadline6 ==> libreadline-dev
libsqlite3-0 ==> libsqlite3-dev
libssl1.0.0 ==> libssl-dev
libtinfo5 ==> libtinfo-dev
zlib1g ==> zlib1g-dev

Having worked out what developer packages we will likely need for all the possible libraries that modules in the Python standard library may require, we can construct the appropriate command to install them.

apt-get install -y libbz2-dev libc6-dev libdb-dev libexpat1-dev \
    libffi-dev libncursesw5-dev libreadline-dev libsqlite3-dev libssl-dev \
    libtinfo-dev zlib1g-dev --no-install-recommends

Note that we only need to list the developer packages. If the base Docker image we used for Debian didn't provide the runtime variant of the packages, the developer packages express a dependency on the runtime package and so they will also be installed. Although we want such hard dependencies, we don't want suggested related packages being installed and so we use the '--no-install-recommends' option to 'apt-get install'. This is done to cut down on the amount of unnecessary packages being installed.

Now it may be the case that not all of these may strictly be necessary as the Python module requiring them wouldn't ever be used in the types of applications that we may want to run inside of a Docker container, but once you install Python you can't add in any extra Python module from the Python standard library after the fact. The only solution would be to reinstall Python again. So it is better to err on the side of caution and add everything that the Python package provided with the Linux distribution lists as a dependency.

If you wanted to try and double check whether they are required by working out what Python modules in the standard library actually required them, you can consult the 'Modules/Setup.dist' file in the Python source code. This file lists the C based Python extension modules and what libraries they require to be available and linked to the extension module when compiled.

For example, the entry in the 'Setup.dist' file for the 'zlib' Python module, which necessitates the availability of the 'zlib1g-dev' package, is:

#zlib zlibmodule.c -I$(prefix)/include -L$(exec_prefix)/lib -lz

Configure script options

Having worked out what packages we need to install into the Linux operating system itself, we next need to look at what options should be supplied to the Python 'configure' script when building it from source code. For this, we could go search out where the specific Linux operating system maintains their packaging scripts for Python and look at those, but there is actually an easier way.

This is because Python itself will save away the options supplied to the 'configure' script and keep them in a file as part of the Python installation. We can either go in and look at that file, or use the 'distutils' module to interrogate the file and tell us what the options were.

You will obviously need to have Python installed in the target Linux operating system to work it out. You will also generally need to have both the runtime and developer variants of the Python packages. For Debian for example, you will need to have run:

apt-get install python2.7 python2.7-dev

The developer package for Python is required as it is that package that contains the file in which the 'configure' args are saved away.

With both packages installed, we can now from the Python interpreter do:

# python2.7
Python 2.7.9 (default, Mar 1 2015, 12:57:24)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from distutils.sysconfig import get_config_var
>>> print get_config_var('CONFIG_ARGS')

On Debian and Python 2.7 this yields:

'--enable-shared' '--prefix=/usr' '--enable-ipv6' '--enable-unicode=ucs4'
'--with-dbmliborder=bdb:gdbm' '--with-system-expat' '--with-system-ffi'
'--with-fpectl' 'CC=x86_64-linux-gnu-gcc' 'CFLAGS=-D_FORTIFY_SOURCE=2 -g
-fstack-protector-strong -Wformat -Werror=format-security ' 'LDFLAGS=-Wl,-z,relro'

There are a couple of key options here to highlight and which need to be separately discussed. These along with their help descriptions from the 'configure' script are:

  --enable-shared ==>  Disable/enable building shared python library
  --enable-unicode[=ucs[24]] ==> Enable Unicode strings (default is ucs2) 

Shared Python library

When you compile Python from source code, there are three primary by products.

There are all the Python modules which are part of the standard library. These may be pure Python modules, or may be implemented as or using C extension modules. The C extension modules are dynamically loadable object files which would only be loaded into a Python application if required.

There is then the Python library, which contains all the code which makes up the core of the Python interpreter itself.

Finally, there is the Python executable itself, which is run on the main script file for your Python application or when running up an interactive interpreter.

For the majority of users, that there is a Python library is irrelevant, as that library would also be statically linked into the Python executable. That the Python library exists is only due to the needs of the subset of users who want to embed the Python interpreter into an existing application.

Now there are two ways that embedding of Python may be done.

The first is that the Python library would be linked directly with the separate application executable when it is being compiled. The second is that the Python library would be linked with a dynamically loadable object, which would then in turn be loaded dynamically into the separate application.

For the case of linking the Python library with the separate application, static linking of the library can be used. Where creating a dynamically loadable object which needs the Python library, things get a bit trickier as trying to link a library statically with a dynamically loadable object will not always work or can cause problems at runtime.

This is a problem which used to plague the mod_python module for Apache many years ago. All Linux distributions would only ship a static variant of the Python library. Back then everything in the Linux world was 32 bit. For the 32 bit architecture at the time, static linking of the Python library into the dynamically loadable mod_python module for Apache would work and it would run okay, but linking statically had impacts on the memory use of the Python web application process.

The issue in this case was that because it was a static library being embedded within the module and the object code also wasn't being compiled as position independent code, the linker had to do a whole lot of fix ups to allow the static code to run at whatever location it was being loaded. This had the consequence of effectively creating a separate copy of the library in memory for each process.

Even back then the static Python library was about 5-10MB in size, the result being that the web application processes were about that much bigger in their memory usage than they needed to be. This resulted in mod_python getting a bit of a reputation of being a memory hog, when part of the problem was that the Python installation was only providing a static library.

I will grant that the memory issues with mod_python weren't just due to this. The mod_python module did have other design problems which caused excessive memory usage as well, plus Apache itself was causing some of it through how it was designed at the time or how some of Apache's internal APIs used by mod_python worked.

On the latter point, mod_wsgi as a replacement for mod_python has learnt from all the problems mod_python experienced around excessive memory usage and so doesn't suffer the memory usage issues that mod_python did.

If using mod_wsgi however, do make sure you are using the latest mod_wsgi version. Those 5 year old versions of mod_wsgi that some LTS variants of Linux ship, especially if Apache 2.2 is used, do have some of those same memory issues that mod_python was effected by in certain corner cases. In short, no one should be using mod_wsgi 3.X any more, use instead the most recent versions of mod_wsgi 4.X and you will be much better off.

Alas, various hosting providers still use mod_wsgi 3.X and don't offer a more modern version. If you can't make the hosting provider provide a newer version, then you really should consider moving to one of the newer Docker based deployment options where you can control what version of mod_wsgi is installed as well as how it is configured.

Now although one could still manage with a static library back when 32 bit architectures were what was being used, this became impossible when 64 bit architectures were introduced.

I can't say I remember or understand the exact reason, but when 64 bit Linux was introduced, attempting to link a static Python library into a dynamically loadable object would fail at compilation link time. The cryptic error message you would get, suggesting some issue related to mixing of 32 and 64 bit code, would be along the lines of:

libpython2.4.a(abstract.o): relocation R_X86_64_32 against `a local
symbol' can not be used when making a shared object; recompile with -fPIC
/usr/local/lib/python2.4/config/libpython2.4.a: could not read symbols: Bad value

This error derives from those fix ups I mentioned before to allow the static code to run in a dynamically loadable object. What was previously possible for just 32 bit object code, was now no longer possible under the 64 bit Linux systems of the time.

In more recent times with some 64 bit Linux systems, it seems that static linking of libraries into a dynamically loadable object may again be possible, or at least the linker will not complain. Even so, where I have seen it being done with 64 bit systems, the user was experiencing strange runtime crashes which went away when steps were taken to avoid static linking of the Python library.

So static linking of the Python library into a dynamically loadable object is a bad idea, causing either additional memory usage, failing at link time, or potentially crashing at run time. What therefore is the solution?

The solution here is to generate a shared version of the Python library and link that into the dynamically loadable object.

In this case all the object code in the Python library will be what is called position independent to begin with and so no fix ups are needed which cause the object code from the library to become process local. Being a proper shared library also now means that there will only be one copy of the code from the Python library in memory across the whole operating system. That is, all processes within the one Python web application will share as common memory space the object code.

That isn't all though, as any separate Python applications you start would also share that same code from the Python library in memory. The end result is a reduction in the amount of overall system memory used.

Use of a shared library for Python therefore enables applications which want to embed Python via a dynamically loadable object to actually work and has the benefits of cutting down memory usage by applications that use the Python shared library.

Although a better solution, when you compile and install Python from source code, the creation of a shared version of the Python library isn't the default, only a static Python library will be created.

In order to force the creation of a shared Python library you must supply the '--enable-shared' option to the 'configure' script for Python when it is being built. This therefore is the reason why that option was appearing in the 'CONFIG_ARGS' variable saved away and extracted using 'distutils'.

You would think that since the providing of a shared library for Python enables the widest set of use cases for using Python, be they through running the Python executable directly, or by embedding, that this is the best solution. Even though it does work, you will find some who will deride the use of shared libraries and say it is a really bad idea.

The two main excuses I have heard from people pushing back on the suggestion of using '--enable-shared' when Python is being built are:

  • That shared libraries introduce all sorts of obscure problems and bugs for users.
  • That position independent object code from a shared library when run is slower.

The first excuse I find perplexing and actually indicates to a degree a level of ignorance about how shared libraries are used and also how to manage such things as the ability of an application to find the shared library at runtime.

I do acknowledge that if an application using a shared library isn't built properly that it may fail to find that shared library at runtime. This can come about as the shared library will only actually be linked into the application when the application is run. To do that it first needs to find the shared library.

Under normal circumstances shared libraries would be placed into special ordained directories that the operating system knows are valid locations for shared libraries. So long as this is done then the shared library will be found okay.

The problem is when a shared library is installed into a non standard directory and the application when compiled wasn't embedded with the knowledge of where that directory is, or if it was, the whole application and library where installed at a different location to where it was originally intended to reside in the file system.

Various options exist for managing this if you are trying to install Python into a non standard location, so it isn't a hard problem. Some still seem to want to make a bigger issue out of it than it is though.

As to the general complaint of shared libraries causing other obscure problems and bugs, as much as this was raised with me, they didn't offer up concrete examples to support that claim.

For the second claim, there is some technical basis to this criticism as position independent code will indeed run ever so slightly slower where it needs to involve calling of C code functions compiled as position independent code. In general the difference is going to be so minimal as not to be noticeable, only perhaps effecting heavily CPU bound code.

To also put things in context, all the Python modules in the standard library which use a C extension module will be affected by this overhead regardless, as they must be compiled as position independent code in order to be able to be dynamically loaded on demand. The only place therefore where this can be seen as an issue is in the Python interpreter core, which is what the code in the Python library implements. Thus the CPU bound code, would also need to principally be pure Python code.

When one talks about something like a Python web application however, there is going to be a fair bit of blocking I/O and the potential for code to be running in C extension modules, or even the underlying C code of an underlying WSGI server or web server, such as the case with Apache and mod_wsgi. The difference in execution time between the position independent code of a shared library and that of a static library, in something like a web application, is going to be at the level of background noise and not noticeable.

Whether this is really an issue or just a case of premature optimisation will really depend on the use case. Either way, if you want to use Python in an embedded system where Python needs to be linked into a dynamically loadable object, you don't have a choice, you have to have a shared library available for Python.

What the issue really comes down to is what the command line Python executable does and whether it is linked with a shared or static Python library.

In the default 'configure' options for Python, it will only generate a static library so in that case everything will be static. When you do use '--enable-shared', that will generate a shared library, but it will also result in the Python executable linking to that shared library. This therefore is the contentious issue that some like to complain about.

Possibly to satisfy these arguments, what some Linux distributions do is try and satisfy both requirements. That is, they will provide a shared Python library, but still also provide a static Python library and link the static Python library with the Python executable.

On a Linux system you can verify whether the Python installation you use is using a static or shared library for the Python executable by looking at the size of the executable, but also by running 'ldd' on the executable to see what shared libraries it is dependent on. If statically linked, the Python executable will be a few MB in size and will not have a dependency on a shared version of the Python library.

# ls -las /usr/bin/python2.7
3700 -rwxr-xr-x 1 root root 3785928 Mar 1 13:58 /usr/bin/python2.7
# ldd /usr/bin/python2.7
linux-vdso.so.1 (0x00007fff84fe5000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f309388d000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f3093689000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f3093486000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f309326b000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f3092f6a000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3092bc1000)
/lib64/ld-linux-x86-64.so.2 (0x00007f3093aaa000)

Jump into the system library directory on a Debian system where using the system Python installation, we see however that a shared Python library does still exist even if the Python executable isn't using it.

# ls -o libpython2.7.*
lrwxrwxrwx 1 root 51 Mar 1 13:58 libpython2.7.a -> ../python2.7/config-x86_64-linux-gnu/libpython2.7.a
lrwxrwxrwx 1 root 17 Mar 1 13:58 libpython2.7.so -> libpython2.7.so.1
lrwxrwxrwx 1 root 19 Mar 1 13:58 libpython2.7.so.1 -> libpython2.7.so.1.0
-rw-r--r-- 1 root 3614896 Mar 1 13:58 libpython2.7.so.1.0

Hopefully in this case then both sides of the argument are happy. The command line Python executable will run as fast as it can, yet the existence of the shared library still allows embedding.

Now in the case of Debian, they are doing all sorts of strange things to ensure that libraries are located in the specific directories they require. This is done in the Debian specific packaging scripts. The question then is whether providing both variants of the Python library can be easily done by someone compiling directly from the Python source code.

The answer is that it is possible, albeit that you will need to build Python twice. When it comes to installing Python after it has been built, you just need to be selective about what is installed.

Normally the process of building and installing Python would be to run the following in the Python source code.

./configure --enable-shared
make
make install 

Note that I have left out most of the 'configure' arguments just to show the steps. I have also ignored the issue of whether you have rights to install to the target location.

These commands will build and install Python but where a shared library for Python is used and where the Python executable will link the shared library.

If we want to have both a static and shared library and for the Python executable to use the static library, we can instead do:

./configure --enable-shared
make
make install

make distclean

./configure
make
make altbininstall

What we are doing here is build Python twice, first with the shared library enabled and then with just the static library. In the first case we will install everything and setup a fully functional Python installation.

In the second case however, we will only trigger the 'altbininstall' target and not the 'install' target.

When the 'altbininstall' target is used, all that will be installed is the static library for Python and the Python executable linked with the static library. In doing this, the existing Python executable using the shared library will be overwritten by the static linked library.

The end result is a Python installation which is a combination of two installs. A shared library for Python for embedding, but also a statically linked Python executable for those who believe that the time difference in the execution of position independent code in the interpreter core is significant enough to be of a concern and so desire speed over anything else.

Unicode character sets

The next option to 'configure' which needs closer inspection is the '--enable-unicode' option.

The name of the option is a bit misleading as Unicode support is these days always compiled into Python. What is being configured with this option is the number of bytes in memory which are to be used for each Unicode character. By default 2 bytes will be used for each Unicode character.

Although 2 bytes is the default, traditionally the Python installations shipped by Linux distributions will always enable the use of 4 bytes per Unicode character. This is why the option to 'configure' is actually '--enable-unicode=ucs4'.

Since Unicode will always be enabled by default, this option actually got renamed in Python 3.0 and is replaced by the '--with-wide-unicode' option. After the rename, the supplying of the option enables the use of 4 bytes, the same as if '--enable-unicode=ucs4' had been used.

The option disappears entirely in Python 3.3, as Python itself from that version will internally determine the appropriate Unicode character width to use. You can read more about that change in PEP 393.

Although Python can quite happily be built for either 2 or 4 byte Unicode characters prior to Python 3.3, the reason for using a Unicode character width the same as what the Python version supplied by the Linux distribution uses, is that prior to Python 3.3, what width was chosen affected the Python ABI.

Specifically, functions related to Unicode at the C code level in the Python library would be named with the character width embedded within the name. That is, the function names would embed the string 'UCS2' or 'UCS4'. Any code in an extension module would use a generic name, where the mapping to the specific function name was achieved through the generic name actually being a C preprocessor macro.

The result of this was that C extension modules had references to functions in the Python library that actually embedded the trait of how many bytes were being used for a Unicode character.

Where this can create a problem is where a binary Python wheel is created on one Python installation and then an attempt made to install it within another Python installation where the configured Unicode character width was different. The consequence of doing this would be that the Unicode functions the extension module required would not be able to be found, as they would not exist in the Python library which had been compiled with the different Unicode character width.

It is therefore important when installing Python to always define the Unicode character width to be the same as what would traditionally be used on that system for Python installations on that brand of Linux and architecture. By doing that you ensure that a binary Python wheel compiled with one Python installation, should always be able to be installed into a different Python installation of the same major/minor version on a similar Linux system.

For Linux systems, as evidenced by the default option of '--enable-unicode=ucs4' being used here with Python 2.7, wide Unicode characters are aways used. This isn't the default though, so the appropriate option does always need to be passed to 'configure' when run.

Optional system libraries

How to determine what system packages for libraries needed to be installed was determined by looking at what packages were listed as dependencies of the system Python package. Of these there are two which technically are optional. These are the packages for the 'expat' and 'ffi' libraries.

The reason these are optional is that the Python source contains its own copies of the source code for these libraries. Unless you tell the 'configure' script by way of the '--with-system-expat' and '--with-system-ffi' options to actually use the versions of these libraries installed by the system packages, then the builtin copies will instead be compiled and used.

Once upon a time, using the copy of 'expat' bundled with the Python source code could cause a lot of problems when trying to use Python embedded within another application such as Apache. This was because Apache would link to and use the system package for the 'expat' library while Python used its own copy. Where this caused a problem was when the two copies of the 'expat' library were incompatible. The functions in the copy loaded by Apache could in some cases be used in preference to that built in to Python, with a crash of the process occurring when expected structure layouts were different between the two versions of 'expat'.

This problem came about because Python did not originally namespace all the functions exported by the 'expat' library in its copy. You therefore had two copies of the same function and which was used was dependent on how the linker resolved the symbols when everything was loaded.

This was eventually solved by way of Python adding a name prefix on all the functions exported by its copy of the 'expat' library so that it would only be used by the Python module for 'expat' which wrapped it. Apache would at the same time use the system version of the 'expat' library.

These days it is therefore safe to use the copy of the 'expat' library bundled with Python and the '--with-system-expat' option can be left out, but as the system version of 'expat' is likely going to be more up to date than that bundled with Python, using the system version would still be preferred.

The situation with the 'ffi' library is similar to the 'expat' library, in that you can either use the bundled version or the system version. I don't actually know whether the 'ffi' library has to contend with the same sorts of namespace issues. It does worry me that on a quick glance I couldn't see anything in the bundled version where an attempt was made to add a name prefix to exported functions. Even though it may not be an issue, it would still be a good idea to use the system version of the library to make sure no conflicts arise where Python is embedded in another application.

Other remaining options

The options relating to the generation of a shared library, Unicode character width and system versions of libraries are the key options which you want to pay attention to. What other options should be used can depend a bit on what Linux variant is being used. With more recent versions of Docker now supporting IPV6, including the '--enable-ipv6' option when running 'configure' is also a good bet in case a user has a need for IPV6.

Other options may relate more to the specific compiler tool chain or hardware being used. The '--with-fpectl' option falls into this category. In cases where you don't specifically know what an option does, is probably best to include it.

Beyond getting the installation of Python itself right, being a Docker image, where space consumed by the Docker image itself is often a concern, one could also consider steps to trim a little fat from the Python installation.

If you want to go to such lengths there are two things you can consider removing from the Python installation.

The first of these is all the 'test' and 'tests' subdirectories of the Python standard library. These contain the unit test code for testing the Python standard library. It is highly unlikely you will ever need these in a production environment.

The second is the compiled byte code files with '.pyc' and '.pyo' extensions. The intent of these files is to speed up application loading, but given that a Docker image is usually going to be used to run a persistent service which stays running for the life of the Docker image, then these files only come into play once. You may well feel that the reduction in image size is more beneficial than the very minor overhead which would be incurred due to the application needing to parse the source code on startup, rather than being able to load the code as compiled byte code.

The removal of the '.pyc' and '.pyo' will no doubt be a contentious issue, but for some types of Python applications, such as web service, may be a quite reasonable thing to do.