Tuesday, June 30, 2015

Proxying to a Python web application running in Docker.

I have seen a few questions of late being posed about how to setup Apache to proxy to a Python web application running in a Docker container. The questions seem to be the result of people who have in the past previously run Apache/mod_wsgi on the host, with potentially multiple Python web applications, now making use of Docker to break out the individual web applications into separate containers.

Because though Apache running on the host was making use of name based virtual hosts, so that only one IP address was required across all the different sites, they need to retain that Apache instance as the termination point. The front end Apache as a result then merely becomes a proxy for the sites which have been separated out into the separate Docker containers.

In this post I am going to explain how name based virtual hosting works and how proxying would be set up when moving the sites into a separate Docker container. I will also explain how to pass through to the sites now running in the containers details about how the site was being accessed to ensure that URLs generated by the web applications, which need to refer back to the same site via its external host name, are correct.

Name based virtual hosts

Name based virtual hosts under the HTTP protocol was a mechanism developed to allow many different web sites being accessed by different host names, to be made available on a single machine using only one IP address. The ability to use name based virtual hosts eliminated the need to allocate many IP addresses to the same physical host if wanting to share the resources of the host for many sites.

Because all the sites would have the same IP address, the mechanism relies though on the HTTP request sent by a client including a special ‘Host’ header which gives the name of the site the request is being targeted at.

To understand this better lets first look at how the host names themselves would be mapped in DNS.

Imagine to start with that you had acquired a VPS from a service provider which you then setup to run a web server. This VPS would have its own specific IP address allocated to it. In DNS this would have a name associated with it using what is called an A record.

host-1234.webhostingservice.com. A 1.2.3.4

The host name here could be a generic one which the hosting service would have setup to map to the IP address, it wouldn’t be related to the specific name of the site you want to run.

For the multiple web sites you now want to host, you need to create a name alias, which says that if your site name is accessed, that the request should actually be sent to the above host on that IP address. This is done in DNS using a CNAME record.

www.example.com. CNAME host-1234.webhostingservice.com.
blog.example.com. CNAME host-1234.webhostingservice.com.
wiki.example.com. CNAME host-1234.webhostingservice.com.

So it doesn’t now matter whether the site ‘www.example.com’, ‘blog.example.com’ or ‘wiki.example.com’ is accessed, the requests are all sent to the host with name ‘host-1234.webhostingservice.com’, listening on IP address ‘1.2.3.4’.

As we are going to access all these sites using HTTP, then the requests will all arrive at port 80 on that host, being the default port that the web server will listen for HTTP requests.

In order that the web server can distinguish between requests for the three different sites, the HTTP request needs to include the ‘Host’ header. Thus, if the web request was destined for ‘blog.example.com’, the HTTP request headers would include:

Host: blog.example.com

Apache virtual hosts

The next part of the puzzle is how Apache deals with these requests and knows which of the multiple sites it needs to be routed to.

This is managed in Apache by setting up multiple ‘VirtualHost’ definitions. Within each of the ‘VirtualHost’ definitions would be placed the configuration specific to that site. In the case of using mod_wsgi, you would then end up with something like:

<VirtualHost _default_:80>
Require all denied
</VirtualHost>
# www.example.com

<VirtualHost *:80>
ServerName www.example.com
WSGIDaemonProcess www.example.com
WSGIScriptAlias / /some/path/www.example.com/site.wsgi \
process-group=www.example.com application-group=%{GLOBAL}
<Directory /some/path/www.example.com>
<Files site.wsgi>
Require all granted
</Files>
</Directory>
</VirtualHost>
# blog.example.com

<VirtualHost *:80>
ServerName blog.example.com
WSGIDaemonProcess blog.example.com
WSGIScriptAlias / /some/path/blog.example.com/site.wsgi \
process-group=blog.example.com application-group=%{GLOBAL}
<Directory /some/path/blog.example.com>
<Files site.wsgi>
Require all granted
</Files>
</Directory>
</VirtualHost>
# wiki.example.com

<VirtualHost *:80>
ServerName wiki.example.com
WSGIDaemonProcess wiki.example.com
WSGIScriptAlias / /some/path/wiki.example.com/site.wsgi \
process-group=wiki.example.com application-group=%{GLOBAL}
<Directory /some/path/wiki.example.com>
<Files site.wsgi>
Require all granted
</Files>
</Directory>
</VirtualHost>

That is, we have three ‘VirtualHost’ definitions corresponding to each site, with all of them being setup as being linked to port 80, the default HTTP port.

What differs in each ‘VirtualHost’ is the value of the ‘ServerName’ directive. It is this directive which specifies what the site host name is for that specific ‘VirtualHost’.

When a request is now received on port 80, the ‘Host’ header is inspected, and the host name listed there is matched against each ‘VirtualHost’ based on the value of the ‘ServerName’ directive. If it matches then the request is routed to that specific WSGI application running under mod_wsgi.

If the host name listed by the ‘Host’ header doesn’t match the value of ‘ServerName’ in any ‘VirtualHost’ then what happens is that Apache will fallback to sending the request to whatever was the first ‘VirtualHost’ defined in the configuration files.

As it is likely undesirable to have requests for an arbitrary host name we don’t know about being sent to ‘www.example.com', a default ‘VirtualHost’ was instead created as the first one. Any request which doesn’t match on the host name will now go to it, and since 'Require all denied’ was all that was contained in that ‘VirtualHost’ it will result in Apache sending back a ‘403 Forbidden’ response.

The use of such a default ‘VirtualHost’ as a fallback ensures that we get a hard error in the case of where we might mess up the configuration, rather than the request being unexpectedly handled by a different site.

Moving sites to Docker

At this point we have our three sites implemented as separate WSGI applications running in separate daemon process groups using mod_wsgi. The WSGI applications for each site are already separated at this point and shouldn’t interfere with each other. All the same, separating each to run in its own Docker container does bring various benefits including better levels of isolation from the other sites and the host operating system, plus the ready ability to run up the sites separately during development and testing, using the same setup as would be used in production.

For a production deployment of Docker, Apache with mod_wsgi is still going to be better than or as good as other alternatives and mod_wsgi-express makes deploying a WSGI application in a Docker container even easier than the traditional way of deploying WSGI applications with Apache. What you definitely shouldn’t do is switch to using any builtin development server provided by a web framework. Even with Docker, such development servers are still not suitable for production use even though they are often what is used in documentation or blog posts related to running Python with Docker.

For Apache/mod_wsgi, the Docker image you should use as a base is:

There are multiple versions of the image for Python 2.7, 3.3 and 3.4. There is also a base image if you need to take full control over how a derived image is built, otherwise the ‘onbuild’ images supplied provide a convenient way of deploying a WSGI application in a container. Even if using the ‘onbuild’ image, hook scripts can still be supplied as part of your application code to perform special build actions or pre deploy steps within the container when started.

For our above sites, the ‘Dockerfile’ to build our Docker image would be as simple as:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
CMD [ "site.wsgi" ]

If the WSGI application had a list of Python modules that needed to be available, they would be listed in the ‘requirements.txt’ file in the root of the application directory along side the ‘Dockerfile’. The image can then be built by running:

docker build -t blog.example.com .

We can then run the image manually using:

docker run --rm -p 8002:80 blog.example.com

When using the ‘mod_wsgi-docker’ image, within the Docker container the Apache instance will listen on port 80. As we are going to be running a container for each site, they can’t all be exported as port 80 on the Docker host. As a result we need to map the internal port 80 to a different external port in each case. Thus we map ‘www.example.com’ to port 8001, ‘blog.example.com’ to port 8002 and ‘wiki.example.com’ to port 8003.

With the sites now running inside of a Docker container on our Docker host, we need to change the front end Apache configuration to proxy requests for the site through to the appropriate Docker container, rather than running the WSGI application for the site on the front end Apache instance.

Previously the Apache configuration was:

# blog.example.com

<VirtualHost *:80>
ServerName blog.example.com
WSGIDaemonProcess blog.example.com
WSGIScriptAlias / /some/path/blog.example.com/site.wsgi \
process-group=blog.example.com application-group=%{GLOBAL}
<Directory /some/path/blog.example.com>
<Files site.wsgi>
Require all granted
</Files>
</Directory>
</VirtualHost>

This now needs to be changed to:

# blog.example.com
<VirtualHost *:80>
ServerName blog.example.com
ProxyPass / http://docker.example.com:8002/
</VirtualHost>

We have therefore removed the mod_wsgi related configuration from the ‘VirtualHost’ and replaced it with a configuration which instead proxies any request for that specific ‘VirtualHost’, through to the port exported for that site on the Docker host.

URL reconstruction

The proxy changes above are enough to at least have HTTP requests originally intended for ‘blog.example.com’, to be proxied through to the new home for the site running under the Docker container. The use of a proxy in this way will however cause a few problems. This is because the WSGI application will now instead think it is running at ‘http://docker.example.com:8002'. This is reflected in the values passed through to the WSGI application for each request in the WSGI environ dictionary.

HTTP_HOST: 'docker.example.com:8002'
PATH_INFO: '/'
QUERY_STRING: ''
SERVER_NAME: 'docker.example.com'
SERVER_PORT: '8002'
SCRIPT_NAME: ''
wsgi.url_scheme: 'http'

We can see the affects of the problem by creating a test WSGI application which implements the algorithm for URL reconstruction as outlined in the WSGI specification (PEP 3333).

from urllib import quote
def reconstruct_url(environ):
url = environ['wsgi.url_scheme']+'://'
    if environ.get('HTTP_HOST'):
url += environ['HTTP_HOST']
else:
url += environ['SERVER_NAME']
    if environ['wsgi.url_scheme'] == 'https':
if environ['SERVER_PORT'] != '443':
url += ':' + environ['SERVER_PORT']
else:
if environ['SERVER_PORT'] != '80':
url += ':' + environ['SERVER_PORT']
    url += quote(environ.get('SCRIPT_NAME', ''))
url += quote(environ.get('PATH_INFO', ''))
    if environ.get('QUERY_STRING'):
url += '?' + environ['QUERY_STRING']
    return url
def application(environ, start_response):
status = '200 OK'
    output = reconstruct_url(environ)
    response_headers = [('Content-type', 'text/plain'),
('Content-Length', str(len(output)))]
start_response(status, response_headers)
    return [output]

The result from this will be:

http://docker.example.com:8002/

What we want to see here though is the public facing URL and not the URL used internally by the proxy. Thus we want:

http://blog.example.com/

The consequences of this are that when a WSGI application constructs a URL for use in links within a HTML page response, or in any HTTP response header such as the ‘Location’ header, the URL returned will be wrong.

If the internal site is not publicly accessible then the user will not be able to then reach the site when the URL is followed.

If the internal site is actually publicly accessible via the URL, then it will work, but then they will be accessing it using what should notionally be a private URL and not the public URL. Such leakage of an internal URL is undesirable.

The solution to this problem is in what additional HTTP request headers the front end Apache will send along with the request when the ‘ProxyPass’ directive is used. These headers appear in the WSGI environ dictionary passed with each request as:

HTTP_X_FORWARDED_FOR: '127.0.0.1'
HTTP_X_FORWARDED_HOST: 'blog.example.com'
HTTP_X_FORWARDED_SERVER: 'blog.example.com'

Some Python web frameworks provide inbuilt support or extensions which can be used to take such request headers and use them to override the values set for key values in the WSGI environ dictionary.

For the above code to reconstruct the URL to work properly, the two key values we need to override are ‘HTTP_HOST’ and ‘SERVER_PORT’. These need to be overridden with what the values would have been as if seen by the front end Apache server and not the backend.

For ‘HTTP_HOST’, we can get what would have been the value as seen by the front end by taking the value of the ‘HTTP_X_FORWARDED_HOST’ value. This value reflects the value of the HTTP request header ‘X-Forwarded-Host’ as set by the ‘ProxyPass’ directive when the request was being proxied.

What the use of the ‘ProxyPass’ directive doesn’t provide us with though is what the front end port was that the original request was accepted on. We have two ways though that we could use to ensure this information is passed through. In both cases it involves setting the ‘X-Forwarded-Port’ request header when proxying the request. This header name is the accepted convention for passing across the port that the front end web server accepted the request on.

The first way is to use the Apache ‘mod_headers’ module to set the request header and hard wire it to be port 80.

# blog.example.com
<VirtualHost *:80>
ServerName blog.example.com
ProxyPass / http://docker.example.com:8002/

RequestHeader set X-Forwarded-Port 80
</VirtualHost>

The second also uses ‘mod_headers’, but with the port number being calculated dynamically using ‘mod_rewrite’.

# blog.example.com
<VirtualHost *:80>
ServerName blog.example.com
ProxyPass / http://docker.example.com:8002/

RewriteEngine On
RewriteRule .* - [E=SERVER_PORT:%{SERVER_PORT},NE]

RequestHeader set X-Forwarded-Port %{SERVER_PORT}e
</VirtualHost>

In either case, the request header is passed in the WSGI environ dictionary of the back end WSGI application as ‘HTTP_X_FORWARDED_PORT’ and similar to how ‘HTTP_FORWARDED_HOST’ was used to override ‘HTTP_HOST’, can be used to override ‘SERVER_PORT’.

Having overridden ‘HTTP_HOST’ and ‘SERVER_PORT’, our reconstructed URL would now be correct.

Who do you really trust

As mentioned, some Python web frameworks do provide features to allow the WSGI environ values to be overridden based on these special headers. Unfortunately not all Python web frameworks do, and even though it is possible to find separate WSGI middleware which may do it, they may not support overriding the port as well as the host. Some don’t even touch the host and port at all, and are actually only concerned with an entirely separate problem of overriding ‘REMOTE_ADDR’ in the WSGI environ dictionary based on information passed in the ‘X-Fowarded-For’ request header by a proxy.

Because of the fact that Python web frameworks may not provide a feature to handle such special request headers and because adding in a WSGI middleware can be fiddly, more recent versions of mod_wsgi have inbuilt support for handling such headers.

In this case where we are using the ‘mod_wsgi-docker’ image, we would change:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
CMD [ "site.wsgi" ]

to:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
CMD [ "--trust-proxy-header", "X-Forwarded-Host", \
"--trust-proxy-header", "X-Forwarded-Port", \
"--trust-proxy-header", "X-Forwarded-For", \
"site.wsgi" ]

Built in to mod_wsgi is knowledge of what the different headers are used for and when they are marked as being trusted, the appropriate override will be done. In this case, as a result of the headers which were listed as trusted, we update ‘HTTP_HOST’, ‘SERVER_PORT’ and ‘REMOTE_ADDR’.

Now unique to how mod_wsgi works which other Python based solutions don’t provide, is that it understands that for some of the values to be overridden, there are multiple conventions as to what request headers are used.

For example, for ‘REMOTE_ADDR’, request headers which might be used are ‘X-Forwarded-For’, ‘X-Client-IP’ and ‘X-Real-IP’. In the face of there being multiple request headers that can represent the same thing, it is important that only the one marked as trusted be used and that the other request headers that can mean the same thing be wiped out and not passed along with the request.

This is to avoid the sort of situation that can arise where the Apache front end is only setting the ‘X-Forwarded-For’ request header. If a malicious client itself set the ‘X-Real-IP’ header and some WSGI middleware was also being used which took the value of ‘HTTP_X_REAL_IP’ and used it to override ‘REMOTE_ADDR’, then you have a potential security issue through a client dictating what IP address the web application thinks it is coming from.

To avoid such problems, when mod_wsgi is told to only trust the ‘X-Forwarded-For’ header, it will ensure that the other headers of ‘X-Client-IP’ and ‘X-Real-IP’ are cleared and not passed through to then be incorrectly interpreted by a WSGI middleware.

This avoids one scenario that can arise as to whether we trust such headers, but we also have the problem of whether the request even came from the proxy which we expected it to, and whose headers we trust.

If a client was able to bypass the proxy and send a request direct to the backend, then it could set the headers itself to anything it wanted and the backend wouldn’t know.

If this scenario is a concern, then mod_wsgi supports an additional feature where the IP address of the trusted proxy itself can be specified. Such so called trusted headers will then only be honoured if coming from that proxy. If they aren’t from that proxy, then all the equivalent headers for the value we are trying to override will be cleared from the request and ignored.

Imagining therefore that the IP address of the host where the front end Apache was running was ‘192.168.59.3’, we would use the ‘—trust-proxy’ option to ‘mod_wsgi-express’ as:

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
CMD [ "--trust-proxy-header", "X-Forwarded-Host", \
"--trust-proxy-header", "X-Forwarded-Port", \
"--trust-proxy-header", "X-Forwarded-For", \
 "--trust-proxy", "192.168.59.3", "site.wsgi" ]

If multiple proxies were being used, then the ‘—trust-proxy’ option can be supplied more than once. Alternatively, an IP subnet description can be used.

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
CMD [ "--trust-proxy-header", "X-Forwarded-Host", \
"--trust-proxy-header", "X-Forwarded-Port", \
"--trust-proxy-header", "X-Forwarded-For", \
 "--trust-proxy", "192.168.59.0/24", "site.wsgi" ]

In addition to being able to list an immediate proxy, you can supply IP addresses of other proxies if the request needs to traverse through more than one. These would come into play with determining how far back in the proxy chain, the value of the ‘X-Fowarded-For’ header can be trusted in determining the true client IP address. In other words, the client IP address will be interpreted as being that immediately before the forward most trusted host.

Although I have shown here how this trust mechanism is setup when using mod_wsgi-express, the same can be done if configuring mod_wsgi in the Apache configuration itself. In that case the configuration directives are ‘WSGITrustedProxyHeaders’ and ‘WSGITrustedProxies’.

Secure connections

The changes described above cover the situation of a normal HTTP connection. What though where the original front end Apache was also accepting secure connections.

The original Apache configuration in this case for each site would have been something like:

# blog.example.com
<VirtualHost *:80>
ServerName blog.example.com

WSGIDaemonProcess blog.example.com

WSGIScriptAlias / /some/path/blog.example.com/site.wsgi \
    process-group=blog.example.com application-group=%{GLOBAL}

<Directory /some/path/blog.example.com>
<Files site.wsgi>
Require all granted
</Files>
</Directory>
</VirtualHost>

<VirtualHost *:443>
ServerName blog.example.com

SSLEngine On
SSLCertificateFile /some/path/blog.example.com/server.crt
SSLCertificateKeyFile /some/path/blog.example.com/server.key

WSGIScriptAlias / /some/path/blog.example.com/site.wsgi \
   process-group=blog.example.com application-group=%{GLOBAL}
<Directory /some/path/blog.example.com>
<Files site.wsgi>
Require all granted
</Files>
</Directory>
</VirtualHost>

In separating out the site into a separate Docker container, as the front end Apache is still the initial termination point, it will still be necessary for it to also handle the secure connections. So long as connections through to the site running under Docker are in a secure non public network, then we can stick with just using a normal HTTP connection between the front end and the site running under Docker.

The updated Apache configuration when it is changed to proxy the site through to the Docker instance will be:

# blog.example.com
<VirtualHost *:80>
ServerName blog.example.com
ProxyPass / http://docker.example.com:8002/

RequestHeader set X-Forwarded-Port 80
</VirtualHost>

<VirtualHost *:443>
ServerName blog.example.com

SSLEngine On
SSLCertificateFile /some/path/blog.example.com/server.crt
SSLCertificateKeyFile /some/path/blog.example.com/server.key

ProxyPass / http://docker.example.com:8002/

RequestHeader set X-Forwarded-Port 443
</VirtualHost>

Checking access to our site over the secure connection with our test script which reconstructs the URL and we find though we are not getting the desired result. Specifically, the reconstructed URL is using ‘http’ as the protocol scheme and not ‘https’. This would mean any URLs generated in HTML responses or in a response header such as ‘Location’, would cause users to start using an insecure connection.

The problem in this case is that the ‘wsgi.url_scheme’ value in the WSGI environ dictionary is being passed as ‘http’. This is due to the web server running in the Docker instance only accepting HTTP connections. We somehow now need to pass across from the front end that the initial connection to the front end Apache instance was in fact a secure connection.

For passing information about whether a secure connection was used, there is no one HTTP header which is universally accepted as a de-facto standard. There are actually about five different headers which different software packages have supported. In our example here we will use the ‘X-Forwarded-Scheme’ header, setting it to be ‘https’ when that protocol scheme is being used.

# blog.example.com
<VirtualHost *:80>
ServerName blog.example.com
ProxyPass / http://docker.example.com:8002/

RequestHeader set X-Forwarded-Port 80
</VirtualHost>

<VirtualHost *:443>
ServerName blog.example.com

SSLEngine On
SSLCertificateFile /some/path/blog.example.com/server.crt
SSLCertificateKeyFile /some/path/blog.example.com/server.key

ProxyPass / http://docker.example.com:8002/

RequestHeader set X-Forwarded-Port 443
RequestHeader set X-Forwarded-Scheme https
</VirtualHost>

Determining the value for this header could also have been done using mod_rewrite, but we will stick with hard wiring the value. We only need to do this for the ‘VirtualHost’ corresponding to the secure connection, i.e., port 443, as the lack of the header will be taken as meaning a HTTP connection was used.

In the Dockerfile where the options to mod_wsgi-express are being supplied, we now add this header as an additional trusted proxy header and mod_wsgi will ensure that the correct thing is done.

FROM grahamdumpleton/mod-wsgi-docker:python-2.7-onbuild
CMD [ "--trust-proxy-header", "X-Forwarded-Host", \
"--trust-proxy-header", "X-Forwarded-Port", \
"--trust-proxy-header", "X-Forwarded-For", \
"--trust-proxy-header", "X-Forwarded-Scheme", \
"site.wsgi" ]

The end result will be that where a secure connection was used at the front end Apache, then mod_wsgi running in the Docker instance will override ‘wsgi.url_scheme’ to be ‘https’ and URL reconstruction will then be correct.

Non WSGI handlers

The above configurations and the fixes to the WSGI environ dictionary done by mod_wsgi, running in the Docker instance, will ensure that the construction of any URLs when being done within a WSGI application will work correctly.

If also hosting static files using the Apache instance running in the Docker instance, or running a dynamic web application using some other Apache module, the fixes will not apply to those scenarios.

Being that we are running a Python web application using mod_wsgi, the only concern for us may be whether the lack of the fixes when serving up static files using the same Apache running in the Docker instance could be an issue. You might be thinking that it wouldn’t since what is being served up is static and so not affected by what URL it was accessed as.

Unfortunately there is one scenario where it can be an issue even when serving up static files. I will describe that issue in a future post, where it can be a problem, and what can be done about it when there is.

3 comments:

Graham Dumpleton said...

Followup post that covers the further issues with redirections when hosting static files can be found at:

http://blog.dscpl.com.au/2015/07/redirection-problems-when-proxying-to.html

Jonathan Barratt said...

Minor typo, but it made me do a double-take nonetheless: "we need to map the internal port 80 to a different external port in each case. Thus we map ... ‘blog.example.com’ to port 8002 and ‘wiki.example.com’ to port 8002"

Clearly the intent was for one of those 8002's to be 8003. Thanks for mod-wsgi!

Graham Dumpleton said...

Fixed. Thanks.