Tuesday, December 29, 2015

Issues with running as PID 1 in a Docker container.

We are getting close to the end of this initial series of posts on getting IPython to work with Docker and OpenShift. In the last post we finally got everything working in plain Docker when a random user ID was used and consequently also under OpenShift.

Although we covered various issues and had to make changes to the existing ‘Dockerfile’ used with the ‘jupyter/notebook’ image to get it all working correctly, there was one issue that the Docker image for ‘jupyter/notebook’ had already addressed which needs a bit of explanation. This related to the existing ‘ENTRYPOINT’ statement used in the ‘Dockerfile’ for ‘jupyter/notebook’.

ENTRYPOINT ["tini", "--"]
CMD ["jupyter", "notebook"]

Specifically, the ‘Dockerfile’ was wrapping the running of the ‘jupyter notebook’ command with the ‘tini’ command.

Orphaned child processes

For a broader discussion on the problem that the use of ‘tini’ is trying to solve you can read the post ‘Docker and the PID 1 zombie reaping problem’.

In short though, process ID 1, which is normally the UNIX ‘init’ process, has a special role in the operating system. That is that when the parent of a process exits prior to its child processes, and the child processes therefore become orphans, those orphaned child processes have their parent process remapped to be process ID 1. When those orphaned processes then finally exit and their exit status is available, it is the job of the process with process ID of 1, to acknowledge the exit of the child processes so that their process state can be correctly cleaned up and removed from the system kernel process table.

If this cleanup of orphaned processes does not occur, then the system kernel process table will over time fill up with entries corresponding to the orphaned processes which have exited. Any processes which persist in the system kernel process table in this way are what are called zombie processes. They will remain there so long as no process performs the equivalent of a system ‘waitpid()’ call on that specific process to retrieve its exit status and so acknowledge that the process has terminated.

Process ID 1 under Docker

Now you may be thinking, what does this have to do with Docker, after all, aren’t processes running in a Docker container just ordinary processes in the operating system, but simply walled off from the rest of the operating system.

This is true, and if you were to run a Docker container which executed a simple single process Python web server, if you look at the process tree on the Docker host using ‘top’ you will see:

Docker host top wsgiref idle

Process ID ‘26196’ here actually corresponds to the process created from the command that we used as the ‘CMD’ in the ‘Dockerfile’ for the Docker image.

Our process isn’t therefore running as process ID 1, so why is the way that orphaned processes are handled even an issue?

The reason is that if we were to instead look at what processes are running inside of our container, we can only see those which are actually started within the context of the container.

Further, rather than those processes using the same process ID as they are really running as when viewed from outside of the container, the process IDs have been remapped. In particular, processes created inside of the container, when viewed from within the container, have process IDs starting at 1.

Thus the very first process created due to the execution of what is given by ‘CMD’ will be identified as having process ID 1. This process is still though the same as identified by process ID ‘26196’ when viewed from the Docker host.

More importantly, what you cannot see from with inside of the container is what was the original process with the process ID of ‘1’ outside of the container. That is, you cannot see the system wide ‘init’ process.

Logically it isn’t therefore possible to reparent an orphaned process created within the container to a process not even visible inside of the container. As such, orphaned processes are reparented to the process with process ID of ‘1’ within the container. The obligation of reaping the resulting zombie processes therefore falls to this process and not the system wide ‘init’ process.

Testing for process reaping

In order to delve more into this issue and in particular its relevance to when running a Python web server, as a next step lets create a simple Python WSGI application which can be used to trigger orphan processes. Initially we will use the WSGI server implemented by the ‘wsgiref’ module in the Python standard library, but we can also run it up with other WSGI servers to see how they behave as well.

from __future__ import print_function

import os

def orphan():
    print('orphan: %d' % os.getpid())
    os._exit(0)

def child():
    print('child: %d' % os.getpid())
    newpid = os.fork()
    pids = (os.getpid(), newpid)
    if newpid == 0:
        orphan()
    else:
        pids = (os.getpid(), newpid)
        print("child: %d, orphan: %d" % pids)
        os._exit(0)

def parent():
     newpid = os.fork()
     if newpid == 0:
         child()
     else:
         pids = (os.getpid(), newpid)
         print("parent: %d, child: %d" % pids)
         os.waitpid(newpid, 0)

def application(environ, start_response):
    status = '200 OK'
    output = b'Hello World!'
    response_headers = [('Content-type', 'text/plain'),
                        ('Content-Length', str(len(output)))]

    start_response(status, response_headers)

    parent()

    return [output]

from wsgiref.simple_server import make_server

httpd = make_server('', 8000, application)
httpd.serve_forever()

The way the test runs is that each time a web request is received, the web application process will fork twice. The web application process itself will be made to wait on the exit of the child process it created. That child process though will not wait on the further child process it had created, thus creating an orphaned process as a result.

Building this test application into a Docker image, with no ‘ENTRYPOINT’ defined and only a ‘CMD’ which runs the Python test file application, when we hit it with half a dozen requests, what we then see from inside of the Docker container is:

Docker container top wsgiref multi

For a WSGI server implemented using the ‘wsgiref’ module from the Python standard library, this indicates that no reaping of the zombie process is occurring. Specifically, you can see how our web application process running as process ID ‘1’ now has various child processes associated with it where the status of each process is ‘Z’ indicating it is a zombie process waiting to be reaped. Even if we wait some time, these zombie processes never go away.

If we look at the processes from the Docker host we see the same thing.

Docker host top wsgiref multi

This therefore confirms what was described, which is that the orphaned processes will be reparented against what is process ID ‘1’ within the container, rather than what is process ID ‘1’ outside of the container.

One thing that is hopefully obvious is that a WSGI server based off the ‘wsgiref’ module sample server in the Python standard library doesn’t do the right thing, and running it as the initial process in a Docker container would not be recommended.

Behaviour of WSGI servers

If a WSGI server based on the ‘wsgiref’ module sample server isn’t okay, what about other WSGI servers. Also, what about ASYNC web servers for Python such as Tornado.

The outcome from running the test WSGI application on the most commonly used WSGI servers, and also equivalent tests specifically for the Tornado ASYNC web server, Django and Flask builtin servers, yields the following results.

django (runserver) - FAIL
flask (builtin) - FAIL
gunicorn - PASS
Apache/mod_wsgi - PASS
tornado (async) - FAIL
tornado (wsgi) - FAIL
uWSGI - FAIL
uWSGI (master) - PASS
waitress - FAIL
wsgiref - FAIL

The general result here is that any Python web server that runs as a single process would usually not do what is required of a process running as process ID ‘1’. This is because they aren’t in any way designed to manage child processes. As a result, there isn’t even the chance that they may look for exiting zombie processes and reap them.

Of note though, uWSGI when used with its default options, although it can run in a multi process configuration has a process management model with is arguably broken. The philosophy with uWSGI though is seemingly to never correct what it gets wrong, but to instead add an option which enables the correct behaviour. Thus users have to opt into the correct or better behaviour. For the case of uWSGI, the more robust process management model is only enabled by using the ‘--master’ option. If using uWSGI you should always use that option, regardless of whether you are running it in Docker or not.

Both uWSGI in master mode and mod_wsgi, although they pass and will reap zombie processes when run as process ID ‘1’, work in a way that can be surprising.

The issue with uWSGI in master mode and mod_wsgi, is that each only look for exiting child processes on a periodic basis. That is, they will wake up about once a second and then look for any child processes that have exited, collecting their exit status and so for zombie processes cause them to be reaped.

This means that during the one second interval, some number of zombie processes still could accumulate, the number depending on request throughput and how often a specific request does something that would trigger the creation of a zombie process. The number of zombie processes will therefore build up and then be brought back to zero each second.

Although this occurs for uWSGI in master mode and mod_wsgi, it shouldn’t in general cause an issue as no other significant code runs in the parent or master process which is managing all the child processes. Thus the presence of the zombie process as a child for a period will not cause any confusion. Further, zombie processes should still be reaped at an adequate rate, so temporary increases shouldn’t matter.

Problems which can arise

As to what problems can actually arise due to this issue, there are a few at least.

The first is that if the process running as process ID ‘1’ does not reap zombie processes, then they will accumulate over time. If the container is for a long running service, then eventually the available slots in the system kernel process table could be used up. If this were to occur, the system as a whole would be unable to create any new processes.

How this plays out in practice within a Docker container I am not sure. If it were the case that the upper bound of the number of such zombie processes that could be created within a Docker container were bounded by the system kernel process table size, then technically the creation of zombie processes could be used as an attack vector against the Docker host. I sort of expect therefore that Docker containers likely have some lower limit on the number of process that can be created within the container, although things get complicated if a specific user has multiple containers. Hopefully someone can clarify this specific point for me.

The second issue is that the reparenting of processes against the application process running as process ID ‘1’ could confuse any process management mechanism running within that process. This could cause issues in a couple of ways.

For example, if the application process were using the ‘wait()’ system call to wait for any child process exiting, but the reported process ID wasn’t one that it was expecting and it didn’t handle that gracefully, it could cause the application process to fail in some way. Especially in the case where the ‘wait()’ call indicated that an exiting zombie process had a non zero status, it may cause the application process to think its directly managed child processes were having problems and failing in some way. Alternatively, if the orphaned processes weren't themselves exiting straight away, and the now parent process operated in some way by monitoring the set of child processes it had, then this itself could be confusing the parent process.

Finally getting back to the IPython example we have been working with, it has been found that when running the ‘jupyter notebook’ application as process ID ‘1’, it fails to start up properly kernel processes for running of individual notebook instances. The logged messages in this case are:

[I 10:19:33.566 NotebookApp] Kernel started: 1ac58cd9-c717-44ef-b0bd-80a377177918
[I 10:19:36.566 NotebookApp] KernelRestarter: restarting kernel (1/5)
[I 10:19:39.573 NotebookApp] KernelRestarter: restarting kernel (2/5)
[I 10:19:42.582 NotebookApp] KernelRestarter: restarting kernel (3/5)
[W 10:19:43.578 NotebookApp] Timeout waiting for kernel_info reply from 1ac58cd9-c717-44ef-b0bd-80a377177918
[I 10:19:45.589 NotebookApp] KernelRestarter: restarting kernel (4/5)
WARNING:root:kernel 1ac58cd9-c717-44ef-b0bd-80a377177918 restarted
[W 10:19:48.596 NotebookApp] KernelRestarter: restart failed
[W 10:19:48.597 NotebookApp] Kernel 1ac58cd9-c717-44ef-b0bd-80a377177918 died, removing from map.
ERROR:root:kernel 1ac58cd9-c717-44ef-b0bd-80a377177918 restarted failed!
[W 10:19:48.610 NotebookApp] Kernel deleted before session

I have been unable to find that anyone has been able to work out the specific cause, but I suspect it is falling foul of the second issue above. That is, the exit statuses from those orphaned processes are confusing the code managing the startup of the kernel processes, making it think the kernel processes are in fact failing, causing it to attempt to restart them repeatedly.

Whatever the specific reason, not running the ‘jupyter notebook’ as process ID ‘1’ avoids the problem, so it does at least appear to be related to the orphaned processes being reparented against the main ‘jupyter notebook’ process.

Now although for IPython it seems to relate to the second issue whereby process management mechanisms are failing, as shown above, even generic Python WSGI servers or web servers don’t necessarily do the right thing either. So even though they might not have process management issues, since they don’t perform any such management of processes for implementing a multi process configuration for the server itself, the accumulation of zombie process could still eventually cause the maximum number of allowed processes to be exceeded.

Shell as parent process

Ultimately the solution is not to run any application process not designed to also perform reaping of child processes as process ID ‘1’ inside of the container.

There are two ways to avoid this. The first is a quick hack and one which is often seen used in Docker containers, although perhaps not intentionally. Although it avoids the zombie reaping problem, it causes its own issues.

The second way is to run as process ID ‘1’ a minimal process whose only role is to execute as a child process the real application process and then subsequently reap the zombie processes.

This minimal init process of the second approach has one other important role as well though and it is this role where the quick hack solution fails.

As to the quick or inadvertent hack that some rely on, lets look at how a ‘CMD’ in a ‘Dockerfile’ is specified.

The recommended way of using ‘CMD’ in a ‘Dockerfile’ would be to write:

CMD [ "python", "server_wsgiref.py" ]

This is what was used above where we saw within the Docker container.

As has already been explained, this results in our application running as process ID ‘1’.

Another way of using ‘CMD’ in a ‘Dockerfile’ is to write:

CMD python server_wsgiref.py

Our application still runs, but this isn’t doing the same thing as when we supplied a list of arguments to ‘CMD’.

The result in this case is:

Docker container top wsgiref shell

With this way of specifying the ‘CMD’ our application is no longer running as process ID ‘1’. Instead process ID ‘1’ is occupied by an instance of ‘/bin/sh’.

This has occurred because supplying the plain command line to ‘CMD’ actually results in the equivalent of:

CMD [ "sh", "-c", "python server_wsgiref.py" ]

Thus the reason for a shell process being introduced into the process hierarchy as process ID ‘1’.

With our application now no longer running as process ID ‘1’, the responsibility of reaping zombie processes falls instead to the instance of ‘/bin/sh’ running as process ID ‘1’.

As it turns out, ‘/bin/sh’ will reap any child processes associated with it, so we do not have the problem of zombie processes accumulating.

Now this isn’t the only way you might end up with an instance of ‘/bin/sh’ being process ID ‘1’.

Another common scenario where this ends up occurring is where someone using Docker uses a shell script with the ‘CMD’ statement so that they can do special setup prior to actually running their application. You thus can often find something like:

CMD [ "/app/start.sh" ]

The contents of the ’start.sh’ script might then be:

#!/bin/sh

python server_wsgiref.py

Using this approach, what we end up with is:

Docker container top wsgiref entrypoint

Our script is listed as process ID ‘1’, although it is in reality still an instance of ‘/bin/sh’.

The reason our application didn’t end up as process ID ‘1’ in this case is that the final line of the script simply said ‘python server_wsgiref.py’.

Whenever using a shell script as a ‘CMD’ like this, you should always ensure that when running your actual application from the shell script, that you do so using ‘exec’. That is:

#!/bin/sh

exec python server_wsgiref.py

By using ‘exec’ you ensure that your application process takes over and replaces the script process, thus resulting in it running as process ID ‘1’.

But wait, if having process ID ‘1’ be an instance of ‘/bin/sh’, with our application being a child process of it solves the zombie reaping problem, why not always do that then.

The reason for this is that although ‘/bin/sh’ will reap zombie processes for us, it will not propagate signals properly.

For our example, what this is means is that with ‘/bin/sh’ as process ID ‘1’, if we were using the command ‘docker stop’, the application process will not actually shutdown. Instead the default timeout for ‘docker stop’ will expire and it will then do the equivalent of ‘docker kill’ which will force kill the application and the container.

This occurs because although the instance of ‘/bin/sh’ will receive the signal to terminate the application which is sent by ‘docker stop', it ignores it and doesn’t pass it on to the actual application.

This in turn means that your application is denied the ability to be notified properly that the container is being shutdown and so ensure that it performs any required finalisation of in progress operations. For some applications, this lack of an ability to perform a clean shutdown could leave any persistent data in an inconsistent state, causing problems when the application is restarted.

It is therefore important that signals always be received by the main application process in a Docker container, but an intermediary shell process will not ensure that.

One can attempt to catch signals in the shell script and forward them on, but this does get a bit tricky as you also have to ensure that you wait for the wrapped application process to shutdown properly when it is passed a signal that would cause it to exit. As I have previously shown in an earlier post for other reasons, you might be able to use in such circumstances the shell script:

#!/bin/sh

trap 'kill -TERM $PID' TERM INT

python server_wsgiref.py &

PID=$!
wait $PID
trap - TERM INT
wait $PID
STATUS=$?

exit $STATUS

To be frank though, rather than hoping this will work reliably, you are better off using a purpose built monitoring process for this particular task.

Minimal init process

Coming from the Python world, one solution that Python developers like to use for managing processes is ‘supervisord’. This should work, but is a relatively heavy weight solution. At this time, ‘supervisord’ is also still only usable with Python 2. If you were wanting to run an application using Python 3, this means you wouldn’t be able to use it, unless you were okay with having to also add Python 2 to your image, resulting in a much fatter Docker image.

The folks at Phusion in that blog post I referenced earlier do provide a minimal ‘init’ like process which is implemented as a Python script, but if not using Python at all in your image, that means pulling in Python 2 once again when you perhaps don’t want that.

Because of the overheads of bringing in additional packages where you don’t necessarily want them, my preferred solution for a minimal ‘init’ process for handling reaping of zombies and the propagation of signals to the managed process is the ‘tini’ program. This is the same program that the ‘jupyter/notebook’ also makes use of and we saw mentioned in the ‘ENTRYPOINT’ statement of the ‘Dockerfile’.

ENTRYPOINT ["tini", "--"]

All ’tini' does is spawn your application and wait for it to exit, all the while reaping zombies and performing signal forwarding. In other words, it is specifically built for this task, relieving you of worrying about whether your own application is going to do the correct thing in relation to reaping of zombie processes.

Even if you believe your application may handle this task okay, I would still recommend that a tool like ‘tini’ be used as it gives you one less thing to worry about.

If you are using a shell script with ‘CMD’ in a ‘Dockerfile’ and subsequently running your application from it, you can still do that, but remember to use ‘exec’ when running your application to ensure that signals will get to your application. Don’t use ‘exec’ and your shell script will still swallow them up.

IPython and cloud services

We are finally done with improving on how IPython can be run with Docker so that it will work with cloud services using Docker. The main issue here we faced was the additional security restrictions that can be in place in cloud services for running Docker images in such a service.

In short, running Docker images as ‘root’ is a bad idea. Even if you are running your own Docker service it is something you should avoid if at all possible. Because of the increased risk you can understand why a hosting service is not going to allow you to do it.

With the introduction of user namespace support in Docker the restriction on what user a Docker image can run as should hopefully be able to be relaxed, but in the interim you would be wise to design Docker images so that they can run as an unprivileged user.

Now since there was actually a few things we needed to change to achieve this and a description of the changes were spread over multiple blog posts, I will summarise the changes in the next post. I will also start to outline what else I believe could be done to make the use of IPython with Docker, and especially cloud services, even better.

Thursday, December 24, 2015

Unknown user when running Docker container.

In the last post we covered how to setup a Docker image to cope with the prospect of a random user ID being used when the Docker container was started. The discussion so far has though only dealt with the issue of ensuring file system access permissions were set correctly to allow the original default user, as well as the random user ID being used, to update files.

A remaining issue of concern was the fact that when a random user ID is used which doesn’t correspond to an actual user account, that UNIX tools such as ‘whoami’ will not return valid results.

I have no name!@5a72c002aefb:/notebooks$ whoami
whoami: cannot find name for user ID 10000

Up to this point this didn’t actually appear to prevent our IPython Notebook application working, but it does leave the prospect that subtle problems could arise when we start actually using IPython to do more serious work.

Lets dig in and see what this failure equates to in the context of a Python application.

Accessing user information

If we are writing Python code, there are a couple of ways using the Python standard library that we could determine the login name for the current user.

The first way is to use the ‘getuser()’ function found in the ‘getpass’ module.

import getpass
name = getpass.getuser()

If we use this from an IPython notebook when a random user ID has been assigned to the Docker container, like how ‘whoami’ fails, this will also fail.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-3-3a0a5fbe1d4e> in <module>()
      1 import getpass
----> 2 name = getpass.getuser()

/usr/lib/python2.7/getpass.pyc in getuser()
    156 # If this fails, the exception will "explain" why
    157 import pwd
--> 158 return pwd.getpwuid(os.getuid())[0]
    159 
    160 # Bind the name getpass to the appropriate function

KeyError: 'getpwuid(): uid not found: 10000'

The error details and traceback displayed here actually indicate the second way of getting access to the login name. In fact the ‘getuser()’ function is just a high level wrapper around a lower level function for accessing user information from the system user database.

We could therefore also have written:

import pwd, os
name = pwd.getpwuid(os.getuid())[0]

Or being more verbose to make it more obvious what is going on:

import pwd, os
name = pwd.getpwuid(os.getuid()).pw_name

Either way, this is still going to fail where the current user ID doesn’t match a valid user in the system user database.

Environment variable overrides

You may be thinking, why bother with the ‘getuser()’ function if one could use ‘pwd.getpwuid()’ directly. Well it turns out that ‘getuser()’ does a bit more than just act as a proxy for calling ‘pwd.getpwuid()’. What it actually does is first consult various environment variables which identify the login name for the current user.

def getuser():
    """Get the username from the environment or password database.

    First try various environment variables, then the password
    database. This works on Windows as long as USERNAME is set.

"""

    import os

    for name in ('LOGNAME', 'USER', 'LNAME', 'USERNAME'):
        user = os.environ.get(name)
        if user:
            return user

    # If this fails, the exception will "explain" why
    import pwd
    return pwd.getpwuid(os.getuid())[0]

These environment variables such as ‘LOGNAME’ and ‘USER’ would normally be set by the login shell for a user. When using Docker though, a login shell isn’t used and so they are not set.

For the ‘getuser()’ function at least, we can therefore get it working by ensuring that as part of the Docker image build, we set one or more of these environment variables. Typically both the ‘LOGNAME’ and ‘USER’ environment variables are set, so lets do that.

ENV LOGNAME=ipython
ENV USER=ipython

Rebuilding our Docker image with this addition to the ‘Dockerfile’ and trying ‘getuser()’ again from within a IPython Notebook and it does indeed now work.

Overriding user system wide

This change may help allow more code to execute without problems, but if code directly accesses the system user database using ‘pwd.getpwuid()’, if it doesn’t catch the ‘KeyError’ exception and handle missing user information you will still have problems.

So although this is still a worthwhile change in its own right, just in case something may want to consult ‘LOGNAME’ and ‘USER’ environment variables which would normally be set by the login shell, such as ‘getuser()’, it does not help with ‘pwd.getpwuid()’ nor UNIX tools such as ‘whoami’.

To be able to implement a solution for this wider use case gets a bit more tricky as we need to solve the issue for UNIX tools, or for that matter, any C level application code which uses the ‘getpwuid()’ function in the system C libraries.

The only way one can achieve this though is through substituting the system C libraries, or at least overriding the behaviour of key C library functions. This may sound impossible but by using a Linux capability to forcibly preload a shared library into executing processes it is actually possible and someone has even written a package we can use for this purpose.

The nss_wrapper library

The package in question is one called ‘nss_wrapper'. The library provides a wrapper for the user, group and hosts NSS API. Using nss_wrapper it is possible to define your own ‘passwd' and ‘group' files which will then be consulted when needing to lookup user information.

One way in which this package is normally used is when doing testing and you need to run applications using a dynamic set of users and you don’t want to have to create real user accounts for them. This mirrors the situation we have where when using a random user ID we will not actually have a real user account.

The idea behind the library is that prior to starting up your application you would make copies of the system user and group database files and then edit any existing entries or add additional users as necessary. When starting your application you would then force it to preload a shared library which overrides the NSS API functions in the standard system libraries such that they consult the copies of the user and group database files.

The general steps therefore are something like:

ipython@3d0c5ea773a3:/tmp$ whoami
ipython

ipython@3d0c5ea773a3:/tmp$ id
uid=1001(ipython) gid=0(root) groups=0(root)

ipython@3d0c5ea773a3:/tmp$ echo "magic:x:1001:0:magic gecos:/home/ipython:/bin/bash" > passwd

ipython@3d0c5ea773a3:/tmp$ LD_PRELOAD=/usr/local/lib64/libnss_wrapper.so NSS_WRAPPER_PASSWD=passwd NSS_WRAPPER_GROUP=/etc/group id
uid=1001(magic) gid=0(root) groups=0(root)

ipython@3d0c5ea773a3:/tmp$ LD_PRELOAD=/usr/local/lib64/libnss_wrapper.so NSS_WRAPPER_PASSWD=passwd NSS_WRAPPER_GROUP=/etc/group whoami
magic

To integrate the use of the ‘nss_wrapper’ package we need to do two things. The first is install the package and the second is to add a Docker entrypoint script which can generate a modified password database file and then ensure that the ‘libnss_wrapper.so’ shared library is forcibly preloaded for all processes subsequently run.

Installing the nss_wrapper library

At this point in time the ‘nss_wrapper’ library is not available in the stable Debian package repository, still only being available in the testing repository. As we do not want in general to be pulling packages from the Debian testing repository, we are going to have to install the ’nss_wrapper’ library from source code ourselves.

To be able to do this, we need to ensure that the system packages for ‘make’ and ‘cmake’ are available. We therefore need to add these to the list of system packages being installed.

# Python binary and source dependencies
RUN apt-get update -qq && \
 DEBIAN_FRONTEND=noninteractive apt-get install -yq --no-install-recommends \
 build-essential \
 ca-certificates \
 cmake \
 curl \
 git \
 make \
 language-pack-en \
 libcurl4-openssl-dev \
 libffi-dev \
 libsqlite3-dev \
 libzmq3-dev \
 pandoc \
 python \
 python3 \
 python-dev \
 python3-dev \
 sqlite3 \
 texlive-fonts-recommended \
 texlive-latex-base \
 texlive-latex-extra \
 zlib1g-dev && \
 apt-get clean && \
 rm -rf /var/lib/apt/lists/*

We can then later on download the source package for ‘nss_wrapper’ and install it.

# Install nss_wrapper.
RUN curl -SL -o nss_wrapper.tar.gz https://ftp.samba.org/pub/cwrap/nss_wrapper-1.1.2.tar.gz && \
 mkdir nss_wrapper && \
 tar -xC nss_wrapper --strip-components=1 -f nss_wrapper.tar.gz && \
 rm nss_wrapper.tar.gz && \
 mkdir nss_wrapper/obj && \
 (cd nss_wrapper/obj && \
 cmake -DCMAKE_INSTALL_PREFIX=/usr/local -DLIB_SUFFIX=64 .. && \
 make && \
 make install) && \
 rm -rf nss_wrapper

Updating the Docker entrypoint

At present the Docker ‘ENTRYPOINT’ and ‘CMD’ are specified in the ‘Dockerfile’ as:

ENTRYPOINT [“tini”, “--"]
CMD ["jupyter", "notebook"]

The ‘CMD’ statement in this case is the actual command we want to run to start the Jupyter Notebook application.

We haven’t said anything about what the ‘tini’ program specified by the ‘ENTRYPOINT' is all about as yet, but it is actually quite important. If you do not use ‘tini’ as a wrapper for IPython Notebook then it will not work properly. We will cover what ‘tini’ is and why it is necessary for running IPython Notebook in a subsequent post.

Now because we do require ‘tini’, but we now also want to do some other work prior to actually running the ‘jupyter notebook’ command, we are going to substitute an entrypoint script in place of ‘tini’. We will call this ‘entrypoint.sh’, make it executable, and place it in the top level directory of the repository. After its copied into place, the ‘ENTRYPOINT’ specified in the ‘Dockerfile’ will then need to be:

ENTRYPOINT ["/usr/src/jupyter-notebook/entrypoint.sh"]

The actual ‘entrypoint.sh’ we will specify as:

#!/bin/sh

# Override user ID lookup to cope with being randomly assigned IDs using
# the -u option to 'docker run'.

USER_ID=$(id -u)

if [ x"$USER_ID" != x"0" -a x"$USER_ID" != x"1001" ]; then
    NSS_WRAPPER_PASSWD=/tmp/passwd.nss_wrapper
    NSS_WRAPPER_GROUP=/etc/group

    cat /etc/passwd | sed -e ’s/^ipython:/builder:/' > $NSS_WRAPPER_PASSWD

    echo "ipython:x:$USER_ID:0:IPython,,,:/home/ipython:/bin/bash" >> $NSS_WRAPPER_PASSWD

    export NSS_WRAPPER_PASSWD
    export NSS_WRAPPER_GROUP

    LD_PRELOAD=/usr/local/lib64/libnss_wrapper.so
    export LD_PRELOAD
fi

exec tini -- "$@"

Note that we still execute ‘tini’ as the last step. We do this using ‘exec’ so that its process will replace the entrypoint script and take over as process ID 1, ensuring that signals get propagated properly, as well as to ensure some details related to process management are handled correctly. We will also pass on all command line arguments given to the entrypoint script to ‘tini’. The double quotes around the arguments reference ensure that argument quoting is handled properly when passing through arguments.

What is now new compared to what was being done before is the enabling of the ‘nss_wrapper’ library. We do not do this though when we are running as ‘root’, were that is that the Docker image was still forced to run as ‘root’ even though the aim is that it run as a non ‘root’ user. We also do not need to do it when we are run with the default user ID.

When run as a random user ID we do two things with the password database file that we will use with ‘nss_wrapper’.

The first is that we change the login name corresponding to the existing user ID of ‘1001’. This is the default ‘ipython’ user account we created previously. We do this by simply replacing the ‘ipython’ login name in the password file when we copy it, with the name ‘builder’ instead.

The second is that we add a new password database file entry corresponding to the current user ID, that being whatever is the random user ID allocated to run the Docker container. In this case we use the login name of ‘ipython’.

The reason for swapping the login names so the current user ID uses ‘ipython’ rather than the original user ID of ‘1001’, is so that the application when run will still think it is the ‘ipython’ user. What we therefore end up with in our copy of the password database file is:

docker run -it --rm -u 10000 -p 8888:8888 jupyter-notebook bash
ipython@0ff73693d433:/notebooks$ tail -2 /tmp/passwd.nss_wrapper
builder:x:1001:0:IPython,,,:/home/ipython:/bin/bash
ipython:x:10000:0:IPython,,,:/home/ipython:/bin/bash

Immediately you can already see that the shell prompt now looks correct. Going back and running our checks from before, we now see:

ipython@0ff73693d433:/notebooks$ whoami
ipython
ipython@0ff73693d433:/notebooks$ id
uid=10000(ipython) gid=0(root) groups=0(root)
ipython@0ff73693d433:/notebooks$ env | grep HOME
HOME=/home/ipython
ipython@0ff73693d433:/notebooks$ touch $HOME/magic
ipython@0ff73693d433:/notebooks$ touch /notebooks/magic
ipython@0ff73693d433:/notebooks$ ls -las $HOME
total 24
4 drwxrwxr-x 4 builder root 4096 Dec 24 10:22 .
4 drwxr-xr-x 6 root root 4096 Dec 24 10:22 ..
4 -rw-rw-r-- 1 builder root 220 Dec 24 10:08 .bash_logout
4 -rw-rw-r-- 1 builder root 3637 Dec 24 10:08 .bashrc
4 drwxrwxr-x 2 builder root 4096 Dec 24 10:08 .jupyter
0 -rw-r--r-- 1 ipython root 0 Dec 24 10:22 magic
4 -rw-rw-r-- 1 builder root 675 Dec 24 10:08 .profile

So even though the random user ID didn’t have an entry in the original system password database file, by using ‘nss_wrapper’ we can trick any applications to use our modified password database file for user information. This means we can dynamically generate a valid password database file entry for the random user ID which was used.

With the way we swapped the login name for the default user ID of ‘1001’, with the random user ID, as far as any application is concerned it is still running as the ‘ipython’ user.

So we can distinguish, any files that were created during the image build as the original ‘ipython’ user will now instead show as being owned by ‘builder’, which if we look it up maps to user ID of ‘1001’.

ipython@0ff73693d433:/notebooks$ id builder
uid=1001(builder) gid=0(root) groups=0(root)
ipython@0ff73693d433:/notebooks$ getent passwd builder
builder:x:1001:0:IPython,,,:/home/ipython:/bin/bash

Running as another name user

Not that there strictly should be a reason for doing so, but it is possible to also force the Docker container to run as some other user ID with an entry in the password database file, but because they have their own distinct primary group assignments, you do have to override the group to be ‘0’ so that it can update any required directories.

$ docker run -it --rm -u 5 -p 8888:8888 jupyter-notebook bash
games@36ec17b1d9c1:/notebooks$ whoami
games
games@36ec17b1d9c1:/notebooks$ id
uid=5(games) gid=60(games) groups=60(games)
games@36ec17b1d9c1:/notebooks$ env | grep HOME
HOME=/home/ipython
games@36ec17b1d9c1:/notebooks$ touch $HOME/magic
touch: cannot touch ‘/home/ipython/magic’: Permission denied
games@36ec17b1d9c1:/notebooks$ touch /notebooks/magic
touch: cannot touch ‘/notebooks/magic’: Permission denied

$ docker run -it --rm -u 5:0 -p 8888:8888 jupyter-notebook bash
games@e2ecabedab47:/notebooks$ whoami
games
games@e2ecabedab47:/notebooks$ id
uid=5(games) gid=0(root) groups=60(games)
games@e2ecabedab47:/notebooks$ env | grep HOME
HOME=/home/ipython
games@e2ecabedab47:/notebooks$ touch $HOME/magic
games@e2ecabedab47:/notebooks$ touch /notebooks/magic
games@e2ecabedab47:/notebooks$ ls -las $HOME
total 24
4 drwxrwxr-x 4 builder root 4096 Dec 24 10:41 .
4 drwxr-xr-x 6 root root 4096 Dec 24 10:41 ..
4 -rw-rw-r-- 1 builder root 220 Dec 24 10:39 .bash_logout
4 -rw-rw-r-- 1 builder root 3637 Dec 24 10:39 .bashrc
4 drwxrwxr-x 2 builder root 4096 Dec 24 10:39 .jupyter
0 -rw-r--r-- 1 games root 0 Dec 24 10:41 magic
4 -rw-rw-r-- 1 builder root 675 Dec 24 10:39 .profile

Running as process ID 1

Finally if we startup the IPython Notebook application localy with Docker, or on OpenShift, then everything still works okay. Further, as well as the ‘getpass.getuser()’ function working, use of ‘pwd.getpwuid(os.getuid())’ also works, this being due to the use of the ‘nss_wrapper’ library.

So everything is now good and we shouldn’t have any issues. There was though something already present in the way that the ‘jupiter/notebook’ Docker image was set up that is worth looking at. This was the use of the ‘tini’ program as the ‘ENTRYPOINT’ in the ‘Dockerfile’. This relates to problems that can arise when running an application as process ID 1. I will look at what this is all about in the next post.

Wednesday, December 23, 2015

Random user IDs when running Docker containers.

At this point in our exploration of getting IPython to work on OpenShift we have deduced that we cannot, and should not, have our Docker container be dependent on running as the 'root' user. Simply setting up the Docker container to run as a specific non ‘root’ user wasn’t enough however. This is because in pursuit of a more secure environment, OpenShift actually uses a different user ID for each project when running Docker containers.

As I keep noting, user namespaces when available in Docker, should be able to transparently hide any underlying mapping to a special user ID as required by an underlying platform, allowing the Docker container to use what ever user ID it wants. We aren’t there yet, and given that user namespaces were first talked about as coming soon well over a year ago, we could well be waiting some time yet for all the necessary pieces to fall into place to enable that.

In the mean time, the best thing you can do to ensure Docker images are portable to different hosting environments, and be as secure as possible, is design your Docker containers to run as a non ‘root’ user, but at the same time be tolerant of running as an arbitrary user ID specified at the time the Docker container is started.

File system access permissions

In our prior post, where we got to was that when running our IPython Docker container as a random user ID, it would fail even when running some basics checks.

$ docker run --rm -it -u 100000 -p 8888:8888 jupyter-notebook bash
I have no name!@78bdfa8dba92:/notebooks$ whoami
whoami: cannot find name for user ID 100000
I have no name!@78bdfa8dba92:/notebooks$ id
uid=100000 gid=0(root)
I have no name!@78bdfa8dba92:/notebooks$ pwd
/notebooks
I have no name!@78bdfa8dba92:/notebooks$ env | grep HOME
HOME=/
I have no name!@78bdfa8dba92:/notebooks$ touch $HOME/magic
touch: cannot touch ‘//magic’: Permission denied
I have no name!@78bdfa8dba92:/notebooks$ touch /notebooks/magic
touch: cannot touch ‘/notebooks/magic’: Permission denied

The problems basically boiled down to file system access permissions, this being caused by the fact that we were running as a different user ID to what we expected.

The first specific problem was that the ‘HOME’ directory environment variable wasn’t set to what was expected for the user we anticipated everything to run as. This meant that instead of the home directory ‘/home/ipython’ being used, it was trying to use ‘/‘ as the home directory.

As a first step, lets simply try overriding the ‘HOME’ directory and forcing it to be what we desired it to be by adding to the ‘Dockerfile':

ENV HOME=/home/ipython

Starting the Docker container with an interactive shell we now get:

$ docker run --rm -it -u 100000 -p 8888:8888 jupyter-notebook bash
I have no name!@e40f5e18f666:/notebooks$ whoami
whoami: cannot find name for user ID 100000
I have no name!@e40f5e18f666:/notebooks$ id
uid=100000 gid=0(root)
I have no name!@e40f5e18f666:/notebooks$ pwd
/notebooks
I have no name!@e40f5e18f666:/notebooks$ env | grep HOME
HOME=/home/ipython
I have no name!@e40f5e18f666:/notebooks$ touch $HOME/magic
touch: cannot touch ‘/home/ipython/magic’: Permission denied

The ‘HOME’ directory environment variable is now correct, but we still cannot create files due to the fact that the home directory is owned by the ‘ipython’ user and we are running with a different user ID.

$ ls -las $HOME
total 24
4 drwxr-xr-x 3 ipython ipython 4096 Dec 22 21:53 .
4 drwxr-xr-x 4 root root 4096 Dec 22 21:53 ..
4 -rw-r--r-- 1 ipython ipython 220 Dec 22 21:53 .bash_logout
4 -rw-r--r-- 1 ipython ipython 3637 Dec 22 21:53 .bashrc
4 drwx------ 2 ipython ipython 4096 Dec 22 21:53 .jupyter
4 -rw-r--r-- 1 ipython ipython 675 Dec 22 21:53 .profile

Using group access permissions

The solution to file system access permission problems one often sees in Docker containers which try to run as a non ‘root’ user is to simply make files and directories world writable. That is, after setting up everything in the ‘Dockerfile’ as the ‘root’ user and before switching the user using a ‘USER’ statement, the ‘chmod’ command is run recursively on any directories and files which the running application might need to update.

I personally don’t like this approach of making everything world writable at all. To me it falls into that category of bad practices you wouldn’t use if you were installing an application direct to a host when you aren’t using Docker, so why start now. But what are the alternatives?

The more secure alternative that would normally be used to allow multiple users to update the same directories or files are UNIX groups. The big question is whether they are going to be useful in this case or not.

As it is, when the home directory for the ‘ipython’ user was created, the directories and files were created with the group ‘ipython’, being a personal group created for the ‘ipython’ user when the ‘adduser’ command was used to create the user account.

The problem with the use of a personal group as the primary group for the user and thus the directories and files created, is that it is impossible to know what the random user ID will be and so add it into the personal group in advance. Having the group of the directories and files be a personal group is therefore not going to work.

The question now is if the group would normally be set to whatever the primary group is for a named user, what group is actually going to be used when the user ID is being overridden for the container at run time.

Lets first look at the case of where we override the user ID but still use one which does have a user defined for it.

$ docker run --rm -it -u 5 -p 8888:8888 jupyter-notebook bash
games@d0e1f5776ccb:/notebooks$ id
uid=5(games) gid=60(games) groups=60(games)

Here we specify the user ID ‘5’, which corresponds to the ‘games’ user. That user happens to have a corresponding primary group which maps to its own personal group of ‘games’. In overriding the user ID, the primary group for the user is still picked and used as the effective group. Thus the ‘id’ command shows the ‘gid’ being ’60’, corresponding to the ‘games’ group.

Do note that this is only the case where only the user ID was overridden. It so happens that the ‘-u’ option to ‘docker run’ can also be used to override the effective group used as well.

$ docker run --rm -it -u 5:1 -p 8888:8888 jupyter-notebook bash
games@58d9074c872c:/notebooks$ id
uid=5(games) gid=1(daemon) groups=60(games)

Here we have overridden the effective group to be group ID of ‘1’, corresponding to the ‘daemon’ group.

Back to our random user ID, when we select a user ID which doesn’t have a corresponding user account we see:

docker run --rm -it -u 10000 -p 8888:8888 jupyter-notebook bash
I have no name!@f4050457c1ee:/notebooks$ id
uid=10000 gid=0(root)

That is, the effective group is set as the ‘gid’ of ‘0’, corresponding to the group for ‘root’.

The end result is that provided that we do not override the effective group as well using the ‘-u’ option, if the user ID specified corresponds to a user account, then the primary group for that user would be used. If instead a random user ID were used for which there did not exist a corresponding user account, then the effective group would be that for the ‘gid’ of ‘0’, which is reserved for the ‘root’ user group.

Note that in a hosting service which is effectively using a randomly assigned user ID, it is assumed that it will never select one which overlaps with an existing user ID. This can’t be completely guaranteed, although so long as a hosting service uses user IDs starting at a very large number, it is a good bet it will not clash with an existing user. For OpenShift at least, it appears to allocate user IDs starting somewhere above ‘1000000000’.

As to overriding the group as well as the user ID, it is also assumed that a hosting service would not do that. Again, OpenShift at least doesn’t override the group and this is probably the most sensible thing that could be done here as overriding of the group to be some random ID as well, would make the use of UNIX groups inside of the container impossible as nothing would be predictable. In this case I would suggest any hosting service going down this path of allocating user IDs, follow OpenShift’s lead and not override the group ID as doing so would likely just cause a world of hurt.

Using a user with effective GID of 0

What now is going to be the most workable solution if we wish to rely on group access permissions?

In light of the above observed behaviour what seems might work is to have the special user we created, and which would be the default user specified by the ‘USER’ statement of the ‘Dockerfile', have a primary group with ‘gid’ of ‘0’. That is, we match what would be the primary group used if a random user ID had been used which does not correspond to a user account.

By making such a choice for the effective group, it means that the group will be the same for both cases and we can now set up file system permissions correspondingly.

Updating our ‘Dockerfile’ based on this, we end up with:

RUN adduser --disabled-password --gid 0 --gecos "IPython" ipython

RUN mkdir -m 0775 /notebooks && chown ipython:root /notebooks

VOLUME /notebooks
WORKDIR /notebooks

USER ipython

# Add a notebook profile.RUN mkdir -p -m 0775 ~ipython/.jupyter/ && \
 echo "c.NotebookApp.ip = '*'" >> ~ipython/.jupyter/jupyter_notebook_config.py

RUN chmod -R u+w,g+w /home/ipython

ENV HOME=/home/ipython

The key changes are:

Add the ‘--gid 0’option to ‘adduser’ so that the primary group for user is ‘root’.
Create the ‘/notebooks’ directory with mode ‘0775’ so writable by group.
Move creation of ‘jupyter_notebook_config.py’ down to where we are the non ‘root’ user.
Change permissions on all files and directories in home directory so writable by group.

Lets now check what happens for each of the use cases we expect.

For the case where the Docker container runs as the default user as specified by the ‘USER’ statement we now get:

$ docker run --rm -it -p 8888:8888 jupyter-notebook bash
ipython@68d5a31bcc03:/notebooks$ whoami
ipython
ipython@68d5a31bcc03:/notebooks$ id
uid=1000(ipython) gid=0(root) groups=0(root)
ipython@68d5a31bcc03:/notebooks$ pwd
/notebooks
ipython@68d5a31bcc03:/notebooks$ env | grep HOME
HOME=/home/ipython
ipython@68d5a31bcc03:/notebooks$ touch $HOME/magic
ipython@68d5a31bcc03:/notebooks$ touch /notebooks
ipython@68d5a31bcc03:/notebooks$ ls -las $HOME
total 24
4 drwxrwxr-x 4 ipython root 4096 Dec 23 02:26 .
4 drwxr-xr-x 6 root root 4096 Dec 23 02:26 ..
4 -rw-rw-r-- 1 ipython root 220 Dec 23 02:15 .bash_logout
4 -rw-rw-r-- 1 ipython root 3637 Dec 23 02:15 .bashrc
4 drwxrwxr-x 2 ipython root 4096 Dec 23 02:15 .jupyter
0 -rw-r--r-- 1 ipython root 0 Dec 23 02:26 magic
4 -rw-rw-r-- 1 ipython root 675 Dec 23 02:15 .profile

Everything in our checks still works okay and running up the actual Jupyter Notebook application also works fine, with us being able to create and save new notebooks.

This is what we would expect as the directories and files are owned by the ‘ipython’ user and we are also running as that user.

Of note is that you will now see that the effective group of the user is a ‘gid’ of ‘0’. All the directories and files also have that group.

If we use the ‘-u ipython’ or ‘-u 1000’ option, where ‘1000’ was the user ID allocated by the ‘adduser’ command in the ‘Dockerfile’, that all works fine as well.

For the case of overriding the user with a random user ID, we get:

$ docker run --rm -it -u 10000 -p 8888:8888 jupyter-notebook bash
I have no name!@dbe290496d44:/notebooks$ whoami
whoami: cannot find name for user ID 10000
I have no name!@dbe290496d44:/notebooks$ id
uid=10000 gid=0(root)
I have no name!@dbe290496d44:/notebooks$ pwd
/notebooks
I have no name!@dbe290496d44:/notebooks$ env | grep HOME
HOME=/home/ipython
I have no name!@dbe290496d44:/notebooks$ touch $HOME/magic
I have no name!@dbe290496d44:/notebooks$ touch /notebooks/magic
I have no name!@dbe290496d44:/notebooks$ ls -las $HOME
total 24
4 drwxrwxr-x 4 ipython root 4096 Dec 23 02:32 .
4 drwxr-xr-x 6 root root 4096 Dec 23 02:32 ..
4 -rw-rw-r-- 1 ipython root 220 Dec 23 02:31 .bash_logout
4 -rw-rw-r-- 1 ipython root 3637 Dec 23 02:31 .bashrc
4 drwxrwxr-x 2 ipython root 4096 Dec 23 02:31 .jupyter
0 -rw-r--r-- 1 10000 root 0 Dec 23 02:32 magic
4 -rw-rw-r-- 1 ipython root 675 Dec 23 02:31 .profile

Unlike before when overriding with a random user ID with no corresponding user account, the attempts to create files in the file system now works okay.

What you will note though is that the file created is in this case owned by user with user ID of ‘10000’. This worked because the effective group of the random user ID was ‘root’, matching what the directory used, along with the fact that the group permissions of the directory allowed updates by anyone in the same group. Thus it didn’t matter that the user ID was different to the owner of the group.

One thing you may note is that when the file ‘magic’ was created, the resulting file wasn’t itself writable to the group. This was the case as the default ‘umask’ setup by Docker when a container is run is ‘0022’. This particular ‘umask’ disables the setting of the ‘w’ flag on the group.

Even though this is the case, this is not a problem because from this point on any code that would run, such as the actual Jupyter Notebook application, would only ever run as the same allocated user ID. There is therefore no expectation of any processes running as the original ‘ipython’ user needing to be able to update the file.

In other words, that directories and files are fixed up to be writable to group only matters for the original directories and files created as part of the Docker build as the ‘ipython’ user. What happens after that and what the ‘umask’ may be is not important.

One final check to go, will this updated version of the ‘jupyter/notebook’ Docker image work on OpenShift, and the answer is that it does indeed now start up okay and does not error out due to the problems with access permissions we had before.

If we access the running container on OpenShift we can perform the same checks as above okay.

$ oc rsh ipython-3-c7oit
I have no name!@ipython-3-c7oit:/notebooks$ whoami
whoami: cannot find name for user ID 1000210000
I have no name!@ipython-3-c7oit:/notebooks$ id
uid=1000210000 gid=0(root)
I have no name!@ipython-3-c7oit:/notebooks$ pwd
/notebooks
I have no name!@ipython-3-c7oit:/notebooks$ env | grep HOME
HOME=/home/ipython
I have no name!@ipython-3-c7oit:/notebooks$ touch $HOME/magic
I have no name!@ipython-3-c7oit:/notebooks$ touch /notebooks/magic
I have no name!@ipython-3-c7oit:/notebooks$ ls -las $HOME
total 20
4 drwxrwxr-x. 5 ipython root 4096 Dec 23 03:20 .
0 drwxr-xr-x. 3 root root 20 Dec 23 03:13 ..
4 -rw-------. 1 1000210000 root 31 Dec 23 03:20 .bash_history
4 -rw-rw-r--. 1 ipython root 220 Dec 23 03:13 .bash_logout
4 -rw-rw-r--. 1 ipython root 3637 Dec 23 03:13 .bashrc
0 drwxr-xr-x. 5 1000210000 root 64 Dec 23 03:19 .ipython
0 drwxrwxr-x. 2 ipython root 39 Dec 23 03:14 .jupyter
0 drwx------. 3 1000210000 root 18 Dec 23 03:18 .local
0 -rw-r--r--. 1 1000210000 root 0 Dec 23 03:20 magic
4 -rw-rw-r--. 1 ipython root 675 Dec 23 03:13 .profile

Named user vs numeric user ID

Before we go on to further verify whether the updated Docker image does in fact work properly on OpenShift, I want to revisit the use of the ‘USER’ statement in the ‘Dockerfile’.

Right now the ‘USER’ statement is specifying a default user. This user would be used if you were running the Docker image directly with Docker yourself. As we have seen, if used with OpenShift, the user given by the ‘USER’ statement is actually ignored.

The reasons that a hosting service such as OpenShift ignores the user specified by the ‘USER’ statement is that it cannot trust that the user is a non ‘root’ user when the user is specified by way of a name. But also because where a host service provides an ability to mount shared persistent volumes into containers it may want to ensure running containers owned by a specific service account, or a project within a service account, have different user IDs as part of ensuring that there is no way an application could see any data stored on a shared volume created by a different user, if a volume was mounted against the wrong container.

Now one of the possibilities I did describe in a prior post was that if a hosting service only supported 12 factor applications and didn’t support persistent data volumes, although it should really still prohibit running a container as ‘root’, it may allow a container to run as the user specified by the ‘USER’ statement so long as it knows it isn’t ‘root’. This though it can only know if a numeric user ID was defined with the ‘USER’ statement.

To cater for the possibility, rather than use a user name with the ‘USER’ statement, lets use its numeric user ID instead.

Now from the above tests we saw that the numeric user ID for the user ‘ipython’ created by ‘adduser’ was ‘1000’. We could therefore use it with the ‘USER’ statement, however, since what ‘adduser’ will use for the user ID is not technically deterministic, as it can be dependent on what other user accounts may already have been created, but also can depend on what operating system is used, we are better off being explicit and telling ‘adduser’ what user ID to use.

What exactly the lowest recommended user ID is for normal user accounts looks to be 500 on Posix and Red Hat systems, and 1000 on OpenSuSE and Debian. Lets therefore go with a number 1000 or above, but just in case an operating system image may include at least a default user account, lets skip 1000 and use 1001 instead.

Making this change we now end up with the ‘Dockerfile’ being:

RUN adduser --disabled-password --uid 1001 --gid 0 --gecos "IPython" ipython

RUN mkdir -m 0775 /notebooks && chown ipython:root /notebooks

VOLUME /notebooks
WORKDIR /notebooks

USER 1001

# Add a notebook profile.
RUN mkdir -p -m 0775 ~ipython/.jupyter/ && \
 echo "c.NotebookApp.ip = '*'" >> ~ipython/.jupyter/jupyter_notebook_config.py

RUN chmod -R u+w,g+w /home/ipython

ENV HOME=/home/ipython

All up this should give us a the most portable solution. Working where the Docker container is hosted directly on Docker, but also working on a hosting service such as OpenShift, which uses Docker under the covers, but which overrides the user ID containers run as. Using a numeric user ID for ‘USER’ also allows a hosting service to still used our preferred user if it does not want to allow containers to run as ‘root’, as will know it can trust that it will run as the user ID indicated.

Cannot find name for user ID

It would be great to say at this point that we are done and everything works fine. That is however not the case as I will go into in the next post.

The remaining problem relates to what happens when we run the ‘whoami’ command:

$ docker run --rm -it -u 10000 -p 8888:8888 jupyter-notebook bash
I have no name!@dbe290496d44:/notebooks$ whoami
whoami: cannot find name for user ID 10000

As we can see, ‘whoami’ isn’t able to return a valid value due to the user ID everything runs as not actually matching a user account.

In initially running up the updated Docker image this didn’t appear to prevent the IPython Notebook application running, but as we delve deeper we will see that it can actually cause problems.

Tuesday, December 22, 2015

Overriding the user Docker containers run as.

In the first post of this series looking at how to get IPython running on OpenShift I showed how taking the ‘jupyter/notebook’ Docker image and trying to use it results in failure. The error we encountered was:

$ oc logs --previous notebook-1-718ce
/usr/local/lib/python3.4/dist-packages/IPython/paths.py:69: UserWarning: IPython parent '/' is not a writable location, using a temp directory.
 " using a temp directory.".format(parent))
Traceback (most recent call last):
  ...

File "/usr/lib/python3.4/os.py", line 237, in makedirs
 mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/.jupyter'

The problem occurred because the ‘jupyter/notebook’ expects to run as the ‘root’ user, but OpenShift doesn’t permit that by default due to the increased security risks from allowing that with how Docker currently works.

Changes are supposedly coming for Docker, in the way of support for user namespaces, which would reduce the security risks, but right now, and perhaps even when support for user namespaces is available, it is simply better that you do not run Docker containers as ‘root’.

Lets now dig more into the ways that a Docker container can be made to not run as the ‘root’ user.

Specifying the user in the Dockerfile

If you are building a Docker image yourself, you can specify that it should run as a particular user by including the ‘USER’ statement in the ‘Dockerfile’. Normally you would place this towards the end of the ‘Dockerfile’ so that prior ‘RUN’ steps within the ‘Dockerfile' can still run with the default ‘root’ privileges.

Unfortunately many images do not close out the ‘Dockerfile’ by specifying a ‘USER’ statement for a non ‘root’ user. This is either done through ignorance that one shouldn’t really run Docker containers as ‘root’ unless you genuinely have a need to, or because they anticipate that the Docker image may later possibly be used as a base image and so perhaps don’t want to make it too difficult for it to be used in that way.

Specifically, if the base image were finished up with a ‘USER’ statement for a non ‘root’ user, when creating a derived image the first thing that anyone would need to do if they wanted to make system changes would be to use ‘USER root’ to switch back to being the ‘root’ user.

One can easily see how people might think this is annoying though and so not specify the ‘USER’ in the base image. Problem is that your typical users of the base image are even less likely to understand the consequences of running as ‘root’ and why you shouldn’t and so aren’t going to revert to a non ‘root’ user in their derived image either if you haven’t provided some pointer to what best practice is.

What user to run a container as

So if it is better to run as a non ‘root’ user, what user should that be?

The simplest course one might choose is to look at what system users an operating system pre defines in the ‘/etc/passwd’ file.

On the ‘busybox’ image, if we do that we find:

root:x:0:0:root:/root:/bin/sh
daemon:x:1:1:daemon:/usr/sbin:/bin/false
bin:x:2:2:bin:/bin:/bin/false
sys:x:3:3:sys:/dev:/bin/false
sync:x:4:100:sync:/bin:/bin/sync
mail:x:8:8:mail:/var/spool/mail:/bin/false
www-data:x:33:33:www-data:/var/www:/bin/false
operator:x:37:37:Operator:/var:/bin/false
ftp:x:83:83:ftp:/home/ftp:/bin/false
nobody:x:99:99:nobody:/home:/bin/false

Of these the ‘www-data’ user looks like a good candidate, this being the user that would normally be used by a web server such as Apache. The ‘www-data’ user is also typically present on all Linux operating system variants.

The problem with the ‘www-data’ user is that although the ‘/etc/passwd’ file usually defines a home directory, that home directory, depending on the Linux operating system variant, may not actually exist.

For example, on ‘busybox’ the home directory of ‘/var/www’ does exist, but on a Debian based image it may not.

$ docker run --rm -it busybox sh
/ # echo ~www-data
/var/www
/ # touch ~www-data/magic
/ # exit

$ docker run --rm -it debian:jessie sh
# echo ~www-data
/var/www
# touch ~www-data/magic
touch: cannot touch '/var/www/magic': No such file or directory

The lack of a home directory means that even if we update the IPython Docker image to run as the ‘www-data’ user, it will still fail on startup.

/usr/local/lib/python3.4/dist-packages/IPython/paths.py:69: UserWarning: IPython parent '/var/www' is not a writable location, using a temp directory.
 " using a temp directory.".format(parent))
Traceback (most recent call last):
 ...
 File "/usr/lib/python3.4/os.py", line 237, in makedirs
 mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/var/www'

What happened this time is that since the home directory didn’t even exist, the Python code for the application tried to create it. The user ‘www-data’ didn’t though have permissions to create a directory under ‘/var’.

More typically application code trying to write files to a home directory would assume that the home directory at least would exist, so instead it would fail when trying to create a file or subdirectory under the non existent home directory. It is unusual that the application code in this case tried to create the home directory first.

Adding of a new user account

One could with the ‘www-data’ account simply ensure that a home directory does exist and create it with the appropriate permissions, but at that point it is probably easier and better to add a new user account to the system specifically to be used by the container when it is run.

The exact command you use to add a new user account is going to depend on the Linux operating system variant being used. If using a Debian based system you would use ‘adduser’. If using Red Hat you would use ‘useradd’.

The IPython Docker image for ‘jupyter/notebook’ is derived from Ubuntu and so is Debian based. To create a special user account called ‘ipython’ we would therefore use:

adduser --disabled-password --gecos "IPython" ipython

The ‘--disabled-password’ option ensures that the ‘adduser’ command doesn’t attempt to prompt for a user password.

Having added our own user account, what changes would we need to make to the ‘Dockerfile’ for the ‘jupyter/notebook’ Docker image to have it use this?

Looking through it the bulk of the commands in the ‘Dockerfile’ relate to installing system or Python packages. It is only where we get to the end of the ‘Dockerfile’ that we come across anything that potentially needs to change.

# Add a notebook profile.
RUN mkdir -p -m 700 /root/.jupyter/ && \
 echo "c.NotebookApp.ip = '*'" >> /root/.jupyter/jupyter_notebook_config.py

VOLUME /notebooks
WORKDIR /notebooks

EXPOSE 8888

ENTRYPOINT ["tini", "--"]
CMD ["jupyter", "notebook"]

The first command here is creating a ‘.jupyter’ sub directory in what would be the home directory for the user, and adding to it a user configuration file for the Jupyter Notebook application. As it stands here, it is assuming that the user will be ‘root’, but we don’t want it to run as ‘root’ but the ‘ipython’ user we have created.

When we used the ‘adduser’ command it automatically created a home directory for the ‘ipython’ user at ‘/home/ipython’. We therefore need to use the home directory for the ‘ipython’ user instead of ‘/root’, which was the home directory for the ‘root’ user.

# Add a notebook profile.
RUN mkdir -p -m 700 ~ipython/.jupyter/ && \
 echo "c.NotebookApp.ip = '*'" >> ~ipython/.jupyter/jupyter_notebook_config.py && \
 chown -R ipython:ipython ~ipython/.jupyter

Also note that since commands are being run as root at this point in the ‘Dockerfile’, we also need to change the ownership on the ‘.jupyter’ subdirectory and the file created inside so they are owned by the ‘ipython’ user. If we do not do this and the IPython Notebook application wants to create additional files in that directory, it will fail, as the directory and files would still be owned by the ‘root’ user.

For good measure and to be consistent with normal permissions one would find on directories and files created by a user, we also ensure that the group is updated to also be that corresponding to the ‘ipython’ user.

Directory permissions also become an issue with the ‘/notebooks’ directory. This is the directory which is setup as the working directory for the IPython Notebook application and which will be where any IPython notebooks will be created.

The ‘/notebooks’ directory is actually created as a side effect of the ‘VOLUME ‘ statement. A directory created in that way will also be created with ‘root’ as the owner. We therefore need to manually create the ‘/notebooks’ directory ourselves and set the permissions appropriately, before marking it as being a volume mount point.

RUN mkdir /notebooks && chown ipython:ipython /notebooks

Finally, we can use the ‘USER’ statement to mark that the Docker container when run should be run as the ‘ipython’ user. The final result we end up with is:

# Add a notebook profile.
RUN mkdir -p -m 700 ~ipython/.jupyter/ && \
 echo "c.NotebookApp.ip = '*'" >> ~ipython/.jupyter/jupyter_notebook_config.py && \
 chown -R ipython:ipython ~ipython/.jupyter

RUN mkdir /notebooks && chown ipython:ipython /notebooks

VOLUME /notebooks
WORKDIR /notebooks

EXPOSE 8888

USER ipython

ENTRYPOINT ["tini", "--"]
CMD ["jupyter", "notebook"]

Running the Docker image after having made these changes and executing an interactive shell within the container, we can check the environment to see if it is what we expect.

ipython@f4665e7d63b7:/notebooks$ whoami
ipython
ipython@f4665e7d63b7:/notebooks$ id
uid=1000(ipython) gid=1000(ipython) groups=1000(ipython)
ipython@f4665e7d63b7:/notebooks$ pwd
/notebooks
ipython@f4665e7d63b7:/notebooks$ env | grep HOME
HOME=/home/ipython
ipython@f4665e7d63b7:/notebooks$ ls -las
total 8
4 drwxr-xr-x 2 ipython ipython 4096 Dec 21 03:54 .
4 drwxr-xr-x 58 root root 4096 Dec 21 03:54 ..

All looks good and from the web interface for the IPython Notebook application we can create and save notebooks.

Alas, even though we have now converted the IPython Docker image to not run as ‘root', it still will not run on OpenShift, again yielding an error message in the log output.

$ oc logs --previous ipython-2-yo557
/usr/local/lib/python3.4/dist-packages/IPython/paths.py:69: UserWarning: IPython parent '/' is not a writable location, using a temp directory.
 " using a temp directory.".format(parent))
Traceback (most recent call last):
 ...
 File "/usr/lib/python3.4/os.py", line 237, in makedirs
 mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/.jupyter'

Why the user ID is overridden

The reason that the updated Docker image failed on OpenShift is that even though a ‘USER’ statement was included to indicate that a specific non ‘root’ user should be used to run the Docker image, this was still ignored.

When we look at a hosting service which wants to prohibit Docker images from running as ‘root’, there are a couple of issues that come into play in respect of what user ID a Docker image is allowed to run as, or forced to run as.

The first issue is what the ‘USER’ was actually set to, which any hosting service can determine by inspecting the Docker image to be deployed.

$ docker inspect --format='{{.ContainerConfig.User}}’ jupyter/notebook
ipython

In our updated Docker image you can see that when inspecting the image it has been setup to run as the ‘ipython’ user.

The problem with this is that because a user name was supplied, it is not actually possible for the hosting service to readily determine what user ID that user name maps to.

Although you might think that so long as it doesn’t have ‘root’ you are good, that isn’t the case. This is because there is nothing to stop someone constructing a Docker image which has a ‘/etc/passwd’ file containing:

ipython:x:0:0:root:/root:/bin/sh

In short, a hosting service cannot trust the user configured into a Docker image if it is not an integer user ID.

In order to try and ensure that a hosting service will actually run a Docker image as the ‘USER’ you defined, at the least, you need to provide an integer user ID for ‘USER’ and not a user name.

Unfortunately, this will probably still only work for a hosting service which only supports 12 factor applications and does not offer any persistent storage.

The second issue here now as to why the ‘USER’ would be ignored being around whether persistent storage is being offered.

The problem here is that for extra security, when providing persistent data storage volumes to applications, you do not really want the storage volumes for every user using the same user ID. As a result, where persistent storage is being offered, a separate user ID will generally be given to each user, or possibly to each project.

This is the case with OpenShift, where applications running under distinct projects, even for the same OpenShift user, are allocated different user IDs that any Docker images are then run as.

As a result, no matter what you set ‘USER’ to in the ‘Dockerfile’, OpenShift will instead force the Docker container to run as the user ID that was allocated to the project the Docker container is run in, this being to allow for better security when persistent volumes or other external resources are being used.

Now before you start thinking this is a limitation with just OpenShift, it isn’t. Any hosting service that wants to use best practice, including not running Docker images as ‘root’ and of providing applications running in different contexts with different user identities for when accessing external resources will be in the same boat.

This situation will likely only change when user namespace support is added to Docker and hosting services can transparently map any user ID within the container, to a distinct user ID outside of the container, of the hosting services choosing.

How the user ID is overridden

So how does OpenShift override the ‘USER’ which a Docker image is configured with?

Overriding of the user that a Docker container is marked to run as, can be done by using the ‘-u’ option to ‘docker run’.

Without even using OpenShift we can therefore very quickly see if a Docker image might fail to start up where the user is being overridden by any hosting service. All we need to do is pick some random user ID which doesn’t have a corresponding user account inside of the Docker container.

Doing this for our IPython Docker image we get as we did before:

$ docker run --rm -u 100000 -p 8888:8888 jupyter-notebook
/usr/local/lib/python3.4/dist-packages/IPython/paths.py:69: UserWarning: IPython parent '/' is not a writable location, using a temp directory.
 " using a temp directory.".format(parent))
Traceback (most recent call last):
 ...
 File "/usr/lib/python3.4/os.py", line 237, in makedirs
 mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/.jupyter'

Randomly assigned user ID

So the fact that the IPython Docker image was setup to run as ‘root’ wasn’t the only problem. As shown, even when we change the ‘Dockerfile’ to have the Jupyter Notebook application run as a non ‘root’ user, it would still fail to start up.

This was because even when configured to run as a non ‘root’ user, that was ignored and a random user ID was being allocated and used to run the Docker container.

The primary issue that arises from this is that there is not going to be a user account defined within the container corresponding to the user ID which is used. Further, because the user ID which is used is not going to be known in advance, it isn’t possible to add into the image at the time it is built a user account with that user ID.

The flow on consequences of this were that the ‘HOME’ directory environment variable is going to default back to being ‘/‘. The application wants to be able to write files to the home directory though and since that wasn’t the directory it was expected to be, it failed.

In addition, if our Docker image had been constructed such that the intended ‘USER’ the Docker image were to run as had special access to write to other parts of the file system, the random user ID wouldn’t be able to right to those either.

This can be seen when we override the command that is run when starting the Docker container to get access to an interactive shell and running some manual checks.

$ docker run --rm -it -u 100000 -p 8888:8888 jupyter-notebook bash
I have no name!@78bdfa8dba92:/notebooks$ whoami
whoami: cannot find name for user ID 100000
I have no name!@78bdfa8dba92:/notebooks$ id
uid=100000 gid=0(root)
I have no name!@78bdfa8dba92:/notebooks$ pwd
/notebooks
I have no name!@78bdfa8dba92:/notebooks$ env | grep HOME
HOME=/
I have no name!@78bdfa8dba92:/notebooks$ touch $HOME/magic
touch: cannot touch ‘//magic’: Permission denied
I have no name!@78bdfa8dba92:/notebooks$ touch /notebooks/magic
touch: cannot touch ‘/notebooks/magic’: Permission denied

We still therefore have some work to do before we can get this working.

In the next post I will start going into how to accomodate a Docker container running as a random user ID which you aren’t going to know of in advance.

Friday, December 18, 2015

Don't run as root inside of Docker containers.

When we run any applications directly on our own computers we avoid doing so as the ‘root’ user unless absolutely necessary. Not running applications as the ‘root’ user is viewed as being a best security practice. Yet when we run applications inside of a Docker container, most people seem to have no qualms at all about running everything as the ‘root’ user.

Yes it can be argued that the application is isolated within a container and shouldn’t be able to break out, but it is still running with the privileges of ‘root’ regardless. If for some reason, be it a bug in Docker or the operating system itself, a misconfigured Docker installation or even that incorrect access rights were granted to some resource when running a specific container, running as ‘root’ is going to increase the risk of an application being able to access outside of the container, directly or indirectly, with elevated privileges.

Although there is a minor note in the Docker best practices guide about not running as ‘root’, it doesn’t get emphasised enough. This will no doubt change over time and be seen more importantly as a best practice to follow. It is therefore better now to start designing your Docker images so that they can be run as a non ‘root’ user.

Even now some hosting services based around Docker are restricting applications running inside of a Docker container from running as the ‘root’ user and forcing them to run as a non privileged user. This is the case with OpenShift 3, but as similar services around Docker seek to limit their exposure to the risk of running as the ‘root’ user, even though inside of a container, you can expect them to do the same.

It is recognised as a big enough issue that I believe there are changes in train for the introduction of new features in Docker to reduce the risk. This is coming in the form of support for user namespaces. This is where although within a container it will appear that an application runs as the ‘root’ user and with the privileges of ‘root’ within the container, when seen from outside of the container it will in reality be running as a non privileged user.

So when this becomes a part of Docker and regarded as stable, hosting services using Docker will be able to use that. In the interim, but then even with that coming down the pipeline, it is still preferable to simply not require your application to run as ‘root’ inside of a Docker container.

Root access via the file system

Although I am sure people will suggest this is a contrived example with limited application to only certain Docker installation types, I am going to show one example where running as ‘root’ in the Docker container can lead to being able to compromise the host that the Docker service is running on.

To do this I am going to rely on the use of volume mounting. What I am going to do may well not be something that any sane person would do, but it highlights the fact that when running as the ‘root’ user within a container, it is currently truly the ‘root’ user even when seen from outside of the container.

For this demonstration I am going to use Docker Toolbox on MacOS X. The ‘docker version’ command indicates I have:

$ docker version
Client:
 Version: 1.9.1
 API version: 1.21
 Go version: go1.4.3
 Git commit: a34a1d5
 Built: Fri Nov 20 17:56:04 UTC 2015
 OS/Arch: darwin/amd64

Server:
 Version: 1.9.1
 API version: 1.21
 Go version: go1.4.3
 Git commit: a34a1d5
 Built: Fri Nov 20 17:56:04 UTC 2015
 OS/Arch: linux/amd64

Lets start out by using the official ‘busybox’ image to fire up a shell inside of a container.

$ docker run --rm -it busybox sh

/ # ls -las
total 44
 4 drwxr-xr-x 19 root root 4096 Dec 18 07:55 .
 4 drwxr-xr-x 19 root root 4096 Dec 18 07:55 ..
 0 -rwxr-xr-x 1 root root 0 Dec 18 07:55 .dockerenv
 0 -rwxr-xr-x 1 root root 0 Dec 18 07:55 .dockerinit
 12 drwxr-xr-x 2 root root 12288 Oct 31 17:14 bin
 0 drwxr-xr-x 5 root root 380 Dec 18 07:55 dev
 4 drwxr-xr-x 2 root root 4096 Dec 18 07:55 etc
 4 drwxr-xr-x 3 root root 4096 Oct 31 17:15 home
 0 dr-xr-xr-x 129 root root 0 Dec 18 07:55 proc
 4 drwxr-xr-x 2 root root 4096 Dec 18 07:55 root
 0 dr-xr-xr-x 13 root root 0 Dec 18 07:55 sys
 4 drwxrwxrwt 2 root root 4096 Oct 31 17:15 tmp
 4 drwxr-xr-x 3 root root 4096 Oct 31 17:15 usr
 4 drwxr-xr-x 4 root root 4096 Oct 31 17:15 var

Now lets do that again, but lets mount the file system of the Docker host inside of the container.

$ docker run --rm -it -v /:/rootfs busybox sh

/ # ls -las
total 44
 4 drwxr-xr-x 20 root root 4096 Dec 18 07:56 .
 4 drwxr-xr-x 20 root root 4096 Dec 18 07:56 ..
 0 -rwxr-xr-x 1 root root 0 Dec 18 07:56 .dockerenv
 0 -rwxr-xr-x 1 root root 0 Dec 18 07:56 .dockerinit
 12 drwxr-xr-x 2 root root 12288 Oct 31 17:14 bin
 0 drwxr-xr-x 5 root root 380 Dec 18 07:56 dev
 4 drwxr-xr-x 2 root root 4096 Dec 18 07:56 etc
 4 drwxr-xr-x 3 root root 4096 Oct 31 17:15 home
 0 dr-xr-xr-x 129 root root 0 Dec 18 07:56 proc
 4 drwxr-xr-x 2 root root 4096 Dec 18 07:56 root
 0 drwxr-xr-x 17 1001 staff 420 Dec 18 00:27 rootfs
 0 dr-xr-xr-x 13 root root 0 Dec 18 07:56 sys
 4 drwxrwxrwt 2 root root 4096 Oct 31 17:15 tmp
 4 drwxr-xr-x 3 root root 4096 Oct 31 17:15 usr
 4 drwxr-xr-x 4 root root 4096 Oct 31 17:15 var

/ # ls -las /rootfs
total 8
 0 drwxr-xr-x 17 1001 staff 420 Dec 18 00:27 .
 4 drwxr-xr-x 20 root root 4096 Dec 18 07:56 ..
 0 drwxr-xr-x 1 1000 staff 204 Nov 12 00:41 Users
 0 drwxr-xr-x 2 root root 1500 Dec 18 00:27 bin
 0 drwxrwxr-x 14 root staff 4400 Dec 18 00:27 dev
 0 drwxr-xr-x 12 root root 1040 Dec 18 00:27 etc
 0 drwxrwxr-x 3 root staff 60 Dec 18 00:27 home
 4 -rwxr-xr-x 1 root root 496 Nov 20 19:34 init
 0 drwxr-xr-x 5 root root 860 Dec 18 00:27 lib
 0 lrwxrwxrwx 1 root root 3 Dec 18 00:27 lib64 -> lib
 0 lrwxrwxrwx 1 root root 11 Dec 18 00:27 linuxrc -> bin/busybox
 0 drwxr-xr-x 4 root root 80 Dec 18 00:27 mnt
 0 drwxrwsr-x 2 root staff 180 Dec 18 00:27 opt
 0 dr-xr-xr-x 129 root root 0 Dec 18 00:26 proc
 0 drwxrwxr-x 2 root staff 80 Dec 18 00:27 root
 0 drwxrwxr-x 4 root staff 80 Dec 18 00:27 run
 0 drwxrwxr-x 2 root root 1400 Dec 18 00:27 sbin
 0 dr-xr-xr-x 13 root root 0 Dec 18 00:27 sys
 0 lrwxrwxrwx 1 root root 13 Dec 18 00:27 tmp -> /mnt/sda1/tmp
 0 drwxr-xr-x 7 root root 140 Dec 18 00:27 usr
 0 drwxrwxr-x 9 root staff 200 Dec 18 00:27 var

When the ‘-v’ option is used with ‘docker run’ it by default mounts the file system read/write.

Using ‘-v /:/rootfs’ I have therefore mounted the root file system of the Docker host, in this case running inside of the VM started by Docker Toolbox, inside of the container at the location ‘/rootfs’.

To show that it is indeed the file system of the Docker host we can ‘ssh’ into the Docker host and create a temporary file in the root directory of the filesystem.

$ docker-machine ssh default
 ## .
 ## ## ## ==
 ## ## ## ## ## ===
 /"""""""""""""""""\___/ ===
 ~~~ {~~ ~~~~ ~~~ ~~~~ ~~~ ~ / ===- ~~~
 \______ o __/
 \ \ __/
 \____\_______/
 _ _ ____ _ _
| |__ ___ ___ | |_|___ \ __| | ___ ___| | _____ _ __
| '_ \ / _ \ / _ \| __| __) / _` |/ _ \ / __| |/ / _ \ '__|
| |_) | (_) | (_) | |_ / __/ (_| | (_) | (__| < __/ |
|_.__/ \___/ \___/ \__|_____\__,_|\___/ \___|_|\_\___|_|
Boot2Docker version 1.9.1, build master : cef800b - Fri Nov 20 19:33:59 UTC 2015
Docker version 1.9.1, build a34a1d5

docker@default:~$ cd /

docker@default:/$ ls -las
total 4
 0 drwxr-xr-x 17 tc staff 420 Dec 18 08:05 ./
 0 drwxr-xr-x 17 tc staff 420 Dec 18 08:05 ../
 0 drwxr-xr-x 1 docker staff 204 Nov 12 00:41 Users/
 0 drwxr-xr-x 2 root root 1500 Dec 18 00:27 bin/
 0 drwxrwxr-x 14 root staff 4400 Dec 18 00:27 dev/
 0 drwxr-xr-x 12 root root 1040 Dec 18 00:27 etc/
 0 drwxrwxr-x 3 root staff 60 Dec 18 00:27 home/
 4 -rwxr-xr-x 1 root root 496 Nov 20 19:34 init
 0 drwxr-xr-x 5 root root 860 Dec 18 00:27 lib/
 0 lrwxrwxrwx 1 root root 3 Dec 18 00:27 lib64 -> lib/
 0 lrwxrwxrwx 1 root root 11 Dec 18 00:27 linuxrc -> bin/busybox
 0 drwxr-xr-x 4 root root 80 Dec 18 00:27 mnt/
 0 drwxrwsr-x 2 root staff 180 Dec 18 00:27 opt/
 0 dr-xr-xr-x 131 root root 0 Dec 18 00:26 proc/
 0 drwxrwxr-x 2 root staff 80 Dec 18 00:27 root/
 0 drwxrwxr-x 4 root staff 80 Dec 18 00:27 run/
 0 drwxrwxr-x 2 root root 1400 Dec 18 00:27 sbin/
 0 dr-xr-xr-x 13 root root 0 Dec 18 00:27 sys/
 0 lrwxrwxrwx 1 root root 13 Dec 18 00:27 tmp -> /mnt/sda1/tmp/
 0 drwxr-xr-x 7 root root 140 Dec 18 00:27 usr/
 0 drwxrwxr-x 9 root staff 200 Dec 18 00:27 var/

docker@default:/$ sudo touch FROMDOCKERHOST

docker@default:/$ ls -las
total 4
 0 drwxr-xr-x 17 tc staff 440 Dec 18 08:06 ./
 0 drwxr-xr-x 17 tc staff 440 Dec 18 08:06 ../
 0 -rw-r--r-- 1 root root 0 Dec 18 08:06 FROMDOCKERHOST
 0 drwxr-xr-x 1 docker staff 204 Nov 12 00:41 Users/
 0 drwxr-xr-x 2 root root 1500 Dec 18 00:27 bin/
 0 drwxrwxr-x 14 root staff 4400 Dec 18 00:27 dev/
 0 drwxr-xr-x 12 root root 1040 Dec 18 00:27 etc/
 0 drwxrwxr-x 3 root staff 60 Dec 18 00:27 home/
 4 -rwxr-xr-x 1 root root 496 Nov 20 19:34 init
 0 drwxr-xr-x 5 root root 860 Dec 18 00:27 lib/
 0 lrwxrwxrwx 1 root root 3 Dec 18 00:27 lib64 -> lib/
 0 lrwxrwxrwx 1 root root 11 Dec 18 00:27 linuxrc -> bin/busybox
 0 drwxr-xr-x 4 root root 80 Dec 18 00:27 mnt/
 0 drwxrwsr-x 2 root staff 180 Dec 18 00:27 opt/
 0 dr-xr-xr-x 131 root root 0 Dec 18 00:26 proc/
 0 drwxrwxr-x 2 root staff 80 Dec 18 00:27 root/
 0 drwxrwxr-x 4 root staff 80 Dec 18 00:27 run/
 0 drwxrwxr-x 2 root root 1400 Dec 18 00:27 sbin/
 0 dr-xr-xr-x 13 root root 0 Dec 18 00:27 sys/
 0 lrwxrwxrwx 1 root root 13 Dec 18 00:27 tmp -> /mnt/sda1/tmp/
 0 drwxr-xr-x 7 root root 140 Dec 18 00:27 usr/
 0 drwxrwxr-x 9 root staff 200 Dec 18 00:27 var/

Switching back to our shell running inside of the container and listing what is in the mounted ‘/rootfs’ directory we can see the file which was created.

/ # ls -las /rootfs
total 8
 0 drwxr-xr-x 17 1001 staff 440 Dec 18 08:06 .
 4 drwxr-xr-x 20 root root 4096 Dec 18 07:56 ..
 0 -rw-r--r-- 1 root root 0 Dec 18 08:06 FROMDOCKERHOST
 0 drwxr-xr-x 1 1000 staff 204 Nov 12 00:41 Users
 0 drwxr-xr-x 2 root root 1500 Dec 18 00:27 bin
 0 drwxrwxr-x 14 root staff 4400 Dec 18 00:27 dev
 0 drwxr-xr-x 12 root root 1040 Dec 18 00:27 etc
 0 drwxrwxr-x 3 root staff 60 Dec 18 00:27 home
 4 -rwxr-xr-x 1 root root 496 Nov 20 19:34 init
 0 drwxr-xr-x 5 root root 860 Dec 18 00:27 lib
 0 lrwxrwxrwx 1 root root 3 Dec 18 00:27 lib64 -> lib
 0 lrwxrwxrwx 1 root root 11 Dec 18 00:27 linuxrc -> bin/busybox
 0 drwxr-xr-x 4 root root 80 Dec 18 00:27 mnt
 0 drwxrwsr-x 2 root staff 180 Dec 18 00:27 opt
 0 dr-xr-xr-x 131 root root 0 Dec 18 00:26 proc
 0 drwxrwxr-x 2 root staff 80 Dec 18 00:27 root
 0 drwxrwxr-x 4 root staff 80 Dec 18 00:27 run
 0 drwxrwxr-x 2 root root 1400 Dec 18 00:27 sbin
 0 dr-xr-xr-x 13 root root 0 Dec 18 00:27 sys
 0 lrwxrwxrwx 1 root root 13 Dec 18 00:27 tmp -> /mnt/sda1/tmp
 0 drwxr-xr-x 7 root root 140 Dec 18 00:27 usr
 0 drwxrwxr-x 9 root staff 200 Dec 18 00:27 var

Now lets try updating the same file system from within the container.

/ # touch /rootfs/FROMCONTAINER

/ # ls -las /rootfs
total 8
 0 drwxr-xr-x 17 1001 staff 460 Dec 18 08:13 .
 4 drwxr-xr-x 20 root root 4096 Dec 18 08:13 ..
 0 -rw-r--r-- 1 root root 0 Dec 18 08:13 FROMCONTAINER
 0 -rw-r--r-- 1 root root 0 Dec 18 08:06 FROMDOCKERHOST
 0 drwxr-xr-x 1 1000 staff 204 Nov 12 00:41 Users
 0 drwxr-xr-x 2 root root 1500 Dec 18 00:27 bin
 0 drwxrwxr-x 14 root staff 4400 Dec 18 00:27 dev
 0 drwxr-xr-x 12 root root 1040 Dec 18 00:27 etc
 0 drwxrwxr-x 3 root staff 60 Dec 18 00:27 home
 4 -rwxr-xr-x 1 root root 496 Nov 20 19:34 init
 0 drwxr-xr-x 5 root root 860 Dec 18 00:27 lib
 0 lrwxrwxrwx 1 root root 3 Dec 18 00:27 lib64 -> lib
 0 lrwxrwxrwx 1 root root 11 Dec 18 00:27 linuxrc -> bin/busybox
 0 drwxr-xr-x 4 root root 80 Dec 18 00:27 mnt
 0 drwxrwsr-x 2 root staff 180 Dec 18 00:27 opt
 0 dr-xr-xr-x 131 root root 0 Dec 18 00:26 proc
 0 drwxrwxr-x 2 root staff 80 Dec 18 00:27 root
 0 drwxrwxr-x 4 root staff 80 Dec 18 00:27 run
 0 drwxrwxr-x 2 root root 1400 Dec 18 00:27 sbin
 0 dr-xr-xr-x 13 root root 0 Dec 18 00:27 sys
 0 lrwxrwxrwx 1 root root 13 Dec 18 00:27 tmp -> /mnt/sda1/tmp
 0 drwxr-xr-x 7 root root 140 Dec 18 00:27 usr
 0 drwxrwxr-x 9 root staff 200 Dec 18 00:27 var

And finally checking that we can see the change from the Docker host.

docker@default:/$ ls -las
total 4
 0 drwxr-xr-x 17 tc staff 460 Dec 18 08:13 ./
 0 drwxr-xr-x 17 tc staff 460 Dec 18 08:13 ../
 0 -rw-r--r-- 1 root root 0 Dec 18 08:13 FROMCONTAINER
 0 -rw-r--r-- 1 root root 0 Dec 18 08:06 FROMDOCKERHOST
 0 drwxr-xr-x 1 docker staff 204 Nov 12 00:41 Users/
 0 drwxr-xr-x 2 root root 1500 Dec 18 00:27 bin/
 0 drwxrwxr-x 14 root staff 4400 Dec 18 00:27 dev/
 0 drwxr-xr-x 12 root root 1040 Dec 18 00:27 etc/
 0 drwxrwxr-x 3 root staff 60 Dec 18 00:27 home/
 4 -rwxr-xr-x 1 root root 496 Nov 20 19:34 init
 0 drwxr-xr-x 5 root root 860 Dec 18 00:27 lib/
 0 lrwxrwxrwx 1 root root 3 Dec 18 00:27 lib64 -> lib/
 0 lrwxrwxrwx 1 root root 11 Dec 18 00:27 linuxrc -> bin/busybox
 0 drwxr-xr-x 4 root root 80 Dec 18 00:27 mnt/
 0 drwxrwsr-x 2 root staff 180 Dec 18 00:27 opt/
 0 dr-xr-xr-x 131 root root 0 Dec 18 00:26 proc/
 0 drwxrwxr-x 2 root staff 80 Dec 18 00:27 root/
 0 drwxrwxr-x 4 root staff 80 Dec 18 00:27 run/
 0 drwxrwxr-x 2 root root 1400 Dec 18 00:27 sbin/
 0 dr-xr-xr-x 13 root root 0 Dec 18 00:27 sys/
 0 lrwxrwxrwx 1 root root 13 Dec 18 00:27 tmp -> /mnt/sda1/tmp/
 0 drwxr-xr-x 7 root root 140 Dec 18 00:27 usr/
 0 drwxrwxr-x 9 root staff 200 Dec 18 00:27 var/

Take note here that the file we were able to create from the Docker container on the Docker host file system is owned by ‘root’. Thus ownership of the file at least was preserved when created.

Lets try something a bit more complicated. From inside of the Docker container, lets make a copy of the program ‘/usr/bin/whoami’ and modify the permissions of the copy so that it should run as a ‘setuid' executable.

/ # cp /rootfs/usr/bin/whoami /rootfs/whoami

/ # chmod 4711 /rootfs/whoami

/ # ls -las /rootfs
total 536
 0 drwxr-xr-x 17 1001 staff 480 Dec 18 08:17 .
 4 drwxr-xr-x 20 root root 4096 Dec 18 08:13 ..
 0 -rw-r--r-- 1 root root 0 Dec 18 08:13 FROMCONTAINER
 0 -rw-r--r-- 1 root root 0 Dec 18 08:06 FROMDOCKERHOST
 0 drwxr-xr-x 1 1000 staff 204 Nov 12 00:41 Users
 0 drwxr-xr-x 2 root root 1500 Dec 18 00:27 bin
 0 drwxrwxr-x 14 root staff 4400 Dec 18 00:27 dev
 0 drwxr-xr-x 12 root root 1040 Dec 18 00:27 etc
 0 drwxrwxr-x 3 root staff 60 Dec 18 00:27 home
 4 -rwxr-xr-x 1 root root 496 Nov 20 19:34 init
 0 drwxr-xr-x 5 root root 860 Dec 18 00:27 lib
 0 lrwxrwxrwx 1 root root 3 Dec 18 00:27 lib64 -> lib
 0 lrwxrwxrwx 1 root root 11 Dec 18 00:27 linuxrc -> bin/busybox
 0 drwxr-xr-x 4 root root 80 Dec 18 00:27 mnt
 0 drwxrwsr-x 2 root staff 180 Dec 18 00:27 opt
 0 dr-xr-xr-x 131 root root 0 Dec 18 00:26 proc
 0 drwxrwxr-x 2 root staff 80 Dec 18 00:27 root
 0 drwxrwxr-x 4 root staff 80 Dec 18 00:27 run
 0 drwxrwxr-x 2 root root 1400 Dec 18 00:27 sbin
 0 dr-xr-xr-x 13 root root 0 Dec 18 00:27 sys
 0 lrwxrwxrwx 1 root root 13 Dec 18 00:27 tmp -> /mnt/sda1/tmp
 0 drwxr-xr-x 7 root root 140 Dec 18 00:27 usr
 0 drwxrwxr-x 9 root staff 200 Dec 18 00:27 var
 528 -rws--x--x 1 root root 539203 Dec 18 08:17 whoami

Note how the permissions for the user are now ‘rws’ on the copy of the ‘whoami’ executable we made.

This ’s’ indicates that it is now a ‘setuid’ executable. This means that when we run that program it should run with the privileges of the user that is the owner of the program, rather than the privileges of the user running the program.

But will that actually be what occurs if we now run the copy of the program from the Docker host given that we made the copy and set the permissions from inside of the Docker container.

docker@default:/$ ls -las
total 532
 0 drwxr-xr-x 17 tc staff 480 Dec 18 08:17 ./
 0 drwxr-xr-x 17 tc staff 480 Dec 18 08:17 ../
 0 -rw-r--r-- 1 root root 0 Dec 18 08:13 FROMCONTAINER
 0 -rw-r--r-- 1 root root 0 Dec 18 08:06 FROMDOCKERHOST
 0 drwxr-xr-x 1 docker staff 204 Nov 12 00:41 Users/
 0 drwxr-xr-x 2 root root 1500 Dec 18 00:27 bin/
 0 drwxrwxr-x 14 root staff 4400 Dec 18 00:27 dev/
 0 drwxr-xr-x 12 root root 1040 Dec 18 00:27 etc/
 0 drwxrwxr-x 3 root staff 60 Dec 18 00:27 home/
 4 -rwxr-xr-x 1 root root 496 Nov 20 19:34 init
 0 drwxr-xr-x 5 root root 860 Dec 18 00:27 lib/
 0 lrwxrwxrwx 1 root root 3 Dec 18 00:27 lib64 -> lib/
 0 lrwxrwxrwx 1 root root 11 Dec 18 00:27 linuxrc -> bin/busybox
 0 drwxr-xr-x 4 root root 80 Dec 18 00:27 mnt/
 0 drwxrwsr-x 2 root staff 180 Dec 18 00:27 opt/
 0 dr-xr-xr-x 131 root root 0 Dec 18 00:26 proc/
 0 drwxrwxr-x 2 root staff 80 Dec 18 00:27 root/
 0 drwxrwxr-x 4 root staff 80 Dec 18 00:27 run/
 0 drwxrwxr-x 2 root root 1400 Dec 18 00:27 sbin/
 0 dr-xr-xr-x 13 root root 0 Dec 18 00:27 sys/
 0 lrwxrwxrwx 1 root root 13 Dec 18 00:27 tmp -> /mnt/sda1/tmp/
 0 drwxr-xr-x 7 root root 140 Dec 18 00:27 usr/
 0 drwxrwxr-x 9 root staff 200 Dec 18 00:27 var/
 528 -rws--x--x 1 root root 539203 Dec 18 08:17 whoami

docker@default:/$ /usr/bin/whoami
docker

docker@default:/$ /whoami
root

And yes it does.

So what we have been able to do is modify the permissions of a program on the Docker host through the volume mount.

Sure, no one should every really be mounting the root filesystem of the Docker host read/write inside of a container and so this really comes down to being a configuration issue, but if that did inadvertently happen for some reason, it means that arbitrary changes could be made to the Docker host file system.

I could have gone and replaced arbitrary executables, modified system startup scripts, or taken a copy of the ‘sh’ program and turned it into a ‘setuid’ executable so that a non ‘root' user on the Docker host could become ‘root’.

As noted I am using Docker Toolbox and it may well be the case that it isn’t locked down in as secure a manner as your typical Docker service, but it was still possible with its configuration at least. The fact that you can ‘ssh’ into the Docker host created by Docker Toolbox and then ‘sudo’ to ‘root’ also makes it a moot point as well.

Anyway, the point I am trying to illustrate is that when running as ‘root’ inside of the container you truly are ‘root’, so if you can escape the container in some way you could well be able to get elevated privileges.

So why risk it? Run your containers as a non ‘root’ user in the first place and you remove one element of risk that anyone able to access the container could somehow break out of the container with ‘root’ privileges.

Running containers as non root user

If you are the developer of a Docker image, the first thing you can therefore do is make sure that you at least use the ‘USER’ statement to indicate what non root user an application should run as when the container is started.

In doing this you will have to be very mindful of how you set up file system permissions so that your application can write to any file system directory it might still want to have write access to.

Even then it becomes tricky as when a container is run, the user that is specified by the ‘USER’ statement can still be overridden using the ‘-u’ option to ‘docker run’. If this is done, even though you may have fixed up filesystem permission so they match the user specified by the ‘USER’ statement, then your application could still fail.

It was the overriding of the user that the container ran as that was in part the issue I described in my last blog post about trying to run the IPython Docker image called ‘jupyter/notebook’ under OpenShift. In order to prohibit applications from running as ‘root’ in a Docker container, OpenShift uses the ‘-u’ option to ‘docker run’ to override the user to be a non privileged user.

In my next blog post I will delve more into the ‘-u’ option of ‘docker run’, what it does and the complications it causes. We will also return back to our IPython example in illustrating those issues.