Closing files without handles (POSIX)

I recently found my self having to close a sizeable number of files after loosing the corresponding file objects (those you get with open(), for instance). This post tells the short tale of how I became slightly more enlightened on POSIX-compliant file access in Python… and how I managed to close the files, of course.

The problem

If you’re just interested in

How to close open files for which you don’t have the file handles (or don’t want to look for them) anymore

skip to the solution safely. Otherwise, keep reading this section and the following, and you may also learn something interesting about file descriptors.

As a rule of thumb, you shouldn’t work with files without wrapping your code inside a with statement, or using some other syntatic/design construct that ensures all your open files are closed at specified points. While I agree this is usually very good advice, there may be times in which you need a dirty fix for a pedant OS not allowing you to open additional files (OSError: Too many open files).

For instance, last week I was working on a handful of classes that provided access to binary files of a certain format. These files were fairly large, somewhat messy, so I parsed their contents into smaller, more organized files for faster and easier access. Due to the complicated class hierarchy, with methods implemented with functools caching and memory maps, some files remained open for longer than they should. I tried for some time to correct these open file leakings My test suite didn’t have a 100% coverage so I couldn’t track these leaking open files

Investigation

The file parameter to the open() function accepts absolute/relative paths as str/bytes objects (or path-like objects, to be more precise) and integral file descriptors. File descriptors are integral indices into a per-process table describing the open files of a process; it includes information such as the open mode, current offset, buffer address, etc.

The open function may return a myriad of objects from the built-in io module, all of them conveniently implementing a (somewhat) common interface for text/binary file I/O. Internally, these objects rely on information provided by the file descriptor table to be able to perform their operations.

Setting up a test scenario

Let’s first create a temporary directory for housing some test files:

[2]:
import tempfile

directory = tempfile.TemporaryDirectory(prefix='notebook_')
directory.name
[2]:
'/tmp/notebook_wlq1g8_j'

Let’s create some test files in {{directory.name}} and write some dummy content in them:

[3]:
import os
import struct

with open(os.path.join(directory.name, 'textfile'), 'w') as fh:
    fh.write('Someone told me long ago: there is a calm before the storm')

with open(os.path.join(directory.name, 'binaryfile'), 'wb') as fh:
    fh.write(struct.pack('6B', 4, 8, 15, 16, 23, 42))

open(os.path.join(directory.name, 'data1'), 'wb')
open(os.path.join(directory.name, 'data2'), 'wb')
[3]:
<_io.BufferedWriter name='/tmp/notebook_wlq1g8_j/data2'>

Inspecting the file descriptor table

We may peek at the file descriptor table with the help of the psutil module (easily installable with pip install psutil) as follows:

[3]:
import psutil

proc = psutil.Process()  # This object concerns the current process

proc.open_files()
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-3-4984fed79333> in <module>()
----> 1 import psutil
      2
      3 proc = psutil.Process()  # This object concerns the current process
      4
      5 proc.open_files()

ModuleNotFoundError: No module named 'psutil'

As expected, the files opened before are not listed: since they were manipulated from a with statement, Python’s context-management closed them for us. So the only files listed above are from my running jupyter notebook process. I’m going to introduce this utility function to obtain a filtered list of files that are most interesting for our demonstration:

[4]:
def relevant_fds():
    return [
        fh for fh in proc.open_files()
        if directory.name in fh.path
    ]
[5]:
relevant_fds()
[5]:
[]

Changing files and descriptors

Let’s add some fun, by opening a couple of file objects:

[4]:
os.chdir(directory.name)

txt_a = open('textfile', 'r')
bin_a = open('binaryfile', 'rb')

bin_b = open('binaryfile', 'r+b')

The file descriptor table now looks like this:

[7]:
relevant_fds()
[7]:
[popenfile(path='/tmp/notebook_48ejav_y/textfile', fd=55, position=0, mode='r', flags=557056),
 popenfile(path='/tmp/notebook_48ejav_y/binaryfile', fd=58, position=0, mode='r', flags=557056),
 popenfile(path='/tmp/notebook_48ejav_y/binaryfile', fd=59, position=0, mode='r+', flags=557058)]

Notice there are two descriptors for binaryfile, but they are opened in different modes ('r' and 'r+'). If we use bin_b to make some ammendments

[5]:
bin_b.write(struct.pack('2b', 55, 55))
bin_b.flush()

then read the contents with bin_a

[6]:
struct.unpack('6b', bin_a.read())
[6]:
(55, 55, 15, 16, 23, 42)

we effectively see the change in binaryfile’s contents. Futhermore, the descriptor table now reports different offsets:

[10]:
relevant_fds()
[10]:
[popenfile(path='/tmp/notebook_48ejav_y/textfile', fd=55, position=0, mode='r', flags=557056),
 popenfile(path='/tmp/notebook_48ejav_y/binaryfile', fd=58, position=6, mode='r', flags=557056),
 popenfile(path='/tmp/notebook_48ejav_y/binaryfile', fd=59, position=2, mode='r+', flags=557058)]

Let’s close bin_a:

[7]:
bin_a.close()

The table now lists only two descriptors, corresponding to objects txt_a and bin_b:

[12]:
relevant_fds()
[12]:
[popenfile(path='/tmp/notebook_48ejav_y/textfile', fd=55, position=0, mode='r', flags=557056),
 popenfile(path='/tmp/notebook_48ejav_y/binaryfile', fd=59, position=2, mode='r+', flags=557058)]

Since bin_b is based on a file descriptor independent from that of bin_a, we may continue to perform operations on binaryfile through bin_b, like writing some contents at the end of the file

[8]:
bin_b.seek(0, os.SEEK_END)
bin_b.write(struct.pack('2b', 67, 85))
[8]:
2

or rewinding and reading everything from it:

[9]:
bin_b.seek(0, os.SEEK_SET)
content = bin_b.read()
struct.unpack('8b', content)
[9]:
(55, 55, 15, 16, 23, 42, 67, 85)

Opening a file from its descriptor

Now let’s try something different: we’ll reopen textfile from its existing descriptor, but with a different access mode (read/write). We may read the existing file descriptor directly from txt_a using the .fileno() method:

[10]:
txt_b = open(txt_a.fileno(), 'r+')

Indeed, we see these objects correspond to the same file descriptor

[11]:
txt_a.fileno() == txt_b.fileno()
[11]:
True

Although they still have different opening modes:

[12]:
txt_a.mode, txt_b.mode
[12]:
('r', 'r+')

Interestingly, the psutil.Process.open_files() method will still return the original opening mode for this descriptor:

[18]:
relevant_fds()
[18]:
[popenfile(path='/tmp/notebook_48ejav_y/textfile', fd=55, position=0, mode='r', flags=557056),
 popenfile(path='/tmp/notebook_48ejav_y/binaryfile', fd=59, position=8, mode='r+', flags=557058)]

Now, what will happen if we try to write to txt_b?

[13]:
txt_b.write('I know, it has been calm here for some time')
[13]:
43

Apparently, it worked. But when we try to flush the contents (after all, the file object buffers writes), we get:

[14]:
txt_b.flush()
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-14-b99907a91092> in <module>()
----> 1 txt_b.flush()

OSError: [Errno 9] Bad file descriptor

So the aftermath is: we reopen files with open, we must be advised that the original access mode will prevail, even though the wrapping object thinks otherwise. Reading operations will still work on txt_a:

[15]:
txt_a.seek(0, os.SEEK_SET)
txt_a.read()
[15]:
'Someone told me long ago: there is a calm before the storm'

Although txt_b has been permanently spoiled:

[16]:
txt_b.tell()
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-16-76b559696301> in <module>()
----> 1 txt_b.tell()

OSError: [Errno 9] Bad file descriptor

Things work as expected if we preserve the access mode, however:

[17]:
txt_c = open(txt_a.fileno(), txt_a.mode)
txt_c.read()
[17]:
''

Wait, where’s the verse? Since txt_c was created from the same file descriptor of txt_a, with index {{txt_a.fileno()}}, it is a {{type(txt_c)}} object that wraps the same file access information, including the offset! See for yourself:

[18]:
txt_a.tell(), txt_c.tell()
[18]:
(58, 58)

If we rewind txt_c to the beginning of the file:

[19]:
txt_c.seek(0, os.SEEK_SET)
[19]:
0

we’ll see that txt_a reflects this change as well:

[20]:
txt_a.tell()
[20]:
0

This leads us to our final solution: in order to close an open file (a limited OS resource), the most important information is not the file handle (a high-level wrapping object), but the file descriptor associated with it. If we close txt_c:

[21]:
txt_c.close()

No errors are raised, and now our file descriptor table looks like:

[ ]:
relevant_fds()

And, indeed, operations on txt_a do not work any longer:

[22]:
txt_a.tell()
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-22-b84e2f8ddf8b> in <module>()
----> 1 txt_a.tell()

OSError: [Errno 9] Bad file descriptor

Solution

Get a list of all open file descriptors in the current process using the pip-installable psutil module:

[ ]:
psutil.Process().open_files()

Perform some filtering based on the paths or whatever other criteria, if you must (here I’m filtering out the files outside the temporary directory I created for this exercise):

[ ]:
closing_files = []

for fd in psutil.Process().open_files():
    if directory.name in fd.path:
        closing_files.append(fd)

Then close the files by creating a helper file handle with the open() standard library function and calling the close() method on them:

[ ]:
for fd in closing_files:
    open(fd.fd).close()

This will close all selected files in the current process. It may not free up the memory if you still have file handle objects associated with these descriptors, but it will allow your process to open additional files. Also, be aware that if you have any references to these ill-made file handles, you won’t be able to perform operations on them (any attempts will raise

OSError: [Errno 9] Bad file descriptor

References

Comments

comments powered by Disqus