# A First Level Header
## A Second Level Header

Use one asterisk to *emphasize*

Use two asterisks for **strong emphasis**

- Use hyphens
- for unordereed
- lists

This is an [link to example.com](http://example.com/)

This is an image ![an openSUSE geeko icon](https://en.opensuse.org/images/d/d0/Icon-distribution.png)

This is a user link @hans

This is a project link hw#some-cool-title

# Hackweek #15 results

I grabbed this project, hoping to be able to pull a few bug fixes from github and then starting to explore why djmount wouldn't read files from my [Raumfeld](https://www.teufel.de/wlan.html) media server corretctly. In order to do this, I needed to enable debug messages in djmount. Unfortunately, as soon as I did this, I started to encounter segmentation faults and ABORTs in djmount after a few minutes runtime. The error stacks varied, but they were always related to memory management, usually malloc reporting corruption of its internal data structures.

I tested various memory debuggers (Valgrind, ElectricFence, [Address Sanitizer](https://en.wikipedia.org/wiki/AddressSanitizer)), but doing that seemed to either prevent the problems from occuring - the program would now run stably for many hours. The only tool that seemed to find problems with glibc's *mtrace*, but these turned out to be false positives, as mtrace can't deal with multi-threaded programs. Valgrind, I learned on a SUSE mailing list, is not well suited to debug multithreading-related corruption.

**djmount** uses the [talloc](https://talloc.samba.org/talloc/doc/html/index.html) library for its own memory management. At the same time, it uses threading heavily; it employs the MT main loop of FUSE, and uses other threads for both **libupnp** and internal purposes. The [talloc documentation](https://talloc.samba.org/talloc/doc/html/libtalloc__threads.html) mentions that talloc, by itself, is not MT-safe, and provides examples how to deal with that. This is a generic problem of the **djmount** code - there's nothing to provide protection from races in the talloc library. It took me quite a while to realize that - initially I couldn't believe that the author Rémi Turboult, whose code otherwise meets high standards AFAICT, might have overlooked such a basic problem. This is not only a matter of protecting the main program's data structures - talloc's internal data can be easily corrupted if the same context is accessed by several threads in parallel.

Before diving into the djmount code directly, I wrote a test program to find out what kind of problems occur with threaded talloc calls, and how to best avoid them. This test program ended up with >1000 lines of code. Eventually I could demonstrate that improper use of threading could cause the same phenomena that I'd observed with the djmount code, and that this could be avoided with proper locking around talloc calls. In short, what's necessary is to protect all talloc calls to shared contexts (=data structures) using mutexes. Thread-local talloc context can be used as usual, *if the application writer can prove that its talloc data structures will never be manipulated in another thread*.

With this knowledge, I went back to the djmount source code and tried to separate "global" and "local" context. The unpleasant part of this is that it's hard to maintain - everyone working with the code needs to understand this distinction and needs to track cleanly which talloc context is used how, and where, in the code. It's not generally possible to detect wrong usage automatically. The djmount code passes talloc'ed memory around between different code modules, making it pretty hard to assess locality correctly.

Therefore I sat back again and started wondering whether it might be possible to actually make a thread-safe version of the talloc library. This would make it possible to use fine-grained per-context locking rather than a slow and clumsy global lock. I think it is certainly possible to do this, but it's far from easy. Various talloc operations involve 2 or 3 different context objects. Thread-safe operation would require locking all of them, which would pose severe risk of deadlock unless the code was written very carefully. I suppose the current version of talloc is not MT-safe for a reason.

So, there's now a choice of options, each of which has pros and cons:

* Pursue analysis of the djmount code, adding locking primitives around talloc calls as appropriate. This is doable in limited time, but will likely not produce an optimal solution, and result in hard-to-maintain code.
 * Convert djmount from using talloc to some other memory management code, possibly plain malloc(). This would be possible and not too hard, but memory leaks are likely to result, and eliminating all of them will be costly. Avoiding leaks by recursively freeing memory is one of the key points of talloc, and heavily used in the djmount code.
 * Try to create an MT-safe talloc library. This is the cleanest option and promises to provide the best result. It's is also by far the most difficult and challenging option, and it's possible that I'll face problems I won't be able to solve.

Hackweek #15 results

I grabbed this project, hoping to be able to pull a few bug fixes from github and then starting to explore why djmount wouldn't read files from my Raumfeld media server corretctly. In order to do this, I needed to enable debug messages in djmount. Unfortunately, as soon as I did this, I started to encounter segmentation faults and ABORTs in djmount after a few minutes runtime. The error stacks varied, but they were always related to memory management, usually malloc reporting corruption of its internal data structures.

I tested various memory debuggers (Valgrind, ElectricFence, Address Sanitizer), but doing that seemed to either prevent the problems from occuring - the program would now run stably for many hours. The only tool that seemed to find problems with glibc's mtrace, but these turned out to be false positives, as mtrace can't deal with multi-threaded programs. Valgrind, I learned on a SUSE mailing list, is not well suited to debug multithreading-related corruption.

djmount uses the talloc library for its own memory management. At the same time, it uses threading heavily; it employs the MT main loop of FUSE, and uses other threads for both libupnp and internal purposes. The talloc documentation mentions that talloc, by itself, is not MT-safe, and provides examples how to deal with that. This is a generic problem of the djmount code - there's nothing to provide protection from races in the talloc library. It took me quite a while to realize that - initially I couldn't believe that the author Rémi Turboult, whose code otherwise meets high standards AFAICT, might have overlooked such a basic problem. This is not only a matter of protecting the main program's data structures - talloc's internal data can be easily corrupted if the same context is accessed by several threads in parallel.

Before diving into the djmount code directly, I wrote a test program to find out what kind of problems occur with threaded talloc calls, and how to best avoid them. This test program ended up with >1000 lines of code. Eventually I could demonstrate that improper use of threading could cause the same phenomena that I'd observed with the djmount code, and that this could be avoided with proper locking around talloc calls. In short, what's necessary is to protect all talloc calls to shared contexts (=data structures) using mutexes. Thread-local talloc context can be used as usual, if the application writer can prove that its talloc data structures will never be manipulated in another thread.

With this knowledge, I went back to the djmount source code and tried to separate "global" and "local" context. The unpleasant part of this is that it's hard to maintain - everyone working with the code needs to understand this distinction and needs to track cleanly which talloc context is used how, and where, in the code. It's not generally possible to detect wrong usage automatically. The djmount code passes talloc'ed memory around between different code modules, making it pretty hard to assess locality correctly.

Therefore I sat back again and started wondering whether it might be possible to actually make a thread-safe version of the talloc library. This would make it possible to use fine-grained per-context locking rather than a slow and clumsy global lock. I think it is certainly possible to do this, but it's far from easy. Various talloc operations involve 2 or 3 different context objects. Thread-safe operation would require locking all of them, which would pose severe risk of deadlock unless the code was written very carefully. I suppose the current version of talloc is not MT-safe for a reason.

So, there's now a choice of options, each of which has pros and cons:

Pursue analysis of the djmount code, adding locking primitives around talloc calls as appropriate. This is doable in limited time, but will likely not produce an optimal solution, and result in hard-to-maintain code.
Convert djmount from using talloc to some other memory management code, possibly plain malloc(). This would be possible and not too hard, but memory leaks are likely to result, and eliminating all of them will be costly. Avoiding leaks by recursively freeing memory is one of the key points of talloc, and heavily used in the djmount code.
Try to create an MT-safe talloc library. This is the cleanest option and promises to provide the best result. It's is also by far the most difficult and challenging option, and it's possible that I'll face problems I won't be able to solve.

# A First Level Header
## A Second Level Header

Use one asterisk to *emphasize*

Use two asterisks for **strong emphasis**

- Use hyphens
- for unordereed
- lists

This is an [link to example.com](http://example.com/)

This is an image ![an openSUSE geeko icon](https://en.opensuse.org/images/d/d0/Icon-distribution.png)

This is a user link @hans

This is a project link hw#some-cool-title

More Complex Markdown Help

Formatting Help

Looking for hackers with the skills:

This project is part of:

Activity

Comments

almost 9 years ago by mwilck | Reply

Hackweek #15 results

about 8 years ago by mwilck | Reply

almost 5 years ago by mwilck | Reply

Similar Projects