Droid is an open-source file format identification library, almost indispensable in the toolkit of any digital preservation software. For this reason, we integrate it directly into the Arcsys platform.
When a client contacted us to report degraded performance during the archiving process, we didn’t expect that it would lead to quite an adventure one that would spark long discussions among Droid’s contributors about certain format identification mechanisms and ultimately benefit the entire preservation community.
Finding the Weak Link
The client opened a support ticket explaining that their archiving process was taking far longer than expected. The files in question were extremely large several terabytes each. The process consistently stalled at the same stage: the format identification phase.
Our first question was naturally: what kind of files are we dealing with? The answer ZIP archives.
Armed with that information, we began the classic troubleshooting step familiar to all software developers: reproducing the issue.
An Unexpected Bandwidth Problem
We started archiving multi-gigabyte ZIP files and quickly confirmed that the delay originated from Droid’s format identification step. By attaching a Java profiling tool to monitor I/O throughput, we found that Droid was reading the entire file twice.
What surprised us even more was that when we used Droid’s graphical interface (DroidUI) to identify the same file, only a small portion was read, completing in a fraction of the time but producing identical results.
Reporting the Issue on GitHub
With concrete data in hand, we submitted a bug report to Droid’s GitHub repository (Issue #906) while continuing our own investigation. The culprit turned out to be a combination of “binary signature files” and “container signature files.”
In simple terms, Droid recognizes file types using a binary signature file that lists unique byte sequences characteristic of each format. However, some files like ZIP containers also have their own internal signature sets. Both components evolve over time, each with distinct versions as contributors continuously add support for new formats.
In our case, we discovered that depending on the combination of binary and container signature file versions used, the performance issue either occurred or disappeared. The DroidUI client did not exhibit the bug simply because it used a different version combination.
Working closely with the open-source community, we were able to identify the exact problematic entries. They fixed the issue by cleaning the signature files. In the meantime, we applied a local correction for our client, who immediately saw dramatic improvements in performance.
A Source of Debate and Reflection
The incident sparked some thoughtful discussions within the digital preservation world. Andy Jackson, a technical architect at the Digital Preservation Coalition, wrote a series of blog posts on the topic. He reflected on a suggestion by Martin Hoppenheit: customizing signature files to match each client’s context to improve performance.
For example, if an archive only contains TIFF and PNG images, why include other unused formats in the signature files and slow down the identification process? After careful consideration, Jackson pointed out that this approach could lead to incompatibilities between binary and container signatures, particularly during future updates. Instead, he suggested that Droid itself should avoid file-scanning combinations that trigger full reads.
Key Takeaways
Here are a few lessons learned from this debugging journey:
-
When encountering slow format identification, check read bandwidths and ensure files aren’t being fully read multiple times something that should rarely happen.
-
Format identification isn’t “magic” and can represent a significant processing cost, especially for large files. Its importance should be considered in the specific context of each project.
-
The open-source digital preservation community demonstrated impressive responsiveness in resolving the Droid anomaly.
-
Andy Jackson’s blog posts underscored the passion and collaboration that drive this community, echoing the spirit we observed at iPres 2024.
This story had two goals: first, to provide a glimpse into the daily work of software development teams tackling real-world performance issues, and second, to highlight the strength and commitment of the open-source community particularly within the field of digital preservation.
Special thanks to my colleague Raphaël Lample, who led much of the technical investigation described here and guided the community toward pinpointing the source of the issue.




