By - binaryfor
Interesting. The 4GB limit is a bit of an issue for practical use though -- the primary case I'd want this is when you're looking to pull something out of a TB-class tar file. In a situation like that, the random access nature of the application could be a major benefit.
(Yes, I know that I should be using a random access data format for that size of archive. I do, but sometimes you come across a file where someone else didn't)
Quite a justified request for a program meant specifically for random access of archived files
> The 4GB limit is a bit of an issue for practical use though
oh I had missed that. I wonder if that's a limit on the archive size, or on each individual file within (an unlimited size) archive
if the archive is less than 4 GB I may not really care about random access, and then tar+zstd beats the hell out of most things I have seen
What is a random access data format? Are they common?
Basically, the format needs an index that lists all the things in the archive, and maintains the offsets to them. It's relatively uncommon for files to work like this -- rather than just "write the data", you need to go through it multiple times, so that you know what goes where upfront.
So, tar, for example, is just a header of "file foo/bar.txt is a 78 byte text file", followed by its content. Then it gets to the next file's info, and so on. Then you can compress the whole thing or whatever. However, if you want to read the contents of "foo/test.dat", you more or less need to scan the file from beginning to end, find the file you want, and then output it.
An example that maintains that random access is squashfs: it works much more like a conventional filesystem, with a tree of basically-inodes, compressed data, and pointers to the location in the compressed archive with the file data. So now when you want to read "foo/test.dat", you look up foo, which directs you to test.dat, which directs you to the location in the archive with that data.
impressive; your numbers are so small that your hyperfine shows wild variation (17 +/- 13) in the report
but the output size matters too, IMO; would be curious what you have
I've switched to squashfs for archival and transport because, apart from being faster than unzip for random access, *I can mount the whole thing* when I need it. I do this with archives from old projects, where I need the archive to be read-only but searchable -- I make the sqfs file system after I've added a full "recoll" index (and including the index) so I can just mount it and search. Happens rarely but boy when it does this is a huge convenience!