Provide a list of file offsets within an ISO9660 file and the associated file path information and filename, and file sizes
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

I've got an ISO9660 CD image from 20 years ago that has a series of files stored on it. I need a list of file offsets within the .ISO image and then the associated full path information. This would be the directory and filename. Associated file sizes (or size on disk) are needed as well.

The tools I've found that do this seem to be stopping around 366MB, but there are definitely files present in the 500,000,000 - 520,000,000 byte range.

The deliverables would be a file listing like this:

DISK OFFSET FILENAME FILE SIZE

102400 /bin/gzip 2000

104448 /bin/hostname 4000

108544 /bin/ip 2000

and so on.

Your answer must include coverage of the whole ISO image, and there must be some results present in that specified range above. Incomplete results are not useful, as I already have them.

Lastly, you must describe in detail which tools and method you used to obtain the answers.

I will tip extra for prompt, correct answers.

Please note that I'm NOT looking for content extraction, just a directory listing sorted by .iso image file offset.

The file is linked below.

https://www.dropbox.com/s/e03zim82olbuefr/XUPH1020.iso?dl=0

awarded to CyteBode

Crowdsource coding tasks.

1 Solution

Winning solution
Tipped

Output: https://gist.github.com/CyteBode/2b404a125e20e55485d6af387c084619

My first attempt was to use isoinfo -l -i XUPH1020.iso in Linux and parse the output, but I had the same results you've described. Then I decided to take a search approach where I mount the ISO (with mount XUPH1020.iso /media/iso), and traverse the files to search their respective sector(s) in the ISO so as to match them to offsets. To that end, I wrote the following script, which runs in about 25s on my machine:

script.py

import os
import os.path


SECTOR_SIZE = 2048
OUTPUT_FILE = "output.txt"


def read_in_chunks(file_object, chunk_size=1024):
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data


def sectors(chunk):
    for i in range(0, len(chunk), SECTOR_SIZE):
        yield chunk[i:min(i+SECTOR_SIZE, len(chunk))]


def pad_sector(sector):
    return sector + b"\0" * (SECTOR_SIZE - len(sector) % SECTOR_SIZE)


def treat_file(file_path, mount_point, all_files,
               data, problematic, doubles, sector_positions):
    path = file_path
    rel_path = "/" + os.path.relpath(file_path, mount_point)
    all_files.add(rel_path)
    size = os.path.getsize(path)
    if size < SECTOR_SIZE:
        sector = pad_sector(open(path, "rb").read())
        assert sector in sector_positions
        offsets = sector_positions[sector]
    else:
        positions = set()
        for i, sector in enumerate(sectors(open(path, "rb").read())):
            if len(sector) % SECTOR_SIZE != 0:
                sector = pad_sector(sector)
            assert sector in sector_positions

            if i == 0:
                for pos in sector_positions[sector]:
                    positions.add(pos)
            else:
                possible_pos = set()
                for pos in sector_positions[sector]:
                    possible_pos.add(pos - i * SECTOR_SIZE)
                positions = positions.intersection(possible_pos)
                assert len(positions) > 0
            if len(positions) == 1:
                break

        offsets = sorted(list(positions))

    if len(offsets) == 1 and offsets[0] not in data:
        data[offsets[0]] = (rel_path, size)
    else:
        if rel_path in problematic:
            doubles.add(rel_path)
        problematic[rel_path] = (offsets, size)


def resolve_problems(problematic, data):
    def neighbor_binary_search(needle, haystack):
        assert needle not in haystack
        lower = 0
        upper = len(haystack) - 1

        while upper - lower > 1:
            mid = (lower + upper) // 2
            if haystack[mid] < needle:
                lower = mid
            else:
                upper = mid

        if needle < haystack[lower]:
            return (None, haystack[lower])
        elif needle > haystack[upper]:
            return (haystack[upper], None)
        return (haystack[lower], haystack[upper])

    offsets = sorted(data.keys())
    offsets_set = set(offsets)
    to_remove = []
    for rel_path in sorted(problematic.keys()):
        possible_offsets, size = problematic[rel_path]
        best_rating = 0
        best_offset = 0
        for offset in problematic[rel_path][0]:
            if offset in offsets_set:
                continue
            before, after = neighbor_binary_search(offset, offsets)
            if before is not None:
                rating_a = len(os.path.split(
                                 os.path.commonpath(
                                   [data[before][0], rel_path])))
            else:
                rating_a = 0
            if after is not None:
                rating_b = len(os.path.split(
                                 os.path.commonpath(
                                   [data[after][0], rel_path])))
            else:
                rating_b = 0
            rating = max(rating_a, rating_b)
            if rating > best_rating:
                best_offset = offset
                best_rating = rating

        if best_offset == 0:
            continue

        if best_offset not in data:
            data[best_offset] = (rel_path, size)
            to_remove.append(rel_path)
            offsets_set.add(best_offset)

    for rel_path in to_remove:
        del problematic[rel_path]
    to_remove.clear()

    for rel_path in problematic:
        size = problematic[rel_path][1]
        if len(problematic[rel_path][0]) == 1:
            offset = problematic[rel_path][0][0]
            if offset in data:
                data_path, data_size = data[offset]
                if rel_path == data_path and size == data_size:
                    to_remove.append(rel_path)

    for rel_path in to_remove:
        del problematic[rel_path]
    to_remove.clear()


def main():
    import sys

    if len(sys.argv) != 3:
        print("Usage: python3 script.py path_to_iso mount_point")
        sys.exit(-1)

    iso_path = sys.argv[1]
    mount_point = sys.argv[2]

    assert os.path.exists(iso_path) and os.path.isfile(iso_path)
    assert os.path.exists(mount_point) and os.path.isdir(mount_point)

    sector_positions = {}
    with open(iso_path, "rb") as f:
        pos = 0
        for chunk in read_in_chunks(f, 1024*1024):
            for sector in sectors(chunk):
                sector_positions.setdefault(sector, []).append(pos)
                pos += SECTOR_SIZE

    count = [0]

    def traverse(directory, file_fn):
        files = []
        for fd in sorted(os.listdir(directory)):
            path = os.path.join(directory, fd)
            if os.path.isdir(path):
                traverse(path, file_fn)
            elif os.path.isfile(path):
                files.append(path)
        count[0] += len(files)
        for file in files:
            file_fn(file)

    all_files = set()
    data = {}
    problematic = {}
    doubles = set()
    traverse(mount_point, lambda f: treat_file(f, mount_point, all_files,
                                               data, problematic, doubles,
                                               sector_positions))

    print("File count: %d" % len(all_files))

    if len(doubles) > 0:
        print("%d files appearing more than once:" % len(doubles))
        for file in doubles:
            print("   ", file)

    resolve_problems(problematic, data)
    if len(problematic) > 0:
        print("%d problems unresolved" % len(problematic))

    missing = len(all_files) - len(data)
    if missing > 0:
        print("%d files missing:" % missing)
        missing = set(all_files)
        for fname, _ in data.values():
            missing.remove(fname)
        for file in sorted(list(missing)):
            print("   ", file)

    HEADER = ("DISK OFFSET", "FILENAME", "FILE SIZE")
    SEPARATOR = "\t"
    with open(OUTPUT_FILE, "w+") as f:
        f.write(SEPARATOR.join(HEADER))
        f.write("\n")
        for offset in sorted(data.keys()):
            fname, size = data[offset]
            f.write(SEPARATOR.join([str(offset), fname, str(size)]))
            f.write("\n")


if __name__ == '__main__':
    main()

Notes:

  • Only tested on Linux with Python 3.
  • Around 6000 files have the same contents as some other file(s). These problems are resolved by finding the offset for which neighboring offsets have the most similar path. That's much better than just taking an offset at random, but I can't guarantee it's always valid.
  • Some of the 0-length files are likely in the wrong offset. That's a limitation of the search approach.
  • 2 files appear twice with the same path. I don't know why.
  • Around 100 files get traversed twice and create a discrepancy in the file count. I don't know why either.
Thank you for submitting a solution! I don't think any of the limitations are deal breakers. Your approach is an interesting one! I suppose the leap of faith we're making here is that the Linux isofs mounter is going to do a better job parsing the iso than iso-info/iso-read. I think that's a fair assumption..... I was seeing about 29k files with iso-info, but 37446 with your solution. I see the linux mounter (via find. -type f) counting 37551, so I think that's your last statement. mount ~/ISO/mygame.iso /mnt/iso -o loop,map=off,check=relaxed (fixes forced lowercase problem) There's still about 26MB left unaccounted for at the end of the image, which has real content. I think this might be a boot image which is why the ISO doesn't "link" there.
KMbountify 19 days ago
if you look at offset 240 decimal, it points to 0x1E382000, 506,994,688. Your last ISO entry is at 506,974,208 + file-size 19364(rounded up to nearest block size 20480) = 506,994,688! I think that's definitive!
KMbountify 19 days ago
Interesting, it definitely looks like there's a secondary image there, but I don't know in what format. I tried mounting it at that offset, but it failed. Looking at it with a hex editor, there's definitely some files in there, including some scripts. Do you care about getting the offsets of those files? Maybe you could make it a new bounty.
CyteBode 19 days ago
Awarded this to you. The problem isn't necessarily getting offsets, how would you get the filenames?
KMbountify 19 days ago
I made a bit of progress. There's a partition table of sorts at offset 2048, with names and BE uint32 offsets that are mod-2048. I dd'd them to separate files, and according to the file command, there's two text files, some PA-RISC1.0 executables, a (apparently corrupt) UFS partition, and some .tar.gz files. But truth be told, I was hoping to be tipped $50 instead of $5. I can't work on this any further if you're not paying me more.
CyteBode 19 days ago