python corrupted fragment finder
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

Write a python 3.x script that locates the file position of a corrupted binary fragment in a larger file. Display the position it was located at, and how many bytes of the corrupted fragment matches the corresponding target section.

The script should take two file-based pure binary arguments: the fragment and the whole file

The approximate file sizes are fragments: 512 bytes to 100KB, whole file: about 1MB.

The most common use is where the majority of the fragment matches. Use 51% as the base for the match --- over 50% of the bytes should match the target, in the proper sequence, in order to be considered a match.

Examples:

Fragment, 10 bytes: 09 44 32 1A 5F 2A 11 01 01 02

Whole file, 17 bytes: 01 02 03 04 09 44 32 1B 5F 2A 11 01 01 02 04 04 05

Output: Position : 5 (offset from start of whole file)

Output: 9 bytes of 10 match

Fragment, 10 bytes: 08 08 08 08 08 08 08 08 08 08

Whole file, 17 bytes: 01 02 03 04 09 44 32 1B 5F 2A 11 01 01 02 04 04 05

Output: No match found.

The matching bytes of the fragment must be found in order, and consecutive without skipping intermediate bytes. If the fragment is 10 bytes long, then the total comparison window of the target is also 10 bytes. More than 5 bytes of each comparison window must match.

For a 17 byte file with 10 byte fragment, there will be (8) comparisons to be made. No shift(ie it lines up with the beginning), +1, +2, +3, +4, +5, +6, +7. Inside those comparison windows, 6 bytes must match. The last comparison window is (Whole_file_size-fragment_file_size) to the end.

Deliverables will be a python script that functions, and generates no errors and minimal warnings via pylint. This script needs to run with minimal external dependencies on a modern Ubuntu box. The script should consume a reasonable amount of memory, and run in a reasonable amount of time.

Tips will be given where all the edge cases are covered, code is properly commented, and a clean, fast solution is provided.

Hi! I've read your email. Sorry for a late reply, I was sick for the past 3 days. It would be helpful if you could provide more detail/examples of how your algorithm should behave. Let's say we have slightly adjusted your first example: Fragment, 10 bytes: 09 44 32 1A 5F 2A 11 01 01 02 Whole file, 17 bytes: 01 09 44 32 04 03 10 1B 2F 2A 11 03 01 02 04 04 05 What would you like to be the output of this?
VladimirMikulic 15 days ago
No problem. I hope you are feeling better! I would like that output to be "no match found." The program does not need to consider extra filler bytes or a fragment that isn't found in the whole. It doesn't need to process it as a binary stream or bit/byte shifts. The main application here, although there are a couple others, is a corrupted floppy disk. That disk is made up of tracks, which all have fixed sizes. Let's say that tracks 50 and 52 are read fine, but track 51 is corrupt. By feeding in tracks 51, 52, and 53 as a fragment, and feeding this tool the original, correct "whole" disk, the tool would find that 2/3(66.6%) of the bytes match, and identify the offset from the beginning of the disk. This would allow the consumer of the data to repair the disk. Updated ques.
KMbountify 15 days ago

Crowdsource coding tasks.

2 Solutions

Winning solution

Solution

Hi!

I've programmed the script as you've specified.
On this link you can find example binary files which I've used for testing and the final python3 script.

Let me know if you need anything more.

Thanks,
Vladimir

Thank you for your time on this. I'm going to check it out soon. I'll be interested to see if there's any big performance hit whenever the fragments are large, or the file to check is large.
KMbountify 13 days ago
No problem, take your time.
VladimirMikulic 13 days ago
I'm not sure what's happening, but something isn't working properly. Do this. "dd if=/dev/urandom of=1mbtest.bin bs=1 count=1000000. dd if=1mbtest.bin of=middle.bin bs=1 count=34000 skip=500000 python3 script.py middle.bin 1mbtest.bin" It's showing No match found.....even though that unmodified(!) fragment should be found. Is it something to do with your chunking?
KMbountify 11 days ago
Hi! Thank you for giving it a try. The problem wasn't with chunking but rather with the comparison. I missed one step of the algorithm. This has been fixed and you can download the new script from here. Thanks.
VladimirMikulic 11 days ago
I've also added a detailed explanation which should make it easier for you or the other maintainer to wrap his/her head around it.
VladimirMikulic 11 days ago
Thanks for the new script. It fixes the problems I saw earlier. The performance of this script doesn't seem great given real world inputs......it took 12.5 minutes with the example I gave you. 34K fragment and 1MB target. I'll have to read your algorithm description better..... I guess if it's doing about 1 million 34k-byte comparisons.....eeeeesh. Might take awhile.
KMbountify 10 days ago
We're running out of time here, so I'm going to award now, but it would be good if you can provide an optimized version. I mentioned one way that could be done(as a comment on the other solution), you could also do "early termination" of matches that can't meet the 50% standard. You know that once you get through 51% of a fragment compare, if you don't have 51% matching bytes that the starting position isn't valid......even if the remainder of that compare is correct!
KMbountify 10 days ago
Hi! Thank you for accepting my solution. I agree that there is no point in comparing the bytes if we already got more than 50% non-matching bytes. It is a bit weird that it takes on 12.5 minutes on your machine. Before the change, it took 20s on my machine and after this change, the time was cut in half - only 10s. Nonetheless, here is an improved version. Let me know if you need anything more.
VladimirMikulic 10 days ago
This did improve performance on my one dataset....moving from 12.5 minutes to 7.5 minutes. That's better but still not great. I'm seeing an index out of range error though, here's how to reproduce it: https://pastebin.com/Be7upKzA
KMbountify 10 days ago
Are you going to fix that index out of range error above?
KMbountify 8 days ago
Hi @KMbountify. Sorry for a late reply. I've been trying to fix this but I haven't come up with a perfect solution yet. The problem is that every solution increases computational time.
VladimirMikulic 8 days ago
hey @VladimirMikulic you may want to hold off..... it seems that kriander's new solution is dramatically faster, and simpler than I was suggesting. I'm still testing it, but at first glance it seems awesome. On a quick run, it did my real world task in 36 seconds, a full search in 53s.
KMbountify 8 days ago
I am truly sorry that my solution wasn't really the solution you were looking for. I can assure you, though, that it is the most precise one. If you later wish to resume the work, let me know!
VladimirMikulic 8 days ago

Here's a simple brute force solution (I think).

chmod +x fragsearch.py
./fragsearch.py -f middle.bin 1mbtest.bin
Thanks for jumping in here. I like the conciseness of your solution. I think the only problem, like the other solution, is that for real world inputs the simple brute force seems pretty slow. My first thought about increasing the speed is to traverse the target file creating larger sized checksum blocks. You could use 8 or 32-byte blocks. You'd iterate over the target file once to generate. Perform the same checksum over the fragment once. Then compare the checksums. I wasn't a math major, but I think that 32-byte blocks would reduce the search space by 3 magnitudes. Then maybe use your original algorithm to get final counts, using a much smaller window.
KMbountify 10 days ago
I agree that @VladimirMikulic is deserving of the Bounty here. However, I couldn't resist making a minor change to my solution to do the byte comparisons using Numpy. This decreased the runtime for the 1 MB test data substantially: 13 minutes to 18 seconds on my laptop. If you are OK with slightly less accuracy, the byte comparisons are best done using the natural integer size for your architecture (probably 64 bit). Just change line 7 to DTYPE = np.uint64 The runtime using 64-bit integer comparisons for the 1 MB test is less than 1 second on my laptop.
kriander 9 days ago
I'm so glad you joined the conversation, and I'm glad you couldn't resist the optimization. What an improvement with NumPy! With the 64-bit version, I'm getting around 1.5s with real world data. Awesome! Thanks much! Does the lack of accuracy in the final output just comes from the NBYTE multiplier? So the Position and total bytes(m) would be the closest multiple of NBYTE? How about n? That's less clear to me why there's a larger variance than say 2*block size difference.
KMbountify 8 days ago
The lack of accuracy using uint64 is mainly caused by comparing 8 bytes at a time. Even if 7 of the 8 bytes actually match, compare() returns 0 (instead of 7). The first example you provided actually fails using 64-bit integers and is instructive. For that test case, fragment = [0x01112a5f1a324409] and source = [0x1b32440904030201, 0x0404020101112a5f]. (The last two bytes of the fragment and last byte of the source file aren't read because there aren't enough bytes to fill a 64-bit integer; the endianness on your machine could also be different.) For this test case, only two comparisons are done and they both fail, so no match is found even though there is one. These errors become negligible as the problem size increases. For real, large data sets uint64 is the way to go.
kriander 6 days ago
Thinking about this more, there are pathological cases using uint64. It depends on whether the corrupted region in the source file aligns with a 64-bit boundary. Consider this test example: dd if=/dev/urandom of=source.bin bs=1 count=1048576 and dd if=source.bin of=fragment.bin bs=1 count=512000 skip=34007. Even though the source and fragment can both be perfectly cast to an array of 64-bit integers, the fragment won't be found because it doesn't start at an 8-byte boundary in the source file. (34007 isn't a multiple of 8.) This seems like a serious problem to me. It's a ~15-20x performance hit, but unless you know the corrupted region is going to start at an 8-byte boundary I guess you have to take it?
kriander 6 days ago
15-20x performance hit sounds bad.....but if we're talking about a 20 second difference, it may not actually matter. 20s is acceptable amount of time for this search. If it was 1 min vs 20 min, big difference there. I suppose we could have a fall-back mechanism where if no match is found it switches back to uint8. Even handle that externally -- ie some shell script that checks a return exit code, and then fires off the non_64.py script.
KMbountify 6 days ago
View Timeline