Fun Programming Question 2022-04-18

OvermindDL1 · 18 April 2022 16:54

An interesting and yet simple programming question making the rounds the past few days, how does everyone here answer:

Given an input file with precisely 4 billion randomly-sorted non-negative 32-bit integers, provide an algorithm to generate a non-negative 32-bit integer that is not contained in the file.

Now of course the canonical answer is to create a zeroed 2^32-bit sized bit field (which consumes 512 megs of memory), and set each bit to 1 as you encounter its index in the file, then search for a 0 bit after, but that is memory heavy and potentially requires up to a near full second pass, how would you do it in, say, 10 megs or memory or so? What different ways and pattern with what different tradeoffs can you come up with?

bot · 18 April 2022 17:08

Corresponding tweet for this thread:

Share link for this tweet.

ohm · 19 April 2022 04:52

Since the list is sorted you could subtract two consecutive entries. If the result is not 1, then you have a missing number.

OvermindDL1 · 19 April 2022 14:54

“Randomly sorted”, not sorted. ^.^;
(clarified with a dash in OP)

AstonJ · 19 April 2022 15:16

Could you simply sort the list in ascending order then when you are done and have the largest integer simply add 1 to it, or, if while sorting the list there is a gap greater than 1 stop it there and add yours there? Not sure how long it’d take to sort 4 billion tho

OvermindDL1 · 19 April 2022 15:22

That means that you’d have to have the entire list loaded into memory, which would be 4 billion * 4 bytes, so just over 14.901 gigabytes of memory at the smallest, and you’d probably allocate to an even page space for easier sorting via radix or so, so 16 gigs. That’s a lot larger than, say, 10 megs or so.

Unless 2^32-1 was already in that list, then you can’t add +1 as it exceeds the 32-bit address space.

Not as long as you might think, being integers then it’s well suited for a fast radix sort, but that’s still loading near 16 gigs of memory. Using just a bit array and setting a bit to 1 for the index of the number only takes 512megs and its “naturally” sorted, so that method is already substantially faster, but still much larger than 10 megs or so of memory used.

See, it’s an interesting problem while looking initially very simple, with lots of possible solutions, and the ones you’d normally immediately jump to are not feasible. ^.^

AstonJ · 19 April 2022 16:17

Ah I see… in that case then could you simply split the file into multiple equal parts and work through each file individually (or if with a BEAM lang, concurrently) sorting each in ascending order but as soon as you come to a gap that’s greater than 1 stop it there and insert your number

mindriot · 19 April 2022 17:32

Wouldn’t that run the risk of the number you pick to insert being in one of the other files? If you sort each of your split parts the other split parts could have higher/lower values or any values in the gap.

One thing I observed playing around with this at a scaled down level (brought it down to 1 byte just to see if I could find some patterns) is that if you are only missing a single value of the range then it seems you can find it by applying xor to all the values you have. If there are multiple missing values or duplicates in the set then it doesn’t work, there might be something more clever that could be done though.

AstonJ · 19 April 2022 17:41

Oh yeah, I am half asleep today

The only other way I can think of is splitting the file into multiple parts/files, but also creating an equal amount of additional new files, and so when sorting through each original file you write each number into the correct new file… and then when you have done that you go through each new file to check for a gap. Probably faster ways tho

mindriot · 19 April 2022 17:59

Yeah, it is quite a tricky one to think of a good solution to and I’m imagining there is some trick that changes everything. If you only want to reduce the RAM usage you could basically just implement the original solution and store it in a file instead of memory, which is slower but also would work and just straight up move the 512MB from RAM to disk with a bunch of seeking around the file. At that point I’d imagine you would be IO bound and probably unlikely to benefit from concurrency, if you were to go concurrent though then you would probably want to make the file bigger as it is usually pretty hard to navigate a file at bit level and likely byte level is the best you could do, having multiple things manipulating the same byte of the file would probably be bad.

OvermindDL1 · 19 April 2022 19:57

One of the more popular ways to solve this, spoiler in case you want to figure it out:

Probabilistic Method

Remember that there are 4 billion numbers, in a 32-bit address space, which is 4294967296 in size, meaning that 294967296 are available, just need to find one. So if you, as an example, just select 100 unique numbers, then parse through the input once weeding them out for what you find, then you have a greater than 99.9% chance of finding a valid number just in the first pass, and if it doesn’t work then you can repeat it again and again until it does, probabilistically it will work quickly, and even more quickly if you select even more numbers to test with, 10 megs worth of numbers (2500000 of them) means that even in the worst case you loop over the input again and again 1599 times, which is still not ‘too’ bad even if not great, and that would only happen if you didn’t select any of the 294967296 available numbers until the very end.

There are other ways still!

kokolegorille · 19 April 2022 23:26

If the problem is funny for You, it might mean it is unsolvable for me

OvermindDL1 · 20 April 2022 15:06

Nah, that’s the point of it, to be a fun challenge! ^.^

davearonson · 20 April 2022 23:34

When I saw “probabilistic” I was thinking maybe Bloom filters. Depending how you set it all up, you might wind up with a false-positive in some missing values, but you might still have a zero. Might even set up multiple Bloom filters with different parameters.

OvermindDL1 · 21 April 2022 17:05

Bloom filters keep coming up a lot, but they tend to only work well with more sparse data, which isn’t the case here, it’s very dense.