An interesting and yet simple programming question making the rounds the past few days, how does everyone here answer:
Given an input file with precisely 4 billion randomly-sorted non-negative 32-bit integers, provide an algorithm to generate a non-negative 32-bit integer that is not contained in the file.
Now of course the canonical answer is to create a zeroed 2^32-bit sized bit field (which consumes 512 megs of memory), and set each bit to 1 as you encounter its index in the file, then search for a 0 bit after, but that is memory heavy and potentially requires up to a near full second pass, how would you do it in, say, 10 megs or memory or so? What different ways and pattern with what different tradeoffs can you come up with?
Could you simply sort the list in ascending order then when you are done and have the largest integer simply add 1 to it, or, if while sorting the list there is a gap greater than 1 stop it there and add yours there? Not sure how long itâd take to sort 4 billion tho
That means that youâd have to have the entire list loaded into memory, which would be 4 billion * 4 bytes, so just over 14.901 gigabytes of memory at the smallest, and youâd probably allocate to an even page space for easier sorting via radix or so, so 16 gigs. Thatâs a lot larger than, say, 10 megs or so.
Unless 2^32-1 was already in that list, then you canât add +1 as it exceeds the 32-bit address space.
Not as long as you might think, being integers then itâs well suited for a fast radix sort, but thatâs still loading near 16 gigs of memory. Using just a bit array and setting a bit to 1 for the index of the number only takes 512megs and its ânaturallyâ sorted, so that method is already substantially faster, but still much larger than 10 megs or so of memory used.
See, itâs an interesting problem while looking initially very simple, with lots of possible solutions, and the ones youâd normally immediately jump to are not feasible. ^.^
Ah I see⊠in that case then could you simply split the file into multiple equal parts and work through each file individually (or if with a BEAM lang, concurrently) sorting each in ascending order but as soon as you come to a gap thatâs greater than 1 stop it there and insert your number
Wouldnât that run the risk of the number you pick to insert being in one of the other files? If you sort each of your split parts the other split parts could have higher/lower values or any values in the gap.
One thing I observed playing around with this at a scaled down level (brought it down to 1 byte just to see if I could find some patterns) is that if you are only missing a single value of the range then it seems you can find it by applying xor to all the values you have. If there are multiple missing values or duplicates in the set then it doesnât work, there might be something more clever that could be done though.
The only other way I can think of is splitting the file into multiple parts/files, but also creating an equal amount of additional new files, and so when sorting through each original file you write each number into the correct new file⊠and then when you have done that you go through each new file to check for a gap. Probably faster ways tho
Yeah, it is quite a tricky one to think of a good solution to and Iâm imagining there is some trick that changes everything. If you only want to reduce the RAM usage you could basically just implement the original solution and store it in a file instead of memory, which is slower but also would work and just straight up move the 512MB from RAM to disk with a bunch of seeking around the file. At that point Iâd imagine you would be IO bound and probably unlikely to benefit from concurrency, if you were to go concurrent though then you would probably want to make the file bigger as it is usually pretty hard to navigate a file at bit level and likely byte level is the best you could do, having multiple things manipulating the same byte of the file would probably be bad.
One of the more popular ways to solve this, spoiler in case you want to figure it out:
Probabilistic Method
Remember that there are 4 billion numbers, in a 32-bit address space, which is 4294967296 in size, meaning that 294967296 are available, just need to find one. So if you, as an example, just select 100 unique numbers, then parse through the input once weeding them out for what you find, then you have a greater than 99.9% chance of finding a valid number just in the first pass, and if it doesnât work then you can repeat it again and again until it does, probabilistically it will work quickly, and even more quickly if you select even more numbers to test with, 10 megs worth of numbers (2500000 of them) means that even in the worst case you loop over the input again and again 1599 times, which is still not âtooâ bad even if not great, and that would only happen if you didnât select any of the 294967296 available numbers until the very end.
When I saw âprobabilisticâ I was thinking maybe Bloom filters. Depending how you set it all up, you might wind up with a false-positive in some missing values, but you might still have a zero. Might even set up multiple Bloom filters with different parameters.