A Linear-Time Selection Algorithm
Quickly finding a median value in unordered data without sorting
A while back, I had to come up with a way to quickly find the median 32-bit unsigned integer in a large, unordered data set. We had space and complexity constraints as it was intended to run in hardware — sorting was out of the question, like in Quickselect. I came up with an approach that I haven’t come across since.
In short, I created a histogram-based selection algorithm that exploits the bounded size of the integer data type. It has fixed space complexity, linear time complexity, and doesn’t reorder or sort anything.
Interestingly, the algorithm doesn’t actually select the value in the end… it constructs it. And while my original algorithm was intended for unsigned integers, it works for anything that can be mapped to this type, including floats.
An Algorithm for Integers
A Bad Approach
It might be best to ease into the idea with a trivial but horribly inefficient algorithm. Lets start with 32-bit unsigned integers and a histogram that is 2³² in size such that there is a one-to-one mapping from a value to a histogram bucket. To populate the histogram, the algorithm iterates through each value of the input once, incrementing its corresponding bucket in the histogram.
After iterating through data once, the algorithm iterates through the histogram once, counting up from the beginning to find which bucket the kth element lives in. The bucket index found directly corresponds to the kth value.
def select(data, kth):
histogram = [0] * (2 ** 32)
for value in data:
histogram[value] += 1
total_population = 0
for bucket, bucket_population in enumerate(histogram):
# Is the kth element in this bucket?
if total_population + bucket_population > kth:
return bucket
total_population += bucket_population
Technically, it didn’t find the kth value itself, but where the value lives in the histogram. Also technically, this is a linear-time search algorithm (sort of), and it works. Problem solved, right?
No. This algorithm is stupidly slow and uses a ton of memory.
But, perhaps we can make do with a smaller, more manageable histogram?
A Better Approach
How about… divide an conquer?
If the histogram only tracks the top (most significant) half of each integer in the data set, it would give a rough estimate as to where the kth element lives. Each bucket in the histogram would indicate the most significant bits of the kth element, and a corresponding population, but nothing more. It’s a good start, but it’s not precise. (Note: Lets refer to the difference between the bucket’s imprecise starting population offset and k as the residual offset.)
Lets apply this approach again to get a more precise result! A second round will build a histogram using the least significant bits, but only using values that start with the most significant bits from the first round. This round can be thought of as improving the precision of the result by finding the bucket that contains the residual-th element. The final, precise result is constructed by concatenating the results from each round.
A visual example is below, featuring two histograms generated when analyzing 150,000 random 16-bit values, 8-bits at a time. The median was 0x7fcc, or 32716. Not surprising, as it was a uniform distribution in this test.
This approach can be iteratively applied and generalized for any number of divisions of the input data type. Rounds could even use differently sized histograms if you want to spice things up.
An Implementation (In Python)
My chosen approach here is to examine 32-bit unsigned integers one byte (8 bits) at a time, but that can easily be tweaked:
DATA_BITS = 32
DATA_MASK = (1 << DATA_BITS) - 1
BUCKET_BITS = 8
BUCKET_MASK = (1 << BUCKET_BITS) - 1def select(data, kth):
assert DATA_BITS % BUCKET_BITS == 0
assert len(data)
assert 0 <= kth < len(data)
# Resulting value is built up from the histogram
result = 0
# Refined for each histogram
residual_offset = kth
# Sweep though bits related to our bucket, starting with msb
for bit_offset in range(0, DATA_BITS, BUCKET_BITS):
histogram = [0] * (2 ** BUCKET_BITS)
# Comparison mask between the running result and the input
comp_mask = DATA_MASK << (DATA_BITS - bit_offset)
for value in data:
assert 0 <= value < 2 ** DATA_BITS
if (result ^ value) & comp_mask == 0:
bucket = bucket_from_value(value, bit_offset)
histogram[bucket] += 1
# Which bucket contains the desired element?
bucket, bucket_population_offset = \
get_bucket_and_offset(histogram, residual_offset)
result |= value_from_bucket(bucket, bit_offset)
residual_offset -= bucket_population_offset
return result
Supporting functions:
def bucket_from_value(value, bit_offset):
return (value >> (DATA_BITS - BUCKET_BITS - bit_offset)) & BUCKET_MASK
def value_from_bucket(bucket, bit_offset):
return bucket << (DATA_BITS - BUCKET_BITS - bit_offset)
def get_bucket_and_offset(histogram, kth):
"""Return the bucket and pop offset containing the kth element
Note: kth is 0-based
"""
total_population = 0
for bucket, bucket_population in enumerate(histogram):
# Is the kth element in this bucket?
if total_population + bucket_population > kth:
return bucket, total_population
total_population += bucket_population
raise RuntimeError(f"Element {kth} not found")
Some notes on this implementation:
- Each integer could alternatively be represented as a tuple of unsigned values, divided up any way as you see fit.
- Python integers are weird, but they work fine for this example.
Using Other Types
Any data type, even complex data containers, can also be used with this algorithm. The only thing needed is a one-to-one mapping to and from an unsigned integer, or a tuple of unsigned values.
So what about floats? Floats are really not that much different than integers. Well… OK, yes, they are different in lots of ways, but not so different that this algorithm won’t work.
IEEE 754 binary32 floats look something like this:
Let’s consider each part:
- The sign uses 0 representing positive, and 1 representing negative.
- The exponent is an offset-exponent, where all-zero represents 127. This can be treated as an unsigned value for the purpose of comparison.
- The fractional value is multiplied against 2^exponent. This is an unsigned value.
- The exponent and fraction combined can be quickly compared between two floats as an unsigned binary value.
A key observation: when considering two positive floats, the one with the larger exponent is always larger, no matter what is in the fraction. For the sake of comparison, and ignoring nan, but even including ±inf, a float is pretty similar to a signed-magnitude integer, and can therefore be easily mapped to an unsigned integer space, or split up into a tuple of unsigned values for comparison.
Complexity
It should be somewhat obvious that there is a time-space tradeoff to this algorithm. We can decrease the memory requirement and histogram scanning time by examining only a small number of bits at a time. But, that increases the overall number of times we need to iterate through the input.
Writing the complexity out in terms of m bits examined per iteration, M total bits in our integer, and n input elements:
- Space requirements (histogram): 2ᵐ
- Number of times we iterate over data (input and histogram): M/m
Our runtime complexity is therefore: M/m * (n + 2ᵐ) → O(n)
Bonus: no matter the order of the data, or which element you’re looking for, runtime is nearly the same, only varying slightly based on the resulting histogram location. As we iterate over the bytes of each value, we build up a more precise result. A system that doesn’t need a perfectly accurate median value could even terminate early.
Conclusion
Like Quickselect, this algorithm takes a divide-and-conquer approach, but with data-independent divisions and consistent, fast, linear performance.
If you happen to have seen this before, or (even better) if you know of a proper name for this algorithm, leave a comment with a source!