Tomer Aberbach

To Dedupe Then Sort or Sort Then Dedupe?

Published ·

I recently came across a deceptively simple problem. I wanted to dedupe and sort a list of integers in-place.

My initial instinct was to first remove duplicates and then sort using the programming language’s built in sort function. In JavaScript that would look something like this:

const removeDuplicatesThenSort = array => {
  const seen = new Set()
  let insertIndex = 0
  for (let i = 0; i < array.length; i++) {
    const value = array[i]

    // Skip integers we've already seen.
    if (seen.has(value)) {
      continue
    }
    seen.add(value)

    // Place this unique integer at the next available index.
    array[insertIndex] = value
    // Increment to place the next unique integer at the next index.
    insertIndex++
  }
  // Shrink the array down to its new length.
  array.length = insertIndex

  array.sort((a, b) => a - b)
}

But then I realized this is not the only option. I could sort first and then remove duplicates:

const sortThenRemoveDuplicates = array => {
  array.sort((a, b) => a - b)

  if (array.length === 0) {
    return
  }
  let insertIndex = 1
  for (let i = 1; i < array.length; i++) {
    // Skip sequences of duplicate integers.
    const previous = array[i - 1]
    const current = array[i]
    if (current === previous) {
      continue
    }

    // Place this unique integer at the next available index.
    array[insertIndex] = current
    // Increment to place the next unique integer at the next index.
    insertIndex++
  }
  // Shrink the array down to its new length.
  array.length = insertIndex
}

At first I thought this option must be more efficient than the other one because it doesn’t require tracking already seen integers in a set. But then I had another realization: removing duplicates first could shrink the size of the array, and that could speed up the subsequent sorting.

So which option is better?

Note

We won’t explore the possibility of removing duplicates while sorting because that would require rewriting the programming language’s built in sorting algorithm and it’s unlikely that we can match the original performance1.

Time and space complexity permalinkTime and space complexity

Complexity analysis is a way to reason about the amount of resources, usually time and space, an algorithm uses as the size of its input increases. We can use it to objectively compare our two algorithms!

Let’s define some variables that describe our input array:

  • nn is the total number of integers
  • dnd \leq n is the number of distinct integers

Dedupe then sort permalinkDedupe then sort

Removing duplicates and then sorting takes O(n+dlog2d)O(n + d\log_2{d}) time (on average) and O(d)O(d) space because:

  • We loop through all nn integers and perform constant time work (on average) for each one. That takes O(n)O(n) time.
  • We add dd integers to a set by the end of the loop. That requires O(d)O(d) space.
  • We sort the remaining deduplicated integers, which there are dd of. In general sorting takes quasilinear time so sorting these integers takes O(dlog2d)O(d\log_2{d}) time.

I used the phrase “on average” a couple of times. That’s because if the set is a hash table, which it is in JavaScript, then insertion and lookup take linear time, not constant time, in the worst case.

That means that the algorithm takes O(n2+dlog2d)=O(n2)O(n^2 + d\log_2{d}) = O(n^2) time in the worst case2.

Sort then dedupe permalinkSort then dedupe

Sorting then removing duplicates takes O(nlog2n)O(n\log_2{n}) time and O(1)O(1) space because:

  • We sort the input integers, which there are nn of. That takes O(nlog2n)O(n\log_2{n}) time.
  • We loop through all nn integers and perform constant time work for each one. That takes O(n)O(n) time.
  • We don’t allocate any space dependent on the input size. That requires only O(1)O(1) space.

Unlike the other algorithm, the average and worst case time complexities are the same.

Which is better? permalinkWhich is better?

The average case is most informative, but it’s still not immediately clear whether O(n+dlog2d)O(n + d\log_2{d}) is more efficient than O(nlog2n)O(n\log_2{n}) because we have two variables, not one. Let’s explore what happens when we assume different relationships between nn and dd.

<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>d</mi></mrow><annotation encoding="application/x-tex">d</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:.6944em"></span><span class="mord mathnormal">d</span></span></span></span> is <em>significantly</em> smaller than <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>n</mi></mrow><annotation encoding="application/x-tex">n</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:.4306em"></span><span class="mord mathnormal">n</span></span></span></span> permalinkdd is significantly smaller than nn

If dd is significantly smaller than nn, meaning there are many duplicate integers, then the time complexity of removing duplicates and then sorting becomes O(n)O(n), which is certainly more efficient than O(nlog2n)O(n\log_2{n}).

There’s still the matter of the O(d)O(d) space complexity, but that’s unlikely to be a problem unless dd, and consequently nn, are very large.

<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>d</mi><mo>≈</mo><mi>n</mi></mrow><annotation encoding="application/x-tex">d \approx n</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:.6944em"></span><span class="mord mathnormal">d</span><span class="mspace" style="margin-right:.2778em"></span><span class="mrel">≈</span><span class="mspace" style="margin-right:.2778em"></span></span><span class="base"><span class="strut" style="height:.4306em"></span><span class="mord mathnormal">n</span></span></span></span> permalinkdnd \approx n

If dd and nn are not significantly different, meaning there are few or no duplicate integers, then the time complexity of both algorithms becomes O(nlog2n)O(n\log_2{n}).

However, removing duplicates and then sorting also requires O(d)O(d) space, which with our assumption is approximately O(n)O(n), that the other algorithm doesn’t. Plus, in practice that space also takes additional time to allocate, insert into, and query, even if it doesn’t affect the time complexity.

So in this case sorting and then removing duplicates is more efficient.

Which is better <em>empirically</em>? permalinkWhich is better empirically?

Our theoretical analysis suggests that we should sort and then remove duplicates if we expect few duplicates. Otherwise, we should remove duplicates and then sort. But is that empirically true? And if so, then how many duplicates does it take for switching algorithms to be worthwhile?

I benchmarked to find out! I ran each algorithm on a shuffled array of 100,000 integers with various duplicate integer percentages and this was the result:

This confirms that the overhead from the set used when removing duplicates and then sorting isn’t always offset by the reduced number of integers to sort later3.

In the case of 100,000 integers the overhead is only offset once roughly 27% of the integers are duplicates. I also benchmarked some other array sizes and the boundary varies. For example, for 100 integers the boundary was around 10%. Do your own benchmarking if you’re going to depend on this!

Conclusion permalinkConclusion

The appropriate approach depends on the expected percentage of duplicates. And problems that seem simple… aren’t.

Footnotes

  1. The sorting algorithms built into programming languages are meticulously optimized by being written low-level programming languages and using a state of the art algorithm like Timsort.

  2. n2n^2 asymptotically dominates dlog2dd\log_2{d} because ndn \geq d, so we can omit dlog2dd\log_2{d}.

  3. You might be wondering why sorting and then removing duplicates becomes so much more efficient when all integers in the input array are the same (100% duplicates). I don’t know for certain, but I would guess the JavaScript sorting algorithm has a linear time optimization that skips sorting when the values are already in order, which is the case if all the values are the same!