HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Removing duplicates in a char[] O(n)

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
charduplicatesremoving

Problem

We're given a char[] containing characters A-Z. (For more characters this can easily be extended). Now we're asked to remove all duplicates in the char[].

For example, in: \$['B','A','A']\$, out \$['B','A']\$.

The algorithm:

public void removeDuplicates(final char[] s) {
  if (s == null || s.length < 2) {
    return;
  }

  int checker = 0;
  int j = 0;
  for (int i = 0; i < s.length; i++) {
    if ((checker & 1 << s[i] - 'A') == 0) {
      s[j] = s[i];
      j++;
      checker |= 1 << s[i] - 'A';
    }
  }

  while (j < s.length) {
    s[j] = 0;
    j++;
  }
}


Here, checker is an int whose bits represent whether we've seen a character before. For example, if we've seen A and C, checker would be 5 (101).

i is our next character to check, and for \$j > 0\$, \$[s[0], s[1], ... , s[j-1]]\$ contains no duplicate characters.

Worst case: \$2n-1 = O(n)\$

Test cases:

  • null -> null



  • [] -> []



  • ['A'] -> ['A']



  • ['A','B'] -> ['A','B']



  • ['A','A'] -> ['A']



  • ['B','A','A'] -> ['B','A']



  • ['A','B','C','D'] -> ['A','B','C','D']



  • ['A','B','A','D'] -> ['A','B','D']



  • ['A','B','A','D','D','A'] -> ['A','B','D']



I am looking for any flaws, mistakes and opinions in general.

Solution

Caveat: Unicode has been around for over 25 years. Unicode's 110,000 characters are four orders of magnitude more than the 26 under consideration. In the messy real world, any interesting problem involving character processing is highly likely to bump into it..

Data Structures

At scale, the bit by bit data representation is impractical because parsing 110,000 bit integers [or even 256 bit integers representing all of ASCII] is likely to be much harder than the original problem (i.e. implementing consistent 256 bit integer arithmetic is non-trival).

This is not to say that the general problem is overly difficult. To the contrary, it is well understood because it is so common. A common data structure addressing this class of problems is the Hash Table because hash tables have constant time [O(1)] lookup. For large ranges of keys, the overhead of implementing a Hash Table is often overwhelmed by the savings in lookup.

Implementation

Hash Table data structures are available as a robust Java implementation:

  • Hash Tables are available directly in Java via HashMap.



A pseudo-code implementation:

forEach character in characters
   myHashTable.insert(character, TRUE)

return myHashTable.keys.toArray()


Performance

For special applications an approach such as that described in the question may offer better performance than a generalized approach using HashMap. A possible alternative in such cases might be Bloom Filters. A field tested Java implementation of Bloom Filters is available via Google's Guava library.

Code Snippets

forEach character in characters
   myHashTable.insert(character, TRUE)

return myHashTable.keys.toArray()

Context

StackExchange Code Review Q#83103, answer score: 6

Revisions (0)

No revisions yet.