I personally hate it when people post clickbait titles and take their sweet time getting to the point, so let’s do this first:
TL;DR: Some Elixir string operations, most notably String.at/2
work in linear time,
as opposed to constant time, like the intuition might suggest. This is because the String
module is UTF-8 aware.
UTF-8 encodes characters outside of ASCII with more than one byte, so in order to find the n-th character in a string you need to process it from the beginning, you can’t just use an offset in memory.
In this blogpost I go a bit more in depth about how UTF-8 works and compare some approaches to getting performant results, so even if you knew the tl;dr you might find something interesting here regardless.
Context
I as doing an algorithmic exercise containing string manipulation
in Elixir and saw that even though my approach was correct and seemed pretty optimal, it was timing out on larger inputs - and timing out by a lot.
After looking into it a bit I realised I was using String.at/2
as if it was C-style char array access operator, even though on some level I knew it didn’t really work like that.
Let’s make it a bit more specific and look at a concrete example. Here’s an efficient way of checking if an (ASCII) string is a palindrome in C:
// code taken from https://rosettacode.org/wiki/Palindrome_detection#C
#include <string.h>
bool palindrome(const char *s)
{
int i,l;
l = strlen(s);
for(i=0; i<l/2; i++)
{
if ( s[i] != s[l-i-1] ) return false;
}
return true;
}
The approach is simple enough, works in pessimistic $O(n)$ time and doesn’t do the same work twice (doesn’t cross the middle of the string). Here’s a naive implementation of this same idea in Elixir:
def check_palindrome_with_string_at(s) do
len = String.length(s)
Range.new(0, div(len, 2) - 1, 1)
|> Enum.reduce_while(true, fn index, _still_palindrome ->
if String.at(s, index) == String.at(s, len - index - 1) do
{:cont, true}
else
{:halt, false}
end
end)
end
Again, simple enough, with a good example of how you can use Enum.reduce_while/2
how in an imperative language you’d use for loops. The problem is, like I hinted before, that both String.length/1
and String.at/2
are linear time complexity operations, which brings the whole algorithm to $O(n^2)$. It might not sound so bad, but if you run some benchmarks, you’ll see, that running it on an ASCII string 50kB in length takes more than half a minute* That’s unacceptable for something that should take essentially as much time as iterating through the string!
*Note, this is the pessimistic running time - when the long string actually is a palindrome. For a random string that starts and ends with different characters, it would be much faster and not much slower than more optimized approaches.
Why is it like this?
a small ASCII / Unicode / UTF8 refresher, feel free to skip ahead
Unlike ASCII encoding, where every character is encoded in exactly one bytes, and many characters can’t be encoded at all, the dominating encoding we use these days is UTF-8, which is a specific implementation of the Unicode Standard.
Unicode lets us represent virtually any glyph humanity came up with (including emojis, but excluding Klingon), and most of them have their own code point - a number that represents their glyph, usually represented in hexadecimal. The way the standard is set up, the maximum code point value is 1114111 - way more than you could represent in one, or even two bytes.
The neat thing is, that the codepoints from 0 to 127 are the same in ASCII and Unicode, and UTF-8 encodes them in just one byte. This means for most simple English texts an algorithm that works for ASCII encoding will happily work on a text encoded with UTF-8.
So since many of the codepoints don’t fit in a single byte, UTF-8 has a mechanism for expressing multibyte codepoints. It’s a prefix encoding, where if the first bit of the byte is a 0
, it’s a single byte ASCII-like value, if it’s a 1
, it’s the first byte of a longer sequence (with additional logic on top of that). If this already is a little confusing, don’t worry, I’ll add some examples below.
To complicate things a bit more, some codepoints are not “standalone” but used in combination with others, to build grapheme clusters (also known as graphemes), which have a specific meaning.
Here are two examples:
Combining e
, or
U+65
- Latin Small Letter E with
U+301
- Combining Acute Accent gives us é
, also known as é
grapheme | é | ||
codepoints | e U+65 | ́ U+301 | |
bytes | 0x65 | 0xCC | 0x81 |
binary | 01100101 | 11001100 | 10000001 |
Similarly, this is how you construct the “OK hand sign with medium skin tone emoji”:
grapheme | 👌🏽 | |||||||
codepoints | 👌 U+1F44C | 🏽 U+1F3FD | ||||||
bytes | 0xF0 | 0x9F | 0x91 | 0x8C | 0xF0 | 0x9F | 0x8F | 0xBD |
binary | 11110000 | 10011111 | 10010001 | 10001100 | 11110000 | 10011111 | 10001111 | 10111101 |
You can use the below form to pick apart arbitrary Unicode text, using ardislu’s tool
What about those palindromes then?
Now we have a good idea how UTF-8 works, and understand that Elixir’s String module “thinks” in terms of graphemes (not bytes). That explains why String.length/1
is $O(n)$ - with variable grapheme byte size it needs to “read” it all and count all the characters. Similarly String.at/2
- in order to fish out the nth character, we need to know where exactly in the binary it is.
Does that mean our precious palindrome checking function can’t be made any faster? Of course not!
If we’re sure we’re only handling ASCII characters (like the C program from before was), we can use byte_size/1
and binary_part/3
, which make no assumptions about the data held by the binary:
def check_palindrome_with_binary_part(s) do
len = byte_size(s)
Range.new(0, div(len, 2) - 1, 1)
|> Enum.reduce_while(true, fn index, _still_palindrome ->
if binary_part(s, index, 1) == binary_part(s, len - index - 1, 1) do
{:cont, true}
else
{:halt, false}
end
end)
end
This can handle our 50kB palindrome in a mere 2ms! That’s 20,000x faster
How about if we just do the check “by definition” and check if the reversed string is the same as the string forwards?
def check_palindrome_with_string_reverse(s) do
s == String.reverse(s)
end
This solution is embarrassingly simple, handles Unicode, and it turns out it’s almost as fast - under 3ms on my machine. Which makes sense, because we’re just doing linear operations sequentially in this one.
Another approach suggested by @codeaddict is using to_charlist/1
:
def check_palindrome_charlist(s) do
erlang_str = to_charlist(s)
:string.equal(erlang_str, :lists.reverse(erlang_str))
end
It is more than 2x faster than the check_palindrome_with_binary_part
solution, but, interestingly, it uses much more memory - around 620kB for the 50k input, whereas check_palindrome_with_binary_part
uses under 1kB. Instinctively the bigger memory consumption makes sense, since the function needs to build two very long lists in order to work, instead of using basically just the input data structure.
I added the full benchmarks for the longest input below
Was it all for nothing then?
I don’t think so
Our toy example isn’t an end-all-be-all of string manipulation algorithms - the String.reverse/1
wouldn’t have worked in my original task. But understanding the trade-offs helped me come up with a solution that’s simple, correct and fast.
But mainly: understanding the behaviour and performance of the functions we use can be critical. The way the String module works helps the programmers not shoot themselves in the foot in the most common scenarios, but there’s plenty of situations where you might want to use the lower-level functions (either exclusively or complementing the high-level ones).
In the nietaki/string_playground repo I benchmarked the functions mentioned in this blogpost and a couple of others and added some property-based tests for them - you can use it as a starting point if you want to run some experiments of your own. As an example of how the learnings from this blogpost can be used in a slightly more realistic scenario, I added the Adaptive.slice_beginning/2
function, which can be used to slice chunks off of the beginning of a string without breaking any grapheme clusters, but also without doing the work of scanning the string from the beginning.
NOTE In order not obscure the main point I didn’t cover string normalization. In our palindrome example we might want to normalize the strings before checking their palindromidity, to make sure we’re not producing false negatives.
Benchmarks
You’ll find all the benchmarked approaches here.
##### With input 50000 #####
Name ips average deviation median 99th %
check_palindrome_charlist_optimized 1529.12 0.65 ms ±71.22% 0.35 ms 1.88 ms
check_palindrome_charlist 1246.36 0.80 ms ±61.33% 0.48 ms 2.08 ms
check_palindrome_with_binary_part 610.27 1.64 ms ±9.25% 1.60 ms 2.35 ms
check_palindrome_with_string_reverse 391.64 2.55 ms ±10.48% 2.48 ms 3.63 ms
check_palindrome_with_graphemes 129.10 7.75 ms ±52.14% 6.58 ms 27.08 ms
check_palindrome_with_string_at 0.0264 37840.16 ms ±0.00% 37840.16 ms 37840.16 ms
Comparison:
check_palindrome_charlist_optimized 1529.12
check_palindrome_charlist 1246.36 - 1.23x slower +0.148 ms
check_palindrome_with_binary_part 610.27 - 2.51x slower +0.98 ms
check_palindrome_with_string_reverse 391.64 - 3.90x slower +1.90 ms
check_palindrome_with_graphemes 129.10 - 11.84x slower +7.09 ms
check_palindrome_with_string_at 0.0264 - 57862.00x slower +37839.51 ms
Memory usage statistics:
Name Memory usage
check_palindrome_charlist_optimized 0.62 MB
check_palindrome_charlist 0.62 MB - 1.00x memory usage +0 MB
check_palindrome_with_binary_part 0.00019 MB - 0.00x memory usage -0.61948 MB
check_palindrome_with_string_reverse 5.34 MB - 8.62x memory usage +4.72 MB
check_palindrome_with_graphemes 7.25 MB - 11.70x memory usage +6.63 MB
check_palindrome_with_string_at 114448.97 MB - 184694.30x memory usage +114448.36 MB
**All measurements for memory usage were the same**