String​.to_existing_atom​/1

...is a double-edged sword

Posted by nietaki on December 4, 2018

I’d argue Elixir has relatively few gotchas. It’s a simple and consistent language and when you first learn it there’s only a few things that are genuinely counter-intuitive and catch you by surprise.

One of the examples could be the difference between binaries and charlists and why iex sometimes seems to do weird things to your lists:

iex> l = [19, 7, 16, 119, 97, 116]
[19, 7, 16, 119, 97, 116]
iex> Enum.drop(l, 1)
[7, 16, 119, 97, 116]
iex> Enum.drop(l, 2)
[16, 119, 97, 116]
iex> Enum.drop(l, 3)
'wat'

One of the other ones comes when you start working with atoms and get a little too trigger-happy with them. What you could hear from your more experienced teammates is something like this:

You shouldn’t really use String.to_atom/1 on user-supplied data. The BEAM has a limit on how many different atoms you can have and they’re not garbage collected!

With data coming from outside the system, stick to strings or use String.to_existing_atom/1 instead!

This is good advice and the official docs agree. It seems like an easy choice too - if you take the approach that all atoms you expect to see in the system are known at compile time and you won’t be creating any new ones during runtime, there’s no reason not to do it! You get all the safety and no problems!

From my personal experience it’s true the vast majority of time. But there are situations where it could blow up when you least expect it (or just in production). Pull up a chair, let me tell you a story

Storing atoms in Postgres

In Curl we use Postgres and Ecto for some of our data storage. There are some situations where we use the structs representing database rows almost directly, so it makes sense to have the stored data as close to the desired Elixir representation as possible.

Let’s say we have a table representing users and we expect all users to be either “active” or “inactive” (whatever it means in the business context). In our Elixir code we’d like to see it as something like %User{status: :active} - an atom struct field value.

Postgres isn’t aware or bothered about what Elixir atoms are, but will happily store them for us as strings. We can hide the boilerplate string <-> atom conversion code by creating a custom ecto type. The code and experience we end up with is very similar to the one described by Lew Parker in his A quick Dip into Ecto Types blog post.

Here’s pretty much the code we ended up with:

defmodule MyApp.Ecto.AtomType do
  @behaviour Ecto.Type

  @type t :: atom

  @impl true
  def type(), do: :string

  @impl true
  def cast(atom) when is_atom(atom), do: {:ok, atom}

  def cast(string) when is_binary(string), do: safe_string_to_atom(string)

  def cast(_), do: :error

  @impl true
  def load(value), do: safe_string_to_atom(value)

  @impl true
  def dump(atom) when is_atom(atom), do: {:ok, Atom.to_string(atom)}

  def dump(_), do: :error

  @spec safe_string_to_atom(String.t()) :: {:ok, atom} | :error
  defp safe_string_to_atom(str) do
    try do
      {:ok, String.to_existing_atom(str)}
    rescue
      ArgumentError -> :error
    end
  end
end

Looks pretty reasonable to me, and again: the String.to_exsting_atom/1 should never be a problem, because we have the :active and :inactive atom literals in our codebase and we even have a database constraint to make sure those are the only two values that can be stored in the status column.

Some time passes…

Time passes, features are added, refactors happen. One day we deploy to production (code that has behaved well on staging environment for a while and passed all system and unit tests with flying colours) and AppSignal starts notifying us about errors:

cannot load `"inactive"` as type MyApp.Ecto.AtomType for field `status` in schema (...)

What gives?! I know :inactive is an atom that exists and the same code has had no problems in the staging environment! We investigate and come up with a theory that gets confirmed by a two-year-old github comment by José.

Here’s what happened:

The :inactive literal was in our codebase, but not in the modules that get used the most frequently on a day-to-day basis. The atoms in them don’t get added to the atoms table until the module gets loaded, which happens lazily - when a function in it is called, for example.

Those modules got loaded when system tests were run in the staging environment (when testing scenarios with users getting deactivated) but not straight away in production. In production the modules weren’t loaded (yet) and when an inactive user was retrieved from the database, the field loading errored out breaking some features.

Fixing it

How do we fix it then?

First we make sure the production system works - we go through some user scenario which uses a Module with the atoms we need. All systems are nominal again. Now we can approach the root cause with less urgency.

There’s a couple of ways of fixing the problem itself. One was suggested by José: Make sure whenever there’s a chance we’ll need the atoms we’re depending on, we’ll load their modules. In our case we could be doing this in our Repo module, which always gets used whenever we talk to the database:

# in lib/my_app/repo.ex

@on_load :load_atoms

def load_atoms() do
  relevant_modules = [
    MyApp.User,
    MyApp.DisablingFlow 
  ]
  
  Enum.each(relevant_modules, &Code.ensure_loaded?/1)
  :ok
end

You can see we’re using module’s @on_load attribute to hook into the module’s lifecycle and “cascade” the module loading.

That’s not the only possible solution though. Another, technically simpler solution is just reverting to String.to_atom/1 when we load data from the database. That would be working under the assumption we knew what we were doing when we were storing them in the first place :) That would just be changing the AtomType.load/1 function:

# Before:
def load(value), do: safe_string_to_atom(value)

# After:
def load(value), do: {:ok, String.to_atom(value)}

The latter approach might look a bit naive, but it saves us from the hassle of what looks like manually tracking dependencies between modules.

There’s more possible approaches here: We could potentially start the system in embedded mode, where all code is loaded at startup, provided we’re deploying the app using releases. While simplifying the app’s lifecycle like this sounds like a clean solution, I think it’s conservative to not to depend on some deployment details for the correctness of your app. Getting started with releases requires some extra work too.

So anyways, crisis averted and we learned something!

End notes

This is not actually how we model our users and it’s a different entity which made the problem surface - it’s just a simplification for the sake of the blog post so I wouldn’t have to get too deep into describing how we model Curl’s domain.

There’s probably some even cleaner ways of solving the problem we ran into. If you have some ideas about them, leave a comment below, I’d love to hear about it!