How to use data spanning multiple data sources in Elixir

Below is a reprint of the article I wrote for the Rekki Medium page.

REKKI builds tools that help people along the restaurant supply chain do their jobs better.

We have a free mobile app that lets restaurants order and chat with suppliers, and a web-based tool for suppliers that helps them process orders, manage product codes and catalogues, and communicate more easily with their customers. The majority of REKKI’s backend is written in Elixir, working hand in hand with services written in Go and Node. The Elixir services handle most of what the user sees in the app like the real-time communication with the supplier and the status of the orders.

We use Ecto to talk to our databases, which is the de facto standard. It provides a nice, composable data querying and manipulation DSL and an ORM-like way to map information stored in the database to Elixir structs using Schemas. In this post, we’re going to focus on advanced usage of one of its features: “preloading” of the related data.

If you’re already an Ecto power user and know how we usually use preloads, you might want to skip ahead to the “Non-trivial scenario,” or straight to the “Using preload functions” section.

setting the scene

Let’s say we’re building a blogging platform, a bit like Medium, where users can write posts and comments. We’d probably model posts and comments as separate tables in a relational db, where each comment belongs to a post, a post can have multiple comments, and both posts and comments belong to individual users.

In that scenario, if we wanted to load all comments written by a certain user, we’d write something like this:

When presenting those comments to the user, it might make sense to also display some information about the post they were written under. In that case we’d want to preload the post information like so:

There’s more than one way of going about it, and you can also use preloads in more complicated scenarios. The folks at Thoughtbot wrote a helpful blog post about different ways to execute nested preloads using the same classic “posts and comments” example that is also worth reading. But, even the simple case above has some nice properties:

It’s not implemented the naive way, where the related Post is fetched for each of the comments, which would be the definition of the n+1 query problem. Instead, all of the related posts are fetched in a single query.

The resulting SQL query also isn’t a simple join like

SELECT * FROM comments c INNER JOIN posts p ON c.post_id = p.id

That approach could be sub-optimal in many cases — if we were retrieving 500 comments under the same post, Postgres would give us 500 copies of the post information, one per each of the returned comments. Instead, Ecto splits it into two queries — one for the comments and one for the related posts.

You don’t need to execute the preload at the same time you’re performing the original query. If you have the Comment structs already fetched, you can preload the corresponding Posts afterwards:

Non-trivial scenario

We’ve seen how we’d use preloads to get data from related tables in the same database. We had foreign key relationships documented in the schemas and Ecto supported our use-case out of the box. However, we don’t always have that luxury. For legacy reasons or because of how we decided to separate concerns in our system, the related data could be in a different database, or even provided to us by a different service!

Let’s say (even if the example is a bit contrived) that each of the Posts (optionally) belongs to a Category, identified by its name. There also exists a service that for a given category name can give us all the relevant category information, such as description, statistics, notable authors and so on. What’s a flexible, efficient and idiomatic way of loading the category information for our Posts?

One thing we could do would be to create a virtual field in our Post schema and a function that would populate it for the Post struct:

This is a good solution, but it does have some disadvantages. Firstly, we can’t really use it when building our composable queries. Secondly, if we had a collection of Posts we wanted to add the category information to, we might be tempted to do something simple, like this:

This would work, but if the collection had multiple posts belonging to the same category, we’d be doing redundant work and fetching the same information multiple times. Plus it’s starting to feel like we’re solving a problem that should already be solved.

This is where the preload functions come in.

Using preload functions

There is a section of the Ecto documentation that’s easy to miss, that gives a brief explanation of preload functions. They give us the flexibility we want. Let’s see how we’d use them in our case.

It seems in the “posts in categories” scenario, the type of relationship we’re looking for would be belongs_to, where the foreign key is category_name. Let’s model this in our schema:

Now the only thing we need is the preloading function itself. It will accept a list of category names and fetch the category info from the api. If we want to, we can also parallelize the fetching from the get-go using Tasks.

The belongs_to(:category, …) in the Post schema informs Ecto what fields we use to identify the categories (on both the post and category side), allowing it to do the matching for us. The categories don’t necessarily need to be Ecto schemas, a plain map with the right name field would do the trick.

The cool thing about this solution is that Ecto takes care of the deduplication for us (so we don’t do redundant fetching). Even better, we can use what we wrote for preloading inside Ecto queries, for prefetching information for individual structs or collections of structs, or nested prefetching, as per the Thoughtbot blog post.

Ever since we discovered preload functions at REKKI, we’ve been using them with good results. While we store most of our data in Postgres, we still have a CouchDB instance we rely on for some things. Preload functions makes working with both of them together much more streamlined. I imagine we’re going to use them even more as the amount of different services that make up the REKKI backend grows.

At the same time, I haven’t seen the preload functions used widely by the community. If you weren’t aware of them before, I hope they become a useful tool in your toolbox. I’m also curious if there’s any other use-cases they are good for, feel free to leave some examples in the comments.

The examples in this post were written with Ecto 3.x. Take a look at the repo with the example code — all of the code snippets in this post were taken from the repo’s code or tests. Running the tests prints all the SQL queries Ecto is running, which is nice for reference.

edit: unfortunately the repo got deleted and I don’t have a copy of it.