How to get the top X results of a given category using Ecto?


I have to queries that return a colossal amount of data on their own. I cannot use Repo.all as doing so would materialize these into memory, which would then quickly run out.

So I am trying to push as much as I can to the pSQL DB, and force the DB to do as much work as possible.

My issue starts with 2 queries.

This ones counts fruits and veggies and aggregates everything into a neat map.

all_counts =
      |> join(:left, [item_A], item_B in table_B,
          item_A.home_id == item_B.home_id and
            item_A.path == item_B.path
      |> select([unfiltered_item, filtered_item], %{
        path: item_A.path,
        item_fruits_count: coalesce(item_A.fruits, 0),
        item_veggies_count: coalesce(item_B.veggies, 0),
        dataset_id: item_A.home_id
      |> subquery()

The second one, joins 2 tables as well (items and photos), with nothing fancy:

file_info =
      |> join(:inner, [item], file in table_D,
 == file.item_id and not file.deleted
      |> select([item, file], %{
        home_id: item.home_id,
        path: item.path,
        photo_key: file.photo_key
      |> subquery()


Now the problem is that I need to merge these 2 together.
At first, one would think to do something like this:

result =
      |> join(:inner, [c], f in ^file_info, on: c.home_id == f.home_id and c.path == f.path)
      |> select([c, f], %{
        item_id: f.item_id,
        home_id: f.home_id,
        path: f.path,
        photo_key: f.photo_key,
        # ... you get the idea
      |> Repo.all()

But this creates an issue, namely, the it will return so much data, the machines will run out of memory.


The approach I am using to solve this problem is to group items by home_id and path (since that is unique for each destination) and then return only a portion of the data I need, lets say, the top 3 items ordered by id.


Here is where my difficulties begin.
I cannot use pSQL directly, I must use Ecto (for reasons beyond this post).

Normally I would use CTEs or row_number():

With ctes:

 WITH cte AS
  ( SELECT name, value,
                              ORDER BY value DESC
             AS rn
    FROM t
SELECT name, value, rn
FROM cte
WHERE rn <= 3
ORDER BY name, rn ;

With row_number:

SELECT name, value, rn
  ( SELECT name, value,
                              ORDER BY value DESC
             AS rn
    FROM t
  ) tmp 
WHERE rn <= 3
ORDER BY name, rn ; 

However, I am not familiar enough with Ecto to know how to use them.

With CTEs, I understand I should avoid them, as they serve no purpose in Ecto:

With row_number() I would need to partition by both home_id and path (2 fields) instead of one:


How do I get the result, to return the top 3 results, grouped by home_id and path and ordered by item_id using Ecto?

Corresponding tweet for this thread:

Share link for this tweet.

1 Like


From what I gathered, there are two decent possible solutions to this conundrum.

row_number + over

One of them is using row_number() with over() from Ecto:

Assuming I join both file_info with all_counts in a single table, I can then perform the query as mentioned in the previous SO post I mentioned in my question:

|> select([fi], %{
        rn: over(row_number(), partition_by: [fi.home_id, fi.path], order_by: [asc: fi.item_id]),
        item_id: fi.item_id,
       # you get the idea ...
|> subquery()

IO.inspect(file_info_with_counts |> where([c], c.rn <= 3) |> Repo.all()

Which prints what I wanted.


Lateral inner joins

However, as mentioned by some people in the community, this solution is rather old, and these days lateral joins seem to also cover this use case.

tops = 
  from top in "file_info_with_counts", 
    where: top.home_id == parent_as(:parent).home_id,
    where: top.path == parent_as(:parent).path,
    order_by: [asc: top.item_id],
    limit: 3,
    select: %{id:}

from parent in "file_info_with_counts",
  as: :parent,
  group_by: [parent.home_id, parent.path],
  lateral_join: top in subquery(tops), 
  on: true,
  select: %{home_id: parent.home_id, path: parent.path, item_id:}


This solution is not without merit, however, given my familiarity with row_number I opted for that solution instead.

Unless there is a considerable performance difference between the two in favor of lateral joins, I will keep the previous solution.

For more info, here is the original source where I got these answers:

1 Like

Will Ecto be able to support NoSQL databases in the future?

1 Like

While I am not an expert, I believe there is a package called mongodb_ecto:

So if you use MongoDB, this might be your thing.

1 Like