Date:

Share:

Using JDK Collectors to De-duplicate parent/child nested collections – Java, SQL and jOOQ.

Related Articles

In classic SQL (i.e. before the amazing JOOQ MULTISET operator), nested collections were brought in using standard (external) connections. An example of such a query would be a query that works against the Sakila’s database Bring in actors and their films. Using jOOQ:

Result<?> result =
ctx.select(
        ACTOR.ACTOR_ID,
        ACTOR.FIRST_NAME,
        ACTOR.LAST_NAME,
        FILM.FILM_ID,
        FILM.TITLE)
   .from(ACTOR)
   .leftJoin(FILM_ACTOR).on(ACTOR.ACTOR_ID.eq(FILM_ACTOR.ACTOR_ID))
   .leftJoin(FILM).on(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
   .orderBy(
       ACTOR.ACTOR_ID,
       FILM.FILM_ID)
   .fetch();

The result of jOOQ’s debugging skill would look something like this:

+--------+----------+---------+-------+---------------------+
|actor_id|first_name|last_name|film_id|title                |
+--------+----------+---------+-------+---------------------+
|       1|PENELOPE  |GUINESS  |      1|ACADEMY DINOSAUR     |
|       1|PENELOPE  |GUINESS  |     23|ANACONDA CONFESSIONS |
|       1|PENELOPE  |GUINESS  |     25|ANGELS LIFE          |
|       1|PENELOPE  |GUINESS  |    106|BULWORTH COMMANDMENTS|
|       1|PENELOPE  |GUINESS  |    140|CHEAPER CLYDE        |
+--------+----------+---------+-------+---------------------+

As expected for the SQL join operation, it neutralizes the data, leading to double values ​​for each player, or if you sort differently, e.g. FILM_ID, So you will also see the double values ​​for each movie:

+--------+----------+---------+-------+----------------+
|actor_id|first_name|last_name|film_id|title           |
+--------+----------+---------+-------+----------------+
|       1|PENELOPE  |GUINESS  |      1|ACADEMY DINOSAUR|
|      10|CHRISTIAN |GABLE    |      1|ACADEMY DINOSAUR|
|      20|LUCILLE   |TRACY    |      1|ACADEMY DINOSAUR|
|      30|SANDRA    |PECK     |      1|ACADEMY DINOSAUR|
|      40|JOHNNY    |CAGE     |      1|ACADEMY DINOSAUR|
+--------+----------+---------+-------+----------------+

This is the only way a joining campaign works. It creates a Cartesian product and then filters by master key / foreign key matches.

Eliminate duplication and nesting of collections using fetchGroups ()

What we usually want is some kind of nested data structure, for example bring the movies to all player. A simple tool in jOOQ is simple to use fetchGroups():

Map<ActorRecord, Result<FilmRecord>> result =
ctx.select(
        ACTOR.ACTOR_ID,
        ACTOR.FIRST_NAME,
        ACTOR.LAST_NAME,
        FILM.FILM_ID,
        FILM.TITLE)
   .from(ACTOR)
   .leftJoin(FILM_ACTOR).on(ACTOR.ACTOR_ID.eq(FILM_ACTOR.ACTOR_ID))
   .leftJoin(FILM).on(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
   .orderBy(
       ACTOR.ACTOR_ID,
       FILM.FILM_ID)
   .fetchGroups(ACTOR, FILM);

It’s super comfortable, it keeps order, but it has a flaw. It’s not “smart” enough to remember LEFT JOIN Semantics and do the right thing, Which is to produce an empty list of FilmRecord In case the actor has no movies. If this is the case, then there is a NULL Movie in the set of results in SQL:

+--------+----------+---------+-------+---------------------+
|actor_id|first_name|last_name|film_id|title                |
+--------+----------+---------+-------+---------------------+
|       1|PENELOPE  |GUINESS  |      1|ACADEMY DINOSAUR     |
|       1|PENELOPE  |GUINESS  |     23|ANACONDA CONFESSIONS |
|       1|PENELOPE  |GUINESS  |     25|ANGELS LIFE          |
|       1|PENELOPE  |GUINESS  |    106|BULWORTH COMMANDMENTS|
|       1|PENELOPE  |GUINESS  |    140|CHEAPER CLYDE        |
|     ...|...       |...      |    ...|...                  |
|     201|UNKNOWN   |ACTOR    | null|null               |
+--------+----------+---------+-------+---------------------+

Therefore, we will not get an empty list, but a list that contains an empty list FilmRecord, Just like when you are GROUP BY Actor and COUNT(*) Number of films:

var r =
ctx.select(
        ACTOR.ACTOR_ID,
        ACTOR.FIRST_NAME,
        ACTOR.LAST_NAME,
        count(),
        count(FILM.FILM_ID))
   .from(ACTOR)
   .leftJoin(FILM_ACTOR).on(ACTOR.ACTOR_ID.eq(FILM_ACTOR.ACTOR_ID))
   .leftJoin(FILM).on(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
   .groupBy(
        ACTOR.ACTOR_ID,
        ACTOR.FIRST_NAME,
        ACTOR.LAST_NAME)
   .orderBy(
       ACTOR.ACTOR_ID)
   .fetch();

The result of this query may look like this:

+--------+----------+------------+-----+-----+
|actor_id|first_name|last_name   |count|count|
+--------+----------+------------+-----+-----+
|       1|PENELOPE  |GUINESS     |   19|   19|
|       2|NICK      |WAHLBERG    |   25|   25|
|       3|ED        |CHASE       |   22|   22|
|       4|JENNIFER  |DAVIS       |   22|   22|
|       5|JOHNNY    |LOLLOBRIGIDA|   29|   29|
|     ...|...       |...         |  ...|  ...|
|     201|UNKNOWN   |ACTOR       |    1|    0|
+--------+----------+------------+-----+-----+

See how to get only the desired count value of 0 When we pass the nullable FILM.FILM_ID A column as an argument. So what will be the equivalent fetchGroups() A conversation that exposes this behavior?

Eliminate duplication and nesting collections with JDK Collectors

A very unappreciated JDK feature Collectors. Although they were introduced specifically for use with the Stream API, can be used with any type of Iterable, In principle, and I still hope that JDK Future will offer Iterable.collect(), among the rest.

With jOOQ, you can collect the results of each query by calling ResultQuery.collect(). To the aforesaid translation fetchGroups() For example, we can write this, producing almost the same result:

Map<ActorRecord, List<FilmRecord>> result =
ctx.select(
        ACTOR.ACTOR_ID,
        ACTOR.FIRST_NAME,
        ACTOR.LAST_NAME,
        FILM.FILM_ID,
        FILM.TITLE)
   .from(ACTOR)
   .leftJoin(FILM_ACTOR).on(ACTOR.ACTOR_ID.eq(FILM_ACTOR.ACTOR_ID))
   .leftJoin(FILM).on(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
   .orderBy(
       ACTOR.ACTOR_ID,
       FILM.FILM_ID)
   .collect(groupingBy(
       r -> r.into(ACTOR), filtering(
           r -> r.get(FILM.FILM_ID) != null, mapping(
               r -> r.into(FILM), toList()
           )
       )
   ));

The call collect () above nests the following:

  • It’s grouped by player (just like fetchGroups())
  • It filters group content by their movies ID No NULL (This can not be achieved with fetchGroups()).
  • It maps the contents of the group so that it contains only FILM Content, not all screening.

So, it’s more literally, but also much more powerful and easy to display custom aggregation behavior on the client side, in case you can not pass the aggregation logic to your SQL statement.

More about powerful collectors in this article here:

Nesting collections directly in SQL

No article on this blog would be complete without putting together the awesome MULTISET An alternative to nest collections. After all, the above duplication algorithm really only works if you join a single parent-child relationship path, which is quite inefficient when there are a lot of duplicate data sets.

Assuming these types of auxiliary data:

record Film(String title) 
record Actor(String firstName, String lastName) 
record Category(String name) 

You can write a query like this:

// We're importing the new Records::mapping method for convenience
import static org.jooq.Records.mapping;

Result<Record3<Film, List<Actor>, List<Category>>> result = ctx
    .select(
        FILM.TITLE.convertFrom(Film::new),
        multiset(
            select(
                FILM_ACTOR.actor().FIRST_NAME,
                FILM_ACTOR.actor().LAST_NAME
            )
            .from(FILM_ACTOR)
            .where(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
        ).convertFrom(r -> r.map(mapping(Actor::new))),
        multiset(
            select(FILM_CATEGORY.category().NAME)
            .from(FILM_CATEGORY)
            .where(FILM_CATEGORY.FILM_ID.eq(FILM.FILM_ID))
        ).convertFrom(r -> r.map(mapping(Category::new)))
    )
    .from(FILM)
    .orderBy(FILM.TITLE)
    .fetch();

Note that we now get a results set containing

  • All movies
  • The actors for each film as a nesting collection
  • The categories for the film as another nested collection

The example is via Joining implicitly To avoid some of the more verbal joins within the MULTISET Phrases, but it is not entirely relevant for this example.

The result is something like:

+----------------------------+--------------------------------------------------+----------------------------+
|title                       |multiset                                          |multiset                    |
+----------------------------+--------------------------------------------------+----------------------------+
|Film[title=ACADEMY DINOSAUR]|[Actor[firstName=PENELOPE, lastName=GUINESS], A...|[Category[name=Documentary]]|
|Film[title=ACE GOLDFINGER]  |[Actor[firstName=BOB, lastName=FAWCETT], Actor[...|[Category[name=Horror]]     |
|Film[title=ADAPTATION HOLES]|[Actor[firstName=NICK, lastName=WAHLBERG], Acto...|[Category[name=Documentary]]|
|Film[title=AFFAIR PREJUDICE]|[Actor[firstName=JODIE, lastName=DEGENERES], Ac...|[Category[name=Horror]]     |
|Film[title=AFRICAN EGG]     |[Actor[firstName=GARY, lastName=PHOENIX], Actor...|[Category[name=Family]]     |
+----------------------------+--------------------------------------------------+----------------------------+

Summary

There are many roads leading to Rome. Classic SQL-based approaches to nest collections have used some duplication technique in the client. In jOOQ, you can always do it with fetchGroups(), And since recently also with collect() Directly (collect() Has always been available through an intermediary stream() call).

However, for truly powerful nesting of collections, our recommendation is to always transfer your logic into SQL directly, using original array functionality, or using MULTISET

Source

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Popular Articles