Hibernate 5 and Entities Duplication: Let’s Dive Under the Hood

If you have ever worked on a Spring application using Hibernate 5, you probably encountered the challenge of dealing with duplicated entities.

It is an odd scenario where your collection contains multiple entries referencing the same entity. This is a quite known issue, and there are ways to address it whether using the DISTINCT keyword, Sets, or JPA EntityGraph.

However, I happened to have quite disturbing behavior using EntityGraph returning duplicated entities! We will be diving a bit under the hood of Hibernate 5 to understand what happened.

A Simple Spring Boot Application

To illustrate Hibernate 5 behavior, let’s start with a basic Spring Boot project connected to a database: https://github.com/Shaolans/HibernateEntityGraphDeduplicationIssue

The database schema is a simple model as follows:

The Hibernate entities are defined as follows:

Our focus will be on the queries generated by this JPA repository:

@Repository
public interface AuthorRepository extends JpaRepository<Author, Integer> {

    @Query("SELECT a FROM Author a LEFT JOIN FETCH a.books")
    List<Author> findAuthorsLeftJoinFetchNoEntityGraph();

    @EntityGraph(attributePaths = "books")
    @Query("FROM Author a LEFT JOIN a.books")
    List<Author> findAuthorsLeftJoinWithEntityGraph();

    @Query("FROM Author a LEFT JOIN a.books")
    List<Author> findAuthorsLeftJoinWithoutEntityGraph();

    @EntityGraph(attributePaths = "books")
    @Query("FROM Author a INNER JOIN a.agent WHERE a.agent.name like '%John%'")
    List<Author> findAuthorsWithJoinNotFetchedAndEntityGraph();

    @EntityGraph(attributePaths = "books")
    @Query("SELECT a FROM Author a INNER JOIN a.agent WHERE a.agent.name like '%John%'")
    List<Author> findAuthorsWithJoinNotFetchedAndEntityGraphAndSelect();
}

Entity Duplication

Why Does It Happen?

To understand the occurrence of entity duplication, let’s consider a use case where the objective is to retrieve all authors and their respective books. In order to avoid the N+1 query issue I include in my JPQL query the keyword FETCH.

It enables the retrieval of both authors and their associated books in a single SQL query, eliminating the necessity to individually fetch books for each author.

@Query("SELECT a FROM Author a LEFT JOIN FETCH a.books")
List<Author> findAuthorsLeftJoinFetchNoEntityGraph();

Invoking this method will result in the compilation of an SQL query equivalent to the following:

SELECT A, B FROM AUTHOR A LEFT JOIN BOOK B ON A.ID = B.BOOK_AUTHOR_ID;

In a case where you have 2 authors that have respectively 2 and 3 books, the result of the SQL query will yield the following table:

Depending on the type of join used, the association between the Author and Book tables are established based on the provided condition.

This will produce for each Book that matched an Author, a new row containing the columns of the Book table but also the columns of the Author table.

Hibernate, upon retrieving this result set, intelligently resolves the entity without duplication. In our case, we end up with only two instances of the Author entity. However, it handles resolution on a row-by-row basis and does not inherently deduplicate the rows by default.

This explains why the List, returned by calling findAuthorsLeftJoinFetchNoEntityGraph, returns the following collection:

[Author@5640, Author@5640, Author@5641, Author@5641, Author@5641]

Author@5640 being the instance representing J.R.R Tolkien Author entity and Author@5641 representing the Author Isaac Asimov.

This duplication is mostly observed on @OneToMany and @ManyToMany relationships since joining with these types of relations might match multiple rows of the associated table.

On the contrary, @OneToOne and @ManyToOne relationships match at most only 1 entry on the associated table and will not induce duplication.

How to Deduplicate

Although this duplication issue is quite common and often occurs during the development of applications, there are ways to address and resolve this issue.

Set

A straightforward solution involves using the returning type Set<Author> instead of List<Author>:

@Query("FROM Author a LEFT JOIN FETCH a.books")
Set<Author> findAuthorsLeftJoinFetchNoEntityGraph();

When the results are collected, duplicated entities will be filtered through the Set.

DISTINCT keyword

Another approach is to use the DISTINCT keyword:

@Query("SELECT DISTINCT a FROM Author a LEFT JOIN FETCH a.books")
List<Author> findAuthorsLeftJoinFetchNoEntityGraph();

It should be noted that using DISTINCT as two semantic meanings:

SQL: The conventional SQL meaning involves filtering duplicate values in the selected columns.

Hibernate: It indicates a desire to filter duplications in the parent entity.

When the DISTINCT keyword is used in a JPQL query, it leads to the generation of an SQL query featuring the DISTINCT keyword but also instructing Hibernate to filter duplicated entities.

However, using DISTINCT in SQL induces an overhead when executed in the database, and in some cases, the goal is to only filter duplicate entities induced by the join. For such cases, the Hibernate hint HINT_PASS_DISTINCT_THROUGH can be employed to avoid the generation of DISTINCT in the SQL query:

@QueryHints(@QueryHint(name = HINT_PASS_DISTINCT_THROUGH, value = "false"))
@Query("SELECT DISTINCT a FROM Author a LEFT JOIN FETCH a.books")
Set<Author> findDistinctAuthorsLeftJoinFetchNoEntityGraph();

EntityGraph

In JPA 2.1, EntityGraph was introduced as a powerful feature allowing developers to define the loading strategies of entities. In our case, we can just use the @EntityGraph annotation to load the books along with the Author, as illustrated below:

@EntityGraph(attributePaths = "books")
@Query("FROM Author a LEFT JOIN a.books WHERE a.books.size > 2")
List<Author> findAuthorsLeftJoinWithEntityGraph();

This JPQL query is designed to retrieve all authors, along with their respective books, who possess more than 2 books.

Even without explicitly including the FETCH keyword in the SQL query, the EntityGraph ensures the books are loaded and the parent entity Author remains non-duplicated.

Note: This is a simple example of using EntityGraph; a more complete and sophisticated way of using it exists, but this is not the scope of this section.

EntityGraph and Entity Duplication

While it is established that EntityGraph handles entity deduplication, let's look into a more complicated scenario.

Consider the Author entity, which has a @ManyToOne relation with Agent. Suppose the objective is to retrieve all authors and their books where the author's agent has a name containing "John", yet we have no intention of loading the Agent entity for an arbitrary reason. In this case, we can accomplish this by the following code:

@EntityGraph(attributePaths = "books")
@Query("FROM Author a INNER JOIN a.agent WHERE a.agent.name like '%John%'")
List<Author> findAuthorsWithJoinNotFetchedAndEntityGraph();

Since we are using EntityGraph, we expect to avoid duplicated Author, but upon executing the method, the outcome is unexpected:

Duplicated entities are retrieved despite the use of EntityGraph. To understand the underlying reasons for this behavior, let’s dive a bit under the hood of Hibernate!

How Hibernate Resolves Entity Duplication

Hibernate is a quite complete and complex ORM, it can be hard to grasp entirely its fine details, but in our case, since we want to understand deduplication mechanisms, we will be focusing on the method list of QueryTranslatorImpl.

For the sake of clarity, some portions of the code have been omitted in order to highlight the most crucial aspects.

Upon invoking a method from the AuthorRepository, the execution will eventually reach this code snippet. Here is a breakdown of its functionality:

It computes a boolean called needsDistincting, and as the name implies, it identifies whether the result set needs to be deduplicated.

We can observe when we do use aDISTINCT keyword, EntityGraph, or limit the size of retrieval, AND there is a FETCHoperation in the JPQL query; the distinction needs to be applied.

In our issue, it can be observed that whenever EntityGraph is used, the distinction will be applied since EntityGraph will edit the source query to “FETCH” child entities given in the attribute paths.
This method invocation executes the query in the database, retrieves the result set, and resolves the entities. This list might contain duplicated entities.
This is the most interesting part here; when it does need to distinguish entities, Hibernate uses an IdentitySet to filter out the duplicated entity.

At this point, it is established that using EntityGraph should filter duplicated entities, but what happened to cause the unexpected outcome previously discussed?

Why EntityGraph Returns Duplicates in Some Cases

Let’s Compare

In order to have a better understanding of the situation, let’s compare the results of queryLoader.list( session, queryParametersToUse ) (2) when invoking the following methods:

@EntityGraph(attributePaths = "books")
@Query("FROM Author a LEFT JOIN a.books")
List<Author> findAuthorsLeftJoinWithEntityGraph();

and

@EntityGraph(attributePaths = "books")
@Query("FROM Author a INNER JOIN a.agent WHERE a.agent.name like '%John%'")
List<Author> findAuthorsWithJoinNotFetchedAndEntityGraph();

Upon calling findAuthorsLeftJoinWithEntityGraph, the returned results contains:

Given the EntityGraph is used, the duplication will be resolved through the IdentitySet, and at the end, the method will return a List<Author> consisting of 3 authors/instances.

However, upon callingfindAuthorsWithJoinNotFetchedAndEntityGraph, the returned results contains the following:

In this scenario, the result is not a List of Author instances but rather a List of distinct array instances of Objects; each encapsulating duplicated Author and Agent instances. The reason behind the presence of duplicate entities becomes evident.

Each array (

Object[2]@13551, Object[2]@135512, ..., Object[2]@13555), being a unique instance, is not filtered during the distinction process with the IdentitySet.

Consequently, all arrays are considered as distinct instances, leading to the observed duplication.

Explaining the Difference

In order to understand the difference between obtaining a List of Author instances in the former case and a list of arrays containing Author and Agent instances in the latter case, we need to explore deeper; let’s take a look at the class QueryLoader and especially the method initialize.

This initialization code has interesting parts, particularly the one highlighted in red. It creates an array of boolean includeInSelect where it checks which entities should be selected and be returned.

Let’s take for example:

@Query("SELECT a, b FROM Author a LEFT JOIN FETCH a.books b")

The initialize method checks theSELECT content one by one which is a and b; if the selected entity originates from a FETCH in the FROM clause, then it will be omitted since FETCH is used to load children entities (but not in the case of the parent entity). Otherwise, the entity should be selected.

In this example, the array includeInSelect contains [true (Author), false (Book)].

In our case, of entity duplication with EntityGraph:

@EntityGraph(attributePaths = "books")
@Query("FROM Author a INNER JOIN a.agent WHERE a.agent.name like 'John'")
List<Author> findAuthorsWithJoinNotFetchedAndEntityGraph();

We did not specify the SELECT clause, so by default, it will select all the entities from all the tables specified in the FROM clause.

The includeInSelect logic returns an array [true (Author), false (Book), true (Agent)] where Author and Agent (2 elements) should be selected because only Book is subject to a FETCH operation due to the use of EntityGraph.

This explains the difference between findAuthorsLeftJoinWithEntityGraph and findAuthorsWithJoinNotFetchedAndEntityGraph.

The former returns only one value per row because the SELECT clause contained only Author and Book entities (the select of Book was generated by EntityGraph), however, since Book was subject to a FETCH, it was not taken into account in the includeInSelect.

The latter returns an array per row because Author and Agent were in the SELECT clause and are not subject to a FETCH operation unlike Book. Thus, returning an array consisting of 2 elements per row.

In summary, Hibernate consistently returns an array, and the content of this array is determined by the entities specified in the SELECT clause that are not associated with FETCH operations.

However, if we follow the preceding logic, findAuthorsLeftJoinWithEntityGraph should also return an array but an array of a single element. This is accurate; it does. However, upon examining the implementation at **QueryLoader **lines 476 and 620, when the array consists of a single element, the element is just unwrapped.

How to Fix It

To ensure correct duplicate filtering when applying a distinction, the result set should be a list of entities rather than a list of entity arrays. This behavior is determined by the entities selected.

A straightforward solution is to include the desired entities in the SELECT clause:

@EntityGraph(attributePaths = "books")
@Query("SELECT a FROM Author a INNER JOIN a.agent WHERE a.agent.name like '%John%'")
List<Author> findAuthorsWithJoinNotFetchedAndEntityGraph();

The SELECT clause will only contain Author and Book (the Book select being generated by the EntityGraph):

Since Book is a FETCH operation, the includeInSelect will only select one element (Author), and instead of returning an array, it will return directly the Author instance which then will pass through the IdentitySet to be deduplicated.

Conclusion

Entity duplication occurs when tables are joined, and multiple rows are matched. There are several methods to address this issue, including using Set, the DISTINCT keyword, or EntityGraph to deduplicate the result set.

In the last two cases, deduplication is achieved through a Hibernate flag that accumulates results into an IdentitySet.

However, it may not work in certain scenarios due to the fact that, depending on your SELECT clause, it could return newly instantiated arrays of your selected entities for each row. This could undermine deduplication through the IdentitySet.

Therefore, it is advisable, when writing JPQL queries, to always specify the SELECT clause. If not supplied, all entities in the FROM clause that are not subject to a FETCH operation will be selected by default.

The issue of duplicated parent entities is addressed in Hibernate 6 (shipped in Spring Boot 3), as it has introduced automatic handling of duplicated entities. However, considering that the release of Hibernate 6 is relatively recent, many applications still run on Hibernate 5 or earlier versions, and migration from Hibernate 5 to 6 might be challenging.

Thus, understanding how Hibernate 5 handles these situations remains relevant for the considerable number of applications still using it.

Also published here