Hierarchical data types

19 sep 2024 - 5 minutes to read

Cet article est disponible en Français : Les types hiérarchiques (2024-09-19)

The SQL standard defines a set of rules so that database systems can be interchangeable, but there are small singularities in the wild. In this regard, the hierarchyid data type provided by SQL Server is a striking example. If you are switching to PostgreSQL, two solutions are available to you.

A first and simpler solution consists in linking each node to its parent using a new parentid column and applying a foreign key constraint. Another, more complete approach consists in using the ltree extension. This article deals with the latter case.

Searching for descendants

The hierarchyid type is designed to represent a hierarchical relationship as a binary tree. It stores the list of successive nodes up to the root in a single column. Thus, the node /1/1/2/ represents a level 3 node, child of the node /1/1/, itself child of the node /1/, itself child of the root /.

Many built-in methods are part of the Transact-SQL language to manipulate the data. They all traverse the binary tree to quickly extract the desired information without having to join the nodes of the table together.

ToString() : returns the textual representation of the current node.
GetLevel() : returns the depth level of the current node.
GetAncestor(n) : returns the n-level node in the tree.
GetDescendant(c1, c2) : returns a new child node between two nodes.

PostgreSQL brings us a new data type, ltree, to store hierarchical data. It is a complex data type that can store labels up to 1,000 characters long, separated by dots. The path can contain up to 65,635 labels.

CREATE EXTENSION ltree;

CREATE TABLE locations (
    id ltree PRIMARY KEY,
    location text NOT NULL,
    locationtype text NOT NULL
);

INSERT INTO locations VALUES
    ('1', 'Earth', 'Planet'),
    ('1.1', 'Europe', 'Continent'),
    ('1.1.1', 'France', 'Country'),
    ('1.1.1.1', 'Paris', 'City'),
    ('1.1.2', 'Spain', 'Country'),
    ('1.1.2.1', 'Madrid', 'City'),
    ('1.2', 'South-America', 'Continent'),
    ('1.2.1', 'Brazil', 'Country'),
    ('1.2.1.1', 'Brasilia', 'City'),
    ('1.2.2', 'Bahia', 'State'),
    ('1.2.2.1', 'Salvador', 'City'),
    ('1.3', 'Antarctica', 'Continent'),
    ('1.3.1', 'McMurdo Station', 'City');

Looking for the depth level of a node becomes trivial with the nlevel() function provided by the ltree extension.

SELECT id, location, locationtype, nlevel(id) AS level
  FROM locations ORDER BY id;

   id    |    location     | locationtype | level
---------+-----------------+--------------+-------
 1       | Earth           | Planet       |     1
 1.1     | Europe          | Continent    |     2
 1.1.1   | France          | Country      |     3
 1.1.1.1 | Paris           | City         |     4
 1.1.2   | Spain           | Country      |     3
 1.1.2.1 | Madrid          | City         |     4
 1.2     | South-America   | Continent    |     2
 1.2.1   | Brazil          | Country      |     3
 1.2.1.1 | Brasilia        | City         |     4
 1.2.2   | Bahia           | State        |     3
 1.2.2.1 | Salvador        | City         |     4
 1.3     | Antarctica      | Continent    |     2
 1.3.1   | McMurdo Station | City         |     3
 2.1.1   | unknown         | State        |     3
(14 rows)

Another method named subpath() allows you to retrieve a part of the path of a node. Let’s see how to get the parent node of each node in the table.

SELECT id, location, locationtype, subpath(id, 0, nlevel(id) - 1) AS parentid
  FROM locations ORDER BY id;

   id    |    location     | locationtype | parent
---------+-----------------+--------------+--------
 1       | Earth           | Planet       |
 1.1     | Europe          | Continent    | 1
 1.1.1   | France          | Country      | 1.1
 1.1.1.1 | Paris           | City         | 1.1.1
 1.1.2   | Spain           | Country      | 1.1
 1.1.2.1 | Madrid          | City         | 1.1.2
 1.2     | South-America   | Continent    | 1
 1.2.1   | Brazil          | Country      | 1.2
 1.2.1.1 | Brasilia        | City         | 1.2.1
 1.2.2   | Bahia           | State        | 1.2
 1.2.2.1 | Salvador        | City         | 1.2.2
 1.3     | Antarctica      | Continent    | 1
 1.3.1   | McMurdo Station | City         | 1.3
 2.1.1   | unknown         | State        | 2.1
(14 rows)

Least but not last, the search for child nodes from a given node is possible thanks to specialized comparison operators. To improve performance, it is recommended to create a GIST index on the primary key column id.

CREATE INDEX ON locations USING GIST (id);

All cities in Europe can be retrieved with the following query.

SELECT l1.*
  FROM locations l1
  JOIN locations l2 ON l1.id <@ l2.id
 WHERE l1.locationtype = 'City' AND l2.location = 'Europe';

   id    | location | locationtype
---------+----------+--------------
 1.1.1.1 | Paris    | City
 1.1.2.1 | Madrid   | City
(2 rows)

Working under constraints

The declared primary key constraint prevents me from inserting an existing path. However, it is still possible to add a new row whose path does not have any ancestor in the table.

INSERT INTO locations VALUES ('2.1.1', 'Unknown', 'Continent');

INSERT 0 1

Things does not change much between SQL Server and PostgreSQL, it is necessary to add an additional column, named parentid for instance, to enforce the foreign key constraint. The following query reuses the subpath() function ensuring a null value is inserted if it is a root node.

DELETE FROM locations WHERE id <@ '2';

ALTER TABLE locations ADD COLUMN parentid ltree
    REFERENCES locations (id)
    GENERATED ALWAYS AS (
        CASE subpath(id, 0, nlevel(id) - 1)
            WHEN '' THEN null
            ELSE subpath(id, 0, nlevel(id) - 1)
        END
    ) STORED;

By now, whenever a new row is inserted, the foreign key constraint is automatically checked.

INSERT INTO locations VALUES ('2.1.1', 'Unknown', 'Continent');

ERROR:  insert or update on table "locations" violates
        foreign key constraint "locations_parentid_fkey"
DETAIL:  Key (parentid)=(2.1) is not present in table "locations".

Just one solution among others

This kind of data type is a clever solution to store hierarchical data and is easily available with PostgreSQL. However, it is an extension of the SQL language and each database engine can propose an implementation that can radically change from one system to another.

As I mentioned in the introduction, the universal answer is to rebuild a hierarchical relationship using a recursive query and the WITH RECURSIVE syntax. To take the example of the locations table, the list of cities in Europe could be obtained as follows.

WITH RECURSIVE loc AS (
    SELECT id, parentid, location, locationtype
      FROM locations
     WHERE location = 'Europe'
     UNION ALL
    SELECT l.id, l.parentid, l.location, l.locationtype
      FROM locations l
      JOIN loc r ON l.parentid = r.id
)
SELECT * FROM loc
 WHERE locationtype = 'City';

 id | parentid | location | locationtype
----+----------+----------+--------------
  4 |        3 | Paris    | City
  6 |        5 | Madrid   | City
(2 rows)