Talks  Archives  About me

Draw me an (abstract) tree

Cet article est disponible en Français : Dessine-moi un arbre (abstrait) (2022-06-29)

The parser stage creates a parse tree using only fixed rules about the syntactic structure of SQL. It does not make any lookups in the system catalogs, so there is no possibility to understand the detailed semantics of the requested operations.

(Documentation: Transformation Process)

What is going on from when a user sends his SQL query to getting back a data result? This passionating question (by a limited amount of people, of course) has been studied by Stefan Simkovics during his Master’s Thesis at Vienna University of Technology in 1998.

His remarkable work was added in official documentation as “Overview of PostgreSQL Internals”, which is intended to share Simkovics’ research in a simplified way to reach a larger audience.

With this article, I’m thrilled to share recent thoughts about a subelement of these internals, the parser. It relies on a similar approach to compiling by using an advanced development pattern called AST (abstract syntax tree).


From code to machine

Writing a statement as a bunch of words, as we do with SQL, involves a need of understanding this specific statement by the execution engine. A simple comparison is common language, when grammar rules enforce the order of adjectives, nouns, and pronouns so that two interlocutors can understand each other.

In computing, this process is called compilation and transforms code instructions to their equivalent binary operations submitted to the machine. Since the dawn of computer sciences, a few software programs have been responsible for analyzing instructions, divided into several families:

  • Lexical analysis reads a sequence of keywords or lexemes to match with internal tokens, detects spacing or comments. The most famous scanners are Lex and Flex (a open-source alternative to Lex);
  • Parsing refers to the formal analysis of previous lexemes into its constituents, resulting in a parse tree showing their syntactic relations to each other. The main parsers are Yacc and Bison (a forward-compatible Yacc replacement);
  • Semantic analysis gathers necessary semantic information from previous steps, including variable declaration or type checking.

Compiling steps are scrupulously implemented in PostgreSQL when a SQL sentence sent by a user needs to be interpreted. The Simkovics’ thesis tales a journey into query parsing:

SELECT s.sname, se.pno
  FROM supplier s, sells se
 WHERE s.sno > 2 AND s.sno = se.sno;

Scanning step finds out every instructions words and categorizes them into lexemes (reserved keywords, identifiers, operators, literals). If any syntax misleading is encoutered, like a coma before FROM keyword, query parsing is halt and an explicit error message is thrown back to user:

ERROR:  syntax error at or near "FROM"
LINE 2:   FROM supplier s, sells se

At the end, if parsed query is syntactically correct, a parse tree is built in memory to link lexemes according to the grammar rules of the language. Thus, the main node SelectStmt is composed by different branches, like queried tables under their RangeVar node stored as an array into fromClause attribute. The same goes for the representation of columns and conditions through the targetList and whereClause nodes respectively.

Parse tree representation

Our parse tree is passed to an upper step, called rewriting, responsible for performing some optimizations and transformations to nodes and removing useless leaves. Then two others mechanisms take place, namely planner and executor. Our final parse tree will be use to build data result requested by user, but I will not discuss here.


Rebuilding an abstract tree

Recently, I wrote some dynamic SQL queries as part of a PL/pgSQL side-project. This feature is quite common, it involves putting several pieces of expressions together to write a SQL query whose parts (columns, tables, conditions) may vary. Here is former prototype of the code:

DO $prototype$
DECLARE
  r record;
  v_columns text;
  v_tabname text;
  v_values text := $$ 1, 'test' $$;
BEGIN
  SELECT value INTO v_tabname
    FROM config WHERE name = 'table_name';

  SELECT string_agg(value, ',') INTO v_columns
    FROM config WHERE name = 'column_name';

  EXECUTE format(
    'INSERT INTO %s (%s) VALUES (%s);',
    v_tabname, v_columns, v_values
  );
END;
$prototype$;

Content of the config table is under the logic and could be critical when constructing a syntactically correct INSERT statement. In addition, in the more than likely event that my needs are getting finer, this procedural code will getting more complex and finally may encounter troubles in maintenance and scalability.

Talking to one of my colleagues about the obvious complications that were growing in my prototype, he advised me to turn to a more advanced concept and make my code more modular using a new abstraction level, aforementioned AST pattern. This method is entirely based on a tree representation of a complex object that we can manipulate and design easily.

In my case, it was about:

  • Building a SQL statement as a parse tree;
  • Deparsing back without lexical or syntactic error when needed.

In few weeks after, a out-of-nowhere solution flashed in my Twitter timeline, a pure PL/pgSQL extension called postgres-ast-deparser. Its main goals are building abstract trees and deparsing back into SQL statements! After a few discussions with its author Dan Lynch, I used a series of AST functions to improve my procedural code.

DO $prototype$
DECLARE
  v_relation jsonb;
  v_columns jsonb;
  v_values jsonb := to_jsonb(ARRAY[ARRAY[
    ast.a_const(v_val := ast.integer(1)),
    ast.a_const(v_val := ast.string('test'))
  ]]);
BEGIN
  SELECT ast_helpers.range_var(
      v_schemaname := 'public', 
      v_relname := value) INTO v_relation
    FROM config WHERE name = 'table_name';

  SELECT jsonb_agg(ast.res_target(
      v_name := value)) INTO v_columns
    FROM config WHERE name = 'column_name';

  EXECUTE deparser.expression(ast.insert_stmt(
    v_relation := v_relation,
    v_cols := v_columns,
    v_selectStmt := ast.select_stmt(
      v_valuesLists := v_values,
      v_op := 'SETOP_NONE'
    )
  )) FROM config WHERE name = 'table_name';
END;
$prototype$;

The extension offers a bunch of methods in the ast and ast_helpers schemas to create tree nodes as JSONB structures. Nesting several calls let us have a entire tree with the upper node InsertStmt, as defined by PostgreSQL parser itself!


Conclusion

By manipulating trees with JSONB, I realized how powerful projects like postgres-ast-deparser are. This extension relies on a other work called libpg_query, provided by pganalyze engineers, which use the internal parser outside of PostgreSQL!

Use cases may be numerous, like syntax highlighting or validation, prettying query newlines or serializing a statement to easily drop or modify internal nodes, etc. Another parsing project wrote in Python, called pglast, suggests you in its documentation more examples, if by chance, this article has aroused your curiosity.