Draw me an (abstract) tree
- 5 minutes to read
The parser stage creates a parse tree using only fixed rules about the syntactic structure of SQL. It does not make any lookups in the system catalogs, so there is no possibility to understand the detailed semantics of the requested operations.
(Documentation: Transformation Process)
What is going on from when a user sends his SQL query to getting back a data result? This passionating question (by a limited amount of people, of course) has been studied by Stefan Simkovics during his Master’s Thesis at Vienna University of Technology in 1998.
His remarkable work was added in official documentation as “Overview of PostgreSQL Internals”, which is intended to share Simkovics’ research in a simplified way to reach a larger audience.
With this article, I’m thrilled to share recent thoughts about a subelement of these internals, the parser. It relies on a similar approach to compiling by using an advanced development pattern called AST (abstract syntax tree).
From code to machine
Writing a statement as a bunch of words, as we do with SQL, involves a need of understanding this specific statement by the execution engine. A simple comparison is common language, when grammar rules enforce the order of adjectives, nouns, and pronouns so that two interlocutors can understand each other.
In computing, this process is called compilation and transforms code instructions to their equivalent binary operations submitted to the machine. Since the dawn of computer sciences, a few software programs have been responsible for analyzing instructions, divided into several families:
- Lexical analysis reads a sequence of keywords or lexemes to match with internal tokens, detects spacing or comments. The most famous scanners are Lex and Flex (a open-source alternative to Lex);
- Parsing refers to the formal analysis of previous lexemes into its constituents, resulting in a parse tree showing their syntactic relations to each other. The main parsers are Yacc and Bison (a forward-compatible Yacc replacement);
- Semantic analysis gathers necessary semantic information from previous steps, including variable declaration or type checking.
Compiling steps are scrupulously implemented in PostgreSQL when a SQL sentence sent by a user needs to be interpreted. The Simkovics’ thesis tales a journey into query parsing:
SELECT s.sname, se.pno
FROM supplier s, sells se
WHERE s.sno > 2 AND s.sno = se.sno;
Scanning step finds out every instructions words and categorizes them into lexemes
(reserved keywords, identifiers, operators, literals). If any syntax misleading
is encoutered, like a coma before FROM
keyword, query parsing is halt and an
explicit error message is thrown back to user:
ERROR: syntax error at or near "FROM"
LINE 2: FROM supplier s, sells se
At the end, if parsed query is syntactically correct, a parse tree is built in
memory to link lexemes according to the grammar rules of the language. Thus, the
main node SelectStmt
is composed by different branches, like queried tables
under their RangeVar
node stored as an array into fromClause
attribute. The
same goes for the representation of columns and conditions through the targetList
and whereClause
nodes respectively.
Our parse tree is passed to an upper step, called rewriting, responsible for performing some optimizations and transformations to nodes and removing useless leaves. Then two others mechanisms take place, namely planner and executor. Our final parse tree will be use to build data result requested by user, but I will not discuss here.
Rebuilding an abstract tree
Recently, I wrote some dynamic SQL queries as part of a PL/pgSQL side-project. This feature is quite common, it involves putting several pieces of expressions together to write a SQL query whose parts (columns, tables, conditions) may vary. Here is former prototype of the code:
DO $prototype$
DECLARE
r record;
v_columns text;
v_tabname text;
v_values text := $$ 1, 'test' $$;
BEGIN
SELECT value INTO v_tabname
FROM config WHERE name = 'table_name';
SELECT string_agg(value, ',') INTO v_columns
FROM config WHERE name = 'column_name';
EXECUTE format(
'INSERT INTO %s (%s) VALUES (%s);',
v_tabname, v_columns, v_values
);
END;
$prototype$;
Content of the config
table is under the logic and could be critical when
constructing a syntactically correct INSERT
statement. In addition, in the more
than likely event that my needs are getting finer, this procedural code will
getting more complex and finally may encounter troubles in maintenance and
scalability.
Talking to one of my colleagues about the obvious complications that were growing in my prototype, he advised me to turn to a more advanced concept and make my code more modular using a new abstraction level, aforementioned AST pattern. This method is entirely based on a tree representation of a complex object that we can manipulate and design easily.
In my case, it was about:
- Building a SQL statement as a parse tree;
- Deparsing back without lexical or syntactic error when needed.
In few weeks after, a out-of-nowhere solution flashed in my Twitter timeline, a pure PL/pgSQL extension called postgres-ast-deparser. Its main goals are building abstract trees and deparsing back into SQL statements! After a few discussions with its author Dan Lynch, I used a series of AST functions to improve my procedural code.
DO $prototype$
DECLARE
v_relation jsonb;
v_columns jsonb;
v_values jsonb := to_jsonb(ARRAY[ARRAY[
ast.a_const(v_val := ast.integer(1)),
ast.a_const(v_val := ast.string('test'))
]]);
BEGIN
SELECT ast_helpers.range_var(
v_schemaname := 'public',
v_relname := value) INTO v_relation
FROM config WHERE name = 'table_name';
SELECT jsonb_agg(ast.res_target(
v_name := value)) INTO v_columns
FROM config WHERE name = 'column_name';
EXECUTE deparser.expression(ast.insert_stmt(
v_relation := v_relation,
v_cols := v_columns,
v_selectStmt := ast.select_stmt(
v_valuesLists := v_values,
v_op := 'SETOP_NONE'
)
)) FROM config WHERE name = 'table_name';
END;
$prototype$;
The extension offers a bunch of methods in the ast
and ast_helpers
schemas
to create tree nodes as JSONB structures. Nesting several calls let us have a
entire tree with the upper node InsertStmt
, as defined by PostgreSQL
parser itself!
Conclusion
By manipulating trees with JSONB, I realized how powerful projects like
postgres-ast-deparser
are. This extension relies on a other work called
libpg_query, provided by pganalyze engineers,
which use the internal parser outside of PostgreSQL!
Use cases may be numerous, like syntax highlighting or validation, prettying
query newlines or serializing a statement to easily drop or modify internal
nodes, etc. Another parsing project wrote in Python, called pglast
, suggests
you in its documentation more examples, if by chance, this article has
aroused your curiosity.