I recently resurrected on my spare time an old idea: a tool to transform a Backus-Naur Form (BNF) grammar to ANTLR. The purpose was to easily generate a parser from an initial BNF grammar. In mind was the conversion of a JPQL query string to Criteria objects. I was previously surprised not to find such a grammar conversion tool available as open source. Here started ēmaitijǭ. Simple was the idea, humble was the ambition: at first glance only characters replacements.
Limited was my experience in text parsing. A first idea was to convert the BNF definition available in the Wikipedia article as a grammar of BNF rules to an ANTLR grammar in order to generate a parser for the transformation tool itself. Such a movement seemed to me too complicated: I chose to process the initial file in a single class. The early ANTLR grammars where generated a few minutes later. Unfortunately, these grammars failed generation with ANTLR with many errors in the output. I finally came back to the idea of a traditional parsing tool but wrote it by myself: the initial ANTLR generation plan would have been the right.
A few hours later, my parser did the same job as the single text transformation class. And as a result there where as many errors in the ANTLR output. I had a look at the JPQL BNF input. Here is a single expression that challenged me:
{AVG |MAX |MIN |SUM} ([DISTINCT] state_field_path_expression) | COUNT ([DISTINCT] identification_variable | state_field_path_expression | single_valued_association_path_expression)
Reading properly was the first problem. I was abused by the parentheses: used to common programming languages, I first considered these characters as grouping symbols, such as in Extended BNF. What a mistake: these where literal characters in the expression. As a consequence, the parser did not read the sequence properly. Just a bug: easy to solve.
In the following section of the expression comes another issue:
([DISTINCT] state_field_path_expression)
Syntax was my second problem. In Extended BNF, literals are clearly identified with quotes. This is not the rule in standard BNF. Because of concatenation, how can the parser make the distinction between the parentheses literals and the expressions to be interpreted (in that case [DISTINCT] and state_field_path_expression) ?
I did not solve properly that problem and modified the input BNF: just introduced a whitespace after and before each parenthesis. After all, ēmaitijǭ was probably not a very good idea in the first place. Donald Knuth said about the form of BNF that it was "not a normal form in the conventional sense". This is probably why such a tool was not available online before. A main problem is that there is no strict definition of BNF.
BNF is largely more used than EBNF in formal definitions. Is the literal limitation a real problem ? As a conclusion, if it was definitely stated that literals and other expression terms should be delimited with whitespaces (instead of quotes or anything else), wouldn't it simplify the use of BNF for anyone ? Well, I hear the coming problem: it wouldn't be possible to express whitespaces in a grammar. But is there a real need to specify sharply whitespaces in a grammar ? The background problem is that BNF grammar definition, because of its origins, is fuzzy and its current use is approximative.