Showing posts with label grammar. Show all posts

Tuesday, December 13, 2022

[FIXED] How to capture a literal in antlr4?

December 13, 2022 antlr, antlr4, grammar, regex, syntax No comments

Issue

I am looking to make a rule for a regex character class that is of the form:

 character_range
   : '[' literal '-' literal ']'
   ;

For example, with [1-5]+ I could match the string "1234543" but not "129". However, I'm having a hard time figuring out how I would define a "literal" in antlr4. Normally I would do [a-zA-Z], but then this is just ascii and won't include something such as é. So how would I do that?

Solution

Actually, you don't want to match an entire literal, because a literal can be more than one character. Instead you only need a single character for the match.

In the parser:

character_range: OPEN_BRACKET LETTER DASH LETTER CLOSE_BRACKET;

And in the lexer:

OPEN_BRACKET: '[';
CLOSE_BRACKET: ']';
LETTER: [\p{L}];

The character class used in the LETTER lexer rule is Unicode Letters as described in the Unicode description file of ANTLR. Other possible character classes are listed in the UAX #44 Annex of the Unicode Character DB. You may need others like Numbers, Punctuation or Separators for all possible regex character classes.

Answered By - Mike Lischke

Answer Checked By - Clifford M. (PHPFixing Volunteer)

[FIXED] How does an empty regular expression evaluate?

December 12, 2022 grammar, parsing, regex, syntax No comments

Issue

For doing something like the following:

select regexp_matches('X', '');

Is a regular expression of an empty-string defined behavior? If so, how does it normally work?

In other words, which of the following is the base production (ignoring some of the advanced constructs such as repetition, grouping, etc.)?

regex
    : atom+
    ;

Or:

regex
    : atom*
    ;

As an example:

regex101 shows no match for all 7 flavors, but Postgres returns true on select regexp_matches('X', '');.

Solution

The empty regex, by definition, matches the empty string. In a substring match (which is what PostgreSQL's regex_match performs), the match always succeeds since the empty string is a substring of every string, including itself. So it's not a very useful query, but it should work with any regex implementation. (It might be more useful as a full string match, but string equality would also work and probably with less overhead.)

One aspect of empty matches which does vary between regex implementations is how they interact with the "global" (repeated application) flag or equivalent. Most regex engines will advance one character after a successful zero-length substring match, but there are exceptions. As a general rule, nullable regexes (including the empty regex) should not be used with a repeated application flag unless the result is explicitly documented by the regex library (and, for what it's worth, I couldn't find such documentation for PostgreSQL, but that doesn't mean that it doesn't exist somewhere).

Answered By - rici

Answer Checked By - Cary Denson (PHPFixing Admin)

[FIXED] How to skip input according to keywords in ANTLR4

July 09, 2022 antlr4, grammar, keyword, skip No comments

Issue

I am new to antlr4 and wonder if it can do what I am looking for. Here is an example input:

There is a lot of text 
in this file that i do not care 
about
Lithium 20 g/ml
Bor that should be skipped
Potassium  300g/ml
...

and code:

SempredParser.g4

parser grammar SempredParser;
options { tokenVocab=SempredLexer ;}

file        : line+ EOF;
line        : KEYWORD (NUM UNIT)+ '\n'+;

SempredLexer.g4:

lexer grammar SempredLexer;

//lexer rules

KEYWORD     : ('Lithium' | 'Potassium' ) ;
NL          : '\n';
NUM         : [0-9]+ ('.'[0-9]+)? ;
UNIT        : 'g/ml';
UNKNOWN     : . -> skip ;

I would like to skip all the lines that do not contain a KEYWORD (I have around 100 KEYWORDS). Note that I only use '\n' as delimiter here and would ideally not have it parsed to the output.

I read about Island grammars in the Definitive guide and also tried using lexer modes but could not make it work that way. Any hints and help greatly appreciated.

Solution

You are pretty close, just avoid to define a linebreak token twice. This grammar works for me (I put it into a combined grammar file):

grammar IslandTest;

start: NL+ line+ EOF;
line:  KEYWORD (NUM UNIT)+ NL+;

KEYWORD: ('Lithium' | 'Potassium');
NUM:     [0-9]+ ('.' [0-9]+)?;
UNIT:    'g/ml';

NL:      '\n';
UNKNOWN: . -> skip;

With your input that gives me this parse tree:

Note also: you cannot avoid the NL token in your output, because you decided to make your line parse rule line based, which requires the newline token.

Answered By - Mike Lischke

Answer Checked By - Willingham (PHPFixing Volunteer)

Tuesday, December 13, 2022

[FIXED] How to capture a literal in antlr4?

Issue

Solution

Monday, December 12, 2022

[FIXED] How does an empty regular expression evaluate?

Issue

Solution

Saturday, July 9, 2022

[FIXED] How to skip input according to keywords in ANTLR4

Issue

Solution

Total Pageviews

Featured Post

Why Learn PHP Programming

Tuesday, December 13, 2022

Issue

Solution

Monday, December 12, 2022

Issue

Solution

Saturday, July 9, 2022

Issue

Solution

Total Pageviews

Featured Post

Why Learn PHP Programming

Subscribe To