PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0
Showing posts with label grammar. Show all posts
Showing posts with label grammar. Show all posts

Tuesday, December 13, 2022

[FIXED] How to capture a literal in antlr4?

 December 13, 2022     antlr, antlr4, grammar, regex, syntax     No comments   

Issue

I am looking to make a rule for a regex character class that is of the form:

 character_range
   : '[' literal '-' literal ']'
   ;

For example, with [1-5]+ I could match the string "1234543" but not "129". However, I'm having a hard time figuring out how I would define a "literal" in antlr4. Normally I would do [a-zA-Z], but then this is just ascii and won't include something such as é. So how would I do that?


Solution

Actually, you don't want to match an entire literal, because a literal can be more than one character. Instead you only need a single character for the match.

In the parser:

character_range: OPEN_BRACKET LETTER DASH LETTER CLOSE_BRACKET;

And in the lexer:

OPEN_BRACKET: '[';
CLOSE_BRACKET: ']';
LETTER: [\p{L}];

The character class used in the LETTER lexer rule is Unicode Letters as described in the Unicode description file of ANTLR. Other possible character classes are listed in the UAX #44 Annex of the Unicode Character DB. You may need others like Numbers, Punctuation or Separators for all possible regex character classes.



Answered By - Mike Lischke
Answer Checked By - Clifford M. (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Monday, December 12, 2022

[FIXED] How does an empty regular expression evaluate?

 December 12, 2022     grammar, parsing, regex, syntax     No comments   

Issue

For doing something like the following:

select regexp_matches('X', '');

Is a regular expression of an empty-string defined behavior? If so, how does it normally work?

In other words, which of the following is the base production (ignoring some of the advanced constructs such as repetition, grouping, etc.)?

regex
    : atom+
    ;

Or:

regex
    : atom*
    ;

As an example:

enter image description here

regex101 shows no match for all 7 flavors, but Postgres returns true on select regexp_matches('X', '');.


Solution

The empty regex, by definition, matches the empty string. In a substring match (which is what PostgreSQL's regex_match performs), the match always succeeds since the empty string is a substring of every string, including itself. So it's not a very useful query, but it should work with any regex implementation. (It might be more useful as a full string match, but string equality would also work and probably with less overhead.)

One aspect of empty matches which does vary between regex implementations is how they interact with the "global" (repeated application) flag or equivalent. Most regex engines will advance one character after a successful zero-length substring match, but there are exceptions. As a general rule, nullable regexes (including the empty regex) should not be used with a repeated application flag unless the result is explicitly documented by the regex library (and, for what it's worth, I couldn't find such documentation for PostgreSQL, but that doesn't mean that it doesn't exist somewhere).



Answered By - rici
Answer Checked By - Cary Denson (PHPFixing Admin)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Saturday, July 9, 2022

[FIXED] How to skip input according to keywords in ANTLR4

 July 09, 2022     antlr4, grammar, keyword, skip     No comments   

Issue

I am new to antlr4 and wonder if it can do what I am looking for. Here is an example input:

There is a lot of text 
in this file that i do not care 
about
Lithium 20 g/ml
Bor that should be skipped
Potassium  300g/ml
...

and code:

SempredParser.g4

parser grammar SempredParser;
options { tokenVocab=SempredLexer ;}

file        : line+ EOF;
line        : KEYWORD (NUM UNIT)+ '\n'+;

SempredLexer.g4:

lexer grammar SempredLexer;

//lexer rules

KEYWORD     : ('Lithium' | 'Potassium' ) ;
NL          : '\n';
NUM         : [0-9]+ ('.'[0-9]+)? ;
UNIT        : 'g/ml';
UNKNOWN     : . -> skip ;

I would like to skip all the lines that do not contain a KEYWORD (I have around 100 KEYWORDS). Note that I only use '\n' as delimiter here and would ideally not have it parsed to the output.

I read about Island grammars in the Definitive guide and also tried using lexer modes but could not make it work that way. Any hints and help greatly appreciated.


Solution

You are pretty close, just avoid to define a linebreak token twice. This grammar works for me (I put it into a combined grammar file):

grammar IslandTest;

start: NL+ line+ EOF;
line:  KEYWORD (NUM UNIT)+ NL+;

KEYWORD: ('Lithium' | 'Potassium');
NUM:     [0-9]+ ('.' [0-9]+)?;
UNIT:    'g/ml';

NL:      '\n';
UNKNOWN: . -> skip;

With your input that gives me this parse tree:

enter image description here

Note also: you cannot avoid the NL token in your output, because you decided to make your line parse rule line based, which requires the newline token.



Answered By - Mike Lischke
Answer Checked By - Willingham (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Older Posts Home
View mobile version

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
All Comments
Atom
All Comments

Copyright © PHPFixing