Issue

I know, that PHP PCRE functions treat strings as byte sequences, so many sites suggest to use /u modifier for handling input and regex as UTF-8.

But, do I really need this always? My tests show, that this flag makes no difference, when I don't use escape sequences or dot or something like this.

For example

preg_match('/^[\da-f]{40}$/', $string); to check if string has format of a SHA1 hash

preg_replace('/[^a-zA-Z0-9]/', $spacer, $string); to replace every char that is non-ASCII letter or number

preg_replace('/^\+$(.*)$$/', '\1', $string); for getting inner content of +(XYZ)

These regex contain only single byte ASCII symbols, so it should work on every input, regardless of encoding, shouldn't it? Note that third regex uses dot operator, but as I cut off some ASCII chars at beginning and end of string, this should work on UTF-8 also, correct?

Cannot anyone tell me, if I'm overlooking something?

Solution

There is no problem with the first expression. The characters being quantified are explicitly single-byte, and cannot occur in a UTF-8 multibyte sequence.

The second expression may give you more spacers than you expect; for example:

echo preg_replace('/[^a-zA-Z0-9]/', "0", "💩");
// => 0000

The third expression also does not pose a problem, as the repeated character is limited by parentheses (which is ASCII-safe).

This is more dangerous:

echo preg_replace('/^(.)/', "0", "💩");
// => 0???

Typically, without knowing more about how UTF-8 works, it may be tricky to predict which regexps are safe, and which are not, so using /u for all text that might contain a character above U+007F is the best practice.

Answered By - Amadan

Answer Checked By - Candace Johnson (PHPFixing Volunteer)

Sunday, November 20, 2022

[FIXED] When do I need u-modifier in PHP regex?

Issue

Solution

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Sunday, November 20, 2022

Issue

Solution

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Subscribe To