PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Sunday, November 20, 2022

[FIXED] How to add restriction in regex

 November 20, 2022     php, preg-replace, regex     No comments   

Issue

I have a Regex function that allows me to replace a word in a text at occurrence X. I try to add the condition, do not replace if the word is in a tag <h1>,<h2>,<h3> and in the image alt beacon. Could someone help me edit the function to add this condition please.

public function str_ireplace_n($search, $replace, $subject, $occurrence)
{
    $search = preg_quote($search);
    return preg_replace("/^((?:(?:.*?$search){" . --$occurrence . "}.*?))$search/i", "$1$replace", $subject);
}

Exemple :

$text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. <h1>Lorem ipsum dolor sit</h1> Proin libero erat, malesuada eget volutpat vitae, efficitur vitae ipsum. Vivamus et <h2>Lorem ipsum dolor sit</h2> justo non quam laoreet euismod. Ut eget dapibus ligula. <img src="url" alt="Lorem ipsum dolor sit"/> Vestibulum vestibulum."

// I replace the second Lorem in this text by a link
$text = $this->str_ireplace_n('Lorem', ' <a href="' . $domain . '" alt="">Lorem</a> ', $text, 2); //2 for the second occurence

//The result will add a link on the Lorem inside the <h1> and I want to avoid this.
//I want the Regex do nothing in the case where the keyword is in h1 h2 or alt of image

I don't choose the "Lorem" I want to replace, the occurence is random. I have to make sure I don't do anything when the occurence is on a <h1>/<h2> or an image alt.

Thank's in advance


Solution

Personally I would use something like preg_split first:

$string = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. <h1>Lorem ipsum dolor sit</h1> Proin libero erat, malesuada eget volutpat vitae, efficitur vitae ipsum. Vivamus et <h2>Lorem ipsum dolor sit</h2> justo non quam laoreet euismod. Ut eget dapibus ligula. <img src="url" alt="Lorem ipsum dolor sit"/> Vestibulum vestibulum.';

$split = preg_split('/(<[^\/]+(?:\/|<\/[^>]+)>)/', $string, null, PREG_SPLIT_DELIM_CAPTURE);

Which gives you this (this is the basic thing we need to do):

Array
(
    [0] => Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
    [1] => <h1>Lorem ipsum dolor sit</h1>
    [2] =>  Proin libero erat, malesuada eget volutpat vitae, efficitur vitae ipsum. Vivamus et 
    [3] => <h2>Lorem ipsum dolor sit</h2>
    [4] =>  justo non quam laoreet euismod. Ut eget dapibus ligula. 
    [5] => <img src="url" alt="Lorem ipsum dolor sit"/>
    [6] =>  Vestibulum vestibulum.
)

Now we have segregated those items inside tags. So now we can loop over this set and check that the leading character is or is not < and have an idea if it's inside / outside a tag. This should work as long as your tags end in </...> or />.

Basically the HTML tags + content become the delimiter, which we also capture.

The point is a simple Regex is not capable of parsing HTML as it's not a regular language. So we have to do some work in PHP to tie it all together. We can break it down and simplify the problem with a simple Regex, as I have done here.

$subject = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. <h1>Lorem ipsum dolor sit</h1> Proin libero erat, malesuada eget volutpat vitae, efficitur vitae ipsum. Vivamus et <h2>Lorem ipsum dolor sit</h2> Lorem justo non quam laoreet euismod. Ut eget dapibus ligula. <img src="url" alt="Lorem ipsum dolor sit"/> Vestibulum vestibulum.';

//word to replace
$search = 'Lorem';
//stuff to replace with
$replace = '<a href="Lorem">foo</a>';
 //what match to replace
$occurrence = 2;

function str_ireplace_n($search, $replace, $subject, $occurrence){
    $search = preg_quote($search);

    //separate the HTML from the "body" text
    $split = preg_split('/(<(?:h1|h2|h3|img)[^\/]+(?:\/|<\/[^>]+)>)/', $subject, null, PREG_SPLIT_DELIM_CAPTURE);
    //the number of current matches
    $match = 0;

    foreach($split as &$s){
        //if strpos < is 0 it's the first character - meaning its part of HTML (we don't want that)
        //if it matches search 
        if(0 !== strpos($s,'<') && preg_match('/\b'.$search.'\b/i', $s)){
            //increment the match counter
            ++$match;
             //replace the match if it's the nth one
            if($match == $occurrence)  $s = preg_replace('/\b'.$search.'\b/i',$replace,$s);
        }
    }

    return implode($split);
}

echo str_ireplace_n($search, $replace, $subject, $occurrence);

Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. <h1>Lorem ipsum dolor sit</h1> 
 Proin libero erat, malesuada eget volutpat vitae, efficitur vitae ipsum. Vivamus et 
  <h2>Lorem ipsum dolor sit</h2> <a href="Lorem">foo</a> justo non quam laoreet euismod. 
  Ut eget dapibus ligula. <img src="url" alt="Lorem ipsum dolor sit"/> Vestibulum vestibulum.

This being the replaced part <a href="Lorem">foo</a>

I added a few line returns for readability (in the output), and another "Lorem" (in the input) as there was no second one outside of the HTML tags to match on. In any case if you notice, nothing within the HTML tags was modified. And in this case only the second match was changed.

It's not 100% clear exactly what you need (as is often the case with these types of questions) so I try to explain how to do instead of just doing it.

Sandbox



Answered By - ArtisticPhoenix
Answer Checked By - Mildred Charles (PHPFixing Admin)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing