Showing posts with label python-re. Show all posts

Sunday, November 6, 2022

[FIXED] How do I create a regex that continues to match only if there is a comma or " and " after the last one?

November 06, 2022 python, python-re, regex, regex-group, string No comments

Issue

What this code does is extract a verb and the information that follows after it. Then create a .txt file with the name of the verb and write the information inside.

I have to run to win the race

import re, os

regex_patron_01 = r"\s*\¿?(?:have to|haveto|must to|mustto)\s*((?:\w\s*)+)\?"
n = re.search(regex_patron_01, input_text_to_check, re.IGNORECASE)

if n:
    word, = n.groups()
    try:
        word = word.strip()
    except AttributeError:
        print("no verb specified!!!")

    regex_patron_01 = r"\s*((?:\w+)?) \s*((?:\w\s*)+)\s*\??"
    n = re.search(regex_patron_01, word, re.IGNORECASE)
    if n:
        #This will have to be repeated for all the verbs that are present in the sentence.
        verb, order_to_remember = n.groups()
        verb = verb.strip()
        order_to_remember = order_to_remember.strip()

        target_file = target_file + verb + ".txt"
    
        with open(target_file, 'w') as f:
            f.write(order_to_remember)

This make a "run.txt", and white in this file : "to win the race"

but now I need that in addition to that, the regex can be extended to the possibility that there is more than one verb, for example

I have to run, jump and hurry to win the race

In that case you should create 3 files, one with the name "run.txt", another with the name "jump.txt", and another with the name "hurry.txt", and in each of them write the line "to win the race.

The problem I'm having is how to make it repeat the process whenever a comma (,) or an "and" follows a verb.

Other example:

I have to dance and sing more to be a pop star

And make 2 files, "dance.txt" and "sing.txt", and both with the line "more to be a pop star"

Solution

I simplified the search and conditions somewhat and did this:

def fun(x):
    match=re.search(r"(?<=have to) ([\w\s,]+) (to [\w\s]+)",x)
    if match:
        for i in re.split(',|and',match[1]):
            with open(f'{i}.txt','w') as file:
                file.write(match[2])

If there is a match, the function will create one or more 'txt' files, with the caught verbs as its names. If there is no match - it'll do nothing.

The regex I used is looking for two groups. The first must be preceded by "have to" and may contain words and whitespaces separated by comma or "and". The second group should start with "to " and can contain only words and whitespaces.

match[0] is a whole match
match[1] is the first group
match[2] is the second group

The 'for' loop iterates through the list obtained by separating the first group using comma and 'and' as separators. At each iteration a file with the name from this list is created.

Answered By - Иван Балван

Answer Checked By - Robin (PHPFixing Admin)

[FIXED] How can I catch the following groups with a regex?

November 06, 2022 python, python-re, regex, regex-group No comments

Issue

Hello I have the following two strings

txt = '/path/to/photo/file.jpg'
txt = '/path/to/photo/file_crXXX.jpg'

in the second string, XXX is a long variable path with information in the name because that is processed.

I want to extract the name 'file' in both path

In order to this, I tried the following code

re.search(".*/(.*)\.jpg", txt).group(1)
re.search(".*/(.*)_cr.*", txt).group(1)

But when I try to combine in one expression with the following code

re.search(".*/(.*)(_cr.*)|(\.jpg)*", txt).group(1)
re.search(".*/(.*)(\.jpg)|(_cr.*)", txt).group(1)

Doesn't work properly, so how can I do this?

Thanks

Solution

The problem was that you had captured a group that should not need to be captured, but the .*/(.*)(\.jpg)|(_cr.*) was closer to the answer. Please use this regex to capture only the filename or its prefix.

([^/]*?)(?:\.jpg|_cr.*)$

Also, see the regex demo

import re

paths = ["/path/to/photo/file.jpg", "/path/to/photo/file_crXXX.jpg"]
for path in paths:
    print(re.search(r"([^/]*?)(?:\.jpg|_cr.*)$", path).group(1))

Answered By - Artyom Vancyan

Answer Checked By - Mildred Charles (PHPFixing Admin)

[FIXED] How to correctly apply a RE for obtaining the last name (of a file or folder) from a given path and print it on Python?

September 18, 2022 for-loop, path, printing, python, python-re No comments

Issue

I have wrote a code which creates a dictionary that stores all the absolute paths of folders from the current path as keys, and all of its filenames as values, respectively. This code would only be applied to paths that have folders which only contain file images. Here:

import os
import re
# Main method
the_dictionary_list = {}

for name in os.listdir("."):
    if os.path.isdir(name):
        path = os.path.abspath(name)
        print(f'\u001b[45m{path}\033[0m')
        match = re.match(r'/(?:[^\\])[^\\]*$', path)
        print(match)
        list_of_file_contents = os.listdir(path)
        print(f'\033[46m{list_of_file_contents}')
        the_dictionary_list[path] = list_of_file_contents
        print('\n')
print('\u001b[43mthe_dictionary_list:\033[0m')
print(the_dictionary_list)

The thing is, that I want this dictionary to store only the last folder names as keys instead of its absolute paths, so I was planning to use this re /(?:[^\\])[^\\]*$, which would be responsible for obtaining the last name (of a file or folder from a given path), and then add those last names as keys in the dictionary in the for loop.

I wanted to test the code above first to see if it was doing what I wanted, but it didn't seem so, the value of the match variable became None in each iteration, which didn't make sense to me, everything else works fine.

So I would like to know what I'm doing wrong here.

Solution

I decided to rewrite the code above, in case of wanting to apply it only in the current directory (where this program would be found).

import os
# Main method

the_dictionary_list = {}

for subdir in os.listdir("."):
    if os.path.isdir(subdir):
        path = os.path.abspath(subdir)
        print(f'\u001b[45m{path}\033[0m')
        list_of_file_contents = os.listdir(path)
        print(f'\033[46m{list_of_file_contents}')
        the_dictionary_list[subdir] = list_of_file_contents
        print('\n')
print('\033[1;37;40mThe dictionary list:\033[0m')
for subdir in the_dictionary_list:
    print('\u001b[43m'+subdir+'\033[0m')
    for archivo in the_dictionary_list[subdir]:
        print("    ", archivo)
print('\n')
print(the_dictionary_list)

This would be useful in case the user wants to run the program with a double click on a specific location (my personal case)

Answered By - NoahVerner

Answer Checked By - Terry (PHPFixing Volunteer)

[FIXED] How to show only the first 20 entries

August 15, 2022 output, python, python-3.x, python-re No comments

Issue

The Python code here gets me the output I want. However, I need help with limiting the result to first 20 lines.

Input example is shown below,

gi|170079688|ref|YP_001729008.1| bifunctional riboflavin kinase/FMN adenylyltransferase [Escherichia coli str. K-12 substr. DH10B] MKLIRGIHNLSQAPQEGCVLTIGNFDGVHRGHRALLQGLQEEGRKRNLPVMVMLFEPQPLELFATDKAPA RLTRLREKLRYLAECGVDYVLCVRFDRRFAALTAQNFISDLLVKHLRVKFLAVGDDFRFGAGREGDFLLL QKAGMEYGFDITSTQTFCEGGVRISSTAVRQALADDNLALAESLLGHPFAISGRVVHGDELGRTIGFPTA NVPLRRQVSPVKGVYAVEVLGLGEKPLPGVANIGTRPTVAGIRQQLEVHLLDVAMDLYGRHIQVVLRKKI RNEQRFASLDELKAQIARDELTAREFFGLTKPA gi|170079689|ref|YP_001729009.1| isoleucyl-tRNA synthetase [Escherichia coli str. K-12 substr. DH10B] MSDYKSTLNLPETGFPMRGDLAKREPGMLARWTDDDLYGIIRAAKKGKKTFILHDGPPYANGSIHIGHSV NKILKDIIVKSKGLSGYDSPYVPGWDCHGLPIELKVEQEYGKPGEKFTAAEFRAKCREYAATQVDGQRKD FIRLGVLGDWSHPYLTMDFKTEANIIRALGKIIGNGHLHKGAKPVHWCVDCRSALAEAEVEYYDKTSPSI DVAFQAVDQDALKAKFAVSNVNGPISLVIWTTTPWTLPANRAISIAPDFDYALVQIDGQAVILAKDLVES VMQRIGVTDYTILGTVKGAELELLRFTHPFMGFDVPAILGDHVTLDAGTGAVHTAPGHGPDDYVIGQKYG LETANPVGPDGTYLPGTYPTLDGVNVFKANDIVVALLQEKGALLHVEKMQHSYPCCWRHKTPIIFRATPQ WFVSMDQKGLRAQSLKEIKGVQWIPDWGQARIESMVANRPDWCISRQRTWGVPMSLFVHKDTEELHPRTL ELMEEVAKRVEVDGIQAWWDLDAKEILGDEADQYVKVPDTLDVWFDSGSTHSSVVDVRPEFAGHAADMYL EGSDQHRGWFMSSLMISTAMKGKAPYRQVLTHGFTVDGQGRKMSKSIGNTVSPQDVMNKLGADILRLWVA STDYTGEMAVSDEILKRAADSYRRIRNTARFLLANLNGFDPAKDMVKPEEMVVLDRWAVGCAKAAQEDIL KAYEAYDFHEVVQRLMRFCSVEMGSFYLDIIKDRQYTAKADSVARRSCQTALYHIAEALVRWMAPILSFT ADEVWGYLPGEREKYVFTGEWYEGLFGLADSEAMNDAFWDELLKVRGEVNKVIEQARADKKVGGSLEAAV TLYAEPELSAKLTALGDELRFVLLTSGATVADYNDAPADAQQSEVLKGLKVALSKAEGEKCPRCWHYTQD VGKVAEHAEICGRCVSNVAGDGEKRKFA gi|170079690|ref|YP_001729010.1| lipoprotein signal peptidase [Escherichia coli str. K-12 substr. DH10B] MSQSICSTGLRWLWLVVVVLIIDLGSKYLILQNFALGDTVPLFPSLNLHYARNYGAAFSFLADSGGWQRW FFAGIAIGISVILAVMMYRSKATQKLNNIAYALIIGGALGNLFDRLWHGFVVDMIDFYVGDWHFATFNLA DTAICVGAALIVLEGFLPSRAKKQ

import re

id = None
header = None
seq = ''

a_file = open('e_coli.faa')

for line in a_file:
    m = re.match(">(\S+)\s+(.+)", line.rstrip())
    if m:
        if id is not None:

            print("{0} length:{1} {2}".format(id, len(seq),header))

        id, header = m.groups()
        seq = ''
    else:
        seq += line.rstrip()

Solution

In the very top, add c = 0. Then, change

        print("{0} length:{1} {2}".format(id, len(seq),header))

        if c < 10:
            print("{0} length:{1} {2}".format(id, len(seq),header))
            c += 1

Result with a few adjustments:

import re

id = None
header = None
seq = ''

with open('e_coli.faa') as a_file:
    for line in a_file:
        m = re.match(">(\S+)\s+(.+)", line.rstrip())
        if m:
            if id and c < 20:
                print("{0} length:{1} {2}".format(id, len(seq),header))
                c += 1

            id, header = m.groups()
            seq = ''
        else:
            seq += line.rstrip()

To read the first 20 lines of the file. you can use readlines():

Instead of:

for line in a_file:

use:

for line in a_file.readlines()[:20]:

Answered By - Ann Zen

Answer Checked By - Gilberto Lyons (PHPFixing Admin)

[FIXED] Why is my Regex search yielding more than expected and why isn't my loop removing certain duplicates?

May 13, 2022 append, duplicates, python, python-re, regex No comments

Issue

I'm working on a program that parses weather information. In this part of my code, I am trying to re-organise the results in order of time before continuing to append more items later on.

The time in these lines is usually the first 4 digits of any line (first 2 digits are the day and the others are the hour). The exception to this is the line that starts with 11010KT, this line is always assumed to be the first line in any weather report, and those numbers are a wind vector and NOT a time.

You will see that I am removing any line that has TEMPO INTER or PROB at the start of this example because I want lines containing these words to be added to the end of the other restructured list. These lines can be thought of as a separate list in which I want organised by time in the same way as the other items.

I am trying to use Regex to pull the times from the lines that remain after removing the TEMPO INTER and PROB lines and then sort them, then once sorted, use regex again to find that line in full and create a restructured list. Once that list has been completed, I am sorting the TEMPO INTER and PROB list and then appending that to the newly completed list I had just made.

I have also tried a for loop that will remove any duplicate lines added, but this seems to only remove one duplicate of the TEMPO line???

Can someone please help me figure this out? I am kind of new to this, thank you...

This ideally should come back looking like this:

ETA IS 0230 which is 1430 local

11010KT 5000 MODERATE DRIZZLE BKN004
FM050200 12012KT 9999 LIGHT DRIZZLE BKN008 
TEMPO 0501/0502 2000 MODERATE DRIZZLE BKN002
INTER 0502/0506 4000 SHOWERS OF MODERATE RAIN BKN008

Instead of this, I am getting repeats of the line that starts with FM050200 and then repeats of the line starting with TEMPO. It doesn't find the line starting with INTER either...

I have made a minimal reproducible example for anyone to try and help me. I will include that here:

import re

total_print = ['\nFM050200 12012KT 9999 LIGHT DRIZZLE BKN008', '\n11010KT 5000 MODERATE DRIZZLE BKN004', '\nINTER 0502/0506 4000 SHOWERS OF MODERATE RAIN BKN008', '\nTEMPO 0501/0502 2000 MODERATE DRIZZLE BKN002']

removed_lines = []
for a in total_print:  # finding and removing lines with reference to TEMPO INTER PROB
    if 'TEMPO' in a:
        total_print.remove(a)
        removed_lines.append(a)
for b in total_print:
    if 'INTER' in b:
        total_print.remove(b)
        removed_lines.append(b)
for f in total_print:
    if 'PROB' in f:
        total_print.remove(f)
        removed_lines.append(f)

list_time_in_line = []
for line in total_print: # finding the times in the remaining lines
    time_in_line = re.search(r'\d\d\d\d', line)
    list_time_in_line.append(time_in_line.group())
sorted_time_list = sorted(list_time_in_line)

removed_time_in_line = []
for g in removed_lines:  # finding the times in the lines that were originally removed
    removed_times = re.search(r'\d\d\d\d', g)
    removed_time_in_line.append(removed_times.group())
sorted_removed_time_list = sorted(removed_time_in_line)


final = []
final.append('ETA IS 1230 which is 1430 local\n')  # appending the time display
search_for_first_line = re.search(r'[\n]\d\d\d\d\dKT', ' '.join(total_print))  # searching for line that has wind vector instead of time
search_for_first_line = search_for_first_line.group()

if search_for_first_line:  # adding wind vector line so that its the firs line listed in the group
    search_for_first_line = re.search(r'%s.*' % search_for_first_line, ' '.join(total_print)).group()
    final.append('\n' + search_for_first_line)

print(sorted_time_list)  # the list of possible times found (the second item in list is the wind vector and not a time)
d = 0
for c in sorted_time_list:  # finding the whole line for the corresponding time
    print(sorted_time_list[d])
    search_for_whole_line = re.search(r'.*\w+\s*%s.*' % sorted_time_list[d], ' '.join(total_print))
    print(search_for_whole_line.group())  # it is doubling up on the 0502 time???????
    d += 1
    final.append('\n' + str(search_for_whole_line.group()))

h = 0
for i in sorted_removed_time_list:  # finding the whole line for the corresponding times from the previously removed items
    whole_line_in_removed_srch = re.search(r'.*%s.*' % sorted_removed_time_list[h], ' '.join(removed_lines))
    h += 1
    final.append('\n' + str(whole_line_in_removed_srch.group()))  # appending them

l_new = []
for item in final:  # this doesn't seeem to properly remove duplicates ?????
    if item not in l_new:
        l_new.append(item)
total_print = l_new

print(' '.join(total_print))

//////////////////////////////////////////EDIT:

I had asked this recently and got an excellent answer to my problem from @diggusbickus. I have now hit a new problem with the sorting in the answer.

Because my original question had only one type of weather line (beginning with the letters 'FM') in my data['other'], the lambda with the split() was only looking at the first item of the line [0] for the time.

data['other'] = sorted(data['other'], key=lambda x: x.split()[0])

Which is where the time is located (in previous question it was FM050200 where 05 is the day and 0200 is the time). That works very well for when there are lines beginning with FM, but I have realised that occasionally lines like this exist:

'\nBECMG 0519/0520 27007KT 9999 SHOWERS OF LIGHT RAIN SCT020 BKN030'

The time in this style of line is the FIRST 4 digits located at index [1] and is in a 4 digit format instead of the 6 digit format line in FM050200. The time in this new line is 05 as the day and 19 as the hour (so 1900).

I need this style of line to be grouped with the FM lines, the problem is that they don't sort. I am trying to find a way to be able to sort the lines by time regardless of whether the time is on the [0] index and in 6 digit format or on the [1] index and in 4 digit format.

I will include a new example with a couple of small changes on the originally answered question. This new question will have different data as the total_print vairable. This is a working example.

I essentially need the lines to be sorted by the FIRST 4 digits of any line, and the results should look like this:

ETA IS 0230 which is 1430 local

FM131200 20010KT 9999 SHOWERS OF LIGHT RAIN SCT006 BKN010 
FM131400 20010KT 9999 SHOWERS OF LIGHT RAIN SCT006 BKN010 
BECMG 1315/1317 27007KT 9999 SHOWERS OF LIGHT RAIN SCT020 BKN030 
TEMPO 1312/1320 4000 SHOWERS OF MODERATE RAIN BKN007

NB. The TEMPO line is supposed to stay at the end, so don't worry about that one.

Here is the example, thank you so much to anyone who helps.

import re

total_print = ['\nBECMG 1315/1317 27007KT 9999 SHOWERS OF LIGHT RAIN SCT020 BKN030', '\nFM131200 20010KT 9999 SHOWERS OF LIGHT RAIN SCT006 BKN010', '\nFM131400 20010KT 9999 SHOWERS OF LIGHT RAIN SCT006 BKN010','\nTEMPO 1312/1320 4000 SHOWERS OF MODERATE RAIN BKN007']
data = {
    'windvector': [], # if it is the first line of the TAF
    'other': [], # anythin with FM or BECMG
    'tip': []  # tempo/inter/prob
}

wind_vector = re.compile('^\s\d{5}KT')
for line in total_print:
    if 'TEMPO' in line \
            or 'INTER' in line \
            or 'PROB' in line:
        key = 'tip'
    elif re.match(wind_vector, line):
        key = 'windvector'
    else:
        key = 'other'
    data[key].append(line)

final = []
data['other'] = sorted(data['other'], key=lambda x: x.split()[0])
data['tip'] = sorted(data['tip'], key=lambda x: x.split()[1])


final.append('ETA IS 0230 which is 1430 local\n')

for lst in data.values():
    for line in lst:
        final.append('\n' + line[1:])  # get rid of newline

print(' '.join(final))

Solution

just sort your data into a dict, you're always creating lists and removing items: it's too confusing.

your regex to catch the wind vector catches also 12012KT, that's why that line was repeated. the ^ ensures it matches only your pattern if it's a the beginning of the line

import re

total_print = ['\nFM050200 12012KT 9999 LIGHT DRIZZLE BKN008', '\n11010KT 5000 MODERATE DRIZZLE BKN004', '\nINTER 0502/0506 4000 SHOWERS OF MODERATE RAIN BKN008', '\nTEMPO 0501/0502 2000 MODERATE DRIZZLE BKN002']

data = {
    'windvector': [],
    'other': [],
    'tip': [] #tempo/inter/prob
}

wind_vector=re.compile('^\s\d{5}KT')
for line in total_print:
    if 'TEMPO' in line \
            or 'INTER' in line \
            or 'PROB' in line:
        key='tip'
    elif re.match(wind_vector, line):
        key='windvector'
    else:
        key='other'
    data[key].append(line)
        
data['tip']=sorted(data['tip'], key=lambda x: x.split()[1])
print('ETA IS 0230 which is 1430 local')
print()
for lst in data.values():
    for line in lst:
        print(line[1:]) #get rid of newline

Answered By - diggusbickus

Answer Checked By - Katrina (PHPFixing Volunteer)

Sunday, November 6, 2022

[FIXED] How do I create a regex that continues to match only if there is a comma or " and " after the last one?

Issue

Solution

[FIXED] How can I catch the following groups with a regex?

Issue

Solution

Sunday, September 18, 2022

[FIXED] How to correctly apply a RE for obtaining the last name (of a file or folder) from a given path and print it on Python?

Issue

Solution

Monday, August 15, 2022

[FIXED] How to show only the first 20 entries

Issue

Solution

Friday, May 13, 2022

[FIXED] Why is my Regex search yielding more than expected and why isn't my loop removing certain duplicates?

Issue

Solution

Total Pageviews

Featured Post

Why Learn PHP Programming

Sunday, November 6, 2022

Issue

Solution

Issue

Solution

Sunday, September 18, 2022

Issue

Solution

Monday, August 15, 2022

Issue

Solution

Friday, May 13, 2022

Issue

Solution

Total Pageviews

Featured Post

Why Learn PHP Programming

Subscribe To