PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Wednesday, November 2, 2022

[FIXED] How to read the file without encoding and extract desired urls with python3?

 November 02, 2022     encoding, file, python-3.x, utf-8     No comments   

Issue

Environment :python3.
There are many files ,some of them encoding with gbk,others encoding with utf-8. I want to extract all the jpg with regular expression

For s.html encoding with gbk.

tree = open("/tmp/s.html","r").read()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 135: invalid start byte

tree = open("/tmp/s.html","r",encoding="gbk").read()
pat = "http://.+\.jpg"
result = re.findall(pat,tree)
print(result)

['http://somesite/2017/06/0_56.jpg']

It is a huge job to open all the files with specified encoding,i want a smart way to extract jpg urls in all the files.


Solution

If they have mixed encoding, you could try one encoding and fall back to another:

# first open as binary
with open(..., 'rb') as f:
    f_contents = f.read()
    try:
        contents = f_contents.decode('UTF-8')
    except UnicodeDecodeError:
        contents = f_contents.decode('gbk')
    ...

If they are html files, you may also be able to find the encoding tag, or search them as binary with a binary regex:

contents = open(..., 'rb').read()
regex = re.compile(b'http://.+\.jpg')
result = regex.findall(contents)
# now you'll probably want to `.decode()` each of the urls, but you should be able to do that pretty trivially with even the `ASCII` codec

Though now that I think of it, you probably don't really want to use regex to capture the links as you'll then have to deal with html entities (&) and may do better with something like pyquery

Here's a quick example using pyquery

contents = open(..., 'rb').read()
pq = pyquery.PyQuery(contents)
images = pq.find('img')
for img in images:
   img = pyquery.PyQuery(img)
   if img.attr('src').endswith('.jpg')
       print(img.attr('src'))

Not on my computer with things installed, so mileage with these code samples may vary



Answered By - anthony sottile
Answer Checked By - Marilyn (PHPFixing Volunteer)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing