PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Thursday, August 4, 2022

[FIXED] How to skip to the next link if a site cant be reached with BeautifulSoup?

 August 04, 2022     beautifulsoup, exception, python     No comments   

Issue

I'm currently coding a Python project which needs to do the following:

-the user inputs multiple links to different sites

-the script scrapes information from these sites and writes the output in a .txt file

The problem I have is that if a site can't be reached (for example a random link: oflexertzue.com) the whole script will be stopped and I have to restart it.

This is the error message I get if a site can't be reached:

Failed to establish a new connection: [Errno 11001] getaddrinfo failed'

I was trying to find a way to skip to the next link or to build in an exception and to output 'exception' into the text file. I have also tried using 'try/except' but I had no luck with it.

This is the code I currently have for my script:

from time import sleep
import requests
from bs4 import BeautifulSoup

http = 'http://'

input_1 = input("Link: ").split(',')
link = [http + site for site in input_1]

open("output.txt", 'w').close()

for url in link:
    sleep(1)

    website = requests.get(url)
    results = BeautifulSoup(website.content, 'html.parser')
    all_div = results.find_all("div", class_="rte", limit = 1)

    #[information I want to scrape from a site]
    #[...]

    file = open("output.txt", 'a', encoding="utf-8")
    file.write("\n")
    file.write('+++++++++' + ' ' + url + ' ' + '+++++++++')
    file.write(output)
    file.write("\n")
    file.close()

Solution

Simply put, let a context manager take care of I/O operations, and place the try block inside the loop. I also added foo(), where you can add further operations on all_div:

def foo(x):
    return ''.join([f"{s.get_text()}\n" for s in x])


def scrape(input_links) -> None:
    with open("output.txt", 'a', encoding="utf-8") as file:
        for url in input_links:
            try:
                sleep(1)
                website = requests.get(url)
                results = BeautifulSoup(website.content, 'html.parser')
                all_div = results.find_all("div", class_="rte", limit=1)
                output = foo(all_div)
            except Exception as ex:
                file.write(f"\n>>>>>>>>>>>> {ex}\n")
            else:
                file.write(f"\n+++++++++ {url} +++++++++\n{output}\n")


scrape(link)


Answered By - Jonathan Ciapetti
Answer Checked By - Timothy Miller (PHPFixing Admin)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing