PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Sunday, July 31, 2022

[FIXED] How to scrape multiple pages of a site using paging using BeautifulSoup and requests?

 July 31, 2022     beautifulsoup, pagination, python     No comments   

Issue

I created a scraper using BeautifulSoup and requests that scrapes the search results of the site Ask.com based on the keywords entered by the user. For now this scraper is limited to only one page of scraped search results. Here is the basic code of my scraper:


def search(request):
    if request.method == 'POST':
        search = request.POST['search']
        url = 'https://www.ask.com/web?q='+search
        res = requests.get(url)
        soup = bs(res.text, 'lxml')

        result_listings = soup.find_all('div', {'class': 'PartialSearchResults-item'})

        final_result = []

        for result in result_listings:
            result_title = result.find(class_='PartialSearchResults-item-title').text
            result_url = result.find('a').get('href')
            result_desc = result.find(class_='PartialSearchResults-item-abstract').text

           
            final_result.append((result_title, result_url, result_desc))

        context = {
            'final_result': final_result
        }

        


And I would like to make sure that BeautifulSoup can scrape the other 5 pages of search results by following the pagination, I modified my code like this:



def search(request):
    if request.method == 'POST':
        search = request.POST['search']
        url = 'https://www.ask.com/web?q='+search
        res = requests.get(url)
        soup = bs(res.text, 'lxml')
       

        result_listings = soup.find_all('div', {'class': 'PartialSearchResults-item'})

        final_result = []

        for result in result_listings:
            while True:
                result_title = result.find(class_='PartialSearchResults-item-title').text
                result_url = result.find('a').get('href')
                result_desc = result.find(class_='PartialSearchResults-item-abstract').text

                result_nextpage = result.find('a').get('PartialWebPagination-next')
                if result_nextpage.find_all('div', {'class': 'PartialSearchResults-item'}):
                    url = 'https://www.ask.com/web?q='+ search + result.find('a').get('PartialWebPagination-next')
                    return url
                else :
                    final_result.append((result_title, result_url, result_desc))


           
                

        context = {
            'final_result': final_result
        }

       

After when I run python manage.py runserver in order to run my server and when I enter the keywords to search in the appropriate search bar, instead of sending me the scraping results the page keeps loading without stopping. I therefore ask for help from more experienced members of the community because I do not know where my error lies. inspired by this question I modified the url variable as well:

url = "https://www.ask.com/search?q=" + search+ "&start=" + str((page - 1) * 5)

and when I executed, I obtained the following error name 'page' is not defined . So I ask for the help of the community. Thank you.


Solution

If your page is working for single page then with a little change it will work on next pages also. Just try to change page number in the url as ask.com supports it.

def search(request):
    if request.method == 'POST':
        search = request.POST['search']
        max_pages_to_scrap = 5
        final_result = []
        for page_num in range(1, max_pages_to_scrap+1):
            url = "https://www.ask.com/web?q=" + search + "&qo=pagination&page=" + str(page_num)
            res = requests.get(url)
            soup = bs(res.text, 'lxml')
            result_listings = soup.find_all('div', {'class': 'PartialSearchResults-item'})

            for result in result_listings:
                result_title = result.find(class_='PartialSearchResults-item-title').text
                result_url = result.find('a').get('href')
                result_desc = result.find(class_='PartialSearchResults-item-abstract').text
           
                final_result.append((result_title, result_url, result_desc))

        context = {'final_result': final_result}


Answered By - Prashant Maurya
Answer Checked By - Dawn Plyler (PHPFixing Volunteer)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing