PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Wednesday, November 9, 2022

[FIXED] How to get as <strong> tag as title and its child element as its description using Beautiful Soup Python

 November 09, 2022     beautifulsoup, html, python     No comments   

Issue

For an HTML input below:

example = """<strong>First Title</strong><p>Content of first title</p><p>Content of first title</p><strong>Second title</strong><p>Content of second title</p></strong>"""

the output should be:

{'First Title': '<p>Content of first title</p> <p>Content of first title</p>', 'Second title': '<p>Content of second title</p>'}

and it works exactly using below code:

soup = BeautifulSoup(example, 'html.parser')

finalOutput = {}
for header in soup.find_all('strong'):
    title = header.get_text().strip()
    nextNode = header
    content = []
    while True:
        previousNode = nextNode.previous_sibling
        nextNode = nextNode.nextSibling       
        
        if not nextNode:
            finalOutput[title] = " ".join(content)
            break        
        elif isinstance(nextNode, NavigableString):
            if nextNode.strip():
                content.append(nextNode.strip())
                pass        
        elif isinstance(nextNode, Tag):
            if nextNode.name == "strong":
                finalOutput[title] = " ".join(content)
                break
            content.append(str(nextNode))

print(finalOutput)

But the problem is HTML code contains <p><strong></p> and the python code does not work for below type of example:

example = """<p><strong>First Title</strong></p><p>Content of first title</p><p>Content of first title</p><p><strong>Second title</strong></p><p>Content of second title</p></strong>"""

So I want the output like below- Text inside <strong> should be the key and value should be the text before next <strong> tag.

Expected Output:

{'First Title': '<p>Content of first title</p> <p>Content of first title</p>', 'Second title': '<p>Content of second title</p>'}

Solution

You need to select <p> nodes with <strong> child

for header in soup.select('p strong'):
    title = header.get_text().strip()
    nextNode = header.parent
    content = []
    ...

and in the inner loop check if the nextNode child is strong

...
elif isinstance(nextNode, Tag):
    if next(nextNode.children).name == "strong":
        finalOutput[title] = " ".join(content)
        break
...


Answered By - Guy
Answer Checked By - Mildred Charles (PHPFixing Admin)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing