PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Wednesday, September 7, 2022

[FIXED] How to parse JavaScript code in html source in Python?

 September 07, 2022     ajax, javascript, json, python, web-scraping     No comments   

Issue

I am trying to web scrape some data inside a JavaScript tag in a HTML source.

The situation: I can get to the appropriate <script></script> tag. But inside that tag, there is a big string, which needs to be converted and then parsed so I can get the precise data that I need.

The problem is: I don't know how to do that and can't find a clear and satisfying answer to do it.

Here is the code:

My goal is to get this data: "xe7fd4c285496ab91" which is the identification number of the content, also called "contentId".

import requests
import bs4
import re

url = 'https://www.khanacademy.org/computing/computer-programming/programming/drawing-basics/pt/making-drawings-with-code'
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text,'html.parser') # by the way I am not sure if this is the right way to parse the link

item = soup.find(string=re.compile('contentId')) # with this line I can get directly to the exact JavaScript tag that I need

print(item) # but as you can see, it's a pretty big string, and I need to parse it to get the desired data. But you can find that the desired data "xe7fd4c285496ab91" is in it.

I tried to use json.parse() but it is not working:

import json
jsonparsed=json.parse(item)

Get this error:

AttributeError: 'NavigableString' object has no attribute 'json'

My question is: How can I get the desired data? Is there a function to convert the string into JavaScript so I can parse it? Or a way to convert this string into a JSON file?

(Keep in mind that I will do this on multiple links with similar HTML/JavaScript).


Solution

You could just stick with regex on text alone without searching for script

import re
import requests

r = requests.get('https://www.khanacademy.org/computing/computer-programming/programming/drawing-basics/pt/making-drawings-with-code')
p = re.compile(r'contentId":"((?:(?!").)*)')  
i = p.findall(r.text)[0]
print(i)

Regex

enter image description here



Answered By - QHarr
Answer Checked By - Terry (PHPFixing Volunteer)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing