Issue
I am trying to web scrape some data inside a JavaScript tag in a HTML source.
The situation: I can get to the appropriate <script></script>
tag. But inside that tag, there is a big string, which needs to be converted and then parsed so I can get the precise data that I need.
The problem is: I don't know how to do that and can't find a clear and satisfying answer to do it.
Here is the code:
My goal is to get this data: "xe7fd4c285496ab91"
which is the identification number of the content, also called "contentId"
.
import requests
import bs4
import re
url = 'https://www.khanacademy.org/computing/computer-programming/programming/drawing-basics/pt/making-drawings-with-code'
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text,'html.parser') # by the way I am not sure if this is the right way to parse the link
item = soup.find(string=re.compile('contentId')) # with this line I can get directly to the exact JavaScript tag that I need
print(item) # but as you can see, it's a pretty big string, and I need to parse it to get the desired data. But you can find that the desired data "xe7fd4c285496ab91" is in it.
I tried to use json.parse()
but it is not working:
import json
jsonparsed=json.parse(item)
Get this error:
AttributeError: 'NavigableString' object has no attribute 'json'
My question is: How can I get the desired data? Is there a function to convert the string into JavaScript so I can parse it? Or a way to convert this string into a JSON file?
(Keep in mind that I will do this on multiple links with similar HTML/JavaScript).
Solution
You could just stick with regex on text alone without searching for script
import re
import requests
r = requests.get('https://www.khanacademy.org/computing/computer-programming/programming/drawing-basics/pt/making-drawings-with-code')
p = re.compile(r'contentId":"((?:(?!").)*)')
i = p.findall(r.text)[0]
print(i)
Regex
Answered By - QHarr Answer Checked By - Terry (PHPFixing Volunteer)
0 Comments:
Post a Comment
Note: Only a member of this blog may post a comment.