Issue
I would like to make a python script which take as an input primary url for example:
https://stackoverflow.com/
and then it goes recursivley over all the pages and make a directed graph of all the pages (nodes) of the site and its sub pages, which has an edged if and only if there is link in node a (page a) to node b (page b). I assume there is something like that but I didn't find in google... If there is any ideas, maybe using wget
who something else I would love to hear.
Solution
I'll only give you pointers to what you'll need to build such a tool using basic python :
- First you will need urllib to open urls .
- Then, you can either use regexps or BeautifulSoup to find links in your pages. The first one is less CPU-expensive, but less precise, the latter is a fuzzy HTML parser (meaning that it accepts defective HTML)
- You can then store your URLs in a
list
of urls to crawl, and for each link you find in a page, you will need to check if you already crawled it (to avoid infinite cycles), using aset
- To build your graph, for each new page you crawl, it's a new node. Each link you find is a new edge.
Or, you can use scrapy, a python library made for crawling.
Answered By - Scharron Answer Checked By - Timothy Miller (PHPFixing Admin)
0 Comments:
Post a Comment
Note: Only a member of this blog may post a comment.