[問題] 初學網路爬蟲問題 starlichin PTT批踢踢實業坊

[問題] 初學網路爬蟲問題

作者: starlichin (白星羽) 2018-10-30 23:30:58

目前在Coursera上自學Python網路爬蟲。
寫作業的時候碰到一個題目，就是要根據使用者輸入的position，
搜尋網頁中使用者指定position的網址，進入該網址後，
再搜尋下一頁面中該指定position的網址，如此重複counter次
原始題目敘述為：
In this assignment you will write a Python program that expands on
http://www.py4e.com/code3/urllinks.py. The program will use urllib to read
the HTML from the data files below, extract the href= vaues from the anchor
tags, scan for a tag that is in a particular position relative to the first
name in the list, follow that link and repeat the process a number of times
and report the last name you find.
下面是我目前寫好的部分，但只能列印出第一層指定位置的網址，不知道該怎麼
依照指定的counter重複進入該網址再列印下幾層的網址，請大家協助解惑了
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html'
counter = input ('Enter counter: ')
position = input ('Enter position: ' )
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags = soup('a')
lst = []
for tag in tags:
link = tag.get('href', None)
lst.append(link)
print(lst[int(position)-1])

作者: takingblue (takingblue) 2018-10-31 15:37:00

再對下個link發request根據你的counter寫一個loop

作者: starlichin (白星羽) 2018-10-31 23:15:00

解決了! 感謝你 :)

繼續閱讀

[問題] 關閉子視窗後繼續執行ted84523 [問題] plotly 多層繪圖問題DRLai [問題] 讀取內部網域共享資料ylim [問題] 表格條件對應問題Xiumpt [問題]window下如何做出能在linux的執行檔rofellosx [問題] wxpython如何讓程式主動觸發按鈕事件Meeeeeeeee [問題] 為什麼讀不到麥克風...delmonika 請問如何知道一地址方圓五公里內的人口Demaster [問題] 這段程式碼的數學意義st40182 [問題] 美國伺服器時區gigigigi