[問題]網頁疑似沒有更新爬蟲重複寫入同一則貼文 GHdisf45a PTT批踢踢實業坊

[問題]網頁疑似沒有更新爬蟲重複寫入同一則貼文

作者: GHdisf45a (The_rabbit) 2022-12-15 12:39:55

請問各位大大
我最近在學習如何使用爬蟲程式所以我拿ptt網頁板作為練習目標
但我碰到在10則後會反覆抓取同一則貼文的title和連結的問題
https://imgur.com/a/Bnqo2B1
我猜想是網頁沒有載入新的網頁資料
但是下拉式載入的動態網頁不是只要下拉就會更新嗎
而且我看chrom driver的selenium的下拉是有在執行的，請問是什麼原因導致?
以下我的程式碼
import urllib.request as req
import requests
import selenium
import schedule
import time
import json
from time import sleep
import json
import openpyxl
import random
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
import bs4
pttWeb = openpyxl.load_workbook('pttweb.xlsx')
ws = pttWeb.active
i = 1
scroll_time = int(input("scroll_Times"))
options = Options()
options.chrome_executable_path = "C:\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(options = options)
sleep(3)
driver.get('https://www.pttweb.cc/hot/all/today')
sleep(5)
prev_ele = None
for now_time in range(1, scroll_time+1):
sleep(2)
eles = driver.find_elements(by=By.CLASS_NAME,value='e7-right.ml-2')
# 若串列中存在上一次的最後一個元素，則擷取上一次的最後一個元素到當前最後一
個元素進行爬取
try:
# print(eles)
# print(prev_ele)
eles = eles[eles.index(prev_ele):]
except:
pass
for ele in eles:
try:
titleInfo = ele.find_element(by=By.CLASS_NAME, value =
"e7-article-default")
title = titleInfo.text
href = titleInfo.get_attribute('href')
ws.cell(i,1,i)
ws.cell(i,2,title)
ws.cell(i,3,href)
sleep(3)
inner =req.Request(href, headers ={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
})
with req.urlopen(inner) as innerRespomse:
articleData = innerRespomse.read().decode("utf-8")
articleRoot = bs4.BeautifulSoup(articleData, "html.parser")
main_content = articleRoot.find("div", itemprop="articleBody")
boardInfo= articleRoot.find("span",
class_="e7-board-name-standalone")
authorInfo = articleRoot.find("span", itemprop="name")
timeInfo = articleRoot.find("time", itemprop="datePublished")
countInfo = articleRoot.find_all("span",
class_="e7-head-content")
board = boardInfo.text
author = authorInfo.text
Time = timeInfo.text
count = countInfo[4].text
allContent = main_content.text
pre_text = allContent.split('

作者: lycantrope (阿寬) 2022-12-15 13:09:00

建議先改掉try-except:pass,把code貼pastebin較容易看

作者: GHdisf45a (The_rabbit) 2022-12-15 16:34:00

更:https://pastebin.com/cyUdWYLZ code的Pastebin更:https://pastebin.com/cyUdWYLZ code的Pastebin

作者: surimodo (好吃棉花糖) 2022-12-16 01:28:00

忙猜你class抓錯標題不只 e7-article-default還有 e7-article-viewed 跟 e7-article-most-recently-viewed然後 try expect 不要 pass一定有跳出找不到class pass幹嘛不用除錯乾脆把try expect全刪好了寫了又pass 脫褲子放屁

繼續閱讀

[問題] 執行程式CPU_14%，GPU_0%unknown [心得] 互動模式下 if 結束後不得接任何程式碼mikemike1021 [問題] beautifulsoup 上的 find() takes no keylivehorse [問題] 徵會雲端GoolgeCloudRun佈署寫python的angel2devil [問題] py程式之間的值如何傳遞XiaoLuu5566 [問題]把圖片映射在網格上但是回貼回去發現變小kyly157 [閒聊] YOUTUBE 同步上ＬＯＧＯ或圖片jackjenny [問題] 如何在sklearn中的分詞加入自己的辭典?TiffanyPany Re: [問題] 列出一個列表中所有子集合poototo Re: [問題] 優化程式碼，轉成 dictpoototo