Re: [分享] 更進一步使用RSelenium抓取PTT內容與通知 celestialgod PTT批踢踢實業坊

Re: [分享] 更進一步使用RSelenium抓取PTT內容與通知

作者: celestialgod (å¤©) 2016-07-24 17:56:52

※ 引述《wanson (望生)》之銘言：
: 最近看到有人教學使用RSelenium 抓取PTT的內容
: https://www.youtube.com/watch?v=PYy5C9IIgp8
: 我自學了一下發現的確可行
: 我自己本身因為有搭共乘的需求
: 特別搭乘的是比較少人提供的部分
: 所以我想更進一步使用這個方法並且進一步通知
: 上述網址的教學只能將爬到的檔案存成一個file
: 並且限制在該網頁的首頁
: 但是我希望更多的功能為以下，想要請問一下不知道可否使用R更進一步處理
: 1. 爬取更多頁面或是該版所有的頁面
: 這個部分我發現
: 他似乎根據網址的index那邊變動
: 如果我使用最舊就會等於1，下一頁就是二
: 但我使用最新他只會顯示index
: https://www.ptt.cc/bbs/car-pool/index2.html
: 似乎可以寫回圈去提取
: 2. 定期自動爬蟲
: 我想要讓電腦設定每兩小時爬一次，不知道要怎樣設定
: 不知道是否可以教學一下
: 謝謝
: 以下是使用他的教學我產生的code
: library(RSelenium)
: url= "https://www.ptt.cc/bbs/car-pool/index.html"
: remDr <- remoteDriver(remoteServerAddr = "localhost"
: , port = 4444
: , browserName ="firefox"
: )
: remDr$open() #open browser
: remDr$getStatus()#check the status of browser
: remDr$navigate(url)# website to crawl
: #the separate symbol in ppt is r-ent
: #get the element from the website
: webElem<-remDr$findElements('css selector', ".r-ent")#class for period (.) id
: then use #
: a = sapply(webElem, function(x){
: c =x$findChildElement('css selector', '.author')
: d =x$findChildElement('css selector', '.title')
: e =x$findChildElement('css selector', '.date')
: cbind(c("author" = c$getElementText(), "title" =
: d$getElementText(),e$getElementText()))
: }
: )
: t=as.data.frame(t(a))
第一個問題的話就是抓上一頁的按鈕連結，然後再慢慢往前轉就好，下面舉例
RSelenium是滿好入手的工具
抓到感覺後就可以慢慢開始用httr, xml2去抓網頁，速度會相對快很多
舉例如下：
(stri_conv只用在windows系統，linux/mac可以不需要)
library(httr)
library(xml2)
library(pipeR)
library(stringi)
library(stringr)
# url
url <- "https://www.ptt.cc/bbs/car-pool/index.html"
# parse網頁
html_nodex <- GET(url) %>>% content
# 找轉頁數的按鈕
btnPageChange <- html_nodex %>>% xml_find_all("//div/a[@class='btn wide']")
# 找上一頁按鈕的位置
locPrevPage <- btnPageChange %>>% xml_text %>>%
stri_conv("UTF8", "BIG5") %>% str_detect("上頁")
# 看上一頁的index編號
indexCount <- btnPageChange[locPrevPage] %>>% xml_attr("href") %>>%
str_match("index(\\d{4})") %>>% `[`(2) %>>% as.integer
# 要抓的欄位資訊
infoVec <- c(title = "//div[@class='title']",
date = "//div[@class='date']",
author = "//div[@class='author']")
# 先抓第一頁的
info <- sapply(infoVec, function(xpath){
xml_find_all(html_nodex, xpath) %>>% xml_text %>>%
stri_conv("UTF8", "BIG5") %>>% str_replace_all("\t|\n| ", "")
})
# 把前天的日期抓出來 (只要date出現前天，就停止，
# 當然有可能置底是前天發的就GG了，這個再自己修改)
crawlDate <- format(Sys.Date()-2, "%m/%d") %>>%
str_extract("[1-9]{1,2}/\\d{2}")
while (continueCrawl)
{
# 把index.html改成上一頁/上上一頁/...
html_nodex <- url %>>%
str_replace("index.html", sprintf("index%i.html", indexCount)) %>>%
GET %>>% content
# 抓出那一頁的資訊
tmpInfo <- sapply(infoVec, function(xpath){
xml_find_all(html_nodex, xpath) %>>% xml_text %>>%
stri_conv("UTF8", "BIG5") %>>% str_replace_all("\t|\n| ", "")
})
# 跟之前的合併
info <- rbind(info, tmpInfo)
# 如果出現不是昨天日期的就停止
if (any(tmpInfo[ , "date"] == crawlDate))
continueCrawl <- FALSE
indexCount <- indexCount - 1
}
info # 即為所求
每兩小時執行一次的話，就外面加一層while (TRUE)然後搭配Sys.sleep使用

繼續閱讀

[分享] quantstrat 套件分享naturalsmen [分享] 更進一步使用RSelenium抓取PTT內容與通知wanson [問題] SparkR rJava 安裝pk790127 [問題] Fourier Transform, noise and signaldreler1 [分享] bigmemory 套件分享cywhale [問題] 擷取excel中某些資料，並且另存成excelnewmatt [問題] 讀取3GB的csv檔資料太大f496328mm [問題] R做相關矩陣Tampa [問題] Rstudio 目錄Chris7462 [問題] 資料屬性developme223