Re: [問題] 丟入htmlParse的東西 oldjojotenya PTT批踢踢實業坊

Re: [問題] 丟入htmlParse的東西

作者: oldjojotenya (舊舅舅) 2015-01-31 13:46:49

後來找了兩種網頁測試了一下：
一、全部資訊在單一頁面的：
https://tw.stock.yahoo.com/d/s/company_2330.html
1.
url<-"https://tw.stock.yahoo.com/d/s/company_2330.html"
content0<-htmlParse(url)
結果：成功但是顯示警告訊息：XML content does not seem to be XML
後來去stockoverflow查了一下，有人回答遇到這種狀況的處理方法：
"You can use RCurl to fetch the content and then XML seems to be able to
handle it"，表示要用RCurl的getURL就能成功。
2.
url<-getURL("https://tw.stock.yahoo.com/d/s/company_2330.html")
content1<-htmlParse(url)
結果：成功
3.
url<-"https://tw.stock.yahoo.com/d/s/company_2330.html"
f<-file(url)
f_size<-file.info(url)$size
content2<-readChar(f,f_size)
close(f)
結果：
#錯誤在readChar(f, f_size) : 無法開啟連結
此外: 警告訊息：
In readChar(f, f_size) : 不支援這種 URL 方法
二、搜尋頁：
http://www.taifex.com.tw/chinese/3/7_12_1.asp
1.
url<-"http://www.taifex.com.tw/chinese/3/7_12_1.asp"
content0<-htmlParse(url)
結果：成功
2.
url<-getURL("http://www.taifex.com.tw/chinese/3/7_12_1.asp")
content1<-htmlParse(url)
結果：成功
3.
url<-"http://www.taifex.com.tw/chinese/3/7_12_1.asp"
f<-file(url)
f_size<-file.info(url)$size
content2<-readChar(f,f_size)
close(f)
結果：
#錯誤: 'nchars' 引數不正確
查了readChar的使用方法，nchars不能為NA，但在此處帶入的f_size不知道為何卻是NA
總結：
1.不管怎樣用getURL比較保險
2.用file.info連接到本地file時，抓出來的size都是該file的size，但是連接到網路
上的file時，不知道為何都讀不到正確的size(都顯示為NA)，所以就不能用
readChar抓出網頁內容了。
可請問為何是這樣嘛？

作者: Wush978 (拒看低質媒體) 2015-02-01 22:43:00

謝謝你的研究精神！

繼續閱讀

[問題] 丟入htmlParse的東西oldjojotenya [問題]不知從何處理起的BUGcoke228 Re: [問題] Rcpp 初學Wush978 Re: [問題] 用R 寫spss 的logistic regressionandrew43 Re: [問題] Rcpp 初學celestialgod [問題] Rcpp 初學gsuper [問題] 用R 寫spss 的logistic regressionlepin2001 [問題] 字串\的輸入方式lovesnow1990 [問題] 請問RSelenium套件問題mickey1231 Re: [問題] 關於R的速度Wush978