Data acquisition methods:
STR | CLASS | TYPE.MAIN | COUNT | PERC |
---|---|---|---|---|
YES | Sheets | CSV | 7224 | 35.03 % |
XML | 6879 | 33.36 % | ||
JSON | 1579 | 7.66 % | ||
RSS | 113 | 0.55 % | ||
MAPs | KML | 51 | 0.25 % | |
WMS | 31 | 0.15 % | ||
SHP | 82 | 0.4 % | ||
KMZ | 9 | 0.04 % | ||
WMTS | 7 | 0.03 % | ||
WebPage | WebPage | 2 | 0.01 % | |
API | ASMX | 1 | 0 % | |
DEMDSM | 1 | 0 % |
url <- paste()
, as.Date(x = "2016/12/08") %>% format(format="%Y%m%d")
download.file()
list.files()
, read.csv
, library(dplyr)
write.csv()
, library(DBI)
url <- 'http://tisvcloud.freeway.gov.tw/history/TDCS/M03A/M03A_20161113.tar.gz'
download.file(url, destfile = "20161113.tar.gz")
untar("20161113.tar.gz")
all.files <- list.files(path = "./var", recursive = TRUE)
for (i in 1:10){
# CODE here
print(all.files[i])
}
# (1) .csv
url <- "http://data.gov.tw/iisi/logaccess/165?dataUrl=http://opendata.epa.gov.tw/ws/Data/AQX/?format=csv&ndctype=CSV&ndcnid=6074"
y <- read.csv(url, sep = ",", stringsAsFactors = F, header = T)
# (2) json files
library(jsonlite)
url <- 'http://data.gov.tw/iisi/logaccess/166?dataUrl=http://opendata.epa.gov.tw/ws/Data/AQX/?format=json&ndctype=JSON&ndcnid=6074'
y <- fromJSON(url, flatten = TRUE)
y <- as.data.frame(y$Records)
# (3) XML
library(XML)
url <- 'http://data.gov.tw/iisi/logaccess/167?dataUrl=http://opendata.epa.gov.tw/ws/Data/AQX/?format=xml&ndctype=XML&ndcnid=6074'
x <- xmlParse(url) # 以 xmlParse 解析 XML 檔案
xmlfiles <- xmlRoot(x) # 將 root 設定到 content 層級(一個偷吃步的做法)
y <- xmlToDataFrame(xmlfiles) # 轉換成 dataframe
# 將整理完成的檔案存成 CSV
write.csv(file = 'open.csv', y, fileEncoding = 'big5')
Open Data Protocol(OData),它是一個開源的協定,藉由簡單的 URL 參數傳遞,來識別並查詢資料庫資料,此協定支援 XML 及 JSON 格式。
SELECT DISTINCT column_list
FROM table_list
JOIN table ON join_condition
WHERE row_filter
ORDER BY column
LIMIT count OFFSET offset
GROUP BY column
HAVING group_filter;
install.packages(c("DBI", "RSQLite"))
vignette("RSQLite")
library(data.table)
ubike <- fread(input = "./data/ubike-weather-big5.csv",
data.table = FALSE,
colClasses = c("factor","integer","integer","factor","factor",
"numeric","numeric","integer","numeric","integer",
"integer","numeric","numeric","integer","integer",
"numeric","numeric","numeric","numeric","numeric",
"numeric", "numeric","numeric"),
stringsAsFactors = F)
Apache Spark 是一個開源叢集運算框架,最初是由加州大學柏克萊分校AMPLab所開發,使用記憶體內運算技術,能在資料尚未寫入硬碟時即在記憶體內分析運算。Spark在記憶體內執行程式的運算速度能做到比 Hadoop MapReduce 的運算速度快上100倍,即便是執行程式於硬碟時,Spark也能快上10倍速度,非常適合用於機器學習演算法。