2016年7月13日 星期三

【交易資料前處理 in R】【Pre-Process for Transaction Dataset in R】(2)

上一篇有些程式碼或許不夠精簡,因為R是個善於做向量矩陣運算的語言,如果不好好利用他的這個特性,就有點像是未成對的倚天劍或屠龍刀。
以下是改成叫精簡的版本,有些邏輯性的bug也做修正。
Previous article is not presenting reduced or elegant code, just being readable to me. R is born to do matrix/vector computing. If we don't use its strength, we are just like using a good saber but in scabbard. So, let's draw our saber from the scabbard.

以下是完整的程式碼
Here's the whole original code.

待之後更新完套件後,會再仔細說明,謝謝。
After updating the package, I will explain the details of this code. Thanks.

================================================================

require(openxlsx)
require(zoo)
require(Hmisc)

path <- "/Users/user/Desktop/myExData.xlsx"
data <- read.xlsx(path)
#=============
data[,'Date'] <- as.Date(data[,'Date'], origin = "1899-12-30")
colNameVector <- colnames(data)

data$Date <- as.POSIXlt(data$Date) # transform to POSIXlt type

year.list <- levels(factor(data$Date$year + 1900))

### sorting data
inc.order <- order(data$Date, decreasing = FALSE)
data <- data[inc.order,]

### building an empty data frame
final.data <- data.frame(data[,1:length(colNameVector)])
final.data[,] <- NA
final.data$Date <- as.POSIXlt(final.data$Date)

year <- substr(data[1,2],1,4)
origin <- paste(year, "-01-01", sep = "")
origin <- as.Date(origin)
diff <- as.Date(data[1,2])-origin
#=============

year.list <- sprintf("%s-01-01", year.list)
year.list <- as.Date(year.list)
yearDays.list <- mapply(yearDays, year.list)
daySum <- sum(yearDays.list)
daySum <- as.numeric(daySum - diff)

final.data[1:daySum,] <- NA # remove first few null days because data is not starting from 1/1
final.data[,2] <- seq(data[1,2], by = "1 days", length.out = daySum)

### setting rownames
rownames(data) <- c(1:nrow(data))

### duplicate identical column names
colnames(data) <- names(final.data)

my.index <- match(data$Date, as.POSIXlt(final.data$Date))
final.data[my.index,] <- data[,]

x <- ifelse(which(colnames(data) == "Date") == 1, 2, 1 )

tag <- max(which(!is.na(final.data[, x])))
final.data <- head(final.data,tag)  # cutting off last empty records

#==============
final.data[which(is.na(final.data[, c(3:10)[1]])), c(3:10)] <- data[1, c(3:10)]
final.data[which(is.na(final.data[, c(11:14)[1]])), c(11:14)] <- numeric(4)

#==============
final.data$Date <- as.Date(final.data$Date) # for correcting date time in excel
colnames(final.data) <- colNameVector
### return data
write.xlsx(final.data, file = "/Users/user/Desktop/myExData_full.xlsx")


沒有留言:

張貼留言