R語言 - 實務應用分享

今天要講的內容是關於時間序列資料的前處理。一般在做資料分析的過程中，資料的前處理(data pre-process)是花最多時間的，大約占80%，甚至更多。有鑑於 "Garbage in, garbage out" 這句巨量資料界最常聽到的名言，無怪乎前處理階段需要如此費時，否則最後的分析階段只是浪費時間。
Today we are going to do pre-processing of time series data. Basically, we spend most of time in data pre-processing when doing data analysis. Approximately 80%, sometimes more. "Garbage in, garbage out" as the saying goes in the field of big data. That's why this is a work of time. Everything's just a waste of time in final analyzing period if we don't do pre-process well.

註：資料的前處理依據個案不同，但大多會包含萃取、轉換、清理、載入等工作(ETL, data munging, etc.)

Note: Not every data pre-processing is the same because of different cases, but basically it's about ETL or data munging task.

這次的主題為何會與時間序列有關，回想時間序列最基本的要求 - 在連續的「均等時間」內所產生的一連串觀察值。有時候，我們用SQL從資料庫撈出的資料可能為不連續的時間點，如下圖所示：
How come this topic connects to the time series. Recall a fundamental requirement for time series - a series observants produced in continuous "uniform time interval". Occasionally, reproduced dataset by SQL from our database is not continuous as below :

註：此份資料集的各欄位說明，依序為交易序號、交易日期、門店、部門代碼、門店類別、產品、產品編碼、客戶、產品規格、產品單位、銷售量、銷售額、成本、利潤

Note : definition of every single column - transaction sequence, transaction date, store name, department code, store class, product, product code, customer, specification, unit, sales, sales amount, cost, profit.

很明顯地，5/12與5/14中間的5/13銷售資料因銷售量為0的緣故，而沒有出現在資料集中。若將這種殘缺的資料錯認為可正確執行時間序列模型的資料，則分析的內容與預測的結果必大大降低其可信度。
Obviously, the date between 5/12 and 5/14 is not shown due to 0 sales. This missing data will make people have less belief in forecast outcome if we really use this kind of dataset.

若缺失的時間點只有一兩列，那麼直接用Excel處理或在資料庫裡 Insert幾行所需內容即可。如下圖所示：
It will be very easy to be dealt with by Excel or SQL in database if we just get few row of missing data as below :

但這類殘缺資料通常含有上百甚至上千筆遺漏的日期，而我們的目的是要將這些日期一一補齊，所以示範如何用R處理這類資料。
Unfortunately, we normally get numerous missing data. So, the aim is to rebuild the dataset into a complete source data for our model.

1. 匯入所需之套件(若尚未安裝請先自行安裝)
1. Import the necessary package(if find none in your local, please install first.)

2. 使用<openxlsx>套件提供的 read.xlsx 方法，此方法可使R讀取外部的xlsx檔案。
命名讀取的資料集名稱為myData，並檢視myData的內容
2. Use method 'read.xlsx' from package <openxlsx>, in this way you can import your xlsx file into R. Name our dataset myData, and view the content of myData.

3. 我們可以發現Date欄位的日期格式怪怪的，怎麼不是年-月-日的格式，檢視此欄位的資料型態後可發現此欄位目前為數值資料型態
3. It's obvious that Date field is sort of weird. Not in the format of 'year-month-day'. So, let's check the datatype of this column, and we can find out it is 'numeric' datatype currently.

那是因為Excel檔的日期初始時間為1899-12-30，所以要理解R的行為的話，我們先將第一筆資料個日期2013-01-07轉為數值15712，此數值是相對於系統初始時間1970-01-01而言。另外，將1899-12-30也轉為數值-25569，接著將兩數值相減得到41281，此數值即為myData資料中Date欄位的第一個值。
The reason is that the original time in Excel is 1899/12/30. Then, the path of R's mind is transform 2013-01-07 to the numeric '15712' which is relative to system original time '1970-01-01', and also transform 1899-12-30 to the numeric '-25569'. Next step is to do subtraction, then we get '41281', this number is exactly the first number of Date field.

4. 接下來，用<zoo>套件的as.Date方法，將Date欄位的值轉成日期資料型態，接著再用內建套件<base>提供的as.POSIXlt方法將Date欄位的值轉換成方便計算用之日期資料型態。
4. Next, use method 'as.Date' from package <zoo>. transform the value in column 'Date' to date datatype. Then, use method 'as.POSIXlt' to transform value of column 'Date' to another date datatype but which is easier for calculating.

5. 接下來，依照交易日期將交易紀錄重新排序
5. Sorting by transaction date.

註：通常在資料庫操作中，用SQL拉出交易紀錄時，拉出的資料表已經過排序了，但某些情況下資料的儲存有可能是零散的，所以在不確定這份資料是否經過排序時，建議還是要依據所需的規則排序一下。

Note : Usually, when we're using SQL to extract the transaction data, the view table will be well-sorted. However, in some cases, data might be discrete, so I highly suggest everyone sort the dataset if we are not sure about the condition of the data.

6. 建立一個內容為空的資料框架(也就是一個空資料集的表格)，目的是為了放置最後補齊缺失日期紀錄的完整資料集
6. Build an empty data.frame(i.e. a table with empty content), which is a room for our target - complete dataset.

第一行是建立的空的data.frame，且依照myData這個資料表來建置，欄位數目是從myData的第一欄到myData的最後一欄。建立後的final.data如下：
First line is to build an empty data.frame according to myData. Columns are as same as myData's. final.data is below :

final.data的資料筆數與欄位數與myData是一致的
rows and columns of final.data are same as myData.

因為我們的目的是要新建一個空的資料表，
所以第二行則是將新建的final.data這個資料表的內容值都設為NA。
Second line is for setting NA value to each cell because the aim is building a table with empty content.

第三行則是轉換final.data資料表第二欄"日期欄位"的資料型態，轉換成POSIXlt型態以便做後續的運算。
The third line is to transform datatype of every value in 'date' column to POSIXlt datatype.

最後顯示此final.data資料表。
To view our current final.data table.

此final.data是一個空的容器，接下會一一將此容器填滿並處理

7. 下一步則是擷取此交易資料集的第一筆交易日期，目的是為了計算此交易日期為該年的第幾天發生，以便做後續的資料切割。e.g. 若取得首筆交易日期為2013-02-19後，則可求得首次交易日是發生於2013年的第50天，也就是已過了49天。
7. Extract the first transaction date in order to compute the number of days between the beginning of year and the first transaction date so that we can do some cutting in data. e.g. If the first transaction date '2013-02-19', then we can notice that first deal is the day 50 in year 2013, which means 'via 49 days'.

接著我們秀出首筆交易年份和該年首日(也就是1月1號)，以及首筆交易日與該年首日的天數差距。
Present the year of first transaction, and show the first day of that year as well. Also, gap of both days, i.e. first transaction date and first date of that year.

8. 下一步將此資料集有橫跨過的年份記錄起來。例如，若資料集是2010到2015年的交易資料，則紀錄(2010, 2011, 2012, 2013, 2014, 2015)這六個年份。
8. Record all distinct years in the data set. For instance, list (2010, 2011, 2012, 2013, 2014, 2015) this six years if all transaction date in the data set is from 2010 to 2015.

至於為何要加1900則是為了顯示正確的年份數值。若將日期轉成POSIXlt的型態，則可以使用$year、$mon、$mday、$hour、$min或$sec等方法，以2013年6月15號8點19分27秒為例 :
In order to get correct number of year, so that's why adding 1900 at the end. Transform date to POSIXlt datatype in order to have an access to use methods like $year, $mon, $mday, $hour, $min and $sec. Take '2013-06-15 08:19:27' for example :

9. 計算交易期間的年份其橫跨的總天數，e.g. 首筆交易日為2013年2月19號，最後一筆交易日為2015年12月3號，則交易期間的橫跨的年份為(2013, 2014, 2015)，橫跨的總天數為(365)2013 + (365)2014 + (365)2015 = 1095
9. Calculate all number of days in all years we have transactions in. e.g. First transaction date is 2013/2/19, and last transaction date is 2015/12/03, so the all years we get is (2013, 2014, 2015), and the number of all days is the addition of (365)2013 + (365)2014 + (365)2015, which equals to 1095.

檢視總橫跨天數：
Check the number of all days.

由第8步得知年份為3年，故總天數為1095天
we can know the number of year is 3 by step 8, so total days are 1095 days.

10. 將總橫跨的天數減去首筆交易日之前的天數(也就是之前求得的diff)。
10. Make the number of all days minus the gap days(that is the 'diff' we got).

檢視扣除diff後的總天數：
View the number after executing :

11. 填寫final.data日期欄位的日期，依據myData首筆交易日的日期開始填寫，並填寫至2015年最後一天。
11. Fill in the value of date in 'date' column of final.data. The first value is connected to the first transaction date in myData, then fill it out continuously.

檢視final.data的日期欄位：
View the date field of final.data :

總資料筆數變更為1089筆，等於daySum的數值
Total record number is 1089, which is exactly same as daySum.

檢視final.data最後一筆是否為2015年12月31日：
Inspect if the last deal is 2015-12-31 :

12. 接下來則是要將myData的各筆交易紀錄與final.data的日期做正確的對應，但在這步驟前，我們可能會發現一個狀況，myData某些欄位的欄位名稱怪怪的。
12. Connecting every single deal in myData with the date in 'date' column of final.data. Before that, we probably knew something wrong. Some column names of myData are so weird.

myData的第一個欄位名稱是NA，但final.data的第一個欄位名稱是"NA."，這可能會對之後交易資料對應產生問題。確認myData的所有欄位名稱及順序是否與final.data相同：
First column name of myData is NA, but first column name of final.data is "NA.". If we connect both table, possibly we will get wrong. Check again if column names or column rank of both tables are identical :

False，表示兩資料表的欄位名稱或順序不完全相同
False, i.e. both table are not exactly same in names or rank of column.

接下來則需使兩資料表的欄位名稱順序一致。
Must make column name and rank of both table consistent.

檢視的結果表示myData的欄位名稱順序與final.data相同。
Take a view of result.

13. 接下來就是做交易資料的對應。概念是myData中每一筆交易日都是唯一的，所以可以在final.data中找到myData的每一筆交易日的日期。因此我們採用的方法為，記住myData上每一筆交易日在final.data上row的位置，接著依據剛剛記住的位置，在將myData上每一筆交易資料複製到final.data中。
13. Do real connection between both table. Supposedly, every single transaction date is unique, so we definitely can find every single transaction date of myData in final.data.
The solutoin is that catch the row site of each transaction date of myData, then copy each one row of myData to final.data according to the site we've got.

檢視myData的資料是否已複製到final.data中：
Check if duplicate of myData is in the content of final.data :

另外，我們也可以看到final.data中，有些row的內容是空的，但日期欄位是可以與上下交易資料連接的。
Furthermore, it's obvious that some row of final.data are empty but date column, whose date values are connected to other date values above and below.

第79列的空資料正是我們想要補齊的空交易資料

可以看到我們想要的結果也慢慢雛形了，但將final.data的交易資料拉到最下面則會發現多出了好幾筆空的交易資料。
Now we can see the whole picture getting complete. But, there is something needed to be adjusted. Scroll the table to the bottom, we can find lots of empty rows which should be of no use.

myData的最後一筆交易資料為final.data的第1071筆，所以1072以後的資料對我們而言是多餘的
The last transaction of myData is the 1072 th row of final.data. So each row is redundant after that row.

14. 這個步驟很明顯的就是要將1072以後的空資料刪除，因為我們已經知道第1072列開始為多餘的空資料，所以只要用final.data = head(final.data, 1071)這個方法就可以處理此問題了。但很多情況是我們無法用肉眼觀察到第1072列開始是多餘的資料，此時我們可能就得寫一個自動化的方法讓R幫我們找到最後一筆非多餘的資料在哪裡。
14. This step is absolutely to do deleting last empty rows after row 1071 because we know those are redundant. Hence, we can use this command "final.data = head(final.data,1071)" to deal with it. However, most of time we have no idea which row is the start of redundant data, that's why we need other ways to help us find those redundant data automatically.

我們這個迴圈的目的是找出最後一筆非多餘的資料其位置，概念是從final.data最後一筆開始搜尋，如果資料是多餘的話，第一個欄位勢必為NA，則搜尋倒數第二筆，直到搜尋出第一個欄位非NA的資料列，則記錄此資料列的列數K。
This for-loop is aimed to find last one row which is irredundant. Searching from the bottom of final.data, we can notice that the value of first column must be NA if which row is redundant, then jump to the second-last row. Tag the row number K until we find the first non-NA value by counting backward.

接著就是將final.data用final.data的第一筆到第K筆替代。
Let new final.data be the combination from the first row to row K of original final.data.

檢視final.data的結果是正確：
Check if final.data is correct :

資料表的列數為1071筆，表示12/14到12/31這18筆空資料已被刪去

15. 目前已幾乎大功告成了，只差空交易日的其他欄位值該如何處理。檢視一下myData中，那些欄位的值是一直不用變動的，即便出現空交易日也無需變動？很明顯的就是第3欄、第4欄、...、第10欄，這8個欄位是不管怎樣都必須填入相同的值。接著，有哪些欄位是會變動的，但對於空交易日而言是不變的？對於myData而言，第11欄~第14欄是會變動的，但對空交易日而言，這些欄位的值卻是不變的。e.g. 若當天沒有交易，則銷售總額為0，銷售量為0，成本支出為0，利潤為0 (可能實際情況會不一定，這裡以簡單情境舉例。)
15. Almost finished, but there is only thing we need to solve, other column value of non transaction date. Check myData to look which column value is always fixed. No need to be changed. Apparently, column 3, column 4, ..., column 10, these 10 columns are always identical no matter what value of other columns. Moreover, which column value is fixed to non transaction date only? In myData, column 11 to column 14 is changeable, but they are also unchangeable to non transaction date.

以上迴圈的寫法就是從final.data的第一筆開始，如果是空資料，則填寫我們給定的各個欄位值。第3欄~第10欄是複製myData的第3欄~第10欄的值，第11欄~第14欄填入我們給定的值0，若實際情況需填入不同的值，則將0替換成所需的值即可。
The for-loop above starts from the first row of final.data. Filling in our particular values if the row is empty. Column 3 to column 10 are the copies of myData. Column 11 to column 14 are given our particular value, 0. Just put another number if yours is not zero.

檢視最後final.data的結果：
View final product of final.data :

可以發現該填的地方都填好了，但第一個欄位還是NA怎麼辦？
因為在這個案例中，我們不需要理會第一個欄位，所以就讓它維持NA
Almost perfect, but how to do with first column if its value is NA?
Just forget it because we don't need to rock them. Just remain stay if no need.

16. 接著刪除不必要的變數後，將final.data輸出成xlsx檔(Excel)
16. After deleting all of unnecessary variables, export final.data as a xlsx file.

接著開啟myExData_full.xlsx查看結果，我們會發現日期欄位好像跟原本myExData.xlsx的日期欄位不一樣。
When we open the file seeing the result, everything is beautiful but date field seems little different from original format in myExData.

日期欄位出現時分秒數值

17. 因為這不是我們要的結果，所以回到R處理這個問題。將final.data的日期欄位轉換其資料型態，以便輸出後能呈現我們所期望的格式。
17. Which is not we want, so be back to R and solve it. Do another transformation to change the datatype of date column in order to show what we're looking for.

最後再一次將final.data輸出成xlsx檔，並開起Excel檢視myExData_full.xlsx中的日期欄位是否為我們所需要的格式。
Output final.data again. Check everything is alright including date format.

Bingo! 輸出的結果正是符合我們所需要的格式。
Bingo! The result completely meets what we want.

以上17個步驟為處理遺漏空交易資料的方法，有些步驟或方法也許可以再精簡或提升效率，但我這裡考慮到我後續維護的可讀性，目前先以此方法處理這類型的資料。
17 steps above is about dealing with missing data. Some of steps is definitely not perfect. They should be able to be reduced or be more efficient way, but this is my first version, for my brain-base size, perhaps it's better to use these readable code at the first time.

目前這方法我已經寫成一個標準化的套件，並上傳到CRAN，Source code的內容可能不會和這篇的17個步驟完全相同，因為要處理標準化流程，所以會有些許的變更，但概念上是相同的。
I've already complete a R package of this method, and uploaded to CRAN, but source code is quite different because of the aim of SOP. But, almost same concept.

此套件可適用大部分這類型的資料，但某些格式的資料集可能暫時不適用，目前正在修正(其實可以很快修正，只是礙於CRAN的上傳規範，短時間內不能立即更新版本，所以目前把邏輯性bug的改掉後，還是無法立即更新到CRAN上。不過會盡快更新)。
至於是那些格式資料集無法適用，若有疑問的話，請留言或寄信給我，我會再向您說明，謝謝。
This package is good for the data set of this type, but some of different types may not matched with the methods of this package. I am fixing now. (It is not hard to fix it out, but it not allowed to update so soon)
Feel free to reach me at my mailbox if you are wondering which type of data set is not working with the package.

最後，此套件名稱為<tigerhitteR>
目前版本為1.0.2
說明文件中的URL為本人的Github，裡面有存放此套件的source code
Finally, package name <tigerhitteR>
Current version : 1.0.2
URL in document is my Github, storing the source code of this package.

本人的email：aiien61will@gmail.com
若有任何疑問或建議，可直接與我聯繫，謝謝！
My email : aiien61will@gmail.com
Again, feel free to contact with me if any suggestion or advice. Thanks!

R語言 - 實務應用分享

2016年7月13日星期三

【交易資料前處理 in R】【Pre-Process for Transaction Dataset in R】(2)

2016年7月8日星期五

【交易資料前處理 in R】【Pre-Process for Transaction Dataset in R】(1)

2016年7月13日 星期三

【交易資料前處理 in R】【Pre-Process for Transaction Dataset in R】(2)

2016年7月8日 星期五

【交易資料前處理 in R】【Pre-Process for Transaction Dataset in R】(1)

2016年7月13日星期三

2016年7月8日星期五