Catalogue of Series Articles
Experiment 1 R Language Data Structure, Data Import and Data Processing
Experiment 2 Basic Data Processing
experimental data
1. item_feature1 dataset field
date | date | item_id | Commodity ID |
cate_id | Warehouse ID | cate_level_id | Warehouse level ID |
brand_id | Brand ID | supplier_id | Vendor ID |
pv_ipv | Number of Browses | cart_uv | Additional purchases |
collect_uv | Favorites | cart_ipv | Number of times purchased |
Tip: The following is the body of this article
1. Experimental Purpose
- Variable creation, variable re-coding, missing values, date value processing, data type conversion, data sorting.
- Merge datasets, select subsets, use SQL to manipulate data frames, and integrate and reconstruct data.
- Control flow: conditions and loops.
- User-defined functions.
II. EXPERIMENTAL CONTENTS
Title 1
- Put item_feature1.csv is read in and stored in df; Name the DF columns date, item_id, cate_id, cate_level_id, brand_id, supplier_id, pv_ipv, cart_uv, collect_uv and cart_ipv.
- For cart_in df UV recoding and naming the new variable recode classifies less than 5000, common greater than 5000 and less than 15000, and many others. View the 10 data at the end.
- Check df for missing values; If there are missing values, delete all rows in the df that contain missing values.
- Convert the date field in the df to a date type, such as 2015-02-13.
- Save DFs in ascending order as df_asc; And look at the top 10 data.
- Ascending DF by date field and item_id descending sort with df1; And look at the first five pieces of data.
Title 2
- Select date, item_from DF Id, cate_id, cart_uv, recode, collect_uv and cart_ Save the IPV field as df1; Exclude cart_in DF1 IPV field saved as df2; Select item_from DF1 Data with ID greater than 500 and recode less is saved as df3.
- Select date from DF as 2015-02-14, item_id 300, and keep date to supplier_ All columns between ids, saved as df_sub.
- 500 samples were randomly selected from the DF without playback and saved as df4. View the dimension of the sample and the header data of the data.
- Select column from DF1 from item_id to cate_id data, saved as df1_temp, then follow item_with DF The ID merge is stored as df5.
- Select item_from df1 using sql method Data with ID of 300 is saved as df6.
- Randomly take as many data bars as df6 from df2 that have been replaced as df_tem, then merge with df6 by column (horizontally) and save as df7.
- Select date, item_from df Id, cate_id and cart_ Save the IPV as a feature and arrange the features in ascending order by date, taking out the only cate_in the feature ID (weight removal is sufficient).
3. Implementation process and experimental results
Title 1
1. Put item_feature1.csv reads in and stores in df; Name the DF columns date, item_id, cate_id, cate_level_id, brand_id, supplier_id, pv_ipv, cart_uv, collect_uv and cart_ipv.
# Read data stored in df df <- read.csv( "R\\data\\ex2\\item_feature1.csv" )
# View the original variable name names(df) ## [1] "X20150628" "X300" "X36" "X4" "X657" "X294" ## [7] "X33" "X19" "X1" "X1.1"
# rename names(df)[1:10] <- c( "date", "item_id", "cate_id", "cate_level_id", "brand_id", "supplier_id", "pv_ipv", "cart_uv", "collect_uv", "cart_ipv" )
# View modified variable names names(df) ## [1] "date" "item_id" "cate_id" "cate_level_id" ## [5] "brand_id" "supplier_id" "pv_ipv" "cart_uv" ## [9] "collect_uv" "cart_ipv"
2. Car_in df UV recoding and naming the new variable recode classifies less than 5000, common greater than 5000 and less than 15000, and many others. View the 10 data at the end.
# Cart_ Re-encode the UV and name the new variable recode df$recode[df$cart_uv < 5000] <- "less" df$recode[df$cart_uv >= 5000 & df$cart_uv < 15000] <- "common" df$recode[df$cart_uv >= 15000] <- "many"
# Tail 10 Data tail(df, 10)
3. Check if there are missing values in the df; If there are missing values, delete all rows in the DF that contain missing values.
# Number of rows with missing values sum(rowSums(is.na(df)) > 0) ## [1] 2
# Delete rows with missing values df <- na.omit(df)
# Number of rows with missing values sum(rowSums(is.na(df)) > 0) ## [1] 0
# Number of rows in df sample nrow(df) ## [1] 230352
4. Convert the date field in the df to a date type, such as: 2015-02-13.
# date data type class(df$date) ## [1] "integer"
# Convert to Character df$date <- as.character(df$date)
# date data type class(df$date) ## [1] "character"
# Convert to Date Type df$date <- as.Date(df$date, "%Y%m%d")
5. Save DF in ascending order by date field as df_asc; And look at the top 10 data.
# Sort ascending by date field df_asc <- df[order(df$date), ]
# Top 10 Data head(df_asc, 10)
6. Ascending DF by date field and item_id descending sort with df1; And look at the first five pieces of data.
# Ascending by date field and item_id descending sort df1 <- df[order(df$date, -df$item_id), ]
# Top 5 Data head(df1, 5)
Title 2
1. Select date, item_from DF Id, cate_id, cart_uv, recode, collect_uv and cart_ Save the IPV field as df1; Exclude cart_in DF1 IPV field saved as df2; Select item_from DF1 Data with ID greater than 500 and recode less is saved as df3.
# df1 df1 <- df[c("date", "item_id", "cate_id", "cart_uv", "recode", "collect_uv", "cart_ipv")]
# df2 df2 <- df1[!names(df1) %in% c("cart_ipv")]
# df3 df3 <- df1[df1$item_id > 500 & df1$recode == "less", ]
2. Select date from DF as 2015-02-14, item_id 300, and keep date to supplier_ All columns between ids, saved as df_sub.
df_sub <- subset(df, date == "2015-02-14" & item_id == 300, select = date:supplier_id) df_sub ## date item_id cate_id cate_level_id brand_id supplier_id ## 432 2015-02-14 300 36 4 657 294
3. 500 samples were randomly selected from the DF and saved as df4. View the dimension of the sample and the header data of the data.
# df4 df4 <- df[sample(1:nrow(df), 500, replace = FALSE), ]
# Sample dimension dim(df4) ## [1] 500 11
# Header data head(df4)
4. Select columns from DF1 from item_id to cate_id data, saved as df1_temp, then follow item_with DF The ID merge is stored as df5.
# df1_temp df1_temp <- subset(df1, select = item_id:cate_id)
# df5 df5 <- merge(df1_temp, df, by = "item_id")
5. Select item_from df1 using sql method Data with ID of 300 is saved as df6.
# install.packages("sqldf") library(sqldf) df6 <- sqldf("select * from df1 where item_id=300")
6. Randomly take as many data bars as df6 from df2 that have been replaced as df_tem, then merge with df6 by column (horizontally) and save as df7.
# df_tem df_tem <- df2[sample(1:nrow(df2), nrow(df6), replace = TRUE), ]
# Data Dimension dim(df_tem) ## [1] 436 6
# df_tem, df6 horizontal merge df7 <- cbind(df_tem, df6)
# Data Dimension dim(df7) ## [1] 436 13
7. Select date, item_from df Id, cate_id and cart_ Save the IPV as a feature and arrange the features in ascending order by date, taking out the only cate_in the feature ID.
# feature feature <- df[c("date", "item_id", "cate_id", "cart_ipv")]
# Sort in ascending order by date feature <- df[order(df$date), ]
# Number of rows in algae1 sample nrow(algae1) ## [1] 184
# Remove the unique cate_in feature ID unique(feature$cate_id) ## [1] 36 35 9 17 7 21 18 13 32 39 30 37 4 33 22 26 19 14 16 20 11 25 24 23 10 ## [26] 5 15 28