Basic Data Processing in Experiment 2

Catalogue of Series Articles

Experiment 1 R Language Data Structure, Data Import and Data Processing
Experiment 2 Basic Data Processing

experimental data

1. item_feature1 dataset field

datedateitem_idCommodity ID
cate_idWarehouse IDcate_level_idWarehouse level ID
brand_idBrand IDsupplier_idVendor ID
pv_ipvNumber of Browsescart_uvAdditional purchases
collect_uvFavoritescart_ipvNumber of times purchased

Tip: The following is the body of this article

1. Experimental Purpose

  1. Variable creation, variable re-coding, missing values, date value processing, data type conversion, data sorting.
  2. Merge datasets, select subsets, use SQL to manipulate data frames, and integrate and reconstruct data.
  3. Control flow: conditions and loops.
  4. User-defined functions.

II. EXPERIMENTAL CONTENTS

Title 1

  1. Put item_feature1.csv is read in and stored in df; Name the DF columns date, item_id, cate_id, cate_level_id, brand_id, supplier_id, pv_ipv, cart_uv, collect_uv and cart_ipv.
  2. For cart_in df UV recoding and naming the new variable recode classifies less than 5000, common greater than 5000 and less than 15000, and many others. View the 10 data at the end.
  3. Check df for missing values; If there are missing values, delete all rows in the df that contain missing values.
  4. Convert the date field in the df to a date type, such as 2015-02-13.
  5. Save DFs in ascending order as df_asc; And look at the top 10 data.
  6. Ascending DF by date field and item_id descending sort with df1; And look at the first five pieces of data.

Title 2

  1. Select date, item_from DF Id, cate_id, cart_uv, recode, collect_uv and cart_ Save the IPV field as df1; Exclude cart_in DF1 IPV field saved as df2; Select item_from DF1 Data with ID greater than 500 and recode less is saved as df3.
  2. Select date from DF as 2015-02-14, item_id 300, and keep date to supplier_ All columns between ids, saved as df_sub.
  3. 500 samples were randomly selected from the DF without playback and saved as df4. View the dimension of the sample and the header data of the data.
  4. Select column from DF1 from item_id to cate_id data, saved as df1_temp, then follow item_with DF The ID merge is stored as df5.
  5. Select item_from df1 using sql method Data with ID of 300 is saved as df6.
  6. Randomly take as many data bars as df6 from df2 that have been replaced as df_tem, then merge with df6 by column (horizontally) and save as df7.
  7. Select date, item_from df Id, cate_id and cart_ Save the IPV as a feature and arrange the features in ascending order by date, taking out the only cate_in the feature ID (weight removal is sufficient).

3. Implementation process and experimental results

Title 1

1. Put item_feature1.csv reads in and stores in df; Name the DF columns date, item_id, cate_id, cate_level_id, brand_id, supplier_id, pv_ipv, cart_uv, collect_uv and cart_ipv.

# Read data stored in df
df <-
  read.csv(
    "R\\data\\ex2\\item_feature1.csv"
  )
# View the original variable name
names(df)
##  [1] "X20150628" "X300"      "X36"       "X4"        "X657"      "X294"     
##  [7] "X33"       "X19"       "X1"        "X1.1"
# rename
names(df)[1:10] <- c(
  "date",
  "item_id",
  "cate_id",
  "cate_level_id",
  "brand_id",
  "supplier_id",
  "pv_ipv",
  "cart_uv",
  "collect_uv",
  "cart_ipv"
)
# View modified variable names
names(df)
##  [1] "date"          "item_id"       "cate_id"       "cate_level_id"
##  [5] "brand_id"      "supplier_id"   "pv_ipv"        "cart_uv"      
##  [9] "collect_uv"    "cart_ipv"

2. Car_in df UV recoding and naming the new variable recode classifies less than 5000, common greater than 5000 and less than 15000, and many others. View the 10 data at the end.

# Cart_ Re-encode the UV and name the new variable recode
df$recode[df$cart_uv < 5000] <- "less"
df$recode[df$cart_uv >= 5000 & df$cart_uv < 15000] <- "common"
df$recode[df$cart_uv >= 15000] <- "many"
# Tail 10 Data
tail(df, 10)

3. Check if there are missing values in the df; If there are missing values, delete all rows in the DF that contain missing values.

# Number of rows with missing values
sum(rowSums(is.na(df)) > 0)
## [1] 2
# Delete rows with missing values
df <- na.omit(df)
# Number of rows with missing values
sum(rowSums(is.na(df)) > 0)
## [1] 0
# Number of rows in df sample
nrow(df)
## [1] 230352

4. Convert the date field in the df to a date type, such as: 2015-02-13.

# date data type
class(df$date)
## [1] "integer"
# Convert to Character
df$date <- as.character(df$date)
# date data type
class(df$date)
## [1] "character"
# Convert to Date Type
df$date <- as.Date(df$date, "%Y%m%d")

5. Save DF in ascending order by date field as df_asc; And look at the top 10 data.

# Sort ascending by date field
df_asc <- df[order(df$date), ]
# Top 10 Data
head(df_asc, 10)

6. Ascending DF by date field and item_id descending sort with df1; And look at the first five pieces of data.

# Ascending by date field and item_id descending sort
df1 <- df[order(df$date, -df$item_id), ]
# Top 5 Data
head(df1, 5)

Title 2

1. Select date, item_from DF Id, cate_id, cart_uv, recode, collect_uv and cart_ Save the IPV field as df1; Exclude cart_in DF1 IPV field saved as df2; Select item_from DF1 Data with ID greater than 500 and recode less is saved as df3.

# df1
df1 <-
  df[c("date",
       "item_id",
       "cate_id",
       "cart_uv",
       "recode",
       "collect_uv",
       "cart_ipv")]
# df2
df2 <- df1[!names(df1) %in% c("cart_ipv")]
# df3
df3 <- df1[df1$item_id > 500 & df1$recode == "less", ]

2. Select date from DF as 2015-02-14, item_id 300, and keep date to supplier_ All columns between ids, saved as df_sub.

df_sub <-
  subset(df, date == "2015-02-14" &
           item_id == 300, select = date:supplier_id)
df_sub
##           date item_id cate_id cate_level_id brand_id supplier_id
## 432 2015-02-14     300      36             4      657         294

3. 500 samples were randomly selected from the DF and saved as df4. View the dimension of the sample and the header data of the data.

# df4
df4 <- df[sample(1:nrow(df), 500, replace = FALSE), ]
# Sample dimension
dim(df4)
## [1] 500  11
# Header data
head(df4)

4. Select columns from DF1 from item_id to cate_id data, saved as df1_temp, then follow item_with DF The ID merge is stored as df5.

# df1_temp
df1_temp <- subset(df1, select = item_id:cate_id)
# df5
df5 <- merge(df1_temp, df, by = "item_id")

5. Select item_from df1 using sql method Data with ID of 300 is saved as df6.

# install.packages("sqldf")
library(sqldf)
df6 <- sqldf("select * from df1 where item_id=300")

6. Randomly take as many data bars as df6 from df2 that have been replaced as df_tem, then merge with df6 by column (horizontally) and save as df7.

# df_tem
df_tem <- df2[sample(1:nrow(df2), nrow(df6), replace = TRUE), ]
# Data Dimension
dim(df_tem)
## [1] 436   6
# df_tem, df6 horizontal merge
df7 <- cbind(df_tem, df6)
# Data Dimension
dim(df7)
## [1] 436  13

7. Select date, item_from df Id, cate_id and cart_ Save the IPV as a feature and arrange the features in ascending order by date, taking out the only cate_in the feature ID.

# feature
feature <- df[c("date", "item_id", "cate_id", "cart_ipv")]
# Sort in ascending order by date
feature <- df[order(df$date), ]
# Number of rows in algae1 sample
nrow(algae1)
## [1] 184
# Remove the unique cate_in feature ID
unique(feature$cate_id)
##  [1] 36 35  9 17  7 21 18 13 32 39 30 37  4 33 22 26 19 14 16 20 11 25 24 23 10
## [26]  5 15 28

4. Summary

That's what we're going to talk about today.

Keywords: R Language

Added by bjoerndalen on Thu, 06 Jan 2022 01:58:04 +0200