R language
Video link: https://www.bilibili.com/video/BV19x411X7C6
Data analysis process
Data collection - data storage - data analysis - Data Mining - Data Visualization - decision making
1 Rstudio use
1.1 INTRODUCTION
-
TAB supplement
- Blue: Functions
- Data frame: box
- Pink: built in dataset
-
Alt+Shift+K: Show shortcuts
1.2 Foundation
- list.files() / dir(): view the files in the working directory
- No declaration is required before variable assignment
1.3 transplantation of rpackage
Rpack <- installed.package()[,1] save(Rpack,file="Rpack.Rdata") #On the new device, you can load the contents in the Rpack, and then download them one by one for (i in Rpack) install.packages(i)
2 data structure
2.1 R object
- Vector, scalar
- matrix
- array
- list
- Data frame
- factor
- time series
2.2 vector
- A one-dimensional array used to store numeric, character, or logical data
- Create a vector with function c
Note: strings in R language should be quoted, otherwise they will be regarded as objects
- seq generates an arithmetic sequence
- rep repeat number
Must be of the same type to process
Vectorization becomes mainly because R language is a statistical software, which is efficient and avoids circulation
Vector index
-
The vector in R starts from 1, not 0
-
If a negative value is used to index, it means that a number other than this number is output
#Index the number at this location, but you cannot have both signs > x[c(4:18)] [1] 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 #You can use the logical value vector to output the number with the logical value of T. the number can not be corresponding, but the number with many logical values will be missing y[c(T,T,T,F,F,F)] y[y>5]#Output with Y > 5 y[y>5 & y<9]
#String access >z <- c("one","two","three") >"one" %in% z TRUE
> v [1] 1 2 3 4 5 6 > v[20]=4 > v [1] 1 2 3 4 5 6 NA NA NA NA NA NA NA NA NA NA NA NA NA 4 > append(x = v,values = 99,after = 4) [1] 1 2 3 4 99 5 6 NA NA NA NA NA NA NA NA NA NA NA NA NA 4
Vector operation
- %%: division operation
- %/%: integer division
When the number of elements is not equal, the less elements will be recycled, and the number must be a multiple relationship
Logical operation: the vector position where x > 5 is greater than 5 will become TRUE, and the rest will become FALSE
ceiling | Returns the smallest integer not less than X |
---|---|
floor | Returns the largest integer not greater than X |
trunc | Returns the integer part |
round | Rounding, the first parameter is a vector, and the second is the reserved digits |
sinif | Similar to round, the second parameter is the number of reserved numbers |
Statistical function
sum | Sum |
---|---|
max/min | Returns the maximum or minimum value |
range | Returns the maximum and minimum values |
mean | mean value |
var | variance |
median | median |
prod | Product of continuous multiplication |
which | Return index value |
2.3 matrix
- The data format of each element is required to be the same
- Generally, it is arranged by column first. You can set byrow=T to arrange by row
Matrix: create a matrix, which can reasonably allocate the original one-dimensional vector to the new matrix
> x <- 1:20 > x [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 > v <- matrix(x,4,5) > v [,1] [,2] [,3] [,4] [,5] [1,] 1 5 9 13 17 [2,] 2 6 10 14 18 [3,] 3 7 11 15 19 [4,] 4 8 12 16 20 #Rename matrix rows and columns > cname <- c("C1","C2","C3","C4","C5") > rname <- c("R1","R2","R3","R4") > dimnames(v) <- list(rname,cname) > v C1 C2 C3 C4 C5 R1 1 5 9 13 17 R2 2 6 10 14 18 R3 3 7 11 15 19 R4 4 8 12 16 20
dim
#Distributive dimension > dim(x) <- c(4,5) > x [,1] [,2] [,3] [,4] [,5] [1,] 1 5 9 13 17 [2,] 2 6 10 14 18 [3,] 3 7 11 15 19 [4,] 4 8 12 16 20
Index of matrix
m[2,] | Access the second line |
---|---|
m[1,2] | Access numbers in one row and two columns |
m[1,c(2,3,4)] | Access the numbers in the first row, two, three and four columns |
m[-1,2] | Remove the first row and take the second column |
- If the matrix is assigned a value, it can also be indexed by row and column names
- Pay attention to whether the row or column is accessed. If the row or column is accessed separately, add a comma before or after it
Matrix operation
- It is the same as the matrix operation in line generation, and the rows and columns need to be consistent
colSums | Sum of each column |
---|---|
rowSums | Sum of each line |
colMeans | Average value of each column |
diag | Returns the value of the diagonal position |
m*n | inner product |
m %*% n | Outer product of matrix |
t(m) | Transpose m |
2.4 array
- An array is more like a vector, which can be made into a matrix
- Use less
> z <- array(1:24) > z [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 > dim1 <- c("A1","A2") > dim2 <- c("B1","B2","B3") > dim3 <- c("C1","C2","C3","C4") > v <- array(z,c(2,3,4),dimnames = list(dim1,dim2,dim3)) > v , , C1 B1 B2 B3 A1 1 3 5 A2 2 4 6 , , C2 B1 B2 B3 A1 7 9 11 A2 8 10 12 , , C3 B1 B2 B3 A1 13 15 17 A2 14 16 18 , , C4 B1 B2 B3 A1 19 21 23 A2 20 22 24
2.5 list
- An ordered collection of objects that can store combinations of several vectors, matrices, data frames, and even other lists
- The most complex and important
- Is a one-dimensional data set
List (): generate a list
You can assign a name, similar to a dictionary
> a <- (1:20) > a [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 > b <- "Hello" > b [1] "Hello" > c <- matrix(a,4,5) > c [,1] [,2] [,3] [,4] [,5] [1,] 1 5 9 13 17 [2,] 2 6 10 14 18 [3,] 3 7 11 15 19 [4,] 4 8 12 16 20 #Create list > mlist <- list(a,b,c) > mlist [[1]] [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 [[2]] [1] "Hello" [[3]] [,1] [,2] [,3] [,4] [,5] [1,] 1 5 9 13 17 [2,] 2 6 10 14 18 [3,] 3 7 11 15 19 [4,] 4 8 12 16 20 > mlist[1] [[1]] [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 #Vectors are required to access multiple list elements at the same time mlist$ (The list elements are automatically exported later) > mlist$first [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Use one bracket to refer to the first element of the list, and two brackets to refer to the first element of the list itself > class(mlist[1]) [1] "list" > class(mlist[[1]]) [1] "integer" Adding a list element also requires two brackets Delete list element: assign a value to this position NULL
2.6 data frame
- A tabular data structure designed to simulate a dataset
- It is a rectangular array composed of data. Rows represent observations and lists represent variables. The matrix must be of the same data type
- The essence is a list. The list elements are vectors. Each column must have the same length. Therefore, the data frame is a rectangular structure, and the columns of the data frame must be named
- excel is a data frame structure
data.frame(): make data frame
Row and column names appear in the index
The indexing method is similar to the above
- When using lm for linear regression, you only need to give the column name
- attach: load the data frame, so you can load data without the $symbol, that is, you can directly enter the column name to get the data
- detach: cancel loading. After execution, you need to bring the data frame name $column name to get the data
- with(mtcars,(hp)): it has the same effect. The first element is the data frame name and the second is the column name
2.7 factor
- The classification of variables (possible values) is called a horizontal level
- Nominal variables: independent of each other, without order
- Ordered variable
- Continuous variable: continuous relationship
- Nominal variables and ordered variables are called factors. The possible values of these classified variables are called a horizontal level. For example, good, better and best become a level. The vector composed of these horizontal values is called a factor
- Function: it is suitable for recording different treatment levels or other types of classification variables met by the research objects in a study
- The maximum function is to classify and calculate frequency and frequency
- application
- Calculation frequency, independence test, correlation test, analysis of variance, principal component analysis, factor analysis, etc
- In many drawing tools, factors are used
- Function: it is suitable for recording different treatment levels or other types of classification variables met by the research objects in a study
table() | level contained in classification statistics factor |
---|---|
cut() | Partition function |
2.8 time series
- Time series analysis
- Used to predict
Processing of time series
ts(): generate time series
3. Missing data
- NA indicates the missing value, not available. It stores the missing information, not necessarily 0. The missing value and the value of 0 are completely different
- When there is NA in the vector, sum is required for the sum result (vector, na.rm=TRUE)
is.na(a) | Check whether there is NA in a, and return TRUE if there is. It can be used to test the data set |
---|---|
colSums() | Number of missing values detected |
oa.omit() | Remove the NA value from the vector |
Processing missing value packages
Identify missing values
- Delete missing values
- Invalid instance (row deletion) omit na()
- Valid instance (paired deletion method) some functions have options available
- Maximum likelihood estimation mvmle packet
- Interpolate missing values
- Single interpolation (simple) Hmisc package
- Multiple interpolation MI package, mace package, amelia package, mitools package
Other missing data
-
The missing data NaN represents an impossible value
-
Inf represents infinity, which is divided into positive infinity inf and negative infinity inf, representing infinity or infinitesimal
Also use is Na() checks for missing values and returns TRUE and FALSE
4. String
Function for processing strings (strings in R language also meet regular expressions, and regular expressions can be used)
nchar() | Returns the length of the string in the vector element (including spaces). Even if the element is not a string, it will be converted into a string for processing |
---|---|
length() | Returns the number of vector elements |
paste() | Paste the string, and add sep = "-" to the following elements to indicate the connection with -. When a vector has multiple strings, add them respectively after each element |
substr() | To extract characters, you can extract the start to stop of each string element |
toupper / tolower | Convert string to uppercase / lowercase |
gsub() | Initial case conversion |
strsplit() | Split string |
5. Obtain data
5.1 obtaining data by keyboard
> patientID <- c(1, 2, 3, 4) > admdate <- c("10/15/2009","11/01/2009","10/21/2009","10/28/2009") > age <- c(25, 34, 28, 52) > diabetes <- c("Type1", "Type2", "Type1", "Type1") > status <- c("Poor", "Improved", "Excellent", "Poor") > data <- data.frame(patientID,age,diabetes,status) > data patientID age diabetes status 1 1 25 Type1 Poor 2 2 34 Type2 Improved 3 3 28 Type1 Excellent 4 4 52 Type1 Poor > data2 <- data.frame(patientID=character(0),admdate=character(0),age=numeric(),diabetes=character(),status=character()) > data2 <- edit(data2) #If you enter this, a box will pop up to modify the data. The undefined data is NA. You must find a variable like this to save, otherwise the modified data in the editor cannot be saved > data2 patientID admdate age diabetes status 1 5 <NA> NA <NA> <NA> 2 6 <NA> NA <NA> <NA> 3 7 <NA> NA <NA> <NA> 4 5 <NA> NA <NA> <NA> #You can also use the fix function, which can be saved directly
5.2 by reading data stored on external files
5.3 obtaining data by accessing the database system
6. Read file
6.1 read.table()
- Place the file in the workspace directory
- Use read Table ("file name"), saved in a variable
- Note: it is generally used to read txt and change to csv file
- You can use the head() and tail() functions to view the first and last lines
- Full path can be used
- sep sets what delimiter is used in the file. txt defaults to space, and csv files should set sep = ","
- header=TRUE, indicating that the first row of data is used as a header instead of data
- Set skip to skip some contents. Set skip=5 to read data from the sixth row
- nrows=100, indicating that 100 rows have been read, which can be used with skip
- na.string: process missing value information. If you know what symbol is used as the missing value, you can replace the missing value with NA
- stringAsFactors: controls whether a string is converted to a factor (generally set to FALSE)
- The data on the shear board can be read
- Select a part in excel and read table(“clipboard”,header =T,sep="\t")
- You can also directly use readClipboard() to read the information on the clipboard
6.2 read.csv()
6.3 read.delim()
- The other read functions are read A simplified version of table. The default segmentation form is different
6.4 reading network files
- Can be csv, txt text files, follow html and other protocols
- But it's easy to go wrong, reptile
6.5 compressed files
- Can be read directly
> read.table(gzfile("input.txt.gz"))
6.6 reading non-standard files
- readLine()
- The file can be read according to each line and unit
- Set parameter n to limit the number of rows read in
- scan()
- Read one unit at a time
- The first parameter represents the file address
- what: unit expected to be read in
7. Write file
7.1 write.table()
- By default, it is saved in the workspace directory. The path must exist, and the R file will not create a new directory
- You can specify a new separator and save it into different types of files
- When writing a file, the line number will also be written in. You can add the parameter row Names = false cancels writing the line number. If the line number is self-contained, the negative index can be used to remove the line number
- Double quotes are added to the string by default, and you can set quote to FALSE
- na adjust missing values
- Files with the same name will be overwritten. You can set the parameter append to TRUE and append to write
- It can be directly written as a compressed file, and the file extension should also correspond to the
- If you want to write the results of R in a format supported by other software, you can use the foreign package
8. Read and write excel files
- If the excel file contains many macros, it is not suitable for direct reading
- openxlsx is only available for opening xlsx files
install.packages(openxlsx) library(openxlsx) a<-read.xlsx("exercise1.xlsx",sheet=1)#File name + sequence number of sheet, simple and crude
Write xlsx file
#Write the contents of the file to variable a > a <- read.xlsx("data.xlsx") > library(openxlsx) > write.xlsx(a,file = "c.xlsx",sheetName = "she")
9. Read and write R format files
- When the dataset is saved, R will automatically compress the data stored in the internal file format, and will store all r metadata related to the stored object
- If the data contains factors, date and time, or class attributes, the function is reflected
RDS files
- Only a single variable can be stored
In the case, iris, the data set provided by the software, is used
> saveRDS(iris,file = "iris.RDS") > iris <- readRDS("E:/mathmodel/R_studio/RS_project/iris.RDS") #Double click to open RDS file auto import > x <- readRDS("iris.RDS")
RData file
- Save multiple type variables, similar to a project file
- The package, file, information and other data are saved
save.image() Save workspace load Load workspace
10. Data conversion
- Data converted to function processing
- as.data.frame can be cast to data frame format
- A matrix can be directly converted to a data frame, but a data frame cannot be directly converted to a matrix because it contains different types of data
10.1 subset
Method 1
> who <- read.csv("WHO.csv") > who1 <- who[c(1:50),c(1:10)] > View(who1) > who4 <- who[which(who$CountryID>50 &who$CountryID<=100),]#Note the comma
Method 2
> who5 <- subset(who,who$CountryID>50 & who$CountryID<=100)
random sampling
> x <- 1:20 > sample(x,10)#The replace parameter is F by default, and there is no duplicate sampling [1] 13 5 11 1 17 4 8 20 6 12 > who7 <- who[sort(sample(who$CountryID,30,replace = F)),]
10.2 delete fixed line
The built-in data set mtcars is used
mtcars[-1:-5,] #Delete fixed row mtcars[,-1:-5] #Delete fixed column
10.3 data frame consolidation
cbind(Two data sets) #Add a column rbind(Two data sets) #Adding a row requires that both datasets have the same column name
Duplicates are not deleted when there are overlaps
duplicated(data set) #Judge whether the row name has duplicate values, and return T and F
data4[duplicated(data4),] #Remove the duplicate part data4[!duplicated(data4),] #Remove the non repeating part You can also directly unique()Function in one step unique(data4)
10.4 data flip
T() function
- You can directly flip rows and columns
rev() function
- Flip a row
- Available for data frames and vectors
transform() function
- Directly modify the value of a column
transform(women,height=height*2.54) #In this way, the height column in the women dataset becomes 2.54 times the original
10.5 sorting of data frames
- Sort can only be used to sort vectors, not data frames
- The order function can also sort vectors, but returns the index instead of the sorted vector
- If you want to take the reverse order, you can add the - sign before the order, which is the effect of the rev function
- You can also sort multiple at the same time
10.6 data frame calculation (apply series functions)
The trick is that FUN is a parameter
apply()
-
Applicable to data frame or matrix
-
MARGIN=1 for row operation, = 2 for column operation
-
FUN: function used
Grouping calculation
- tapply()
- Parameters: vector
- Return value: vector
- apply()
- Parameters: list, data, frame, array
- Return value: vector, matrix
Multi parameter calculation
- mapply()
- Parameter: vector, unlimited number
- Return value: vector, matrix
Cyclic iteration
- lapply()
- Parameters: list, data, frame
- Return value: list
- Simplified version: sapply()
- The return value vapply() can be set
- Vector or matrix
- Recursive version: rapply()
- Parameter: list
- Return value: list
Environment space traversal
- eapply()
- Parameter: environment
- Return value: list
> state.name [1] "Alabama" "Alaska" "Arizona" "Arkansas" [5] "California" "Colorado" "Connecticut" "Delaware" [9] "Florida" "Georgia" "Hawaii" "Idaho" [13] "Illinois" "Indiana" "Iowa" "Kansas" [17] "Kentucky" "Louisiana" "Maine" "Maryland" [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi" [25] "Missouri" "Montana" "Nebraska" "Nevada" [29] "New Hampshire" "New Jersey" "New Mexico" "New York" [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma" [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina" [41] "South Dakota" "Tennessee" "Texas" "Utah" [45] "Vermont" "Virginia" "Washington" "West Virginia" [49] "Wisconsin" "Wyoming" > state.division [1] East South Central Pacific Mountain West South Central [5] Pacific Mountain New England South Atlantic [9] South Atlantic South Atlantic Pacific Mountain [13] East North Central East North Central West North Central West North Central [17] East South Central West South Central New England South Atlantic [21] New England East North Central West North Central East South Central [25] West North Central Mountain West North Central Mountain [29] New England Middle Atlantic Mountain Middle Atlantic [33] South Atlantic West North Central East North Central West South Central [37] Pacific Middle Atlantic New England South Atlantic [41] West North Central East South Central West South Central Mountain [45] New England South Atlantic Pacific South Atlantic [49] East North Central Mountain 9 Levels: New England Middle Atlantic South Atlantic ... Pacific > tapply(state.name,state.division,FUN=length) #The operation is a factor New England Middle Atlantic South Atlantic East South Central 6 3 8 4 West South Central East North Central West North Central Mountain 4 5 7 8 Pacific 5
10.7 data centralization and standardization
Data centralization: refers to subtracting the mean value of the data set from the given data in the data set
Data standardization: after centralization, it is divided by the standard deviation of the data set
scale() function
-
parameter
- x
- center: centralization
- scale: Standardization
10.8 reshape2 package
- First, the data is fused, and then a specific column is taken as a flag to reorder into a column
> library(reshape2) > head(airquality) > names(airquality) <- tolower(names(airquality)) > head(airquality) > aql <- melt(airquality) > head(aql) > head(airquality) ozone solar.r wind temp month day > aql <- melt(airquality,id.vars=c("month","day")) > aql month day variable value
- acast()
- dcast()
10.9 tidyr package
-
Features: relatively simple
-
Each column represents a variable
-
Each row represents an observation
-
A variable and an observation determine unique values
gather()
- The data to be collected is integrated into a single column, and the desired data is arranged into two columns according to key and value
spread()
- Contrary to gather
unite(): merge multiple columns into one column
- Contrary to seperate
separate(): a column is separated into multiple columns
- col=x column name to operate on
- into=c("A", "B"): column name after splitting
- seq: set separator
10.10 dplyr package
- You can operate single tables or double tables
- Double colons shall be used when calling, and ambiguity occurs when the function with the same name is placed
dplyr::distinct() | Remove duplicate line overlay |
---|---|
dplyr::filter() | Matches, using Boolean operations to remove some values |
dplyr::slice() | Slice and take out any line |
sample_ N (dataset, 10) | Randomly select 10 rows |
sample_ Frac (dataset, 0.1) | Random sampling was carried out according to the proportion of 0.1 |
arrange | Sort (if desc is added, it is sorted in the reverse order |
- The select() function takes a subset
- Based on the column name, or the beginning or end of a character
10.11 chained operator% >%
- The implementation passes the output of one function to the next function as the input of the next function
- You can use multiple chained operators to pass content
11. R function
- Option parameters
- Input control section
- Output control section
- Adjustment part
- Common options
- File: next file
- Data: generally refers to entering a data frame
- x: Represents a single object, generally a vector, or a matrix or list
- x and y: the function requires two input variables
- x. y, z: the function requires three input variables
- formula: formula
- na.rm: delete missing values
- ...: indicates that it is similar to other functions
- tune parameter
- main: string, not vector
- na.rm: TRUE or FALSE
- The axis: side parameter can only be 1 to 4 (determine the location of the drawing area)
- fig: vector containing four elements
12. Data statistics function
12.1 functions
d | probability density function |
---|---|
p | distribution function |
q | Inverse function of distribution function |
r | Generate random with the same distribution |
The function name is formed by adding the dpqr prefix to the normal distribution
-
rNormal: random number function of normal distribution
-
The same is true for other probability distribution functions
-
Knowing the distribution can draw various curves
Distribution name | abbreviation | Distribution name | abbreviation |
---|---|---|---|
Beta distribution | beta | Logistic distribution | logis |
Binomial distribution | binom | Multinomial distribution | multinom |
Cauchy distribution | cauchy | Negative binomial distribution | nbinom |
(non central) chi square distribution | chisq | Normal distribution | norm |
exponential distribution | exp | Poisson distribution | pois |
F distribution | f | Wilcoxon signed rank distribution | signrank |
Gamma distribution | gamma | t distribution | t |
Geometric distribution | geom | uniform distribution | unif |
Hypergeometric distribution | hyper | Weibull distribution | weibull |
Lognormal distribution | lnorm | Wilcoxon rank sum distribution | wilcox |
12.2 generate random number runif()
-
Number of 0 to 1 generated by default
-
The parameters min and max can be adjusted
-
If you want to generate the same random number twice, you can
-
set.seed(66)#A binding number runif(50) runif(50) #The random number generated here is different from the previous one set.seed(66) runif(50)#The random number generated here is the same as the previous one
-
12.3 descriptive statistical functions
summary()
- Data sets can be counted
fivenum()
- Same as summary
aggregate()
- Statistical function to group some values according to their own set information
- Only one function can be used to calculate and return one function value at a time
summaryBy()
- Multiple groups and statistical functions are completed at one time, and the results are displayed in one table
13. Frequency statistical function
13.1 table()
- It can directly complete the statistics of frequency
- Alternatively, you can set the column name as a factor, split it with the split function, and divide it into different data frames according to different values
- Can complete one-dimensional or two-dimensional statistics
13.2 xtabs()
- addmargins()
- Add to row or column of contingency table
- The second parameter 1 represents row and 2 represents column
- The ftabs() function can convert the result into an evaluation contingency table
14. Independence test function
-
P value
- hypothesis test
- Original assumption - no occurrence; Alternative hypothesis - what happened
- P value is a probability value obtained through calculation, that is, the probability of obtaining the maximum or exceeding the obtained test statistic value when the original hypothesis is true,
- Generally, the P value is positioned at 0.05. When P < 0.05, the original hypothesis is rejected, and when P > 0.05, the original hypothesis is not rejected
- hypothesis test
14.1 chi square test
> library(vcd) > mytable <- table(Arthritis$Treatment,Arthritis$Improved) > mytable None Some Marked Placebo 29 7 7 Treated 13 7 21 > chisq.test(mytable) Pearson's Chi-squared test data: mytable X-squared = 13.055, df = 2, p-value = 0.001463 > mytable <- table(Arthritis$Sex,Arthritis$Improved) > chisq.test(mytable) Pearson's Chi-squared test data: mytable X-squared = 4.8407, df = 2, p-value = 0.08889 Warning message: In chisq.test(mytable) : Chi-squared approximation may be incorrect
14.2 Fisher test
- Rows and columns in a bound contingency table are fixed
> mytable <- xtabs(~Treatment+Improved,data=Arthritis) > fisher.test(mytable) Fisher's Exact Test for Count Data data: mytable p-value = 0.001393 alternative hypothesis: two.sided
14.3 Cochran mantel Haenszel test
- The two nominal variables are independent in the third variable (three variables are required, and the order of variables has an impact on the results
> mytable <- xtabs(~Treatment+Improved+Sex,data=Arthritis) > mantelhaen.test(mytable) Cochran-Mantel-Haenszel test data: mytable Cochran-Mantel-Haenszel M^2 = 14.632, df = 2, p-value = 0.0006647 > mytable <- xtabs(~Sex+Treatment+Improved,data=Arthritis) > mantelhaen.test(mytable) Mantel-Haenszel chi-squared test with continuity correction data: mytable Mantel-Haenszel X-squared = 2.0863, df = 1, p-value = 0.1486 alternative hypothesis: true common odds ratio is not equal to 1 95 percent confidence interval: 0.8566711 8.0070521 sample estimates: common odds ratio 2.619048
15. Correlation function
- Measure relevance through quantitative indicators
15.1 correlation analysis function
cor()
- Correlation analysis uses this function
- pearson (default), kendall and spearman can be calculated, and other coefficients can only be implemented by R's extension package (ggm)
- Separate columns can be taken for comparison
cov()
- Calculate covariance
- Measure the overall error of two variables
- The problem reflected is similar to cor()
ggm package
-
pcor(u,s) function is used to calculate the partial correlation coefficient
-
u is a numerical vector. The first two values represent the subscript of the variable for which the correlation coefficient is to be calculated, and the other values are the subscript of the conditional variable
Jsbl < - C (1,5) # the variable subscript of the correlation coefficient to be calculated
Tjbl < - the subscript of C (2,3,6) # conditional (control) variable, that is, the subscript of the variable to be excluded
u <- c(jsbl,tjbl)
-
Covariance of s < - CoV (pcordata) # variable
-
R < - pcor (U, s) # partial correlation coefficient
15.2 correlation test function
- After analysis, it still needs to be quantified into P value to test
cor.test()
- pearson (default), kendall and spearman can be calculated, and other coefficients can only be implemented by R's extension package (ggm)
> cor.test(state.x77[,3],state.x77[,5]) Pearson's product-moment correlation data: state.x77[, 3] and state.x77[, 5] t = 6.8479, df = 48, p-value = 1.258e-08 alternative hypothesis: true correlation is not equal to 0 #Confidence interval: the estimation interval of population parameters constructed by sample statistics. In statistics, the confidence interval of a probability sample is the interval estimation of a population parameter of the sample. The confidence interval shows the degree to which the real value of the parameter falls around the measurement result with a certain probability. The confidence interval gives the confidence degree of the measured value of the measured parameter 95 percent confidence interval: #confidence interval 0.5279280 0.8207295 sample estimates: cor 0.7029752
corr.test()
- In the psych package, recursive processing is possible
- The correlation coefficient is calculated and the detection value is given
pcor.test()
- Partial correlation test
- parameter
- Partial correlation coefficient
- Number of variables
- Number of samples
- Return value
- Student T-test
- freedom
- P value
Grouping data correlation test
-
T-test
- Compare whether the difference between the two averages is significant
- It is mainly used for normal distribution data with small content (less than 30) and unknown overall standard deviation
Parameter test
- It is a method to infer the parameters of the overall distribution, such as mean and variance, when the overall distribution form is known (i.e. the data distribution is known, such as satisfying the normal distribution)
Nonparametric test
- When the population variance is unknown or little known, it is a method to infer the population distribution form by using sample data. The nonparametric test method is called nonparametric test because it does not involve parameters related to the overall distribution in the inference process
16. Drawing function
Four drawing systems of R language
- Basic drawing system
- lattice package
- ggplot2 package
- grid package
plot()
- Multiple data types are supported
par()
Set font style, color and other parameters
17. User defined function
Function declaration
myfun <- function(Option parameters) { Function body }
🔺 18. Data analysis practice
18.1 linear regression
-
lm function input must be in the format of data frame
-
regression: refers to those methods that use one or more predictive variables, also known as independent variables or explanatory variables, to predict response variables, also known as dependent variables, calibration variables or result variables
> fit <- lm(weight ~ height,data = women) > fit Call: lm(formula = weight ~ height, data = women) Coefficients: (Intercept) height -87.52 3.45 > summary(fit) Call: lm(formula = weight ~ height, data = women) Residuals: #Residual: the difference between the real value and the predicted value is the residual distribution. The smaller the residual value, the more accurate it is Min 1Q Median 3Q Max -1.7333 -1.1333 -0.3833 0.7417 3.1167 Coefficients: #Inspection weight = 3.45height -87.52 Estimate Std. Error t value Pr(>|t|) (Intercept) -87.51667 5.93694 -14.74 1.71e-09 *** #Intercept term height 3.45000 0.09114 37.85 1.09e-14 *** #coefficient --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #Star evaluation standard Residual standard error: 1.525 on 13 degrees of freedom #Residual standard error Multiple R-squared: 0.991, Adjusted R-squared: 0.9903 #The R-square judgment coefficient (0 ~ 1) judges the fitting quality of the model, and 99.1% means that the model can represent 99.1% of the data F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14 #F statistic indicates whether the model is significant. Measured by P value, the smaller P is, the more significant it is #Judge model order: from bottom to top #Judge F first. If it is not less than 0.05, the model has no value. If it is less than 0.05, then look at R
- Other functions useful for fitting linear models
summary | Show the detailed results of the fitted model |
---|---|
coefficients | List the model parameters (intercept term and slope) of the fitted model |
confint | Provide confidence intervals for model parameters |
fitted | List the predicted values of the fitted model |
residuals | Lists the residual values of the fitted model |
anova | Generate one ANOVA table of the fitted model, or compare the ANOVA tables of two or more fitted models |
vcov | List the covariance matrix of model parameters |
AIC | Output Chi Chi information statistics |
plot | Generate diagnostic diagram of evaluation fitting model |
predict | The fitting model is used to predict the value of response variables for the new data set |
#Connect the upper edge > plot(women$height,women$weight) > abline(fit) #Generate curve after fitting #Improve fit > fit2 <- lm(weight ~ height+I(height^2),data = women) #Add it once. If you want to add it three times, you have to add I(height^3) > lines(women$weight,fitted(fit2),col="red") #Color discrimination
18.2 multiple linear regression
> states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")]) > class(states) [1] "data.frame" > fit <- lm(Murder~Population+Illiteracy+Income+Frost,data=states) > fit Call: lm(formula = Murder ~ Population + Illiteracy + Income + Frost, data = states) Coefficients: (Intercept) Population Illiteracy Income Frost 1.235e+00 2.237e-04 4.143e+00 6.442e-05 5.813e-04 > summary(fit) Call: lm(formula = Murder ~ Population + Illiteracy + Income + Frost, data = states) Residuals: Min 1Q Median 3Q Max -4.7960 -1.6495 -0.0811 1.4815 7.6210 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.235e+00 3.866e+00 0.319 0.7510 Population 2.237e-04 9.052e-05 2.471 0.0173 * Illiteracy 4.143e+00 8.744e-01 4.738 2.19e-05 *** Income 6.442e-05 6.837e-04 0.094 0.9253 Frost 5.813e-04 1.005e-02 0.058 0.9541 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.535 on 45 degrees of freedom Multiple R-squared: 0.567, Adjusted R-squared: 0.5285 F-statistic: 14.73 on 4 and 45 DF, p-value: 9.133e-08 > options(digits = 4) > coef(fit) (Intercept) Population Illiteracy Income Frost 1.235e+00 2.237e-04 4.143e+00 6.442e-05 5.813e-04
Judge variable relationship (AIC function)
-
When there are multiple variables, pay attention to the relationship between two variables. If you are uncertain about the relationship between two variables, you can use colons, as shown below
-
> fit <- lm(mpg~hp+wt+hp:wt,data=mtcars) #These two variables are related by interaction, but it is not clear what the relationship is, so you can connect them with colons > summary(fit) Call: lm(formula = mpg ~ hp + wt + hp:wt, data = mtcars) Residuals: Min 1Q Median 3Q Max -3.063 -1.649 -0.736 1.421 4.551 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 49.80842 3.60516 13.82 5.0e-14 *** hp -0.12010 0.02470 -4.86 4.0e-05 *** wt -8.21662 1.26971 -6.47 5.2e-07 *** hp:wt 0.02785 0.00742 3.75 0.00081 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.15 on 28 degrees of freedom Multiple R-squared: 0.885, Adjusted R-squared: 0.872 F-statistic: 71.7 on 3 and 28 DF, p-value: 2.98e-13
-
Akaike information criterion
- The fitting degree of the model and the number of parameters used for fitting are considered
- The smaller the calculated AIC, the better, which means that fewer variables can be used to express the fitting degree of the model
- There are many parameters, so stepwise regression and full subset regression can be used for judgment
- Stepwise regression method: reduce / increase one variable at a time until the value remains unchanged (stepAIC() function in mars package)
- Decrease: backward stepwise regression
- Increase: forward stepwise regression
- Full subset regression (regsbuses() function in leaps package)
- Take all models and calculate the best model
- But if there are too many variables, it takes longer
- Stepwise regression method: reduce / increase one variable at a time until the value remains unchanged (stepAIC() function in mars package)
-
The fitting effect is the best, but it has no practical significance and is useless
18.3 regression diagnosis
-
Diagnose problems
- Is this model the best model
- To what extent does the model meet the statistical assumptions of OLS model
- Can the model stand the test of more data
- If the fitted model index is not good, how to continue
- ......
-
Method of testing unified hypothesis
-
summary(): generate various indicators
-
plot(): enter the value after lm analysis to generate four graphs
-
> opar <- par(no.readonly = TRUE) > fit <- lm(weight~height,data=women) > par(mfrow=c(2,2)) #Set four diagrams to be displayed in one screen > plot(fit)
-
-
lm() function can be used for fitting only when the statistical assumptions of OLS (least square method) model are met, but the four graphs can not be used to judge the independence
- Normality: for fixed independent variable values, the dependent variable values are normally distributed
- Independence: dependent variables are independent of each other
- Linearity: there is a linear correlation (straight line, curve) between dependent variable and independent variable
- Homovariance: the variance of dependent variable does not change with the level of independent variable, which can also be called invariant variance
-
Sampling method validation model
- There are 1000 samples in the data set, and 500 data are randomly selected for regression analysis
- After the model is built, the predict function is used to predict the remaining 500 samples and compare the residual values
- If the prediction is accurate, explain the model; otherwise, adjust the model
18.4 analysis of variance
-
It is used to test the significance of the difference between the mean of two or more samples. Analysis of variance is also a kind of regression analysis, but the dependent variable of linear regression is generally continuous variable. When the independent variable is a factor, the focus of research usually changes from prediction to comparison of differences between different groups.
-
aov() function for analysis (the order of variables is important)
Design | expression |
---|---|
Univariate ANOVA | y~A |
Single factor ANCOVA with single covariate | y~x+A |
Two factor ANOVA | y~A*B |
Two factor ANCOVA with two covariates | y~x1+x2+A*B |
Randomized block | y~B+A (B is the block factor) |
ANOVE in univariate group | y~A+Error(Subject/A) |
Repeated measurement ANOVA with single intra group factor (W) and single inter group factor (B) | y~B*W+Error(Subject/W) |
18.5 efficacy analysis
- Regression analysis and analysis of variance can be used to model and judge the relationship between data
- It can be used to determine the sample size required when a given effect value is detected under the given confidence level. Conversely, it can also calculate the probability that a given effect value can be detected within a certain sample size under the given confidence level
Theoretical basis (give three of them, and you can deduce the fourth)
- Sample size refers to the number of observations in each condition / group in the experimental design
- The significance level (also known as alpha) is defined by the probability of type I error. It can also be regarded as the probability that the discovery effect does not occur
- Efficacy is defined by subtracting the probability of type I error from the probability of type II error, which can be regarded as the probability of real effect
- Effect value refers to the amount of effect under alternative or research assumptions. The expression of effect value depends on the statistical method used in hypothesis testing
Linear regression efficacy analysis case
-
Use the "pwr" package in R
18.6 generalized linear model
- There are many types of models. Linear regression and analysis of variance are based on the assumption of normal distribution. The generalized linear model extends the framework of linear model, which includes the analysis of non normal dependent variables
glm() function
- Similar to lm(), but with additional parameters: probability distribution family and corresponding default link function
- The derivation is based on maximum likelihood estimation
Poisson regression
- A regression analysis used to model counting data and contingency tables. Poisson regression assumes that the dependent variable is Poisson distribution and that the logarithm of its mean can be modeled by a linear combination of unknown parameters
Logistic regression
- It is a very useful tool when predicting binary outcome variables through a series of continuous or category predictive variables
18.7 principal component analysis / factor analysis
principal component analysis
- And factor analysis are both monitoring methods used to simplify multivariable complex relationships
- A data dimensionality reduction technique, which transforms a large number of relevant variables into a group of few irrelevant variables. These irrelevant variables are called principal components. In fact, principal components re linearly combine the original variables and re combine the original many indicators with certain correlation into a new group of independent comprehensive indicators
- The principal() function or the principal() function in the psych package
factor analysis
- A series of methods used to discover the potential structure of a group of variables, by looking for a smaller, potential or hidden structure to explain the observed and explicit relationship between variables, are the generalization of principal components
- It is more difficult than principal component analysis to find common factors and express and explain them
Steps of principal component analysis and factor analysis
- Data preprocessing
- Select analysis model
- Judge the number of principal components / factors to be selected (analyze with gravel diagram: fa.parallel() function)
- Select principal component / factor
- Rotate principal component / factor (optional)
- Interpretation results
- Calculate the principal component or factor score, which is also optional