Introduction to R language

R language

Video link: https://www.bilibili.com/video/BV19x411X7C6

Data analysis process

Data collection - data storage - data analysis - Data Mining - Data Visualization - decision making

1 Rstudio use

1.1 INTRODUCTION

TAB supplement
- Blue: Functions
- Data frame: box
- Pink: built in dataset
Alt+Shift+K: Show shortcuts

1.2 Foundation

list.files() / dir(): view the files in the working directory
No declaration is required before variable assignment

1.3 transplantation of rpackage

Rpack <- installed.package()[,1]
save(Rpack,file="Rpack.Rdata")

#On the new device, you can load the contents in the Rpack, and then download them one by one
for (i in Rpack) install.packages(i)

2 data structure

2.1 R object

Vector, scalar
matrix
array
list
Data frame
factor
time series

2.2 vector

A one-dimensional array used to store numeric, character, or logical data
Create a vector with function c

Note: strings in R language should be quoted, otherwise they will be regarded as objects

seq generates an arithmetic sequence
rep repeat number

Must be of the same type to process

Vectorization becomes mainly because R language is a statistical software, which is efficient and avoids circulation

Vector index

The vector in R starts from 1, not 0

If a negative value is used to index, it means that a number other than this number is output

#Index the number at this location, but you cannot have both signs
> x[c(4:18)]
 [1]  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18
#You can use the logical value vector to output the number with the logical value of T. the number can not be corresponding, but the number with many logical values will be missing
y[c(T,T,T,F,F,F)]
y[y>5]#Output with Y > 5
y[y>5 & y<9]

#String access
>z <- c("one","two","three")
>"one" %in% z
 TRUE

> v
[1] 1 2 3 4 5 6
> v[20]=4
> v
 [1]  1  2  3  4  5  6 NA NA NA NA NA NA NA NA NA NA NA NA NA  4
> append(x = v,values = 99,after = 4)
 [1]  1  2  3  4 99  5  6 NA NA NA NA NA NA NA NA NA NA NA NA NA  4

Vector operation

%%: division operation
%/%: integer division

When the number of elements is not equal, the less elements will be recycled, and the number must be a multiple relationship

Logical operation: the vector position where x > 5 is greater than 5 will become TRUE, and the rest will become FALSE

ceiling	Returns the smallest integer not less than X
floor	Returns the largest integer not greater than X
trunc	Returns the integer part
round	Rounding, the first parameter is a vector, and the second is the reserved digits
sinif	Similar to round, the second parameter is the number of reserved numbers

Statistical function

sum	Sum
max/min	Returns the maximum or minimum value
range	Returns the maximum and minimum values
mean	mean value
var	variance
median	median
prod	Product of continuous multiplication
which	Return index value

2.3 matrix

The data format of each element is required to be the same
Generally, it is arranged by column first. You can set byrow=T to arrange by row

Matrix: create a matrix, which can reasonably allocate the original one-dimensional vector to the new matrix

> x <- 1:20
> x
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
> v <- matrix(x,4,5)
> v
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20
#Rename matrix rows and columns
> cname <- c("C1","C2","C3","C4","C5")
> rname <- c("R1","R2","R3","R4")
> dimnames(v) <- list(rname,cname)
> v
   C1 C2 C3 C4 C5
R1  1  5  9 13 17
R2  2  6 10 14 18
R3  3  7 11 15 19
R4  4  8 12 16 20

dim

#Distributive dimension
> dim(x) <- c(4,5)
> x
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20

Index of matrix

m[2,]	Access the second line
m[1,2]	Access numbers in one row and two columns
m[1,c(2,3,4)]	Access the numbers in the first row, two, three and four columns
m[-1,2]	Remove the first row and take the second column

If the matrix is assigned a value, it can also be indexed by row and column names
Pay attention to whether the row or column is accessed. If the row or column is accessed separately, add a comma before or after it

Matrix operation

It is the same as the matrix operation in line generation, and the rows and columns need to be consistent

colSums	Sum of each column
rowSums	Sum of each line
colMeans	Average value of each column
diag	Returns the value of the diagonal position
m*n	inner product
m %*% n	Outer product of matrix
t(m)	Transpose m

2.4 array

An array is more like a vector, which can be made into a matrix
Use less

> z <- array(1:24)
> z
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
> dim1 <- c("A1","A2")
> dim2 <- c("B1","B2","B3")
> dim3 <- c("C1","C2","C3","C4")
> v <- array(z,c(2,3,4),dimnames = list(dim1,dim2,dim3))
> v
, , C1

   B1 B2 B3
A1  1  3  5
A2  2  4  6

, , C2

   B1 B2 B3
A1  7  9 11
A2  8 10 12

, , C3

   B1 B2 B3
A1 13 15 17
A2 14 16 18

, , C4

   B1 B2 B3
A1 19 21 23
A2 20 22 24

2.5 list

An ordered collection of objects that can store combinations of several vectors, matrices, data frames, and even other lists
The most complex and important
Is a one-dimensional data set

List (): generate a list

You can assign a name, similar to a dictionary

> a <- (1:20)
> a
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
> b <- "Hello"
> b
[1] "Hello"
> c <- matrix(a,4,5)
> c
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20
#Create list
> mlist <- list(a,b,c)
> mlist
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

[[2]]
[1] "Hello"

[[3]]
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20

> mlist[1]
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
#Vectors are required to access multiple list elements at the same time

mlist$  (The list elements are automatically exported later)
> mlist$first
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

Use one bracket to refer to the first element of the list, and two brackets to refer to the first element of the list itself
> class(mlist[1])
[1] "list"
> class(mlist[[1]])
[1] "integer"

Adding a list element also requires two brackets
 Delete list element: assign a value to this position NULL

2.6 data frame

A tabular data structure designed to simulate a dataset
It is a rectangular array composed of data. Rows represent observations and lists represent variables. The matrix must be of the same data type
The essence is a list. The list elements are vectors. Each column must have the same length. Therefore, the data frame is a rectangular structure, and the columns of the data frame must be named
excel is a data frame structure

data.frame(): make data frame

Row and column names appear in the index

The indexing method is similar to the above

When using lm for linear regression, you only need to give the column name
attach: load the data frame, so you can load data without the $symbol, that is, you can directly enter the column name to get the data
detach: cancel loading. After execution, you need to bring the data frame name $column name to get the data
with(mtcars,(hp)): it has the same effect. The first element is the data frame name and the second is the column name

2.7 factor

The classification of variables (possible values) is called a horizontal level
- Nominal variables: independent of each other, without order
- Ordered variable
- Continuous variable: continuous relationship
Nominal variables and ordered variables are called factors. The possible values of these classified variables are called a horizontal level. For example, good, better and best become a level. The vector composed of these horizontal values is called a factor
- Function: it is suitable for recording different treatment levels or other types of classification variables met by the research objects in a study
  - The maximum function is to classify and calculate frequency and frequency
- application
  - Calculation frequency, independence test, correlation test, analysis of variance, principal component analysis, factor analysis, etc
- In many drawing tools, factors are used

table()	level contained in classification statistics factor
cut()	Partition function

2.8 time series

Time series analysis
- Used to predict

Processing of time series

ts(): generate time series

3. Missing data

NA indicates the missing value, not available. It stores the missing information, not necessarily 0. The missing value and the value of 0 are completely different
When there is NA in the vector, sum is required for the sum result (vector, na.rm=TRUE)

is.na(a)	Check whether there is NA in a, and return TRUE if there is. It can be used to test the data set
colSums()	Number of missing values detected
oa.omit()	Remove the NA value from the vector

Processing missing value packages

Identify missing values

Delete missing values
- Invalid instance (row deletion) omit na()
- Valid instance (paired deletion method) some functions have options available
Maximum likelihood estimation mvmle packet
Interpolate missing values
- Single interpolation (simple) Hmisc package
- Multiple interpolation MI package, mace package, amelia package, mitools package

Other missing data

The missing data NaN represents an impossible value
Inf represents infinity, which is divided into positive infinity inf and negative infinity inf, representing infinity or infinitesimal

Also use is Na() checks for missing values and returns TRUE and FALSE

4. String

Function for processing strings (strings in R language also meet regular expressions, and regular expressions can be used)

nchar()	Returns the length of the string in the vector element (including spaces). Even if the element is not a string, it will be converted into a string for processing
length()	Returns the number of vector elements
paste()	Paste the string, and add sep = "-" to the following elements to indicate the connection with -. When a vector has multiple strings, add them respectively after each element
substr()	To extract characters, you can extract the start to stop of each string element
toupper / tolower	Convert string to uppercase / lowercase
gsub()	Initial case conversion
strsplit()	Split string

5. Obtain data

5.1 obtaining data by keyboard

> patientID <- c(1, 2, 3, 4)
> admdate <- c("10/15/2009","11/01/2009","10/21/2009","10/28/2009")
> age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> status <- c("Poor", "Improved", "Excellent", "Poor")
> data <- data.frame(patientID,age,diabetes,status)
> data
  patientID age diabetes    status
1         1  25    Type1      Poor
2         2  34    Type2  Improved
3         3  28    Type1 Excellent
4         4  52    Type1      Poor

> data2 <- data.frame(patientID=character(0),admdate=character(0),age=numeric(),diabetes=character(),status=character())
> data2 <- edit(data2)  #If you enter this, a box will pop up to modify the data. The undefined data is NA. You must find a variable like this to save, otherwise the modified data in the editor cannot be saved
> data2
  patientID admdate age diabetes status
1         5    <NA>  NA     <NA>   <NA>
2         6    <NA>  NA     <NA>   <NA>
3         7    <NA>  NA     <NA>   <NA>
4         5    <NA>  NA     <NA>   <NA>

#You can also use the fix function, which can be saved directly

5.2 by reading data stored on external files

5.3 obtaining data by accessing the database system

6. Read file

6.1 read.table()

Place the file in the workspace directory
Use read Table ("file name"), saved in a variable
Note: it is generally used to read txt and change to csv file
- You can use the head() and tail() functions to view the first and last lines
- Full path can be used
- sep sets what delimiter is used in the file. txt defaults to space, and csv files should set sep = ","
- header=TRUE, indicating that the first row of data is used as a header instead of data
- Set skip to skip some contents. Set skip=5 to read data from the sixth row
- nrows=100, indicating that 100 rows have been read, which can be used with skip
- na.string: process missing value information. If you know what symbol is used as the missing value, you can replace the missing value with NA
- stringAsFactors: controls whether a string is converted to a factor (generally set to FALSE)
The data on the shear board can be read
- Select a part in excel and read table(“clipboard”,header =T,sep="\t")
- You can also directly use readClipboard() to read the information on the clipboard

6.2 read.csv()

6.3 read.delim()

The other read functions are read A simplified version of table. The default segmentation form is different

6.4 reading network files

Can be csv, txt text files, follow html and other protocols
But it's easy to go wrong, reptile

6.5 compressed files

Can be read directly

> read.table(gzfile("input.txt.gz"))

6.6 reading non-standard files

readLine()
- The file can be read according to each line and unit
- Set parameter n to limit the number of rows read in
scan()
- Read one unit at a time
- The first parameter represents the file address
- what: unit expected to be read in

7. Write file

7.1 write.table()

By default, it is saved in the workspace directory. The path must exist, and the R file will not create a new directory
You can specify a new separator and save it into different types of files
When writing a file, the line number will also be written in. You can add the parameter row Names = false cancels writing the line number. If the line number is self-contained, the negative index can be used to remove the line number
Double quotes are added to the string by default, and you can set quote to FALSE
na adjust missing values
Files with the same name will be overwritten. You can set the parameter append to TRUE and append to write
It can be directly written as a compressed file, and the file extension should also correspond to the
If you want to write the results of R in a format supported by other software, you can use the foreign package

8. Read and write excel files

If the excel file contains many macros, it is not suitable for direct reading
openxlsx is only available for opening xlsx files

install.packages(openxlsx)
library(openxlsx)
a<-read.xlsx("exercise1.xlsx",sheet=1)#File name + sequence number of sheet, simple and crude

Write xlsx file

#Write the contents of the file to variable a
> a <- read.xlsx("data.xlsx")

> library(openxlsx)
> write.xlsx(a,file = "c.xlsx",sheetName = "she")

9. Read and write R format files

When the dataset is saved, R will automatically compress the data stored in the internal file format, and will store all r metadata related to the stored object
If the data contains factors, date and time, or class attributes, the function is reflected

RDS files

Only a single variable can be stored

In the case, iris, the data set provided by the software, is used

> saveRDS(iris,file = "iris.RDS")
> iris <- readRDS("E:/mathmodel/R_studio/RS_project/iris.RDS") #Double click to open RDS file auto import
> x <- readRDS("iris.RDS")

RData file

Save multiple type variables, similar to a project file
The package, file, information and other data are saved

save.image() Save workspace
load Load workspace

10. Data conversion

Data converted to function processing
as.data.frame can be cast to data frame format
A matrix can be directly converted to a data frame, but a data frame cannot be directly converted to a matrix because it contains different types of data

10.1 subset

Method 1

> who <- read.csv("WHO.csv")
> who1 <- who[c(1:50),c(1:10)]
> View(who1)

> who4 <- who[which(who$CountryID>50 &who$CountryID<=100),]#Note the comma

Method 2

> who5 <- subset(who,who$CountryID>50 & who$CountryID<=100)

random sampling

> x <- 1:20
> sample(x,10)#The replace parameter is F by default, and there is no duplicate sampling
 [1] 13  5 11  1 17  4  8 20  6 12
 
 > who7 <- who[sort(sample(who$CountryID,30,replace = F)),]

10.2 delete fixed line

The built-in data set mtcars is used

mtcars[-1:-5,] #Delete fixed row
mtcars[,-1:-5] #Delete fixed column

10.3 data frame consolidation

cbind(Two data sets)  #Add a column
rbind(Two data sets)  #Adding a row requires that both datasets have the same column name

Duplicates are not deleted when there are overlaps

duplicated(data set)  #Judge whether the row name has duplicate values, and return T and F

data4[duplicated(data4),]  #Remove the duplicate part
data4[!duplicated(data4),]  #Remove the non repeating part

You can also directly unique()Function in one step
unique(data4)

10.4 data flip

T() function

You can directly flip rows and columns

rev() function

Flip a row
Available for data frames and vectors

transform() function

Directly modify the value of a column

transform(women,height=height*2.54) #In this way, the height column in the women dataset becomes 2.54 times the original

10.5 sorting of data frames

Sort can only be used to sort vectors, not data frames
The order function can also sort vectors, but returns the index instead of the sorted vector
- If you want to take the reverse order, you can add the - sign before the order, which is the effect of the rev function
- You can also sort multiple at the same time

10.6 data frame calculation (apply series functions)

The trick is that FUN is a parameter

apply()

Applicable to data frame or matrix
MARGIN=1 for row operation, = 2 for column operation
FUN: function used

Grouping calculation

tapply()
- Parameters: vector
- Return value: vector
apply()
- Parameters: list, data, frame, array
- Return value: vector, matrix

Multi parameter calculation

mapply()
- Parameter: vector, unlimited number
- Return value: vector, matrix

Cyclic iteration

lapply()
- Parameters: list, data, frame
- Return value: list
Simplified version: sapply()
- The return value vapply() can be set
- Vector or matrix
Recursive version: rapply()
- Parameter: list
- Return value: list

Environment space traversal

eapply()
- Parameter: environment
- Return value: list

> state.name
 [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
 [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
 [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
[13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
[17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
[21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
[25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
[29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
[33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
[37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
[41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
[45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
[49] "Wisconsin"      "Wyoming"    

> state.division
 [1] East South Central Pacific            Mountain           West South Central
 [5] Pacific            Mountain           New England        South Atlantic    
 [9] South Atlantic     South Atlantic     Pacific            Mountain          
[13] East North Central East North Central West North Central West North Central
[17] East South Central West South Central New England        South Atlantic    
[21] New England        East North Central West North Central East South Central
[25] West North Central Mountain           West North Central Mountain          
[29] New England        Middle Atlantic    Mountain           Middle Atlantic   
[33] South Atlantic     West North Central East North Central West South Central
[37] Pacific            Middle Atlantic    New England        South Atlantic    
[41] West North Central East South Central West South Central Mountain          
[45] New England        South Atlantic     Pacific            South Atlantic    
[49] East North Central Mountain          
9 Levels: New England Middle Atlantic South Atlantic ... Pacific


> tapply(state.name,state.division,FUN=length) #The operation is a factor
       New England    Middle Atlantic     South Atlantic East South Central 
                 6                  3                  8                  4 
West South Central East North Central West North Central           Mountain 
                 4                  5                  7                  8 
           Pacific 
                 5

10.7 data centralization and standardization

Data centralization: refers to subtracting the mean value of the data set from the given data in the data set

Data standardization: after centralization, it is divided by the standard deviation of the data set

scale() function

parameter
- x
- center: centralization
- scale: Standardization

10.8 reshape2 package

First, the data is fused, and then a specific column is taken as a flag to reorder into a column

> library(reshape2)
> head(airquality)
> names(airquality) <- tolower(names(airquality))
> head(airquality)
> aql <- melt(airquality)
> head(aql)

> head(airquality)
  ozone solar.r wind temp month day
  
> aql <- melt(airquality,id.vars=c("month","day"))
> aql
    month day variable value

acast()
dcast()

10.9 tidyr package

Features: relatively simple
Each column represents a variable
Each row represents an observation
A variable and an observation determine unique values

gather()

The data to be collected is integrated into a single column, and the desired data is arranged into two columns according to key and value

spread()

Contrary to gather

unite(): merge multiple columns into one column

Contrary to seperate

separate(): a column is separated into multiple columns

col=x column name to operate on
into=c("A", "B"): column name after splitting
seq: set separator

10.10 dplyr package

You can operate single tables or double tables
Double colons shall be used when calling, and ambiguity occurs when the function with the same name is placed

dplyr::distinct()	Remove duplicate line overlay
dplyr::filter()	Matches, using Boolean operations to remove some values
dplyr::slice()	Slice and take out any line
sample_ N (dataset, 10)	Randomly select 10 rows
sample_ Frac (dataset, 0.1)	Random sampling was carried out according to the proportion of 0.1
arrange	Sort (if desc is added, it is sorted in the reverse order

The select() function takes a subset
- Based on the column name, or the beginning or end of a character

10.11 chained operator% >%

The implementation passes the output of one function to the next function as the input of the next function
You can use multiple chained operators to pass content

11. R function

Option parameters
- Input control section
- Output control section
- Adjustment part
Common options
- File: next file
- Data: generally refers to entering a data frame
- x: Represents a single object, generally a vector, or a matrix or list
- x and y: the function requires two input variables
- x. y, z: the function requires three input variables
- formula: formula
- na.rm: delete missing values
- ...: indicates that it is similar to other functions
tune parameter
- main: string, not vector
- na.rm: TRUE or FALSE
- The axis: side parameter can only be 1 to 4 (determine the location of the drawing area)
- fig: vector containing four elements

12. Data statistics function

12.1 functions

d	probability density function
p	distribution function
q	Inverse function of distribution function
r	Generate random with the same distribution

The function name is formed by adding the dpqr prefix to the normal distribution

rNormal: random number function of normal distribution
The same is true for other probability distribution functions
Knowing the distribution can draw various curves

Distribution name	abbreviation	Distribution name	abbreviation
Beta distribution	beta	Logistic distribution	logis
Binomial distribution	binom	Multinomial distribution	multinom
Cauchy distribution	cauchy	Negative binomial distribution	nbinom
(non central) chi square distribution	chisq	Normal distribution	norm
exponential distribution	exp	Poisson distribution	pois
F distribution	f	Wilcoxon signed rank distribution	signrank
Gamma distribution	gamma	t distribution	t
Geometric distribution	geom	uniform distribution	unif
Hypergeometric distribution	hyper	Weibull distribution	weibull
Lognormal distribution	lnorm	Wilcoxon rank sum distribution	wilcox

12.2 generate random number runif()

Number of 0 to 1 generated by default
The parameters min and max can be adjusted

If you want to generate the same random number twice, you can

set.seed(66)#A binding number
runif(50)

runif(50) #The random number generated here is different from the previous one

set.seed(66)
runif(50)#The random number generated here is the same as the previous one

12.3 descriptive statistical functions

summary()

Data sets can be counted

fivenum()

Same as summary

aggregate()

Statistical function to group some values according to their own set information
Only one function can be used to calculate and return one function value at a time

summaryBy()

Multiple groups and statistical functions are completed at one time, and the results are displayed in one table

13. Frequency statistical function

13.1 table()

It can directly complete the statistics of frequency
Alternatively, you can set the column name as a factor, split it with the split function, and divide it into different data frames according to different values
Can complete one-dimensional or two-dimensional statistics

13.2 xtabs()

addmargins()
- Add to row or column of contingency table
- The second parameter 1 represents row and 2 represents column
The ftabs() function can convert the result into an evaluation contingency table

14. Independence test function

P value
- hypothesis test
  - Original assumption - no occurrence; Alternative hypothesis - what happened
- P value is a probability value obtained through calculation, that is, the probability of obtaining the maximum or exceeding the obtained test statistic value when the original hypothesis is true,
- Generally, the P value is positioned at 0.05. When P < 0.05, the original hypothesis is rejected, and when P > 0.05, the original hypothesis is not rejected

14.1 chi square test

> library(vcd)
> mytable <- table(Arthritis$Treatment,Arthritis$Improved)
> mytable
         
          None Some Marked
  Placebo   29    7      7
  Treated   13    7     21
> chisq.test(mytable)

	Pearson's Chi-squared test

data:  mytable
X-squared = 13.055, df = 2, p-value = 0.001463

> mytable <- table(Arthritis$Sex,Arthritis$Improved)
> chisq.test(mytable)

	Pearson's Chi-squared test

data:  mytable
X-squared = 4.8407, df = 2, p-value = 0.08889

Warning message:
In chisq.test(mytable) : Chi-squared approximation may be incorrect

14.2 Fisher test

Rows and columns in a bound contingency table are fixed

> mytable <- xtabs(~Treatment+Improved,data=Arthritis)
> fisher.test(mytable)

	Fisher's Exact Test for Count Data

data:  mytable
p-value = 0.001393
alternative hypothesis: two.sided

14.3 Cochran mantel Haenszel test

The two nominal variables are independent in the third variable (three variables are required, and the order of variables has an impact on the results

> mytable <- xtabs(~Treatment+Improved+Sex,data=Arthritis)
> mantelhaen.test(mytable)

	Cochran-Mantel-Haenszel test

data:  mytable
Cochran-Mantel-Haenszel M^2 = 14.632, df = 2, p-value = 0.0006647

> mytable <- xtabs(~Sex+Treatment+Improved,data=Arthritis)
> mantelhaen.test(mytable)

	Mantel-Haenszel chi-squared test with continuity correction

data:  mytable
Mantel-Haenszel X-squared = 2.0863, df = 1, p-value = 0.1486
alternative hypothesis: true common odds ratio is not equal to 1
95 percent confidence interval:
 0.8566711 8.0070521
sample estimates:
common odds ratio 
         2.619048

15. Correlation function

Measure relevance through quantitative indicators

15.1 correlation analysis function

cor()

Correlation analysis uses this function
pearson (default), kendall and spearman can be calculated, and other coefficients can only be implemented by R's extension package (ggm)
Separate columns can be taken for comparison

cov()

Calculate covariance
- Measure the overall error of two variables
- The problem reflected is similar to cor()

ggm package

pcor(u,s) function is used to calculate the partial correlation coefficient
u is a numerical vector. The first two values represent the subscript of the variable for which the correlation coefficient is to be calculated, and the other values are the subscript of the conditional variable

Jsbl < - C (1,5) # the variable subscript of the correlation coefficient to be calculated

Tjbl < - the subscript of C (2,3,6) # conditional (control) variable, that is, the subscript of the variable to be excluded

u <- c(jsbl,tjbl)
Covariance of s < - CoV (pcordata) # variable
R < - pcor (U, s) # partial correlation coefficient

15.2 correlation test function

After analysis, it still needs to be quantified into P value to test

cor.test()

pearson (default), kendall and spearman can be calculated, and other coefficients can only be implemented by R's extension package (ggm)

> cor.test(state.x77[,3],state.x77[,5])

	Pearson's product-moment correlation

data:  state.x77[, 3] and state.x77[, 5]
t = 6.8479, df = 48, p-value = 1.258e-08
alternative hypothesis: true correlation is not equal to 0

#Confidence interval: the estimation interval of population parameters constructed by sample statistics. In statistics, the confidence interval of a probability sample is the interval estimation of a population parameter of the sample. The confidence interval shows the degree to which the real value of the parameter falls around the measurement result with a certain probability. The confidence interval gives the confidence degree of the measured value of the measured parameter
95 percent confidence interval: #confidence interval
 0.5279280 0.8207295
sample estimates:
      cor 
0.7029752

corr.test()

In the psych package, recursive processing is possible
The correlation coefficient is calculated and the detection value is given

pcor.test()

Partial correlation test
parameter
- Partial correlation coefficient
- Number of variables
- Number of samples
Return value
- Student T-test
- freedom
- P value

Grouping data correlation test

T-test
- Compare whether the difference between the two averages is significant
- It is mainly used for normal distribution data with small content (less than 30) and unknown overall standard deviation

Parameter test

It is a method to infer the parameters of the overall distribution, such as mean and variance, when the overall distribution form is known (i.e. the data distribution is known, such as satisfying the normal distribution)

Nonparametric test

When the population variance is unknown or little known, it is a method to infer the population distribution form by using sample data. The nonparametric test method is called nonparametric test because it does not involve parameters related to the overall distribution in the inference process

16. Drawing function

Four drawing systems of R language

Basic drawing system
lattice package
ggplot2 package
grid package

plot()

Multiple data types are supported

par()

Set font style, color and other parameters

17. User defined function

Function declaration

myfun <- function(Option parameters)
	{
		Function body
			}

🔺 18. Data analysis practice

18.1 linear regression

lm function input must be in the format of data frame
regression: refers to those methods that use one or more predictive variables, also known as independent variables or explanatory variables, to predict response variables, also known as dependent variables, calibration variables or result variables

> fit <- lm(weight ~ height,data = women)
> fit

Call:
lm(formula = weight ~ height, data = women)

Coefficients:
(Intercept)       height  
     -87.52         3.45  

> summary(fit)

Call:
lm(formula = weight ~ height, data = women)

Residuals:   #Residual: the difference between the real value and the predicted value is the residual distribution. The smaller the residual value, the more accurate it is
    Min      1Q  Median      3Q     Max 
-1.7333 -1.1333 -0.3833  0.7417  3.1167 

Coefficients:  #Inspection weight = 3.45height -87.52
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***   #Intercept term
height        3.45000    0.09114   37.85 1.09e-14 ***   #coefficient
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  #Star evaluation standard

Residual standard error: 1.525 on 13 degrees of freedom  #Residual standard error
Multiple R-squared:  0.991,	Adjusted R-squared:  0.9903   #The R-square judgment coefficient (0 ~ 1) judges the fitting quality of the model, and 99.1% means that the model can represent 99.1% of the data
F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14   #F statistic indicates whether the model is significant. Measured by P value, the smaller P is, the more significant it is

#Judge model order: from bottom to top
#Judge F first. If it is not less than 0.05, the model has no value. If it is less than 0.05, then look at R

Other functions useful for fitting linear models

summary	Show the detailed results of the fitted model
coefficients	List the model parameters (intercept term and slope) of the fitted model
confint	Provide confidence intervals for model parameters
fitted	List the predicted values of the fitted model
residuals	Lists the residual values of the fitted model
anova	Generate one ANOVA table of the fitted model, or compare the ANOVA tables of two or more fitted models
vcov	List the covariance matrix of model parameters
AIC	Output Chi Chi information statistics
plot	Generate diagnostic diagram of evaluation fitting model
predict	The fitting model is used to predict the value of response variables for the new data set

#Connect the upper edge
> plot(women$height,women$weight)
> abline(fit)
#Generate curve after fitting

#Improve fit
> fit2 <- lm(weight ~ height+I(height^2),data = women) #Add it once. If you want to add it three times, you have to add I(height^3)
> lines(women$weight,fitted(fit2),col="red")  #Color discrimination

18.2 multiple linear regression

> states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
> class(states)
[1] "data.frame"
> fit <- lm(Murder~Population+Illiteracy+Income+Frost,data=states)
> fit

Call:
lm(formula = Murder ~ Population + Illiteracy + Income + Frost, 
    data = states)

Coefficients:
(Intercept)   Population   Illiteracy       Income        Frost  
  1.235e+00    2.237e-04    4.143e+00    6.442e-05    5.813e-04  
  
> summary(fit)

Call:
lm(formula = Murder ~ Population + Illiteracy + Income + Frost, 
    data = states)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7960 -1.6495 -0.0811  1.4815  7.6210 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.235e+00  3.866e+00   0.319   0.7510    
Population  2.237e-04  9.052e-05   2.471   0.0173 *  
Illiteracy  4.143e+00  8.744e-01   4.738 2.19e-05 ***
Income      6.442e-05  6.837e-04   0.094   0.9253    
Frost       5.813e-04  1.005e-02   0.058   0.9541    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.535 on 45 degrees of freedom
Multiple R-squared:  0.567,	Adjusted R-squared:  0.5285 
F-statistic: 14.73 on 4 and 45 DF,  p-value: 9.133e-08

> options(digits = 4)
> coef(fit)
(Intercept)  Population  Illiteracy      Income       Frost 
  1.235e+00   2.237e-04   4.143e+00   6.442e-05   5.813e-04

Judge variable relationship (AIC function)

When there are multiple variables, pay attention to the relationship between two variables. If you are uncertain about the relationship between two variables, you can use colons, as shown below

> fit <- lm(mpg~hp+wt+hp:wt,data=mtcars)  #These two variables are related by interaction, but it is not clear what the relationship is, so you can connect them with colons
> summary(fit)

Call:
lm(formula = mpg ~ hp + wt + hp:wt, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-3.063 -1.649 -0.736  1.421  4.551 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 49.80842    3.60516   13.82  5.0e-14 ***
hp          -0.12010    0.02470   -4.86  4.0e-05 ***
wt          -8.21662    1.26971   -6.47  5.2e-07 ***
hp:wt        0.02785    0.00742    3.75  0.00081 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.15 on 28 degrees of freedom
Multiple R-squared:  0.885,	Adjusted R-squared:  0.872 
F-statistic: 71.7 on 3 and 28 DF,  p-value: 2.98e-13

Akaike information criterion
- The fitting degree of the model and the number of parameters used for fitting are considered
- The smaller the calculated AIC, the better, which means that fewer variables can be used to express the fitting degree of the model
- There are many parameters, so stepwise regression and full subset regression can be used for judgment
  - Stepwise regression method: reduce / increase one variable at a time until the value remains unchanged (stepAIC() function in mars package)
    - Decrease: backward stepwise regression
    - Increase: forward stepwise regression
  - Full subset regression (regsbuses() function in leaps package)
    - Take all models and calculate the best model
    - But if there are too many variables, it takes longer
The fitting effect is the best, but it has no practical significance and is useless

18.3 regression diagnosis

Diagnose problems
- Is this model the best model
- To what extent does the model meet the statistical assumptions of OLS model
- Can the model stand the test of more data
- If the fitted model index is not good, how to continue
- ......

Method of testing unified hypothesis

summary(): generate various indicators
plot(): enter the value after lm analysis to generate four graphs

> opar <- par(no.readonly = TRUE)
> fit <- lm(weight~height,data=women)
> par(mfrow=c(2,2))  #Set four diagrams to be displayed in one screen
> plot(fit)

lm() function can be used for fitting only when the statistical assumptions of OLS (least square method) model are met, but the four graphs can not be used to judge the independence
1. Normality: for fixed independent variable values, the dependent variable values are normally distributed
2. Independence: dependent variables are independent of each other
3. Linearity: there is a linear correlation (straight line, curve) between dependent variable and independent variable
4. Homovariance: the variance of dependent variable does not change with the level of independent variable, which can also be called invariant variance
Sampling method validation model
1. There are 1000 samples in the data set, and 500 data are randomly selected for regression analysis
2. After the model is built, the predict function is used to predict the remaining 500 samples and compare the residual values
3. If the prediction is accurate, explain the model; otherwise, adjust the model

18.4 analysis of variance

It is used to test the significance of the difference between the mean of two or more samples. Analysis of variance is also a kind of regression analysis, but the dependent variable of linear regression is generally continuous variable. When the independent variable is a factor, the focus of research usually changes from prediction to comparison of differences between different groups.
aov() function for analysis (the order of variables is important)

Design	expression
Univariate ANOVA	y~A
Single factor ANCOVA with single covariate	y~x+A
Two factor ANOVA	y~A*B
Two factor ANCOVA with two covariates	y~x1+x2+A*B
Randomized block	y~B+A (B is the block factor)
ANOVE in univariate group	y~A+Error(Subject/A)
Repeated measurement ANOVA with single intra group factor (W) and single inter group factor (B)	y~B*W+Error(Subject/W)

18.5 efficacy analysis

Regression analysis and analysis of variance can be used to model and judge the relationship between data
It can be used to determine the sample size required when a given effect value is detected under the given confidence level. Conversely, it can also calculate the probability that a given effect value can be detected within a certain sample size under the given confidence level

Theoretical basis (give three of them, and you can deduce the fourth)

Sample size refers to the number of observations in each condition / group in the experimental design
The significance level (also known as alpha) is defined by the probability of type I error. It can also be regarded as the probability that the discovery effect does not occur
Efficacy is defined by subtracting the probability of type I error from the probability of type II error, which can be regarded as the probability of real effect
Effect value refers to the amount of effect under alternative or research assumptions. The expression of effect value depends on the statistical method used in hypothesis testing

Linear regression efficacy analysis case

Use the "pwr" package in R

18.6 generalized linear model

There are many types of models. Linear regression and analysis of variance are based on the assumption of normal distribution. The generalized linear model extends the framework of linear model, which includes the analysis of non normal dependent variables

glm() function

Similar to lm(), but with additional parameters: probability distribution family and corresponding default link function
The derivation is based on maximum likelihood estimation

Poisson regression

A regression analysis used to model counting data and contingency tables. Poisson regression assumes that the dependent variable is Poisson distribution and that the logarithm of its mean can be modeled by a linear combination of unknown parameters

Logistic regression

It is a very useful tool when predicting binary outcome variables through a series of continuous or category predictive variables

18.7 principal component analysis / factor analysis

principal component analysis

And factor analysis are both monitoring methods used to simplify multivariable complex relationships
A data dimensionality reduction technique, which transforms a large number of relevant variables into a group of few irrelevant variables. These irrelevant variables are called principal components. In fact, principal components re linearly combine the original variables and re combine the original many indicators with certain correlation into a new group of independent comprehensive indicators
The principal() function or the principal() function in the psych package

factor analysis

A series of methods used to discover the potential structure of a group of variables, by looking for a smaller, potential or hidden structure to explain the observed and explicit relationship between variables, are the generalization of principal components
It is more difficult than principal component analysis to find common factors and express and explain them

Steps of principal component analysis and factor analysis

Data preprocessing
Select analysis model
Judge the number of principal components / factors to be selected (analyze with gravel diagram: fa.parallel() function)
Select principal component / factor
Rotate principal component / factor (optional)
Interpretation results
Calculate the principal component or factor score, which is also optional

Keywords: R Language

Added by newbienewbie on Sun, 16 Jan 2022 15:25:37 +0200

Programming VIP