Introduction to R language

R language

Video link: https://www.bilibili.com/video/BV19x411X7C6

Data analysis process

Data collection - data storage - data analysis - Data Mining - Data Visualization - decision making

1 Rstudio use

1.1 INTRODUCTION

  • TAB supplement

    • Blue: Functions
    • Data frame: box
    • Pink: built in dataset
  • Alt+Shift+K: Show shortcuts

1.2 Foundation

  • list.files() / dir(): view the files in the working directory
  • No declaration is required before variable assignment

1.3 transplantation of rpackage

Rpack <- installed.package()[,1]
save(Rpack,file="Rpack.Rdata")

#On the new device, you can load the contents in the Rpack, and then download them one by one
for (i in Rpack) install.packages(i)

2 data structure

2.1 R object

  • Vector, scalar
  • matrix
  • array
  • list
  • Data frame
  • factor
  • time series

2.2 vector

  • A one-dimensional array used to store numeric, character, or logical data
  • Create a vector with function c

Note: strings in R language should be quoted, otherwise they will be regarded as objects

  • seq generates an arithmetic sequence
  • rep repeat number

Must be of the same type to process

Vectorization becomes mainly because R language is a statistical software, which is efficient and avoids circulation

Vector index

  • The vector in R starts from 1, not 0

  • If a negative value is used to index, it means that a number other than this number is output

    #Index the number at this location, but you cannot have both signs
    > x[c(4:18)]
     [1]  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18
    #You can use the logical value vector to output the number with the logical value of T. the number can not be corresponding, but the number with many logical values will be missing
    y[c(T,T,T,F,F,F)]
    y[y>5]#Output with Y > 5
    y[y>5 & y<9]
    
    #String access
    >z <- c("one","two","three")
    >"one" %in% z
     TRUE
    
    > v
    [1] 1 2 3 4 5 6
    > v[20]=4
    > v
     [1]  1  2  3  4  5  6 NA NA NA NA NA NA NA NA NA NA NA NA NA  4
    > append(x = v,values = 99,after = 4)
     [1]  1  2  3  4 99  5  6 NA NA NA NA NA NA NA NA NA NA NA NA NA  4
    

Vector operation

  • %%: division operation
  • %/%: integer division

When the number of elements is not equal, the less elements will be recycled, and the number must be a multiple relationship

Logical operation: the vector position where x > 5 is greater than 5 will become TRUE, and the rest will become FALSE

ceilingReturns the smallest integer not less than X
floorReturns the largest integer not greater than X
truncReturns the integer part
roundRounding, the first parameter is a vector, and the second is the reserved digits
sinifSimilar to round, the second parameter is the number of reserved numbers

Statistical function

sumSum
max/minReturns the maximum or minimum value
rangeReturns the maximum and minimum values
meanmean value
varvariance
medianmedian
prodProduct of continuous multiplication
whichReturn index value

2.3 matrix

  • The data format of each element is required to be the same
  • Generally, it is arranged by column first. You can set byrow=T to arrange by row

Matrix: create a matrix, which can reasonably allocate the original one-dimensional vector to the new matrix

> x <- 1:20
> x
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
> v <- matrix(x,4,5)
> v
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20
#Rename matrix rows and columns
> cname <- c("C1","C2","C3","C4","C5")
> rname <- c("R1","R2","R3","R4")
> dimnames(v) <- list(rname,cname)
> v
   C1 C2 C3 C4 C5
R1  1  5  9 13 17
R2  2  6 10 14 18
R3  3  7 11 15 19
R4  4  8 12 16 20

dim

#Distributive dimension
> dim(x) <- c(4,5)
> x
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20

Index of matrix

m[2,]Access the second line
m[1,2]Access numbers in one row and two columns
m[1,c(2,3,4)]Access the numbers in the first row, two, three and four columns
m[-1,2]Remove the first row and take the second column
  • If the matrix is assigned a value, it can also be indexed by row and column names
  • Pay attention to whether the row or column is accessed. If the row or column is accessed separately, add a comma before or after it

Matrix operation

  • It is the same as the matrix operation in line generation, and the rows and columns need to be consistent
colSumsSum of each column
rowSumsSum of each line
colMeansAverage value of each column
diagReturns the value of the diagonal position
m*ninner product
m %*% nOuter product of matrix
t(m)Transpose m

2.4 array

  • An array is more like a vector, which can be made into a matrix
  • Use less
> z <- array(1:24)
> z
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
> dim1 <- c("A1","A2")
> dim2 <- c("B1","B2","B3")
> dim3 <- c("C1","C2","C3","C4")
> v <- array(z,c(2,3,4),dimnames = list(dim1,dim2,dim3))
> v
, , C1

   B1 B2 B3
A1  1  3  5
A2  2  4  6

, , C2

   B1 B2 B3
A1  7  9 11
A2  8 10 12

, , C3

   B1 B2 B3
A1 13 15 17
A2 14 16 18

, , C4

   B1 B2 B3
A1 19 21 23
A2 20 22 24

2.5 list

  • An ordered collection of objects that can store combinations of several vectors, matrices, data frames, and even other lists
  • The most complex and important
  • Is a one-dimensional data set

List (): generate a list

You can assign a name, similar to a dictionary

> a <- (1:20)
> a
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
> b <- "Hello"
> b
[1] "Hello"
> c <- matrix(a,4,5)
> c
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20
#Create list
> mlist <- list(a,b,c)
> mlist
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

[[2]]
[1] "Hello"

[[3]]
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20

> mlist[1]
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
#Vectors are required to access multiple list elements at the same time

mlist$  (The list elements are automatically exported later)
> mlist$first
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

Use one bracket to refer to the first element of the list, and two brackets to refer to the first element of the list itself
> class(mlist[1])
[1] "list"
> class(mlist[[1]])
[1] "integer"

Adding a list element also requires two brackets
 Delete list element: assign a value to this position NULL

2.6 data frame

  • A tabular data structure designed to simulate a dataset
  • It is a rectangular array composed of data. Rows represent observations and lists represent variables. The matrix must be of the same data type
  • The essence is a list. The list elements are vectors. Each column must have the same length. Therefore, the data frame is a rectangular structure, and the columns of the data frame must be named
  • excel is a data frame structure

data.frame(): make data frame

Row and column names appear in the index

The indexing method is similar to the above

  • When using lm for linear regression, you only need to give the column name
  • attach: load the data frame, so you can load data without the $symbol, that is, you can directly enter the column name to get the data
  • detach: cancel loading. After execution, you need to bring the data frame name $column name to get the data
  • with(mtcars,(hp)): it has the same effect. The first element is the data frame name and the second is the column name

2.7 factor

  • The classification of variables (possible values) is called a horizontal level
    • Nominal variables: independent of each other, without order
    • Ordered variable
    • Continuous variable: continuous relationship
  • Nominal variables and ordered variables are called factors. The possible values of these classified variables are called a horizontal level. For example, good, better and best become a level. The vector composed of these horizontal values is called a factor
    • Function: it is suitable for recording different treatment levels or other types of classification variables met by the research objects in a study
      • The maximum function is to classify and calculate frequency and frequency
    • application
      • Calculation frequency, independence test, correlation test, analysis of variance, principal component analysis, factor analysis, etc
    • In many drawing tools, factors are used
table()level contained in classification statistics factor
cut()Partition function

2.8 time series

  • Time series analysis
    • Used to predict

Processing of time series

ts(): generate time series

3. Missing data

  • NA indicates the missing value, not available. It stores the missing information, not necessarily 0. The missing value and the value of 0 are completely different
  • When there is NA in the vector, sum is required for the sum result (vector, na.rm=TRUE)
is.na(a)Check whether there is NA in a, and return TRUE if there is. It can be used to test the data set
colSums()Number of missing values detected
oa.omit()Remove the NA value from the vector

Processing missing value packages

Identify missing values

  • Delete missing values
    • Invalid instance (row deletion) omit na()
    • Valid instance (paired deletion method) some functions have options available
  • Maximum likelihood estimation mvmle packet
  • Interpolate missing values
    • Single interpolation (simple) Hmisc package
    • Multiple interpolation MI package, mace package, amelia package, mitools package

Other missing data

  1. The missing data NaN represents an impossible value

  2. Inf represents infinity, which is divided into positive infinity inf and negative infinity inf, representing infinity or infinitesimal

Also use is Na() checks for missing values and returns TRUE and FALSE

4. String

Function for processing strings (strings in R language also meet regular expressions, and regular expressions can be used)

nchar()Returns the length of the string in the vector element (including spaces). Even if the element is not a string, it will be converted into a string for processing
length()Returns the number of vector elements
paste()Paste the string, and add sep = "-" to the following elements to indicate the connection with -. When a vector has multiple strings, add them respectively after each element
substr()To extract characters, you can extract the start to stop of each string element
toupper / tolowerConvert string to uppercase / lowercase
gsub()Initial case conversion
strsplit()Split string

5. Obtain data

5.1 obtaining data by keyboard

> patientID <- c(1, 2, 3, 4)
> admdate <- c("10/15/2009","11/01/2009","10/21/2009","10/28/2009")
> age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> status <- c("Poor", "Improved", "Excellent", "Poor")
> data <- data.frame(patientID,age,diabetes,status)
> data
  patientID age diabetes    status
1         1  25    Type1      Poor
2         2  34    Type2  Improved
3         3  28    Type1 Excellent
4         4  52    Type1      Poor

> data2 <- data.frame(patientID=character(0),admdate=character(0),age=numeric(),diabetes=character(),status=character())
> data2 <- edit(data2)  #If you enter this, a box will pop up to modify the data. The undefined data is NA. You must find a variable like this to save, otherwise the modified data in the editor cannot be saved
> data2
  patientID admdate age diabetes status
1         5    <NA>  NA     <NA>   <NA>
2         6    <NA>  NA     <NA>   <NA>
3         7    <NA>  NA     <NA>   <NA>
4         5    <NA>  NA     <NA>   <NA>

#You can also use the fix function, which can be saved directly

5.2 by reading data stored on external files

5.3 obtaining data by accessing the database system

6. Read file

6.1 read.table()

  • Place the file in the workspace directory
  • Use read Table ("file name"), saved in a variable
  • Note: it is generally used to read txt and change to csv file
    • You can use the head() and tail() functions to view the first and last lines
    • Full path can be used
    • sep sets what delimiter is used in the file. txt defaults to space, and csv files should set sep = ","
    • header=TRUE, indicating that the first row of data is used as a header instead of data
    • Set skip to skip some contents. Set skip=5 to read data from the sixth row
    • nrows=100, indicating that 100 rows have been read, which can be used with skip
    • na.string: process missing value information. If you know what symbol is used as the missing value, you can replace the missing value with NA
    • stringAsFactors: controls whether a string is converted to a factor (generally set to FALSE)
  • The data on the shear board can be read
    • Select a part in excel and read table(“clipboard”,header =T,sep="\t")
    • You can also directly use readClipboard() to read the information on the clipboard

6.2 read.csv()

6.3 read.delim()

  • The other read functions are read A simplified version of table. The default segmentation form is different

6.4 reading network files

  • Can be csv, txt text files, follow html and other protocols
  • But it's easy to go wrong, reptile

6.5 compressed files

  • Can be read directly
> read.table(gzfile("input.txt.gz"))

6.6 reading non-standard files

  • readLine()
    • The file can be read according to each line and unit
    • Set parameter n to limit the number of rows read in
  • scan()
    • Read one unit at a time
    • The first parameter represents the file address
    • what: unit expected to be read in

7. Write file

7.1 write.table()

  • By default, it is saved in the workspace directory. The path must exist, and the R file will not create a new directory
  • You can specify a new separator and save it into different types of files
  • When writing a file, the line number will also be written in. You can add the parameter row Names = false cancels writing the line number. If the line number is self-contained, the negative index can be used to remove the line number
  • Double quotes are added to the string by default, and you can set quote to FALSE
  • na adjust missing values
  • Files with the same name will be overwritten. You can set the parameter append to TRUE and append to write
  • It can be directly written as a compressed file, and the file extension should also correspond to the
  • If you want to write the results of R in a format supported by other software, you can use the foreign package

8. Read and write excel files

  • If the excel file contains many macros, it is not suitable for direct reading
  • openxlsx is only available for opening xlsx files
install.packages(openxlsx)
library(openxlsx)
a<-read.xlsx("exercise1.xlsx",sheet=1)#File name + sequence number of sheet, simple and crude

Write xlsx file

#Write the contents of the file to variable a
> a <- read.xlsx("data.xlsx")

> library(openxlsx)
> write.xlsx(a,file = "c.xlsx",sheetName = "she")

9. Read and write R format files

  • When the dataset is saved, R will automatically compress the data stored in the internal file format, and will store all r metadata related to the stored object
  • If the data contains factors, date and time, or class attributes, the function is reflected

RDS files

  • Only a single variable can be stored
In the case, iris, the data set provided by the software, is used
> saveRDS(iris,file = "iris.RDS")
> iris <- readRDS("E:/mathmodel/R_studio/RS_project/iris.RDS") #Double click to open RDS file auto import
> x <- readRDS("iris.RDS")

RData file

  • Save multiple type variables, similar to a project file
  • The package, file, information and other data are saved
save.image() Save workspace
load Load workspace

10. Data conversion

  • Data converted to function processing
  • as.data.frame can be cast to data frame format
  • A matrix can be directly converted to a data frame, but a data frame cannot be directly converted to a matrix because it contains different types of data

10.1 subset

Method 1

> who <- read.csv("WHO.csv")
> who1 <- who[c(1:50),c(1:10)]
> View(who1)

> who4 <- who[which(who$CountryID>50 &who$CountryID<=100),]#Note the comma

Method 2

> who5 <- subset(who,who$CountryID>50 & who$CountryID<=100)

random sampling

> x <- 1:20
> sample(x,10)#The replace parameter is F by default, and there is no duplicate sampling
 [1] 13  5 11  1 17  4  8 20  6 12
 
 > who7 <- who[sort(sample(who$CountryID,30,replace = F)),]

10.2 delete fixed line

The built-in data set mtcars is used

mtcars[-1:-5,] #Delete fixed row
mtcars[,-1:-5] #Delete fixed column

10.3 data frame consolidation

cbind(Two data sets)  #Add a column
rbind(Two data sets)  #Adding a row requires that both datasets have the same column name

Duplicates are not deleted when there are overlaps

duplicated(data set)  #Judge whether the row name has duplicate values, and return T and F
data4[duplicated(data4),]  #Remove the duplicate part
data4[!duplicated(data4),]  #Remove the non repeating part

You can also directly unique()Function in one step
unique(data4)

10.4 data flip

T() function

  • You can directly flip rows and columns

rev() function

  • Flip a row
  • Available for data frames and vectors

transform() function

  • Directly modify the value of a column
transform(women,height=height*2.54) #In this way, the height column in the women dataset becomes 2.54 times the original

10.5 sorting of data frames

  • Sort can only be used to sort vectors, not data frames
  • The order function can also sort vectors, but returns the index instead of the sorted vector
    • If you want to take the reverse order, you can add the - sign before the order, which is the effect of the rev function
    • You can also sort multiple at the same time

10.6 data frame calculation (apply series functions)

The trick is that FUN is a parameter

apply()

  • Applicable to data frame or matrix

  • MARGIN=1 for row operation, = 2 for column operation

  • FUN: function used

Grouping calculation

  • tapply()
    • Parameters: vector
    • Return value: vector
  • apply()
    • Parameters: list, data, frame, array
    • Return value: vector, matrix

Multi parameter calculation

  • mapply()
    • Parameter: vector, unlimited number
    • Return value: vector, matrix

Cyclic iteration

  • lapply()
    • Parameters: list, data, frame
    • Return value: list
  • Simplified version: sapply()
    • The return value vapply() can be set
    • Vector or matrix
  • Recursive version: rapply()
    • Parameter: list
    • Return value: list

Environment space traversal

  • eapply()
    • Parameter: environment
    • Return value: list
> state.name
 [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
 [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
 [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
[13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
[17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
[21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
[25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
[29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
[33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
[37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
[41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
[45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
[49] "Wisconsin"      "Wyoming"    

> state.division
 [1] East South Central Pacific            Mountain           West South Central
 [5] Pacific            Mountain           New England        South Atlantic    
 [9] South Atlantic     South Atlantic     Pacific            Mountain          
[13] East North Central East North Central West North Central West North Central
[17] East South Central West South Central New England        South Atlantic    
[21] New England        East North Central West North Central East South Central
[25] West North Central Mountain           West North Central Mountain          
[29] New England        Middle Atlantic    Mountain           Middle Atlantic   
[33] South Atlantic     West North Central East North Central West South Central
[37] Pacific            Middle Atlantic    New England        South Atlantic    
[41] West North Central East South Central West South Central Mountain          
[45] New England        South Atlantic     Pacific            South Atlantic    
[49] East North Central Mountain          
9 Levels: New England Middle Atlantic South Atlantic ... Pacific


> tapply(state.name,state.division,FUN=length) #The operation is a factor
       New England    Middle Atlantic     South Atlantic East South Central 
                 6                  3                  8                  4 
West South Central East North Central West North Central           Mountain 
                 4                  5                  7                  8 
           Pacific 
                 5 

10.7 data centralization and standardization

Data centralization: refers to subtracting the mean value of the data set from the given data in the data set

Data standardization: after centralization, it is divided by the standard deviation of the data set

scale() function

  • parameter
    • x
    • center: centralization
    • scale: Standardization

10.8 reshape2 package

  • First, the data is fused, and then a specific column is taken as a flag to reorder into a column
> library(reshape2)
> head(airquality)
> names(airquality) <- tolower(names(airquality))
> head(airquality)
> aql <- melt(airquality)
> head(aql)

> head(airquality)
  ozone solar.r wind temp month day
  
> aql <- melt(airquality,id.vars=c("month","day"))
> aql
    month day variable value
  • acast()
  • dcast()

10.9 tidyr package

  • Features: relatively simple

  • Each column represents a variable

  • Each row represents an observation

  • A variable and an observation determine unique values

gather()

  • The data to be collected is integrated into a single column, and the desired data is arranged into two columns according to key and value

spread()

  • Contrary to gather

unite(): merge multiple columns into one column

  • Contrary to seperate

separate(): a column is separated into multiple columns

  • col=x column name to operate on
  • into=c("A", "B"): column name after splitting
  • seq: set separator

10.10 dplyr package

  • You can operate single tables or double tables
  • Double colons shall be used when calling, and ambiguity occurs when the function with the same name is placed
dplyr::distinct()Remove duplicate line overlay
dplyr::filter()Matches, using Boolean operations to remove some values
dplyr::slice()Slice and take out any line
sample_ N (dataset, 10)Randomly select 10 rows
sample_ Frac (dataset, 0.1)Random sampling was carried out according to the proportion of 0.1
arrangeSort (if desc is added, it is sorted in the reverse order
  • The select() function takes a subset
    • Based on the column name, or the beginning or end of a character

10.11 chained operator% >%

  • The implementation passes the output of one function to the next function as the input of the next function
  • You can use multiple chained operators to pass content

11. R function

  • Option parameters
    • Input control section
    • Output control section
    • Adjustment part
  • Common options
    • File: next file
    • Data: generally refers to entering a data frame
    • x: Represents a single object, generally a vector, or a matrix or list
    • x and y: the function requires two input variables
    • x. y, z: the function requires three input variables
    • formula: formula
    • na.rm: delete missing values
    • ...: indicates that it is similar to other functions
  • tune parameter
    • main: string, not vector
    • na.rm: TRUE or FALSE
    • The axis: side parameter can only be 1 to 4 (determine the location of the drawing area)
    • fig: vector containing four elements

12. Data statistics function

12.1 functions

dprobability density function
pdistribution function
qInverse function of distribution function
rGenerate random with the same distribution

The function name is formed by adding the dpqr prefix to the normal distribution

  • rNormal: random number function of normal distribution

  • The same is true for other probability distribution functions

  • Knowing the distribution can draw various curves

Distribution nameabbreviationDistribution nameabbreviation
Beta distribution betaLogistic distribution logis
Binomial distributionbinomMultinomial distributionmultinom
Cauchy distributioncauchyNegative binomial distributionnbinom
(non central) chi square distributionchisqNormal distributionnorm
exponential distribution expPoisson distribution pois
F distributionfWilcoxon signed rank distributionsignrank
Gamma distribution gammat distributiont
Geometric distributiongeomuniform distributionunif
Hypergeometric distributionhyperWeibull distributionweibull
Lognormal distributionlnormWilcoxon rank sum distributionwilcox

12.2 generate random number runif()

  • Number of 0 to 1 generated by default

  • The parameters min and max can be adjusted

  • If you want to generate the same random number twice, you can

    • set.seed(66)#A binding number
      runif(50)
      
      runif(50) #The random number generated here is different from the previous one
      
      set.seed(66)
      runif(50)#The random number generated here is the same as the previous one
      

12.3 descriptive statistical functions

summary()

  • Data sets can be counted

fivenum()

  • Same as summary

aggregate()

  • Statistical function to group some values according to their own set information
  • Only one function can be used to calculate and return one function value at a time

summaryBy()

  • Multiple groups and statistical functions are completed at one time, and the results are displayed in one table

13. Frequency statistical function

13.1 table()

  • It can directly complete the statistics of frequency
  • Alternatively, you can set the column name as a factor, split it with the split function, and divide it into different data frames according to different values
  • Can complete one-dimensional or two-dimensional statistics

13.2 xtabs()

  • addmargins()
    • Add to row or column of contingency table
    • The second parameter 1 represents row and 2 represents column
  • The ftabs() function can convert the result into an evaluation contingency table

14. Independence test function

  • P value

    • hypothesis test
      • Original assumption - no occurrence; Alternative hypothesis - what happened
    • P value is a probability value obtained through calculation, that is, the probability of obtaining the maximum or exceeding the obtained test statistic value when the original hypothesis is true,
    • Generally, the P value is positioned at 0.05. When P < 0.05, the original hypothesis is rejected, and when P > 0.05, the original hypothesis is not rejected

14.1 chi square test

> library(vcd)
> mytable <- table(Arthritis$Treatment,Arthritis$Improved)
> mytable
         
          None Some Marked
  Placebo   29    7      7
  Treated   13    7     21
> chisq.test(mytable)

	Pearson's Chi-squared test

data:  mytable
X-squared = 13.055, df = 2, p-value = 0.001463

> mytable <- table(Arthritis$Sex,Arthritis$Improved)
> chisq.test(mytable)

	Pearson's Chi-squared test

data:  mytable
X-squared = 4.8407, df = 2, p-value = 0.08889

Warning message:
In chisq.test(mytable) : Chi-squared approximation may be incorrect

14.2 Fisher test

  • Rows and columns in a bound contingency table are fixed
> mytable <- xtabs(~Treatment+Improved,data=Arthritis)
> fisher.test(mytable)

	Fisher's Exact Test for Count Data

data:  mytable
p-value = 0.001393
alternative hypothesis: two.sided

14.3 Cochran mantel Haenszel test

  • The two nominal variables are independent in the third variable (three variables are required, and the order of variables has an impact on the results
> mytable <- xtabs(~Treatment+Improved+Sex,data=Arthritis)
> mantelhaen.test(mytable)

	Cochran-Mantel-Haenszel test

data:  mytable
Cochran-Mantel-Haenszel M^2 = 14.632, df = 2, p-value = 0.0006647

> mytable <- xtabs(~Sex+Treatment+Improved,data=Arthritis)
> mantelhaen.test(mytable)

	Mantel-Haenszel chi-squared test with continuity correction

data:  mytable
Mantel-Haenszel X-squared = 2.0863, df = 1, p-value = 0.1486
alternative hypothesis: true common odds ratio is not equal to 1
95 percent confidence interval:
 0.8566711 8.0070521
sample estimates:
common odds ratio 
         2.619048 

15. Correlation function

  • Measure relevance through quantitative indicators

15.1 correlation analysis function

cor()

  • Correlation analysis uses this function
  • pearson (default), kendall and spearman can be calculated, and other coefficients can only be implemented by R's extension package (ggm)
  • Separate columns can be taken for comparison

cov()

  • Calculate covariance
    • Measure the overall error of two variables
    • The problem reflected is similar to cor()

ggm package

  • pcor(u,s) function is used to calculate the partial correlation coefficient

  • u is a numerical vector. The first two values represent the subscript of the variable for which the correlation coefficient is to be calculated, and the other values are the subscript of the conditional variable

    Jsbl < - C (1,5) # the variable subscript of the correlation coefficient to be calculated

    Tjbl < - the subscript of C (2,3,6) # conditional (control) variable, that is, the subscript of the variable to be excluded

    u <- c(jsbl,tjbl)

  • Covariance of s < - CoV (pcordata) # variable

  • R < - pcor (U, s) # partial correlation coefficient

15.2 correlation test function

  • After analysis, it still needs to be quantified into P value to test

cor.test()

  • pearson (default), kendall and spearman can be calculated, and other coefficients can only be implemented by R's extension package (ggm)
> cor.test(state.x77[,3],state.x77[,5])

	Pearson's product-moment correlation

data:  state.x77[, 3] and state.x77[, 5]
t = 6.8479, df = 48, p-value = 1.258e-08
alternative hypothesis: true correlation is not equal to 0

#Confidence interval: the estimation interval of population parameters constructed by sample statistics. In statistics, the confidence interval of a probability sample is the interval estimation of a population parameter of the sample. The confidence interval shows the degree to which the real value of the parameter falls around the measurement result with a certain probability. The confidence interval gives the confidence degree of the measured value of the measured parameter
95 percent confidence interval: #confidence interval
 0.5279280 0.8207295
sample estimates:
      cor 
0.7029752 

corr.test()

  • In the psych package, recursive processing is possible
  • The correlation coefficient is calculated and the detection value is given

pcor.test()

  • Partial correlation test
  • parameter
    • Partial correlation coefficient
    • Number of variables
    • Number of samples
  • Return value
    • Student T-test
    • freedom
    • P value

Grouping data correlation test

  • T-test
    • Compare whether the difference between the two averages is significant
    • It is mainly used for normal distribution data with small content (less than 30) and unknown overall standard deviation

Parameter test

  • It is a method to infer the parameters of the overall distribution, such as mean and variance, when the overall distribution form is known (i.e. the data distribution is known, such as satisfying the normal distribution)

Nonparametric test

  • When the population variance is unknown or little known, it is a method to infer the population distribution form by using sample data. The nonparametric test method is called nonparametric test because it does not involve parameters related to the overall distribution in the inference process

16. Drawing function

Four drawing systems of R language

  • Basic drawing system
  • lattice package
  • ggplot2 package
  • grid package

plot()

  • Multiple data types are supported

par()

Set font style, color and other parameters

17. User defined function

Function declaration

myfun <- function(Option parameters)
	{
		Function body
			}

🔺 18. Data analysis practice

18.1 linear regression

  • lm function input must be in the format of data frame
  • regression: refers to those methods that use one or more predictive variables, also known as independent variables or explanatory variables, to predict response variables, also known as dependent variables, calibration variables or result variables

> fit <- lm(weight ~ height,data = women)
> fit

Call:
lm(formula = weight ~ height, data = women)

Coefficients:
(Intercept)       height  
     -87.52         3.45  

> summary(fit)

Call:
lm(formula = weight ~ height, data = women)

Residuals:   #Residual: the difference between the real value and the predicted value is the residual distribution. The smaller the residual value, the more accurate it is
    Min      1Q  Median      3Q     Max 
-1.7333 -1.1333 -0.3833  0.7417  3.1167 

Coefficients:  #Inspection weight = 3.45height -87.52
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***   #Intercept term
height        3.45000    0.09114   37.85 1.09e-14 ***   #coefficient
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  #Star evaluation standard

Residual standard error: 1.525 on 13 degrees of freedom  #Residual standard error
Multiple R-squared:  0.991,	Adjusted R-squared:  0.9903   #The R-square judgment coefficient (0 ~ 1) judges the fitting quality of the model, and 99.1% means that the model can represent 99.1% of the data
F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14   #F statistic indicates whether the model is significant. Measured by P value, the smaller P is, the more significant it is

#Judge model order: from bottom to top
#Judge F first. If it is not less than 0.05, the model has no value. If it is less than 0.05, then look at R
  • Other functions useful for fitting linear models
summaryShow the detailed results of the fitted model
coefficientsList the model parameters (intercept term and slope) of the fitted model
confintProvide confidence intervals for model parameters
fittedList the predicted values of the fitted model
residualsLists the residual values of the fitted model
anovaGenerate one ANOVA table of the fitted model, or compare the ANOVA tables of two or more fitted models
vcovList the covariance matrix of model parameters
AICOutput Chi Chi information statistics
plotGenerate diagnostic diagram of evaluation fitting model
predictThe fitting model is used to predict the value of response variables for the new data set
#Connect the upper edge
> plot(women$height,women$weight)
> abline(fit)
#Generate curve after fitting

#Improve fit
> fit2 <- lm(weight ~ height+I(height^2),data = women) #Add it once. If you want to add it three times, you have to add I(height^3)
> lines(women$weight,fitted(fit2),col="red")  #Color discrimination

18.2 multiple linear regression

> states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
> class(states)
[1] "data.frame"
> fit <- lm(Murder~Population+Illiteracy+Income+Frost,data=states)
> fit

Call:
lm(formula = Murder ~ Population + Illiteracy + Income + Frost, 
    data = states)

Coefficients:
(Intercept)   Population   Illiteracy       Income        Frost  
  1.235e+00    2.237e-04    4.143e+00    6.442e-05    5.813e-04  
  
> summary(fit)

Call:
lm(formula = Murder ~ Population + Illiteracy + Income + Frost, 
    data = states)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7960 -1.6495 -0.0811  1.4815  7.6210 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.235e+00  3.866e+00   0.319   0.7510    
Population  2.237e-04  9.052e-05   2.471   0.0173 *  
Illiteracy  4.143e+00  8.744e-01   4.738 2.19e-05 ***
Income      6.442e-05  6.837e-04   0.094   0.9253    
Frost       5.813e-04  1.005e-02   0.058   0.9541    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.535 on 45 degrees of freedom
Multiple R-squared:  0.567,	Adjusted R-squared:  0.5285 
F-statistic: 14.73 on 4 and 45 DF,  p-value: 9.133e-08

> options(digits = 4)
> coef(fit)
(Intercept)  Population  Illiteracy      Income       Frost 
  1.235e+00   2.237e-04   4.143e+00   6.442e-05   5.813e-04 

Judge variable relationship (AIC function)

  • When there are multiple variables, pay attention to the relationship between two variables. If you are uncertain about the relationship between two variables, you can use colons, as shown below

  • > fit <- lm(mpg~hp+wt+hp:wt,data=mtcars)  #These two variables are related by interaction, but it is not clear what the relationship is, so you can connect them with colons
    > summary(fit)
    
    Call:
    lm(formula = mpg ~ hp + wt + hp:wt, data = mtcars)
    
    Residuals:
       Min     1Q Median     3Q    Max 
    -3.063 -1.649 -0.736  1.421  4.551 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept) 49.80842    3.60516   13.82  5.0e-14 ***
    hp          -0.12010    0.02470   -4.86  4.0e-05 ***
    wt          -8.21662    1.26971   -6.47  5.2e-07 ***
    hp:wt        0.02785    0.00742    3.75  0.00081 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 2.15 on 28 degrees of freedom
    Multiple R-squared:  0.885,	Adjusted R-squared:  0.872 
    F-statistic: 71.7 on 3 and 28 DF,  p-value: 2.98e-13
    
  • Akaike information criterion

    • The fitting degree of the model and the number of parameters used for fitting are considered
    • The smaller the calculated AIC, the better, which means that fewer variables can be used to express the fitting degree of the model
    • There are many parameters, so stepwise regression and full subset regression can be used for judgment
      • Stepwise regression method: reduce / increase one variable at a time until the value remains unchanged (stepAIC() function in mars package)
        • Decrease: backward stepwise regression
        • Increase: forward stepwise regression
      • Full subset regression (regsbuses() function in leaps package)
        • Take all models and calculate the best model
        • But if there are too many variables, it takes longer
  • The fitting effect is the best, but it has no practical significance and is useless

18.3 regression diagnosis

  • Diagnose problems

    • Is this model the best model
    • To what extent does the model meet the statistical assumptions of OLS model
    • Can the model stand the test of more data
    • If the fitted model index is not good, how to continue
    • ......
  • Method of testing unified hypothesis

    • summary(): generate various indicators

    • plot(): enter the value after lm analysis to generate four graphs

    • > opar <- par(no.readonly = TRUE)
      > fit <- lm(weight~height,data=women)
      > par(mfrow=c(2,2))  #Set four diagrams to be displayed in one screen
      > plot(fit)
      
  • lm() function can be used for fitting only when the statistical assumptions of OLS (least square method) model are met, but the four graphs can not be used to judge the independence

    1. Normality: for fixed independent variable values, the dependent variable values are normally distributed
    2. Independence: dependent variables are independent of each other
    3. Linearity: there is a linear correlation (straight line, curve) between dependent variable and independent variable
    4. Homovariance: the variance of dependent variable does not change with the level of independent variable, which can also be called invariant variance
  • Sampling method validation model

    1. There are 1000 samples in the data set, and 500 data are randomly selected for regression analysis
    2. After the model is built, the predict function is used to predict the remaining 500 samples and compare the residual values
    3. If the prediction is accurate, explain the model; otherwise, adjust the model

18.4 analysis of variance

  • It is used to test the significance of the difference between the mean of two or more samples. Analysis of variance is also a kind of regression analysis, but the dependent variable of linear regression is generally continuous variable. When the independent variable is a factor, the focus of research usually changes from prediction to comparison of differences between different groups.
  • aov() function for analysis (the order of variables is important)

Designexpression
Univariate ANOVAy~A
Single factor ANCOVA with single covariatey~x+A
Two factor ANOVAy~A*B
Two factor ANCOVA with two covariatesy~x1+x2+A*B
Randomized blocky~B+A (B is the block factor)
ANOVE in univariate groupy~A+Error(Subject/A)
Repeated measurement ANOVA with single intra group factor (W) and single inter group factor (B)y~B*W+Error(Subject/W)

18.5 efficacy analysis

  • Regression analysis and analysis of variance can be used to model and judge the relationship between data
  • It can be used to determine the sample size required when a given effect value is detected under the given confidence level. Conversely, it can also calculate the probability that a given effect value can be detected within a certain sample size under the given confidence level

Theoretical basis (give three of them, and you can deduce the fourth)

  1. Sample size refers to the number of observations in each condition / group in the experimental design
  2. The significance level (also known as alpha) is defined by the probability of type I error. It can also be regarded as the probability that the discovery effect does not occur
  3. Efficacy is defined by subtracting the probability of type I error from the probability of type II error, which can be regarded as the probability of real effect
  4. Effect value refers to the amount of effect under alternative or research assumptions. The expression of effect value depends on the statistical method used in hypothesis testing

Linear regression efficacy analysis case

  • Use the "pwr" package in R

18.6 generalized linear model

  • There are many types of models. Linear regression and analysis of variance are based on the assumption of normal distribution. The generalized linear model extends the framework of linear model, which includes the analysis of non normal dependent variables

glm() function

  • Similar to lm(), but with additional parameters: probability distribution family and corresponding default link function
  • The derivation is based on maximum likelihood estimation

Poisson regression

  • A regression analysis used to model counting data and contingency tables. Poisson regression assumes that the dependent variable is Poisson distribution and that the logarithm of its mean can be modeled by a linear combination of unknown parameters

Logistic regression

  • It is a very useful tool when predicting binary outcome variables through a series of continuous or category predictive variables

18.7 principal component analysis / factor analysis

principal component analysis

  • And factor analysis are both monitoring methods used to simplify multivariable complex relationships
  • A data dimensionality reduction technique, which transforms a large number of relevant variables into a group of few irrelevant variables. These irrelevant variables are called principal components. In fact, principal components re linearly combine the original variables and re combine the original many indicators with certain correlation into a new group of independent comprehensive indicators
  • The principal() function or the principal() function in the psych package

factor analysis

  • A series of methods used to discover the potential structure of a group of variables, by looking for a smaller, potential or hidden structure to explain the observed and explicit relationship between variables, are the generalization of principal components
  • It is more difficult than principal component analysis to find common factors and express and explain them

Steps of principal component analysis and factor analysis

  1. Data preprocessing
  2. Select analysis model
  3. Judge the number of principal components / factors to be selected (analyze with gravel diagram: fa.parallel() function)
  4. Select principal component / factor
  5. Rotate principal component / factor (optional)
  6. Interpretation results
  7. Calculate the principal component or factor score, which is also optional

Keywords: R Language

Added by newbienewbie on Sun, 16 Jan 2022 15:25:37 +0200