Data analysis using python

Chapter VII data cleaning and preparation

7.1 handling missing values

  • All descriptive statistics for pandas objects exclude missing values by default.
  • For numeric data, pandas uses the floating-point value NaN (Not a Number to represent the missing value).
  • In pandas, we use the programming convention in R language to make the missing value NA, which means not available
string_data = pd.Series(['aardvark','artichoke',np.nan,'avocado'])
string_data
0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object
string_data.isnull()
0    False
1    False
2     True
3    False
dtype: bool
#When cleaning data for analysis, it is often important to analyze the missing data itself to determine the data deviation caused by data collection problems or data loss.
#Python's built-in None value is also treated as NA in the object array
string_data[0] = None
string_data.isnull()
0     True
1    False
2     True
3    False
dtype: bool
  • NA treatment method
Function namedescribe
dropnaThe axis labels are filtered according to whether the value of each label is missing data, and the threshold is determined according to the amount of data allowed to be lost
fillnaFill in missing data with some values or use interpolation methods (such as' fill 'or' bfill ')
isnullReturns a boolean indicating which values are missing values
notnullInverse function of isnull

7.1. 1 filter missing values

#Using dropna on Series, it will return all non empty data in Series and its index value
from numpy import nan as  NA
data = pd.Series([1,NA,3.5,NA,7])
data
0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64
#dropna deletes rows with missing values by default
data.dropna()
0    1.0
2    3.5
4    7.0
dtype: float64
data[data.notnull()]
0    1.0
2    3.5
4    7.0
dtype: float64
data = pd.DataFrame([[1,6.5,3],[1,NA,NA],[NA,NA,NA],[NA,6.5,3]])
data
012
01.06.53.0
11.0NaNNaN
2NaNNaNNaN
3NaN6.53.0
#When how='all 'is passed in, all rows with NA values will be deleted
data.dropna(how='all')
012
01.06.53.0
11.0NaNNaN
3NaN6.53.0
#If you want to delete columns in the same way, pass in the parameter axis=1
data[4] = NA
data
0124
01.06.53.0NaN
11.0NaNNaNNaN
2NaNNaNNaNNaN
3NaN6.53.0NaN
data.dropna(axis=1,how = 'all')
012
01.06.53.0
11.0NaNNaN
2NaNNaNNaN
3NaN6.53.0
df = pd.DataFrame(np.random.randn(7,3))
df
012
0-0.1002880.1170810.629897
10.1452240.827820-0.197561
2-1.372610-0.5210750.783224
3-0.6793390.355698-1.283404
4-1.5877080.2546160.149215
5-0.323276-0.393636-1.828212
6-0.639610-1.6778211.618943
df.iloc[:4,1] = NA
df.iloc[:2,2] = NA
df
012
0-0.100288NaNNaN
10.145224NaNNaN
2-1.372610NaN0.783224
3-0.679339NaN-1.283404
4-1.5877080.2546160.149215
5-0.323276-0.393636-1.828212
6-0.639610-1.6778211.618943
df.dropna()
012
4-1.5877080.2546160.149215
5-0.323276-0.393636-1.828212
6-0.639610-1.6778211.618943
#Suppose you only want to keep rows that contain a certain number of observations. You can use the thresh parameter
#thresh=n means that the number of non NA data rows of each reserved row is greater than or equal to N, and thresh=2 means that at least n non Nan data rows are reserved
df.dropna(thresh = 2)
012
2-1.372610NaN0.783224
3-0.679339NaN-1.283404
4-1.5877080.2546160.149215
5-0.323276-0.393636-1.828212
6-0.639610-1.6778211.618943

7.1. 2. Complete the missing value

  • Fillna method is mainly used to complete the missing values. When fillna is called, a constant can be used instead of the missing value
  • fillna function parameters
parameterdescribe
valueScalar values or dictionary objects are used to populate missing values
methodThe interpolation method defaults to 'fill' if there are no other parameters
axisAxis to be filled, default axis=0
inplaceModify the called object instead of generating a backup
limitMaximum fill range for forward or backward fill
df.fillna(0)
012
0-0.1002880.0000000.000000
10.1452240.0000000.000000
2-1.3726100.0000000.783224
3-0.6793390.000000-1.283404
4-1.5877080.2546160.149215
5-0.323276-0.393636-1.828212
6-0.639610-1.6778211.618943
#Using the dictionary when calling fillna, you can set different fill values for different columns
df.fillna({1:0.5,2:0})
012
0-0.1002880.5000000.000000
10.1452240.5000000.000000
2-1.3726100.5000000.783224
3-0.6793390.500000-1.283404
4-1.5877080.2546160.149215
5-0.323276-0.393636-1.828212
6-0.639610-1.6778211.618943
#fillna returns a new object, but you can also modify an existing object
_ = df.fillna(0,inplace = True)
df
012
0-0.1002880.0000000.000000
10.1452240.0000000.000000
2-1.3726100.0000000.783224
3-0.6793390.000000-1.283404
4-1.5877080.2546160.149215
5-0.323276-0.393636-1.828212
6-0.639610-1.6778211.618943
#The same interpolation method used to rebuild the index can also be used for fillna
df = pd.DataFrame(np.random.randn(6,3))
df
012
0-0.4284050.1993830.354342
10.0197820.9213890.534736
2-0.5831580.390681-2.386976
3-0.076475-0.0349951.635065
40.5288140.7117170.696243
5-0.1935770.162206-0.520191
df.iloc[2:,1] = NA
df
012
0-0.4284050.1993830.354342
10.0197820.9213890.534736
2-0.583158NaN-2.386976
3-0.076475NaN1.635065
40.528814NaN0.696243
5-0.193577NaN-0.520191
df.iloc[4:,2] = NA
df
012
0-0.4284050.1993830.354342
10.0197820.9213890.534736
2-0.583158NaN-2.386976
3-0.076475NaN1.635065
40.528814NaNNaN
5-0.193577NaNNaN
df.fillna(method='ffill')
012
0-0.4284050.1993830.354342
10.0197820.9213890.534736
2-0.5831580.921389-2.386976
3-0.0764750.9213891.635065
40.5288140.9213891.635065
5-0.1935770.9213891.635065
df.fillna(method='backfill')
012
0-0.4284050.1993830.354342
10.0197820.9213890.534736
2-0.583158NaN-2.386976
3-0.076475NaN1.635065
40.528814NaNNaN
5-0.193577NaNNaN
df.fillna(method='ffill',limit=2)
012
0-0.4284050.1993830.354342
10.0197820.9213890.534736
2-0.5831580.921389-2.386976
3-0.0764750.9213891.635065
40.528814NaN1.635065
5-0.193577NaN1.635065
data = pd.Series([5,NA,3,NA,7])
data.fillna(data.mean())
0    5.0
1    5.0
2    3.0
3    5.0
4    7.0
dtype: float64
data.mean()
5.0

7.2 data conversion

7.2. 1 delete duplicate values

data = pd.DataFrame({'k1':['one','two']*3+['two'],
                    'k2':[1,1,2,3,3,4,4,]})
data
k1k2
0one1
1two1
2one2
3two3
4one3
5two4
6two4
#The duplicated method of DataFrame returns a Boolean Series,
#This Series reflects whether there are duplicates in each row (the same as the rows that have occurred before)
data.duplicated()
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool
#drop_duplicates returns the DataFrame, and the content is the part of the duplicated returned array that is False
#Delete duplicate values and keep only unique values
#These methods operate on columns by default
data.drop_duplicates()
k1k2
0one1
1two1
2one2
3two3
4one3
5two4
data['v1'] = range(7)
data
k1k2v1
0one10
1two11
2one22
3two33
4one34
5two45
6two46
data.drop_duplicates(['k1'])
k1k2v1
0one10
1two11
#duplicated and drop_duplicates by default retain the first observed value. Passing in the parameter keep='last 'will return the last one
data.drop_duplicates(['k1','k2'],keep = 'last')
k1k2v1
0one10
1two11
2one22
3two33
4one34
6two46

7.2. 2 data conversion using functions or mappings

data = pd.DataFrame({'food':['bacon','pulled pork','bacon','pastrami','corned beef',
                             'bacon','pastrami','honey ham','nova lox'],
                     'ounces':[4.0,3.0,12.0,6.0,7.5,8.0,3.0,5.0,6.0]})

data
foodounces
0bacon4.0
1pulled pork3.0
2bacon12.0
3pastrami6.0
4corned beef7.5
5bacon8.0
6pastrami3.0
7honey ham5.0
8nova lox6.0
meat_to_animal = {
'bacon':'pig',
'pulled pork':'pig',  
'pastrami':'cow',
'corned beef':'cow',  
'honey ham':'pig', 
'nova lox':'samlon',     
}
#The map method of Series receives a function or a dictionary object containing a mapping relationship
lowercased = data['food'].str.lower()
lowercased
0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object
data['animal'] = lowercased.map(meat_to_animal)
data
foodouncesanimal
0bacon4.0pig
1pulled pork3.0pig
2bacon12.0pig
3pastrami6.0cow
4corned beef7.5cow
5bacon8.0pig
6pastrami3.0cow
7honey ham5.0pig
8nova lox6.0samlon
#Using map is a convenient way to perform element by element conversion and other cleaning related operations
data['food'].map(lambda x :meat_to_animal[x.lower()])
0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    samlon
Name: food, dtype: object

7.2. 3 alternative values

  • data.replace method and data The str.replace method is different, data Str. replace replaces a string by element.
data = pd.Series([1,-999,2,-999,-1000,3])
data
0       1
1    -999
2       2
3    -999
4   -1000
5       3
dtype: int64
data.replace(-999,np.nan)
0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64
#If you want to override multiple values at once, you can pass in a list and override values
data.replace([-999,-1000],np.nan)
0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64
#To replace different values with different values, you can pass in a list of alternative values
data.replace([-999,-1000],[np.nan,0])
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64
#Parameters can also be passed through a dictionary
data.replace({-999:np.nan,-1000:0})
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

7.2. 4 rename axis index

data = pd.DataFrame(np.arange(12).reshape(3,4),
                   index = ['Ohio','Colorado','New York'],
                   columns = ['one','two','three','four'])
data
onetwothreefour
Ohio0123
Colorado4567
New York891011
transform = lambda x :x[:4].upper()
data.index.map(transform)
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
data.index = data.index.map(transform)
data
onetwothreefour
OHIO0123
COLO4567
NEW891011
data.rename(index = str.title,columns = str.upper)
ONETWOTHREEFOUR
Ohio0123
Colo4567
New891011
#rename can be used in conjunction with dictionary objects to provide new values for a subset of axis labels
data.rename(index = {'OHIO':'INDIANA'},
           columns = {'three':'peekaboo'})
onetwopeekaboofour
INDIANA0123
COLO4567
NEW891011

7.2. 5 discretization and distribution

ages = [20,22,24,27,21,23,37,31,61,45,41,32]
bins = [18,25,35,60,100]
cats = pd.cut(ages,bins)
cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
cats.codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
cats.categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')
#pd.value_counts(cats) is for pandas Count of the number of boxes in the result of the cut
pd.value_counts(cats)
(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
dtype: int64
#Consistent with the mathematical symbol of the interval, parentheses indicate that the edge is open and square brackets indicate that it is closed (including edges).
#You can change which side is closed by passing right=False
pd.cut(ages,[18,26,36,61,100],right = False)
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
#You can also pass in a custom box name by passing a list or array to the labels option
group_names = ['youth','youngadult','middleaged','senior']
a = pd.cut(ages,bins,labels = group_names)
pd.value_counts(a)
youth         5
youngadult    3
middleaged    3
senior        1
dtype: int64
#If you pass an integer box of cut to replace the explicit box edge, pandas will calculate the box with the same length according to the minimum and maximum values in the data
#The option precision=2 limits decimal precision to two digits
data = np.random.rand(20)
pd.cut(data,4,precision=2)
[(0.51, 0.74], (0.29, 0.51], (0.74, 0.97], (0.29, 0.51], (0.06, 0.29], ..., (0.06, 0.29], (0.29, 0.51], (0.74, 0.97], (0.51, 0.74], (0.74, 0.97]]
Length: 20
Categories (4, interval[float64]): [(0.06, 0.29] < (0.29, 0.51] < (0.51, 0.74] < (0.74, 0.97]]
#qcut is a function closely related to bin division, which is based on sample quantiles. Depending on the distribution of data, using cut usually does not make each box have the same amount of data points.
#Because qcut uses the quantile of the sample, you can get the same length of the box through qcut
data = np.random.randn(1000)
cats = pd.qcut(data,4)#Cut into 4 parts
cats
[(-0.00707, 0.65], (-0.00707, 0.65], (-2.936, -0.626], (-0.626, -0.00707], (-2.936, -0.626], ..., (-0.626, -0.00707], (-0.626, -0.00707], (-0.626, -0.00707], (-0.626, -0.00707], (-0.00707, 0.65]]
Length: 1000
Categories (4, interval[float64]): [(-2.936, -0.626] < (-0.626, -0.00707] < (-0.00707, 0.65] < (0.65, 3.139]]
pd.value_counts(cats)
(-2.936, -0.626]      250
(-0.626, -0.00707]    250
(-0.00707, 0.65]      250
(0.65, 3.139]         250
dtype: int64
#Similar to cut, you can pass in custom quantiles (data between 0 and 1, including edges)
pd.cut(data,[0,0.1,0.5,0.9,1])
[(0.5, 0.9], (0.1, 0.5], NaN, NaN, NaN, ..., NaN, NaN, NaN, NaN, (0.1, 0.5]]
Length: 1000
Categories (4, interval[float64]): [(0.0, 0.1] < (0.1, 0.5] < (0.5, 0.9] < (0.9, 1.0]]

7.2. 6 detect and filter outliers

data = pd.DataFrame(np.random.randn(1000,4))
data.describe()
0123
count1000.0000001000.0000001000.0000001000.000000
mean0.0133430.0301420.0203120.042330
std1.0125280.9844430.9998690.982124
min-2.942920-3.799121-3.412855-2.632107
25%-0.668303-0.629645-0.654843-0.643005
50%0.0103490.0400640.0261970.028003
75%0.7015250.6793710.7061700.714993
max3.2744963.9984933.2642162.907744
#Suppose you want to find a column with an absolute value greater than three
col = data[2]
col[np.abs(col) > 3]
91    -3.044972
711    3.264216
858   -3.412855
Name: 2, dtype: float64
data[(np.abs(data)>3).any(1)]
0123
91-0.341046-0.555910-3.0449720.474512
3252.233400-3.0274040.8457041.441757
332-0.460361-3.799121-0.3129310.478548
4570.0110043.9984930.9774190.577620
711-0.603762-1.6509013.264216-0.803395
7461.455624-3.178085-0.3871400.859193
858-2.1279230.163924-3.412855-0.073186
9463.274496-0.699596-1.0168790.358252
data[np.abs(data)>3] = np.sign(data)*3
data.describe()
0123
count1000.0000001000.0000001000.0000001000.000000
mean0.0130690.0301480.0205060.042330
std1.0116800.9774590.9975730.982124
min-2.942920-3.000000-3.000000-2.632107
25%-0.668303-0.629645-0.654843-0.643005
50%0.0103490.0400640.0261970.028003
75%0.7015250.6793710.7061700.714993
max3.0000003.0000003.0000002.907744
data
0123
00.9972850.352539-0.158277-0.069519
1-1.144523-0.173312-0.6512270.686972
20.6501310.271325-0.304344-0.281217
30.527442-2.0237650.827982-1.855424
4-0.578451-0.949705-0.582701-1.725697
...............
9950.4943110.528862-0.1910970.118121
996-0.5821541.251247-1.622055-0.436563
9970.687732-1.670059-0.272708-0.369290
998-0.4432300.984728-0.283506-1.473420
999-0.276277-0.5972561.269391-0.704337

1000 rows × 4 columns

#Statement NP Sign (data) generates 1 and - 1 values respectively according to the positive and negative values in the data
np.sign(data).head()
0123
01.01.0-1.0-1.0
1-1.0-1.0-1.01.0
21.01.0-1.0-1.0
31.0-1.01.0-1.0
4-1.0-1.0-1.0-1.0

7.2. 7 replacement and random sampling

  • Use numpy random. Permutation is very convenient for permutation (random reordering) of Series or rows in DataFrame.
sampler = np.random.permutation(5)
sampler
array([3, 2, 0, 4, 1])
df = pd.DataFrame(np.arange(5*4).reshape(5,4))
df
0123
00123
14567
2891011
312131415
416171819
#Integer arrays can be used in iloc based indexes or equivalent take functions
df.take(sampler)
0123
312131415
2891011
00123
416171819
14567
#To select a random subset without alternative values, you can use the sample methods of Series and DataFrame
df.sample(n=3)
0123
00123
416171819
312131415
#To generate a sample with alternative values (duplicate selections are allowed), pass replace=True into the sample method
choices = pd.Series([5,6,-1,6,4])
draws = choices.sample(n=10,replace = True)
draws
2   -1
0    5
2   -1
3    6
0    5
1    6
1    6
4    4
3    6
1    6
dtype: int64

7.2. 8 calculation index / dummy variable

  • Transforming categorical variables into "virtual" or "indicator" matrices is another transformation operation for statistical modeling or machine learning
  • pandas has a get_ The dummies function is used to implement this function
df = pd.DataFrame({'key':['b','b','a','c','a','b'],
                  'data1':range(6)})
df
keydata1
0b0
1b1
2a2
3c3
4a4
5b5
pd.get_dummies(df['key'])
abc
0010
1010
2100
3001
4100
5010
#In some cases, you may want to prefix the column of the indicator DataFrame and merge it with other data.
#In get_ There is a prefix parameter in the dummies method to implement this function
dummies = pd.get_dummies(df['key'],prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy
data1key_akey_bkey_c
00010
11010
22100
33001
44100
55010
mnames = ['movie_id','title','genres']
movies = pd.read_table(r'D:\PythonFlie\python\utilize python Conduct data analysis(Book notes)\pydata-book-2nd-edition\datasets\movielens\movies.dat'
                       ,sep='::',header=None,names = mnames)
<ipython-input-188-960ac40c2eea>:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  movies = pd.read_table(r'D:\PythonFlie\python\utilize python Conduct data analysis(Book notes)\pydata-book-2nd-edition\datasets\movielens\movies.dat'
movies[::10]
movie_idtitlegenres
01Toy Story (1995)Animation|Children's|Comedy
1011American President, The (1995)Comedy|Drama|Romance
2021Get Shorty (1995)Action|Comedy|Drama
3031Dangerous Minds (1995)Drama
4041Richard III (1995)Drama|War
............
38403910Dancer in the Dark (2000)Drama|Musical
38503920Faraway, So Close (In Weiter Ferne, So Nah!) (...Drama|Fantasy
38603930Creature From the Black Lagoon, The (1954)Horror
38703940Slumber Party Massacre III, The (1990)Horror
38803950Tigerland (2000)Drama

389 rows × 3 columns

#Adding index variables for each movie genre requires some data processing. First, we extract a list of all the different genres from the dataset
all_genres = []
for x in movies.genres:
    all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)
genres
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)
zero_matrix = np.zeros((len(movies),len(genres)))
dummies = pd.DataFrame(zero_matrix,columns=genres)
zero_matrix
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
dummies
AnimationChildren'sComedyAdventureFantasyRomanceDramaActionCrimeThrillerHorrorSci-FiDocumentaryWarMusicalMysteryFilm-NoirWestern
00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
10.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
20.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
30.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
40.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
.........................................................
38780.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
38790.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
38800.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
38810.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
38820.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0

3883 rows × 18 columns

gen = movies.genres[0]
gen.split('|')
['Animation', "Children's", 'Comedy']
dummies.columns.get_indexer(gen.split("|"))
array([0, 1, 2], dtype=int64)
for i,gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split("|"))
    dummies.iloc[i,indices] = 1
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]
movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Animation                              1.0
Genre_Children's                             1.0
Genre_Comedy                                 1.0
Genre_Adventure                              0.0
Genre_Fantasy                                0.0
Genre_Romance                                0.0
Genre_Drama                                  0.0
Genre_Action                                 0.0
Genre_Crime                                  0.0
Genre_Thriller                               0.0
Genre_Horror                                 0.0
Genre_Sci-Fi                                 0.0
Genre_Documentary                            0.0
Genre_War                                    0.0
Genre_Musical                                0.0
Genre_Mystery                                0.0
Genre_Film-Noir                              0.0
Genre_Western                                0.0
Name: 0, dtype: object
#Will get_dummies combined with cut and other discrete functions is a useful method for statistical applications
np.random.seed(12345)
values = np.random.rand(10)
values
array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
       0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])
bins = [0,0.2,0.4,0.6,0.8,1]
pd.get_dummies(pd.cut(values,bins))
(0.0, 0.2](0.2, 0.4](0.4, 0.6](0.6, 0.8](0.8, 1.0]
000001
101000
210000
301000
400100
500100
600001
700010
800010
900010

7.3 string operation

7.3. 1 string object method

#A comma separated string can be split into multiple blocks using the split method:
val = 'a,b, guido'
val.split(',')
['a', 'b', ' guido']
#split is often used with strip to clear spaces (including line breaks)
pieces = [x.strip() for x in val.split(',')]
pieces
['a', 'b', 'guido']
#These substrings can be concatenated with two colon separators using addition
first,second,third = pieces
first+"::"+second+"::"+third
'a::b::guido'
#Passing in a list or tuple in the join method of string '::' is a faster and more Python style method
"::".join(pieces)
'a::b::guido'
#Using Python's in keyword is the best way to detect substrings, although index and find can do the same
'guido' in val
True
#Note that the difference between find and index is that index throws an exception when the string is not found (while find returns - 1)
val.index('guido')
5
val.find('guido')
5
#count returns the number of occurrences of a specific substring in the string

val.count('guido')
1
#replace replaces one pattern with another. It is also commonly used to pass in an empty string to delete a pattern
val.replace(',','::')
'a::b:: guido'
val.replace(',','')
'ab guido'
  • Python built-in string method
methoddescribe
countReturns the number of non overlapping occurrences of a substring in a string
endswithReturns True if the string ends with a suffix
startswithReturns True if the string starts with a prefix
joinUse strings as spacers to glue sequences of other strings
indexIf found in the string, the position of the first character in the substring is returned: if not found, ValueError is raised
findReturns the position of the first character in the string where the first sub character appears: similar to index, but - 1 is returned if it is not found
rfindReturns the position of the first character of the substring when it last appears in the string. If it is not found, it returns - 1
replaceReplace one string with another
strip,rstrip,1stripTrimming whitespace, including line breaks, is equivalent to x. strip() (and rstrip, lstrip) for each element.
splitUse the separator to split the string into a list of substrings
lowerConvert uppercase letters to lowercase letters
upperConvert lowercase letters to uppercase letters
casefoldConverts characters to lowercase and any region specific variable character combination to a common comparable form
ljust, rjustAlign left or right; Fill the opposite side of the string with spaces (or some other characters) to return the string with the minimum width

7.3. 2 regular expression

  • The re module has three main topics: pattern matching, substitution and splitting
  • The regular expression describing one or more white space characters is \ s+
  • If you need to apply the same expression to multiple strings, re. Is recommended Compile creates a regular expression object, which helps to save CPU cycles.
  • To avoid the influence of the escape character \ in regular expressions, you can use native String Syntax, such as r 'C:\x' or the equivalent 'C:\x'
  • Regular expression method
methoddescribe
findallReturns all non overlapping matching patterns in a string as a list
finditerSimilar to findall, but returns an iterator
matchMatch the pattern at the beginning of the string, or match the pattern to the group; If the pattern matches, a matching object is returned; otherwise, None is returned
searchScan the matching pattern of the string. If the returned matching object is scanned, unlike the match method, the matching of the search method can be any position of the string, not just the starting position of the string
splitSplits the string into multiple parts according to the schema
sub,subnReplace all matching (sub) or the nth matching string (subn) in the string with a replacement expression; Use symbols \ 1\ 2... To reference the matching group elements in the replacement string
import re
text = 'foo bar\t baz \tqux'
re.split('\s+',text)
['foo', 'bar', 'baz', 'qux']
#You can use re Compile compiles itself to form a reusable regular expression object
regex = re.compile('\s+')
regex.split(text)
['foo', 'bar', 'baz', 'qux']
#If you want to get a list of all the patterns that match the regular expression, you can use the findall method
regex.findall(text)
[' ', '\t ', ' \t']
text = """
Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9.%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
regex = re.compile(pattern,flags = re.IGNORECASE)
regex.findall(text)
['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']
#search returns the first matching email address in the text
m = regex.search(text)
m
<re.Match object; span=(6, 21), match='dave@google.com'>
text[m.start():m.end()]
'dave@google.com'
#regex.match matches only when the pattern appears at the beginning of the string. If it does not match, it returns None
print(regex.match(text))
None
#sub will return a new string, and the pattern in the original string will be replaced by a new string
print(regex.sub('ABC',text))
Dave ABC
Steve ABC
Rob ABC
Ryan ABC

#Suppose you want to find an e-mail address and divide each address into three parts: user name, domain name and domain name suffix.
#To do this, you can wrap patterns in parentheses
pattern = r'([A-Z0-9.%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern,flags = re.IGNORECASE)
m = regex.match('wesm@bright.net')
m.groups()
('wesm', 'bright', 'net')
#When patterns can be grouped, findall returns a list containing tuples
regex.findall(text)
[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]
print(regex.sub(r'Username:\1,Domain:\2,Suffix:\3',text))
Dave Username:dave,Domain:google,Suffix:com
Steve Username:steve,Domain:gmail,Suffix:com
Rob Username:rob,Domain:gmail,Suffix:com
Ryan Username:ryan,Domain:yahoo,Suffix:com

7.3. 3. Vectorized string function in pandas

  • List of partial vectorization string methods
methoddescribe
catGlue strings by element based on optional delimiters
containsReturns an array of Boolean values containing a pattern / regular expression
countCount of the number of occurrences of the mode
extractUse regular expressions to group and extract - or more strings from string scripts; The result returned is a DataFrame in which each grouping forms a - column
endswithEquivalent to using x. endwith (schema) for each element
data = {'Dave':'dave@google.com','Steve':'steve@gmail.com','Rob':'rob@gmail.com','Ryan':np.nan}
data = pd.Series(data)
data
Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Ryan                 NaN
dtype: object
data.isnull()
Dave     False
Steve    False
Rob      False
Ryan      True
dtype: bool
#You can use data Map applies strings and valid regular expression methods (passed as lambda or other functions) to each value,
#However, it will fail on the NA (null) value. Series has array oriented methods to skip the string operation of NA value. These methods are called through the str property of series
data.str.contains('gmail')
Dave     False
Steve     True
Rob       True
Ryan       NaN
dtype: object
#Regular expressions can also be used with any re module option
data.str.findall(pattern,flags=re.IGNORECASE)
Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Ryan                       NaN
dtype: object
#You can use str.get or index inside the str attribute
matches = data.str.match(pattern,flags=re.IGNORECASE)
matches
Dave     True
Steve    True
Rob      True
Ryan      NaN
dtype: object
#To access the elements in the embedded list, we can pass the index to any of these functions
matches.str.get(l)
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-245-8d76f9329d2a> in <module>
      1 #To access the elements in the embedded list, we can pass the index to any of these functions
----> 2 matches.str.get(l)


D:\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5459             or name in self._accessors
   5460         ):
-> 5461             return object.__getattribute__(self, name)
   5462         else:
   5463             if self._info_axis._can_hold_identifiers_and_holds_name(name):


D:\Anaconda3\lib\site-packages\pandas\core\accessor.py in __get__(self, obj, cls)
    178             # we're accessing the attribute of the class, i.e., Dataset.geo
    179             return self._accessor
--> 180         accessor_obj = self._accessor(obj)
    181         # Replace the property with the accessor object. Inspired by:
    182         # https://www.pydanny.com/cached-property.html


D:\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py in __init__(self, data)
    152         from pandas.core.arrays.string_ import StringDtype
    153 
--> 154         self._inferred_dtype = self._validate(data)
    155         self._is_categorical = is_categorical_dtype(data.dtype)
    156         self._is_string = isinstance(data.dtype, StringDtype)


D:\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py in _validate(data)
    215 
    216         if inferred_dtype not in allowed_types:
--> 217             raise AttributeError("Can only use .str accessor with string values!")
    218         return inferred_dtype
    219 


AttributeError: Can only use .str accessor with string values!
matches.str[0]
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-246-10bdd22fd8b2> in <module>
----> 1 matches.str[0]


D:\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5459             or name in self._accessors
   5460         ):
-> 5461             return object.__getattribute__(self, name)
   5462         else:
   5463             if self._info_axis._can_hold_identifiers_and_holds_name(name):


D:\Anaconda3\lib\site-packages\pandas\core\accessor.py in __get__(self, obj, cls)
    178             # we're accessing the attribute of the class, i.e., Dataset.geo
    179             return self._accessor
--> 180         accessor_obj = self._accessor(obj)
    181         # Replace the property with the accessor object. Inspired by:
    182         # https://www.pydanny.com/cached-property.html


D:\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py in __init__(self, data)
    152         from pandas.core.arrays.string_ import StringDtype
    153 
--> 154         self._inferred_dtype = self._validate(data)
    155         self._is_categorical = is_categorical_dtype(data.dtype)
    156         self._is_string = isinstance(data.dtype, StringDtype)


D:\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py in _validate(data)
    215 
    216         if inferred_dtype not in allowed_types:
--> 217             raise AttributeError("Can only use .str accessor with string values!")
    218         return inferred_dtype
    219 


AttributeError: Can only use .str accessor with string values!
#You can use the syntax similar to string slicing for vectorization slicing
data.str[:5]
Dave     dave@
Steve    steve
Rob      rob@g
Ryan       NaN
dtype: object

Chapter 8 data regularization: connection, association and reconstruction

8.1 hierarchical index

#What you see is a beautified view of Series indexed by MultiIndex. The "gap" in the index means "use the label directly above"
data = pd.Series(np.random.randn(9),
                index = [['a','a','a','b','b','c','c','d','d'],
                        [1,2,3,1,3,1,2,2,3]])
data
a  1    1.007189
   2   -1.296221
   3    0.274992
b  1    0.228913
   3    1.352917
c  1    0.886429
   2   -2.001637
d  2   -0.371843
   3    1.669025
dtype: float64
data.index
MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )
data['b']
1    0.228913
3    1.352917
dtype: float64
data['b':'c']
b  1    0.228913
   3    1.352917
c  1    0.886429
   2   -2.001637
dtype: float64
data.loc[['b','c']]
b  1    0.228913
   3    1.352917
c  1    0.886429
   2   -2.001637
dtype: float64
#It is also possible to select at the internal level
data.loc[:,3]
a    0.274992
b    1.352917
d    1.669025
dtype: float64
#Hierarchical index plays an important role in grouping operations such as reshaping data and array PivotTable.
#You can rearrange the data in the DataFrame using the unstack method
data.unstack()
123
a1.007189-1.2962210.274992
b0.228913NaN1.352917
c0.886429-2.001637NaN
dNaN-0.3718431.669025
#The reverse operation of unstack is stack
data.unstack().stack()
a  1    1.007189
   2   -1.296221
   3    0.274992
b  1    0.228913
   3    1.352917
c  1    0.886429
   2   -2.001637
d  2   -0.371843
   3    1.669025
dtype: float64
#In the DataFrame, each axis can have a hierarchical index
frame = pd.DataFrame(np.arange(12).reshape(4,3),
                    index = [['a','a','b','b'],[1,2,1,2]],
                    columns = [['ohio','ohio','colorado'],['green','red','green']])
frame
ohiocolorado
greenredgreen
a1012
2345
b1678
291011
frame.index.names = ['key1','key2']
frame
ohiocolorado
greenredgreen
key1key2
a1012
2345
b1678
291011
frame.columns.names = ['state','color']
frame
stateohiocolorado
colorgreenredgreen
key1key2
a1012
2345
b1678
291011
frame['ohio']
colorgreenred
key1key2
a101
234
b167
2910
#A MultiIndex object can be created and reused using its own constructor
pd.MultiIndex.from_arrays([['ohio','ohio','colorado'],['green','red','green']],
                      names = ['state','color'])
MultiIndex([(    'ohio', 'green'),
            (    'ohio',   'red'),
            ('colorado', 'green')],
           names=['state', 'color'])

8.1. 1 reordering and hierarchical sorting

#Swapelevel receives the sequence number or level name of two levels and returns a new object with level change (but the data is unchanged)
frame.swaplevel('key1','key2')
stateohiocolorado
colorgreenredgreen
key2key1
1a012
2a345
1b678
2b91011
#sort_index can only sort data at a single level.
#Use sort when making hierarchical changes_ It is also common to index so that the results are lexically sorted by level
frame.sort_index(level=1)
stateohiocolorado
colorgreenredgreen
key1key2
a1012
b1678
a2345
b291011
frame.sort_index(level=0)
stateohiocolorado
colorgreenredgreen
key1key2
a1012
2345
b1678
291011
frame.swaplevel(0,1).sort_index(level=0)
stateohiocolorado
colorgreenredgreen
key2key1
1a012
b678
2a345
b91011

8.1. 2. Summary statistics by level

  • Many descriptive and summary statistics in DataFrame and Series have a level option, through which you can specify that you want to aggregate on a specific axis
frame.sum(level='key2')
stateohiocolorado
colorgreenredgreen
key2
16810
2121416
frame.sum(level='color',axis=1)
colorgreenred
key1key2
a121
284
b1147
22010

8.1. 3 use DataFrame columns for indexing

  • Generally, we do not use one or more columns in the DataFrame as row indexes; Instead, you may want to move the row index into the columns of the DataFrame.
  • Different connection types of how parameter
optionbehavior
' inner'Join only the intersection of keys that both tables have
'left'Union all keys of the left table
'right 'Federate the keys of all right tables
' outer'Union the union of the keys of both tables
frame = pd.DataFrame({'a':range(7),'b':range(7,0,-1),
                     'c':['one','one','one','two','two','two','two'],
                     'd':[0,1,2,0,1,2,3]})
frame
abcd
007one0
116one1
225one2
334two0
443two1
552two2
661two3
#Set of DataFrame_ The index function generates a new DataFrame, which uses one or more columns as indexes
frame2 = frame.set_index(['c','d'])
frame2
ab
cd
one007
116
225
two034
143
252
361
#By default, these columns are removed from the DataFrame, or you can leave them in the DataFrame
frame.set_index(['c','d'],drop = False)
abcd
cd
one007one0
116one1
225one2
two034two0
143two1
252two2
361two3
#reset_index is set_ In the reverse operation of index, the index level of hierarchical index will be moved to the column
frame2.reset_index()
cdab
0one007
1one116
2one225
3two034
4two143
5two252
6two361

8.2 federated and consolidated data sets

  • pandas.merge joins rows based on one or more keys. For users of SQL or other relational databases, this method is familiar. It realizes the connection operation of the database.
  • pandas.concat causes objects to bond or "stack" axially.
  • combine_ The first instance method allows overlapping data to be spliced together to fill missing values in another object with values in one object.

8.2. 1 database style DataFrame connection

  • A merge or join operation federates a dataset by joining rows with one or more keys
  • merge function parameters
parameterdescribe
leftDataFrame on the left in the merge operation
rightDataFrame on the right side of the operation during merging
how‘inner’. ‘ outer’. ‘ left’. ‘ right '; The default is' inner '
onThe name of the column that needs to be connected. It must be the column name of DataFrame objects on both sides, and the intersection of column names in left and right is used as the connection key
left_on1eft columns used as join keys in dataframe
right_onRight the column used as the join key in the dataframe
left_indexUse the row index of left as its join key (multiple keys in case of Multilndex)
right_indexUse the row index of right as its join key (multiple keys in case of MultiIndex)
sortSort the merged data alphabetically through the connection key; True by default (for better performance on large datasets, disabling this feature in some cases)
suffixesIn case of overlap, the string tuple added after the column name; The default is ('x ',' y ') (for example, if the DataFrame to be merged contains' data' columns, 'data_x' and 'data_y' will appear in the result)
copyIf False, avoid copying data to the result data structure in some special cases; Always copy by default
indicatorAdd a special column_ merge, indicating the source of each row; The value will be 'left' according to the source of the connected data in each row_ only’,‘right_ only 'or' both '
df1 = pd.DataFrame({'key':['b','b','a','c','a','a','b'],
                   'data1':range(7)})
df1
keydata1
0b0
1b1
2a2
3c3
4a4
5a5
6b6
df2 = pd.DataFrame({'key':['a','b','d'],
                   'data2':range(3)})
df2
keydata2
0a0
1b1
2d2
pd.merge(df1,df2)
keydata1data2
0b01
1b11
2b61
3a20
4a40
5a50
#Note that I did not specify which column to connect on. If the key information of the connection is not specified, merge will automatically use the overlapping column name as the key of the connection.
#However, specifying the join key explicitly is a good implementation
pd.merge(df1,df2,on='key')
keydata1data2
0b01
1b11
2b61
3a20
4a40
5a50
#If the column names of each object are different, you can specify column names for them separately
df3 = pd.DataFrame({'Lkey':['b','b','a','c','a','a','b'],
                   'data1':range(7)})
df3
Lkeydata1
0b0
1b1
2a2
3c3
4a4
5a5
6b6
df4 = pd.DataFrame({'Rkey':['a','b','d'],
                   'data2':range(3)})
df4
Rkeydata2
0a0
1b1
2d2
pd.merge(df3,df4,left_on='Lkey',right_on='Rkey')
Lkeydata1Rkeydata2
0b0b1
1b1b1
2b6b1
3a2a0
4a4a0
5a5a0
#Other options are 'left', 'right' and 'outer'.
#outer join is the union of keys, which combines the effects of left join and right join
pd.merge(df1,df2,how='outer')
keydata1data2
0b0.01.0
1b1.01.0
2b6.01.0
3a2.00.0
4a4.00.0
5a5.00.0
6c3.0NaN
7dNaN2.0
df1 = pd.DataFrame({'key':['b','b','a','c','a','b'],
                   'data1':range(6)})
df1
keydata1
0b0
1b1
2a2
3c3
4a4
5b5
df2 = pd.DataFrame({'key':['a','b','a','b','d'],
                   'data2':range(5)})
df2
keydata2
0a0
1b1
2a2
3b3
4d4
pd.merge(df1,df2,on='key',how='left')
keydata1data2
0b01.0
1b03.0
2b11.0
3b13.0
4a20.0
5a22.0
6c3NaN
7a40.0
8a42.0
9b51.0
10b53.0
pd.merge(df1,df2,on='key',how='inner')
keydata1data2
0b01
1b03
2b11
3b13
4b51
5b53
6a20
7a22
8a40
9a42
#When merging with multiple keys, a list of column names is passed in
left = pd.DataFrame({'key1':['foo','foo','bar'],
                    'key2':['one','two','one'],
                    'lval':[1,2,3]})
left
key1key2lval
0fooone1
1footwo2
2barone3
right = pd.DataFrame({'key1':['foo','foo','bar','bar'],
                    'key2':['one','one','one','two'],
                    'rval':[4,5,6,7]})
right
key1key2rval
0fooone4
1fooone5
2barone6
3bartwo7
pd.merge(left,right,on=['key1','key2'],how='outer')
key1key2lvalrval
0fooone1.04.0
1fooone1.05.0
2footwo2.0NaN
3barone3.06.0
4bartwoNaN7.0
#merge has a suffixes option,
#Used to specify the string to be added after the overlapping column names of the left and right DataFrame objects
pd.merge(left,right,on=['key1'])
key1key2_xlvalkey2_yrval
0fooone1one4
1fooone1one5
2footwo2one4
3footwo2one5
4barone3one6
5barone3two7
pd.merge(left,right,on=['key1'],suffixes=('_left','_right'))
key1key2_leftlvalkey2_rightrval
0fooone1one4
1fooone1one5
2footwo2one4
3footwo2one5
4barone3one6
5barone3two7

8.2. 2 merge by index

  • In some cases, the key used for merging in the DataFrame is its index. In this case, you can pass left_index=True or right_index=True (or both) to indicate that the index needs to be used as the merge key
left1 = pd.DataFrame({'key':['a','b','a','a','b','c'],
                     'value':range(6)})
left1
keyvalue
0a0
1b1
2a2
3a3
4b4
5c5
right1 = pd.DataFrame({'group_val':[3.5,7]},index = ['a','b'])
right1
group_val
a3.5
b7.0
pd.merge(left1,right1,left_on = 'key',right_index=True)
keyvaluegroup_val
0a03.5
2a23.5
3a33.5
1b17.0
4b47.0
pd.merge(left1,right1,left_on = 'key',right_index=True,how='outer')
keyvaluegroup_val
0a03.5
2a23.5
3a33.5
1b17.0
4b47.0
5c5NaN
#In the case of multi-layer index data, things will be more complex. Joining on the index is an implicit multi key merge
lefth = pd.DataFrame({'key1':['ohio','ohio','ohio','Nevada','Nevada'],
                     'key2':[2000,2001,2002,2001,2002],
                     'data':np.arange(5.)})
lefth
key1key2data
0ohio20000.0
1ohio20011.0
2ohio20022.0
3Nevada20013.0
4Nevada20024.0
righth = pd.DataFrame(np.arange(12).reshape(6,2),
                     index=[['nevada','nevada','ohio','ohio','ohio','ohio'],[2001,2000,2000,2000,2001,2002]],
                     columns = ['event1','event2'])
righth
event1event2
nevada200101
200023
ohio200045
200067
200189
20021011
pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True,how='outer')
key1key2dataevent1event2
0ohio20000.04.05.0
0ohio20000.06.07.0
1ohio20011.08.09.0
2ohio20022.010.011.0
3Nevada20013.0NaNNaN
4Nevada20024.0NaNNaN
4nevada2001NaN0.01.0
4nevada2000NaN2.03.0
left2 = pd.DataFrame([[1,2],[3,4],[5,6]],
                    index = ['a','c','e'],
                    columns = ['ohio','nevada'])
left2
ohionevada
a12
c34
e56
right2 = pd.DataFrame([[7,8],[9,10],[11,12],[13,14]],
                    index = ['b','c','d','e'],
                    columns = ['missouri','alabama'])
right2
missourialabama
b78
c910
d1112
e1314
pd.merge(left2,right2,how='outer',left_index=True,right_index=True)
ohionevadamissourialabama
a1.02.0NaNNaN
bNaNNaN7.08.0
c3.04.09.010.0
dNaNNaN11.012.0
e5.06.013.014.0
#join instance method for merging by index
left2.join(right2,how='outer')
ohionevadamissourialabama
a1.02.0NaNNaN
bNaNNaN7.08.0
c3.04.09.010.0
dNaNNaN11.012.0
e5.06.013.014.0
left1.join(right1,on='key')
keyvaluegroup_val
0a03.5
1b17.0
2a23.5
3a33.5
4b47.0
5c5NaN
another = pd.DataFrame([[7,8],[9,10],[11,12],[16,17]],
                      index = ['a','c','e','f'],
                      columns = ['new york','oregon'])
another
new yorkoregon
a78
c910
e1112
f1617
left2.join([right2,another])
ohionevadamissourialabamanew yorkoregon
a1.02.0NaNNaN7.08.0
c3.04.09.010.09.010.0
e5.06.013.014.011.012.0
left2.join([right2,another],how='outer')
ohionevadamissourialabamanew yorkoregon
a1.02.0NaNNaN7.08.0
c3.04.09.010.09.010.0
e5.06.013.014.011.012.0
bNaNNaN7.08.0NaNNaN
dNaNNaN11.012.0NaNNaN
fNaNNaNNaNNaN16.017.0

8.2. 3 axial connection

  • Another data composition operation is interchangeably called splicing, binding, or stacking.
  • The concatenate function of NumPy can implement this function on the NumPy array
  • Parameters of concat function
parameterdescribe
objsList of pandas objects or dictionary to be connected. This is a required parameter
axisAxial direction of connection; The default is 0 (along the row)
joinIt can be 'inner' or 'outer' (the default is' outer '); Used to specify whether the connection method is internal or external
join_ _axesUsed to specify the specific index of other n-1 axes, which can replace the logic of inner / outer connection
keysThe value associated with the object to be connected forms a hierarchical index along the connection axis; It can be a list or array of arbitrary values, an array of tuples, or a list of arrays (if you pass a multi-layer array to the levels parameter)
leelsWhen the key value is passed, this parameter is used to specify the level of multi-level index
arr = np.arange(12).reshape(3,4)
arr
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
np.concatenate([arr,arr],axis=1)
array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])
s1 = pd.Series([0,1],index=['a','b'])
s1
a    0
b    1
dtype: int64
s2 = pd.Series([2,3,4],index=['c','d','e'])
s2
c    2
d    3
e    4
dtype: int64
s3 = pd.Series([5,6],index=['f','g'])
s3
f    5
g    6
dtype: int64
pd.concat([s1,s2,s3])
a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64
#The concat method takes effect along the axis of axis=0 to generate another Series.
#If you pass axis=1, the returned result is a DataFrame (column when axis=1)
pd.concat([s1,s2,s3],axis=1)
012
a0.0NaNNaN
b1.0NaNNaN
cNaN2.0NaN
dNaN3.0NaN
eNaN4.0NaN
fNaNNaN5.0
gNaNNaN6.0
s4 = pd.concat([s1,s3])
s4
a    0
b    1
f    5
g    6
dtype: int64
pd.concat([s1,s4],axis=1)
01
a0.00
b1.01
fNaN5
gNaN6
pd.concat([s1,s4],axis=1,join='inner')
01
a00
b11
#You can use join_axes to specify the axis used to connect other axes
pd.concat([s1,s4],axis=1,join_axes=[['a','c','b','e']])
01
a0.00
b1.01
fNaN5
gNaN6
result = pd.concat([s1,s2,s3],keys=['one','two','three'])
result
one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: int64
result.unstack()
abcdefg
one0.01.0NaNNaNNaNNaNNaN
twoNaNNaN2.03.04.0NaNNaN
threeNaNNaNNaNNaNNaN5.06.0
pd.concat([s1,s2,s3],axis = 1,keys=['one','two','three'])
onetwothree
a0.0NaNNaN
b1.0NaNNaN
cNaN2.0NaN
dNaN3.0NaN
eNaN4.0NaN
fNaNNaN5.0
gNaNNaN6.0
df1 = pd.DataFrame(np.arange(6).reshape(3,2),
                  index = ['a','b','c'],
                  columns = ['one','two'])
df1
onetwo
a01
b23
c45
df2 = pd.DataFrame(np.arange(4).reshape(2,2)+5,
                  index = ['a','c'],
                  columns = ['three','four'])
df2
threefour
a56
c78
pd.concat([df1,df2],axis=1,keys=['lever1','level2'])
lever1level2
onetwothreefour
a015.06.0
b23NaNNaN
c457.08.0
#If you pass the object's dictionary instead of the list, the dictionary key will be used for the keys option
pd.concat({'level1':df1,'level2':df2},axis=1)
level1level2
onetwothreefour
a015.06.0
b23NaNNaN
c457.08.0
pd.concat([df1,df2],axis=1,keys=['lever1','level2'],names=['upper','lower'])
upperlever1level2
loweronetwothreefour
a015.06.0
b23NaNNaN
c457.08.0
df1 = pd.DataFrame(np.random.randn(3,4),columns = ['a','b','c','d'])
df1
abcd
0-1.1195931.953114-1.514807-1.054782
10.5433931.1729030.9458290.656643
21.0126951.481920-0.413033-1.280521
df2 = pd.DataFrame(np.random.randn(2,3),columns = ['b','d','a'])
df2
bda
01.638046-0.8501121.895532
1-1.1759521.370474-0.992356
pd.concat([df1,df2],ignore_index=True)
abcd
0-1.1195931.953114-1.514807-1.054782
10.5433931.1729030.9458290.656643
21.0126951.481920-0.413033-1.280521
31.8955321.638046NaN-0.850112
4-0.992356-1.175952NaN1.370474

8.2. 4 joint overlapping data

a = pd.Series([np.nan,2.5,0,3.5,4.5,np.nan],
             index=['f','e','d','c','b','a'])
a
f    NaN
e    2.5
d    0.0
c    3.5
b    4.5
a    NaN
dtype: float64
b = pd.Series([0,np.nan,2,np.nan,np.nan,5],
             index=['a','b','c','d','e','f'])
b
a    0.0
b    NaN
c    2.0
d    NaN
e    NaN
f    5.0
dtype: float64
#Consider NumPy's where function, which can perform array oriented if else equivalent operations
np.where(pd.isnull(a),b,a)
array([0. , 2.5, 0. , 3.5, 4.5, 5. ])
#Series has a combine_first method, which can be equivalent to the following axial operation using pandas common data alignment logic
b.combine_first(a)
a    0.0
b    4.5
c    2.0
d    0.0
e    2.5
f    5.0
dtype: float64
df1 = pd.DataFrame({'a':[1,np.nan,5,np.nan],
                   'b':[np.nan,2,np.nan,6],
                   'c':range(2,18,4)})
df1
abc
01.0NaN2
1NaN2.06
25.0NaN10
3NaN6.014
df2 = pd.DataFrame({'a':[5,4,np.nan,3,7],
                   'b':[np.nan,3,4,6,8]})
df2
ab
05.0NaN
14.03.0
2NaN4.0
33.06.0
47.08.0
df1.combine_first(df2)
abc
01.0NaN2.0
14.02.06.0
25.04.010.0
33.06.014.0
47.08.0NaN

8.3 remodeling and Perspective

  • There are many basic operations to rearrange tabular data. These operations are called reshaping or perspective.

8.3. 1 reshape with multi-layer index

  • statck this operation "rotates" or pivots the data in the column to the row.
  • unstack this operation pivots the data in the row to the column.
data = pd.DataFrame(np.arange(6).reshape(2,3),
                    index = pd.Index(['ohio','colorado'],name='state'),
                    columns = pd.Index(['one','two','three'],name='number'))
data
numberonetwothree
state
ohio012
colorado345
result = data.stack()
result
state     number
ohio      one       0
          two       1
          three     2
colorado  one       3
          two       4
          three     5
dtype: int32
result.unstack()
numberonetwothree
state
ohio012
colorado345
result.unstack(0)
stateohiocolorado
number
one03
two14
three25
result.unstack('state')
stateohiocolorado
number
one03
two14
three25
#Splitting may introduce missing values if all values in the hierarchy are not included in each subgroup
s1 = pd.Series([0,1,2,3],index=['a','b','c','d'])
s1
a    0
b    1
c    2
d    3
dtype: int64
s2 = pd.Series([4,5,6],index=['c','d','e'])
s2
c    4
d    5
e    6
dtype: int64
data = pd.concat([s1,s2],keys=['one','two'])
data
one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64
data.unstack()
abcde
one0.01.02.03.0NaN
twoNaNNaN4.05.06.0
#By default, the stack will filter out missing values, so the operation of stacking and destacking is reversible
data.unstack().stack()
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64
data.unstack().stack(dropna = False)
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64
#When you disassemble the heap in the DataFrame, the level of the disassembled heap will become the lowest level in the result
df = pd.DataFrame({'left':result,'right':result+5},
                 columns=pd.Index(['left','right'],name='side'))
df
sideleftright
statenumber
ohioone05
two16
three27
coloradoone38
two49
three510
df.unstack('state')
sideleftright
stateohiocoloradoohiocolorado
number
one0358
two1469
three25710
#When calling the stack method, we can indicate the name of the axis to be stacked
df.unstack('state').stack('side')
statecoloradoohio
numberside
oneleft30
right85
twoleft41
right96
threeleft52
right107
df.unstack('state').stack()
sideleftright
numberstate
oneohio05
colorado38
twoohio16
colorado49
threeohio27
colorado510

8.3. 2 change "long" perspective to "wide"

data = pd.read_csv('examples/macrodata.csv')
data.head()
yearquarterrealgdprealconsrealinvrealgovtrealdpicpim1tbilrateunemppopinflrealint
01959.01.02710.3491707.4286.898470.0451886.928.98139.72.825.8177.1460.000.00
11959.02.02778.8011733.7310.859481.3011919.729.15141.73.085.1177.8302.340.74
21959.03.02775.4881751.8289.226491.2601916.429.35140.53.825.3178.6572.741.09
31959.04.02785.2041753.7299.356484.0521931.329.37140.04.335.6179.3860.274.06
41960.01.02847.6991770.5331.722462.1991955.529.54139.63.505.2180.0072.311.19
periods = pd.PeriodIndex(year=data.year,quarter=data.quarter,name='date')
periods
PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', name='date', length=203, freq='Q-DEC')
columns = pd.Index(['realgdp','infl','unemp'],name='item')
columns
Index(['realgdp', 'infl', 'unemp'], dtype='object', name='item')
data = data.reindex(columns=columns)
data.head()
itemrealgdpinflunemp
02710.349NaN5.8
12778.801NaN5.1
22775.488NaN5.3
32785.204NaN5.6
42847.699NaN5.2
data.index = periods.to_timestamp('D','end')
data.head()
itemrealgdpinflunemp
date
1959-03-31 23:59:59.9999999992710.349NaN5.8
1959-06-30 23:59:59.9999999992778.801NaN5.1
1959-09-30 23:59:59.9999999992775.488NaN5.3
1959-12-31 23:59:59.9999999992785.204NaN5.6
1960-03-31 23:59:59.9999999992847.699NaN5.2
ldata = data.stack().reset_index().rename(columns={0:'value'})
ldata[:10]
dateitemvalue
01959-03-31 23:59:59.999999999realgdp2710.349
11959-03-31 23:59:59.999999999unemp5.800
21959-06-30 23:59:59.999999999realgdp2778.801
31959-06-30 23:59:59.999999999unemp5.100
41959-09-30 23:59:59.999999999realgdp2775.488
51959-09-30 23:59:59.999999999unemp5.300
61959-12-31 23:59:59.999999999realgdp2785.204
71959-12-31 23:59:59.999999999unemp5.600
81960-03-31 23:59:59.999999999realgdp2847.699
91960-03-31 23:59:59.999999999unemp5.200
pivoted = ldata.pivot('date','item','value')
pivoted
itemrealgdpunemp
date
1959-03-31 23:59:59.9999999992710.3495.8
1959-06-30 23:59:59.9999999992778.8015.1
1959-09-30 23:59:59.9999999992775.4885.3
1959-12-31 23:59:59.9999999992785.2045.6
1960-03-31 23:59:59.9999999992847.6995.2
.........
2008-09-30 23:59:59.99999999913324.6006.0
2008-12-31 23:59:59.99999999913141.9206.9
2009-03-31 23:59:59.99999999912925.4108.1
2009-06-30 23:59:59.99999999912901.5049.2
2009-09-30 23:59:59.99999999912990.3419.6

203 rows × 2 columns

8.3. 3 change the "wide" perspective to "long"

  • In DataFrame, the inverse operation of pivot method is pandas melt
df = pd.DataFrame({'key':['foo','bar','baz'],
                  'A':[1,2,3],
                  'B':[4,5,6],
                  'C':[7,8,9]})
df
keyABC
0foo147
1bar258
2baz369
#When using pandas When melt, we must indicate which columns are grouping indicators (if any)
melted = pd.melt(df)
melted
variablevalue
0keyfoo
1keybar
2keybaz
3A1
4A2
5A3
6B4
7B5
8B6
9C7
10C8
11C9
melted = pd.melt(df,['key'])
melted
keyvariablevalue
0fooA1
1barA2
2bazA3
3fooB4
4barB5
5bazB6
6fooC7
7barC8
8bazC9
reshaped = melted.pivot('key','variable','value')
reshaped
variableABC
key
bar258
baz369
foo147
#Using reset_index to move the data back one column
reshaped.reset_index()
variablekeyABC
0bar258
1baz369
2foo147
pd.melt(df,id_vars=['key'],value_vars=['A','B'])
keyvariablevalue
0fooA1
1barA2
2bazA3
3fooB4
4barB5
5bazB6
pd.melt(df,value_vars=['A','B','C'])
variablevalue
0A1
1A2
2A3
3B4
4B5
5B6
6C7
7C8
8C9
pd.melt(df,value_vars=['A','B','key'])
variablevalue
0A1
1A2
2A3
3B4
4B5
5B6
6keyfoo
7keybar
8keybaz

Chapter 9 drawing and visualization

%matplotlib notebook

9.1 introduction to concise matplotlib API

  • There is a detail to note when using Jupiter notebook. After each cell is run, the chart is reset. Therefore, for more complex charts, you must put all drawing commands in a single notebook cell
import matplotlib.pyplot as plt
import numpy as np
data = np.arange(10)
data
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
plt.plot(data)

9.1. 1 pictures and subgraphs

  • The graph drawn by matplotlib is located in the picture object
  • pyplot.subplots option
parameterdescribe
nrowsNumber of rows of subgraph
ncolsColumn number of subgraph
sharexAll subgraphs use the same x-axis scale (adjusting xlim will affect all subgraphs)
shareyAll subgraphs use the same y-axis scale (adjusting ylim affects all subgraphs)
subplot_ kwIncoming add_ The keyword parameter Dictionary of subplot is used to generate subgraphs
**fig_ _kWAdditional keyword parameters used when generating pictures, such as PLT subplots (2,2,figsize= (8,6))
#You can use PLT Figure generate a new picture
fig = plt.figure()
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,3)
<IPython.core.display.Javascript object>

Keywords: Python Data Analysis

Added by nrussell on Sun, 02 Jan 2022 06:03:37 +0200