Pandas: from shallow to deep

What is panda?

pandas is a NumPy based tool, which is created to solve data analysis tasks. pandas incorporates a large number of libraries and some standard data models, providing the tools needed to efficiently operate large datasets. pandas provides a large number of functions and methods that enable us to process data quickly and conveniently. You will soon find that it is one of the important factors that make Python a powerful and efficient data analysis environment.

Why to learn pandas

So here comes the problem: numpy has been able to help us deal with data, and can solve the problem of data analysis with matplotlib. So what is the purpose of pandas learning?
numpy can help us deal with numerical data, but it's not enough
Most of the time, in addition to numerical value, our data includes string, time series, etc
For example, we get the data stored in the database through the crawler
For example, in the previous youtube example, in addition to the numerical value, there are also country information, video classification (tag) information, title information, etc
Therefore, numpy can help us deal with numerical values, but pandas can help us deal with other types of data besides numerical values (based on numpy)

Common data types of pandas

Series one dimensional, labeled array

Create a series from darray

If the data is ndarray, the index passed must have the same length. If no index value is passed, the default index will be the range (n), where n is the array length, that is, [0,1,2,3 . range(len(array))-1] - 1].
Case 1:

#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])##Array objects in numpy
s = pd.Series(data)
print s
//Operation result:
0   a
1   b
2   c
3   d
dtype: object
[Note] the index name is not specified here, so it is the default.
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print s
//Operation result:
//case2: 
100  a
101  b
102  c
103  d
dtype: object
[notes]The index name is passed here

Create a series from a dictionary

A dictionary can be passed as input. If no index is specified, the dictionary keys are obtained in the sort order to construct the index. If an index is passed, the values in the data corresponding to the label in the index are pulled out.
Example 1

#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print s
Python
//Execute the above example code, and the output is as follows - 
a 0.0
b 1.0
c 2.0
dtype: float64

//be careful - Dictionary keys are used to build indexes. dType I can specify the type myself

Example 2

#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
print s
//Execute the above example code, and the output is as follows - 
b 1.0
c 2.0
d NaN##If the key corresponding to the index is not available, it is filled with NAN
a 0.0
dtype: float64

Access data from series with location

The data in the series can be accessed using data similar to that in ndarray.
Example 1:
Retrieve the first element. For example, you already know that arrays count from zero, the first element is stored at zero, and so on.

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first element
print s[0]
Execute the above example and get the following results- 
1

Example 2
Retrieve the first three elements in the series. If a: is inserted in front of it, all items forward from the index are extracted. If you use two parameters (using between them), items between two indexes (excluding stop indexes).

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first three element
print s[:3]
//Execute the above example and get the following results - 
a  1
b  2
c  3
dtype: int64

Example 3
To retrieve the last three elements, refer to the following example code-

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the last three element
print s[-3:]
//Execute the above example code and get the following results - 
c  3
d  4
e  5
dtype: int64

Retrieving data (index) using tags

A series is like a fixed size dictionary, which can get and set values through index labels.
Example 1
Use index label values to retrieve individual elements.

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve a single element
print s['a']
Execute the above example code and get the following results- 
1

Example 2
Retrieve multiple elements using the index label value list.

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elements
print s[['a','c','d']]
Python
//Execute the above example code and get the following results - 
a  1
c  3
d  4
dtype: int64

Example 3
If the label is not included, an exception occurs.

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elements
print s['f']
Execute the above example code and get the following results- 
...
KeyError: 'f'

Series slice and index of pandas

Index and value of pandas Series

Reading external data of pandas

Now suppose we have a group of statistics about the dog's name. What should we do to observe this group of statistics?


[note] I used the Boolean index to filter some data, only counting 800-1000 times. Obviously, this is no longer a series, but a new data structure, that is, the DataFrame to be introduced next!

DataFrame 2D, Series container

DataFrame of pandas


DataFrame objects have both row and column indexes
Row index, indicating different rows, horizontal index, called index, axis 0, axis=0
Column index, different table names, vertical index, called columns, 1 axis, axis=1


[note] in the figure above, you can see the parameters to be passed in the dataframe!

Just like a ndarray, we know the basic information of this ndarray through shape, ndim and dtype. So how do we know about DataFrame

Create DataFrame from list

Instance-1

import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print df

Execute the above example code and get the following results- 
     0
0    1
1    2
2    3
3    4
4    5

Example-2

import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print df

Execute the above example code and get the following results- 
      Name      Age
0     Alex      10
1     Bob       12
2     Clarke    13

Example -3

import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print df

Execute the above example code and get the following results- 
      Name     Age
0     Alex     10.0
1     Bob      12.0
2     Clarke   13.0

Note - you can observe that the dtype parameter changes the type of the Age column to floating point.

Create a DataFrame from the dictionaries of darrys / lists

All ndarrays must have the same length. If an index is passed, the length of the index should be equal to the length of the array.
If no index is passed, by default, the index will be range(n), where n is the array length.
Instance-1

import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print df

Execute the above example code and get the following results- 
      Age      Name
0     28        Tom
1     34       Jack
2     29      Steve
3     42      Ricky

Note - observations 0,1,2,3. They are the default indexes assigned to each use function range(n).

Example-2
Use an array to create an indexed data frame.

import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print df

//Execute the above example code and get the following results - 
         Age    Name
rank1    28      Tom
rank2    34     Jack
rank3    29    Steve
rank4    42    Ricky

Note - the index parameter assigns an index to each row, colmun is the column.

Create DataFrame from list

The dictionary list can be passed as input data to create a data frame. The dictionary key defaults to the column name.
Instance-1
The following example shows how to create a data frame by passing in a list of dictionaries.

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df

Execute the above example code and get the following results- 
    a    b      c
0   1   2     NaN
1   5   10   20.0

Note - it was observed that Nan (not a number) was attached to the missing area.

Example-2
The following example shows how to create a data frame by passing a dictionary list and a row index.

import pandas as pd
##It is equivalent to making column index a,b,c
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print df

//Execute the above example code and get the following results - 
        a   b       c
first   1   2     NaN
second  5   10   20.0

Example-3
The following example shows how to create a data frame using a dictionary, row index, and column index list.

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

#With two column indices, values same as dictionary keys
##You can kill the c above by redefining the column index name
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])

#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print df1
print df2

//Execute the above example code and get the following results - 
#df1 output
         a  b
first    1  2
second   5  10

#df2 output
         a  b1
first    1  NaN
second   5  NaN

Note - observe that df2 creates a DataFrame using a column index other than the dictionary key; therefore, NaN is attached to the location. df1 is created using the column index, the same as the dictionary key, so NaN is attached.

Create a DataFrame from the dictionary of the series

A series of dictionaries can be passed to form a DataFrame. The resulting index is the union of all the series indexes passed.
Example

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
      'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df
`

//Execute the above example code and get the following results - 
      one    two
a     1.0    1
b     2.0    2
c     3.0    3
d     NaN    4


//be careful - For the first series, no delivery label was observed'd',But in the result, for d Tag, attached NaN. 

Column selection

Next, you will select a column from the data frame.
Example

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
      'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df ['one']

//Execute the above example code and get the following results - 
a     1.0
b     2.0
c     3.0
d     NaN
Name: one, dtype: float64

Column add

You will understand this by adding a new column to an existing data box.
Example
i

mport pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
      'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

# Adding a new column to an existing DataFrame object with column label by passing new series

print ("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print df

print ("Adding a new column using the existing columns in DataFrame:")
df['four']=df['one']+df['three']

print df

//Execute the above example code and get the following results - 
Adding a new column by passing as Series:
     one   two   three
a    1.0    1    10.0
b    2.0    2    20.0
c    3.0    3    30.0
d    NaN    4    NaN

Adding a new column using the existing columns in DataFrame:
      one   two   three    four
a     1.0    1    10.0     11.0
b     2.0    2    20.0     22.0
c     3.0    3    30.0     33.0
d     NaN    4     NaN     NaN
Shell

Column delete

Columns can be deleted or popped up; take a look at the example below.
example

# Using the previous DataFrame, we will delete a column
# using del function
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 
     'three' : pd.Series([10,20,30], index=['a','b','c'])}

df = pd.DataFrame(d)
print ("Our dataframe is:")
print df

# using del function
print ("Deleting the first column using DEL function:")
del df['one']
print df

# using pop function
print ("Deleting another column using POP function:")
df.pop('two')
print df
Python
//Execute the above example code and get the following results - 
Our dataframe is:
      one   three  two
a     1.0    10.0   1
b     2.0    20.0   2
c     3.0    30.0   3
d     NaN     NaN   4

Deleting the first column using DEL function:
      three    two
a     10.0     1
b     20.0     2
c     30.0     3
d     NaN      4

Deleting another column using POP function:
   three
a  10.0
b  20.0
c  30.0
d  NaN
Shell

Row selection, adding and deleting

Now you'll learn about row selection, addition, and deletion through the following example. We start with the concept of choice.
Label selection
You can select a row by passing the row label to the loc() function. Refer to the following example code-

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df.loc['b']
Python
 Execute the above example code and get the following results- 
one 2.0
two 2.0
Name: b, dtype: float64
Shell
 The result is a series of labels as column names for the DataFrame. Also, the name of the series is the retrieved label.
Select by integer position
 You can select rows by passing the integer position to the iloc() function. Refer to the following example code- 
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df.iloc[2]
Python
 Execute the above example code and get the following results- 
one   3.0
two   3.0
Name: c, dtype: float64
Shell

Row slice

You can use the: operator to select multiple lines. Refer to the following example code-

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
    'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df[2:4]
Python
//Execute the above example code and get the following results - 
      one    two
c     3.0     3
d     NaN     4
Shell

Additional lines

Use the append() function to add a new row to the DataFrame. This feature will append the end of the row.

import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)
print df
Python
//Execute the above example code and get the following results - 
   a  b
0  1  2
1  3  4
0  5  6
1  7  8
Shell

Delete row

Use index labels to delete or delete rows from a DataFrame. If the label is duplicated, multiple lines are deleted.
If noted, in the above example, there are labels that are repeated. Put another label here to see how many lines have been deleted.

import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)

# Drop rows with label 0
df = df.drop(0)

print df
Python
//Execute the above example code and get the following results - 
  a b
1 3 4
1 7 8
Shell
//In the above example, a total of two lines were deleted because they contain the same label0. 

pandas case

Extraction of Douban information from mongodb

from pymongo import MongoClient
import pandas as pd

##Connect to mongo
client = MongoClient()
##tv1 set of double library linked to mongo
collection = client["douban"]["tv1"]
data = collection.find()##Take out one document, the document here is a large json string!!
data_list = []
for i in data:
    temp = {}
    temp["info"]= i["info"]
    ##The following statement i [] [] means to take the value corresponding to rating in the json string, and then take the value corresponding to count from the value!!
    temp["rating_count"] = i["rating"]["count"]
    temp["rating_value"] = i["rating"]["value"]
    temp["title"] = i["title"]
    temp["country"] = i["tv_category"]
    temp["directors"] = i["directors"]
    temp["actors"] = i['actors']
    data_list.append(temp)
# t1 = data[0]
# t1 = pd.Series(t1)
# print(t1)

df = pd.DataFrame(data_list)
# print(df)

#Display the first few lines
print(df.head(1))
# print("*"*100)
# print(df.tail(2))

#Show df at a glance
# print(df.info())
# print(df.describe())
print(df["info"].str.split("/").tolist())

Sort the number of times the dog's name appears

import pandas as pd

df = pd.read_csv("./dogNames2.csv")
# print(df.head())
# print(df.info())

#Sorting method in dataFrame
df = df.sort_values(by="Count_AnimalName",ascending=False)
# print(df.head(5))

#Attention points of pandas fetching rows or columns
# -Square brackets write array to indicate row fetching and operation on row
# -Write string, de column index of representation, operate on column
print(df[:20])
print(df["Row_Labels"])
print(type(df["Row_Labels"]))

Keywords: Python shell JSON Database

Added by Braveheart on Wed, 17 Jun 2020 06:08:58 +0300