8 Python data cleaning codes, ready to use

hello everyone!

Whether you admit it or not, data cleaning is not a simple task. In most cases, this work is very time-consuming and boring, but it is very important.

If you go through the process of data cleaning, you will understand what I mean. This is the purpose of writing this article - to make it easier for readers to clean data.

In fact, I realized not long ago that some data have similar patterns when cleaning data. Since then, I have compiled some data cleansing code (see below), which I think can also be used in other common scenarios.

Because these common scenarios involve different types of data sets, this article focuses more on showing and explaining what these codes can be used to do, so that readers can use them more conveniently.

Data cleaning toolbox

In the following code snippet, the data cleaning code is encapsulated in some functions, and the purpose of the code is very intuitive. You can use this code directly without embedding them into functions that require a small number of parameter changes.

1. Delete multiple columns of data

def drop_multiple_col(col_names_list, df): 
    '''
    AIM    -> Drop multiple columns based on their column names 

    INPUT  -> List of column names, df

    OUTPUT -> updated df with dropped columns 
    ------
    '''
    df.drop(col_names_list, axis=1, inplace=True)
    return df

Sometimes, not all columns of data are useful for our data analysis work. Therefore, "df.drop" can easily delete the columns you selected.

2. Convert Dtypes

def change_dtypes(col_int, col_float, df): 
    '''
    AIM    -> Changing dtypes to save memory

    INPUT  -> List of column names (int, float), df

    OUTPUT -> updated df with smaller memory  
    ------
    '''
    df[col_int] = df[col_int].astype('int32')
    df[col_float] = df[col_float].astype('float32')

When we face larger data sets, we need to convert "dtypes" to save memory.

3. Convert classified variables to numeric variables

def convert_cat2num(df):
    # Convert categorical variable to numerical variable
    num_encode = {'col_1' : {'YES':1, 'NO':0},
                  'col_2'  : {'WON':1, 'LOSE':0, 'DRAW':0}}  
    df.replace(num_encode, inplace=True)  

Some machine learning models require variables to exist in numerical form. At this time, we need to convert classification variables into numerical variables, and then use them as the input of the model. For the data visualization task, I suggest you keep the classification variables, so that the visualization results can be more clearly explained and easy to understand.

4. Check for missing data

def check_missing_data(df):
    # check for any missing data in the df (display in descending order)
    return df.isnull().sum().sort_values(ascending=False)

If you want to check how much missing data is in each column, this may be the fastest way. This method can let you know more clearly which columns have more missing data, and help you decide what actions to take in data cleaning and data analysis.

5. Delete the string in the column

def remove_col_str(df):
    # remove a portion of string in a dataframe column - col_1
    df['col_1'].replace('\n', '', regex=True, inplace=True)

    # remove all the characters after &# (including &#) for column - col_1
    df['col_1'].replace(' &#.*', '', regex=True, inplace=True)

Sometimes you may see a new line of characters or some strange symbols in the string column. You can easily use df ['col_1 '] Replace to handle this problem, where "col_1" is a column in the data frame df.

6. Delete spaces in columns

def remove_col_white_space(df):
    # remove white space at the beginning of string 
    df[col] = df[col].str.lstrip()

When the data is very chaotic, many unexpected situations will happen. It is common to have some spaces at the beginning of a string. Therefore, this method is very useful when you want to delete the spaces at the beginning of the string in the column.

7. Splice two columns of string data (under certain conditions)

def concat_col_str_condition(df):
    # concat 2 columns with strings if the last 3 letters of the first column are 'pil'
    mask = df['col_1'].str.endswith('pil', na=False)
    col_new = df[mask]['col_1'] + df[mask]['col_2']
    col_new.replace('pil', ' ', regex=True, inplace=True)  # replace the 'pil' with emtpy space

This method is useful when you want to combine two columns of string data under certain conditions. For example, you want to splice the first and second columns of data together when the first column ends with some specific letters. According to your needs, you can also delete the ending letters after the splicing work is completed.

8. Convert timestamp (convert from string type to date "DateTime" format)

def convert_str_datetime(df): 
    '''
    AIM    -> Convert datetime(String) to datetime(format we want)

    INPUT  -> df

    OUTPUT -> updated df with new datetime format 
    ------
    '''
    df.insert(loc=2, column='timestamp', value=pd.to_datetime(df.transdate, format='%Y-%m-%d %H:%M:%S.%f'))

When processing time series data, you may encounter timestamp columns in string format. This means that we may have to convert the data in string format to the date "datetime" format specified according to our needs, so as to use these data for meaningful analysis and display.

Keywords: Python

Added by imartin on Wed, 12 Jan 2022 12:53:13 +0200