hello everyone!
Whether you admit it or not, data cleaning is not a simple task. In most cases, this work is very time-consuming and boring, but it is very important.
If you go through the process of data cleaning, you will understand what I mean. This is the purpose of writing this article - to make it easier for readers to clean data.
In fact, I realized not long ago that some data have similar patterns when cleaning data. Since then, I have compiled some data cleansing code (see below), which I think can also be used in other common scenarios.
Because these common scenarios involve different types of data sets, this article focuses more on showing and explaining what these codes can be used to do, so that readers can use them more conveniently.
Data cleaning toolbox
In the following code snippet, the data cleaning code is encapsulated in some functions, and the purpose of the code is very intuitive. You can use this code directly without embedding them into functions that require a small number of parameter changes.
1. Delete multiple columns of data
def drop_multiple_col(col_names_list, df): ''' AIM -> Drop multiple columns based on their column names INPUT -> List of column names, df OUTPUT -> updated df with dropped columns ------ ''' df.drop(col_names_list, axis=1, inplace=True) return df
Sometimes, not all columns of data are useful for our data analysis work. Therefore, "df.drop" can easily delete the columns you selected.
2. Convert Dtypes
def change_dtypes(col_int, col_float, df): ''' AIM -> Changing dtypes to save memory INPUT -> List of column names (int, float), df OUTPUT -> updated df with smaller memory ------ ''' df[col_int] = df[col_int].astype('int32') df[col_float] = df[col_float].astype('float32')
When we face larger data sets, we need to convert "dtypes" to save memory.
3. Convert classified variables to numeric variables
def convert_cat2num(df): # Convert categorical variable to numerical variable num_encode = {'col_1' : {'YES':1, 'NO':0}, 'col_2' : {'WON':1, 'LOSE':0, 'DRAW':0}} df.replace(num_encode, inplace=True)
Some machine learning models require variables to exist in numerical form. At this time, we need to convert classification variables into numerical variables, and then use them as the input of the model. For the data visualization task, I suggest you keep the classification variables, so that the visualization results can be more clearly explained and easy to understand.
4. Check for missing data
def check_missing_data(df): # check for any missing data in the df (display in descending order) return df.isnull().sum().sort_values(ascending=False)
If you want to check how much missing data is in each column, this may be the fastest way. This method can let you know more clearly which columns have more missing data, and help you decide what actions to take in data cleaning and data analysis.
5. Delete the string in the column
def remove_col_str(df): # remove a portion of string in a dataframe column - col_1 df['col_1'].replace('\n', '', regex=True, inplace=True) # remove all the characters after &# (including &#) for column - col_1 df['col_1'].replace(' &#.*', '', regex=True, inplace=True)
Sometimes you may see a new line of characters or some strange symbols in the string column. You can easily use df ['col_1 '] Replace to handle this problem, where "col_1" is a column in the data frame df.
6. Delete spaces in columns
def remove_col_white_space(df): # remove white space at the beginning of string df[col] = df[col].str.lstrip()
When the data is very chaotic, many unexpected situations will happen. It is common to have some spaces at the beginning of a string. Therefore, this method is very useful when you want to delete the spaces at the beginning of the string in the column.
7. Splice two columns of string data (under certain conditions)
def concat_col_str_condition(df): # concat 2 columns with strings if the last 3 letters of the first column are 'pil' mask = df['col_1'].str.endswith('pil', na=False) col_new = df[mask]['col_1'] + df[mask]['col_2'] col_new.replace('pil', ' ', regex=True, inplace=True) # replace the 'pil' with emtpy space
This method is useful when you want to combine two columns of string data under certain conditions. For example, you want to splice the first and second columns of data together when the first column ends with some specific letters. According to your needs, you can also delete the ending letters after the splicing work is completed.
8. Convert timestamp (convert from string type to date "DateTime" format)
def convert_str_datetime(df): ''' AIM -> Convert datetime(String) to datetime(format we want) INPUT -> df OUTPUT -> updated df with new datetime format ------ ''' df.insert(loc=2, column='timestamp', value=pd.to_datetime(df.transdate, format='%Y-%m-%d %H:%M:%S.%f'))
When processing time series data, you may encounter timestamp columns in string format. This means that we may have to convert the data in string format to the date "datetime" format specified according to our needs, so as to use these data for meaningful analysis and display.