Big data project 1 (data preprocessing)

1, Understand what preprocessing is

Data preprocessing is a data mining technology. Its essence is to convert the original data into an understandable format or in line with our mining format.

2, Why preprocessing

In the real world, data is usually incomplete (lack of some interesting attribute values), inconsistent (including differences in codes or names), and vulnerable to noise (errors or outliers). Because the database is too large and data sets often come from multiple heterogeneous data sources, low-quality data will lead to low-quality mining results.

3, Basic methods of data preprocessing

1. Basic method

  • Data cleaning: fill in missing values, smooth noise data, identify or delete outliers, and solve inconsistencies to "clean up data";
  • Data integration: using multiple databases, data cubes or files;
  • Data reduction: replace metadata with alternative and smaller data representation to minimize the loss of information content, including dimension specification, quantity specification and data compression;
  • Data transformation: transform data into the form of mining.
    visualization:

4, Apply

To process with Python, you also need to use two libraries, Numpy and Pandas.

1. Prepare data

There is a data set, which includes the information of IT professionals, such as country, salary and gender, as follows:

We are free to create copies of this dataset.

We can observe that the data set above contains some null values

2. Import library

The Numpy library contains math tools that can be used for any type of math in our code;
The Pandas library is used to import and manage datasets.
How to import libraries:

import pandas as pd
import numpy as np

3. Import dataset

I named my dataset file 'profess ionals' and its format is csv.

#Read data (my dataset file is in the same directory as my python file)
data = pd.read_csv("profess.csv")

After importing the dataset, let's see its format:

print(data)

4. Data cleaning - check the missing value

Use the isnull function of the Pandas library to see.

print(data.isnull().sum())

Both age and salary columns have missing values (i.e. empty values), and the number of missing values is 1. There are seven methods to deal with missing values. Let's talk about the two commonly used methods.

(1) Processing normal null values

If a row has a null value for a specific characteristic d, the row is deleted. If a specific column has more than 75% missing values, the specific column is deleted. However, we should use this method when we ensure that there are enough sample data. Because we want to ensure that the deviation will not be increased after deleting the data.

data.dropna(inplace=True)
print(data.isnull().sum())

(2) Processing data such as age, year and amount

This method is applicable to the function of digital data such as year or age and amount. We can calculate the mean, median or mode of the feature and replace it with the missing value. Compared with the first method, this method can offset the lack of data and produce better results.

Let's look at the operation:

# Replace the empty value in the Age column with the median of Age.
 
# medain() is the median method of pandas library
 
data['Age'] = data['Age']
            .replace(np.NaN,data['Age']
            .median())
 
print(data['Age'])

5. Data reduction

In order to meet the mining requirements, we need to know the Salary distribution range of these engineers, but we only have the Salary column. Therefore, in order to facilitate mining, we add the Salary level column to our data set and reduce the areas through the Salary column. This method is called "attribute construction". Let's look at the operation:

**#Data reduction 
def section(d):   
    if 50000 > d:        
         return "50000 following"
     if  100000 > d >= 5000:        
         return "50000-100000"
     if  d > 100000:       
         return "100000 above"
 
data['level'] =  data['Salary']
                 .apply(lambda x: section(x))
 
print(data['level'])**


We define a function of 'data transformation', select the interval for transformation according to Salary's judgment, and assign it to level.

6. Data transformation

We can see that the Salary column also has a null value. From a business perspective, it should be a number. However, we found that our data set is a currency format, and we need to 'transform' it into the digital format we need. Let's look at the actual operation:

#Data transformation
def convert_currency(d):
    new_value = str(d).replace(",","")
                .replace("$","")
    return float(new_value)    
 
data['Salary'] = data['Salary'].apply(convert_currency)
 
# mean() is the average method of pandas library
 
data['Salary'] = data['Salary']
                    .replace(np.NaN,data['Salary']
                    .mean())
print(data['Salary'])

Transform succeeded:

We define a "data transformation" function, then apply it to the Salary column, and finally perform the same replacement operation as the data cleaning step. Here, we replace it with the average value.

Keywords: Python Database

Added by mjlogan on Tue, 18 Jan 2022 15:18:21 +0200