Summary of big data processing knowledge points 1

I Basic knowledge of Request Library

method	explain	Return object
.request ()	Construction request
.get()	GET HTML web page, corresponding to HTTP GET	The returned response object contains all the resources of the server
.head()	Get the page header information, corresponding to HEAD
.post()	Submit POST request
.put()	Submit request PUT
.patch()	Submit local modification request, corresponding to PATCH
.delete()	Submit a DELETE request, corresponding to DELETE

General framework for crawling pages:

try:
	r = requests.get(url,timeout = 30)
	r.raise_for_status()
	r.encoding = r.apparent_encoding
	return r.text
except:
	return "raise an exception"

II Loan data conversion and integration of Lending Club

1. Random sampling

loan.sample(frac=0.01)
loan.sample(n=10,replace=True)

n specifies the quantity, frac specifies the proportional sampling, and axis can specify the column sampling

2. Data fusion

2.1 merge

left.merge(right,how='inner',on=None,left_on=None,right_on=None,left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), indicator=False)

give an example:

# 'user' based on user information data_ The variables' user 'and' outer 'are connected.
test_user.merge(test_loan,how="inner",left_on="user_id",right_on="user")

Set the parameter indicator parameter to True, and the result is an additional column of merge.

2.2 join

left.join(right, on=None, how='left', lsuffix='', rsuffix='', sort=False)

Introduction to using join

data.sort_values(by='xxx')
data.sort_index(by='xxx')

ascending defaults to the last missing value

2.3 data conversion

object type: str.strip('%') first, then PD to_ numeric(xxx)

3. Variable discretization

annual_inc=pd.cut(combine.annual_inc,bins=[np.min(combine.annual_inc)-1,np.percentile(combine.annual_inc,50),np.max(combine.annual_inc)+1],labels=['low','high'])

The bins parameter sets the split point
qcut is divided directly by frequency

4. Variable value replacement

Writing method 1:

combine['loan_status'].replace(to_replace=['Fully Paid','Current','Charged Off','In Grace Period','Late (31-120 days)'],value=[0,0,1,1,1],inplace=True)

Writing method 2:

test_loan.replace(to_replace={'loan_status':{'Fully Paid':0,'Charged Off':0,'In Grace Period':1}}
# You can also specify different variables
test_loan.replace(to_replace={'loan_status':'Fully Paid','grade':'A'},value='Good')

map
Numerical substitution only for a series

test_loan['loan_status'].map({'Fully Paid':0,'Charged Off':0,'In Grace Period':1})

Input function mapping

combine['int_rate'][:5]
def f(x):
    if x < 12:
        return 'Low'
    else:
        return 'High'
combine['int_rate'][:5].map(f)

5. Dummy variable (dummy variable) processing

Properties to be processed

cat_vars=['term','grade','emp_length','annual_inc','home_ownership','verification_stat us']

Cyclic processing

for var in cat_vars:
	car_list = pd.get_dummies(combine[var],prefix=var,drop_first=True)
	combine = combine.join(cat_list)

drop_ If first is set to True, the first dummy variable obtained will be deleted. This is because in modeling, k-1 variables are required to describe k-class classification variables. If K variables are used, there will be a problem of complete collinearity.

6. Add constant column

const = pd.Series([1]*combine.shape[0],name='const')
X.reset_index(drop=True,inplace=True)
X=pd.concat([const,X],axis=1)
X.head()

Join is the method of DateFrame, X.join(const)
The object of concat is a list

III Planetary data grouping and aggregation

1. Data grouping

1.1 grouping method

Feature grouping

grouped = =planets.groupby('method')

array,list,series grouping

a=np.repeat([0,1],[500,535])
planets.groupby(a).mean()

Function grouping

new=plantes.set.index('year')
def test(x):
	if x<2000:
		return 'Before'2000
	else:
		return 'After 2000'
		
new.groupby(test).mean()
group_index=new.index.map(test)
new.groupby(group_index).mean()

1.2 group application

Get the required column grouped ['year']
View the median distance under different methods:

plants.groupby('method')['distance'].median()

The year becomes the corresponding year

decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade[:5]

Group by method and age

plants.groupby(['method',decade])['number'].sum()

Use unstack to split the hierarchical index into new column indexes, and use fillna to fill the missing values with 0.

planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)

2. Analyze Planetary Data

2.1GroupBy.apply

grouped.apply(lambda x:.....)
grouped.apply(func)

Relatively slow

2.2GroupBy.agg

Calculate the average distance of planets discovered by various methods and the sum of the number of planets discovered

grouped.agg({'distance':'mean','number':'sum'})
grouped.agg({'distance':['min','max','mean','median'],'number':'sum'})

You can also customize functions
apply is to perform an operation on each group of data and the entire DataFrame.
agg calculates each feature in each set of data.
① Perform a function on each column in each set of data
② Merge the returned results of each column
③ Merge the returned results of each set of data

2.3GroupBy.transform

Use transform to standardize grouped data (data with the same dimension of grouped data):

grouped.transform(lambda x:x(x-x.mean())/x.std()).head()

You cannot use dictionary style to specify features for function operations

Keywords: Big Data Data Analysis pandas

Added by knowram on Tue, 08 Feb 2022 19:04:28 +0200

Programming VIP