Well, I know it's midnight... But I still think it's worth spending half an hour to share this latest idea and get straight to the point
Let's simulate a scenario. You need to grab a page, and then there are many URLs on this page, and after entering these sub URLs, there is data to grab. Simply put, let's look at the three layers, and our code is as follows:
def func_top(url): data_dict= {} #Get sub url on page sub_urls = xxxx data_list = [] for it in sub_urls: data_list.append(func_sub(it)) data_dict['data'] = data_list return data_dict def func_sub(url): data_dict= {} #Get sub url on page bottom_urls = xxxx data_list = [] for it in bottom_urls: data_list.append(func_bottom(it)) data_dict['data'] = data_list return data_dict def func_bottom(url): #get data data = xxxx return data
func_top is the processing function of the upper page, func_sub is the processing function of the sub page, func_bottom is the processing function of the deepest page, func_top will traverse and call func after getting the sub page url_ sub,func_ So is sub.
Under normal circumstances, this really meets the demand, but the website you want to capture may be extremely unstable and often can't be linked, resulting in the lack of data.
So you have two choices at this time:
Stop when you encounter an error, and then run again from the broken position
Continue when you encounter an error, but run it again later. At this time, you don't want to pull the existing data from the website again, but only the data that you haven't got
The first scheme is basically impossible to implement, because if the URLs of other websites are adjusted in order, your recorded location will be invalid. Then there is only the second solution. To put it bluntly, it is to cache the obtained data and directly get it from the cache when necessary. Finally, if your time is not very tight and you want to improve quickly, the most important thing is not afraid of hardship. It is recommended that you can price @ 762459510. That's really good. Many people make rapid progress and need you to be not afraid of hardship! You can add it and have a look~
OK, the goal already exists. How can we achieve it?
If it is in C + +, this is a very troublesome thing, and the code written must be very ugly. However, fortunately, we use python, which has decorators for functions.
Therefore, there are implementation schemes:
Define a decorator. If you get data before, you can directly get the data of the cache; If it is not retrieved before, it is pulled from the website and stored in the cache
The code is as follows:
import os import hashlib def deco_args_recent_cache(category='dumps'): ''' Decorator, return to the latest cache Data ''' def deco_recent_cache(func): def func_wrapper(*args, **kargs): sig = _mk_cache_sig(*args, **kargs) data = _get_recent_cache(category, func.__name__, sig) if data is not None: return data data = func(*args, **kargs) if data is not None: _set_recent_cache(category, func.__name__, sig, data) return data return func_wrapper return deco_recent_cache def _mk_cache_sig(*args, **kargs): ''' Generate unique identification by passing in parameters ''' src_data = repr(args) + repr(kargs) m = hashlib.md5(src_data) sig = m.hexdigest() return sig def _get_recent_cache(category, func_name, sig): full_file_path = '%s/%s/%s' % (category, func_name, sig) if os.path.isfile(full_file_path): return eval(file(full_file_path,'r').read()) else: return None def _set_recent_cache(category, func_name, sig, data): full_dir_path = '%s/%s' % (category, func_name) if not os.path.isdir(full_dir_path): os.makedirs(full_dir_path) full_file_path = '%s/%s/%s' % (category, func_name, sig) f = file(full_file_path, 'w+') f.write(repr(data)) f.close()
Then, we just need to be in each func_top,func_sub,func_bottom plus Deco_ args_ recent_ Just cache this decorator~~
Done! The biggest advantage of this is that each layer of top, sub and bottom will dump data, so for example, after the data of a sub layer is dumped, it will not go to the corresponding bottom layer at all, reducing a lot of overhead!
OK, that's it ~ life is short, I use python!