Life is short, learn Python!
Python is a great language. It is one of the fastest growing programming languages in the world. Again and again, it has proven its usefulness in developer positions and cross industry data science positions. The entire ecosystem of Python and its libraries makes it a suitable choice for users around the world (beginners and advanced users). One of the reasons for its success and popularity is its powerful collection of third-party libraries, which enable it to remain dynamic and efficient.
In this article, we will study some Python libraries for data science tasks, rather than common libraries such as panda, scikit learn and matplotlib. Although libraries such as panda and scikit learn often appear in machine learning tasks, it is always beneficial to understand other Python products in this field.
Extracting data from the network is one of the important tasks of data scientists. WGet is a free utility that can be used to download non interactive files from the network. It supports HTTP, HTTPS and FTP protocols, as well as file retrieval through HTTP proxy. Because it is non interactive, it can work in the background even if the user does not log in. So the next time you want to download all the pictures on a website or page, WGet can help you.
$ pip install wget
import wget url = 'http://www.futurecrew.com/skaven/song_files/mp3/razorback.mp3' filename = wget.download(url) 100% [................................................] 3841532 / 3841532 filename 'razorback.mp3' ### Pendulum
For those who feel frustrated when dealing with dates and times in Python, Pendulum is perfect for you. It is a python package that simplifies date and time operations. It is a simple alternative to Python's native classes. Refer to the documentation for further study.
$ pip install pendulum
import pendulum dt_toronto = pendulum.datetime(2012, 1, 1, tz='America/Toronto') dt_vancouver = pendulum.datetime(2012, 1, 1, tz='America/Vancouver') print(dt_vancouver.diff(dt_toronto).in_hours()) 3
3, Imbalanced learn
It can be seen that when the number of samples of each class is basically the same, the effect of most classification algorithms is the best, that is, it is necessary to maintain data balance. However, most of the real cases are unbalanced data sets, which have a great impact on the learning stage and subsequent prediction of machine learning algorithm. Fortunately, this library is used to solve this problem. It is compatible with scikit learn and is part of the scikit learn contrib project. The next time you encounter an unbalanced dataset, try using it.
pip install -U imbalanced-learn #Or conda install -c conda-forge imbalanced-learn
In NLP tasks, cleaning up text data often requires replacing keywords in sentences or extracting keywords from sentences. Usually, this can be done using regular expressions, but it becomes troublesome if the number of terms to search reaches thousands. Python's FlashText module is based on the FlashText algorithm, which provides a suitable alternative to this situation. The best thing about FlashText is that the running time is the same regardless of the number of search terms. You can learn more here.
$ pip install flashtext
from flashtext import KeywordProcessor keyword_processor = KeywordProcessor() # keyword_processor.add_keyword(<unclean name>, <standardised name>) keyword_processor.add_keyword('Big Apple', 'New York') keyword_processor.add_keyword('Bay Area') keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.') keywords_found ['New York', 'Bay Area']
keyword_processor.add_keyword('New Delhi', 'NCR region') new_sentence = keyword_processor.replace_keywords('I love Big Apple and new delhi.') new_sentence 'I love New York and NCR region.' Fuzzywuzzy
5, Fuzzy wuzzy
The name of this library sounds strange, but fuzzy wuzzy is a very useful library for string matching. It is convenient to calculate string matching degree and token matching degree, and it is also convenient to match the records saved in different databases.
$ pip install fuzzywuzzy
from fuzzywuzzy import fuzz from fuzzywuzzy import process #Simple matching degree fuzz.ratio("this is a test", "this is a test!") 97 #Fuzzy matching degree fuzz.partial_ratio("this is a test", "this is a test!") 100
Time series analysis is one of the most common problems in the field of machine learning. PyFlux is an open source library in Python, which is built to deal with time series problems. The library has a series of excellent modern time series models, including but not limited to ARIMA, GARCH and VAR models. In short, PyFlux provides a probabilistic method for time series modeling. It's worth trying.
pip install pyflux
The above useful data science Python libraries are carefully selected by me, not common libraries such as numpy and pandas. If you know other libraries that can be added to the list, please mention them in the comments below. Also, don't forget to try running them first.