Yesterday, when I read the news, the lobbyist opened the electronic version of the teaching materials for primary and secondary schools (download address: http://bp.pep.com.cn/jc/ ), just want to download it all for my son to use.But downloading manually is really cumbersome and dull. Just write a crawl to save time.Originally planned to use python, looking back, I haven't used C# for a long time, so I'll write it in C#.
Specific ideas and implementation steps are as follows
1. Analyse the structure and connection jumps of related web pages to find out how to get the web address of an e-book.
First, there are two main pages. The first page is the Catalog of Classifications page, which divides the categories into primary and secondary schools. Each large category has these small categories under the disciplines. The second page is the download details page of e-books for each grade under each discipline.
Based on the above two pages, I decided to first get the web page addresses of all categories and each discipline under each category from the first page, then iterate through the content of each discipline web page, get the address of each e-book from its content, and finally download the e-books under each discipline asynchronously and multi-threaded.
2. To get an e-book address from an html page, you must use two class libraries, one for accessing web pages and network downloads, and the other for analyzing the html structure.Here I choose WebClient and HtmlAgilityPack.
3. According to the idea of step 1, the html code structure of the classified catalog's pages is analyzed first, and the class library selected in step 2 is used to obtain the web addresses of the classified catalog and the disciplines under it. The returned results are stored with Dictionary <string, List <string>, where key represents the classified names of elementary school, junior middle school and high school, and List <string> represents the pages of disciplines under the big classification.Face address.The implementation code is as follows:
//Get the page addresses for each subject public async Task<Dictionary<string, List<string>>> GetSubjectPageUrlsAsync() { var url = BASE_URL; Dictionary<string, List<string>> bookUrls = new Dictionary<string, List<string>>(); var categoryXpath = "//*[@id=\"container\"]/div[@class=\"list_sjzl_jcdzs2020\"]"; //Gets the html Page Content WebClient webClient = new WebClient(); var content = await webClient.DownloadStringTaskAsync(url); //Load html Content to HtmlDocument In order to process content HtmlDocument htmlDocument = new HtmlDocument(); htmlDocument.LoadHtml(content); //Gets the set of nodes for the specified path HtmlNodeCollection booksListEle = htmlDocument.DocumentNode.SelectNodes(categoryXpath); if (booksListEle != null) { foreach (var item in booksListEle) { //Get these categorical names, such as middle school, primary school, and so on string title = string.Empty; var titleNode = item.SelectSingleNode(".//div[@class=\"container_title_jcdzs2020\"]"); if (titleNode != null) { title = titleNode?.InnerText; } //Get the addresses of the subject pages under these categories: middle schools, primary schools, etc. HtmlNodeCollection urlsNodes = item.SelectNodes(".//a"); if (urlsNodes?.Count > 0) { var list = new List<string>(); foreach (HtmlNode urlItem in urlsNodes) { var fullUrl = url + urlItem.Attributes["href"].Value.Substring(2); list.Add(fullUrl); } if (!string.IsNullOrEmpty(title) && list.Count > 0) { bookUrls.Add(title, list); } } } } return bookUrls; }
4. Iterate the results shown in step 3 to extract e-book addresses from the content of each subject page.The code is as follows:
//Get e-book addresses on subject pages private async Task<(string Subject, List<(string BookName, string BookUrl)> Books)> GetSubjectBooksAsync(string url) { const string contentRootXpath = "//*[@id=\"container\"]/div[@class=\"con_list_jcdzs2020\"]"; //Get html content WebClient client = new WebClient(); string webcontent = await client.DownloadStringTaskAsync(url); //load html string with HtmlDocument HtmlDocument htmlDocument = new HtmlDocument(); htmlDocument.LoadHtml(webcontent); HtmlNode rootNode = htmlDocument.DocumentNode.SelectSingleNode(contentRootXpath); //Get the subject.Obtaining discipline names HtmlNode titleEle = rootNode.SelectSingleNode(".//div[@class=\"con_title_jcdzs2020\"]"); string subject = string.Concat(titleEle?.InnerText.Where(c => !char.IsWhiteSpace(c))); //Get all books of the subject. //Get a list of all books under the discipline and start downloading HtmlNodeCollection bookNodes = rootNode.SelectNodes(".//li"); List<(string BookName, string BookUrl)> books = new List<(string BookName, string BookUrl)>(); if (bookNodes != null && bookNodes.Count>0) { string bookName = null; string bookUrl = null; foreach (HtmlNode liItem in bookNodes) { bookName = FixFileName(string.Concat(liItem.ChildNodes["h6"].InnerText.Where(c => !char.IsWhiteSpace(c))));//get book's name bookUrl = liItem.ChildNodes["div"].ChildNodes[3].Attributes["href"].Value;//get the url of ebook books.Add((bookName, bookUrl)); } } return (subject,books); }
5. Start downloading the e-book from the e-book address you obtained in step 4.The code is as follows:
//Download all books under a single subject private async Task DownloadBooksAsync(string dir, string baseUrl, (string Subject, List<(string BookName, string BookUrl)> Books) books,Action<string, string> callback) { //Create the subdirectory under the specified directory. //create subdirectory dir = Path.Combine(dir, books.Subject); dir = FixPath(dir); if (!Directory.Exists(dir)) { Directory.CreateDirectory(dir); } //Build Download Task List List<Task> downloadTasks = new List<Task>(); int count = 0; foreach (var book in books.Books) { WebClient wc = new WebClient(); Uri.TryCreate(baseUrl + book.BookUrl[2..], UriKind.Absolute, out Uri bookUri); var path = Path.Combine(dir, @$"{book.BookName}.pdf"); var fi = new FileInfo(path); if (!fi.Exists || fi.Length == 0) { var task = wc.DownloadFileTaskAsync(bookUri, path); downloadTasks.Add(task); count++; } } //Wait for all download tasks to complete, then execute callback function await Task.WhenAll(downloadTasks).ContinueWith((task) => { callback(books.Subject ?? string.Empty, count.ToString()); }); }
6. At this point, the core methods have been completed.Then you can choose the appropriate implementation according to your own interface interaction needs, such as graphical interface, console or web page, and write specific application logic based on the interface.To save time and simplicity, I chose the console.The specific code is not described here. If you are interested, you can download the full code from github to see it.The specific github address is: https://github.com/topstarai/PepBookDownloader