Download elementary and secondary electronic textbooks from people's educational institutions with C#

Yesterday, when I read the news, the lobbyist opened the electronic version of the teaching materials for primary and secondary schools (download address: http://bp.pep.com.cn/jc/ ), just want to download it all for my son to use.But downloading manually is really cumbersome and dull. Just write a crawl to save time.Originally planned to use python, looking back, I haven't used C# for a long time, so I'll write it in C#.

Specific ideas and implementation steps are as follows

1. Analyse the structure and connection jumps of related web pages to find out how to get the web address of an e-book.

First, there are two main pages. The first page is the Catalog of Classifications page, which divides the categories into primary and secondary schools. Each large category has these small categories under the disciplines. The second page is the download details page of e-books for each grade under each discipline.

Based on the above two pages, I decided to first get the web page addresses of all categories and each discipline under each category from the first page, then iterate through the content of each discipline web page, get the address of each e-book from its content, and finally download the e-books under each discipline asynchronously and multi-threaded.

2. To get an e-book address from an html page, you must use two class libraries, one for accessing web pages and network downloads, and the other for analyzing the html structure.Here I choose WebClient and HtmlAgilityPack.

3. According to the idea of step 1, the html code structure of the classified catalog's pages is analyzed first, and the class library selected in step 2 is used to obtain the web addresses of the classified catalog and the disciplines under it. The returned results are stored with Dictionary <string, List <string>, where key represents the classified names of elementary school, junior middle school and high school, and List <string> represents the pages of disciplines under the big classification.Face address.The implementation code is as follows:

            //Get the page addresses for each subject
            public async Task<Dictionary<string, List<string>>> GetSubjectPageUrlsAsync()
            {
                var url = BASE_URL;
                Dictionary<string, List<string>> bookUrls = new Dictionary<string, List<string>>();

                var categoryXpath = "//*[@id=\"container\"]/div[@class=\"list_sjzl_jcdzs2020\"]";

                //Gets the html Page Content
                WebClient webClient = new WebClient();
                var content = await webClient.DownloadStringTaskAsync(url);

                //Load html Content to HtmlDocument In order to process content
                HtmlDocument htmlDocument = new HtmlDocument();
                htmlDocument.LoadHtml(content);

                //Gets the set of nodes for the specified path
                HtmlNodeCollection booksListEle = htmlDocument.DocumentNode.SelectNodes(categoryXpath);

                if (booksListEle != null)
                {
                    foreach (var item in booksListEle)
                    {
                        //Get these categorical names, such as middle school, primary school, and so on
                        string title = string.Empty;
                        var titleNode = item.SelectSingleNode(".//div[@class=\"container_title_jcdzs2020\"]");
                        if (titleNode != null)
                        {
                            title = titleNode?.InnerText;
                        }

                        //Get the addresses of the subject pages under these categories: middle schools, primary schools, etc.
                        HtmlNodeCollection urlsNodes = item.SelectNodes(".//a");
                        if (urlsNodes?.Count > 0)
                        {
                            var list = new List<string>();
                            foreach (HtmlNode urlItem in urlsNodes)
                            {
                                var fullUrl = url + urlItem.Attributes["href"].Value.Substring(2);
                                list.Add(fullUrl);
                            }

                            if (!string.IsNullOrEmpty(title) && list.Count > 0)
                            {
                                bookUrls.Add(title, list);
                            }
                        }
                    }
                }
                return bookUrls;
            }

 

4. Iterate the results shown in step 3 to extract e-book addresses from the content of each subject page.The code is as follows:

            //Get e-book addresses on subject pages
            private async Task<(string Subject, List<(string BookName, string BookUrl)> Books)> GetSubjectBooksAsync(string url)
            {
                const string contentRootXpath = "//*[@id=\"container\"]/div[@class=\"con_list_jcdzs2020\"]";

                //Get html content
                WebClient client = new WebClient();
                string webcontent = await client.DownloadStringTaskAsync(url);

                //load html string with HtmlDocument
                HtmlDocument htmlDocument = new HtmlDocument();
                htmlDocument.LoadHtml(webcontent);

                HtmlNode rootNode = htmlDocument.DocumentNode.SelectSingleNode(contentRootXpath);

                //Get the subject.Obtaining discipline names
                HtmlNode titleEle = rootNode.SelectSingleNode(".//div[@class=\"con_title_jcdzs2020\"]");
                string subject = string.Concat(titleEle?.InnerText.Where(c => !char.IsWhiteSpace(c)));

                //Get all books of the subject. 
                //Get a list of all books under the discipline and start downloading
                HtmlNodeCollection bookNodes = rootNode.SelectNodes(".//li");
                List<(string BookName, string BookUrl)> books = new List<(string BookName, string BookUrl)>();
                if (bookNodes != null && bookNodes.Count>0)
                {
                    string bookName = null;
                    string bookUrl = null;

                    foreach (HtmlNode liItem in bookNodes)
                    {
                        bookName = FixFileName(string.Concat(liItem.ChildNodes["h6"].InnerText.Where(c => !char.IsWhiteSpace(c))));//get book's name
                        bookUrl = liItem.ChildNodes["div"].ChildNodes[3].Attributes["href"].Value;//get the url of ebook

                        books.Add((bookName, bookUrl));
                    }
                }
                return (subject,books);
            }

5. Start downloading the e-book from the e-book address you obtained in step 4.The code is as follows:

//Download all books under a single subject
            private async Task DownloadBooksAsync(string dir, string baseUrl, (string Subject, List<(string BookName, string BookUrl)> Books) books,Action<string, string> callback)
            {
                //Create the subdirectory under the specified directory.
                //create subdirectory
                dir = Path.Combine(dir, books.Subject);
                dir = FixPath(dir);
                if (!Directory.Exists(dir))
                {
                    Directory.CreateDirectory(dir);
                }

                //Build Download Task List
                List<Task> downloadTasks = new List<Task>();
                int count = 0;
                foreach (var book in books.Books)
                {
                    WebClient wc = new WebClient();
                    Uri.TryCreate(baseUrl + book.BookUrl[2..], UriKind.Absolute, out Uri bookUri);
                    var path = Path.Combine(dir, @$"{book.BookName}.pdf");
                    var fi = new FileInfo(path);
                    if (!fi.Exists || fi.Length == 0)
                    {
                        var task = wc.DownloadFileTaskAsync(bookUri, path);
                        downloadTasks.Add(task);
                        count++;
                    }
                }

                //Wait for all download tasks to complete, then execute callback function
                await Task.WhenAll(downloadTasks).ContinueWith((task) => { callback(books.Subject ?? string.Empty, count.ToString()); });
            }

 

6. At this point, the core methods have been completed.Then you can choose the appropriate implementation according to your own interface interaction needs, such as graphical interface, console or web page, and write specific application logic based on the interface.To save time and simplicity, I chose the console.The specific code is not described here. If you are interested, you can download the full code from github to see it.The specific github address is: https://github.com/topstarai/PepBookDownloader

Keywords: C# github Python network

Added by Haberdasher on Thu, 13 Feb 2020 21:44:03 +0200