Writing a Configurable Web Information Extraction Component

Introduction

Recent projects have a need to grab information from an old site and pour it into a new system. Because the old system has no one to maintain and the data is scattered, and the data to be extracted is more uniform on the web, it is planned to extract the data through the way of network request and analysis of the page. And at this time two years ago, I seemed to have done the same thing - fate, which is really interesting.

imagine

In collecting information, the most troublesome thing is often the decomposition of different pages and the extraction of data - because the design and structure of pages are often very different. At the same time, for some pages, you usually have to ask around the curve (ajax, iframe, etc.), which makes data extraction the most time-consuming and painful process - because you need to write a lot of logical code to connect the whole process. I remember that I had thought about this question in July, 15 years ago, at this time two years ago. A type Common Extractor was introduced to solve this problem. The general definition is as follows:

    public class CommonExtractor
    {
        public CommonExtractor(PageProcessConfig config)
        {
            PageProcessConfig = config;
        }

        protected PageProcessConfig PageProcessConfig;

        public virtual void Extract(CrawledHtmlDocument document)
        {
            if (!PageProcessConfig.IncludedUrlPattern.Any(i => Regex.IsMatch(document.FromUrl.ToString(), i)))
                return;
            var node = new WebHtmlNode { Node = document.Contnet.DocumentNode, FromUrl = document.FromUrl };
            ExtractData(node, PageProcessConfig);
        }

        protected Dictionary<string, ExtractionResult> ExtractData(WebHtmlNode node, PageProcessConfig blockConfig)
        {

            var data = new Dictionary<string, ExtractionResult>();
            foreach (var config in blockConfig.DataExtractionConfigs)
            {
                if (node == null)
                    continue;
                /*Use'. 'to context the current node*/
                var selectedNodes = node.Node.SelectNodes("." + config.XPath);
                var result = new ExtractionResult(config, node.FromUrl);
                if (selectedNodes != null && selectedNodes.Any())
                {
                    foreach (var sNode in selectedNodes)
                    {
                        if (config.Attribute != null)
                            result.Fill(sNode.Attributes[config.Attribute].Value);
                        else
                            result.Fill(sNode.InnerText);
                    }
                    data[config.Key] = result;
                }
                else { data[config.Key] = null; }
            }

            if (DataExtracted != null)
            {
                var args = new DataExtractedEventArgs(data, node.FromUrl);
                DataExtracted(this, args);
            }

            return data;
        }

        public EventHandler<DataExtractedEventArgs> DataExtracted;
    }

The code is a bit messy (because Abot was used to crawl), but the intention is clear, hoping to extract useful information from an html file, and then specify how to extract information through a configuration. The main problem of this approach is that it can not cope with complex structures, and new configurations and processes must be introduced when dealing with specific structures. At the same time, this new process does not have a high degree of reusability.

Design

A Simple Start

In order to cope with the complexity of the real situation, the most basic treatment must be designed simply. Inspiration is captured from previous code. For data extraction, what we really want is:

Provide an html document to the program
The program returns us a value

Thus, the basic definition of interface is given:

    public interface IContentProcessor
    {
        /// <summary>
        /// Processing content
        /// </summary>
        /// <param name="source"></param>
        /// <returns></returns>
        object Process(object source);
    }

Combinability

In the above interface definition, if the implementation of IContentProcessor interface is large enough, it can actually solve the data extraction of any html page, but this means that its reusability will become less and less, and maintenance will become more and more difficult. Therefore, we prefer that the implementation of the method is small enough. However, the smaller it is, the fewer functions it has. In order to meet the complex practical needs, these interfaces must be combined. So add a new element to the interface: the subprocessor.

    public interface IContentProcessor
    {
        /// <summary>
        /// Processing content
        /// </summary>
        /// <param name="source"></param>
        /// <returns></returns>
        object Process(object source);

        /// <summary>
        /// The order of the processors, the smaller the execution, the earlier the execution.
        /// </summary>
        int Order { get; }

        /// <summary>
        /// Subprocessor
        /// </summary>
        IList<IContentProcessor> SubProcessors { get; }
    }

In this way, the individual Processors can collaborate. The order of execution is determined by the nesting relationship and the Order attribute. At the same time, the whole process also has the characteristics of pipeline: the processing results of the previous Processor can be used as the processing source of the next Processor.

Combination of results

Although the composability of the process is solved, the results of the process are not composable at present, because the complex structure can not be coped with. In order to solve this problem, IContent Collector is introduced, which inherits from IContent Processor. However, additional requirements are put forward as follows:

    public interface IContentCollector : IContentProcessor
    {
        /// <summary>
        /// Key corresponding to the value collected by the data collector
        /// </summary>
        string Key { get; }
    }

The interface requires a Key to identify the results. In this way, we can manage complex structures with a Dictionary < string, object >. Because dictionary entries can also correspond to Dictionary < string, object >, at this time, if you use json as a serialization tool, it is very easy to deserialize the results into complex classes.

As for why this interface should be inherited from IContentProcessor, this is to ensure the consistency of the node type, thus facilitating the construction of the entire processing flow through configuration.

To configure

From the above design, we can see that the whole process is actually a tree, and the structure is very standardized. This provides a feasibility for configuration, where a Content-Processor-Options type is used to represent the type of each Processor node and the necessary initialization information. The definition is as follows:

    public class ContentProcessorOptions
    {
        /// <summary>
        /// Construct a list of Processor parameters
        /// </summary>
        public Dictionary<string, object> Properties { get; set; } = new Dictionary<string, object>();

        /// <summary>
        /// Processor Type Information
        /// </summary>
        public string ProcessorType { get; set; }

        /// <summary>
        /// Specify a sub-Processor for quick initialization of Children to reduce nesting.
        /// </summary>
        public string SubProcessorType { get; set; }

        /// <summary>
        /// Subitem configuration
        /// </summary>
        public List<ContentProcessorOptions> Children { get; set; } = new List<ContentProcessorOptions>();
    }

The SubProcessorType attribute is introduced in Options to quickly initialize Content Collector with only one sub-processing node, which can reduce the level of configuration content and make the configuration file clearer. The following method shows how to initialize a Processor through a Content-Processor-Options. Reflections are used here, but because they are not initialized frequently, there won't be much of a problem.

        public static IContentProcessor BuildContentProcessor(ContentProcessorOptions contentProcessorOptions)
        {
            Type instanceType = null;
            try
            {
                instanceType = Type.GetType(contentProcessorOptions.ProcessorType, true);
            }
            catch
            {
                foreach (var assembly in AppDomain.CurrentDomain.GetAssemblies())
                {
                    if (assembly.IsDynamic) continue;
                    instanceType = assembly.GetExportedTypes()
                        .FirstOrDefault(i => i.FullName == contentProcessorOptions.ProcessorType);
                    if (instanceType != null) break;
                }
            }

            if (instanceType == null) return null;

            var instance = Activator.CreateInstance(instanceType);
            foreach (var property in contentProcessorOptions.Properties)
            {
                var instanceProperty = instance.GetType().GetProperty(property.Key);
                if (instanceProperty == null) continue;
                var propertyType = instanceProperty.PropertyType;
                var sourceValue = property.Value.ToString();
                var dValue = sourceValue.Convert(propertyType);
                instanceProperty.SetValue(instance, dValue);
            }
            var processorInstance = (IContentProcessor) instance;
            if (!contentProcessorOptions.SubProcessorType.IsNullOrWhiteSpace())
            {
                var quickOptions = new ContentProcessorOptions
                {
                    ProcessorType = contentProcessorOptions.SubProcessorType,
                    Properties = contentProcessorOptions.Properties
                };
                var quickProcessor = BuildContentProcessor(quickOptions);
                processorInstance.SubProcessors.Add(quickProcessor);
            }
            foreach (var processorOption in contentProcessorOptions.Children)
            {
                var processor = BuildContentProcessor(processorOption);
                processorInstance.SubProcessors.Add(processor);
            }
            return processorInstance;
        }

Several constraints

Need convergence set

An example is given to illustrate the problem: for example, n p tags are extracted from an html document, a string [] is returned, and this is passed as a source to the next processing node. The next processing node handles each string correctly, but if the node also returns a string [] for a string, the string [] should be stitched together by a Connector. Otherwise, the result would be an array of two, three, and even more dimensions. In this way, the logic of each node becomes complex and uncontrollable. So the set needs to converge to a dimension.

Proerties in configuration files do not support complex structures

Because of the configuration file system of. NET CORE currently in use, it is not possible to set its subitems as collections in a Dictionary < string, object >.

Several implementations

Implementation and Testing of Processor

HttpRequestContentProcessor

The processor is used to download a piece of html text from the network and pass the text content as a source to the next processor; it can specify the request url at the same time or use the source passed by the previous request node as a url to request. The realization is as follows:

  public class HttpRequestContentProcessor : BaseContentProcessor
    {
        public bool UseUrlWhenSourceIsNull { get; set; } = true;

        public string Url { get; set; }

        public bool IgnoreBadUri { get; set; }

        protected override object ProcessElement(object element)
        {
            if (element == null) return null;
            if (Uri.IsWellFormedUriString(element.ToString(), UriKind.Absolute))
            {
                if (IgnoreBadUri) return null;
                throw new FormatException($"Request address{Url}Incorrect format");
            }
            return DownloadHtml(element.ToString());
        }

        public override object Process(object source)
        {
            if (source == null && UseUrlWhenSourceIsNull && !Url.IsNullOrWhiteSpace())
                return DownloadHtml(Url);
            return base.Process(source);
        }

        private static async Task<string> DownloadHtmlAsync(string url)
        {
            using (var client = new HttpClient())
            {
                var result = await client.GetAsync(url);
                var html = await result.Content.ReadAsStringAsync();
                return html;
            }
        }

        private string DownloadHtml(string url)
        {
            return AsyncHelper.Synchronize(() => DownloadHtmlAsync(url));
        }
    }

The tests are as follows:

        [TestMethod]
        public void HttpRequestContentProcessorTest()
        {
            var processor = new HttpRequestContentProcessor {Url = "https://www.baidu.com"};
            var result = processor.Process(null);
            Assert.IsTrue(result.ToString().Contains("baidu"));
        }

XpathContentProcessor

The processor receives the specified information by accepting an XPath path. You can specify how to get data from a node by specifying ValueProvider and ValueProviderKey, as follows:

    public class XpathContentProcessor : BaseContentProcessor
    {
        /// <summary>
        /// Element Path of Index
        /// </summary>
        public string Xpath { get; set; }

        /// <summary>
        /// Keys worth providing
        /// </summary>
        public string ValueProviderKey { get; set; }

        /// <summary>
        /// Types of providers
        /// </summary>
        public XpathNodeValueProviderType ValueProviderType { get; set; }

        /// <summary>
        /// Index of Nodes
        /// </summary>
        public int? NodeIndex { get; set; }

        /// <summary>
        /// 
        /// </summary>
        public string ResultConnector { get; set; } = Constants.DefaultResultConnector;

        public override object Process(object source)
        {
            var result = base.Process(source);
            return DeterminAndReturn(result);
        }

        protected override object ProcessElement(object element)
        {
            var result = base.ProcessElement(element);
            if (result == null) return null;

            var str = result.ToString();
            
            return ProcessWithXpath(str, Xpath, false);
        }

        protected object ProcessWithXpath(string documentText, string xpath, bool returnArray)
        {
            if (documentText == null) return null;

            var document = new HtmlDocument();
            document.LoadHtml(documentText);
            var nodes = document.DocumentNode.SelectNodes(xpath);

            if (nodes == null)
                return null;

            if (returnArray && nodes.Count > 1)
            {
                var result = new List<string>();
                foreach (var node in nodes)
                {
                    var nodeResult = Helper.GetValueFromHtmlNode(node, ValueProviderType, ValueProviderKey);
                    if (!nodeResult.IsNullOrWhiteSpace())
                    {
                        result.Add(nodeResult);
                    }
                }
                return result;
            }
            else
            {
                var result = string.Empty;
                foreach (var node in nodes)
                {
                    var nodeResult = Helper.GetValueFromHtmlNode(node, ValueProviderType, ValueProviderKey);
                    if (!nodeResult.IsNullOrWhiteSpace())
                    {
                        if (result.IsNullOrWhiteSpace()) result = nodeResult;
                        else result = $"{result}{ResultConnector}{nodeResult}";
                    }
                }
                return result;
            }
        }
    }

Combining this Processor with the last Processor, let's grab the title of Baidu's home page:

        [TestMethod]
        public void XpathContentProcessorTest()
        {
            var xpathProcessor = new XpathContentProcessor
            {
                Xpath = "//title",
                ValueProviderType = XpathNodeValueProviderType.InnerText
            };
            var processor = new HttpRequestContentProcessor { Url = "https://www.baidu.com" };
            xpathProcessor.SubProcessors.Add(processor);

            var result = xpathProcessor.Process(null);
            Assert.AreEqual("Baidu, you know", result.ToString());
        }

Implementation and Testing of Collector

Collector's greatest role is to solve the problem of complex output models. The Collector implementation of a complex data structure is as follows:

    public class ComplexContentCollector : BaseContentCollector
    {
        /// <summary>
        /// Complex Content Collector needs a sub-data extractor to provide a Key, so it ignores Processor
        /// </summary>
        /// <param name="source"></param>
        /// <returns></returns>
        protected override object ProcessElement(object source)
        {
            var result = new Dictionary<string, object>();

            foreach (var contentCollector in SubProcessors.OfType<IContentCollector>())
            {
                result[contentCollector.Key] = contentCollector.Process(source);
            }

            return result;
        }
    }

The corresponding tests are as follows:

[TestMethod]
        public void ComplexContentCollectorTest2()
        {
            var xpathProcessor = new XpathContentProcessor
            {
                Xpath = "//title",
                ValueProviderType = XpathNodeValueProviderType.InnerText
            };

            var xpathProcessor2 = new XpathContentProcessor
            {
                Xpath = "//p[@id=\"cp\"]",
                ValueProviderType = XpathNodeValueProviderType.InnerText,
                Order = 1
            };
            var processor = new HttpRequestContentProcessor {Url = "https://www.baidu.com", Order = -1};
            var complexCollector = new ComplexContentCollector();
            var baseCollector = new BaseContentCollector();

            baseCollector.SubProcessors.Add(processor);
            baseCollector.SubProcessors.Add(complexCollector);
            
            var titleCollector = new BaseContentCollector{Key = "Title"};
            titleCollector.SubProcessors.Add(xpathProcessor);
            var footerCollector = new BaseContentCollector {Key = "Footer"};
            footerCollector.SubProcessors.Add(xpathProcessor2);
            footerCollector.SubProcessors.Add(new HtmlCleanupContentProcessor{Order = 3});

            complexCollector.SubProcessors.Add(titleCollector);
            complexCollector.SubProcessors.Add(footerCollector);

            var result = (Dictionary<string,object>)baseCollector.Process(null);
            Assert.AreEqual("Baidu, you know", result["Title"]);
            Assert.AreEqual("©2014 Baidu Must read Beijing before using Baidu ICP Certificate No. 030173", result["Footer"]);

        }

Use configuration to deal with slightly more complex situations

Now, test with the following code:

        public void RunConfig(string section)
        {
            var builder = new ConfigurationBuilder()
                .SetBasePath(AppDomain.CurrentDomain.BaseDirectory)
                .AddJsonFile("appsettings1.json");
            var configurationRoot = builder.Build();

            var options = configurationRoot.GetSection(section).Get<ContentProcessorOptions>();
            var processor = Helper.BuildContentProcessor(options);

            var result = processor.Process(null);
            var json = JsonConvert.SerializeObject(result);
            System.Console.WriteLine(json);
        }

Grabbing List Titles of Blog Gardens

Configuration used:

"newsListOptions": {
    "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
    "Properties": {},
    "Children": [
      {
        "ProcessorType": "IC.Robot.ContentProcessor.HttpRequestContentProcessor",
        "Properties": {
          "Url": "https://www.cnblogs.com/news/",
          "Order": "0"
        }
      },
      {
        "ProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
        "Properties": {
          "Xpath": "//div[@class=\"post_item\"]",
          "Order": "1",
          "ValueProviderType": "OuterHtml",
          "OutputToArray": true
        }
      },
      {
        "ProcessorType": "IC.Robot.ContentCollector.ComplexContentCollector",
        "Properties": {
          "Order": "2"
        },
        "Children": [
          {
            "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
            "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
            "Properties": {
              "Xpath": "//a[@class=\"titlelnk\"]",
              "Key": "Url",
              "ValueProviderType": "Attribute",
              "ValueProviderKey": "href"
            }
          },
          {
            "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
            "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
            "Properties": {
              "Xpath": "//span[@class=\"article_comment\"]",
              "Key": "CommentCount",
              "ValueProviderType": "InnerText",
              "Order": "0"
            },
            "Children": [
              {
                "ProcessorType": "IC.Robot.ContentProcessor.RegexMatchContentProcessor",
                "Properties": {
                  "RegexPartten": "[0-9]+",
                  "Order": "1"
                }
              }
            ]
          },
          {
            "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
            "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
            "Properties": {
              "Xpath": "//*[@class=\"digg\"]//span",
              "Key": "LikeCount",
              "ValueProviderType": "InnerText"
            }
          },
          {
            "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
            "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
            "Properties": {
              "Xpath": "//a[@class=\"titlelnk\"]",
              "Key": "Title",
              "ValueProviderType": "InnerText"
            }
          }
        ]
      }
    ]
  },

The results obtained are as follows:

[
        {
            "Url": "//news.cnblogs.com/n/574269/",
            "CommentCount": "1",
            "LikeCount": "3",
            "Title": "Liu Qiangdong: After 13 years in Jingdong, few people really understand us."
        },
        {
            "Url": "//news.cnblogs.com/n/574267/",
            "CommentCount": "0",
            "LikeCount": "0",
            "Title": "Lenovo is also talking about AI, but its most urgent goal is to sell more. PC"
        },
        {
            "Url": "//news.cnblogs.com/n/574266/",
            "CommentCount": "0",
            "LikeCount": "0",
            "Title": "Almost all but millet 1 support - millet MIUI9 List of Upgraded Aircraft"
        },
        ...
]

Get details of the most commented news in this list

This involves computation, and set operations, while the set element is a dictionary, so we need to introduce two new Processor s, one for filtering and one for mapping.

    public class ListItemPickContentProcessor : BaseContentProcessor
    {
        public string Key { get; set; }

        /// <summary>
        /// Types used for operations
        /// </summary>
        public string OperatorTypeFullName { get; set; }

        /// <summary>
        /// Values for comparison
        /// </summary>
        public string OperatorValue { get; set; }

        /// <summary>
        /// Subscription
        /// </summary>
        public int Index { get; set; }

        /// <summary>
        /// Model
        /// </summary>
        public ListItemPickMode PickMode { get; set; }

        /// <summary>
        /// Operator
        /// </summary>
        public ListItemPickOperator PickOperator { get; set; }

        public override object Process(object source)
        {
            var preResult = base.Process(source);

            if (!Helper.IsEnumerableExceptString(preResult))
            {
                if (source is Dictionary<string, object>)
                    return ((Dictionary<string, object>) preResult)[Key];
                return preResult;
            }

            return Pick(source as IEnumerable);
        }

        private object Pick(IEnumerable source)
        {
            var objCollection = source.Cast<object>().ToList();
            if (objCollection.Count == 0)
                return objCollection;
            var item = objCollection[0];
            var compareDictionary = new Dictionary<object, IComparable>();
            if (item is IDictionary)
            {

                foreach (Dictionary<string, object> dic in objCollection)
                {
                    var key = (IComparable) dic[Key].ToString().Convert(ResolveType(OperatorTypeFullName));
                    compareDictionary.Add(dic, key);
                }
            }
            else
            {
                foreach (var objItem in objCollection)
                {
                    var key = (IComparable) objItem.ToString().Convert(ResolveType(OperatorTypeFullName));
                    compareDictionary.Add(objItem, key);
                }
            }

            IEnumerable<object> result;

            switch (PickOperator)
            {
                case ListItemPickOperator.OrderDesc:
                    result = compareDictionary.OrderByDescending(i => i.Value).Select(i => i.Key);
                    break;
                default: throw new NotSupportedException();
            }

            switch (PickMode)
            {
                case ListItemPickMode.First:
                    return result.FirstOrDefault();
                case ListItemPickMode.Last:
                    return result.LastOrDefault();
                case ListItemPickMode.Index:
                    return result.Skip(Index - 1).Take(1).FirstOrDefault();
                default:
                    throw new NotImplementedException();
            }
        }

        private Type ResolveType(string typeName)
        {
            if (typeName == typeof(Int32).FullName)
                return typeof(Int32);
            throw new NotSupportedException();
        }

        public enum ListItemPickMode
        {
            First,
            Last,
            Index
        }

        public enum ListItemPickOperator
        {
            LittleThan,
            GreaterThan,
            Order,
            OrderDesc
        }
    }

A lot of reflection is used here, but performance is not considered for the time being.

    public class DictionaryPickContentProcessor : BaseContentProcessor
    {
        public string Key { get; set; }

        protected override object ProcessElement(object element)
        {
            if (element is IDictionary)
            {
                return (element as IDictionary)[Key];
            }
            return element;
        }
    }

This Processor will extract a record from the dictionary.