C# Linq's three ways of de duplication (Distinct)

preface

I don't need to talk more about when the default Distinct method in C# can be de duplicated. The default implementation of de duplication for collection objects will no longer be satisfied. Therefore, we need to customize the implementation to solve this problem. Next, we will explain several common de duplication schemes in detail, which is good or bad.

First, give the objects we need to use, as follows:

public class Person
{
    public string Name { get; set; }
    public int Age { get; set; }
}

Next, we add 1 million pieces of data to the collection, as follows:

            var list = new List<Person>();
            for (int i = 0; i < 1000000; i++)
            {
                list.Add(new Person() { Age = 18, Name = "Infatuated with private plots" });
            }
            for (int i = 0; i < 1000; i++)
            {
                list.Add(new Person() { Age = 19, Name = "Infatuated with private plots" });
            }

The first group de duplication

Group by age and name, and then take the first item to achieve weight removal, as follows:

var  list1 = list.GroupBy(d => new { d.Age, d.Name })
    .Select(d => d.FirstOrDefault())
    .ToList();

The second HashSet de duplication (extension method)

In C #, HashSet will filter repeated elements, so we write the following extension method (defined in the static function), traverse the set elements, and finally use HashSet to filter to achieve the purpose of de duplication, as follows:

public static IEnumerable<TSource> Distinct<TSource, TKey>(
        this IEnumerable<TSource> source,
        Func<TSource, TKey> keySelector)
    {
        var hashSet = new HashSet<TKey>();
        
        foreach (TSource element in source)
        {
            if (hashSet.Add(keySelector(element)))
            {
                yield return element;
            }
        }
    }

The extension method can remove the duplication, as follows:

 var  list2 = list.Distinct(d => new { d.Age, d.Name }).ToList();

The third IEqualityComparer de duplication (extension method)

In actual projects, many implement this interface through specific implementation classes, and achieve the purpose of de duplication by rewriting the attribute values of Equals and HashCode. Because each class has to implement the corresponding comparator, it is not universal, but the above method is the best. In fact, we can use this comparison interface to realize a general solution, The reason why we have to implement a comparator for each class is that we put the attribute comparison inside the interface of the class. If we put the attribute comparison on the periphery, we will achieve a general solution at this time. How can we implement it through delegation? The essence of implementing the interface is to compare HashCode and then compare its values through Equals, When comparing hashcodes, we force the value to be a constant (such as 0). When overriding the Equals method, we can call the delegate, as follows

public static class Extensions
{
    public static IEnumerable<T> Distinct<T>(
        this IEnumerable<T> source, Func<T, T, bool> comparer)
        where T : class
        => source.Distinct(new DynamicEqualityComparer<T>(comparer));

    private sealed class DynamicEqualityComparer<T> : IEqualityComparer<T>
        where T : class
    {
        private readonly Func<T, T, bool> _func;

        public DynamicEqualityComparer(Func<T, T, bool> func)
        {
            _func = func;
        }

        public bool Equals(T x, T y) => _func(x, y);

        public int GetHashCode(T obj) => 0;
    }
}

Finally, the duplicate can be removed by comparing the specified attributes, as follows:

list = list.Distinct((a, b) => a.Age == b.Age && a.Name == b.Name).ToList();

performance comparison

We have introduced the above three common methods. When the amount of data is relatively small, we can ignore the performance brought by various operations on the collection. However, once the amount of data is large, we may need to consider the performance. It may be necessary to save some time. Therefore, on the premise of the above 1 million data, we will analyze the time-consuming situation as follows:

var list = new List<Person>();
for (int i = 0; i < 1000000; i++)
{
    list.Add(new Person() { Age = 18, Name = "jeffcky" });
}

var time1 = Time(() =>
{
    list.GroupBy(d => new { d.Age, d.Name })
        .Select(d => d.FirstOrDefault())
        .ToList();
});
Console.WriteLine($"Grouping time:{time1}");

var time2 = Time(() =>
{
    list.Distinct(d => new { d.Age, d.Name }).ToList();
});
Console.WriteLine($"HashSet Time consuming:{time2}");

var time3 = Time(() =>
{
    list.Distinct((a, b) => a.Age == b.Age && a.Name == b.Name).ToList();
});
Console.WriteLine($"Entrusted time:{time3}");


static long Time(Action action)
{
    var stopwatch = new Stopwatch();
    stopwatch.Start();
    action();
    stopwatch.Stop();
    return stopwatch.ElapsedMilliseconds;
}

reference resources: https://www.cnblogs.com/CreateMyself/p/12863407.html

Keywords: C#

Added by j_miguel_y on Thu, 10 Feb 2022 17:41:01 +0200