[Java required course] performance comparison of contains method between ArrayList and HashSet (JMH performance test)

1 Introduction

In daily development, ArrayList and HashSet are very common collection classes in Java.

  • ArrayList is the most commonly used implementation class of List interface.
  • HashSet is the implementation that holds the unique element Set.

This paper mainly makes a simple discussion on the common method contains(), mainly the performance comparison, and uses JMH(ava Microbenchmark Harness) to test and compare.

2. See JMH test results first

We use a Micro Benchmark Framework developed by the big bulls who have developed Java compilers in OpenJDK/Oracle to test. The following is a brief demonstration of the use process.

2.1 Maven import dependency

To import JMH related dependencies, you can go to the official website to view the latest version:

<dependencies>
  <dependency>
    <groupId>org.openjdk.jmh</groupId>
    <artifactId>jmh-core</artifactId>
    <version>${openjdk.jmh.version}</version>
  </dependency>
  <dependency>
    <groupId>org.openjdk.jmh</groupId>
    <artifactId>jmh-generator-annprocess</artifactId>
    <version>${openjdk.jmh.version}</version>
  </dependency>
</dependencies>

<properties>
  <openjdk.jmh.version>1.19</openjdk.jmh.version>
</properties>

2.2 create test related classes

2.2.1 class of collection storage object

Because we want to test the methods of the collection class, we create a class to represent the objects stored in the collection. As follows:

@Data
@AllArgsConstructor(staticName = "of")
public class Student {
    private Long id;
    private String name;
}

2.2.2 JMH test

Next, we will write the test performance comparison class, the code is as follows:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class ContainsPerformanceTest {
    @State(Scope.Thread)
    public static class MyState {
        private Set<Student> studentSet = new HashSet<>();
        private List<Student> studentList = new ArrayList<>();
        private Student targetStudent = Student.of(99L, "Larry");

        @Setup(Level.Trial)
        public void prepare() {
            long MAX_COUNT = 10000;
            for (long i = 0; i < MAX_COUNT; i++) {
                studentSet.add(Student.of(i, "MQ"));
                studentList.add(Student.of(i, "MQ"));
            }
            studentList.add(targetStudent);
            studentSet.add(targetStudent);
        }
    }

    @Benchmark
    public boolean arrayList(MyState state) {
        return state.studentList.contains(state.targetStudent);
    }

    @Benchmark
    public boolean hashSet(MyState state) {
        return state.studentSet.contains(state.targetStudent);
    }

    public static void main(String[] args) throws Exception {
        Options options = new OptionsBuilder()
                .include(ContainsPerformanceTest.class.getSimpleName())
                .threads(6)
                .forks(1)
                .warmupIterations(3)
                .measurementIterations(6)
                .shouldFailOnError(true)
                .shouldDoGC(true)
                .build();
        new Runner(options).run();
    }
}

Test class notes:

  • @Benchmark mode: indicates the mode used for benchmarking; AverageTime indicates the average time of test calls.
  • @OutputTimeUnit: the unit of measurement time for the test; NANOSECONDS means NANOSECONDS.
  • @State: accept a Scope parameter to represent the shared Scope of the state; Scope.Thread means that each thread is exclusive.
  • @Setup: execute before Benchmark, similar to JUnit's @ BeforeAll.
  • @Benchmark: the object for benchmark, similar to JUnit's @ Test.

Test class startup parameter Options Description:

  • include: the class name of the benchmark;
  • Threads: the number of test threads in each process;
  • Fork: number of processes. If it is 3, JMH will fork out 3 processes to test.
  • warmupIterations: number of iterations to warm up,
  • Measurement iterations: the number of iterations actually measured.

2.3 test results

After setting the parameters, you can run the test. The test results are as follows:

# Benchmark: ContainsPerformanceTest.arrayList

# Run progress: 0.00% complete, ETA 00:00:18
# Fork: 1 of 1
# Warmup Iteration   1: 42530.408 ±(99.9%) 2723.999 ns/op
# Warmup Iteration   2: 17841.988 ±(99.9%) 1882.026 ns/op
# Warmup Iteration   3: 18561.513 ±(99.9%) 2021.506 ns/op
Iteration   1: 18499.568 ±(99.9%) 2126.172 ns/op
Iteration   2: 18975.407 ±(99.9%) 2004.509 ns/op
Iteration   3: 19386.851 ±(99.9%) 2248.536 ns/op
Iteration   4: 19279.722 ±(99.9%) 2102.846 ns/op
Iteration   5: 19796.495 ±(99.9%) 1974.987 ns/op
Iteration   6: 21363.962 ±(99.9%) 2175.961 ns/op


Result "ContainsPerformanceTest.arrayList":
  19550.334 ±(99.9%) 2771.595 ns/op [Average]
  (min, avg, max) = (18499.568, 19550.334, 21363.962), stdev = 988.377
  CI (99.9%): [16778.739, 22321.929] (assumes normal distribution)


# Benchmark: ContainsPerformanceTest.hashSet

# Run progress: 50.00% complete, ETA 00:00:16
# Fork: 1 of 1
# Warmup Iteration   1: 10.662 ±(99.9%) 0.209 ns/op
# Warmup Iteration   2: 11.177 ±(99.9%) 1.077 ns/op
# Warmup Iteration   3: 9.467 ±(99.9%) 1.462 ns/op
Iteration   1: 9.540 ±(99.9%) 0.535 ns/op
Iteration   2: 9.388 ±(99.9%) 0.365 ns/op
Iteration   3: 10.604 ±(99.9%) 1.008 ns/op
Iteration   4: 9.361 ±(99.9%) 0.154 ns/op
Iteration   5: 9.366 ±(99.9%) 0.458 ns/op
Iteration   6: 9.274 ±(99.9%) 0.237 ns/op


Result "ContainsPerformanceTest.hashSet":
  9.589 ±(99.9%) 1.415 ns/op [Average]
  (min, avg, max) = (9.274, 9.589, 10.604), stdev = 0.505
  CI (99.9%): [8.174, 11.004] (assumes normal distribution)


# Run complete. Total time: 00:00:32

Benchmark                          Mode  Cnt      Score      Error  Units
ContainsPerformanceTest.arrayList  avgt    6  19550.334 ± 2771.595  ns/op
ContainsPerformanceTest.hashSet    avgt    6      9.589 ±    1.415  ns/op

After testing, it is found that the time-consuming difference between the two is great. The ArrayList is about 20K nanoseconds, while the HashSet is about 10 nanoseconds. They are not in the same order of magnitude at all.

3 source code analysis

Through the test, we know that there are great differences between the two. Let's take a look at the source code analysis.

3.1 contains() of ArrayList

The bottom layer of ArrayList uses array as data storage. When an Object is given to judge whether it exists, it needs to traverse the array and compare with each element.

public boolean contains(Object o) {
  return indexOf(o) >= 0;
}
public int indexOf(Object o) {
  if (o == null) {
    for (int i = 0; i < size; i++)
      if (elementData[i]==null)
        return i;
  } else {
    for (int i = 0; i < size; i++)
      if (o.equals(elementData[i]))
        return i;
  }
  return -1;
}

It can be found from the source code that the contains() method is judged by calling indexOf(), which is to traverse the array until the element equal to the input parameter is found. Because, the time complexity of the contains() method of ArrayList is O(n), that is to say, time depends on the length and is proportional.

3.2 contents() of HashSet

The bottom layer of HashSet is implemented by HashMap, and the bottom structure of HashMap is array + linked list. After JDK 8, it is changed to array + linked list + red black tree.

The relevant codes of HashMap are as follows:

public boolean containsKey(Object key) {
  return getNode(hash(key), key) != null;
}
final Node<K,V> getNode(int hash, Object key) {
  Node<K,V>[] tab; Node<K,V> first, e; int n; K k;
  if ((tab = table) != null && (n = tab.length) > 0 &&
      (first = tab[(n - 1) & hash]) != null) {
    if (first.hash == hash && // always check first node
        ((k = first.key) == key || (key != null && key.equals(k))))
      return first;
    if ((e = first.next) != null) {
      if (first instanceof TreeNode)
        return ((TreeNode<K,V>)first).getTreeNode(hash, key);
      do {
        if (e.hash == hash &&
            ((k = e.key) == key || (key != null && key.equals(k))))
          return e;
      } while ((e = e.next) != null);
    }
  }
  return null;
}

First, find it by getting the Hash value. If the Hash value is equal and the object is equal, find it. Generally speaking, in the case that the implementation of hashCode() method is OK, there are less Hash conflicts. Therefore, in most cases, the time complexity of contains() is O(1), and the number of elements does not affect its speed. In case of Hash conflict, when the length of the linked list is less than 8, the time complexity is O(n); when the length of the linked list is greater than 8, the time complexity is O(logn).

Generally, we think that the time complexity of HashSet/HashMap search is O(1).

4 Summary

Through JMH test, we find that the performance of contains() method of ArrayList and HashSet is quite different. According to the source code analysis, the time complexity of ArrayList is O(n), while that of HashSet is O(1).

Welcome to the public account "pumpkin slow talk", will continue to update for you...

Keywords: Java Junit less Oracle

Added by CodeEye on Wed, 23 Oct 2019 17:03:08 +0300