Function of box chart: cleaning abnormal data

Box plot, also known as box whisker chart, box chart or box line chart, is a statistical chart used to display a group of data dispersion. Named for its shape like a box. It is also often used in various fields, often in quality management. It is mainly used to reflect the distribution characteristics of original data, and can also compare the distribution characteristics of multiple groups of data. The drawing method of box graph is: first find out the upper edge, lower edge, median and two quartiles of a group of data; Then, connect the two quartiles to draw the box; Then connect the upper and lower edges with the box, and the median is in the middle of the box.

It mainly includes six data nodes. Arrange a group of data from large to small, and calculate its upper edge, upper quartile Q3, median, lower quartile Q1, lower edge and an abnormal value respectively.

Box charts provide a way to simply summarize data sets with only five points. These five points include midpoint, Q1, Q3, high and low of segment status. The box diagram is vividly divided into the whole range of center, extension and distribution state.

The most important thing in the box chart is the calculation of relevant statistical points, which can be realized by the percentile calculation method.

Drawing steps of box diagram: [2]

1. Draw a number axis. The unit of measurement is the same as that of the data batch. The starting point is slightly smaller than the minimum value and the length is slightly longer than the full distance of the data batch.

2. Draw a rectangular box, and the positions of both ends correspond to the upper and lower quartiles (Q3 and Q1) of the data batch respectively. Draw a line segment at the median (Xm) position inside the rectangular box as the median line.

3. Draw two line segments at Q3+1.5IQR and Q1-1.5IQR, which are the same as the median line. These two line segments are abnormal value cutoff points, which are called internal limits; Draw two line segments at Q3+3IQR and Q1-3IQR, which are called outer limits. The data represented by points outside the inner limit are all outliers. The outliers between the inner limit and the outer limit are mild outliers, and the outliers outside the outer limit are extreme outliers. Interquartile range IQR=Q3-Q1

4. Draw a line segment outward from both ends of the rectangular box until it is not the farthest point of the abnormal value, indicating the distribution range of the normal value of the batch of data.

5. Mark mild outliers with "0" and extreme outliers with "*". The data points with the same value are marked on the same data line position in parallel, and the data points with different values are marked on different data line positions. So far, the box diagram of a batch of data has been drawn. The box diagram drawn by statistical software generally does not mark the internal limit and external limit.

/** * Box diagram * * @param data */ public static BoxPlot plot(double[] data) { List<Double> collect = Arrays.stream(data).boxed().sorted().distinct().collect(Collectors.toList()); data = collect.stream().mapToDouble(i -> i).toArray(); //median double median; double min; double max; //Lower quartile 0.25 double Q1; //Upper quartile 0.75 double Q3; //Interquartile distance double IQR; //Mild anomaly double[] mildOutlier; //Extreme anomaly double[] extremeOutlier; if (data.length % 2 == 0) { median = (data[(data.length) / 2 - 1] + data[(data.length) / 2]) / 2; Q1 = (data[(data.length) / 4 - 1] + data[(data.length) / 4]) / 2; Q3 = (data[((data.length) * 3) / 4 - 1] + data[((data.length) * 3) / 4]) / 2; } else { median = data[(data.length) / 2]; Q1 = data[(data.length) / 4]; Q3 = data[(data.length * 3) / 4]; } //Maximum max = data[data.length - 1]; //minimum value min = data[0]; IQR = Q3 - Q1; //Internal limit double maxInRegion = Q3 + 1.5 * IQR; double mixInRegion = Q1 - 1.5 * IQR; //Outer limit double maxOutRegion = Q3 + 3 * IQR; double mixOutRegion = Q1 - 3 * IQR; BoxPlot box = new BoxPlot(); box.setMin(min); box.setQ1(Q1); box.setMedian(median); box.setQ3(Q3); box.setMax(max); box.setMixInRegion(mixInRegion); box.setMaxInRegion(maxInRegion); box.setMixOutRegion(mixOutRegion); box.setMaxOutRegion(maxOutRegion); box.setIQR(IQR); log.info(JSON.toJSONString(box)); return box; }```