Article Directory
Preface
Web site UV refers to the amount of Unique Visitor visited by independent users of a Web site, that is, multiple visits by the same users need to be weighted.
thinking
When it comes to UV weighting, it's assumed that the Set collection class will come to mind.
- It's a good idea to use Set collections, where the user's id is stored.When each user visits the page, we save the id directly into the Set and get the size of the Set.The question is how much capacity does a Set need to be set?If the application is distributed, do you need to merge?The first problem can actually be estimated by calculation, which requires a lot of storage space if you have hundreds of millions of users; the second problem can actually be stored by Redis, DB, etc., such as Redis's Set structure, DB's unique key.
- The DB we mentioned above is also a solution, but when there is a lot of writing, the pressure on the database will be greater.If there are many users, there are many row s, and you may need to break up your daily data.This can be done with a small amount of user access.
Although the above two methods can achieve the function of Statistics website UV, one takes up memory and the other takes up database resources.So how can we avoid these two questions?Here, we introduce another implementation, which uses the HyperLogLog structure in Redis and only takes up 12 K of space.
HyperLogLog
HyperLog is simple to use and slightly complex to implement.Let's first look at how HyperLogLog can be used for page UV statistics.
Use Redis command action
# Add Elements 127.0.0.1:6379> pfadd user zhangsan lisi wangwu # Add Successful Return1,Add Failure Return0 (integer) 1 # Statistics Quantity 127.0.0.1:6379> pfcount user # Return to the present quantity (integer) 3 # Regenerate a pfkey 127.0.0.1:6379> pfadd user2 zhangsan2 lisi2 wangwu (integer) 1 127.0.0.1:6379> pfcount user2 (integer) 3 # pfmerge merges values from subsequent pfkeys into previous pfkeys 127.0.0.1:6379> pfmerge user2 user OK # View user2 after merge 127.0.0.1:6379> pfcount user2 (integer) 5
Working with Java code
import org.springframework.data.redis.core.HyperLogLogOperations; import org.springframework.data.redis.core.RedisTemplate; import org.springframework.stereotype.Service; import javax.annotation.Resource; @Service public class RedisService { @Resource private RedisTemplate<String, String> redisTemplate; /** * Logging user access * * @param user */ public long statistic(String Key, String user) { HyperLogLogOperations<String, String> hyperLogLog = redisTemplate.opsForHyperLogLog(); return hyperLogLog.add(Key, user); } /** * Statistics Current UV * * @return */ public long size(String Key) { HyperLogLogOperations<String, String> hyperLogLog = redisTemplate.opsForHyperLogLog(); return hyperLogLog.size(Key); } /** * Delete the current key */ public void clear(String Key) { HyperLogLogOperations<String, String> hyperLogLog = redisTemplate.opsForHyperLogLog(); hyperLogLog.delete(Key); } }
The Principle and Feature of HyperLog Implementation
- Principle: Actually this is a probability problem.For example, in Java, when we put a string in HyperLogLog each time, we actually convert the string to a value that we can use as a hash value, convert the value to binary, and look back and forward at where the first 1 appears.So what is the probability that 1 will appear in the third position (xxxx x100)?(1/2) ^3 = 1/8, that is, when about eight numbers enter this data structure, the first 1 may have appeared in the third location more likely, so we just need to maintain a maximum of 1's occurrence (temporarily called max position) to know how many HyperLogLogs there are.
- Duplicate: When we talked about hash values above, in fact, the whole algorithm is to map a fixed value to a number to solve the duplicate problem.If zhangsan corresponds to 8, then max position=4, another zhangsan or 8, then max position does not change.
- Features: Because of the probability problem, there will always be inaccuracies, so when you use HyperLogLog, you can set a larger number of user s, such as 100W.But as a result, it's possible that you can see less than 100W, or that the calculated UV is larger than 100W.
Implement HyperLog using Java
public class HyperLogLogSelf { static class BitKeeper { private int maxBits; public void random() { // Here the random number can be treated as a hashCode of an object.Long value = new Object (). hashCode () ^ (2 << 32); long value = ThreadLocalRandom.current().nextLong(2L << 32); int bits = lowZeros(value); if (bits > this.maxBits) { this.maxBits = bits; } } /** * How many consecutive zeros are there in the low position * Ideas_Position of the first one to the last * * @param value * @return */ private int lowZeros(long value) { int i = 1; for (; i < 32; i++) { if (value >> i << i != value) { break; } } return i - 1; } } static class Experiment { private int n; private BitKeeper keeper; public Experiment(int n) { this.n = n; this.keeper = new BitKeeper(); } public void work() { for (int i = 0; i < n; i++) { this.keeper.random(); } } public void debug() { double v = Math.log(this.n) / Math.log(2); System.out.printf("%d %.2f %d\n", this.n, v, this.keeper.maxBits); } } public static void main(String[] args) { for (int i = 10000; i < 1000000; i += 10000) { Experiment exp = new Experiment(i); exp.work(); exp.debug(); } } }
As shown in the code above, if there is only one BitKeeper, the precision is difficult to control. The more BitKeepers there are, the more accurate they are. So Redis set up HyperLogLog with 16384 barrels, or 2^14. The maxbits of each barrel need 6 bits to store. The maximum is maxbits=63, so the total memory usage is 2^14 * 6/8 = 12K bytes.
Summary
Starting with the application scenario, we describe the use and implementation principle of HyperLog, and give a simple Java implementation of HyperLog.
Finally, when using HyperLogLog, we need to be aware that:
- HyperLogLog takes up 12k of memory (when there is a lot of data), so HyperLogLog is not suitable for storing a user-related information separately.
- HyperLogLog has a certain loss of precision, may be more or less than the true number, but it is basically within n(0<n<10).