Reference:
**https://www.youtube.com/watch?v=kx-XDoPjoHw**
Use Cases
- Find 100 most
- Viewed videos on youtube
- Played songs on Spotify
- Shared posts on facebook.
- Retweeted tweets on twitter.
- Liked Photos on Instagram
- Searched keywords on Google.
- With Such a scale, database or distributed cache is not an option. We might be dealing with 1M RPS. If we would use DB to track view counts, first the writes/updates would be super slow, and then finding the top K items would require scanning the whole dataset. For small datasets, this might work but not for a large amount of data.
- May be MapReduce can help. But it is not sufficient. We need to return heavy hitters in as close to realtime as possible. MapReduce is batch processing.
- e.g. Calculate top 100 list for last
- 1 min, 5 min, 15 mins, 60 mins etc.
- This makes this problem in the flavors of stream processing problem.
Requirements
- Functional
- topK(k, startTime, endTime)
- Non-Functional
- Scalable(Scales together with increasing amount of data.)
- Highly available(survives hardware/network failures, no SPOF)
- Highly Performant(few 10s of ms to return top 100 list)
- Given the performance requirement, it is a hint that the final list should be pre-calculated and we should avoid heavy calculations while calling the top K API.
- Accurate
- For e.g. by using data sampling, we may not count every element, but only a small fraction of events.
Approaches:
- Hash table, Single Host.
- Keep the count of the incoming list of events in a hashmap.
- 2 Approaches.
- Sort the list of entries in the hashmap by frequency and return the first K elements. Time O(nLogn)
- Put the elements on a heap of size K. Time-O(nLogK)
- This is not scalable as the volume of events incoming goes too high then single host will become a bottleneck. We may need to process events in parallel.
- HashTable, multiple Hosts.
- We introduce a load balancer, and we can now process data in parallel. Total throughput of the system has definitely increased. Memory would be a problem if you would store all the youtube videos IDs in the memory you the host.
- HashTable, Multiple Hosts, Partitioning.
- Data partitioner is responsible for routing each individual identifer to it's own processor host. So each processor host stores only subset of all data. We won't send in all the hash table data from all hosts to the storage host. Instead we would compute topK list individually at each host, and we need to merge these sorted list finally on the storage host. This is similar to Merge K sorted linked lists.
- Problem?
- We considered the data set to be bounded, thats why we were able to think about partitioning it into multiple chunks. But streaming data isn't bounded. It keeps on coming. In this case, the processor host can keep on accumulating the data only for certain period of time, before which it will run out of memory. Say 1 min.
- We will flush 1 min data to the storage host.
- Storage host stores the list of heavy hitters for every minute.
- We are intentionally losing all the information about non-topK elements. We can't afford storing information about every video in memory.
- But what if we want to find topK in last 1 hour, or last 1 day, how can we build it using 60 1 min list??
- Given the current approach, there isn't a correct way to solve this problem. To find the topK for the day, we need full dataset for the whole day.
- Conflicting requirements, keep whole 1 day data(to satisfy the requirement) or lose it to afford storage.
- Let us store all the data on the disk since it can't fit into the memory, and use batch processing frameworks to do topK lists.
- Map Reduce architecture would come into play.
- Another problem with this arch is even though it may seem simple, it is not. Think about?
- Every time we introduce data partitioning, we have to think about data replication so that copies of each partition are stored on multiple nodes.
- We need to think about re-balancing, when a new node is added/removed to/from the cluster.
- We need to deal with hot partitions.
- Before jumping into the above discussed approaches, let us think if there is a simple solution to the topK problem?
-
Yes, but we need to make sacrifices along the way. Accuracy is the sacrifice.
-
Data Structure that would help us count topK with fixed size memory, but results may not be 100% accurate.
Count-Min Sketch.
- Think of it has 2 dimensional array.
- Width is usually in thousands, and height is small(say 5 which is the list of the hash functions.)
- Whenever a new element comes, we calculate the hash value, and add 1 to the corresponding cell.
- There could be collision at some places. Thus we take the smallest value across the hash function values for an element.
- Count-Min sketch is a fixed size data structure, even when dataset size increases.
- It can be thought of as a replacement for hashtable.
High Level Design