Skip to content
Design Questions
- Match only from beginning or support match in the middle
- Ranking method: term frequency etc.
- Spell check method.
- Batch processing or online processing.
- Calculation
- QPS: 10 millions DAU * 10 queries / day * 20 bytes / query = 24k QPS
- Storage: 10 millions DAU * 10 new queries * 20 bytes * 20% new queries = 0.4 GB
CUJ: Gathering Service
- Store all user’s queries for a period of time
- Run the offline batch job to build Trie data structure
- How to build Trie data structure
- Limit the max length of the prefix to prevent the tree going too deep
- Cache top search queries at each node, so we don’t need to go down to the leaf nodes for the auto-completed search queries
- We can apply filter rules for users’ queries
- We can add more signals (not only term frequency) when building Trie
CUJ: Query Service
- User send autocomplete request to API servers
- API server query the Trie data structure to get the top auto-completed search queries
- How to serve Trie data structure efficiently
- Shard probably into different Trie DB
- Use cache to speed up access speed to Trie DB
- Also use cache on client side to avoid sending autocomplete request
- We can consider other signal (not only term frequency) when ranking candidates
Other Topics
- Extend to other languages: use unicode instead of ASCII
- Different ranking algorithm: use other signals, e.g. location, time of the query
- Support real-time / trending / online search query data, some ideas:
- Schedule more frequent offline job within a reduce working set.
- Combine Trie with other data structures e.g. hash table etc.