by: MicrostrongPosted on: December 29, 2024December 29, 2024

[SysDes1][Chap. 13] Design a Search Autocomplete System

Design Questions

Match only from beginning or support match in the middle
Ranking method: term frequency etc.
Spell check method.
Batch processing or online processing.
Calculation
- QPS: 10 millions DAU * 10 queries / day * 20 bytes / query = 24k QPS
- Storage: 10 millions DAU * 10 new queries * 20 bytes * 20% new queries = 0.4 GB

CUJ: Gathering Service

Store all user’s queries for a period of time
Run the offline batch job to build Trie data structure
How to build Trie data structure
- Limit the max length of the prefix to prevent the tree going too deep
- Cache top search queries at each node, so we don’t need to go down to the leaf nodes for the auto-completed search queries
We can apply filter rules for users’ queries
We can add more signals (not only term frequency) when building Trie

CUJ: Query Service

User send autocomplete request to API servers
API server query the Trie data structure to get the top auto-completed search queries
How to serve Trie data structure efficiently
- Shard probably into different Trie DB
- Use cache to speed up access speed to Trie DB
- Also use cache on client side to avoid sending autocomplete request
We can consider other signal (not only term frequency) when ranking candidates

Other Topics

Extend to other languages: use unicode instead of ASCII
Different ranking algorithm: use other signals, e.g. location, time of the query
Support real-time / trending / online search query data, some ideas:
- Schedule more frequent offline job within a reduce working set.
- Combine Trie with other data structures e.g. hash table etc.

Leave a Reply Cancel reply