[SysDes1][Chap. 13] Design a Search Autocomplete System

Design Questions

  • Match only from beginning or support match in the middle
  • Ranking method: term frequency etc.
  • Spell check method.
  • Batch processing or online processing.
  • Calculation
    • QPS: 10 millions DAU * 10 queries / day * 20 bytes / query = 24k QPS
    • Storage: 10 millions DAU * 10 new queries * 20 bytes * 20% new queries = 0.4 GB

CUJ: Gathering Service

  • Store all user’s queries for a period of time
  • Run the offline batch job to build Trie data structure
  • How to build Trie data structure
    • Limit the max length of the prefix to prevent the tree going too deep
    • Cache top search queries at each node, so we don’t need to go down to the leaf nodes for the auto-completed search queries
  • We can apply filter rules for users’ queries
  • We can add more signals (not only term frequency) when building Trie

CUJ: Query Service

  • User send autocomplete request to API servers
  • API server query the Trie data structure to get the top auto-completed search queries
  • How to serve Trie data structure efficiently
    • Shard probably into different Trie DB
    • Use cache to speed up access speed to Trie DB
    • Also use cache on client side to avoid sending autocomplete request
  • We can consider other signal (not only term frequency) when ranking candidates

Other Topics

  • Extend to other languages: use unicode instead of ASCII
  • Different ranking algorithm: use other signals, e.g. location, time of the query
  • Support real-time / trending / online search query data, some ideas:
    • Schedule more frequent offline job within a reduce working set.
    • Combine Trie with other data structures e.g. hash table etc.

Leave a Reply

Your email address will not be published. Required fields are marked *