Contents

字串比較各種演算法

Programming 2021-07-08 727 words 2 minutes

Contents

Problem Definition

給定 k 個匹配字串 P，各字串平均長度為 m，在目標字串 T 中尋找符合的字串，目標字串長度為 n

Brute Force

Time complexity: O(k*m*n)
Space complexity: O(k*m) 建立匹配字串的字典

KMP (Knuth Morris Pratt) Algorithm

Time complexity: O(k*m + k*n)
- Failure function construction complexity: O(k*m)
- Matching complexity: O(k*n)
Space complexity: O(k*m) 每個匹配字串 Pi 建立 Pi[j] 最長相同的前綴與後綴 for j in [1, m - 1]

Trie

Time complexity: O(k*m + n*m)
- Trie construction complexity: O(k*m)
- Matching complexity: O(n*m)
Space complexity: worse case O(k*m), best case O(m)

AC (Aho–Corasick) Algorithm

Time complexity: O(n + k*m + z)
- z is number of output matches (e.g. dictionary = a, aa, aaa, aaaa and input string is aaaa)
- Construction complexity: O(k*m)
- Matching complexity: O(n + z)
- Insert new matching pattern: worse case O(k*m), best case O(m)
Space complexity: worse case O(k*m), best case O(m)

Go further to Regex

NFA

DFA

問題

如果今天請你做出一個聊天室的訊息過濾器（不允許誤報），你會如何實作呢？共有 N 個敏感詞彙，每個敏感詞彙的平均長度為 M，每個句子包含 K 個字元。可參考網路相關資料，統整分析解決方案優缺點後，選擇你認為最佳的解法簡略回答即可

Q1. 使用何種資料結構？包含敏感詞本身總所需空間複雜度為何？

若考慮N會非常大，對速度的需求高於記憶體用量，且可以是先建好字典，可能要支援Unicode，那我會選擇使用 Aho–Corasick 演算法。
此演算法會用到 Trie 資料結構和 Failure table，空間複雜度是 O(N*M)

Q2. 使用何種方式過濾？過濾敏感詞所需時間複雜度為何？

此演算法結合了 KMP 的 Failure function 來減少時間複雜度，也使用了 Trie 來降低重疊的敏感詞，時間複雜度是 O(N*M + K + Z) 其中 Z 是輸入句子敏感單詞出現的個數

Q3. 使用何種方式進行測試？

自動產生敏感單詞，隨機加入敏感單詞來生成句子，使用現有的regex來產生對照組，寫個簡單程式測試兩者結果差異

Q4. 在運作期間要更新詞彙列表的話要如何處理？

加入 Trie 更新 Failure table，時間複雜度的 best case 是 O(M)

References

Комментарии