Daily Discussion

[082]Vector space model

Text Retrieval and Search Engines (2)- System implementation

[082]Vector space model

2.1 Term frequency (IF) & Inverse Document Frequency (IDF)
2.2 The upper bound of C(w, d)
2.3 Document Length Normalization

2.1 Term frequency (IF) & Inverse Document Frequency (IDF)

問題：
- f(q,d2)=3
- f(q,d3)=3
儘管d2, d3的f(q,d)是一樣，但不能說d2, d3兩者的相關性與query是一樣，
接著，在Sim(q,d)計算時，在d中的有些詞，會被計算多次，像是about, 那麼這時的ranking 就會用Term frequency (TF)做weighting的調整。
可是，用TF會有個問題，那些事about出現2次, 而presidential出現1次，about的TF較高，但是在字義上，presidential的重要性比較大，所以接下來，further improvement of Vector Place:Adding Inverse Document Frequency (IDF)
IDF(Wi)—> IDF weighting: 懲罰熱門的terms, k是Ｘ軸

2.2 The upper bound of C(w, d)

IDF(w)在解決popular term問題後，要問問看“How effective is VSM with IF-IDF Weighting?”(還是有問題...)
這個問題就是，若C(w, d)一直增加，那麼TF(w,d)就會一直無止盡的增加，這會造成計算上的問題。
所以，用了BM25 Transformation來解決，使得TF(w,d) 有一個upper bound: y=(k+1)*X/(X+k)

2.3 Doc Length Normalization

做TF, IDF, 與BM25 Transformation後，怕整份document太長，所以我們去penalize a long doc with a doc length normalizer, 而我們要看Pivoted Length Normalization
而Pivoted length normalizer: average doc length as “pivot”

Previous[081]Natural Language Processing Next[083]Evaluation of TR system

Last updated 6 years ago

Was this helpful?