[082]Vector space model
Text Retrieval and Search Engines (2)- System implementation
[082]Vector space model
2.1 Term frequency (IF) & Inverse Document Frequency (IDF)
2.2 The upper bound of C(w, d)
2.3 Document Length Normalization
2.1 Term frequency (IF) & Inverse Document Frequency (IDF)
問題:
f(q,d2)=3
f(q,d3)=3
儘管d2, d3的f(q,d)是一樣,但不能說d2, d3兩者的相關性與query是一樣,
接著,在Sim(q,d)計算時,在d中的有些詞,會被計算多次,像是about, 那麼這時的ranking 就會用Term frequency (TF)做weighting的調整。
可是,用TF會有個問題,那些事about出現2次, 而presidential出現1次,about的TF較高,但是在字義上,presidential的重要性比較大,所以接下來,further improvement of Vector Place:Adding Inverse Document Frequency (IDF)
IDF(Wi)—> IDF weighting: 懲罰熱門的terms, k是X軸
2.2 The upper bound of C(w, d)
IDF(w)在解決popular term問題後,要問問看“How effective is VSM with IF-IDF Weighting?”(還是有問題...)
這個問題就是,若C(w, d)一直增加,那麼TF(w,d)就會一直無止盡的增加,這會造成計算上的問題。
所以,用了BM25 Transformation來解決,使得TF(w,d) 有一個upper bound: y=(k+1)*X/(X+k)
2.3 Doc Length Normalization
做TF, IDF, 與BM25 Transformation後,怕整份document太長,所以我們去penalize a long doc with a doc length normalizer, 而我們要看Pivoted Length Normalization
而Pivoted length normalizer: average doc length as “pivot”
Last updated
Was this helpful?