[082]Vector space model

Text Retrieval and Search Engines (2)- System implementation

[082]Vector space model

  • 2.1 Term frequency (IF) & Inverse Document Frequency (IDF)

  • 2.2 The upper bound of C(w, d)

  • 2.3 Document Length Normalization

2.1 Term frequency (IF) & Inverse Document Frequency (IDF)

  • 問題:

    • f(q,d2)=3

    • f(q,d3)=3

  • 儘管d2, d3的f(q,d)是一樣,但不能說d2, d3兩者的相關性與query是一樣,

  • 接著,在Sim(q,d)計算時,在d中的有些詞,會被計算多次,像是about, 那麼這時的ranking 就會用Term frequency (TF)做weighting的調整。

  • 可是,用TF會有個問題,那些事about出現2次, 而presidential出現1次,about的TF較高,但是在字義上,presidential的重要性比較大,所以接下來,further improvement of Vector Place:Adding Inverse Document Frequency (IDF)

  • IDF(Wi)—> IDF weighting: 懲罰熱門的terms, k是X軸

2.2 The upper bound of C(w, d)

  • IDF(w)在解決popular term問題後,要問問看“How effective is VSM with IF-IDF Weighting?”(還是有問題...)

  • 這個問題就是,若C(w, d)一直增加,那麼TF(w,d)就會一直無止盡的增加,這會造成計算上的問題。

  • 所以,用了BM25 Transformation來解決,使得TF(w,d) 有一個upper bound: y=(k+1)*X/(X+k)

2.3 Doc Length Normalization

  • 做TF, IDF, 與BM25 Transformation後,怕整份document太長,所以我們去penalize a long doc with a doc length normalizer, 而我們要看Pivoted Length Normalization

  • 而Pivoted length normalizer: average doc length as “pivot”

Last updated

Was this helpful?