Daily Discussion

[085]Feedback on Text Retrieval

2018-06-29(五)

Text Retrieval and Search Engines(5)-Feedback

[085]Feedback on Text Retrieval

5.1 Feedback in Text Retrieval
5.2 Feedback in Vector Space Model
5.3 Feedback with Language Models
5.4 Web Search
5.5 Web Indexing
5.6 Web Search:Link Analysis

5.1 Feedback in Text Retrieval

feedback有分為三種：
- 1.Relevance Feedback:Users make explicit relevance judgments on the initial results // user不用費太多力
- 2.Pseudo/Blind/Automatic Feedback:Top-k initial results are simply assumed to be relevant //前面K名是相關
- 3.Implicit Feedback:User-clicked docs are assumed to be relevant; skipped ones non-relevant //user點擊
Pseudo feedback 不會涉及到human

5.2 Feedback in Vector Space Model

在TR問題中，用到Vector Space Model, 而在feedback的計算時，也會用到VSM, 這時會用"Rocchio Feedback: Formula"

5.3 Feedback with Language Models

因為Query likelihood method 沒有自然地支援feedback的相關性，所以需要相關解法，而Kullback-Leibler (KL) divergence retrieval model 正是所選
Query Likelihood v.s KL-divergence
- 由 c(w,q) 轉為 p(w|θ)
混合式的model: Maximum Likelihood θ＝argmax logp(F|θ)

5.4 Web Search

Scalability(規模性)：能用Parallel indexing & searching (MapReduce)
機會：為了增加search accuracy, 用上Link analysis & multi-feature ranking

Basic Search Engine Technologies

Web 會經過 Crawler 把data存在 Cached pages
接著，Cached pages 會經過 indexer 把data converted to "(Inverted) Index"
Crawler: 會先在一個queue做一個seed pages的集合, 然後從網上fetch pages, 接著分析這些pages, 然後附上hyperlink, 主要策略是"Breadth-First is common"

5.5 Web Indexing

Google在indexing 部分有三個貢獻：
- Google File System (GFS): distributed file system
- MapReduce: Software framework for parallel computation
- Hadoop: Open source implementation of MapReduce
值得一提是word counting:Reduce Function的機制，接著利用該機制，做"Inverted Indexing with MapReduce"

5.6 Web Search:Link Analysis

做完indexing後，要對inverted index做ranking, 標準的TR model是不夠的，需要做更多的extensions

Exploiting Inter-Document Links

Hub:以我為中心點，指向很多地方 // Pages that cite many other pages are good hubs
Authority: 很多人指向我 // Pages that are widely cited are good authorities

PageRank: Capturing Page “Popularity”:

類似“citation counting”的概念，但要考慮到“indirect citations”與“ Smoothing of citations”問題
Algorithm: 用一矩陣呈現，p(di): PageRank score of di = average probability of visiting page di

Hypertext-Induced Topic Search

The key idea of HITS: Good authorities are cited by good hubs
可以參考Journal Citation report
Link Information: 有可能用到兩種的混合PageRank＆HITS

Previous[084]Probabilistic Model Next[086]Web search

Last updated 6 years ago

Was this helpful?