[085]Feedback on Text Retrieval
2018-06-29(五)
Text Retrieval and Search Engines(5)-Feedback
[085]Feedback on Text Retrieval
5.1 Feedback in Text Retrieval
5.2 Feedback in Vector Space Model
5.3 Feedback with Language Models
5.4 Web Search
5.5 Web Indexing
5.6 Web Search:Link Analysis
5.1 Feedback in Text Retrieval
feedback有分為三種:
1.Relevance Feedback:Users make explicit relevance judgments on the initial results // user不用費太多力
2.Pseudo/Blind/Automatic Feedback:Top-k initial results are simply assumed to be relevant //前面K名是相關
3.Implicit Feedback:User-clicked docs are assumed to be relevant; skipped ones non-relevant //user點擊
Pseudo feedback 不會涉及到human
5.2 Feedback in Vector Space Model
在TR問題中,用到Vector Space Model, 而在feedback的計算時,也會用到VSM, 這時會用"Rocchio Feedback: Formula"
5.3 Feedback with Language Models
因為Query likelihood method 沒有自然地支援feedback的相關性,所以需要相關解法,而Kullback-Leibler (KL) divergence retrieval model 正是所選
Query Likelihood v.s KL-divergence
由 c(w,q) 轉為 p(w|θ)
混合式的model: Maximum Likelihood θ=argmax logp(F|θ)
5.4 Web Search
Scalability(規模性):能用Parallel indexing & searching (MapReduce)
機會:為了增加search accuracy, 用上Link analysis & multi-feature ranking
Basic Search Engine Technologies
Web 會經過 Crawler 把data存在 Cached pages
接著,Cached pages 會經過 indexer 把data converted to "(Inverted) Index"
Crawler: 會先在一個queue做一個seed pages的集合, 然後從網上fetch pages, 接著分析這些pages, 然後附上hyperlink, 主要策略是"Breadth-First is common"
5.5 Web Indexing
Google在indexing 部分有三個貢獻:
Google File System (GFS): distributed file system
MapReduce: Software framework for parallel computation
Hadoop: Open source implementation of MapReduce
值得一提是word counting:Reduce Function的機制,接著利用該機制,做"Inverted Indexing with MapReduce"
5.6 Web Search:Link Analysis
做完indexing後,要對inverted index做ranking, 標準的TR model是不夠的,需要做更多的extensions
Exploiting Inter-Document Links
Hub:以我為中心點,指向很多地方 // Pages that cite many other pages are good hubs
Authority: 很多人指向我 // Pages that are widely cited are good authorities
PageRank: Capturing Page “Popularity”:
類似“citation counting”的概念,但要考慮到“indirect citations”與“ Smoothing of citations”問題
Algorithm: 用一矩陣呈現,p(di): PageRank score of di = average probability of visiting page di
Hypertext-Induced Topic Search
The key idea of HITS: Good authorities are cited by good hubs
Link Information: 有可能用到兩種的混合PageRank&HITS
Last updated
Was this helpful?