[085]Feedback on Text Retrieval

2018-06-29(五)

Text Retrieval and Search Engines(5)-Feedback

[085]Feedback on Text Retrieval

  • 5.1 Feedback in Text Retrieval

  • 5.2 Feedback in Vector Space Model

  • 5.3 Feedback with Language Models

  • 5.4 Web Search

  • 5.5 Web Indexing

  • 5.6 Web Search:Link Analysis

5.1 Feedback in Text Retrieval

  • feedback有分為三種:

    • 1.Relevance Feedback:Users make explicit relevance judgments on the initial results // user不用費太多力

    • 2.Pseudo/Blind/Automatic Feedback:Top-k initial results are simply assumed to be relevant //前面K名是相關

    • 3.Implicit Feedback:User-clicked docs are assumed to be relevant; skipped ones non-relevant //user點擊

  • Pseudo feedback 不會涉及到human

5.2 Feedback in Vector Space Model

  • 在TR問題中,用到Vector Space Model, 而在feedback的計算時,也會用到VSM, 這時會用"Rocchio Feedback: Formula"

5.3 Feedback with Language Models

  • 因為Query likelihood method 沒有自然地支援feedback的相關性,所以需要相關解法,而Kullback-Leibler (KL) divergence retrieval model 正是所選

  • Query Likelihood v.s KL-divergence

    • 由 c(w,q) 轉為 p(w|θ)

  • 混合式的model: Maximum Likelihood θ=argmax logp(F|θ)

  • Scalability(規模性):能用Parallel indexing & searching (MapReduce)

  • 機會:為了增加search accuracy, 用上Link analysis & multi-feature ranking

Basic Search Engine Technologies

  • Web 會經過 Crawler 把data存在 Cached pages

  • 接著,Cached pages 會經過 indexer 把data converted to "(Inverted) Index"

  • Crawler: 會先在一個queue做一個seed pages的集合, 然後從網上fetch pages, 接著分析這些pages, 然後附上hyperlink, 主要策略是"Breadth-First is common"

5.5 Web Indexing

  • Google在indexing 部分有三個貢獻:

    • Google File System (GFS): distributed file system

    • MapReduce: Software framework for parallel computation

    • Hadoop: Open source implementation of MapReduce

  • 值得一提是word counting:Reduce Function的機制,接著利用該機制,做"Inverted Indexing with MapReduce"

  • 做完indexing後,要對inverted index做ranking, 標準的TR model是不夠的,需要做更多的extensions

  • Hub:以我為中心點,指向很多地方 // Pages that cite many other pages are good hubs

  • Authority: 很多人指向我 // Pages that are widely cited are good authorities

PageRank: Capturing Page “Popularity”:

  • 類似“citation counting”的概念,但要考慮到“indirect citations”與“ Smoothing of citations”問題

  • Algorithm: 用一矩陣呈現,p(di): PageRank score of di = average probability of visiting page di

  • The key idea of HITS: Good authorities are cited by good hubs

  • Link Information: 有可能用到兩種的混合PageRank&HITS

Last updated

Was this helpful?