[086]Web search

2018-06-30(六)

  • 6.1 Learning to Rank

  • 6.2 Future of web search

  • 6.3 Recommend systems:Content-based filtering

  • 6.4:Recommender Systems: Collaborative Filtering

  • 6.5: Course Summary

6.1 Learning to Rank

  • Given a query-doc, define various kinds of features Xi(Q,D)

結合feature案例,

  • 一、重複term的數量、

  • 二、BM25的score,

  • 三、p(Q|D),

  • 四、PageRank of D

  • 五、BM25Anchor

  • 假設 p(R=1|Q,D)=s(X1(Q,D),…,Xn(Q,D), λ) Learn λ, 透由fitting function with training data,

  • 例如3-turple like (D,Q,1) 代表document is relevant to query

  • 更進階的演算法,能應用到rank problems beyond search,像recommender system/computation advertising

  • 總結來看,machine learning方式用在text retrieval已經幾十年(Rocchio feedback),最近被用來large-scale training data分析, many feature的結合

  • More specialized/customized (vertical search engine)

  • Special group(Ei Compendex)

  • Personalized (youtube, nexflix)

  • beyond search to support tasks(e.g., shopping)

The data-user-service (DUS) Triangle

  • Data:web pages, new articles, blog articles, literature, email

  • Services: search, browsing, mining, task support

  • Users: lawyers, scientists, online shoppers

Future Intelligent Information Systems

  • search出發, 會access&mining 資料,然後做task support

  • 而search相鄰兩點是keyword queries, bag of words, 前者會存入search history並形成一個user model; 後者會做entities-relations, 並且做knowledge representation (大規模的語義分析)

6.3 recommend systems:content-based filtering

  • 在push & pull mode中,前者代表的例子是recommender systems,特性是系統主導、穩定的information need 或者系統對user有豐富的理解

  • 而recommender 更像一個filter system, 而基本的filter問題是:Will User U Like Item X?

  • 這個問題有兩個方向,

    • 第一是Item similarity=>content-based filtering

    • 第二是User similarity=>collaborative filtering

    • 兩者能夠相互結合。

  • a typical content-based filtering system:

    • Linear Utility = 3 #good-2#bad (這樣設置合理嗎?)

    • 或者說(10,-1),(1,-10)

Three Basic Problems in content-based filtering

  • 1.對doc, text做出yes/no的決定

  • 2.初始化:像是nexflix起初選三個電影

  • 3.學習:從yes/no judgement判斷,還有看過的doc

  • 所以,我們延伸retrieval system來做資訊篩選,例如用”reuse retrieval” technique來做doc的score,或者用新的approach來制定threshold

  • 因此,我們用“A general Vector-space approach”當作起點,doc會經過scoring, thresholding來判定utility的值

  • 不過,在設立threshold時,會遇到一些難題,像是Many documents are not available for judgments. 解法是“empirical Utility optimization”, 計算utility on training data for each candidate score,

  • 具體的解法叫:“beta-gamma threshold learning”(利用性與探索性的平衡) 探索性太高,utility就會往”零“靠近,使得doc 是non-relevant

6.4: Recommender Systems: Collaborative Filtering

  • Collaborative filtering定義:根據其他user的judgement, 來對個人的doc做篩選

  • Given a user u, find similar users {u1,u2,…um} 使用CF是有一個前題假設,要有足夠的user preferences // 否則會有”cold start”問題

  • 先從collaboration filtering problem來看,用objects & users的組合,做一張評分表格

  • Memory-based approach:整體概念是用數學式表示不同users對各種objects做出表示

  • Cold start:代表一開始很少其他user的資料,以至於不容易做filter

6.5: Course Summary

  • 1.NLP is foundation for text retrieval (TR), but current NLP isn’t robust enough; Bag of words (BOW) is sufficient for most search tasks.

  • 2.Push vs. pull; Query vs. Browsing

  • 3.TR->Ranking problem

  • 4.Many trivial methods: VSM(vector space model), LM(Language model approach), TF-IDF (Term frequency-Inverse document frequency), Length Norm(document length normalization)

  • 6.Implementation :Inverted index+fast search

  • 7.Evaluation: The Cranfield collection, MAP(Mean Average Precision), nDCG(Normalized Discounted Cumulative Gain), Precision and Recall

  • 9.Feedback: Rocchio in VSM and the mixture model and language model

  • 10.Web search: MapReduce for parallel indexing, The PageRank Algorithm, Hypertext-Induced Topic Search (HITS), learn to rank, Future of web search

  • 11.Recommendation: Content-based + collaborative filtering

Last updated

Was this helpful?