[086]Web search
2018-06-30(六)
Text Retrieval and Search Engines(6)-Web search
[086]Web search
6.1 Learning to Rank
6.2 Future of web search
6.3 Recommend systems:Content-based filtering
6.4:Recommender Systems: Collaborative Filtering
6.5: Course Summary
6.1 Learning to Rank
Given a query-doc, define various kinds of features Xi(Q,D)
結合feature案例,
一、重複term的數量、
二、BM25的score,
三、p(Q|D),
四、PageRank of D
五、BM25Anchor
假設 p(R=1|Q,D)=s(X1(Q,D),…,Xn(Q,D), λ) Learn λ, 透由fitting function with training data,
例如3-turple like (D,Q,1) 代表document is relevant to query
更進階的演算法,能應用到rank problems beyond search,像recommender system/computation advertising
總結來看,machine learning方式用在text retrieval已經幾十年(Rocchio feedback),最近被用來large-scale training data分析, many feature的結合
6.2 Future of web search
More specialized/customized (vertical search engine)
Special group(Ei Compendex)
Personalized (youtube, nexflix)
beyond search to support tasks(e.g., shopping)
The data-user-service (DUS) Triangle
Data:web pages, new articles, blog articles, literature, email
Services: search, browsing, mining, task support
Users: lawyers, scientists, online shoppers
Future Intelligent Information Systems
search出發, 會access&mining 資料,然後做task support
而search相鄰兩點是keyword queries, bag of words, 前者會存入search history並形成一個user model; 後者會做entities-relations, 並且做knowledge representation (大規模的語義分析)
6.3 recommend systems:content-based filtering
在push & pull mode中,前者代表的例子是recommender systems,特性是系統主導、穩定的information need 或者系統對user有豐富的理解
而recommender 更像一個filter system, 而基本的filter問題是:Will User U Like Item X?
這個問題有兩個方向,
第一是Item similarity=>content-based filtering
第二是User similarity=>collaborative filtering
兩者能夠相互結合。
a typical content-based filtering system:
Linear Utility = 3 #good-2#bad (這樣設置合理嗎?)
或者說(10,-1),(1,-10)
Three Basic Problems in content-based filtering
1.對doc, text做出yes/no的決定
2.初始化:像是nexflix起初選三個電影
3.學習:從yes/no judgement判斷,還有看過的doc
所以,我們延伸retrieval system來做資訊篩選,例如用”reuse retrieval” technique來做doc的score,或者用新的approach來制定threshold
因此,我們用“A general Vector-space approach”當作起點,doc會經過scoring, thresholding來判定utility的值
不過,在設立threshold時,會遇到一些難題,像是Many documents are not available for judgments. 解法是“empirical Utility optimization”, 計算utility on training data for each candidate score,
具體的解法叫:“beta-gamma threshold learning”(利用性與探索性的平衡) 探索性太高,utility就會往”零“靠近,使得doc 是non-relevant
6.4: Recommender Systems: Collaborative Filtering
Collaborative filtering定義:根據其他user的judgement, 來對個人的doc做篩選
Given a user u, find similar users {u1,u2,…um} 使用CF是有一個前題假設,要有足夠的user preferences // 否則會有”cold start”問題
先從collaboration filtering problem來看,用objects & users的組合,做一張評分表格
Memory-based approach:整體概念是用數學式表示不同users對各種objects做出表示
Cold start:代表一開始很少其他user的資料,以至於不容易做filter
6.5: Course Summary
1.NLP is foundation for text retrieval (TR), but current NLP isn’t robust enough; Bag of words (BOW) is sufficient for most search tasks.
2.Push vs. pull; Query vs. Browsing
3.TR->Ranking problem
4.Many trivial methods: VSM(vector space model), LM(Language model approach), TF-IDF (Term frequency-Inverse document frequency), Length Norm(document length normalization)
6.Implementation :Inverted index+fast search
7.Evaluation: The Cranfield collection, MAP(Mean Average Precision), nDCG(Normalized Discounted Cumulative Gain), Precision and Recall
9.Feedback: Rocchio in VSM and the mixture model and language model
10.Web search: MapReduce for parallel indexing, The PageRank Algorithm, Hypertext-Induced Topic Search (HITS), learn to rank, Future of web search
11.Recommendation: Content-based + collaborative filtering
Last updated
Was this helpful?