[086]Web search

2018-06-30(六)

Text Retrieval and Search Engines(6)-Web search

[086]Web search

6.1 Learning to Rank
6.2 Future of web search
6.3 Recommend systems:Content-based filtering
6.4:Recommender Systems: Collaborative Filtering
6.5: Course Summary

6.1 Learning to Rank

Given a query-doc, define various kinds of features Xi(Q,D)

結合feature案例，
一、重複term的數量、
二、BM25的score,
三、p(Q|D),
四、PageRank of D
五、BM25Anchor

假設 p(R=1|Q,D)=s(X1(Q,D),…,Xn(Q,D), λ) Learn λ, 透由fitting function with training data,
例如3-turple like (D,Q,1) 代表document is relevant to query
更進階的演算法，能應用到rank problems beyond search,像recommender system/computation advertising
總結來看，machine learning方式用在text retrieval已經幾十年(Rocchio feedback)，最近被用來large-scale training data分析, many feature的結合

6.2 Future of web search

More specialized/customized (vertical search engine)
Special group(Ei Compendex)
Personalized (youtube, nexflix)
beyond search to support tasks(e.g., shopping)

The data-user-service (DUS) Triangle

Data:web pages, new articles, blog articles, literature, email
Services: search, browsing, mining, task support
Users: lawyers, scientists, online shoppers

Future Intelligent Information Systems

search出發, 會access&mining 資料，然後做task support
而search相鄰兩點是keyword queries, bag of words, 前者會存入search history並形成一個user model; 後者會做entities-relations, 並且做knowledge representation (大規模的語義分析)

在push & pull mode中，前者代表的例子是recommender systems，特性是系統主導、穩定的information need 或者系統對user有豐富的理解
而recommender 更像一個filter system, 而基本的filter問題是：Will User U Like Item X?
這個問題有兩個方向，
- 第一是Item similarity=>content-based filtering
- 第二是User similarity=>collaborative filtering
- 兩者能夠相互結合。
a typical content-based filtering system:
- Linear Utility = 3 #good-2#bad (這樣設置合理嗎？)
- 或者說(10,-1),(1,-10)

Three Basic Problems in content-based filtering

1.對doc, text做出yes/no的決定
2.初始化：像是nexflix起初選三個電影
3.學習：從yes/no judgement判斷，還有看過的doc
所以，我們延伸retrieval system來做資訊篩選，例如用”reuse retrieval” technique來做doc的score，或者用新的approach來制定threshold
因此，我們用“A general Vector-space approach”當作起點，doc會經過scoring, thresholding來判定utility的值
不過，在設立threshold時，會遇到一些難題，像是Many documents are not available for judgments. 解法是“empirical Utility optimization”, 計算utility on training data for each candidate score,
具體的解法叫：“beta-gamma threshold learning”(利用性與探索性的平衡) 探索性太高，utility就會往”零“靠近，使得doc 是non-relevant

6.4: Recommender Systems: Collaborative Filtering

Collaborative filtering定義：根據其他user的judgement, 來對個人的doc做篩選
Given a user u, find similar users {u1,u2,…um} 使用CF是有一個前題假設，要有足夠的user preferences // 否則會有”cold start”問題
先從collaboration filtering problem來看，用objects & users的組合，做一張評分表格
Memory-based approach:整體概念是用數學式表示不同users對各種objects做出表示
Cold start：代表一開始很少其他user的資料，以至於不容易做filter

6.5: Course Summary

1.NLP is foundation for text retrieval (TR), but current NLP isn’t robust enough; Bag of words (BOW) is sufficient for most search tasks.
2.Push vs. pull; Query vs. Browsing
3.TR->Ranking problem
4.Many trivial methods: VSM(vector space model), LM(Language model approach), TF-IDF (Term frequency-Inverse document frequency), Length Norm(document length normalization)
6.Implementation :Inverted index+fast search
7.Evaluation: The Cranfield collection, MAP(Mean Average Precision), nDCG(Normalized Discounted Cumulative Gain), Precision and Recall
9.Feedback: Rocchio in VSM and the mixture model and language model
10.Web search: MapReduce for parallel indexing, The PageRank Algorithm, Hypertext-Induced Topic Search (HITS), learn to rank, Future of web search
11.Recommendation: Content-based + collaborative filtering

Previous[085]Feedback on Text Retrieval Next[087]2018年上半年總結(1)-完成工作

Last updated 6 years ago

Was this helpful?

2018-06-30(六)

Text Retrieval and Search Engines(6)-Web search

[086]Web search

6.1 Learning to Rank

6.2 Future of web search

The data-user-service (DUS) Triangle

Future Intelligent Information Systems

6.3 recommend systems:content-based filtering

Three Basic Problems in content-based filtering

6.4: Recommender Systems: Collaborative Filtering

6.5: Course Summary