[081]Natural Language Processing

1.1 Natural Language Content Analysis
1.2 Text Access
1.3 Text Retrieval Problem
1.4 Text Retrieval Methods
1.5 vector Space Model

在大量的text data中，第一步就是要了解 "Natural Language Processing"的原理機制，在NLP的例子中，作者用一句話A dog is chasing a boy on the playground.，
首先，把每個單詞拆解，分析詞性，看是名詞、形容詞、還是動詞，接著做syntactic analysis(語法分析），另外，也會去做semantic analysis(語義分析)，前者會有speech act的分析，後者則是有infernce的推論。
但是，也提到NLP會面臨的挑戰，像是"common sense" knowledge不夠多，還有很多ambiguities（歧義）的問題，舉體例子來看:word-level ambiguity(像是 design 可以是名詞與動詞); syntactic ambiguity(Preposition phrase attachment)，像是"a man saw a boy with a telescope"，第三個是anaphora resolution，以代名詞所造成的歧異，最後是presupposition的例子。

我們不可能百分之百做到POS tagging (part-of-speech tagging)，另外，做到更深入的semantic analysis是有一定的難度的，所以目前的NLP是比較“淺”的階段。

－沿襲剛剛的NLP原理、挑戰，這邊限縮到text retreival的領域，討論從big text data到 a small set relevant data的過程。

首先，在text access介紹時，分成兩種不同的系統(push VS pull)，第一是推薦系統，以系統為導向，而系統對使用者有較高的了解，像amazon的推薦系統; 第二是搜尋引擎，像是google的搜尋系統，以使用者為導向，並且較隨意。
然而，pull系統，又分成querying+browsing，前者是知道key還有要找的資訊，後者只能知道大概，所以花費時間會更多，但是，在找尋我們目的地時，往往是會搭配上述兩種mode來尋找。

首先，在TR部分，我們要跟Database Retrieval做比較，c後者的data更加structured, 而且是well-defined semantics，再者，Answers是看有無matched, 並非只是relevant，而TR problem主要是Document selection 與 ranking的機制，制定一套“selection”的標準，一定會涉及相關的數學模型。

既然要做一個ranking function, 那麼就要用一個retrieval model, 目的是“formalization of relevance”，目前有BM25, Query likehood, PL2, 而BM25是最熱門的

在Similarity-based models: f(q,d) = similarity(q,d)
– Vector space model
把query與doc做映射，而query又能分解為很多term（這就是bag of words的概念）, 以向量形式呈現，故此能用doc1. doc2對vector做內積，求相近程度

Last updated 6 years ago

Was this helpful?