龙星计划课程信息检索OverviewofTextRetrievalPart
龙星计划课程龙星计划课程:信息检索信息检索 Overview of Text Retrieval: Part 2ChengXiang Zhai (翟成祥) Department of Computer ScienceGraduate School of Library & Information ScienceInstitute for Genomic Biology, StatisticsUniversity of Illinois, Urbana-Champaignhttp://www-faculty.cs.uiuc.edu/~czhai, [email protected] © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20081Outline•Other retrieval models•Implementation of a TR System•Applications of TR techniques2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20082P-norm (Extended Boolean)(Salton et al. 83)•Motivation: how to rank documents with a Boolean query?•Intuitions–Docs satisfying the query constraint should get the highest ranks–Partial satisfaction of query constraint can be used to rank other docs•Question: How to capture “partial satisfaction”?2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20083P-norm: Basic Ideas•Normalized term weights for doc rep ([0,1])•Define similarity between a Boolean query and a doc vectorQ= T1 AND T2(0,0)(1,0)(0,1)(1,1)(x,y)Q= T1 ORT2(0,0)(1,0)(0,1)(1,1)(x,y)2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20084P-norm: FormulasSince the similarity value is normalized to [0,1], these two formulas can be applied recursively.1 P +vector-space Boolean/Fuzzy logic2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20085P-norm: Summary•A general (and elegant) similarity function for Boolean query and a regular document vector•Connecting Boolean model and vector space model with models in between•Allowing different “confidence” on Boolean operators (different p for different operators)•A model worth more exploration (how to learn optimal p values from feedback?)2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20086Probabilistic Retrieval Models 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20087Overview of Retrieval ModelsRelevance (Rep(q), Rep(d)) SimilarityP(r=1|q,d) r {0,1} Probability of RelevanceP(d q) or P(q d) Probabilistic inferenceDifferent rep & similarityVector spacemodel(Salton et al., 75)Prob. distr.model(Wong & Yao, 89)…GenerativeModelRegressionModel(Fox 83)Classicalprob. Model(Robertson & Sparck Jones, 76)DocgenerationQuerygenerationLMapproach(Ponte & Croft, 98)(Lafferty & Zhai, 01a)Prob. conceptspace model(Wong & Yao, 95)Differentinference systemInference network model(Turtle & Croft, 91)Learn toRank (Joachims 02)(Burges et al. 05)2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20088The Basic QuestionWhat is the probability that THIS document is relevant to THIS query?Formally…3 random variables: query Q, document D, relevance R {0,1}Given a particular query q, a particular document d, p(R=1|Q=q,D=d)=? 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20089Probability of Relevance•Three random variables–Query Q–Document D–Relevance R {0,1}•Goal: rank D based on P(R=1|Q,D)–Evaluate P(R=1|Q,D)–Actually, only need to compare P(R=1|Q,D1) with P(R=1|Q,D2), I.e., rank documents•Several different ways to refine P(R=1|Q,D) 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200810Refining P(R=1|Q,D) Method 1: conditional models•Basic idea: relevance depends on how well a query matches a document–Define features on Q x D, e.g., #matched terms, # the highest IDF of a matched term, #doclen,..–P(R=1|Q,D)=g(f1(Q,D), f2(Q,D),…,fn(Q,D), )–Using training data (known relevance judgments) to estimate parameter –Apply the model to rank new documents•Special case: logistic regression2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200811Logistic Regression (Cooper 92, Gey 94)logit function:logistic (sigmoid) function:XP(R=1|Q,D)1.0Uses 6 features X1, …, X62008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200812 Features/AttributesAverage Absolute Query FrequencyQuery LengthAverage Absolute Document FrequencyDocument LengthAverage Inverse Document FrequencyInverse Document FrequencyNumber of Terms in common between query and document -- logged 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200813Logistic Regression: Pros & Cons•Advantages–Absolute probability of relevance available–May re-use all the past relevance judgments•Problems–Performance is very sensitive to the selection of features–No much guidance on feature selection•In practice, performance tends to be average2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200814Refining P(R=1|Q,D) Method 2:generative models •Basic idea–Define P(Q,D|R)–Compute O(R=1|Q,D) using Bayes’ rule•Special cases–Document “generation”: P(Q,D|R)=P(D|Q,R)P(Q|R)–Query “generation”: P(Q,D|R)=P(Q|D,R)P(D|R)Ignored for ranking D2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200815Document GenerationModel of relevant docs for QModel of non-relevant docs for QAssume independent attributes A1…Ak ….(why?)Let D=d1…dk, where dk {0,1} is the value of attribute Ak (Similarly Q=q1…qk )Non-query terms are equally likely to appear in relevant and non-relevant docs2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200816Robertson-Sparck Jones Model(Robertson & Sparck Jones 76)Two parameters for each term Ai: pi = P(Ai=1|Q,R=1): prob. that term Ai occurs in a relevant doc qi = P(Ai=1|Q,R=0): prob. that term Ai occurs in a non-relevant doc (RSJ model) How to estimate parameters?Suppose we have relevance judgments,“+0.5” and “+1” can be justified by Bayesian estimation 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200817RSJ Model: No Relevance Info(Croft & Harper 79)(RSJ model) How to estimate parameters?Suppose we do not have relevance judgments,- We will assume pi to be a constant - Estimate qi by assuming all documents to be non-relevantN: # documents in collectionni: # documents in which term Ai occurs2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200818RSJ Model: Summary•The most important classic prob. IR model•Use only term presence/absence, thus also referred to as Binary Independence Model•Essentially Naïve Bayes for doc ranking•Most natural for relevance/pseudo feedback•When without relevance judgments, the model parameters must be estimated in an ad hoc way•Performance isn’t as good as tuned VS model2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200819Improving RSJ: Adding TF Let D=d1…dk, where dk is the frequency count of term AkBasic doc. generation model: 2-Poisson mixture modelMany more parameters to estimate! (how many exactly?)2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200820BM25/Okapi Approximation(Robertson et al. 94)•Idea: Approximate p(R=1|Q,D) with a simpler function that share similar properties •Observations:–log O(R=1|Q,D) is a sum of term weights Wi–Wi= 0, if TFi=0–Wi increases monotonically with TFi–Wi has an asymptotic limit•The simple function is 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200821Adding Doc. Length & Query TF•Incorporating doc length–Motivation: The 2-Poisson model assumes equal document length–Implementation: “Carefully” penalize long doc•Incorporating query TF–Motivation: Appears to be not well-justified–Implementation: A similar TF transformation•The final formula is called BM25, achieving top TREC performance2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200822The BM25 Formula “Okapi TF/BM25 TF”2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200823Extensions of “Doc Generation” Models•Capture term dependence (Rijsbergen & Harper 78)•Alternative ways to incorporate TF (Croft 83, Kalt96)•Feature/term selection for feedback (Okapi’s TREC reports)•Other Possibilities (machine learning … )2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200824Query GenerationAssuming uniform prior, we haveQuery likelihood p(q| d) Document prior Now, the question is how to compute ?Generally involves two steps:(1) estimate a language model based on D(2) compute the query likelihood according to the estimated modelLeading to the so-called “Language Modeling Approach” …2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200825What is a Statistical LM?•A probability distribution over word sequences–p(“Today is Wednesday”) 0.001–p(“Today Wednesday is”) 0.0000000000001–p(“The eigenvalue is positive”) 0.00001•Context-dependent! •Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200826The Simplest Language Model(Unigram Model)•Generate a piece of text by generating each word INDEPENDENTLY•Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)•Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size)•Essentially a multinomial distribution over words•A piece of text can be regarded as a sample drawn according to this word distribution2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200827Text Generation with Unigram LM (Unigram) Language Model p(w| )…text 0.2mining 0.1association 0.01clustering 0.02…food 0.00001…Topic 1:Text mining…food 0.25nutrition 0.1healthy 0.05diet 0.02…Topic 2:Health DocumentText miningpaperFood nutritionpaperSampling2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200828Estimation of Unigram LM(Unigram) Language Model p(w| )=? Documenttext 10mining 5association 3database 3algorithm 2…query 1efficient 1…text ?mining ?association ?database ?…query ?…EstimationA “text mining paper”(total #words=100)10/1005/1003/1003/1001/1002008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200829Language Models for Retrieval(Ponte & Croft 98) DocumentText miningpaperFood nutritionpaperLanguage Model …text ?mining ?assocation ?clustering ?…food ?……food ?nutrition ?healthy ?diet ?…Query = “data mining algorithms”?Which model would most likely have generated this query?2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200830Ranking Docs by Query Likelihoodd1d2dNq d1 d2 dNDoc LMp(q| d1)p(q| d2)p(q| dN)Query likelihood2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200831Retrieval as Language Model Estimation•Document ranking based on query likelihood •Retrieval problem Estimation of p(wi|d)•Smoothing is an important issue, and distinguishes different approachesDocument language model2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200832How to Estimate p(w|d)?•Simplest solution: Maximum Likelihood Estimator–P(w|d) = relative frequency of word w in d–What if a word doesn’t appear in the text? P(w|d)=0•In general, what probability should we give a word that has not been observed?•If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words•This is what “smoothing” is about …2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200833Language Model Smoothing (Illustration)P(w)Word wMax. Likelihood Estimate Smoothed LM 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200834How to Smooth?•All smoothing methods try to–discount the probability of words seen in a document–re-allocate the extra counts so that unseen words will have a non-zero count•A simple method (additive smoothing): Add a constant to the counts of each word•Problems?“Add one”, Laplace smoothingVocabulary sizeCounts of w in dLength of d (total counts)2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200835A General Smoothing Scheme•All smoothing methods try to–discount the probability of words seen in a doc–re-allocate the extra probability so that unseen words will have a non-zero probability•Most use a reference model (collection language model) to discriminate unseen wordsDiscounted ML estimateCollection language model2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200836Smoothing & TF-IDF Weighting•Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtainIgnore for rankingIDF weightingTF weightingDoc length normalization(long doc is expected to have a smaller d)•Smoothing with p(w|C) TF-IDF + length norm.2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200837Derivation of the Query Likelihood Retrieval FormulaDiscounted ML estimate Reference language modelRetrieval formula using the general smoothing schemeKey rewriting stepSimilar rewritings are very common when using LMs for IR…2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200838More Smoothing Methods•Method 1 (Absolute discounting): Subtract a constant from the counts of each word•Method 2 (Linear interpolation, Jelinek-Mercer): “Shrink” uniformly toward p(w|REF)# uniq wordsparameterML estimate2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200839More Smoothing Methods (cont.)•Method 3 (Dirichlet Prior/Bayesian): Assume pseudo counts p(w|REF)•Method 4 (Good Turing): Assume total # unseen events to be n1 (# of singletons), and adjust the seen events in the same wayparameter2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200840Dirichlet Prior Smoothing• ML estimator: M=argmax M p(d|M)• Bayesian estimator: – First consider posterior: p(M|d) =p(d|M)p(M)/p(d)– Then, consider the mean or mode of the posterior dist.• p(d|M) : Sampling distribution (of data) • P(M)=p(1 ,…, N) : our prior on the model parameters• conjugate = prior can be interpreted as “extra”/“pseudo” data• Dirichlet distribution is a conjugate prior for multinomial sampling distribution “extra”/“pseudo” word counts i= p(wi|REF)2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200841Dirichlet Prior Smoothing (cont.)Posterior distribution of parameters:The predictive distribution is the same as the mean:Dirichlet prior smoothing2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200842Advantages of Language Models•Solid statistical foundation•Parameters can be optimized automatically using statistical estimation methods •Can easily model many different retrieval tasks•To be covered more later2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200843What You Should Know•Global relationship among different probabilistic models •How logistic regression works•How the Robertson-Sparck Jones model works•The BM25 formula •All document-generation models have trouble when no relevance judgments are available•How the language modeling approach (query likelihood scoring) works•How Dirichlet prior smoothing works•3 state of the art retrieval models: Pivoted Norm Okapi/BM25 Query Likelihood (Dirichlet prior smoothing)2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200844Implementation of an IR System2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200845IR System ArchitectureUserqueryjudgmentsdocsresultsQueryQueryRepRepDocRepRankingFeedbackINDEXINGINDEXINGSEARCHINGSEARCHINGQUERY MODIFICATIONQUERY MODIFICATIONINTERFACEINTERFACE2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200846Indexing•Indexing = Convert documents to data structures that enable fast search •Inverted index is the dominating indexing method (used by all search engines)•Other indices (e.g., document index) may be needed for feedback2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200847Inverted Index•Fast access to all docs containing a given term (along with freq and pos information) •For each term, we get a list of tuples (docID, freq, pos).•Given a query, we can fetch the lists for all query terms and work on the involved documents.–Boolean query: set operation–Natural language query: term weight summing •More efficient than scanning docs (why?)2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200848Inverted Index ExampleThis is a sample documentwith one samplesentenceDoc 1This is another sample documentDoc 2DictionaryPostingsTerm# docsTotal freqThis22is22sample23another11………Doc idFreq11211121122121…………2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200849Data Structures for Inverted Index•Dictionary: modest size–Needs fast random access–Preferred to be in memory–Hash table, B-tree, trie, …•Postings: huge–Sequential access is expected –Can stay on disk–May contain docID, term freq., term pos, etc–Compression is desirable2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200850Inverted Index Compression•Observations–Inverted list is sorted (e.g., by docid or termfq)–Small numbers tend to occur more frequently•Implications–“d-gap” (store difference): d1, d2-d1, d3-d2-d1,…–Exploit skewed frequency distribution: fewer bits for small (high frequency) integers• Binary code, unary code, -code, -code 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200851Integer Compression Methods•In general, to exploit skewed distribution•Binary: equal-length coding•Unary: x1 is coded as x-1 one bits followed by 0, e.g., 3=> 110; 5=>11110•-code: x=> unary code for 1+log x followed by uniform code for x-2 log x in log x bits, e.g., 3=>101, 5=>11001•-code: same as -code ,but replace the unary prefix with -code. E.g., 3=>1001, 5=>101012008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200852Constructing Inverted Index•The main difficulty is to build a huge index with limited memory•Memory-based methods: not usable for large collections •Sort-based methods: –Step 1: collect local (termID, docID, freq) tuples–Step 2: sort local tuples (to make “runs”)–Step 3: pair-wise merge runs–Step 4: Output inverted file2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200853Sort-based Inversion...Term Lexicon:the 1cold 2days 3a 4...DocIDLexicon:doc1 1doc2 2doc3 3...doc1doc1doc300<1,1,3><2,1,2><3,1,1>... <1,2,2><3,2,3><4,2,2>…<1,300,3><3,300,1>...Sort by doc-idParse & Count<1,1,3><1,2,2><2,1,2><2,4,3>...<1,5,3><1,6,2>…<1,299,3><1,300,1>...Sort by term-id“Local” sort<1,1,3><1,2,2><1,5,2><1,6,3>...<1,300,3><2,1,2>…<5000,299,1><5000,300,1>...Merge sortAll info about term 12008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200854Searching•Given a query, score documents efficiently•Boolean query–Fetch the inverted list for all query terms–Perform set operations to get the subset of docs that satisfy the Boolean condition–E.g., Q1=“info” AND “security” , Q2=“info” OR “security”•info: d1, d2, d3, d4info: d1, d2, d3, d4•security: d2, d4, d6security: d2, d4, d6•Results: {d2,d4} (Q1) {d1,d2,d3,d4,d6} (Q2)Results: {d2,d4} (Q1) {d1,d2,d3,d4,d6} (Q2)2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200855Ranking Documents•Assumption:score(d,q)=f[g(w(d,q,t1),…w(d,q,tn)), w(d),w(q)], where, ti’s are the matched terms•Maintain a score accumulator for each doc to compute function g•For each query term ti–Fetch the inverted list {(d1,f1),…,(dn,fn)}–For each entry (dj,fj), Compute w(dj,q,ti), and Update score accumulator for doc di•Adjust the score to compute f, and sort 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200856Ranking Documents: ExampleQuery = “info security”S(d,q)=g(t1)+…+g(tn) [sum of freq of matched terms]Info: (d1, 3), (d2, 4), (d3, 1), (d4, 5)Security: (d2, 3), (d4,1), (d5, 3)Accumulators: d1 d2 d3 d4 d5 0 0 0 0 0 (d1,3) => 3 0 0 0 0 (d2,4) => 3 4 0 0 0 (d3,1) => 3 4 1 0 0 (d4,5) => 3 4 1 5 0 (d2,3) => 3 7 1 5 0 (d4,1) => 3 7 1 6 0 (d5,3) => 3 7 1 6 3infosecurity2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200857Further Improving Efficiency•Keep only the most promising accumulators•Sort the inverted list in decreasing order of weights and fetch only N entries with the highest weights•Pre-compute as much as possible 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200858Open Source IR Toolkits•Smart (Cornell)•MG (RMIT & Melbourne, Australia; Waikato, New Zealand), •Lemur (CMU/Univ. of Massachusetts)•Terrier (Glasgow)•Lucene (Open Source)2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200859Smart•The most influential IR system/toolkit•Developed at Cornell since 1960’s •Vector space model with lots of weighting options•Written in C •The Cornell/AT&T groups have used the Smart system to achieve top TREC performance2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200860MG•A highly efficient toolkit for retrieval of text and images •Developed by people at Univ. of Waikato, Univ. of Melbourne, and RMIT in 1990’s•Written in C, running on Unix•Vector space model with lots of compression and speed up tricks•People have used it to achieve good TREC performance2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200861Lemur/Indri•An IR toolkit emphasizing language models•Developed at CMU and Univ. of Massachusetts in 2000’s•Written in C++, highly extensible•Vector space and probabilistic models including language models•Achieving good TREC performance with a simple language model2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200862Terrier•A large-scale retrieval toolkit with lots of applications (e.g., desktop search) and TREC support•Developed at University of Glasgow, UK•Written in Java, open source•“Divergence from randomness” retrieval model and other modern retrieval formulas2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200863Lucene•Open Source IR toolkit •Initially developed by Doug Cutting in Java•Now has been ported to some other languages•Good for building IR/Web applications •Many applications have been built using Lucene (e.g., Nutch Search Engine)• Currently the retrieval algorithms have poor accuracy2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200864What You Should Know•What is an inverted index•Why does an inverted index help make search fast•How to construct a large inverted index•Simple integer compression methods•How to use an inverted index to rank documents efficiently•HOW TO IMPLEMENT A SIMPLE IR SYSTEM2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200865Applications of Basic IR Techniques 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200866Some “Basic” IR Techniques•Stemming•Stop words•Weighting of terms (e.g., TF-IDF)•Vector/Unigram representation of text•Text similarity (e.g., cosine)•Relevance/pseudo feedback (e.g., Rocchio)They are not just for retrieval!2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200867Generality of Basic TechniquesRaw textTerm similarityDoc similarityVector centroidCLUSTERING dCATEGORIZATIONMETA-DATA/ANNOTATION d d d d d d d d d d d d d d t t t t t t t t t t t tStemming & Stop wordsTokenized textTerm Weightingw11 w12… w1nw21 w22… w2n … …wm1 wm2… wmn t1 t2 … tn d1 d2 … dmSentenceselectionSUMMARIZATION2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200868Sample Applications•Information Filtering (covered earlier)•Text Categorization•Document/Term Clustering•Text Summarization2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200869Text Categorization•Pre-given categories and labeled document examples (Categories may form hierarchy)•Classify new documents •A standard supervised learning problemCategorizationSystem…SportsBusinessEducationScience…SportsBusinessEducation2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200870“Retrieval-based” Categorization•Treat each category as representing an “information need”•Treat examples in each category as “relevant documents”•Use feedback approaches to learn a good “query”•Match all the learned queries to a new document•A document gets the category(categories) represented by the best matching query(queries)2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200871Prototype-based Classifier•Key elements (“retrieval techniques”)–Prototype/document representation (e.g., term vector)–Document-prototype distance measure (e.g., dot product)–Prototype vector learning: Rocchio feedback•Example 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200872K-Nearest Neighbor Classifier•Keep all training examples•Find k examples that are most similar to the new document (“neighbor” documents)•Assign the category that is most common in these neighbor documents (neighbors vote for the category)•Can be improved by considering the distance of a neighbor ( A closer neighbor has more influence)•Technical elements (“retrieval techniques”)–Document representation–Document distance measure2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200873Example of K-NN Classifier(k=1)(k=4) 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200874Examples of Text Categorization•News article classification•Meta-data annotation•Automatic Email sorting•Web page classification 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200875Sample Applications•Information Filtering•Text CategorizationÞDocument/Term Clustering•Text Summarization2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200876The Clustering Problem•Discover “natural structure”•Group similar objects together•Object can be document, term, passages•Example2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200877Similarity-based Clustering(as opposed to “model-based”)•Define a similarity function to measure similarity between two objects •Gradually group similar objects together in a bottom-up fashion•Stop when some stopping criterion is met•Variations: different ways to compute group similarity based on individual object similarity2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200878Similarity-induced Structure2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200879How to Compute Group Similarity?Given two groups g1 and g2,Single-link algorithm: s(g1,g2)= similarity of the closest paircomplete-link algorithm: s(g1,g2)= similarity of the farthest pairaverage-link algorithm: s(g1,g2)= average of similarity of all pairsThree Popular Methods:2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200880Three Methods IllustratedSingle-link algorithm?g1g2complete-link algorithm……average-link algorithm2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200881Examples of Doc/Term Clustering•Clustering of retrieval results•Clustering of documents in the whole collection •Term clustering to define “concept” or “theme”•Automatic construction of hyperlinks•In general, very useful for text mining2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200882Sample Applications•Information Filtering•Text Categorization•Document/Term ClusteringÞText Summarization2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200883The Summarization Problem•Essentially “semantic compression” of text•Selection-based vs. generation-based summary•In general, we need a purpose for summarization, but it’s hard to define it2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200884“Retrieval-based” Summarization•Observation: term vector summary?•Basic approach–Rank “sentences”, and select top N as a summary•Methods for ranking sentences–Based on term weights–Based on position of sentences–Based on the similarity of sentence and document vector2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200885Simple Discourse Analysis----------------------------------------------------------------------------------------------------------------------------------------------------------------vector 1vector 2vector 3……vector n-1vector nsimilaritysimilaritysimilarity2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200886A Simple Summarization Method----------------------------------------------------------------------------------------------------------------------------------------------------------------sentence 1sentence 2sentence 3summaryDoc vectorMost similarin each segment2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200887Examples of Summarization•News summary •Summarize retrieval results–Single doc summary–Multi-doc summary•Summarize a cluster of documents (automatic label creation for clusters)2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200888What You Should Know•Retrieval touches some basic issues in text information management (what are these basic issues?)•How to apply simple retrieval techniques, such as the vector space model, to information filtering, text categorization, clustering, and summarization•There are many other tasks that can potentially benefit from simple IR techniques 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200889Roadmap•This lecture–Other retrieval models– IR system implementation–Applications of basic TR techniques•Next lecture: more in-depth treatment of language models2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200890。
网址:龙星计划课程信息检索OverviewofTextRetrievalPart http://www.mxgxt.com/news/view/1271943
相关内容
第6章 信息检索与利用课件.ppt信息检索系统
信息检索系统:从理论到实践
山西省太谷县明星中学初中信息技术《感受信息》教学设计 新人教版.docx
文献检索系统
新东方启明星计划项目需求信息
明星舆论课程设计.docx
信息检索
明星训练班内容课程设计.docx
职业生涯规划——课程总结