目录 1 Search Engines and Information Retrieval 1.1 What Is Information Retrieval? 1.2 The Big Issues 1.3 Search Engines 1.4 Search Engineers 2 Architecture of a Search Engine 2.1 What Is an Architecture ? 2.2 Basic Building Blocks 2.3 Breaking It Down 2.3.1 Text Acquisition 2.3.2 Text Transformation 2.3.3 Index Creation 2.3.4 User Interaction 2.3.5 Ranking 2.3.6 Evaluation 2.4 How Does It Really Work? 3 Crawls and Feeds 3.1 Deciding What to Search 3.2 Crawling the Web 3.2.1 Retrieving Web Pages 3.2.2 The Web Crawler 3.2.3 Freshness 3.2.4 Focused Crawling 3.2.5 Deep Web 3.2.6 Sitemaps 3.2.7 Distributed Crawling 3.3 Crawling Documents and Email 3.4 Document Feeds 3.5 The Conversion Problem 3.5.1 Character Encodings 3.6 Storing the Documents 3.6,1 Using a Database System 3.6.2 Random Access 3.6.3 Compression and Large Files 3.6.4 Update 3.6.5 BigTable 3.7 Detecting Duplicates 3.8 Removing Noise 4 Processing Text 4.1 From Words to Terms 4.2 Text Statistics 4.2.1 Vocabulary Growth 4.2.2 Estimating Collection and Result Set Sizes 4.3 Document Parsing 4.3.1 Overview 4.3.2 Tokenizing 4.3.3 Stopping 4.3.4 Stemming 4.3.5 Phrases and N-grams 4.4 Document Structure and Markup 4.5 Link Analysis 4.5.1 Anchor Text 4.5.2 PageRank 4.5.3 Link Quality 4.6 Information Extraction 4.6.1 Hidden Markov Models for Extraction 4.7 Internationalization 5 Ranking with Indexes 6 Queries and Interfaces 7 Retrieval Models 8 Evaluating Search Engines 9 Classification and Clustering 10 So Search 11 Beyond Bag of Words Reverences Index
This book is designed to help people understand search engines, evaluate and compare them, and modify them for specific applications. Searching for information on the Web is, for most people, a daily activity. Search and communication are by far the most popular uses of the computer. Not surprisingly, many people in companies and universities are trying to improve search by coming up with easier and faster ways to find the right information. These people, whether they call themselves computer scientists, software engineers, information scientists, search engine optimizers, or something else, are working in the field of Information Retrieval.1 So, before we launch into a detailed journey through the internals of search engines, we will take a few pages to provide a context for the rest of the book.
Gerard Salton, a pioneer in information retrieval and one of the leading figures from the 1960s to the 1990s, proposed the following definition in his classic 1968 textbook (Salton, 1968):
Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.
Despite the huge advances in the understanding and technology of search in the past 40 years, this definition is still appropriate and accurate. The term "informa……
以下为对购买帮助不大的评价