Modeling the Internet and the Web: Probabilistic Methods and Algorithms / Edition 1

Modeling the Internet and the Web: Probabilistic Methods and Algorithms / Edition 1

ISBN-10:
0470849061
ISBN-13:
9780470849064
Pub. Date:
07/07/2003
Publisher:
Wiley
ISBN-10:
0470849061
ISBN-13:
9780470849064
Pub. Date:
07/07/2003
Publisher:
Wiley
Modeling the Internet and the Web: Probabilistic Methods and Algorithms / Edition 1

Modeling the Internet and the Web: Probabilistic Methods and Algorithms / Edition 1

Hardcover

$113.95
Current price is , Original price is $113.95. You
$113.95 
  • SHIP THIS ITEM
    Qualifies for Free Shipping
  • PICK UP IN STORE
    Check Availability at Nearby Stores
  • SHIP THIS ITEM

    Temporarily Out of Stock Online

    Please check back later for updated availability.


Overview

Modeling the Internet and the Web covers the most important aspects of modeling the Web using a modern mathematical and probabilistic treatment. It focuses on the information and application layers, as well as some of the emerging properties of the Internet.

 Provides a comprehensive introduction to the modeling of the Internet and the Web at the information level.
 Takes a modern approach based on mathematical, probabilistic, and graphical modeling.
 Provides an integrated presentation of theory, examples, exercises and applications.
 Covers key topics such as text analysis, link analysis, crawling techniques, human behaviour, and commerce on the Web.

Interdisciplinary in nature, Modeling the Internet and the Web will be of interest to students and researchers from a variety of disciplines including computer science, machine learning, engineering, statistics, economics, business, and the social sciences.

"This book is fascinating!" - David Hand (Imperial College, UK)

"This book provides an extremely useful introduction to the intellectually stimulating problems of data mining electronic business." - Andreas S. Weigend (Chief Scientist, Amazon.com)


Product Details

ISBN-13: 9780470849064
Publisher: Wiley
Publication date: 07/07/2003
Series: Wiley Series in Probability and Statistics , #493
Pages: 306
Product dimensions: 6.24(w) x 9.11(h) x 0.90(d)

About the Author

Pierre Baldi is a chancellor's professor of computer science at University of California Irvine and the director of its Institute for Genomics and Bioinformatics. Paolo Frasconi is the author of Modeling the Internet and the Web: Probabilistic Methods and Algorithms, published by Wiley.

Read an Excerpt

Modeling the Internet and the Web

Probabilistic Methods and Algorithms
By Pierre Baldi Paolo Frasconi Padhraic Smyth

John Wiley & Sons

Copyright © 2003 P. Baldi, P. Frasconi and P. Smyth
All right reserved.

ISBN: 0-470-84906-1


Chapter One

Text Analysis

Having focused in earlier chapters on the general structure of the Web, in this chapter we will discuss in some detail techniques for analyzing the textual content of individual-Web pages. The techniques presented here have been developed within the fields of information retrieval (IR) and machine learning and include indexing, scoring, and categorization of textual documents.

The focus of IR is that of accessing as efficiently as possible and as accurately as possible a small subset of documents that is maximally related to some user interest. User interest can be expressed for example by a query specified by the user. Retrieval includes two separate subproblems: indexing the collection of documents in order to improve the computational efficiency of access, and ranking documents according to some importance criterion in order to improve accuracy. Categorization or classification of documents is another useful technique, somewhat related to information retrieval, that consists of assigning a document to one or more predefined categories. A classifier can be used, for example, to distinguish betweenrelevant and irrelevant documents (where the relevance can be personalized for a particular user or group of users), or to help in the semiautomatic construction of large Web-based knowledge bases or hierarchical directories of topics like the Open Directory.

A vast portion of the Web consists of text documents - thus, methods for automatically analyzing text have great importance in the context of the Web. Of course, retrieval and classification methods for text, such as those reviewed in this chapter can be specialized or modified for other types of Web documents such as images, audio or video (see, for example, Del Bimbo 1999), but our focus in this chapter will be on text.

4.1 Indexing

4.1.1 Basic concepts

In order to retrieve text documents efficiently it is necessary to enrich the collection with specialized data structures that facilitate access to documents in response to user queries. A substring search, even when implemented using sophisticated algorithms like suffix trees or suffix arrays (Manber and Myers 1990), is not adequate for searching very large text collections. Many different methods of text retrieval have been proposed in the literature, including early attempts such as clustering (Salton 1971) and the use of signature files (Faloutsos and Christodoulakis 1984). In practice, inversion (Berry and Browne 1999; Witten et al. 1999) is the only effective technique for dealing with very large sets of documents. The method relies on the construction of a data structure, called an inverted index, which associates lexical items to their occurrences in the collection of documents. Lexical items in text retrieval are called terms and may consist of words as well as expressions. The set of terms of interest is called the vocabulary, denoted V. In its simplest form, an inverted index is a dictionary where each key is a term [omega] [element of] V and the associated value b([omega]) is a pointer to an additional intermediate data structure, called a bucket or posting list. The bucket associated with a certain term [omega] is essentially a list of pointers marking all the occurrences of [omega] in the text collection.

The general structure is illustrated in Figure 4.1. In the simplest case, the entries in each bucket can simply consist of the document identifier (DID), the ordinal number of the document within the collection. However, in many cases it will be more useful to have a separate entry for each occurrence of the term, where each entry consists of the DID and the offset (in characters) of the term's occurrence within the document. The latter approach has two advantages. First, recording all the occurrences of the term allows us to present a user with a short context (i.e. a fragment of surrounding text) in which the term occurs in the retrieved document. Second, the occurrence information enables vicinity queries such as retrieving all documents where two assigned terms are positionally close in the text.

The size of the inverted index is [ohm](|V|) and this data structure can typically be stored in main memory.

It can be implemented using a hash table so that the expected access time is independent of the size of the vocabulary. Buckets and the document repository must typically be stored on disk. This poses a challenge during the construction phase of the inverted index. If the buckets can be stored in the main memory, the construction algorithm is trivial. It simply consists of parsing documents, extracting terms, inserting the term in the inverted index if not present, and finally inserting the occurrence in the bucket. However, if the buckets are on disk, this approach will generate a tremendous amount of disk accesses making the algorithm impractical, since accessing a random byte of information on disk can take up to [10.sup.5] times longer than accessing a random byte in main memory. Thus, for very large collections of text it is necessary to resort to specialized secondary memory algorithms (Witten et al. 1999).

Searching for a single term [omega] in an indexed collection of documents is straightforward. First, we access the inverted index to obtain b([omega]). Then we simply scan the bucket pointed to by b([omega]) to obtain the list of occurrences. The case of Boolean queries with multiple terms is only slightly more complicated. Suppose the query contains literals associated with the presence or absence of k terms. The previous procedure can be repeated separately for each term, yielding k lists of occurrences. These lists are then combined by elementary set operations that correspond to the Boolean operators in the query: intersection for AND, union for OR, and complement for NOT (Harman et al. 1992).

4.1.2 Compression techniques

The memory required for each pointer in the buckets can be reduced by sorting the occurrences of each term by its DID. In this way, rather than storing each DID, it suffices to store the sequence of differences between successive DIDs as a list of gaps. For example, the sequence of DIDs

(14, 22, 38, 42, 66, 122, 131, 226, 312, 331)

may be stored as a sequence of gaps

(14, 8, 16, 4, 24, 56, 9, 95, 86, 19).

The advantage is that frequent terms will produce many small gaps, and small integers can be encoded by short variable-length codewords. In particular, [gamma] encoding (Elias 1975) represents a gap g as the pair of bit strings ([u.sub.g], [b.sub.g]), where [u.sub.g] is the unary code for the number [[log.sub.2] g] + 1 (i.e. [[log.sub.2] g] ones followed by a zero) and [b.sub.g] is the standard binary code for the number g - [2.sup.[[log.sub.2] g]]. For example, the integer '7' would be represented as (110, 11) and '9' would be represented as (1110, 001). In contrast, for a collection of n documents, storing an integer word for a DID would require [[log.sub.2] n] bits. Because terms are distributed according to the Zipf distribution (see Chapter 1), on average this represents a significant memory saving (see Exercise 4.1 for details). Moffat and Zobel (1996) discuss this and other index compression techniques, together with fast algorithms for query evaluation and ranking.

In the case of large collections of Web pages, efficient indexing may be challenging and traditional approaches may quickly became inadequate. The design of an efficient distributed index for Web documents is discussed in Melnik et al. (2001).

4.2 Lexical Processing

4.2.1 Tokenization

Tokenization is the extraction of plain words and terms from a document, stripping out administrative metadata and structural or formatting elements, e.g. removing HTML tags from the HTML source file for a Web page. This operation needs to be performed prior to indexing or before converting documents to vector representations that are used for retrieval or categorization (see Section 4.3.1). Tokenization appears to be a straightforward problem but in many practical situations the task can actually be quite challenging.

The case of HTML documents is relatively easy. The simplest approach consists of reducing the document to an unstructured representation, i.e. a plain sequence of words with no particular relationship among them other than serial order. This can be achieved by only retaining the text enclosed between <html> and </html> (see Section 2.1), removing tags, and perhaps converting strings that encode international characters to a standard representation (e.g. Unicode). The resulting unstructured representation allows simple queries related to the presence of terms and to their positional vicinity. More generally, the additional information embedded in HTML can be exploited to map documents to semi-structured representations. In this case, the representation is sensitive to the presence of terms in different elements of the document and allows more sophisticated queries like 'find documents containing population in a table header and famine in the title'. Extracting semi-structured representations from HTML documents is in principle very easy since tags provide all the necessary structural information. However, the reader should be aware that, although HTML has a clearly defined formal grammar, real world browsers do not strictly enforce syntax correctness and, as a result, most Web pages fail to rigorously comply to the HTML syntax. Hence, an HTML parser must tolerate errors and include recovery mechanisms to be of any practical usefulness. A public domain parser is distributed with Libwww, the W3C Protocol Library. An HTML parser written in Perl (with methods for text extraction) is available at search.cpan.org/dist/HTML-Parser/.

Obviously, after plain text is extracted, punctuation and other special characters need to be stripped off. In addition, the character case may be folded (e.g. to all lowercase characters) to reduce the number of index terms.

Besides HTML, textual documents in the Web come in a large variety of formats. Some formats are proprietary and undisclosed and extracting text from such file types is severely limited by the fact that the associated formats have not been disclosed. Other formats are publicly known, such as PostScript or Portable Document Format (PDF).

Tokenization of PostScript or PDF files can be difficult to handle because these are not data formats but algorithmic descriptions of how the document should be rendered. In particular, PostScript is an interpreted programming language and text rendering is controlled by a set of show commands. Arguments to show commands are text strings and two-dimensional coordinates. However, these strings are not necessarily entire words. For example, in order to perform typographical operations such as kerning, ligatures, or hyphenation, words are typically split into fragments and separate show commands are issued for each of the fragments. Show commands do not need to appear in reading order, so it is necessary to track the two-dimensional position of each 'shown' string and use information about the font in order to correctly reconstruct word boundaries. PostScript is almost always generated automatically by other programs, such as typesetting systems and printer drivers, which further complicates matters because different generators follow different conventions and approaches. In fact perfect conversion is not always possible. As an example of efforts in this area, the reader can consult Neville-Manning and Reed (1996) for details on PreScript, a PostScript-to-plain-text converter developed within the New Zealand Digital Library project (Witten et al. 1996). Another converter is Pstotext, developed within the Virtual Paper project (research.compaq.com/SRC/virtualpaper/home.html).

PDF is a binary format that is based on the same core imaging model as PostScript but can contain additional pieces of information, including descriptive and administrative metadata, as well as structural elements, hyperlinks, and even sound or video. In terms of delivered contents, PDF files are therefore much closer in structure to Web pages than PostScript files are. PDF files can (and frequently do, in the case of digital libraries) embed raster images of scanned textual documents. In order to extract text from raster images, it is necessary to resort to optical character recognition (OCR) systems (Mori et al. 1992).

4.2.2 Text conflation and vocabulary reduction

Often it is desirable to reduce all the morphological variants of a given word to a single index term. For example, a document of interest might contain several occurrences of words like fish, fishes, fisher, and fishers but would not be retrieved by a query with the keyword fishing if the term 'fishing' never occurs in the text. Stemming consists of reducing words to their root form (such as fish in our example), which becomes the actual index term. Besides its effect on retrieval, stemming can be also useful to reduce the size of the index.

One well known stemming algorithm for English was developed by Porter (1980) as a simplification and systematization of previous work by Lovins (1968). It relies on a preconstructed suffix list with associated rules. An example of a rewriting rule is 'if the suffix of the word is IZATION and the prefix contains at least one vowel followed by a consonant, then replace the suffix with IZE'. This would transform, for example, the word BINARIZATION into the word BINARIZE. Porter's stemmer applies rules in five consecutive steps. Source code in several languages and a detailed description of the algorithm are available at tartarus.org/~martin/ PorterStemmer/.

Another technique that is commonly used for controlling the vocabulary is the removal of stop words, i.e. common words such as articles, prepositions, and adverbs that are not informative about the semantic content of a document (Fox 1992). Since stop words are very common, removing them from the vocabulary can also significantly help in reducing the size of the index. In practice, stopwords may account for a large percentage of text, up to 20-30%. Naturally, removal of stop words also improves computational efficiency during retrieval.

4.3 Content-Based Ranking

A Boolean query to a Web search engine may return several thousands of matching documents, but a typical user will only be able to examine a small fraction of these. Ranking matching documents according to their relevance to the user is therefore a fundamental problem.

Continues...


Excerpted from Modeling the Internet and the Web by Pierre Baldi Paolo Frasconi Padhraic Smyth Copyright © 2003 by P. Baldi, P. Frasconi and P. Smyth. Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Table of Contents

Preface.

1 Mathematical Background.

1.1 Probability and Learning from a Bayesian Perspective.

1.2 Parameter Estimation from Data.

1.3 Mixture Models and the Expectation Maximization Algorithm.

1.4 Graphical Models.

1.5 Classification.

1.6 Clustering.

1.7 Power-Law Distributions.

1.8 Exercises.

2 Basic WWW Technologies.

2.1 Web Documents.

2.2 Resource Identifiers: URI, URL, and URN.

2.3 Protocols.

2.4 Log Files.

2.5 Search Engines.

2.6 Exercises.

3 Web Graphs.

3.1 Internet and Web Graphs.

3.2 Generative Models for the Web Graph and Other Networks.

3.3 Applications.

3.4 Notes and Additional Technical References.

3.5 Exercises.

4 Text Analysis.

4.1 Indexing.

4.2 Lexical Processing.

4.3 Content-Based Ranking.

4.4 Probabilistic Retrieval.

4.5 Latent Semantic Analysis.

4.6 Text Categorization.

4.7 Exploiting Hyperlinks.

4.8 Document Clustering.

4.9 Information Extraction.

4.10 Exercises.

5 Link Analysis.

5.1 Early Approaches to Link Analysis.

5.2 Nonnegative Matrices and Dominant Eigenvectors.

5.3 Hubs and Authorities: HITS.

5.4 PageRank.

5.5 Stability.

5.6 Probabilistic Link Analysis.

5.7 Limitations of Link Analysis.

6 Advanced Crawling Techniques.

6.1 Selective Crawling.

6.2 Focused Crawling.

6.3 Distributed Crawling.

6.4 Web Dynamics.

7 Modeling and Understanding Human Behavior on the Web.

7.1 Introduction.

7.2 Web Data and Measurement Issues.

7.3 Empirical Client-Side Studies of Browsing Behavior.

7.4 Probabilistic Models of Browsing Behavior.

7.5 Modeling and Understanding Search Engine Querying.

7.6 Exercises.

8 Commerce on the Web: Models and Applications.

8.1 Introduction.

8.2 Customer Data on theWeb.

8.3 Automated Recommender Systems.

8.4 Networks and Recommendations.

8.5 Web Path Analysis for Purchase Prediction.

8.6 Exercises.

Appendix A: Mathematical Complements.

A.1 Graph Theory.

A.2 Distributions.

A.3 Singular Value Decomposition.

A.4 Markov Chains.

A.5 Information Theory.

Appendix B: List of Main Symbols and Abbreviations.

References.

Index.

From the B&N Reads Blog

Customer Reviews