Description

, c Ε. Γαλλόπουλος Επιστηµονικός Υπολογισµός ΙΙ Ε. Γαλλόπουλος 1 1 Τµήµα Μηχανικών Ηλεκτρονικών Υπολογιστών και Πληροφορικής Πολυτεχνική Σχολή, Πανεπιστήµιο Πατρών 20/5/13 Motivation Importance

Information

Category:
## Bills

Publish on:

Views: 8 | Pages: 71

Extension: PDF | Download: 0

Share

Transcript

, c Ε. Γαλλόπουλος Επιστηµονικός Υπολογισµός ΙΙ Ε. Γαλλόπουλος 1 1 Τµήµα Μηχανικών Ηλεκτρονικών Υπολογιστών και Πληροφορικής Πολυτεχνική Σχολή, Πανεπιστήµιο Πατρών 20/5/13 Motivation Importance at Google What makes a web page important for Google? many important pages contain links to it a page containing many links has reduced impact on the importance of the pages it contains links to Motivation Importance at Google What makes a web page important for Google? many important pages contain links to it a page containing many links has reduced impact on the importance of the pages it contains links to Ranking equations: x(p j ) x(p i ) = P j B Pi P j where x(p) is rank of node P and B(P) the set of incoming links to P PageRank vector x := [x(p 1 ),...,x(p n )] expresses the relative importance of Web pages for Google US Patent Method for node ranking in a linked database (Stanford) Motivation Importance at Google From Computer Networks and ISDN Systems 30 ( 1998) The anatomy of a large-scale hypertextual Web search engine Sergey Brin *, Lawrence Page *Z Computer Science Department. Stanford Univer.sity Stanford. CA 94305, USA Abstract In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of Web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the Web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and Web proliferation, creating a Web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale Web search engine - the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want Published by Elsevier Science B.V. All rights reserved. Keywords: World Wide Web; Search engines; Information retrieval; PageRank: Google 1. Introduction likely to surf the Web using its link graph, often starting with high quality human maintained indices such The Web creates new challenges for information as Yahoo! 3 or with search engines. Human mainretrieval. The amount of information on the Web is tained lists cover popular topics effectively but are growing rapidly, as well as the number of new users subjective, expensive to build and maintain, slow to inexperienced in the art of Web research. People are improve, and cannot cover all esoteric topics. Automated search engines that rely on keyword matching usually return too many low quality matches. To make Corresponding author. matters worse, some advertisers attempt to gain peo- There are two versions of this paper - a longer full version and a shorter printed version. The full version is available on the ple s attention by taking measures meant to mislead Web and the conference CD-ROM. (sergey, Google goals Co-founder Larry Page once described the perfect search engine as something that understands exactly what you mean and gives back exactly what you want. Basic factors: Relevance - Freshness - Comprehensiveness - Speed /9X/$ Published by Elsevier Science B.V. All rights reserved. PII SOl (98)001 IO-X Motivation Importance at Google From Relevance - Comprehesiveness - Speed... one key innovation was PageRank, a technology that determined the importance of a webpage by looking at what other pages link to it, as well as other data. Google indexes billions and billions of webpages, currently O(10 8 ) GB. Average query response roughly 0.25sec. Motivation Importance at Google Many rewards! Motivation Importance at Google PageRank computation is a Grand Challenge How to compute? ITERATE x k+1 (P i ) = P j B Pi x k (P j ) P j where x k (P) is rank of node P at iteration k Motivation Importance at Google PageRank computation is a Grand Challenge How to compute? ITERATE x k+1 (P i ) = P j B Pi x k (P j ) P j where x k (P) is rank of node P at iteration k... megascale to rank the Web, n = O(10 9 )... requires the use of HPC tools... distributed data... sparse and irregular, small data locality... sophisticated methods might not work... interpretability and benchmarking? Motivation Importance at Google PageRank computation is a Grand Challenge How to compute? ITERATE x k+1 (P i ) = P j B Pi x k (P j ) where x k (P) is rank of node P at iteration k... megascale to rank the Web, n = O(10 9 )... requires the use of HPC tools... distributed data... sparse and irregular, small data locality... sophisticated methods might not work... interpretability and benchmarking? Challenges: Numerical, Computational, Cognitive, Legal... P j Motivation Importance at Google Graph Adjacency matrix Adjacency matrix A { 1, i points to j page A ij = 0, otherwise nz = 2636 Harvard Motivation Importance at Google Substochastic transition matrix Dangling pages Transition matrix P P ij = where deg(i) = j A ij (outdegree) Dangling page problem { Aij deg(i), iff deg(i) 0 0, otherwise Rows of P sum to 1 or 0 (substochastic) For Harvard500: sum(sum(p )==0)=122 dangling pages, e.g. , E. Gall opouloc Motivation Importance at Google Stochasticity adjustment (Brin & Page) First rank-1 adjustment: S = P + w dt where w = n1 e (could also be any probability vector) e vector of ones, di = 1 if deg (i ) = 0 (dangling), else nz = nz = Motivation Importance at Google Primitivity adjustment (Brin & Page) Observe: S is (column) stochastic but could be reducible, imprimitive therefore could have several eigenvalues s.t. λ = 1. Motivation Importance at Google Primitivity adjustment (Brin & Page) Observe: S is (column) stochastic but could be reducible, imprimitive therefore could have several eigenvalues s.t. λ = 1. Fix (Brin & Page primitivity adjustment) G(µ) := µs + (1 µ) 1 n ee,µ (0,1) convex combination of S and rank-1 teleportation matrix H := 1 n ee. G(µ) 0, irreducible and primitive column stochastic, unique eigenvalue 1 = ρ(g(µ)) with corresponding left eigenvector 1 n e. Ergodic theorem: lim k G(µ) k = pe, where p is the right (Perron) eigenvector. Motivation Importance at Google Relaxed formulation to handle personalization, fight spam... Observe: Use more general v 0 rather than 1 e to correct. n Google matrix G(µ) := µs + (1 µ)ve,µ (0,1) where v 0 is also called personalization or trust vector. convex combination of S and rank-1 matrix H := ve. G(µ) 0, not necessarily irreducible or primitive. Can be shown that λ = 1 is its only eigenvalue of largest modulus. column stochastic, unique eigenvalue 1 = ρ(g(µ)) with corresponding (Perron) eigenvector 1 n e. Ergodic theorem: lim k G(µ) k = pe, where p is the right (Perron) eigenvector. G(µ) nicknamed parametric by Serra-Cappizano. Motivation Importance at Google Historical remarks on Spectral Ranking [Vigna 10] Working structure: Matrix where each row/column represents an entity and a value at entry (i,j) represents a form of endorsment of entity j from entity i. Goal: Use spectral information to rank entities [Luce & Perry 49] [A k ] i,j paths of size k exactly linking i to j. [Seeley 49] x = x P, where P is row stochastic. [Katz 53] Damping factor µ for convergence of e k 0 (µs) k. [Hubbell 65] Personalization vector v so that x = v + x S. Motivation Importance at Google Historical remarks on Spectral Ranking [Vigna 10] Working structure: Matrix where each row/column represents an entity and a value at entry (i,j) represents a form of endorsment of entity j from entity i. Goal: Use spectral information to rank entities [Luce & Perry 49] [A k ] i,j paths of size k exactly linking i to j. [Seeley 49] x = x P, where P is row stochastic. [Katz 53] Damping factor µ for convergence of e k 0 (µs) k. [Hubbell 65] Personalization vector v so that x = v + x S. Journals: Psychometrica ( 49, 53), Canadian J. Psych. 49, Sociometrics 65, Motivation Importance at Google Characterization PageRank vector satisfies x = G x Thus, x is the positive eigenvector corresponding to λ max = 1 of the irreducible, stochastic matrix G. When normalized as x e = 1, x represents the stationary probability distribution of a Markov chain represented by G. Random surfer interpretation: Web surfer who traverses the links, every time picking a random page from the existing hyperlinks with probability µ and any other one with probability 1 µ. Then x i will be the probability that a random walk through the web will land on page i Caution: Google tries to find pages that are both reputable and relevant. Warning: Only Google knows the exact recipe! Motivation Importance at Google Possibly the largest computation in the world! Motivation Importance at Google Rough taxonomy [Sakellaridi 06] PageRank Computation Spectral problemς Τ Τ π P = π Linear system π Τ ( I P) = 0 Τ Power method Krylov subspace methods Classical iterations / splittings Jacobi/ Gauss-Seidel Krylov subspace methods Matrix reordering P Power method acceleration Arnoldi based method reordered PageRank Extrapolation methods Matrix reordering P Adaptive PageRank Block- Structured PageRank Motivation Importance at Google Some contributors Brin, Page, Kleinberg Golub, Kamvar, Haveliwala, Manning, Lee, Zenios Ipsen, Meyer, Langville Berkhin, Gleich, Greif, Gray, Lau, Zhukov, Brinkmeier, del Corso, Boldi, Vigna, Bianchini, Brezinski, Serra-Capizzano Broder, Lempel, Maghoul, Pedersen, Avranchenkov, Bressan, Peserico, Akian, Gaubert, Ninove Comprehensive surveys A.N. Langville and C.D. Meyer. Google s PageRank and Beyond: The Science of Search Engine Rankings, Princeton University Press, P. Berkhin. A Survey on PageRank Computing. Internet Math., 2005 Motivation Importance at Google Building the matrix Need links from each website extensive crawl to obtain them or off-the-shelf Web crawls Production of links Small crawls e.g. using Moler s simple MATLAB crawler surfer.m Thus, x is the normalized positive solution of linear system satisfying the nonnegativity and normalization constraints , E. Gall opouloc Motivation PR characterization Characterizations Eigensystem: PageRank vector x PR satisfies x = G x x is the positive eigenvector corresponding to λ 1 := maxλ(g) = 1. When normalized as x e = 1, x represents the stationary probability distribution of the Markov chain for G Linear system: PageRank vector x PR satisfies (I µs) x = (1 µ) v, x 0, x 1 = 1. Motivation PR characterization Power Method x = Gx: Eigenvector problem (λ = 1) Power Method λ 1 λ 2... λ n, x 1,x 2,...,x n x = a 1 x 1 + a 2 x a n x n, a 1 0 vanishing terms for m {}}{ G m x = a 1 λ m 1 x }{{} 1 + a 2 λ m 2 x a n λ m n x n dominant term Note: Stochasticity preserved hence no need to normalize... can apply asynchronous techniques [KGSzyld 06] (important topic) Motivation Spectral characteristics Spectrum of Google matrix Theorem (Brauer 52) Let A R n n be an arbitrary matrix with eigenvalues λ 1,...,λ n, v be an eigenvector associated with eigenvalue λ k and q any n-dimensional vector. Then the matrix A + vq has eigenvalues λ 1,...,λ k 1,λ k + v q,λ k+1,...,λ n. Theorem (HaveliwalaKamvar, Langville&Meyer, Eldén, Serra-Capizzano) Let {1,λ 2,...,λ n } be the eigenvalues of S. Then the eigenvalues of G = µs + (1 µ)ve are {1,µλ 2,...,µλ n }. Note Convergence of PM depends on 1 µλ 2 λ 2 could be 1... so µ plays critical role in convergence. Choice of µ plays an important dual role: Mathematical and Behavioral (as probability). PageRank as matrix series PageRank via (matrix) power series in S From linear system interpretation, x PR := (I µs) 1 v(1 µ) the corresponding Neumann series converges since ρ(µs) 1, so x PR x = (1 µ) µ i S i v i=0 stochasticity preserved: e x = (1 µ) i=0 µ i e S i v = 1. see [Fogaras 03, Boldi.Santini.Vigna 05] see [Serra,Horn,Brezinsky.Redivo-Zaglia 06] summary of many alternative implicit and explicit representations (polynomial, rational). PageRank as matrix series Powers of adjacency matrices If A is the adjacency matrix for graph G A then [A] i,j := α ij 1 iff there is edge i j [A 2 ] i,j := α (2) ij # 2-stop paths i k j [A 3 ] i,j := α (3) ij # 3-stop paths i k 1 k 2 j [A + A 2 + A k ] i,j : # paths of up to k-stops from vertex i to j. e [A + A 2 + A k ] j : # paths of up to k-stops from any vertex to j. Weighted versions, e.g. exponential, make for network/graph measures, e.g. Estrada index [EstradaHigham 10] PageRank as matrix series PageRank as link-expressing power series in damping factor Let p = (η 1,η 2,...,η k ) be in G A, branching(p) = 1 out(ν 1 )out(ν 2 ) out(ν k 1 ) then [Brinkmeier 06,Newman.Strogatz.Watts 01] Recommendation (x) i = l=0 (1 µ)µ l branching(p)v η1 p i l... the representation of PageRank as power series provides deeper insight into the nature and properties of this ranking. The effects of the parameters, i.e. the graph, the personalization vector and the damping factor, are clearly separated from each other, and their influence on the resulting scores becomes clear [Brinkmeier 06]. PageRank as matrix series Functional rankings Functional ranking [BaezaYates.Boldi.Castillo 06] Generalize the Neumann representation by x df := j=0 ψ(j)s j v (1) Observations ψ is the damping function and the resulting ranking functional. Relation (1) is template for general ranking vector based on ψ. Generalization by selecting alternative series coefficients aka damping function : Ranking is well defined if the series is convergent and normalized if the coefficients add to 1. PageRank as matrix series Functional rankings Functional Rankings Total Rank [Boldi] x TR = 1 0 x PR 1 (µ)dµ = j=0 (j + 1)(j + 2) Sj v General Hyperbolic Rank [B-YBC] x GHR = 1 ζ(β) j=0 1 (j + 1) β Sj v, ζ(β) := j=0 1 (j + 1) β. PageRank as matrix series Functional rankings Series do not have to be infinite polynomials LinearRank x LR(k) = k j=0 2(k + 1 j) (k + 1)(k + 2) Sj v Minimal polynomial form Let q m (λ) be the minimal polynomial of S w.r.t. v, then q m (λ) = (λ 1)ˆq m 1 (λ), and x PR = ˆq m 1 (λ)v. Refs: cf. [BaezaYates.Boldi.Castillo 06, Boldi.Santini.Vigna 05, Brezinsky.Redivo-Zaglia 06] Multidamping Multidamped surfing [Kollias+EG 07,10] Set G(µ) = µ S + (1 µ) H to highlight dependence on µ. Definition Surfing process modeled by transitions with possibly varying damping parameters M := (µ 1,µ 2,...) [0,1): v G(µ 1 )v G(µ 2 )G(µ 1 )v G(µ i ) G(µ 1 )v When M = (µ 1,...,µ k ) is finite call this k-step multidamping modeled by M. Multidamping Multidamped surfing [Kollias+EG 07,10] Set G(µ) = µ S + (1 µ) H to highlight dependence on µ. Definition Surfing process modeled by transitions with possibly varying damping parameters M := (µ 1,µ 2,...) [0,1): v G(µ 1 )v G(µ 2 )G(µ 1 )v G(µ i ) G(µ 1 )v When M = (µ 1,...,µ k ) is finite call this k-step multidamping modeled by M. Properties: All G(µ j ) are stochastic ρ(g(µ j )) = 1 All products G(µ j1 )G(µ j2 ) G(µ ji ) are stochastic If e v = 1 e ( i k=1 G(µ j k )v) = 1 Multidamping Structure: Single µ Lemma (Bernstein.vanLoan 00) Given B R n n,g,h R n and j 0, then (B + gh ) j = B j + K j E j L j where K j = [g,bg,...,b j 1 g], L j = [h,(b + hg )h,...,(b + hg ) j 1 h], and E j = I j (:,j : 1 : 1) is the antidiagonal identity matrix of order j. Corollary: Using the above on B := µs, g := (1 µ)v, h := e from stochasticity follows that (G(µ)) j = µ j S j + (1 µ)(p j 1 (S)v)e where p j 1 (S) = 1 + µs + µ j 1 S j 1. that was an alternative proof for finite PageRank representation Multidamping Teleportation, Stochasticity and Personalization synergy Note j G(µ j ) s contains sum-of-products of S,H. Properties Hv = v H 2 = H HS = H HG(µ) = H Consequence Any word (product) that is made of H and S terms, is simplified as follows: Anything to the right of the first term H drops. Terms that end with Hv become v. Multidamping Structure: Multidamping Using induction, from stochasticity & teleportation properties, Theorem k j=1 G(µ k j+1 ) = ( k µ j )S k + E (2) j=1 where E := ( k 1 j=0 ζ js j v ) e so that rank(e) = 1 and ζ k 1 = µ k µ 2 (1 µ 1 ) (3) ζ 1 = µ k (1 µ k 1 ) ζ 0 = 1 µ k Multidamping Observations Related Extends generalized Sherman-Morrison formula of [BoldiSantiniCastillo.06] to the case of multidamping. Important We can do coefficient matching between this expanded product and the functional ranking template so that the two expressions agree... exactly or approximately. Coefficient generation Coefficients ζ j can be quickly generated from the µ j s; for instance, we can first compute all the prefixes π j of the set {µ k,...,µ 1 } (that is all the values π j = µ k µ k j+1,j = 1,...,k) and then compute ζ j 1 = π j 1 π j with π 0 = 1 for j = 1,...,k. Multidamping Tracing a path with multidamping Let v be the personalization vector, so Hv = v. Transition 1 G(µ 1 )v = µ 1 Sv + (1 µ 1 )Hv = µ 1 Sv + (1 µ 1 )v span v,sv Transition 2 G(µ 2 )G(µ 1 )v = µ 2 µ 1 S 2 v + µ 1 (1 µ 2 )HSv + (1 µ 1 )µ 2 SHv +(1 µ 1 )(1 µ 2 )H 2 v = µ 2 µ 1 S 2 v + µ 1 (1 µ 2 )v + (1 µ 1 )µ 2 Sv +(1 µ 1 )(1 µ 2 )v = µ 2 µ 1 S 2 v + (1 µ 1 )µ 2 Sv + (1 µ 2 )v span v,sv,s 2 v Multidamping Observations G(µ k ) G(µ 1 )v = (µ j S + (1 µ j )ve )v j=k: 1:1 = k j=0 for coefficients {ζ 0,...,ζ k }. Note: ζ k = k j=1 µ j k j=0 ζ j = 1 from stochasticity. Objective Seek transformation ζ j S j v K k+1 (S,v) := span v,sv,...,s k v {ζ 0,,ζ k } φ {µ 1,,µ k } Multidamping Coefficient generation Coefficient generation Theorem Given the scalars Z := {ζ j 0,j = 0,...,k k j=0 ζ j = 1} and the vector v so that e v = 1, then there exists a sequence M := {µ j [0,1),j = 1,...,k}, such that... k j=1 G(µ k j+1 )v = ζ k S k v + p k 1 (S)v, Multidamping Coefficient generation Theorem (continued) Damping parameters can be computed either by backward recurrence φ bwd (Z): µ k+1 = 1 forward recurrence φ fwd (Z): µ j = 1 ζ k j k+1 i=j+1 µ,j = k,,1. i µ 0 = 0 µ j = ρ, j = 1,...,k, k j+1 1 µ j 1 where ρ k j+1 = ζ k j+1 ζ k j. Multidamping Coefficient generation What we learn It is possible to express vectors produced by the finite form of the template for functional rankings as a finite number of multidamping steps, that is, in product form, where each term is a Google matrix. Example Z = { 1 2, 1 3, 1 6 } x = 1 6 S2 v Sv v M = { 1 3, 1 2 } x = ( 1 2 S H)( 1 3 S H)v. Multidamping Coefficient generation Normalization issues Issue: Typical damping functions are infinite series. In practice, we can only have a finite number of terms and coefficients Z = {...,ζ j,...}, hence might lose normalization, s := Z ζ j 1. Multidamping Coefficient generation Normalization issues Issue: Typical damping functions are infinite series. In practice, we can only have a finite number of terms and coefficients Z = {...,ζ j,...}, hence might lose normalization, s := Z ζ j 1. Multidamping Coefficient generation Normalization issues Issue: Typical damping functions are infinite series. In practice, we can on

Related Search

Similar documents

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks