Tag Archives: PageRank

[repost ]GraphLab PageRank Example

original:http://graphlab.org/home/abstraction/

Designing and implementing efficient, bug free parallel and distributed algorithms can be very challenging. To address this challenge high-level data-parallel abstractions like Map-Reduce expose a simple computational pattern that isolates users from the complexities of large-scale parallel and distribute system design. Unfortunately, many important computational tasks are not inherently data-parallel and cannot be efficiently or intuitively expressed in data-parallel abstractions.

GraphLab is a high-level graph-parallel abstraction that efficiently and intuitively expresses computational dependencies. Unlike Map-Reduce where computation is applied to independent records, computation in GraphLab is applied to dependent records which are stored as vertices in a large distributed data-graph. Computation in GraphLab is expressed as a vertex-programs which are executed in parallel on each vertex and can interact with neighboring vertices. In contrast to the more general message passing and actor models, GraphLab constrains the interaction of vertex-programs to the graph structure enabling a wide range of system optimizations. GraphLab programs interact by directly reading the state of neighboring vertices and by modifying the state of adjacent edges. In addition, vertex-programs can signal neighboring vertex-programs causing them to be rerun at some point in the future.

PageRank Example

The easiest way to explain the GraphLab abstraction is through an example.  The PageRank algorithm is a popular technique for analyzing the relative “importance” of webpages based on the hyperlink structure.  At a high-level the PageRank algorithm captures the modeling assumption that a link to a webpage is a vote of confidence and that votes from important pages are more valuable.  The PageRank of the webpage i is given by the recurrence equation:

where α is the random reset probability and N is the number of webpages.  Because the PageRank of page i depends on the PageRank of the pages that link to page i, we iteratively apply the equation until the PageRank of each page converges.  We can express this algorithm as the following vertex program:

GraphLab1 PageRank Vertex-Program

1
2
3
4
5
6
7
8
9
10
11
12
13
14
PageRank_vertex_program(vertex i) {
  // (Gather) Compute the sum of my neighbors rank
  double sum = 0;
  foreach(vertex j : in_neighbors(i)) {
    sum = sum + j.rank / num_out_neighbors(j);
  }
  // (Apply) Update my Rank (i)
  i.old_rank = i.rank;
  i.rank = (1-ALPHA)/num_vertices + ALPHA*sum;
  // (Scatter) If necessary signal my neighbors to recompute their rank
  if(abs(i.old_rank – i.rank) > EPSILON) {
    foreach(vertex j : out_neighbors(i)) signal(j);
  }
}

The above vertex-program is the basis of the original GraphLab1 abstraction.  Notice that the vertex program can directly access the neighboring vertices:

1
2
3
foreach(vertex j : in_neighbors(i)) {
  sum = sum + j.rank / num_out_neighbors(j);
}

Using the resulting sum the program directly updates the value of its vertex:

1
2
i.old_rank = i.rank;
i.rank = (1-ALPHA)/num_vertices + ALPHA*sum;

Finally, using one of the more powerful features of the GraphLab abstraction, the vertex-program only triggers its neighboring vertices to recompute their value when the PageRank has changed significantly.

1
2
3
if(abs(i.old_rank – i.rank) > EPSILON) {
  foreach(vertex j : out_neighbors(i)) signal(j);
}

The signal command tells the GraphLab computational engine to run the vertex-program on the neighboring vertices.

GraphLab2: Cooking with GAS

The original GraphLab1 abstraction encoded the vertex-program as a single function in the underlying language (C++/Java).  This made programming simple but limited the potential parallelism as well as the ability for the GraphLab runtime to apply some  optimizations needed to really tackle massive problems.  For example, in many real world applications a small set of vertices with have enormous neighborhoods (e.g., celebrities in a social network). These high degree vertices lead to performance bottlenecks since their vertex programs are executed sequentially on a single machine. Even worse, because the GraphLab engine cannot choose the order in which neighbors are touched or move parts of the vertex-program to the machines that store the data, certain critical optimizations are not possible.

Through our  research in graph-parallel algorithms we discovered a common pattern.  The vast majority of vertex-programs can be further decomposed into three phases: GatherApply, and Scatter (GAS). During the gather phase the vertex-program typically collects information about its neighborhoods.  More importantly, the the computation in the gather phase typically resembles a micro map-reduce job over the neighborhood of the vertex.  The result of the gather phase is then passed on to the apply phase in which the vertex program updates the value of the vertex.  Finally, during the scatter phase the vertex program typically signals adjacent vertices.

The GraphLab2 abstraction refines the GraphLab1 abstraction by exploiting the GAS decomposition and requiring the user to explicitly break their program into gather, apply, and scatter functions.   This allows GraphLab to parallelize and distribute the gather and scatter phases and employ sophisticated new techniques for data layout and caching.

PageRank in GraphLab2:

We can easily decompose the PageRank algorithm into the Gather, Apply, and Scatter phases:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
PageRank_vertex_program {
  // Gather Phase
  edge_dir gather_edges(vertex i) { return in_edges; }
  double gather(vertex i, edge e) { return e.source.rank / e.source.num_out_edges; }
  double gather_sum(double r1, double r2) { return r1 + r2; }
  // Apply Phase
  void apply(vertex i, double sum) {
    i.old_rank = i.rank;
    i.rank = (1-ALPHA)/num_vertices + ALPHA*sum;
  }
  // Scatter Phase
  edge_dir scatter_edges(vertex i) { return out_edges; }
  void scatter(vertex i, edge e) {
    if(abs(e.source.old_rank – e.source.rank ) > EPSILON) {
      signal(e.target);
    }
  }
}

Actual C++ Code:

C++

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
#include <graphlab.hpp>
float RESET_PROB = 0.15;
float TOLERANCE = 1.0E-2;
// The vertex data is just the pagerank value (a float).
// Extends POD type to enable serialization as “plain old data”
struct vertex_data : public graphlab::IS_POD_TYPE {
  float rank;
  vertex_data() : rank(1) { }
};
// There is no edge data in the pagerank application
typedef graphlab::empty edge_data;
// The graph type is determined by the vertex and edge data types
typedef graphlab::distributed_graph<vertex_data, edge_data> graph_type;
// PageRank Vertex Program
class pagerank :
  public graphlab::ivertex_program<graph_type, float>,
  public graphlab::IS_POD_TYPE {
  float last_change;
public:
  /* Gather the weighted rank of the adjacent page   */
  float gather(icontext_type& context, const vertex_type& vertex,
               edge_type& edge) const {
    return ((1.0 – RESET_PROB) / edge.source().num_out_edges()) *
      edge.source().data().rank;
  }
  /* Use the total rank of adjacent pages to update this page */
  void apply(icontext_type& context, vertex_type& vertex,
             const gather_type& total) {
    const double newval = total + RESET_PROB;
    last_change = std::fabs(newval – vertex.data().rank);
    vertex.data().rank = newval;
  }
  /* The scatter edges depend on whether the pagerank has converged */
  edge_dir_type scatter_edges(icontext_type& context,
                              const vertex_type& vertex) const {
    if (last_change > TOLERANCE) return graphlab::OUT_EDGES;
    else return graphlab::NO_EDGES;
  }
  /* The scatter function just signal adjacent pages */
  void scatter(icontext_type& context, const vertex_type& vertex,
               edge_type& edge) const {
    context.signal(edge.target());
  }
}; // end of factorized_pagerank update functor
int main(int argc, char** argv) {
  // Initialize control plain using mpi
  graphlab::mpi_tools::init(argc, argv);
  graphlab::distributed_control dc;
  // Parse command line options ———————————————–
  graphlab::command_line_options clopts(“PageRank algorithm.”);
  std::string graph_dir;
  clopts.attach_option(“graph”, graph_dir,
                       “The graph file. Required “);
  if(!clopts.parse(argc, argv)) {
    dc.cout() << “Error in parsing command line arguments.” << std::endl;
    return EXIT_FAILURE;
  }
  if (graph_dir == “”) {
    dc.cout() << “Graph not specified. Cannot continue”;
    return EXIT_FAILURE;
  }
  // Build the graph ———————————————————-
  graph_type graph(dc, clopts);
  graph.load_format(graph_dir, “tsv”);
  // Running The Engine ——————————————————-
  graphlab::omni_engine<pagerank> engine(dc, graph, “synchronous”, clopts);
  engine.signal_all();
  engine.start();
  const float runtime = engine.elapsed_seconds();
  dc.cout() << “Finished Running engine in ” << runtime
            << ” seconds.” << std::endl;
  // Tear-down communication layer and quit ———————————–
  graphlab::mpi_tools::finalize();
  return EXIT_SUCCESS;
} // End of main

 

 

[repost ]The Mathematics Behind Google’s PageRank

original:http://www4.ncsu.edu/~ipsen/ps/slides_man.pdf

The Mathematics
Behind Google’s PageRank
Ilse Ipsen
Department of Mathematics
North Carolina State University
Raleigh, USA
Joint work with Rebecca Wills
Man – p.1Two Factors
Determine where Google displays a web page on
the Search Engine Results Page:
1. PageRank (links)
A page has high PageRank
if many pages with high PageRank link to it
2. Hypertext Analysis (page contents)
Text, fonts, subdivisions, location of words,
contents of neighbouring pages
Man – p.2PageRank
An objective measure of the citation importance of
a web page [Brin & Page 1998]
• Assigns a rank to every web page
• Influences the order in which Google displays
search results
• Based on link structure of the web graph
• Does not depend on contents of web pages
• Does not depend on query
Man – p.3More PageRank More Visitors
Man – p.4PageRank
. . . continues to provide the basis for all of our web
search tools http://www.google.com/technology/
• “Links are the currency of the web”
• Exchanging & buying of links
• BO (backlink obsession)
• Search engine optimization
Man – p.5Overview
• Mathematical Model of Internet
• Computation of PageRank
• Is the Ranking Correct?
• Floating Point Arithmetic Issues
Man – p.6Mathematical Model of Internet
1. Represent internet as graph
2. Represent graph as stochastic matrix
3. Make stochastic matrix more convenient
=⇒ Google matrix
4. dominant eigenvector of Google matrix
=⇒ PageRank
Man – p.7The Internet as a Graph
Link from one web page to another web page
Web graph:
Web pages = nodes
Links = edges
Man – p.8The Internet as a Graph
Man – p.9The Web Graph as a Matrix
3
2
55
4
1
S =







0
1
2
0
1
2
0
0 0
1
3
1
3
1
3
0 0 0 1 0
0 0 0 0 1
1 0 0 0 0







Links = nonzero elements in matrix
Man – p.10Properties of Matrix S
• Row i of S: Links from page i to other pages
• Column i of S: Links into page i
• S is a stochastic matrix:
All elements in [0, 1]
Elements in each row sum to 1
• Dominant left eigenvector:
ω
T
S = ω
T
ω ≥ 0 kωk1 = 1
• ωi
is probability of visiting page i
• But: ω not unique
Man – p.11Google Matrix
Convex combination
G = αS + (1 − α)11v
T
| {z }
rank 1
• Stochastic matrix S
• Damping factor 0 ≤ α < 1
e.g. α = .85
• Column vector of all ones 11
• Personalization vector v ≥ 0 kvk1 = 1
Models teleportation
Man – p.12PageRank
G = αS + (1 − α)11v
T
• G is stochastic, with eigenvalues:
1 > α|λ2(S)| ≥ α|λ3(S)| ≥ . . .
• Unique dominant left eigenvector:
π
TG = π
T
π ≥ 0 kπk1 = 1
• πi
is PageRank of web page i
[Haveliwala & Kamvar 2003, Eldén 2003,
Serra-Capizzano 2005] Man – p.13How Google Ranks Web Pages
• Model:
Internet → web graph → stochastic matrix G
• Computation:
PageRank π is eigenvector of G
πi
is PageRank of page i
• Display:
If πi > πk then
page i may

be displayed before page k

depending on hypertext analysis
Man – p.14Facts
• The anatomy of a large-scale hypertextual web
search engine [Brin & Page 1998]
• US patent for PageRank granted in 2001
• Google indexes 10’s of billions of web pages
(1 billion = 10
9
)
• Google serves ≥ 200 million queries per day
• Each query processed by ≥ 1000 machines
• All search engines combined process more
than 500 million queries per day
[Desikan, 26 October 2006] Man – p.15Computation of PageRank
The world’s largest matrix computation
[Moler 2002]
• Eigenvector
• Matrix dimension is 10’s of billions
• The matrix changes often
250,000 new domain names every day
• Fortunately: Matrix is sparse
Man – p.16Power Method
Want: π such that π
TG = π
T
Power method:
Pick an initial guess x
(0)
Repeat
[x
(k+1)
]
T
:= [x
(k)
]
TG
until “termination criterion satisfied”
Each iteration is a matrix vector multiply
Man – p.17Matrix Vector Multiply
x
TG = x
T

αS + (1 − α)11v
T

Cost: # non-zero elements in S
A power method iteration is cheap
Man – p.18Error Reduction in 1 Iteration
π
TG = π
T G = αS + (1 − α)11v
T
[x
(k+1) − π]
T = [x
(k)
]
TG − π
TG
= α[x
(k) − π]
T
S
Error: kx
(k+1) − πk1
| {z }
iteration k+1
≤ α kx
(k) − πk1
| {z }
iteration k
Man – p.19Error in Power Method
π
TG = π
T G = αS + (1 − α)11v
T
Error after k iterations:
kx
(k) − πk1 ≤ α
k
kx
(0) − πk1
| {z }
≤2
[Bianchini, Gori & Scarselli 2003]
Error bound does not depend on matrix dimension
Man – p.20Advantages of Power Method
• Simple implementation (few decisions)
• Cheap iterations (sparse matvec)
• Minimal storage (a few vectors)
• Robust convergence behaviour
• Convergence rate independent of matrix
dimension
• Numerically reliable and accurate
(no subtractions, no overflow)
But: can be slow
Man – p.21PageRank Computation
• Power method
Page, Brin, Motwani & Winograd 1999
Bianchini, Gori & Scarselli 2003
• Acceleration of power method
Kamvar, Haveliwala, Manning & Golub 2003
Haveliwala, Kamvar, Klein, Manning & Golub 2003
Brezinski & Redivo-Zaglia 2004, 2006
Brezinski, Redivo-Zaglia & Serra-Capizzano 2005
• Aggregation/Disaggregation
Langville & Meyer 2002, 2003, 2006
Ipsen & Kirkland 2006
Man – p.22PageRank Computation
• Methods that adapt to web graph
Broder, Lempel, Maghoul & Pedersen 2004 Kamvar,
Haveliwala & Golub 2004
Haveliwala, Kamvar, Manning & Golub 2003
Lee, Golub & Zenios 2003
Lu, Zhang, Xi, Chen, Liu, Lyu & Ma 2004
Ipsen & Selee 2006
• Krylov methods
Golub & Greif 2004
Del Corso, Gullí, Romani 2006
Man – p.23PageRank Computation
• Schwarz & asynchronous methods
Bru, Pedroche & Szyld 2005
Kollias, Gallopoulos & Szyld 2006
• Linear system solution
Arasu, Novak, Tomkins & Tomlin 2002
Arasu, Novak & Tomkins 2003
Bianchini, Gori & Scarselli 2003
Gleich, Zukov & Berkin 2004
Del Corso, Gullí & Romani 2004
Langville & Meyer 2006
Man – p.24PageRank Computation
• Surveys of numerical methods:
Langville & Meyer 2004
Berkhin 2005
Langville & Meyer 2006 (book)
Man – p.25Is the Ranking Correct?
π
T =

.23 .24 .26 .27

• x
T =

.27 .26 .24 .23

kx − πk∞ = .04
Small error, but incorrect ranking
• y
T =

0 .001 .002 .997

ky − πk∞ = .727
Large error, but correct ranking
Man – p.26

[repost ]谷歌怎样给搜索结果排序

original:http://www.guokr.com/article/65304/

9 月 27 日谷歌推出新款doodle,庆祝自己 13 岁生日。在这个世界上,谷歌几乎无人不晓了。但鲜为人知的是,在13年前,拉里•佩奇( Larry Page )和谢尔盖•布林( Sergey Brin )正是依靠先进的算法发家并创立谷歌的。在这个世界上最自由和创新公司的生日里,来听听死理性派讲述它当年的数学故事吧。

http://i1.wp.com/img1.guokr.com/gkimage/mr/1i/8j/mr1i8j.png?w=625

网页排名和谷歌算法的诞生

一个正常的搜索引擎,其核心功能自然是网页搜索。那搜索结果应该怎样排序才最好呢?实际上,在谷歌主导互联网搜索之前,人们为此伤透脑筋。

当时人们认为,通过判断能够得知哪个网页更重要,对搜索引擎的发展十分有帮助——很显然,搜索引擎应该把重要的网页放到搜索结果中比较靠前的地方。

这个问题看起来很容易,但是解决的方法却没有想象的那么简单。

在谷歌诞生之前那段时间,流行的网页排名算法都很类似,它们都使用了一个非常简单的思想:越是重要的网页,访问量就会越大。许多大公司就通过统计网页的访问量来进行网页排名。但是这种排名算法有两个很显著的问题:一是因为只能够抽样统计,所以统计数据不一定准确,而且访问量的波动会比较大,想要得到准确的统计需要大量的时间和人力,还只能维持很短的有效时间;二是访问量并不一定能体现网页的“重要程度”——可能一些比较早接触互联网的网民还记得,那时有很多人推出了专门“刷访问量”的服务。有没有更好的方法,不统计访问量就能够为网页的重要度排序呢?

就是在这种情况下,1996 年初,谷歌公司的创始人,当时还是美国斯坦福大学研究生的佩奇和布林开始了对网页排序问题的研究。在1999年,一篇以佩奇为第一作者的论文发表了,论文中介绍了一种叫做 PageRank 的算法,这种算法的主要思想是:越“重要”的网页,页面上的链接质量也越高,同时越容易被其它“重要”的网页链接。于是,算法完全利用网页之间互相链接的关系来计算网页的重要程度,将网页排序彻底变成一个数学问题,终于摆脱了访问量统计的框框。

三个孩子和豌豆游戏

在详细讲述这个算法之前,不妨让我们用一个游戏,先来简单模拟一下 PageRank 算法的运行过程,以便读者更好地理解。

三兄弟分 30 颗豌豆。起初每人 10 颗,他们每次都要把手里的豌豆全部平均分给自己喜欢的人。下图表示了三兄弟各自拥有的初始豌豆数量,以及相互喜欢的关系(箭头方向表示喜欢,例如老二喜欢老大,老大喜欢老二和老三)。

http://i1.wp.com/img1.guokr.com/gkimage/iu/xv/em/iuxvem.png?w=625第一次分配后,我们会得到结果如下:

http://i0.wp.com/img1.guokr.com/gkimage/t4/hz/ez/t4hzez.png?w=625就这样,让游戏一直进行下去。直到他们手中的豌豆数不再变化为止。

那么这个游戏到底是否可以结束呢,如果可以,最终的结果又是什么样的?在此我们用电脑模拟了这个过程,得出的结果是:老大和老二的盘子里各有 12 颗豌豆,而老三的盘子里有 6 颗豌豆。这时候无论游戏怎么进行下去,盘子里的豌豆数量都不会再变化。

看到这里,读者可能会问:这个游戏和网页排序有什么关系?实际上, PageRank 会给每个网页一个数值,这个数值越高,就说明这个网页越“重要”。而刚刚的游戏中,如果把豌豆的数量看作这个数值(可以不是整数),把孩子们看作网页,那么游戏的过程就是 PageRank 的算法,而游戏结束时豌豆的分配,就是网页的 PageRank 值。

PageRank的数学模型

不同于之前的访问量统计,PageRank 求解了这样一个问题:一个人在网络上浏览网页,每看过一个网页之后就会随机点击网页上的链接访问新的网页。如果当前这个人浏览的网页 x 已经确定,那么网页 x 上每个链接被点击的概率也是确定的,可以用向量 Nx 表示。在这种条件下,这个人点击了无限多次链接后,恰好停留在每个网页上的概率分别是多少?

在这个模型中,我们用向量 Ri 来表示点击了 i 次链接之后可能停留在每个网页上的概率( R 0 则为一开始就打开了每个网页的概率,后面我们将证明 R 0 的取值对最终结果没有影响)。很显然 R i 的 L1 范式为 1 ,这也是 PageRank 算法本身的要求。

仍以上面的游戏为例,整个浏览过程的一开始,我们有:

http://i0.wp.com/img1.guokr.com/gkimage/u6/0y/sy/u60ysy.png?w=625其中, A 表示每一次点击链接概率的矩阵。 A 的第 i 列第 j 行 A i, j 的含义是,如果当前访问的网页是网页i,那么下一次点击链接跳转到网页 j 的概率为 A i, j 。

这样设计矩阵 A 的好处是,通过矩阵 A 和向量 R n-1 相乘,即可得出点击一次链接后每个网页可能的停留概率向量 R n 。例如,令 R 1 = A R 0 ,可以得到点击一次链接后停留在每个网页的概率:

http://i1.wp.com/img1.guokr.com/gkimage/lz/6j/1c/lz6j1c.png?w=625之后一直迭代下去,有:

http://i1.wp.com/img1.guokr.com/gkimage/nj/xh/be/njxhbe.png?w=625对于上面的例子,迭代结果如下图:

http://i1.wp.com/img1.guokr.com/gkimage/cx/gf/ag/cxgfag.png?w=625可以看到,每个网页停留的概率在振荡之后趋于稳定。

在这种稳定状态下,我们可以知道,无论如何迭代,都有 R n = R n-1 。这样我们就获得了一个方程:

http://i1.wp.com/img1.guokr.com/gkimage/tg/s5/vb/tgs5vb.png?w=625而整个迭代的过程,就是在寻求方程 R = AR 的解。而无论 R 0 是多少,迭代无限多次之后,一定会取得令 R = AR 成立的 R 值。整个求解 R 的过程,就如同一个人在一张地图上的不同位置之间随机地行走一样,所以被称为“随机行走模型”。

随机行走模型有一个显著的特点,那就是每一次迭代的结果只与前一次有关,与更早的结果完全无关。这种过程又被称为马尔可夫过程( Markov Process )或马尔可夫链( Markov Chain )。

马尔可夫过程的数学定义是:如果对于一个随机变量序列 X 0 、 X 1 、 X 2 、…, 其中 X n 表示时间 n 的状态及转移概率P,有:

http://i0.wp.com/img1.guokr.com/gkimage/lg/q7/bf/lgq7bf.png?w=625即 X n 只受 X n-1 的影响,则此过程成为马尔可夫过程。其中 P( X n+1 | X n ) 称作“一步转移概率”,而两步、三步转移概率则可以通过一步转移概率的积分求得。

当状态空间有限时,转移概率可以用用一个矩阵 A 来表示,称作转移矩阵( transition matrix )。此时转移概率的积分即为矩阵的幂,k步转移概率可以用 A k 表示,这也是随机行走模型中的情况。而对于一个正的(每个元素都为正的)转移矩阵 A ,可以证明一定有:

http://i2.wp.com/img1.guokr.com/gkimage/ua/wx/rz/uawxrz.png?w=625这就完整解释了为什么 R 0 的取值对最终结果没有影响。

修正“悬挂网页”带来的不良影响

但是这里有一个问题:即便 R 0 的取值对最终结果没有影响,用 R 作为网页排序的依据是否真的合理?

其实并不合理,因为当一个网页只有链入链接没有链出链接的时候,这个网页就会像一个“黑洞”一样,将同一个连通子图中其它网页流向它的 PageRank 慢慢“吞掉”(因为算法中虚拟的用户一旦进入那样的网页, 就会由于没有对外链接而永远停留在那里),这种网页我们称之为“悬挂网页”( Dangling Link )。这种“黑洞”效应是如此显著, 以至于在一个连通性良好的互联网上, 哪怕只有一个 “悬挂网页”, 也足以使整个互联网的网页排序失效, 可谓是 “一粒老鼠屎坏了一锅粥”。

为了解决这个问题,佩奇和布林进行了修正。他们意识到, 当用户访问到 “悬挂网页” 时, 都不可能也不应该就停留在了这个页面, 而是会自行访问其它网页。虽然对每个用户来说, 自行访问的网页与各人的兴趣有关,但在平均意义上来讲,佩奇和布林假定用户将会在整个互联网上随机选取一个网页进行访问。

所以他们给 PageRank 算法加入了一个新的向量 E。它的作用是,按照其中所描述的比例来向全部网页分配悬挂网页每一次“吞掉”的 PageRank。这样,相当于为悬挂网页添加了链向网络上全部网页的链接,避免了悬挂链接的出现。

以上就是谷歌背后最重要的数学奥秘。 与以往那种凭借关键词出现次数所作的排序不同, 这种由所有网页的相互链接所确定的排序是不那么容易做假的, 因为做假者再是把自己的网页吹得天花乱坠, 如果没有真正吸引人的内容, 别人不链接它, 一切就还是枉然。 而且 “佩奇排序” 还有一个重要特点, 那就是它只与互联网的结构有关, 而与用户具体搜索的东西无关。 这意味着排序计算可以单独进行, 而无需在用户键入搜索指令后才临时进行。 谷歌搜索的速度之所以快捷, 在很大程度上得益于此。

结语

不过要强调的是,虽然PageRank是Google搜索结果排序的重要依据并以此发家,不过它并不是全部依据——实际上,Google发展到现在,已同时用了数百种不同的算法来确定最终显示给用户的搜索结果顺序。

关于PageRank还有一个小故事。拉里•佩奇是Google的创始人之一,也是现任Google的CEO。有意思的是:“佩奇”的英文是“Page”,恰好与“PageRank”的“Page”相吻合。这是巧合还是有意为之呢?在网络上笔者可以找到的许多资料中,均提到PageRank是以拉里•佩奇的姓命名。但是所有这些资料都没有提到这条信息的来源,所以其真实性无从得证。

不过,既然佩奇本人没有出来解释,那我们也没有必要纠结于Page的含义了。或许这个词本身就是佩奇利用双关语向我们开的一个小玩笑呢!


参考资料:

[1] 卢昌海, 谷歌背后的数学

[2] L. Page, S. Brin, R. Motwani and T. Winograd. The PageRank Citation Ranking: Bring Order to the Web. Jan, 1998.

[3] 维基百科: 马尔可夫过程