# 分布式机器学习：PageRank算法的并行化实现（PySpark）

## 1. PageRank的两种串行迭代求解算法

$\left(\begin{array}{lll} 0 & 0 & 1 \\ \frac{1}{2} & 0 & 0 \\ \frac{1}{2} & 1 & 0 \end{array}\right)$

$G=\frac{q}{n} E+(1-q) M$

$x_{t+1}=G x_{t}$

(每轮迭代后要归一化)

PageRank还有一种求解算法（名字就叫“迭代算法”），它的迭代形式如下：

$x_{t+1} = \frac{q}{n}\bm{1} + (1-q)Mx_t$

## 4. 编程实现

import re
import sys
from typing import Iterable, Tuple

from pyspark.resultiterable import ResultIterable
from pyspark.sql import SparkSession

n_slices = 3  # Number of Slices
n_iterations = 10  # Number of iterations
q = 0.15 #the default value of q is 0.15

def computeContribs(neighbors: ResultIterable[int], rank: float) -> Iterable[Tuple[int, float]]:
# Calculates the contribution(rank/num_neighbors) of each vertex, and send it to its neighbours.
num_neighbors = len(neighbors)
for vertex in neighbors:
yield (vertex, rank / num_neighbors)

if __name__ == "__main__":
# Initialize the spark context.
spark = SparkSession\
.builder\
.appName("PythonPageRank")\
.getOrCreate()

[(1, 2), (1, 3), (2, 3), (3, 1)],
n_slices
)

# count the number of vertexes

# init the rank of each vertex, the default is 1.0/n_vertexes
ranks = adj_list.map(lambda vertex_neighbors: (vertex_neighbors[0], 1.0/n_vertexes))

# Calculates and updates vertex ranks continuously using PageRank algorithm.
for t in range(n_iterations):
# Calculates the contribution(rank/num_neighbors) of each vertex, and send it to its neighbours.
vertex_neighbors_rank[1][0], vertex_neighbors_rank[1][1]  # type: ignore[arg-type]
))

# Re-calculates rank of each vertex based on the contributions it received
ranks = contribs.reduceByKey(add).mapValues(lambda rank: q/n_vertexes + (1 - q)*rank)

# Collects all ranks of vertexs and dump them to console.
for (vertex, rank) in ranks.collect():
print("%s has rank: %s." % (vertex, rank))

spark.stop()


1 has rank: 0.38891305880091237.
2 has rank: 0.214416470596171.
3 has rank: 0.3966704706029163.


## 参考

• [1] Malewicz G, Austern M H, Bik A J C, et al. Pregel: a system for large-scale graph processing[C]//Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 2010: 135-146.

• [2] Low Y, Gonzalez J, Kyrola A, et al. Distributed graphlab: A framework for machine learning in the cloud[J]. arXiv preprint arXiv:1204.6078, 2012.

• [3] Gonzalez J E, Low Y, Gu H, et al. {PowerGraph}: Distributed {Graph-Parallel} Computation on Natural Graphs[C]//10th USENIX symposium on operating systems design and implementation (OSDI 12). 2012: 17-30.

• [6] 许利杰，方亚芬. 大数据处理框架Apache Spark设计与实现[M]. 电子工业出版社, 2021.

• [9] 李航. 统计学习方法(第2版)[M]. 清华大学出版社, 2019.

• [10] Timothy sauer. 数值分析(第2版)[M].机械工业出版社, 2018.

原文作者：orion-orion
原文地址: https://www.cnblogs.com/orion-orion/p/16340839.html
本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。