PageRank is an algorithm developed by Larry Page and Sergey Brin (co-founders of Google) for estimating the importance (rank) of nodes in a directed graph based on the structure of incoming links.
Core idea: A page is important if many important pages link to it. This is a recursive notion of importance.
KEY TAKEAWAY: PageRank models a “random surfer” who follows links at random. A page’s rank is proportional to the probability that the random surfer is visiting it at any given moment.
Input: A directed graph $G = (V, E)$ where nodes are web pages and directed edges represent hyperlinks
Output: A score $PR(v)$ for each vertex, representing its relative importance
Interpretation: Higher $PR(v)$ → more “important” node
For a vertex $v$ with $n$ total vertices:
$$PR(v) = \frac{1-d}{n} + d \sum_{u \in \text{in}(v)} \frac{PR(u)}{\text{out}(u)}$$
Where:
- $d$ is the damping factor (typically 0.85) — probability of following a link
- \$1 - d$ = probability of jumping to a random page (teleportation)
- $\text{in}(v)$ = set of vertices with edges pointing to $v$
- $\text{out}(u)$ = out-degree of vertex $u$ (number of links from $u$)
Imagine a web surfer who:
- With probability $d = 0.85$: clicks a random link on the current page
- With probability \$1 - d = 0.15$: jumps to a completely random page
The PageRank of a page is the long-run fraction of time the surfer spends on that page.
PageRank is computed iteratively until convergence:
ALGORITHM PageRank(graph, d, iterations)
n ← number of vertices
PR ← Dictionary (all vertices → 1/n) // initialise equally
for iter ← 1 to iterations do
newPR ← Dictionary
for each vertex v in graph do
rank_sum ← 0
for each u in inNeighbours(v) do
rank_sum ← rank_sum + PR[u] / outDegree(u)
end for
newPR[v] ← (1 - d) / n + d * rank_sum
end for
PR ← newPR
end for
return PR
3-page graph: A→B, A→C, B→C, C→A. $d = 0.85$, $n = 3$.
Initial: $PR(A) = PR(B) = PR(C) = 1/3$
Iteration 1:
- $PR(A) = 0.05 + 0.85 \times (PR(C)/1) = 0.05 + 0.85 \times 0.333 = 0.333$
- $PR(B) = 0.05 + 0.85 \times (PR(A)/2) = 0.05 + 0.85 \times 0.167 = 0.192$
- $PR(C) = 0.05 + 0.85 \times (PR(A)/2 + PR(B)/1) = 0.05 + 0.85 \times (0.167 + 0.333) = 0.475$
C has the highest rank — it receives links from both A and B.
PageRank corresponds to the stationary distribution of a Markov chain over the graph. With the damping factor:
- Every page has a non-zero probability of being visited
- The process is guaranteed to converge to a unique stationary distribution
| Limitation | Explanation |
|---|---|
| Link spam | Pages can inflate their rank by creating many inbound links |
| Dangling nodes | Pages with no outgoing links can absorb rank (handled by teleportation) |
| Not semantic | Measures link structure, not content quality |
| Dynamic web | Web changes constantly; PageRank must be recomputed |
EXAM TIP: In VCE exams, PageRank questions typically ask you to interpret the formula, trace one or two iterations, or justify why a node has a high/low rank. Focus on understanding the formula components: damping factor, in-neighbours, and out-degrees.