PageRank Algorithm: Estimating Node Importance

What Is PageRank?

PageRank is an algorithm developed by Larry Page and Sergey Brin (co-founders of Google) for estimating the importance (rank) of nodes in a directed graph based on the structure of incoming links.

Core idea: A page is important if many important pages link to it. This is a recursive notion of importance.

KEY TAKEAWAY: PageRank models a “random surfer” who follows links at random. A page’s rank is proportional to the probability that the random surfer is visiting it at any given moment.

Problem Specification

Input: A directed graph $G = (V, E)$ where nodes are web pages and directed edges represent hyperlinks
Output: A score $PR(v)$ for each vertex, representing its relative importance
Interpretation: Higher $PR(v)$ → more “important” node

The PageRank Formula

For a vertex $v$ with $n$ total vertices:

$$PR(v) = \frac{1-d}{n} + d \sum_{u \in \text{in}(v)} \frac{PR(u)}{\text{out}(u)}$$

Where:
- $d$ is the damping factor (typically 0.85) — probability of following a link
- \$1 - d$ = probability of jumping to a random page (teleportation)
- $\text{in}(v)$ = set of vertices with edges pointing to $v$
- $\text{out}(u)$ = out-degree of vertex $u$ (number of links from $u$)

Intuition: The Random Surfer Model

Imagine a web surfer who:
- With probability $d = 0.85$: clicks a random link on the current page
- With probability \$1 - d = 0.15$: jumps to a completely random page

The PageRank of a page is the long-run fraction of time the surfer spends on that page.

Iterative Computation

PageRank is computed iteratively until convergence:

ALGORITHM PageRank(graph, d, iterations)
    n ← number of vertices
    PR ← Dictionary (all vertices → 1/n)   // initialise equally

    for iter ← 1 to iterations do
        newPR ← Dictionary
        for each vertex v in graph do
            rank_sum ← 0
            for each u in inNeighbours(v) do
                rank_sum ← rank_sum + PR[u] / outDegree(u)
            end for
            newPR[v] ← (1 - d) / n + d * rank_sum
        end for
        PR ← newPR
    end for
    return PR

Worked Example

3-page graph: A→B, A→C, B→C, C→A. $d = 0.85$, $n = 3$.

Initial: $PR(A) = PR(B) = PR(C) = 1/3$

Iteration 1:
- $PR(A) = 0.05 + 0.85 \times (PR(C)/1) = 0.05 + 0.85 \times 0.333 = 0.333$
- $PR(B) = 0.05 + 0.85 \times (PR(A)/2) = 0.05 + 0.85 \times 0.167 = 0.192$
- $PR(C) = 0.05 + 0.85 \times (PR(A)/2 + PR(B)/1) = 0.05 + 0.85 \times (0.167 + 0.333) = 0.475$

C has the highest rank — it receives links from both A and B.

Correctness and Convergence

PageRank corresponds to the stationary distribution of a Markov chain over the graph. With the damping factor:
- Every page has a non-zero probability of being visited
- The process is guaranteed to converge to a unique stationary distribution

Limitations

Limitation	Explanation
Link spam	Pages can inflate their rank by creating many inbound links
Dangling nodes	Pages with no outgoing links can absorb rank (handled by teleportation)
Not semantic	Measures link structure, not content quality
Dynamic web	Web changes constantly; PageRank must be recomputed

EXAM TIP: In VCE exams, PageRank questions typically ask you to interpret the formula, trace one or two iterations, or justify why a node has a high/low rank. Focus on understanding the formula components: damping factor, in-neighbours, and out-degrees.

PageRank Algorithm

Table of Contents

About these notes

Join StudyPulse