A Friendly Introduction to the PageRank Algorithm

Aloha Churchill
13 min readMay 7, 2021

--

Introduction

Historical Context

The PageRank algorithm was the original algorithm that Google used to sort its web pages as displayed to a user. For example, if a user searched for the term “linear algebra”, Google would use the PageRank algorithm to choose which web pages to display first to the user in order of their importance. PageRank is based on the insightful idea that web pages can be ranked based upon the number of links to them. The mathematics behind PageRank guides page search algorithms to this day and web page ranking was undoubtedly one of the key factors in Google’s success.

Overview of the PageRank Algorithm

The purpose of the PageRank algorithm is to give each page a score rating of where it should be displayed in the results. PageRank was first proposed as a solution to the following scenario: imagine a person randomly surfing through the internet and clicking on links from each page. What is the probability that the person lands on a certain page? Naturally, the web pages that rank higher have a higher probability of being visited because more pages link to them.

Mathematical Formulation

A Simplistic Example of the PageRank Algorithm

Imagine that the internet consists of three web pages: A, B, C. Figure 1 shows how these pages are linked together.

Figure 1. Simplified Web. The arrows away from the nodes are outbound links and links pointing toward nodes are inbound links.

To solve the PageRank (PR) for each page, we take an iterative approach. Let the initial PageRank, denoted by PR(u) where u is a page on the web be equal for all pages such that

Since there are three pages, we initialize all pages to 1/3 and, in general, if there are k pages, then each page can be initialized to 1/k.

The PageRank of each page is dependent on the number of inbound and outbound links for each page u. For example, the PageRank of A for each iteration is

L(u)denotes the number of outbound links that a page contains. A coefficient of 0 indicates that page B does not contain an outbound link to page A while a coefficient of 1 indicates that page C does have an outbound link to page A.

The general formula for calculating the simplified page rank of a page u in a set of U pages on the web is

c(v) corresponds to a coefficient of 1 if there is an outbound link to page u from page v and 0 if there is not an outbound link to page u from page v.
Figure 2. Pseudocode for the simple page search algorithm

By iteratively running this algorithm, the stable page ranks are 0.23, 0.33, 0.44for pages A, B, and C respectively. The pages are then sorted from highest page rank to lowest page rank, corresponding to the probability of landing on each page in the random surfer model. Therefore, in this scenario, the search engine would display C then B then A to a user.

Markov Processes and the PageRank Algorithm

This iterative approach for solving the PR for each page can be modeled as a Markov chain. A Markov chain is a system that experiences transitions from one state to another where the future state is only dependent on the previous state. A Markov chain follows the formula

v_new is a state vector representing the next state, v_current is a state vector representing the current state, and T is a matrix of transition probabilities called the transition matrix.

The state vector is said to converge or reach equilibrium when

Equilibrium happens when the vector ceases to change when continually multiplied by T.

In the PageRank algorithm, we can construct a transition matrix T based on the transition probabilities defined by links from one page to another. The pages are the possible different states and the state vector is the page rank vector with each element corresponding to the PR score of each page. The PageRank algorithm can be modeled in this way because the future page that the random internet surfer will navigate to only depends on the current page and the links contained in it. To better understand, let us return to the example where there are only three web pages on the internet.

For each page or node, let us construct a transition vector representation of the outbound links. Each vector will take the form

a,b,c is equal to one if there is an outbound link to A, B, or C respectively and zero otherwise.

Thus the un-normalized vectors for pages A, B, C are as follows.

To create a transition matrix of probabilities, we must normalize each transition vector so that the sum of elements sums to one for each vector. The normalized vectors become

The transition matrix of the graph is the matrix whose columns are the normalized vectors and is accordingly

Transition matrix T

Let the original rank/state vector, or vector which shows the relative importance of pages be initialized so that each page is of equal rank. In this representation we have

Using Markov chains, we find the next state

We then update v_current = v_new and repeat the calculation until the rank vector has reached a state of equilibrium. In this example, the page rank vector converges to approximately

which again ranks page C the highest then B then A.

PageRank with Damping Factor and Jumping

Problems arise when we use the simple page ranking algorithm. For example, if there is a page disconnected from any other page, then the simple page algorithm will not converge correctly.

Figure 3. Web graphs that do not converge correctly with simple PageRank

To remedy this problem, PageRank introduces two ideas: the damping factor and jumping. The goal of jumping is to connect all nodes or pages of the web to all other nodes. In essence, if the random surfer finds themselves on a page which links to no other page they can arbitrarily jump to a random page. The damping factor, denoted by 𝛽, is essentially is the fraction of time the random web surfer spends clicking on links on the current page and 1-𝛽 is the fraction of time the web surfer teleports or randomly jumps to other links in the network. The damping factor also has the effect of lowering the probability that the random surfer will continue to click links on each page. Going back to the original example presented in this article, the formulation for the page rank for each page u in a collection of U pages of size N becomes

c(v )is a coefficient of 1 if there is an outbound link to page u from page v and 0 if there is not an outbound link to page u from page v, L(v) is the number of links page v contains, and N is the number of pages in the web.

The page rank vector, v, is the solution to the equation

where l(u_i , u_j ) is the “ratio between the number of links outbound from page j to page i to the total number of outbound links of page j.”

The damping constant is found to work best when set to 0.85. This is discussed in more detail later in this paper.

Adding damping and jumping to the formulation is closely tied to Markov theory. Markov theory states that for any starting page rank vector, applying the power method with a transition matrix T converges to a unique positive vector named the stationary vector if T is stochastic, irreducible, and aperiodic. The new transition matrix T with damping and jumping satisfies all of these properties. First, stochasticity or randomness is guaranteed because the formulation of T ensures that each column adds to one since it is normalized over the number of outbound links for each page. In other words, every outbound link on a page has an equal chance of being clicked on.

However, stochasticity is not sufficient to ensure convergence and thus the matrix must be modified further. To make the transition matrix irreducible and aperiodic, we adjust T to be in the form of

𝛽 is the damping factor between 0 and 1 and N is the number of pages in the network

T is irreducible because every page is connected to another page i.e. every element of the matrix is positive. Furthermore, T is aperiodic because every diagonal element is positive. An aperiodic matrix occurs when a vector does not return to some state after a number of iterations with certainty. With these properties, the constructed transition matrix guarantees a unique convergence.

Alternative Methods to Calculate the PageRank

Power Iteration

This paper has already introduced an iterative approach to solving the page rank algorithm, which uses the idea of a Markov chain to obtain the final page rank vector. To briefly review, the iterative approach initializes all the ranks of the pages in the page rank vector to the same value, 1/N, where N is the number of pages. For each iteration, the resulting page rank vector, v, becomes

where u is a vector of size N consisting of entries of all ones

T is the transition matrix equivalent to

A is the adjacency matrix and K is a diagonal matrix whose diagonal entries are the number of outbound links from that page.

This process is repeated until convergence.

Algebraic

As the number of iterations grows towards infinity, the page rank vector reaches the steady-state

whose unique solution is given by

where I is the identity matrix. This simple result is due to the Perron-Frobenius theorem which states that for a nonnegative, regular matrix

  • there is an eigenvalue, λ_PF, the Perron-Frobenius (PF) eigenvalue, that is real and positive
  • for any other eigenvalues, abs_val(λ) < λ_PF
  • there is a unique, invariant vector v for Tv = v.

In Markov chains, there is a transition matrix T that is regular and nonnegative. Since the transition matrix T is stochastic, meaning that each entry is non-negative and the sum of each column adds to one, then 1 is the PF eigenvalue with

which, of course, corresponds to the equilibrium distribution of the Markov chain. Since the PF eigenvalue is dominant, it will always converge to a unique invariant distribution.

This result applies beautifully to the page rank matrix, as the transition matrix is positive and stochastic and 1 is the PF eigenvalue. A short proof for the above solution is as follows:

from the steady-state definition of v and I the identity matrix
since lambda = 1 due to the Perron-Frobenius theorem

Coding the PageRank Algorithm

I was curious about the implementation of the PageRank algorithm in code and ended up coding the page rank algorithm using the Python programming language because of its powerful access to the Numpy library. The code is included in the appendix through a link to a Github repository. The code is commented in Github, however, I will briefly go over my implementation.

In the initial implementation, the code first takes a user-defined size to specify how many pages to include in the algorithm defined to be a constant, N. Each of the N pages is then assigned a vector in which each element is given either a one or zero randomly indicating if it has an outbound link to the page in that index and then normalizing this vector. The transition matrix T is then constructed using the normalized page link vectors as columns. I used the iterative method to implement the PageRank algorithm and included a convergence checker which terminates the Markov process if

where 𝜖 is a small, user-defined epsilon and the magnitude is defined to be the Euclidean norm

As a sanity check, the example previously covered in this paper was checked. For an epsilon of 0.01 and damping factor of 0.85, the algorithm took 7 iterations to converge to the vector [0.23366914, 0.33333333, 0.43299752] which corresponds to the correct solution for this configuration.

Discussion and Conclusions

Exploring the Influence of the Damping Factor

To study the time complexity and convergence of the PageRank algorithm, multiple parameters must be taken into account. The first is the damping factor. The most common damping factor found in literature when debriefing the PageRank algorithm is 𝛽=0.85, which was also published originally by Brin and Page.

In my simple implementation of the PageRank algorithm, I found it tricky to test different values of the damping factor versus the number of iterations the algorithm took to converge. In using my original convergence checker that checked the magnitude of the difference between the previous and current PageRank vector, I found that the number of iterations increased linearly with the value of 𝛽, as shown in the figure below.

Figure 4. Number of iterations versus damping factor for 𝜖 = 0.00001 and size of network N = 10 (left) and N= 100 (right)

I created another convergence checker which terminated the page rank algorithm when the order of the pages was “close” to the “correct order”. In this case, the “correct order” of the pages corresponded to the ordering of the pages from highest to lowest rank with a damping factor of 𝛽=0.85 after 100 iterations. A “close” match happened when the Euclidean norm of the difference between the correct order and current order was smaller than a constant 𝜖_big.

Figure 5. Iterations until convergence based on order (network size = N): Left (N = 30, 𝜖_big = 50), Center (N = 50, 𝜖_big = 10), Right (N = 50, 𝜖_big = 50)

The iterations in the graphs above never reach more than 100 since if the damping values sort the pages in a different order, then the algorithm will never terminate. In reality, the damping values corresponding to 100 iterations indicate that the pages were sorted in a different order. As the size of N increases, it becomes less likely that the pages will be in the same order for different damping factors. Inversely, as 𝜖_big increases, the order of pages becomes less stringent, and more damping values will converge. This approach is flawed as well because the damping values for convergence naturally cluster around 𝛽=0.85 which is assumed to be the damping value that will correspond to the “correct” ordering of pages. However, the value 𝛽=0.85 is not arbitrary. Larger values require more iterations and numeric instability arises when 𝛽 is too close to one. A 𝛽=0.85 has been tested extensively and seems to be a “sweet spot” for the PageRank algorithm.

Time Complexity of PageRank

Each iteration of the algorithm with a network of size N requires a matrix multiplication which takes O(N²) time complexity. However, since the transition matrix T is a sparse matrix meaning that it contains many zeroes, the average time complexity per iteration becomes O(k) where k is the number of non-zero entries in T. In my implementation, an approximately overall O(N² ) time complexity was seen as shown by the graph below. This time estimate used the initial convergence checker discussed in the previous sections.

Figure 6. Size of network versus time.

Limitations of PageRank

While PageRank revolutionized search engines, it is not perfect and today, more complex algorithms are used by Google and other providers to optimize their search engines. The first limitation is that PageRank does not take time into account. In essence, old pages and new pages are treated the same and in fact, new pages are often at a disadvantage because not as many sources link to them. This poses a problem for areas such as news where recent events are the most pertinent to users and thus Google uses a different approach to provide news. Secondly, PageRank cannot process complex search queries. While it looks for keywords, it cannot process an entire sentence. Instead, Google uses natural language processing to make sense of these queries and filter search results.

Further Applications of the PageRank Algorithm

The PageRank algorithm is hardly limited to web networks and has in fact been used in fields from medicine to ecology. For example, PageRank has been used in sports to determine the best athletes. In this model, the athletes are the nodes or pages of the network, and matches between them are linked. Another area where PageRank has been used is in debugging. A tool called MonitorRank analyzes the structure of the code and then returns the most likely bugs that may be occurring.

Github

https://github.com/Aloha-Churchill/page_rank

References

Arians, Jamie. “The Google PageRank Algorithm.” College of William and Mary. College of William and Mary. Accessed April 27, 2021. https://cklixx.people.wm.edu/teaching/math410/google-pagerank.pdf.

Brin, Sergey, and Lawrence Page. “The Anatomy of a Large-Scale Hypertextual Web Search Engine.” The Anatomy of a Search Engine. Stanford University, 1998. http://infolab.stanford.edu/~backrub/google.html.

Knill, Oliver. “Lecture 34: Perron Frobenius Theorem.” Harvard. Harvard University, 2011. http://people.math.harvard.edu/~knill/teaching/math19b_2011/handouts/lecture34.pdf.

“Limitations of PageRank.” Limitations of PageRank: Networks Course blog for INFO 2040/CS 2850/Econ 2040/SOC 2090. Cornell University, October 28, 2015. https://blogs.cornell.edu/info2040/2015/10/28/limitations-of-pagerank/#:~:text=PageRank%20hopes%20can%20topics%20relevant,least%20important%20to%20the%20user.

Linear Algebra — Introduction to PageRank. YouTube. YouTube, 2018. https://www.youtube.com/watch?v=F5fcEtqysGs.

“Markov Chain.” Wikipedia. Wikimedia Foundation, April 27, 2021. https://en.wikipedia.org/wiki/Markov_chain.

“More than Just a Web Search Algorithm: Google’s PageRank in Non-Internet Contexts.” More than just a Web Search algorithm: Google’s PageRank in non-Internet contexts: Networks Course blog for INFO 2040/CS 2850/Econ 2040/SOC 2090. Cornell University, November 3, 2014. https://blogs.cornell.edu/info2040/2014/11/03/more-than-just-a-web-search-algorithm-googles-pagerank-in-non-internet-contexts/.

“PageRank.” Wikipedia. Wikimedia Foundation, April 17, 2021. https://en.wikipedia.org/wiki/PageRank.

Raluca Tanase, Remus Radu. “Lecture #3: PageRank Algorithm — The Mathematics of Google Search.” PageRank Algorithm — The Mathematics of Google Search. Cornell University, 2009. http://pi.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html.

Srivastava, Atul & Garg, Rakhi & Mishra, P.K.. (2017). Discussion on Damping Factor Value in PageRank Computation. International Journal of Intelligent Systems and Applications. 9. 19–28. 10.5815/ijisa.2017.09.03.

Varagouli, Erika. “Everything You Need to Know about Google PageRank (Why It Still Matters in 2021).” Semrush Blog, December 23, 2021. https://www.semrush.com/blog/pagerank/.

--

--