Graph Theory + Implementation

Sheryl Li
14 min readJan 6, 2022

--

This article will take you from fundamental definitions to various algorithms, including dfs,bfs, dijkstras, and minimum spanning trees. A version of this article with linked outline (and better format overall) is also accessible on notion here. 💙

Why graphs?

graphs can be used to model various relationships in many fields (information systems, biology, etc.) and has numerous applications. DNA sequencing 🧬, social media 📱, recommendation engines, google maps 📍, and soo many more.

let’s dive in. 💻

Graphs/Definitions

fundamentally, nodes and edges/vertices make up a graph. A path gets you from one point to another; from node a to node b through the edges of the graph, and how many edges it takes to get from one node to another is the length of the path.

If the first and last node of a path is the same, then its a cycle- you’ve started from the same node that you ended up. It’s like running around the track full circle. 🏃🏻

A graph is connected if you can get from any node to any other node- meaning, if there is a path between any two nodes of the graph. In any graph, we can sort of group the connected parts of a graph, where each group is called a component.

Two nodes are neighbors/adjacent if theres an edge between them, and the degree of a node is the number of its neighbors. For example, for node 2 in the figure above, it has 3 neighbors: 3,4, and 5. Because it has 3 neighbors, it has a degree of 3.

A graph is directed if it’s edges can only get to a node from only one direction- visually, you can represent this with arrows.

Honestly, directed edges are like one-sided love, its not reciprocated 💔. In the figure above you sort of see a love triangle- 2 can only get to 4, 4 only to 5, and 5 only to 2; and 3 is kind of just there 😆. Back with the non reciprocated feelings, if 5 doesn’t have a directed edge to 4, it just won’t reach 4.

Graph representation

There are different ways to represent graphs, which vary based on what you’re working on. There’s two main ways to represent a graph:

  • adjacency list
  • adjacency matrix

Adjacency list:

each node x in the graph is assigned an adjacency list that consists of nodes to which there is an edge from x. It consists of an array of linked lists, one for each edge.

A nice way of storing adjacency lists is to declare an array of vectors

vector<int> adj[n]

For the figure above, you can store that as

adj[1].push_back(2);
adj[2].push_back(3);
adj[3].push_back(1);
adj[3[.push_back(4);

and so on. If the graph is undirected, it can be stored in a similar way except each edge will be added to both directions, same with weighted graphs- for this, you can use a pair data structure.

vector<pair<int,int>> adj[n];

For example, this can be stored as

adj[2].push_back({4,5}); //going to node 4 with a weight of 5
adj[3].push_back({5,6}); //going to node 5 with a weight of 6 from node 3
adj[4].push_back({5,2}); //going to node 5 with a weight of 2 from node 4
//basically:
//adj[where you're coming from].push_back({node you want, weight you're coming with})

With adjacency lists, we can find the nodes that we can move from given a node via an edge

for (auto u : adj[s]){
//process node u
}

Adjacency matrix:

its a 2d array thats more edge focused, and can be stored as an array

int adj[n][n] //indicates if edge from node a to node b exists

For example,

However, the matrix has n² elements so it should not be used if the graph is large.

Alright 😋- now that we know the basics, let’s learn some cool algorithms and implementations! There are many algorithms, if its one thing you must get out of this article, its DFS / BFS

DFS 🔎

Depth first search is a way to traverse the graph. We start at a node, and go to other nodes that are connected from the initial node using the edges. It’ll keep going as long as there are other reachable nodes from that starting node. It’s important to keep track of nodes that we’ve already visited so that we don’t revisit the same nodes.

DFS is recursive because once you pick a starting node, you keep visiting its neighbors until there are non reachable left.

Implementation:

vector<int> adj[N];
bool visited[N];
void dfs(int s) {
if (visited[s]) return;
visited[s] = true;
//process node s
for( auto u: adj[s]) {
dfs(u);
}
}

Connected Components

Connected components were introduced above. Let’s look at a code implementation intuitively. Reframing connected components into a direct implementation problem:

Given n cities and m roads between them, we want to construct a new road so that there is a route between any 2 cities (meaning, you can get from city to city via the edges)

We are given n and m for the first line, then m lines after that describing the roads- a and b (meaning an edge exists between a and b )

We want to print the number of roads required to connect the city! Then, also print which cities the roads should be between. Here’s a sample case:

Hmm. Given cities and nodes, I first definitely want to make sure if the cities even connected first. Why make more roads when cities are already connected? To make the cities all connected, that means there should only be 1 connected component in the end.

We want to find the various connected components, and state an edge that would connect the components. This means you can start off initially by finding the various connected components, then find any nodes from each component that would connect the components together to form 1 fully connected graph.

Implementation:

const int mn = 1e5+10;int n,m, ans, rep[mn];
vector<int> adj_list[mn];
bool visited[mn];
//normal DFS
void dfs(int node)
{
visited[node[ = true;
for(int u: edj_list[node])
if(!visited[n])
dfs(u);
}
//counts number of connected components
int count_components()
{
int count = 0;
for(int i=1; i<=n; ++i){
rep[count++]=i;
dfs(i);
}
return count;
}
//general intput and output
int main(){
scanf("%d%d", &N, &M); //take in that first fline input
for(int i=0,u,v; i<M; ++i) //take in edges from 2 nodes
scanf("%d%d", &u, &v), adj_list[u].push_back(v), adj_list[v].push_back(u);
ans = count_components();
printf("%d\\n", ans-1); //number of components -1
for(int i=1; i<ans; ++i)
printf("%d%d\\n", rep[i-1], rep[i]);
return 0;
}

notice that each edge decreases the number of connected components by 0 or 1. So you must add at last C-1 edges, where C is the number of connected components in the input graph.

DSU/UFDS: Bonus

Disjoint Set Union is a data structure, also goes by Disjoint Set Union Find.

disjoint set union basically is a data structure that allows you to add edges to an empty graph and see if two vertices of the graph are connected.

disjoint sets: 2 or more sets with nothing in common

basically: dsu/ufds lets you

  • add edges to graph
  • let you ask things about it, such as:
  • which component that vertex is in
  • are 2 vertices in the same component?
  • size of each component

To go deeper, watch this video:

BFS 🔍

breadth first search is another way of traversing a graph, but goes more width than depth. This means that the vertices that are closer to the current vertex are processed first, rather than following a single path in the graph as long as it finds new reachable parts of the graph (dfs). the graph can be directed or undirected.

With BFS:

  1. Start by visiting one arbitrary vertex, which we call the root.
  2. Visit all the root’s neighbours and push them into a queue.
  3. Pop the first node from the queue, visit all its neighbours and push into the queue those neighbours that have not been previously visited.
  4. Repeat step 3 while the queue is not empty.
  5. Finally, when the queue becomes empty it means all the reachable nodes have been visited, so the algorithm finishes.

if you want to calculate lengths of shortest paths, you need to store the distance and in order to do that, you need the parent node so you can calculate the distance from the parent to the current node.

Implementation

let’s draw out test case:

  • being in the same group as someone means that we have an edge between them (consider friendships to be edges)
  • the problem though, is the size. it can be up to 5x10⁵ -creating an edge between every person would be too much (but we also don’t need to) 🤔
  • you just need the whole group to be connected somehow
  • if you connected 1 to 2, and 2 to 3, theres no point in connecting 1 to 3 because its already connected through 2 (its adding another edge to connect them when theyre already connected:

when we give the news to node 1, how many other people will it reach? It’s the size of the component.

once we find the edges, we just find connected components and for each one, we print its component size.

implementation wise, we visit the initial node first, then visit everything thats distance 1 away from that node. Then, everything thats distance two away then three etc.

(bfs can only be used to find shortest paths if all the edges don’t have weights or all have the same weight. for clarity, the difference between bfs and dfs is that dfs uses a stack, and bfs uses a queue )

vector<long long> edges[500005];
bool vis[500005];
long long a[500005];
long long n, m;
int main() {
cin>>n>>m;
for(long long i=0; i< m; i++){
long long k;
cin>>k;
vector<long long> v(k);
for(long long j =0; j<k; j++){
cin>>v[j];
//its 1 indexed and we dont want that
--v[j];
}
for(long long j=0; j+1<k; j++){
edges[v[j]].push_back(v[j+1]);
edges[v[j+1]].push_back(v[j]);
//makes whole friend group connected without explicitly adding all the edges
}
}
//memset(vis,0,sizeof(vis));
//only in a new component if we haven't visited this
for(long long i=0; i<n; i++){
if(!vis[i]) {
vector<long long> component;
//start at any node and visit whats adjacent to it
queue<long long> q;
q.push(i); //push initial node in the queue
//while we still have stuff in the queue. If we don't, that means we've visited everything already
while(!q.empty()){
ll x = q.front(); //look at beginning of queue
q.pop();
if(vis[x]) continue;
vis[x] = 1;
//store all nodes in component- we need to know the final size. We dont know final size until bfs is done
component.push_back(x);
//for every other possible vertex
for(long long y: edges[x]){
if(!vis[y]){
q.push(y);
}
}
}
for(long long x: component) a[x] = component.size();
}
}
for(long long i=0; i<n; i++) cout<<a[i]<<" ";
}

Bipartiteness 🍃

formal definition:

a graph G= (U,V,E) is bipartite if its vertices can be divided into 2 disjoint sets, U and V, such that every edge (E) connects a vertex in U to 1 in V. Vertex sets U and V are usually termed as the parts of the graph. Equivalently, a graph that does not contain any odd-length cycles is by definition a bipartite graph, whereas bipartite graphs are also equivalent to 2 colorable graphs. Among the various types of graphs, trees, acyclic graphs, and circular graphs with an even number of vertices, are by definition bipartite. A bipartite graph represents a special case of a *k-*partite graph with k = 2. If a bipartite graph is not connected, it may have more than 1 bipartition; in this case, the (U, V, E) notation is helpful in specifying 1 particular bipartition that may be of importance in an application. If |U|=|V||U|=|V|⁠, that is, if the 2 subsets have equal cardinality, then G is called a balanced bipartite graph. If all vertices on the same side of the bipartition have the same degree, then G is called biregular.

read more about bipartite graphs and one of its applications: https://academic.oup.com/gigascience/article/7/4/giy014/4875933

a bipartite graph basically means that you can partition the graph into two sets, such that theres no edges that connect any two nodes in set 1, and same conditions for set two.

The edges can only be between set 1 and set 2.

An easier way to see what the above is saying is by coloring a graph using two colors so that there are no adjacent nodes with the same color.

if a graph is not bipartite, that means theres no way to do this partition (can’t color it)

  • a graph is not bipartite if it has cycle of odd length
  • for some vertex, everything adjacent to it has to be a different color, and everything adjacent to that to has to be a different color

we kind of do the same bfs/dfs traversal, except now we keep track of the color we’re at and alternate. If we ever have an edge that forces us to connect two vertices that are already predetermined (visited) and has the same color, then its not bipartite

Implementation

let’s draw out the tests case and understand the input and outputs:

from looking at the test cases, its pretty clear that we check for bipartiteness using dfs, and if its bipartite we print out the number of nodes for each coloring and the indices of each node (there should only be 2 of those sections because theres only 2 colors)

ll n,m,x,y,z,a,b,c;vector<ll>edges[100005];
bool vis[100005];
bool col[100005]; // color
bool possible = 1;void dfs(ll v, bool c) //vertex and color {
if(vis[v]) return;
vis[v] = 1; //if its not visited, we're visiting it now
col[v] = c // current color
//for edge edge coming from vertex v
for(ll x : edges[v]) {
if(!vis[x]) {
dfs(x, c^1); //traverse and set color to the other color
}else{
if(c == col[x]){
possible= 0; //we have contradiction
}
}
}
}
int solve(int tc = 0){
//read in vertices and edges
cin>>n>>m;
for(ll i = 0; i< m; i++){
ll u ,v; // read in the edge
cin >> u>> v;
--u, --v;
edges[u].push_back(v);
edges[v].push_back(u);
}
//mark components for each vertex
for(ll i=0; i<n; i++){
if(!vis[i]){
dfs(i,0);
}
}
//if its possible
if (possible) {
ll cnt[2] = {0}; //theres only two colors
//count number of vertices that are labeled 0
for (ll i =0; i<n; i++) ++cnt[col[i]];
cout << cnt[0] << '\\n';
for(ll i =0; i<n; i++) {
if(col[i] == 0){
cout<< i+1 <<" ";
}
}
cout<<'\\n';
//print number of vertices that are labled 1
cout<<cnt[1] << '\\n';
for(ll i=0; i<n; i++){
if(col[i] == 1){
cout<< i+1<< " ";
}
}
cout<< '\\n';
} else{
cout<< -1;
}
}

Dijkstra’s

Dijkstra’s is basically a way to find shortest paths, its more efficient for larger graphs because it only traverses through the edges once

starting off initially, all the nodes distances are going to be infinity. The data structure for implementation is going to be a priority queue- containing the nodes by their distances, thus the ext node can be processed in logarithmic time.

At each step, Dijkstra’s algorithm chooses a node that’s not been processed yet and whose distance is as small as possible. From there, after a node has been chosen, the algorithm goes to all its adjacent/neighboring nodes and reduces the distances.

Basically: look at all the nodes, and take the one thats the shortest distance away. For example:

Implementation

the way we do this in code is via priority queue (a priority queue spits out the smallest number, or greatest — depends how you do it). we want to spit out the smallest number, look at that number, and look at all its neighbors

The shortest path between two nodes is simply the path with the smallest cost.

Given number of vertices and edges in the first line, and for the following m lines, each line has 3 values- x,y,l. edge forms between vertices x and y with weight l.

sample case:

int n,m; //n vertices, m edges 
vector<pair<int,int>> adj[100];
//min priority queue
priority_queue<pair<int,int>, vector<pair<int,int>>,greater<pair<int,int>>> pq;
int dist[100];
int main(){
cin>>n>>m;
for(int i=0; i<m; i++){
int x,y,l;
cin>>x>>y>>l;
adj[x].push_back({1,y}); //edges are bidirectional
adj[y].push_back({1,x});
}
for(int i=0; i<n; i++){
dist[i] = -1;
}
pq.push({0,0});
while(!pq.empty()){
pair<int,int> dv;
dv = pq.top();
pq.pop();
if(dist[dv.second] != -1){
continue;
}
dist[dv.second] = dv.first;
//go through all the neighbors
for (pair<int,int> j : adj[dv.second]){
if(dist[j.second] == -1){
pq.push({dv.first + j.first, j.second});
}
}
}
for(int i=0; i<n; i++){
cout<<dist[i]<<endl;
}
}

Minimum Spanning trees 🌴

To learn what minimum spanning trees are, this video explains it super well:

https://www.youtube.com/watch?v=4ZlRH0eK-qQ&ab_channel=AbdulBari

tldr; there are multiple spanning trees for each graph, each with different cost. how do we find the minimum?

prims algorithm- select minimum cost edge. then, select the next one, except always make sure its connected to previously selected vertices. (in other words, spanning tree must be connected, can’t do for graphs in which their connected components > 1)

kruskal’s- always select the minimum cost edge, BUT if its forming a cycle, don’t include it. The time complexity is O(n²), but can be optimized with a min heap- with this optimization the time complexity is O(nlog(n))

Implementation

this problem is a very literal implementation of the algorithm. Walking through the test case:

/*<https://www.hackerearth.com/practice/algorithms/graphs/minimum-spanning-tree/practice-problems/algorithm/minimum-spanning-tree-5/> */
struct edge{
int a,b,w;
};
edge ar[100005];
//doing to use disjoint set union by path compression
int parent[10005];
bool comp(edge a, edge b){
if(a.w < b.w) return true;
return false;
}
//dsu by path compression
int find(int a){
if(parent[a] == -1) return a;
return parent[a] = find(parent[a]);
}
//unionc function
void merge(int a, int b){
parent[a] = b;
}
int main(){
int n,m,a,b,w;
cin>>n>>m;

for (int i=1; i<=n; i++) parent[i] = -1;
for(int i=0; i<m; i++){
cin>>ar[i].a>>ar[i].b>>ar[i].w;
}
int sum=0;
sort(ar, ar+m, comp);
//for each edge, traverse and find parent
for(int i=0; i<m; i++){
a = find(ar[i].a);
b = find(ar[i].b); //find parent of a and b
if(a!= b){ //parents are different
sum += ar[i].w;
merge(a,b);
}
}
cout<<sum;
}

phewww. alright, hope you learned something new 💙

--

--