Weighted Graphs :: Typedefs: Adventures in Data Structure

November 3, 2012

{

Weighted Graphs

This post is dedicated to a student of mine. My first quarter teaching JavaScript, I had one person in class who was working on mapping, using Python to parse the data and JavaScript to present it. He was taking the class in order to better understand that presentation layer--in particular, he wanted to be able to cluster a large number of map points at a wide zoom levels.

My first idea was to create a quadtree, but although this student was fairly advanced, he was still learning to process lists--processing a tree using recursion would have been tough to swallow. We ended up using Underscore to build groups filtered by distance from the first ungrouped item, which is workable and fast, but the clusters it creates are dependent on order of the original list, and they can drift over time as items move in and out of the clustering groups. However, when I started reading Steven Skiena's Algorithm Design Manual this month and happened upon a mention that weighted graphs could be used for clustering, I leapt on the idea. Maybe now, finally, I could figure out a satisfactory answer to his problem?

Before we answer that question, let's talk a little bit about what a weighted graph is. If you're used to the idea of a tree, a graph is not a hard adjustment to make--instead of a tree with left/right branches (or some other scheme), a graph is composed of nodes connected by "edges," which may have weight (corresponding to distance, cost, change, etc.). For implementing a graph, we really care about the edges most of all--handling node values during/after traversal is mostly important at a higher abstraction level than the graph itself. We can think of each edge as looking similar to this: var edge = { to: 1 //index of the vertex at the end of the edge weight: 0 //used for deciding traversal path } Edges only have one endpoint. If two vertexes are related to each other, the practice is usually to add two edges--one going each way. Since they're not real, it's not like they'll collide or anything. A graph with one-way linkages is called a "directed" graph, while one with all linkages going both ways is "undirected."

The graph structure that surrounds our edges also requires some explanation. Skiena's graph structure is built around tables of linked lists (which, in JavaScript, we can simply replace with arrays)--one table for vertexes, one for edges, and more tables for flags like "outdegree" or "search state"--with relationships between the tables expressed through common indexes. Consider the following simple graph: var graph = { vertexes: ["A", "B", "C"], edges: [ [{to: 1}, {to: 2}], // A's edges [{to: 0}], // B's edges [] // C's edges (or lack thereof) ] } In this case, A has edges linking it to B and C. B is linked only to A, and C is an end vertex--it doesn't link out to anywhere. The link between A and B is undirected, while the links to C are directed (one-way). This may be easier to envision as a table, where the numbers listed for each edge express which column the edge links to:

Column Index 0 1 2
Vertex valuesABC
Edge to10
2

Skiena sets up his structures this way because he's using C, and because he's trying to go fast, so it makes sense to keep data in flat arrays instead of objects or structs. In JavaScript, it feels a little bit odd to be using arrays linked by index instead of objects with properties, so after I implemented Skiena's version, I built a version of the graph and graph algorithms using Edge and Vertex constructors, as well as some functional techniques that felt more idiomatic for JavaScript. Unfortunately, these turned out to be much slower, so I'll be sticking with Skiena's versions. You can check out the object-oriented versions in the source below.

It's also true that since actual vertexes turn out to be mostly ignored during traversal, there are good reasons to separate your edges and your vertexes anyway. JavaScript's lack of a real dictionary type (where object references can be used as keys) means that array indexes are as good as anything else for linking the two. In the examples below, for example, I keep a copy of the pre-clustered edges and restore them after clustering, so that the search animations will run properly across the entire graph. We just have to be careful to delete the corresponding items from the edge list and other flag arrays whenever we remove a vertex (addition is simpler, since we don't have to worry about out-of-bounds errors in JavaScript if we handle undefined properly).

The constructor that I've provided, WeightedGraph, accepts arrays of vertexes, edges, and a flag for directionality, but most of the time it's seeded with the vertexes only. We add our edges one at a time, depending on what kind of graph we're planning to create, using the addEdge() function. addEdge: function(from, to, weight, directed) { // most of our graphs are undirected, so this will be false directed = directed || this.directed; // if there's not an edge list for the starting // vertex, create one if (!this.edges[from]) this.edges[from] = []; // add an edge to the list with a default weight this.edges[from].push({ to: to, weight: weight || Infinity }); // if undirected, add a second edge going the other way if (!directed) { this.addEdge(to, from, weight, true); } }

With our edges added and our vertexes in place, we can traverse our graph. There are two ways to travel across a graph: breadth-first, which radiates outward from the starting vertex, and depth-first, which recursively travels as far as it can go before backtracking and continuing from the last branch. Here's our breadth-first traversal, with arguments for a starting index and a callback function to be run on each vertex along the graph: breadthFirst: function(start, f) { // call f on our root node // we pass in the vertex, its edges, and its index f(this.vertexes[start], this.edges[start], start); // create a queue of edges to traverse from the root var queue = this.edges[start].slice(); // visited keeps track of where we've been already var visited = []; visited[start] = true; // now we pull edges from the queue and process them, // adding their edges to be processed in turn var edge; while (edge = queue.shift()) { // skip previously-visited nodes if (visited[edge.to]) continue; visited[edge.to] = true; // process unvisited node f(this.vertexes[edge.to], this.edges[edge.to], edge.to); // add its edges to the queue for traversal var edges = this.edges[edge.to]; queue = queue.concat(edges); } }

A breadth-first algorithm traverses all the vertexes at a given distance from the root before it moves outward. In depth-first, by contrast, we travel as far out as possible and sweep around the rest of the graph. At heart, this is just the difference between first-in-first-out and first-in-last-out order. But in implementation, depth-first has a nice, tidy recursive version that means we don't have to keep any lists of upcoming edges: depthFirst: function(start, f) { var visited = []; var self = this; // here's our recursive function to walk the graph var walk = function(nodeIndex) { // on entering a node, mark it as visited visited[nodeIndex] = true; var node = self.vertexes[nodeIndex]; var edges = self.edges[nodeIndex]; // call our traversal function f(node, edges, nodeIndex); // now walk through the edges for (var i = 0; i < edges.length; i++) { var edge = edges[i]; // if we haven't been to the end of this edge, we // immediately walk down to it if (!visited[edge.to]) { walk(edge.to); } } } // trigger the recursive function walk(start); } After getting used to recursion while working on trees, depth-first seems cleaner to me. If you keep track of a counter value on entry and exit for each node, you can also use depth-first search to identify cycles (places where the graph loops in on itself) and bottlenecks, which is pretty useful.

These functions, though, are not the hard part. Here's the hard part: to do our clustering, we want to build what's called a "minimal spanning tree"--that's the path through the graph with the lowest sum edge-weight. In our case, our edges are weighted by distance between two points, so that we can group together only points that are close. If we then remove any edges above a certain length, we're left with only the clustered "components" within the graph. Minimal spanning trees are also really useful for identifying fast or cheap paths through a network. In one of my favorite notes in the Algorithm Design Manual, Skiena suggests using them to identify line breaks in OCR'd text, by making lighter pixel edges cheaper than darker ones.

Given that we're dealing with a dense graph of points, where each point has a distance edge with every other point, I've implemented Prim's algorithm (whoever Prim is), because it's apparently better for dense graphs than Kruskal's (whoever that is). Prim's method works by creating a "tree" of one vertex, then repeatedly adding the closest non-tree vertex until the entire available graph has been traversed. As we add each point to the tree, we update a list of distance values corresponding to each non-tree point, which makes it easy to find the next candidate. findMST: function(start) { var v = start; // we'll use two flag arrays in this algorithm: // - inTree keeps track of what's been added // - distance is the current distance of each non-tree vertex var inTree = []; var distance = this.edges.map(function() { return Infinity }); // here's where we'll accumulate minimal spanning edges var tree = []; tree[v] = null; while (inTree[v] !== true) { inTree[v] = true; // for each node, we update the distance list if it moved // us closer to any non-tree points var edges = this.edges[v]; for (var i = 0; i < edges.length; i++) { var edge = edges[i]; if (distance[edge.to] > edge.weight && !inTree[edge.to]) { distance[edge.to] = edge.weight; // we create edges inward to the current vertex // they'll be overwritten if we find something closer tree[edge.to] = {to: v, weight: edge.weight}; } } // find the closest non-tree vertex to process next var d = Infinity; for (var i = 0; i < distance.length; i++) { if (!inTree[i] && distance[i] < d) { d = distance[i]; v = i; } } } // once we have all our edges, let's construct a new graph var newGraph = new WeightedGraph(this.vertexes); for (var i = 0; i < tree.length; i++) { if (tree[i]) { newGraph.addEdge(i, tree[i].to, tree[i].weight); } } return newGraph; }

This function is very C-like and tricky--so much so that I actually wrote it two more times in increasingly readable style. Take a look at the mst3K function for a simpler (but not shorter) loop, and then Vertex.prototype.findMST for the object-oriented version (which is much slower, but much easier to understand).

But with that out of the way, the rest of our clustering algorithm is pretty simple. cut() just removes any edges above the treshold weight, and findComponents() takes the now-shattered graph and creates a map of which vertex belongs to which coherent group. Using the canvas toy below, you can play with the different graph traversals, the minimum spanning tree, and the cluster cut threshold. You can also see an animation of the two different search algorithms--I recommend creating the MST first, then trying both searches. It's a great illustration of the difference between the queue and recursive exploration patterns.

Number of points:
Clear graph Add points
Breadth-first search Depth-first search
Create minimum spanning tree
Cut threshold: Cluster graph

So what's our takeaway from this? Are graphs useful? Absolutely. Are they useful for this particular clustering problem? Eh, not so much. If you play with the clustering, you'll notice that it's possible to get some weird edge conditions that we probably wouldn't want for our map. For example, it's possible to have a chain of points that wraps around--but doesn't include--one or more points in the middle. Moreover, long chains of closely-positioned points can end up forming much larger clusters than we would really want on a map. I suspect that our hypothetical JavaScript map should probably still look at some kind of spatial or suffix tree to handle combining points at its widest zoom.

But graphs are a useful problem-solver elsewhere. Pathfinding comes to mind--if our map offered directions, or was a game of some kind, a graph traversal algorithm would definitely be key. And of course, graphs like these are invaluable for modeling networks (social or otherwise), traffic flows, and topology. Think of some other uses you'd like to explore? As always, the graph library and the example source code are yours to peruse.

}