 Sorting by feature vector distance

Sometimes, describing the real world requires us to use a lot of numbers measuring a variety of features. We call a set of these numerical qualities a feature vector. For example, let’s consider the specs of a variety of apples. Suppose we’ve rated the apples for a variety of flavor characteristics from 0-100:

Name Crunchy Tart Sweet Nutty Bitter
Fuji 50 90 10 30 40
Gala 90 60 70 80 10
Cameo 40 30 90 50 80

Since we have 5 different numerical features, we can describe the apple characteristics in 5-dimensional vector space.

Calculating distance

Suppose, given an apple from this list with 5-dimensional flavor characteristics, we want to find the top 3 other apples in our system that are most similar to it in flavor. In other words, we want to measure the “flavor distance” between apples, and find the 3 apples with the shortest distance to our original apple.

Fortunately, the simplest way to calculate distance in any number of dimensions is one we all learned in middle school: a variation on the familiar Pythagorean Theorem known as Euclidean distance, which simply extends that theorem from 2 to an arbitrary number of dimensions. As in the Pythagorean theorem, we take the difference between points in each dimension, square the differences, then add up the squares. Finally, we take the square root of that sum. So to find the distance between vectors q and p in n dimensions:

If we’re modeling these apples in Ruby and ActiveRecord, we might have a model that looks like:

Scaling

Now that we have a basic approach for finding the distance between 2 flavor vectors, what if we want to dramatically increase the number of apples in our database? While finding the distance between a single apple and every other apple runs in O(n) time, comparing each apple to every other apple runs in O(n2) time, which becomes prohibitive rather quickly. In addition, a naive approach using this function would be extra slow in Rails:

This is incredibly slow, because not only are the comparisons O(n2), we’re also instantiating O(n2) ActiveRecord objects for a one-off comparison!

To speed this up, we can calculate and store these comparisons in a separate table, so we can query based on distance and avoid the overhead of instantiating that many AR objects. However, even with this approach we’ll need to take the hit of recalculating the distances every time a change is made to the flavor vectors of one of the apples. Fortunately, Postgres gives us tools to avoid ActiveRecord altogether.

Leveraging Postgres array and cube modules

Postgres 9 introduces the array column type, which can be used to store our feature vectors. The Rails Postgres adapter allows us to use this column type in our AR model.

We’ll need to maintain this flavor_vector column whenever a field changes, or else we can just delete the individual columns for the flavor attributes altogether.

Taking this a step further, Postgres’ cube module provides cube_distance, which calculates the distance between two vectors in a query. This allows us to sort by distance and return only the top 3 distances! These distances can be cached at the database level, making subsequent queries even faster.

This allows us to rewrite our top_apple_similarities function as follows:

We’re still in O(n2) territory, but our AR instantiations are now limited to O(n), pushing the bulk of the work onto the database where it belongs. This optimization led to a 5-6x speedup of production code, from >30 sec. to 5-8 sec, with a table of 500 rows and 15 feature dimensions. Selecting the result of the cube_distance function as a column allows us to inspect the computed result on the returned model.

By leveraging the database to do the hard work, we can achieve significant gains in performance even without dramatically changing the algorithm.

Main image from Flickr user originalbliss.