DBSCAN Clustering Algorithm and C# Implementation

DBSCAN

So, what is this DBSCAN actually. It is one of unsupervised learning and clustering algorithm in machine learning. What is unsupervised means is simply there is no labelled data in given set. So we don’t know how the each data called.

Clustering methods aim are finding useful classifications for given dataset. Here DBSCAN is one of them. The good thing about DBSCAN is you don’t give how many clusters (class labels) there will be in dataset. You tell to algorithm:

1- Dataset

2- Size of a Region (Topologically we can call it as a Ball Neighborhood)

3- Minimum # of points needs to be in a Region

And it produces the number of clusters based on this density rate. It clusters point on three base structures: Core, Density (Neighbor), Outlier (Noise).

DBSCAN-Illustration
Red points are Core, yellow points neighbors, and blue point is a noise for minimum 3 points with given region radius.

What is good about it, in some domains, you never know how many clusters you need to find. So giving the expected number of clusters like in k-Means algorithm are not suitable at all. At the other hand you may decide minimum density rate of the given dataset and set the threshold value for it. Especially it has very good use cases in video processing domains.

Some good things about DBSCAN are:

+ No predefined number of clusters

+ Built-in noise points estimation

+ Different sized and shaped clusters can be handled

Algorithm

DBSCAN algorithm is pretty straightforward. Here is the pseudo code of algorithm that I took from its Wikipedia article:

Implementation

Implementation is actually a piece of cake. But when it comes to think about performance and optimization you need to start scratching your head no doubt.

Here is actual implementation of Wikipedia pseudo code:

DbscanAlgorithm class takes metric function as constructor parameter for querying the data points. ComputeClusterDbscan method is only method that you need to call. You send dataset, region radius (epsilon), and minimum number of points that must be in a region. As a result you take a HashSet of data points.

And here is the other objects that you need to

And here is an example clustering dataset:

 

Feel free to update the code, and make some optimization.

And here is the GitHub repository of implementation.

I am csharp developer, mathematics graduated, visionary coder, tennis player, bad english speaker, blog reader, blog writer, and very lazy person. I will be sharing my personal thoughts, experiences, hobbies that I'd like to do and different news that takes my interest as a simple, regular person. Sometimes in English, sometimes in Turkish.
  • Konstantinos Papakonstantinou

    Hi Yusuf, can I use this algorithm to cluster markers with X and Y coordinates? If yes, I would like a cluster to cover an area of 20×20 pixels, how will I define this epsilon parameter?

    • Yusuf Uzun

      Yes, If I understand correctly, you want to cluster some two-dimensional points(markers). The epsilon parameter is constant value defined for your environment. Regular k-means and its derivatives takes the number of clusters as argument, and divides whole dataset to that number and creates clusters. DBSCAN power comes into play right here. In DBSCAN you don’t define number of clusters, you define neighborhood distance between your points and this is called epsilon. Hence, you don’t divide data into clusters, data decides for number of clusters with given epsilon and minimum neighbor numbers to the algorithm.

  • Pingback: Video Processing Using Optical Flows and DBSCAN Algorithm | A Developer's Side Notes()