Welcome to bldg_point_clustering’s documentation!

Introduction

A Python 3.5+ wrapper for clustering building point labels using KMeans, DBScan, and Agglomerative clustering.

Installation

Using pip for Python 3.5+ run:

$ pip install bldg_point_clustering

Quick Start

Instantiate Featurizer object and get featurized Pandas DataFrame.

Instantiate Cluster object and pass in featurized DataFrame to. Then, call a clustering method with the appropriate parameters.

Use the plot3D function in the Plotter to create a 3D plot of metrics returned by any of the clustering trials.

Example Usage

Running one iteration of the KMeans algorithm:

import pandas as pd
import numpy as np
from bldg_point_clustering.cluster import Cluster
from bldg_point_clustering.featurizer import Featurizer

filename = "GBSF"

df = pd.read_csv("./datasets/" + filename + ".csv")

first_column = df.iloc[:, 0]

f = Featurizer(filename, corpus=first_column)

featurized_df = f.bag_of_words()

c = Cluster(df, featurized_df)

clustered_df = c.kmeans(n_clusters=300, plot=True, to_csv=True)

metrics = c.get_metrics_df()

avg_levenshtein_score = np.mean(c.get_levenshtein_scores())

Running several iterations of the KMeans algorithm:

from bldg_point_clustering.plotter import plot_3D

c.kmeans_trials()

metrics = c.get_metrics_df()

plot_3D(metrics, "n_clusters", "Avg Levenshtein Score", "Silhouette Score")

This process is similar for DBScan and Agglomerative.

Featurizer

class bldg_point_clustering.featurizer.featurizer.Featurizer(filename, corpus)

Creates a Featurizer object instance.

Parameters:
  • filename – The name of the file containing the data to be featurized (Excluding file extension)
  • corpus – The Pandas Series of the strings to be clustered
bag_of_words(min_freq=1, max_freq=1.0, stop_words=None, tfidf=False)

Returns feature vectors based on the bag of words model, with each string tokenized using the arka tokenizer (Look in tokenizers for more information).

Parameters:
  • min_freq – Minimum frequency of a word to include in the featurization (float from 0.0 to 1.0)
  • max_freq – Maximum frequency of a word to include in the featurization (float from 0.0 to 1.0)
  • stop_words – Array of stop words all of which will be removed from the resulting tokens.
  • tfidf – Boolean indicating whether to use term frequency–inverse document frequency (TFIDF) model
Returns:

Pandas DataFrame of the document-term matrix featurization of the corpus (Featurized DataFrame)

arka()

Returns feature vectors based on the arka thesis model.

Returns:Pandas DataFrame of the document-term matrix featurization of the corpus
get_word_matrix_df()

Gets the Bag of Words document-term featurization matrix

Returns:Pandas DataFrame of Bag of Words document-term featurization matrix

Cluster

class bldg_point_clustering.cluster.cluster.Cluster(df, featurized_df)

Creates a Cluster object instance.

Parameters:
  • df – The Pandas DataFrame of the original data.
  • featurized_df – The Pandas DataFrame of the featurized data
kmeans(n_clusters=2, max_iter=300, plot=False, levenshtein_min_samples=50, to_csv=False)

Runs one iteration of the kmeans clustering algorithm

Parameters:
  • n_clusters – The number of clusters to form as well as the number of centroids to generate.
  • max_iter – Maximum number of iterations of the k-means algorithm for a single run.
  • plot – Boolean value indicating whether to plot clusters.
  • levenshtein_min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score.
  • to_csv – Boolean value indicating whether to export clusters to a csv file.
Returns:

Pandas DataFrame with a column for each cluster.

dbscan(eps=2, min_samples=10, plot=False, levenshtein_min_samples=50, to_csv=False)

Runs one iteration of the dbscan clustering algorithm

Parameters:
  • eps – The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster.
  • min_samples – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
  • levenshtein_min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score.
  • plot – Boolean value indicating whether to plot clusters.
  • to_csv – Boolean value indicating whether to export clusters to a csv file.
Returns:

Pandas DataFrame with a column for each cluster.

agglomerative(n_clusters=2, plot=False, levenshtein_min_samples=50, to_csv=False)

Runs one iteration of the agglomerative/hierarchical clustering algorithm

Parameters:
  • n_clusters – The number of clusters to form as well as the number of centroids to generate.
  • plot – Boolean value indicating whether to plot clusters.
  • levenshtein_min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score.
  • to_csv – Boolean value indicating whether to export clusters to a csv file.
Returns:

Pandas DataFrame with a column for each cluster.

kmeans_trials(min_clusters=2, max_clusters=10, step=3, levenshtein_min_samples=50, plot=False)

Runs multiple iterations/trials of the kmeans clustering algorithm starting from ‘min_clusters’ and adding ‘step’ each time until the number of clusters reaches ‘max_clusters’. Calculates the metrics below at each iteration.

Parameters:
  • min_clusters – The lowest number of clusters to start with
  • max_clusters – The maximum number of clusters to end with
  • step – The number of clusters to increment by for each iteration
  • levenshtein_min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score.
  • plot – Boolean value indicating whether to plot clusters.
Returns:

Pandas DataFrame with all the metrics for each clustering iteration. Metrics include: Number of Clusters, Avg Levenshtein Score, STD Levenshtein Score, Min Levenshtein Score, Max Levenshtein Score, Silhouette Score, SSE, MSE, RMSE, Average Cluster Size, STD Cluster Size

dbscan_trials(min_eps=0.2, max_eps=1, eps_step=0.2, start_min_samples=10, max_min_samples=30, min_samples_step=5, levenshtein_min_samples=50, plot=False)

Runs multiple iterations/trials of the dbscan clustering algorithm starting from ‘min_eps’ and adding ‘eps_step’ each time until the number of clusters reaches ‘max_eps’. For each iteration of an eps value, several iterations will be run starting with ‘start_min_samples’ up to ‘max_min_samples’ incremented by ‘min_samples_step’ at each iteration.

Parameters:
  • min_eps – The lowest maximum distance between two samples for one to be considered as in the neighborhood of the other.
  • max_eps – The highest maximum distance between two samples for one to be considered as in the neighborhood of the other.
  • eps_step – The amount of eps to increment by for each iteration
  • start_min_samples – The lowest amount of the number of samples (or total weight) in a neighborhood for a point to be considered as a core point.
  • max_min_samples – The highest amount of the number of samples (or total weight) in a neighborhood for a point to be considered as a core point.
  • min_samples_step – The number of min_samples to increment by for each iteration
  • levenshtein_min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score.
  • plot – Boolean value indicating whether to plot clusters.
Returns:

Pandas DataFrame with all the metrics for each clustering iteration. Metrics include: EPS, Min Samples, Avg Levenshtein Score, STD Levenshtein Score, Min Levenshtein Score, Max Levenshtein Score, Estimated # of Clusters, Estimated # of Noise/Outlier Points, Silhouette Score, Average Cluster Size, STD Cluster Size

agglomerative_trials(min_clusters=2, max_clusters=10, step=3, levenshtein_min_samples=50, plot=False)

Runs multiple iterations/trials of the agglomerative/hierarchical clustering algorithm starting from ‘min_clusters’ and adding ‘step’ each time until the number of clusters reaches ‘max_clusters’. Calculates the metrics below at each iteration.

Parameters:
  • min_clusters – The lowest number of clusters to start with
  • max_clusters – The maximum number of clusters to end with
  • step – The number of clusters to increment by for each iteration
  • levenshtein_min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score.
  • plot – Boolean value indicating whether to plot clusters.
Returns:

Pandas DataFrame with all the metrics for each clustering iteration. Metrics include: Number of Clusters, Avg Levenshtein Score, STD Levenshtein Score, Min Levenshtein Score, Max Levenshtein Score, Silhouette Score, Average Cluster Size, STD Cluster Size

get_levenshtein_scores(min_samples=50)

Calculate levenshtein score of each cluster by averaging the levenshtein scores of all pairwise strings in each cluster and returns a score for each cluster. This will calculate levenshtein scores for the last run clustering algorithm.

Parameters:min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score.
Returns:Array of Levenshtein scores of each cluster respectively
get_metrics_df()

Gets the Pandas dataframe containing the metrics of the last run clustering algorithm.

Returns:Pandas DataFrame containing the metrics of the last run clustering algorithm.
get_clustered_df()

Gets the Pandas dataframe of the data sorted into its respective clusters after running one of the clustering algorithms.

Returns:Pandas DataFrame of each column representing a cluster.
get_cluster_instance()

Gets the Sklearn Object of the previously called clustering algorithm.

Returns:Sklearn Object of the previously called clustering algorithm.
get_cluster_fit_instance()

Gets the Sklearn Object of the previously called clustering algorithm after fitting the data.

Returns:Sklearn Object of the previously called clustering algorithm after fitting the data.

Plotter

bldg_point_clustering.plotter.plotter.plot_3D(df, x, y, z)

Plots metrics data on a 3D plot with given axes x, y, and z.

Parameters:
  • df – A Pandas DataDrame of columns of numerical data (i.e. Metrics DataFrame)
  • x – The column of the dataframe to go on the x-axis (Column Name -> String)
  • y – The column of the dataframe to go on the y-axis (Column Name -> String)
  • z – The column of the dataframe to go on the z-axis (Column Name -> String)
Returns:

3D Plot of x, y, and z (using Plotly Express)

bldg_point_clustering.plotter.plotter.plot_silhouettes(X, labels)

Finds and plots silhouette samples for each label

Parameters:
  • X – A dataframe of the featurized data
  • labels – The labeled cluster assigned to each string (array)
bldg_point_clustering.plotter.plotter.plot_kmeans_inertia(X, max_clusters=10)

Finds and plots inertia/sum of squared errors of running kmeans on number of clusters from 1 to max_clusters inclusive.

Parameters:
  • X – A dataframe of the featurized data
  • n_clusters – The maximum number of clusters to plot the inertia (sum of squared error) for.
bldg_point_clustering.plotter.plotter.plot_kmeans(X, n_clusters=2)

Plots the PCA (2 components) reduced data with kmeans clustering and n_clusters

Parameters:
  • X – A dataframe of the featurized data
  • n_clusters – The number of clusters to apply kmeans clustering and plot
bldg_point_clustering.plotter.plotter.plot_dbscan(X, eps=0.2, min_samples=10)

Plots the PCA (2 components) reduced data with dbscan clustering with epsilon (eps) and min_samples

Parameters:
  • X – A dataframe of the featurized data
  • eps – The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster.
  • min_samples – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
bldg_point_clustering.plotter.plotter.plot_agglomerative(X, n_clusters=2)

Plots the PCA (2 components) reduced data with agglomerative clustering with n_clusters

Parameters:
  • X – A dataframe of the featurized data
  • n_clusters – The number of clusters to apply dbscan clustering and plot

Metrics

bldg_point_clustering.metrics.metrics.levenshtein_metric(X, min_samples=0)

Returns array of average levenshtein scores of each cluster by averaging the levenshtein scores of all pairwise strings in each cluster.

Parameters:
  • X – Pandas DataFrame of clusters with their respective strings
  • min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score.
Returns:

Array of average Levenshtein scores of each cluster

bldg_point_clustering.metrics.metrics.silhouette_metric(X, labels)

Returns silhouette score of featurized Pandas DataFrame

Parameters:
  • X – Pandas DataFrame of featurized data
  • labels – The labeled cluster assigned to each string (array)
Returns:

Silhouette Score (Floating point value between 0 and 1)