Welcome to bldg_point_clustering’s documentation!¶

Introduction¶

A Python 3.5+ wrapper for clustering building point labels using KMeans, DBScan, and Agglomerative clustering.

Installation¶

Using pip for Python 3.5+ run:

$ pip install bldg_point_clustering

Quick Start¶

Instantiate Featurizer object and get featurized Pandas DataFrame.

Instantiate Cluster object and pass in featurized DataFrame to. Then, call a clustering method with the appropriate parameters.

Use the plot3D function in the Plotter to create a 3D plot of metrics returned by any of the clustering trials.

Example Usage¶

Running one iteration of the KMeans algorithm:

import pandas as pd
import numpy as np
from bldg_point_clustering.cluster import Cluster
from bldg_point_clustering.featurizer import Featurizer

filename = "GBSF"

df = pd.read_csv("./datasets/" + filename + ".csv")

first_column = df.iloc[:, 0]

f = Featurizer(filename, corpus=first_column)

featurized_df = f.bag_of_words()

c = Cluster(df, featurized_df)

clustered_df = c.kmeans(n_clusters=300, plot=True, to_csv=True)

metrics = c.get_metrics_df()

avg_levenshtein_score = np.mean(c.get_levenshtein_scores())

Running several iterations of the KMeans algorithm:

from bldg_point_clustering.plotter import plot_3D

c.kmeans_trials()

metrics = c.get_metrics_df()

plot_3D(metrics, "n_clusters", "Avg Levenshtein Score", "Silhouette Score")

This process is similar for DBScan and Agglomerative.

Featurizer¶

class bldg_point_clustering.featurizer.featurizer.Featurizer(filename, corpus)¶

Creates a Featurizer object instance.

Parameters:	filename – The name of the file containing the data to be featurized (Excluding file extension) corpus – The Pandas Series of the strings to be clustered

bag_of_words(min_freq=1, max_freq=1.0, stop_words=None, tfidf=False)¶

Returns feature vectors based on the bag of words model, with each string tokenized using the arka tokenizer (Look in tokenizers for more information).

Parameters:

min_freq – Minimum frequency of a word to include in the featurization (float from 0.0 to 1.0)
max_freq – Maximum frequency of a word to include in the featurization (float from 0.0 to 1.0)
stop_words – Array of stop words all of which will be removed from the resulting tokens.
tfidf – Boolean indicating whether to use term frequency–inverse document frequency (TFIDF) model

Returns:

Pandas DataFrame of the document-term matrix featurization of the corpus (Featurized DataFrame)

arka()¶

Returns feature vectors based on the arka thesis model.

Returns:	Pandas DataFrame of the document-term matrix featurization of the corpus

get_word_matrix_df()¶

Gets the Bag of Words document-term featurization matrix

Returns:	Pandas DataFrame of Bag of Words document-term featurization matrix

Cluster¶

class bldg_point_clustering.cluster.cluster.Cluster(df, featurized_df)¶

Creates a Cluster object instance.

Parameters:	df – The Pandas DataFrame of the original data. featurized_df – The Pandas DataFrame of the featurized data

kmeans(n_clusters=2, max_iter=300, plot=False, levenshtein_min_samples=50, to_csv=False)¶

Runs one iteration of the kmeans clustering algorithm

Parameters:

n_clusters – The number of clusters to form as well as the number of centroids to generate.
max_iter – Maximum number of iterations of the k-means algorithm for a single run.
plot – Boolean value indicating whether to plot clusters.
levenshtein_min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score.
to_csv – Boolean value indicating whether to export clusters to a csv file.

Returns:

Pandas DataFrame with a column for each cluster.

dbscan(eps=2, min_samples=10, plot=False, levenshtein_min_samples=50, to_csv=False)¶

Runs one iteration of the dbscan clustering algorithm

Parameters:

eps – The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster.
min_samples – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
levenshtein_min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score.
plot – Boolean value indicating whether to plot clusters.
to_csv – Boolean value indicating whether to export clusters to a csv file.

Returns:

Pandas DataFrame with a column for each cluster.

agglomerative(n_clusters=2, plot=False, levenshtein_min_samples=50, to_csv=False)¶

Runs one iteration of the agglomerative/hierarchical clustering algorithm

Parameters:	n_clusters – The number of clusters to form as well as the number of centroids to generate. plot – Boolean value indicating whether to plot clusters. levenshtein_min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score. to_csv – Boolean value indicating whether to export clusters to a csv file.
Returns:	Pandas DataFrame with a column for each cluster.

kmeans_trials(min_clusters=2, max_clusters=10, step=3, levenshtein_min_samples=50, plot=False)¶

Runs multiple iterations/trials of the kmeans clustering algorithm starting from ‘min_clusters’ and adding ‘step’ each time until the number of clusters reaches ‘max_clusters’. Calculates the metrics below at each iteration.

Parameters:

min_clusters – The lowest number of clusters to start with
max_clusters – The maximum number of clusters to end with
step – The number of clusters to increment by for each iteration
levenshtein_min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score.
plot – Boolean value indicating whether to plot clusters.

Returns:

Pandas DataFrame with all the metrics for each clustering iteration. Metrics include: Number of Clusters, Avg Levenshtein Score, STD Levenshtein Score, Min Levenshtein Score, Max Levenshtein Score, Silhouette Score, SSE, MSE, RMSE, Average Cluster Size, STD Cluster Size

dbscan_trials(min_eps=0.2, max_eps=1, eps_step=0.2, start_min_samples=10, max_min_samples=30, min_samples_step=5, levenshtein_min_samples=50, plot=False)¶

Runs multiple iterations/trials of the dbscan clustering algorithm starting from ‘min_eps’ and adding ‘eps_step’ each time until the number of clusters reaches ‘max_eps’. For each iteration of an eps value, several iterations will be run starting with ‘start_min_samples’ up to ‘max_min_samples’ incremented by ‘min_samples_step’ at each iteration.

Parameters:

min_eps – The lowest maximum distance between two samples for one to be considered as in the neighborhood of the other.
max_eps – The highest maximum distance between two samples for one to be considered as in the neighborhood of the other.
eps_step – The amount of eps to increment by for each iteration
start_min_samples – The lowest amount of the number of samples (or total weight) in a neighborhood for a point to be considered as a core point.
max_min_samples – The highest amount of the number of samples (or total weight) in a neighborhood for a point to be considered as a core point.
min_samples_step – The number of min_samples to increment by for each iteration
levenshtein_min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score.
plot – Boolean value indicating whether to plot clusters.

Returns:

Pandas DataFrame with all the metrics for each clustering iteration. Metrics include: EPS, Min Samples, Avg Levenshtein Score, STD Levenshtein Score, Min Levenshtein Score, Max Levenshtein Score, Estimated # of Clusters, Estimated # of Noise/Outlier Points, Silhouette Score, Average Cluster Size, STD Cluster Size

agglomerative_trials(min_clusters=2, max_clusters=10, step=3, levenshtein_min_samples=50, plot=False)¶

Runs multiple iterations/trials of the agglomerative/hierarchical clustering algorithm starting from ‘min_clusters’ and adding ‘step’ each time until the number of clusters reaches ‘max_clusters’. Calculates the metrics below at each iteration.

Parameters:

min_clusters – The lowest number of clusters to start with
max_clusters – The maximum number of clusters to end with
step – The number of clusters to increment by for each iteration
levenshtein_min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score.
plot – Boolean value indicating whether to plot clusters.

Returns:

Pandas DataFrame with all the metrics for each clustering iteration. Metrics include: Number of Clusters, Avg Levenshtein Score, STD Levenshtein Score, Min Levenshtein Score, Max Levenshtein Score, Silhouette Score, Average Cluster Size, STD Cluster Size

get_levenshtein_scores(min_samples=50)¶

Calculate levenshtein score of each cluster by averaging the levenshtein scores of all pairwise strings in each cluster and returns a score for each cluster. This will calculate levenshtein scores for the last run clustering algorithm.

Parameters:	min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score.
Returns:	Array of Levenshtein scores of each cluster respectively

get_metrics_df()¶

Gets the Pandas dataframe containing the metrics of the last run clustering algorithm.

Returns:	Pandas DataFrame containing the metrics of the last run clustering algorithm.

get_clustered_df()¶

Gets the Pandas dataframe of the data sorted into its respective clusters after running one of the clustering algorithms.

Returns:	Pandas DataFrame of each column representing a cluster.

get_cluster_instance()¶

Gets the Sklearn Object of the previously called clustering algorithm.

Returns:	Sklearn Object of the previously called clustering algorithm.

get_cluster_fit_instance()¶

Gets the Sklearn Object of the previously called clustering algorithm after fitting the data.

Returns:	Sklearn Object of the previously called clustering algorithm after fitting the data.

Plotter¶

bldg_point_clustering.plotter.plotter.plot_3D(df, x, y, z)¶

Plots metrics data on a 3D plot with given axes x, y, and z.

Parameters:	df – A Pandas DataDrame of columns of numerical data (i.e. Metrics DataFrame) x – The column of the dataframe to go on the x-axis (Column Name -> String) y – The column of the dataframe to go on the y-axis (Column Name -> String) z – The column of the dataframe to go on the z-axis (Column Name -> String)
Returns:	3D Plot of x, y, and z (using Plotly Express)

bldg_point_clustering.plotter.plotter.plot_silhouettes(X, labels)¶

Finds and plots silhouette samples for each label

Parameters:	X – A dataframe of the featurized data labels – The labeled cluster assigned to each string (array)

bldg_point_clustering.plotter.plotter.plot_kmeans_inertia(X, max_clusters=10)¶

Finds and plots inertia/sum of squared errors of running kmeans on number of clusters from 1 to max_clusters inclusive.

Parameters:	X – A dataframe of the featurized data n_clusters – The maximum number of clusters to plot the inertia (sum of squared error) for.

bldg_point_clustering.plotter.plotter.plot_kmeans(X, n_clusters=2)¶

Plots the PCA (2 components) reduced data with kmeans clustering and n_clusters

Parameters:	X – A dataframe of the featurized data n_clusters – The number of clusters to apply kmeans clustering and plot

bldg_point_clustering.plotter.plotter.plot_dbscan(X, eps=0.2, min_samples=10)¶

Plots the PCA (2 components) reduced data with dbscan clustering with epsilon (eps) and min_samples

Parameters:	X – A dataframe of the featurized data eps – The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. min_samples – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

bldg_point_clustering.plotter.plotter.plot_agglomerative(X, n_clusters=2)¶

Plots the PCA (2 components) reduced data with agglomerative clustering with n_clusters

Parameters:	X – A dataframe of the featurized data n_clusters – The number of clusters to apply dbscan clustering and plot

Metrics¶

bldg_point_clustering.metrics.metrics.levenshtein_metric(X, min_samples=0)¶

Returns array of average levenshtein scores of each cluster by averaging the levenshtein scores of all pairwise strings in each cluster.

Parameters:	X – Pandas DataFrame of clusters with their respective strings min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score.
Returns:	Array of average Levenshtein scores of each cluster

bldg_point_clustering.metrics.metrics.silhouette_metric(X, labels)¶

Returns silhouette score of featurized Pandas DataFrame

Parameters:	X – Pandas DataFrame of featurized data labels – The labeled cluster assigned to each string (array)
Returns:	Silhouette Score (Floating point value between 0 and 1)

Table Of Contents

Welcome to bldg_point_clustering’s documentation!¶

Introduction¶

Installation¶

Quick Start¶

Example Usage¶

Featurizer¶

Cluster¶

Plotter¶

Metrics¶

Index¶