Welcome to bldg_point_clustering’s documentation!¶
Introduction¶
A Python 3.5+ wrapper for clustering building point labels using KMeans, DBScan, and Agglomerative clustering.
Quick Start¶
Instantiate Featurizer object and get featurized Pandas DataFrame.
Instantiate Cluster object and pass in featurized DataFrame to. Then, call a clustering method with the appropriate parameters.
Use the plot3D function in the Plotter to create a 3D plot of metrics returned by any of the clustering trials.
Example Usage¶
Running one iteration of the KMeans algorithm:
import pandas as pd
import numpy as np
from bldg_point_clustering.cluster import Cluster
from bldg_point_clustering.featurizer import Featurizer
filename = "GBSF"
df = pd.read_csv("./datasets/" + filename + ".csv")
first_column = df.iloc[:, 0]
f = Featurizer(filename, corpus=first_column)
featurized_df = f.bag_of_words()
c = Cluster(df, featurized_df)
clustered_df = c.kmeans(n_clusters=300, plot=True, to_csv=True)
metrics = c.get_metrics_df()
avg_levenshtein_score = np.mean(c.get_levenshtein_scores())
Running several iterations of the KMeans algorithm:
from bldg_point_clustering.plotter import plot_3D
c.kmeans_trials()
metrics = c.get_metrics_df()
plot_3D(metrics, "n_clusters", "Avg Levenshtein Score", "Silhouette Score")
This process is similar for DBScan and Agglomerative.
Featurizer¶
-
class
bldg_point_clustering.featurizer.featurizer.
Featurizer
(filename, corpus)¶ Creates a Featurizer object instance.
Parameters: - filename – The name of the file containing the data to be featurized (Excluding file extension)
- corpus – The Pandas Series of the strings to be clustered
-
bag_of_words
(min_freq=1, max_freq=1.0, stop_words=None, tfidf=False)¶ Returns feature vectors based on the bag of words model, with each string tokenized using the arka tokenizer (Look in tokenizers for more information).
Parameters: - min_freq – Minimum frequency of a word to include in the featurization (float from 0.0 to 1.0)
- max_freq – Maximum frequency of a word to include in the featurization (float from 0.0 to 1.0)
- stop_words – Array of stop words all of which will be removed from the resulting tokens.
- tfidf – Boolean indicating whether to use term frequency–inverse document frequency (TFIDF) model
Returns: Pandas DataFrame of the document-term matrix featurization of the corpus (Featurized DataFrame)
-
arka
()¶ Returns feature vectors based on the arka thesis model.
Returns: Pandas DataFrame of the document-term matrix featurization of the corpus
-
get_word_matrix_df
()¶ Gets the Bag of Words document-term featurization matrix
Returns: Pandas DataFrame of Bag of Words document-term featurization matrix
Cluster¶
-
class
bldg_point_clustering.cluster.cluster.
Cluster
(df, featurized_df)¶ Creates a Cluster object instance.
Parameters: - df – The Pandas DataFrame of the original data.
- featurized_df – The Pandas DataFrame of the featurized data
-
kmeans
(n_clusters=2, max_iter=300, plot=False, levenshtein_min_samples=50, to_csv=False)¶ Runs one iteration of the kmeans clustering algorithm
Parameters: - n_clusters – The number of clusters to form as well as the number of centroids to generate.
- max_iter – Maximum number of iterations of the k-means algorithm for a single run.
- plot – Boolean value indicating whether to plot clusters.
- levenshtein_min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score.
- to_csv – Boolean value indicating whether to export clusters to a csv file.
Returns: Pandas DataFrame with a column for each cluster.
-
dbscan
(eps=2, min_samples=10, plot=False, levenshtein_min_samples=50, to_csv=False)¶ Runs one iteration of the dbscan clustering algorithm
Parameters: - eps – The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster.
- min_samples – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
- levenshtein_min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score.
- plot – Boolean value indicating whether to plot clusters.
- to_csv – Boolean value indicating whether to export clusters to a csv file.
Returns: Pandas DataFrame with a column for each cluster.
-
agglomerative
(n_clusters=2, plot=False, levenshtein_min_samples=50, to_csv=False)¶ Runs one iteration of the agglomerative/hierarchical clustering algorithm
Parameters: - n_clusters – The number of clusters to form as well as the number of centroids to generate.
- plot – Boolean value indicating whether to plot clusters.
- levenshtein_min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score.
- to_csv – Boolean value indicating whether to export clusters to a csv file.
Returns: Pandas DataFrame with a column for each cluster.
-
kmeans_trials
(min_clusters=2, max_clusters=10, step=3, levenshtein_min_samples=50, plot=False)¶ Runs multiple iterations/trials of the kmeans clustering algorithm starting from ‘min_clusters’ and adding ‘step’ each time until the number of clusters reaches ‘max_clusters’. Calculates the metrics below at each iteration.
Parameters: - min_clusters – The lowest number of clusters to start with
- max_clusters – The maximum number of clusters to end with
- step – The number of clusters to increment by for each iteration
- levenshtein_min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score.
- plot – Boolean value indicating whether to plot clusters.
Returns: Pandas DataFrame with all the metrics for each clustering iteration. Metrics include: Number of Clusters, Avg Levenshtein Score, STD Levenshtein Score, Min Levenshtein Score, Max Levenshtein Score, Silhouette Score, SSE, MSE, RMSE, Average Cluster Size, STD Cluster Size
-
dbscan_trials
(min_eps=0.2, max_eps=1, eps_step=0.2, start_min_samples=10, max_min_samples=30, min_samples_step=5, levenshtein_min_samples=50, plot=False)¶ Runs multiple iterations/trials of the dbscan clustering algorithm starting from ‘min_eps’ and adding ‘eps_step’ each time until the number of clusters reaches ‘max_eps’. For each iteration of an eps value, several iterations will be run starting with ‘start_min_samples’ up to ‘max_min_samples’ incremented by ‘min_samples_step’ at each iteration.
Parameters: - min_eps – The lowest maximum distance between two samples for one to be considered as in the neighborhood of the other.
- max_eps – The highest maximum distance between two samples for one to be considered as in the neighborhood of the other.
- eps_step – The amount of eps to increment by for each iteration
- start_min_samples – The lowest amount of the number of samples (or total weight) in a neighborhood for a point to be considered as a core point.
- max_min_samples – The highest amount of the number of samples (or total weight) in a neighborhood for a point to be considered as a core point.
- min_samples_step – The number of min_samples to increment by for each iteration
- levenshtein_min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score.
- plot – Boolean value indicating whether to plot clusters.
Returns: Pandas DataFrame with all the metrics for each clustering iteration. Metrics include: EPS, Min Samples, Avg Levenshtein Score, STD Levenshtein Score, Min Levenshtein Score, Max Levenshtein Score, Estimated # of Clusters, Estimated # of Noise/Outlier Points, Silhouette Score, Average Cluster Size, STD Cluster Size
-
agglomerative_trials
(min_clusters=2, max_clusters=10, step=3, levenshtein_min_samples=50, plot=False)¶ Runs multiple iterations/trials of the agglomerative/hierarchical clustering algorithm starting from ‘min_clusters’ and adding ‘step’ each time until the number of clusters reaches ‘max_clusters’. Calculates the metrics below at each iteration.
Parameters: - min_clusters – The lowest number of clusters to start with
- max_clusters – The maximum number of clusters to end with
- step – The number of clusters to increment by for each iteration
- levenshtein_min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score.
- plot – Boolean value indicating whether to plot clusters.
Returns: Pandas DataFrame with all the metrics for each clustering iteration. Metrics include: Number of Clusters, Avg Levenshtein Score, STD Levenshtein Score, Min Levenshtein Score, Max Levenshtein Score, Silhouette Score, Average Cluster Size, STD Cluster Size
-
get_levenshtein_scores
(min_samples=50)¶ Calculate levenshtein score of each cluster by averaging the levenshtein scores of all pairwise strings in each cluster and returns a score for each cluster. This will calculate levenshtein scores for the last run clustering algorithm.
Parameters: min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score. Returns: Array of Levenshtein scores of each cluster respectively
-
get_metrics_df
()¶ Gets the Pandas dataframe containing the metrics of the last run clustering algorithm.
Returns: Pandas DataFrame containing the metrics of the last run clustering algorithm.
-
get_clustered_df
()¶ Gets the Pandas dataframe of the data sorted into its respective clusters after running one of the clustering algorithms.
Returns: Pandas DataFrame of each column representing a cluster.
-
get_cluster_instance
()¶ Gets the Sklearn Object of the previously called clustering algorithm.
Returns: Sklearn Object of the previously called clustering algorithm.
-
get_cluster_fit_instance
()¶ Gets the Sklearn Object of the previously called clustering algorithm after fitting the data.
Returns: Sklearn Object of the previously called clustering algorithm after fitting the data.
Plotter¶
-
bldg_point_clustering.plotter.plotter.
plot_3D
(df, x, y, z)¶ Plots metrics data on a 3D plot with given axes x, y, and z.
Parameters: - df – A Pandas DataDrame of columns of numerical data (i.e. Metrics DataFrame)
- x – The column of the dataframe to go on the x-axis (Column Name -> String)
- y – The column of the dataframe to go on the y-axis (Column Name -> String)
- z – The column of the dataframe to go on the z-axis (Column Name -> String)
Returns: 3D Plot of x, y, and z (using Plotly Express)
-
bldg_point_clustering.plotter.plotter.
plot_silhouettes
(X, labels)¶ Finds and plots silhouette samples for each label
Parameters: - X – A dataframe of the featurized data
- labels – The labeled cluster assigned to each string (array)
-
bldg_point_clustering.plotter.plotter.
plot_kmeans_inertia
(X, max_clusters=10)¶ Finds and plots inertia/sum of squared errors of running kmeans on number of clusters from 1 to max_clusters inclusive.
Parameters: - X – A dataframe of the featurized data
- n_clusters – The maximum number of clusters to plot the inertia (sum of squared error) for.
-
bldg_point_clustering.plotter.plotter.
plot_kmeans
(X, n_clusters=2)¶ Plots the PCA (2 components) reduced data with kmeans clustering and n_clusters
Parameters: - X – A dataframe of the featurized data
- n_clusters – The number of clusters to apply kmeans clustering and plot
-
bldg_point_clustering.plotter.plotter.
plot_dbscan
(X, eps=0.2, min_samples=10)¶ Plots the PCA (2 components) reduced data with dbscan clustering with epsilon (eps) and min_samples
Parameters: - X – A dataframe of the featurized data
- eps – The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster.
- min_samples – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
-
bldg_point_clustering.plotter.plotter.
plot_agglomerative
(X, n_clusters=2)¶ Plots the PCA (2 components) reduced data with agglomerative clustering with n_clusters
Parameters: - X – A dataframe of the featurized data
- n_clusters – The number of clusters to apply dbscan clustering and plot
Metrics¶
-
bldg_point_clustering.metrics.metrics.
levenshtein_metric
(X, min_samples=0)¶ Returns array of average levenshtein scores of each cluster by averaging the levenshtein scores of all pairwise strings in each cluster.
Parameters: - X – Pandas DataFrame of clusters with their respective strings
- min_samples – Minimum number of elements in a cluster for the cluster to be counted for calculating the Levenshtein score.
Returns: Array of average Levenshtein scores of each cluster
-
bldg_point_clustering.metrics.metrics.
silhouette_metric
(X, labels)¶ Returns silhouette score of featurized Pandas DataFrame
Parameters: - X – Pandas DataFrame of featurized data
- labels – The labeled cluster assigned to each string (array)
Returns: Silhouette Score (Floating point value between 0 and 1)