Wednesday, 12 February 2025

Comparing Different Clustering Algorithms like K-means, DBSCAN, GMM, Hierarchical Clustering

Let's implement multiple clustering algorithms on the Wholesale Customer dataset and evaluate them using Silhouette Score and Davies-Bouldin Index.

Clustering Methods to Implement:

  1. k-Means Clustering (Partition-based)
  2. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) (Density-based)
  3. GMM (Gaussian Mixture Model) (Probabilistic-based)
  4.  Hierarchical Clustering

Evaluation Metrics:

  • Silhouette Score: Measures how well-separated the clusters are.
  • Davies-Bouldin Index: Measures intra-cluster similarity and inter-cluster differences.

Step-1: Mount the drive

from google.colab import drive
drive.mount('/content/drive')


 Step-2: Read the Wholesale_customers_data.csv dataset 


Step-2:  Preprocessing: Selecting relevant features and standardizing them

 

 Step-3:   Standardize the dataset for better clustering performance

from sklearn.preprocessing import StandardScaler # Import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


Step-4: Store results in a dictionary for evaluation


clustering_results = {}

 Step-5:  Prepare Function to evaluate clustering results

def evaluate_clustering(labels, X_scaled):
    if len(set(labels)) > 1:  # Ensure we have more than 1 cluster
        silhouette = silhouette_score(X_scaled, labels)
        db_index = davies_bouldin_score(X_scaled, labels)
    else:
        silhouette = -1  # Undefined for a single cluster
        db_index = -1  # Undefined for a single cluster
    return silhouette, db_index

 Step-6k-Means Clustering


from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X_scaled)
clustering_results['k-Means'] = evaluate_clustering(kmeans_labels, X_scaled)

 Step-7: DBSCAN Clustering


from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=1.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_scaled)
clustering_results['DBSCAN'] = evaluate_clustering(dbscan_labels, X_scaled)

 Step-8Gaussian Mixture Model (GMM)


from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3, random_state=42)
gmm_labels = gmm.fit_predict(X_scaled)
clustering_results['GMM'] = evaluate_clustering(gmm_labels, X_scaled)

 Step-9Hierarchical (Agglomerative) Clustering

from sklearn.cluster import AgglomerativeClustering

hierarchical = AgglomerativeClustering(n_clusters=3, linkage='ward')
hierarchical_labels = hierarchical.fit_predict(X_scaled)

# Evaluate Hierarchical Clustering
clustering_results['Hierarchical'] = evaluate_clustering(hierarchical_labels, X_scaled)

Step-10:  Convert results to DataFrame for easy comparison


clustering_eval_df = pd.DataFrame.from_dict(
    clustering_results, orient='index', columns=['Silhouette Score', 'Davies-Bouldin Index']
)

 Step-11: Comparing Different Clustering Algorithms

import pandas as pd # If pandas is not already imported

print("Clustering Evaluation Metrics (Including Hierarchical):")
display(clustering_eval_df)

 Final output 


Agglomerative and Divisive Clustering in Hierarchical Clustering

Hierarchical clustering is a clustering technique that builds a hierarchy of clusters. It does not require specifying the number of clusters in advance and is particularly useful for understanding the structure of data. It is mainly divided into Agglomerative Clustering (Bottom-Up Approach) and Divisive Clustering (Top-Down Approach).

1. Agglomerative Clustering (Bottom-Up Approach)

Agglomerative clustering starts with each data point as an individual cluster and iteratively merges the closest clusters until only one cluster remains.

Steps in Agglomerative Clustering:

  1. Initialize each data point as a separate cluster.
  2. Compute pairwise distances between clusters.
  3. Merge the two closest clusters based on a linkage criterion.
  4. Repeat steps 2-3 until all points belong to a single cluster or until a desired number of clusters is reached.
  5. Cut the dendrogram at the chosen level to obtain the final clusters.

Types of Linkage Methods:

  • Single Linkage: Merges clusters based on the minimum distance between points.
  • Complete Linkage: Uses the maximum distance between points.
  • Average Linkage: Considers the average distance between all pairs of points in clusters.
  • Ward’s Method: Minimizes variance within clusters.

Advantages of Agglomerative Clustering:

  1. No need to predefine the number of clusters.
  2. Suitable for small to medium-sized datasets.
  3. Produces a dendrogram, which helps in deciding the optimal number of clusters.

Disadvantages:

  1. Computationally expensive for large datasets (O(n²) complexity).
  2. Sensitive to noise and outliers.

2. Divisive Clustering (Top-Down Approach)

Divisive clustering starts with all data points in a single cluster and recursively splits the cluster into smaller clusters until each data point is its own cluster.

Steps in Divisive Clustering:

  1. Consider all data points as a single cluster.
  2. Use a clustering algorithm (e.g., k-Means) to divide the cluster into two sub-clusters.
  3. Recursively repeat step 2 on each cluster until each point is in its own cluster or the stopping criterion is met.
  4. The resulting dendrogram is then cut at an appropriate level to define the final clusters.

Advantages of Divisive Clustering:

  1. More accurate in some cases, as it doesn't suffer from early erroneous merges.
  2. Can be more meaningful when the natural structure of data is divisive in nature.

Disadvantages:

  1. Computationally very expensive (O(2^n) complexity).
  2. Not widely implemented in standard libraries.
  3. Requires a predefined stopping criterion for splitting.

Comparison with k-Means: k-Means is faster but requires predefining the number of clusters, while hierarchical clustering is slower but provides more insights.

from sklearn.cluster import AgglomerativeClustering, KMeans
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import fcluster

X = np.array([[1, 2], [1, 4], [1, 0],
              [4, 2], [4, 4], [4, 0]])

# Agglomerative Clustering
clustering = AgglomerativeClustering(n_clusters=2).fit(X)
print("Agglomerative Clustering Labels:", clustering.labels_)

# K-Means Clustering
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
print("K-Means Clustering Labels:", kmeans.labels_)

# Compare the results (you can add more sophisticated comparison methods)
print("Are the cluster labels the same?", np.array_equal(clustering.labels_, kmeans.labels_))


Z = linkage(X, 'ward') # Ward Distance

dendrogram(Z) #plotting the dendogram

plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Data point')
plt.ylabel('Distance')
plt.show()



max_dist = 3  # Example maximum distance. Adjust as needed based on the dendrogram.
clusters = fcluster(Z, max_dist, criterion='distance')
num_clusters = len(set(clusters))

print(f"Number of clusters (at max distance {max_dist}): {num_clusters}")


Output:

Agglomerative Clustering Labels: [1 1 1 0 0 0]
K-Means Clustering Labels: [1 0 1 0 0 0]
Are the cluster labels the same? False

 


Number of clusters (at max distance 3): 4

Thursday, 30 January 2025

Understanding Factor Analysis: A Key Tool for Data Reduction and Interpretation

 Factor analysis is a powerful statistical technique widely used in research to reduce large datasets into smaller, interpretable groups. It is an interdependence technique, meaning that it does not have predefined dependent or independent variables. Instead, it identifies underlying relationships among multiple observed variables to uncover common dimensions or "factors."

This article explores the concept, methods, and applications of factor analysis, highlighting its significance in fields such as marketing research, psychology, and social sciences.

What is Factor Analysis?

Factor analysis is a statistical method used to summarize and reduce data by identifying hidden structures in large datasets. It works by grouping together variables that share common underlying dimensions, making the data easier to interpret.

For example, if a researcher is analyzing consumer preferences, they may collect responses on 100 different product attributes. Instead of analyzing all 100 variables separately, factor analysis groups related attributes together into fewer categories (e.g., “Quality,” “Price Sensitivity,” “Brand Loyalty”), simplifying the dataset for further analysis.

Types of Factor Analysis

There are two primary types of factor analysis:

1. Principal Component Analysis (PCA)

  • PCA is the most widely used form of factor analysis.

  • It considers the total variance in the dataset, including unique and error variance.

  • It creates new variables (principal components) that explain the maximum variance in the data.

2. Common Factor Analysis (FA)

  • FA considers only the shared (common) variance among variables.

  • It ignores unique and error variance, making it more precise for identifying underlying factors.

  • FA is commonly used in psychology and social sciences.

Key Concepts in Factor Analysis

Factor analysis involves several important concepts that help researchers interpret their data effectively:

1. Factor Loadings

Factor loadings measure how strongly each variable is related to a particular factor. A high loading (closer to 1) indicates a strong relationship.

2. Eigenvalues

Eigenvalues indicate the amount of variance explained by each factor. Generally, factors with eigenvalues greater than 1 are considered significant.

3. Scree Plot

A scree plot is a graphical method used to determine the number of factors. It displays eigenvalues in descending order, allowing researchers to identify a natural cutoff point (the “elbow” point).

4. Factor Rotation

Factor rotation helps improve the interpretability of factor analysis results by redistributing factor loadings more evenly. There are two main types:

  • Orthogonal Rotation (e.g., Varimax) keeps factors uncorrelated.

  • Oblique Rotation (e.g., Oblimin) allows factors to be correlated.

Applications of Factor Analysis

Factor analysis has numerous applications across different fields:

1. Marketing Research

  • Identifying customer segments based on purchasing behavior.

  • Understanding brand perception by analyzing survey responses.

  • Reducing large datasets of product attributes into key dimensions.

2. Psychology and Social Sciences

  • Developing psychological scales (e.g., personality traits, intelligence tests).

  • Measuring abstract concepts like trust, honesty, or satisfaction.

  • Grouping related survey questions into meaningful constructs.

3. Healthcare and Medicine

  • Identifying risk factors for diseases based on patient data.

  • Grouping symptoms into broader syndromes for diagnosis.

4. Education and Academia

  • Analyzing student performance across multiple subjects.

  • Identifying key factors influencing learning outcomes.

How to Perform Factor Analysis

The following steps outline how to conduct a factor analysis:

Step 1: Data Preparation

  • Ensure that variables are quantitative (measured on interval or ratio scales).

  • Check for sufficient sample size (typically at least 100 participants, with 10 respondents per variable).

  • Verify correlations among variables using Bartlett’s test of sphericity.

Step 2: Factor Extraction

  • Decide on the method (PCA or Common Factor Analysis).

  • Use Eigenvalues > 1 or a scree plot to determine the number of factors.

Step 3: Factor Rotation

  • Apply rotation to simplify interpretation.

  • Choose between orthogonal (uncorrelated factors) or oblique (correlated factors) rotation methods.

Step 4: Interpret Factors

  • Examine factor loadings to identify meaningful relationships.

  • Name factors based on the common themes among the grouped variables.

Step 5: Apply Results

  • Use factor scores for further analysis (e.g., regression, clustering).

  • Summarize findings to inform decision-making.

Advantages and Limitations of Factor Analysis

Advantages:

✔ Helps simplify complex datasets by reducing variables. 

✔ Improves data interpretation by identifying meaningful dimensions. 

✔ Supports decision-making in marketing, psychology, and other fields. 

✔ Can be used for survey design and validation.

Limitations:

✖ Requires subjective interpretation of factors. 

✖ Only works well with metric data (not suitable for categorical variables). 

✖ Results may change depending on the chosen extraction and rotation methods. 

✖ Sensitive to sample size and data quality.

Machine Learning

More

Advertisement

Java Tutorial

More

UGC NET CS TUTORIAL

MFCS
COA
PL-CG
DBMS
OPERATING SYSTEM
SOFTWARE ENG
DSA
TOC-CD
ARTIFICIAL INT

C Programming

More

Python Tutorial

More

Data Structures

More

computer Organization

More
past the following code above
Top