32 Clustering
Proximity Measures¶
- Similarity
- Dissimilarity
- Distance measure (subclass)
Range¶
May be
- \([0, 1], [0, 10], [0, 100]\)
- \([0, \infty)\)
Types of Proximity Measures¶
Similarity¶
For document, sparse data
- Jacard Similarity
- Cosine Similarity
Dissimilarity¶
For continuous data
- Correlation
- Euclidean
Transformations¶
We should be careful; first study the problem and apply only if it is logical to complete the operation
Fixed Range \(\to [0, 1]\) | \([0, \infty) \to [0, 1]\) |
---|---|
\(s' = \frac{s - s_\text{min}}{s_\text{max} - s_\text{min}}\) | \(d' = \frac{d}{1+d}\) |
Something¶
Attribute Type | Dissimilarity | Similarity |
---|---|---|
Nominal | $\begin{cases} 0, & p=q \ | |
1, &p \ne q \end{cases}$ | \(\begin{cases} 1, & p=q \\ 0, &p \ne q \end{cases}\) | |
Ordinal | \(\dfrac{\vert p-q \vert}{n-1}\) Values mapped to integers: \([0, n-1]\), where \(n\) is the no of values | \(1- \dfrac{\vert p-q \vert}{n-1}\) |
Interval/Ratio | \(\vert p-q \vert\) | \(-d\) \(\dfrac{1}{1+d}\) \(1 - \dfrac{d-d_\text{min}}{d_\text{max}-d_\text{min}}\) |
Dissimilarity Matrix¶
Symmetric \(n \times n\) matrix, which stores a collection of dissimilarities for all pairs of \(n\) objects
- \(d(2, 1) = d(1, 2)\)
It gives the distance from every object to every other object
Something
Example
Object Identifier | Test 1 | Tets 2 | Test 3 |
---|---|---|---|
Compute for test 2
1 | 2 | 3 | 4 | |
---|---|---|---|---|
1 | ||||
2 | ||||
3 | ||||
4 |
Distance between data objects¶
Minkowskiâs distance¶
Let
- \(a, b\) be data objects
- \(n\) be no of attributes
- \(r\) be parameter
The distance between \(x,y\) is
\(r\) | Type of Distance | \(d(x, y)\) | Gives | Magnitude of Distance | Remarks |
---|---|---|---|---|---|
1 | City block Manhattan Taxicab \(L_1\) Norm | \(\sum_{k=1}^n \vert a_k - b_k \vert\) | Distance along axes | Maximum | |
2 | Euclidean \(L_2\) Norm | \(\sqrt{ \sum_{k=1}^n \vert a_k - b_k \vert^2 }\) | Perpendicular Distance | Shortest | We need to standardize the data first |
\(\infty\) | Chebychev Supremum \(L_{\max}\) norm \(L_\infty\) norm | \(\max (\vert x_k - y_k \vert )\) | Medium | ||
Makowski |
Also, we have squared euclidean distance, which is used sometimes
Properties of Distance Metrics¶
Property | Meaning |
---|---|
Non-negativity | \(d(a, b) = 0\) |
Symmetry | \(d(a, b) = d(b, a)\) |
Triangular inequality | \(d(a, c) \le d(a, b) + d(b, c)\) |
Similarity between Binary Vector¶
\(M_{00}\) shows how often do they come together; \(p, q\) do not have 11 in the same attribute
Simple Matching Coefficient¶
Jaccard Coefficient¶
We ignore the similarities of \(M_{00}\)
Similarity between Document Vectors¶
Cosine Similarity¶
\(\cos (x, y)\) | Interpretation |
---|---|
1 | Similarity |
0 | No similarity/Dissimilarity |
-1 | Dissimilarity |
Document Vector¶
Frequency of occurance of each term
Tanimatto Coefficient/Extended Jaccard Coefficient¶
Correlation¶
Used for continuous attributes
Pearsonâs Correlation Coefficient (\(r\))¶
Range = \([-1, +1]\)
\(r\) | |
---|---|
\(-1\) | High -ve correlation |
\(0\) | No correlation |
\(+1\) | High +ve correlation |
Clustering¶
Assigning class label to set of unclassified items
Labels¶
Labels for classes are unknown during clustering
We may label the clusters later
K-Means Clustering¶
Input: \(k =\) number of clusters
Output: Set of \(k\) clusters
Steps¶
-
Randomly choose \(k\) objects from the dataset as the initial cluster centroids (centroid = center of a cluster)
-
For each of the objects
-
Compute distance between current objects and \(k\) cluster centroids
-
Assign current object to that cluster to which it is closest
If distance of a point between 2 clusters is same, then we assign the point to first centroid.
-
Compute âcluster centersâ \(m\) of each cluster. These become the new cluster centroids
-
Repeat steps 2-3 until convergence criterion is satisfied
-
Stop
Convergence Criterion¶
- Particular number of iterations We can derive this by testing and plotting graph of accuracy vs iterations
or
- when clusters donât change over subsequent iterations
EM¶
EM = Expectation Maximization
Gaussian EM K-Means Clustering¶
Probabilistic clustering which requires K means as well; the output is k means is fed into this model
Gaussian¶
Gaussian Mixture¶
- \(\mu_k =\) Means
- \(\Sigma_k =\) Covariances
- \(\pi_k =\) Mixing Coefficients
- Proportion of each gaussian in the mixture
EM for Gaussian Mixture¶
-
Initialize gaussian parameters \(\mu_k, \Sigma_k, \pi_k\)
-
\(\mu_k \leftarrow \mu_k\)
- \(\Sigma_k \leftarrow \text{cov \(\Big(\) cluster(\)k\() \(\Big)\)}\)
-
\(\pi_k = \frac{\text{No of points in } k}{\text{Total no of points}}\)
-
E Step: Assign each point \(X_n\) an assignment score \(\gamma(z_{nk})\) for each cluster \(k\)
- M Step: Given scores, adjust \(\mu_k, \pi_k, \Sigma_k\) for each cluster \(k\)
- Evaluate log likelihood
- Stop if likelihood/parameters converge
K Means vs Gaussian Mixture¶
K means | Gaussian Mixture | |
---|---|---|
Classifier model | Probability model | |
Probabilistic? | â | â |
Classifier Type | Hard | Soft |
Pamater to fit to data | \(\mu_k\) | \(\mu_k, \Sigma_k, \pi_k\) |
â Disadvantage | If a class may belong to multiple clusters, we have to assign the first one If sample size is too small, initial grouping determines clusters significantly | Complex |
â Advantage | Simple Fast and efficient, with \(O(tkn),\) where - \(t =\) no of iterations - \(k =\) no of clusters - \(n =\) no of sample points | If a class may belong to multiple clusters, we have a probability to back it up |