无监督（聚类）算法

Unsupervised Learning (Clustering) Algorithms

授课对象：计算机科学与技术专业二年级

课程名称：人工智能（专业必修）

节选内容：第六章机器学习

课程学分：3学分

什么是聚类分析？

Finding groups of objects such that the objects in a group will besimilar (or related) to one another and different from (or unrelated to)the objects in other groups

聚类分析的应用

⚫ Understanding

Group related documents forbrowsing

Group genes and proteins thathave similar functionality

Group stocks with similar pricefluctuations

⚫ Summarization

Reduce the size of large datasets

	Discovered Clusters	Industry Group
1	Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,Natl-Semiconductor-DOWN,Oracl-DOWN,SGI-DOWN,Sun-DOWN	Technology1-DOWN
2	Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,Computer-Assoc-DOWN,Circuit-City-DOWN,Compaq-DOWN,EMC-Corp-DOWN,Gen-Inst-DOWN,Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN	Technology2-DOWN
3	Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,MBNA-Corp-DOWN,Morgan-Stanley-DOWN	Financial-DOWN
4	Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,Schlumberger-UP	Oil-UP

聚类：社交网络

聚类：图像分割

聚类：视频检测

TVCP

Frame #50

F-m.=0.9630

Frame #70

$F - m. = 0.9752$

Nonnegative PARAFAC

Frame#50

F-m.=0.9209

Frame #70

F-m.=0.9512

什么不是聚类分析？

⚫ Supervised classification

➢ Have class label information

Simple segmentation

Dividing students into different registrationgroups alphabetically, by last name

簇的概念：二义性

How many clusters?

Six Clusters

Two Clusters

Four Clusters

簇的概念：二义性

What is a natural grouping among these objects?

聚类(Clustering)类型

⚫ A clustering is a set of clusters

⚫ Important distinction between hierarchical and partitional sets of clusters

Partitional Clustering (分割式聚类)

➢ A division data objects into non-overlapping subsets (clusters) such thateach data object is in exactly one subset

➢ # clusters is needed, kmeans….

Hierarchical clustering (阶层式聚类)

➢ A set of nested clusters organized as a hierarchical tree

➢ # clusters is not needed

分割式聚类

Original Points

A Partitional Clustering

阶层式聚类

Traditional Hierarchical Clustering

Traditional Dendrogram

Non-traditional Hierarchical Clustering

Non-traditional Dendrogram

不同聚类集合的其他差异

Fuzzy versus non-fuzzy 模糊聚类

➢ In fuzzy clustering, a point belongs to every cluster with someweight between 0 and 1

➢ Weights must sum to 1

➢ Probabilistic clustering has similar characteristics

Partial versus complete

➢ In some cases, we only want to cluster some of the data

⚫ Heterogeneous versus homogeneous 异质vs同质

➢ Cluster of widely different sizes, shapes, and densities

簇(Cluster)的类型

⚫ Well-separated clusters 分离良好

⚫ Center-based clusters 基于中心

⚫ Contiguous clusters 基于邻近程度

⚫ Density-based clusters 基于密度

Property or Conceptual 基于属性或概念

Described by an Objective Function 基于目标函数

Well-Separated Clusters

Well-Separated Clusters:

➢ A cluster is a set of points such that any point in a cluster iscloser (or more similar) to every other point in the cluster than toany point not in the cluster.

3 well-separated clusters

Center-Based Clusters

⚫ Center-based

➢ A cluster is a set of objects such that an object in a cluster is closer(more similar) to the “center” of a cluster, than to the center of anyother cluster

The center of a cluster is often a centroid, the average of all thepoints in the cluster, or a medoid, the most “representative” point of acluster

4 center-based clusters

Contiguity(邻近)-Based Clusters

⚫ Contiguous Cluster (Nearest neighbor or Transitive)

➢ A cluster is a set of points such that a point in a cluster is closer (ormore similar) to one or more other points in the cluster than to anypoint not in the cluster.

8 contiguous clusters

Density-Based Clusters

⚫ Density-based

➢ A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density.

➢ Used when the clusters are irregular or intertwined(不规则or纠缠), andwhen noise and outliers are present.

6 density-based clusters

Conceptual(概念) Clusters

⚫ Shared Property or Conceptual Clusters

➢ Finds clusters that share some common property or represent aparticular concept.

2 Overlapping Circles

Objective Function

⚫ Clusters Defined by an Objective Function

➢ Finds clusters that minimize or maximize an objective function.

穷举法：Enumerate all possible ways of dividing the points intoclusters and evaluate the `goodness’ of each potential set of clustersby using the given objective function. (NP Hard)

➢ Can have global or local objectives.

Hierarchical clustering algorithms typically have local objectives

• Partitional algorithms typically have global objectives

Objective Function

Map the clustering problem to a different domain and solve a related problem in thatdomain

➢ Proximity（亲近度） matrix defines a weighted graph, where the nodes are thedata points, and the weighted edges represent the proximities (similarity ordissimilarity) between data points

point	x	y
p1	0	2
p2	2	0
p3	3	1
p4	5	1

	p1	p2	p3	p4
p1	0.000	2.828	3.162	5.099
p2	2.828	0.000	1.414	3.162
p3	3.162	1.414	0.000	2.000
p4	5.099	3.162	2.000	0.000

Figure Four points and their corresponding data and proximity (distance) matrices.

Similarity? Dissimilarity?

Euclidean Distance / Manhattan Distance / Chebyshev Distance/

Cosine Similarity /…..

Similarity=1-Dissimilarity

距离度量指标

$L_{p} (x_{1}, x_{2}) = (\sum_{l = 1}^{n} x_{i}^{(l)} - x_{j}^{(l)}^{p})^{\frac{1}{p}}$

$L_{1} (x_{1}, x_{2}) = \sum_{l = 1}^{n} x_{i}^{(l)} - x_{j}^{(l)}$ Manhattan Distance

⚫ $L_{2} (x_{1}, x_{2}) = \sum_{l = 1}^{n} (x_{i}^{(l)} - x_{j}^{(l)})^{2}$ Euclidean Distance

$L_{\infty} (x_{1}, x_{2}) = \sum_{l = 1}^{n} x_{i}^{(l)} - x_{j}^{(l)}^{\infty} = ma x (∣ x_{i} - x_{j} ∣)$

Chebyshev Distance

⚫ $L_{- \infty} (x_{1}, x_{2}) = \sum_{l = 1}^{n} x_{i}^{(l)} - x_{j}^{(l)}^{- \infty} = min (x_{i} - x_{j})$

⚫ Cosine Distance: $D (x_{1}, x_{2}) = 1 - cos (x_{1}, x_{2}) = 1 - \frac{x _{1} \cdot x _{2}}{∣ ∣ x _{1} ∣ ∣ _{2} ∣ ∣ x _{2} ∣ ∣ _{2}}$ ??1 2 ??2 2

Objective Function …

7 Clustering is equivalent to breaking the graph into connected components, one foreach cluster.

➢ Want to minimize the edge weight between clusters and maximize the edge weightwithin clusters

R a t i o C u t (π) = \frac{1}{k} i = 1 \sum k \frac{cut ( C _{i} , C ˉ _{i} )}{∣ C _{i} ∣},

N o r m a l i z e d Cut (π) = \frac{1}{k} i = 1 \sum k \frac{cut ( C _{i} , C ˉ _{i} )}{vol ( C _{i} )}

$C_{i}$ : a community

$∣ C i ∣$ : # nodes in $C_{i}$

$v o l (C i)$ : sum of degrees in $C_{i}$

聚类算法

Hierarchical clustering (Omit, because…)

⚫ K-means and its variants

⚫ Density-based clustering

聚类算法

⚫ Hierarchical clustering

Hierarchical clustering starts by treating each observation as aseparate cluster.

Then, it repeatedly executes the following two steps:

(1) identify the two clusters that are closest together,

(2) merge the two most similar clusters.

This iterative process continues until all the clusters are mergedtogether.

K-means聚类

Partitional clustering approach

◼ Each cluster is associated with a centroid (center point)

◼ Each point is assigned to the cluster with the closest centroid

➢ Number of clusters, K, must be specified

➢ The basic algorithm is very simple

Algorithm 1 Basic K-means Algorithm.

1: Select $K$ points as the initial centroids.

2: repeat

3:Form $K$ clusters by assigning all points to the closest centroid.

4:Recompute the centroid of each cluster.

5: until The centroids don’t change

标准K-means流程

K-mean聚类细节

⚫ Initial centroids are often chosen randomly.

➢ Clusters produced vary from one run to another.

⚫ The centroid is (typically) the mean of the points in the cluster.

‘Closeness’ is measured by Euclidean distance, cosine similarity,correlation, etc.

K-means will converge for common similarity measures mentionedabove.

Most of the convergence happens in the first few iterations.

➢ Often the stopping condition is changed to ‘Until relatively fewpoints change clusters’

⚫ Complexity is $O (n^{⋆} K^{⋆} ∣^{⋆} d)$

$n =$ number of points, $K =$ number of clusters,I = number of iterations, $d =$ number of attributes

两个不同的K-mean聚类

Optimal Clustering

Sub-optimal Clustering

选择初始质心的重要性

K-means簇的评估

⚫ Most common measure is Sum of Squared Error (SSE)

➢ For each point, the error is the distance to the nearest cluster

➢ To get SSE, we square these errors and sum them.

S S E = i = 1 \sum K x \in C_{i} \sum d i s t^{2} (m_{i}, x)

➢ x is a data point in cluster Ci and $m_{i}$ is the representative point forcluster Ci

• can show that $m_{i}$ corresponds to the center (mean) of the cluster

➢ Given two clusters, we can choose the one with the smallesterror

➢ One easy way to reduce SSE is to increase K, the number ofclusters

• A good clustering with smaller K can have a lower SSE than a poorclustering with higher K

K的选择

Elbow method: plot a line chart of the SSE for each value of k. If theline chart looks like an arm, then the “elbow” on the arm is the valueof K that is the best.

手臂曲线的肘关节

选择初始点的问题

If there are K ‘real’ clusters then the chance of selecting one centroidfrom each cluster is small.

➢ Chance is relatively small when K is large

➢ If clusters are the same size, n, then

P = \frac{numberofwaystoselectonecentroidfromeachcluster}{numberofwaystoselect K centroids} = \frac{K ! n ^{K}}{( K n ) ^{K}} = \frac{K !}{K ^{K}}

➢ For example, if $K = 10$ , then probabil $ty = 10! /1 0^{10} = 0.00036$

Sometimes the initial centroids will readjust themselves in ‘right’ way,and sometimes they don’t

➢ Consider an example of five pairs of clusters

例子：10个簇

Iteration 4

例子：10个簇

Iteration 1

Iteration 2

例子：10个簇

Iteration 4

Starting with some pairs of clusters having three initial centroids, while other have only one.

例子：10个簇

Iteration 1

Iteration 3

Iteration 2

Iteration 4

如何选择初始质心？

⚫ Multiple runs

➢ Helps, but probability is not on your side

⚫ Sample and use hierarchical clustering to determine initial centroids

⚫ Select more than k initial centroids and then select among these initialcentroids

➢ Select most widely separated

Bisecting K-means （二分kmeans）

➢ 不容易受到初始化问题的影响

Bisecting(二分) K-means

Bisecting K-means algorithm

7 Variant of K-means that can produce a partitional or ahierarchical clustering

Algorithm 3 Bisecting K-means Algorithm.

1: Initialize the list of clusters to contain the cluster containing all points.
2: repeat
3: Select a cluster from the list of clusters
4: for $i = 1$ to number_of_iterations do
5: Bisect the selected cluster using basic K-means
6: end for
7: Add the two clusters from the bisection with the lowest SSE to the list of clusters.
8: until Until the list of clusters contains $K$ clusters

例子：二分K-means

处理空簇

⚫ Basic K-means algorithm can yield empty clusters

处理空簇

⚫ Basic K-means algorithm can yield empty clusters

If happens, find a replacement centroid to replace the centroid of theempty cluster, several strategies:

➢ Choose the point that contributes most to SSE

➢ Choose a point from the cluster with the highest SSE

➢ If there are several empty clusters, the above can be repeatedseveral times.

Algorithm 1 Basic K-means Algorithm.

1: Select $K$ points as the initial centroids.

2: repeat

3:Form $K$ clusters by assigning all points to the closest centroid.

4: Recompute the centroid of each cluster.

5: until The centroids don’t change

预处理和后处理

⚫ Pre-processing

➢ Normalize the data

➢ Eliminate outliers

⚫ Post-processing

➢ Eliminate small clusters that may represent outliers

➢ Split ‘loose’ clusters, i.e., clusters with relatively high SSE

➢ Merge clusters that are ‘close’ and that have relatively low SSE

K-means的其他局限性

K-means has problems when clusters are of differing

➢ Sizes

➢ Densities

➢ Non-globular shapes

⚫ K-means has problems when the data contains outliers.

K-means的局限: 不同大小

Original Points

K-means (3 Clusters)

K-means的局限: 不同密度

Original Points

K-means (3 Clusters)

K-means的局限: 非球形

Original Points

K-means (2 Clusters)

克服K-means的局限性

Original Points

K-means Clusters

One solution is to use many clusters.

Find parts of clusters, but need to put together.

克服K-means的局限性

Original Points

K-means Clusters

克服K-means的局限性

Original Points

K-means Clusters

K-means++ [Arthur et al. ’07]

Spreads out the centers

⚫ Choose first center, x1, uniformly at random from the data set

Repeat for $2 \leq i \leq k$ :

➢ Choose $x^{'}$ to be equal to a data point sampled from the distribution:

\frac{D ( x ^{'} ) ^{2}}{\sum _{x \in X} D ( x ) ^{2}}

$▹ D (x) .$ : the shortest distance from a data point x to the closest centerwe have already chosen

K-means++ [Arthur et al. ’07]

\frac{D ( x ^{'} ) ^{2}}{\sum _{x \in X} D ( x ) ^{2}}

K-means++

K-means++ [Arthur et al. ’07]

What’s wrong with K-means++?

➢ Needs K passes over the data

➢ In large data applications, not only the data is massive, but also K istypically large (e.g., easily 1000).

➢ Does not scale!

可伸缩K-means: K-means||

Intuition of

➢ K-means++ samples one point per iteration and updates itsdistribution

➢ What if we oversample by sampling each point independently with alarger probability?

➢ Intuitively equivalent to updating the distribution much less frequently

➢ Coarser sampling

➢ Turns out to be sufficient: K-means||

可伸缩K-means: K-means||

主要思路在于改变每次遍历时的取样策略，并非按照 k-means++ 那样每次遍历只取样一个样本，而是每次遍历取样 O(k) 个样本，重复该取样过程大约 O(logn) 次，共得到 O(klogn) 个样本点组成的集合

⚫ 然后再聚类这 O(klogn) 个点成 k 个点，最后将这 k 个点作为初始聚类中心

实际实验证明 O(logn) 次重复取样是不需要的，一般5次重复取样就可以得到一个较好的聚类初始中心

可伸缩K-means: K-means||

⚫ K-means|| Initialization

K=4,

Oversampling factor $= 3$

可伸缩K-means: K-means||

⚫ K-means|| Initialization

K = 4,

O v e r s a m p l i n g f a c t o r = 3

Cluster the intermediate centers

几类kmeans算法的理论保障

K-means, K-means++, and K-means||

Number of iterations (R)

R=k: Simulating K-means++ (=1)→Strong gaantee

Small R: K-means| → Can it possibly give any guarantees?

$R = 0$ : Lloyd → No guarantees

可伸缩K-means

⚫ Theorem

➢ Optimal clustering： the sum of the squared distances between eachpoint and its closest center is minimal

• Solving this problem exactly is NP-hard

➢ Let ψ= cost of initial clustering, ψ’=cost of clustering after convergence,OPT $≗$ cost of the optimal clustering

➢ Using K-means++ for clustering the intermediate centers, the overallapproximation factor $=$ O(log k)

• It means ψ’ = O(log k)*OPT

也就是说最终确定的质心能够使cost达到最优聚类的cost的O（log k）倍

➢ K-means|| produces a constant-factor approximation to OPT, using onlyO(log (ψ/OPT)) iterations

可伸缩K-means

Experimental Results: Quality

Let $v = {c_{1}, ..., E_{k}}$ be a set of points and let $Y = λ$ We define the cost of $Y$ with respect to $C$ as

ϕ_{Y} (C) = y \in Y \sum d^{2} (y, C) = y \in Y \sum i = 1, \dots, k min ∥ y - c_{i} ∥^{2} .

	Clustering Cost Right After Initialization	Clustering Cost After Convergence
Random	NA	22,000
K-means++	430	65
K-means\|\|	16	14

GAUSSMIXTURE: 10,000 points in 15 dimensions $¥ 50$

Costs scaled down by $1 0^{4}$

可伸缩K-means

Experimental Results: Convergence

➢ K-means|| reduces number of iterations even more than K-means++

	Iterations After Initialization
Random	167
K-means++	42
K-means\|\|	28

SPAM: 4,601 points in 58 dimensions

$K = 50$

K-means 练习

假设有如下八个点：(1, 2)，(2, 4)，(1, 9)，(6, 5)，(4, 2)，(7, 2)，(8, 2)，(4, 3)，使用k-Means算法对其进行聚类。设K=2，初始聚类两个中心点坐标分别为(1, 2)，(8, 2)，算法使用曼哈顿距离（L1 距离）作为距离度量。请计算一次迭代后的聚类中心坐标，并写出上述八个点分别属于哪个簇（cluster）

对于点(1,2):

d_{11} = ∣1 - 1∣ + ∣2 - 2∣ = 0, d_{12} = ∣1 - 8∣ + ∣2 - 2∣ = 7 。

因此，(1,2)应当被聚为第1类;

对于点(2,4):

d_{21} = ∣2 - 1∣ + ∣4 - 2∣ = 3, d_{22} = ∣2 - 8∣ + ∣4 - 2∣ = 8 。

因此，(2,4)应当被聚为第1类;

对于点(1,9):

d_{31} = ∣1 - 1∣ + ∣9 - 2∣ = 7, d_{32} = ∣1 - 8∣ + ∣9 - 2∣ = 14 。

因此，(1,9)应当被聚类为第1类;

对于点(6,5):

d_{41} = ∣6 - 1∣ + ∣5 - 2∣ = 8, d_{42} = ∣6 - 8∣ + ∣5 - 2∣ = 5 。

因此，(6,5)应当被聚类为第 2类;

K-means 练习

对于点(4,2):

d_{51} = ∣4 - 1∣ + ∣2 - 2∣ = 3, d_{52} = ∣4 - 8∣ + ∣2 - 2∣ = 4 。

因此，点(4,2)应当被聚类为第1类;

对于点(7,2):

d_{61} = ∣7 - 1∣ + ∣2 - 2∣ = 6, d_{62} = ∣7 - 8∣ + ∣2 - 2∣ = 1 。

因此，点(7,2)应当被聚类为第 2类；

对于点(8,2):

d_{71} = ∣8 - 1∣ + ∣2 - 2∣ = 7, d_{72} = ∣8 - 8∣ + ∣2 - 2∣ = 0 。

因此，点(8,2)应当被聚类为第2类;

对于点(4,3):

d_{81} = ∣4 - 1∣ + ∣3 - 2∣ = 4, d_{82} = ∣4 - 8∣ + ∣3 - 2∣ = 5 。

因此，点(4,3)应当被聚类为第1类。

综上，类1的点有(1,2)(2,4)(1,9) (4,2)(4,3)，则对应的聚类中心为：(2.4,4)。

类2的点有(6,5)(7,2) (8,2)，则对应的聚类中心为：(7,3)。

DBSCAN

⚫ DBSCAN is a density-based algorithm.

Density $=$ number of points within a specified radius (Eps)

A point is a core point if it has more than a specified number ofpoints (MinPts) within Eps（半径范围内有一定数量的点）

• These are points that are at the interior of a cluster

A border point has fewer than MinPts within Eps, but is in theneighborhood of a core point （半径范围内未达到一定数量的点，但是在core point 范围内）

A noise point is any point that is not a core point or a border point.（啥都不是）

DBSCAN

Density = 7 points

Figure 8.20. Center-baseddensity.

Density $\underline{\underline{δ π}}$ number of points within a specified radius(Eps)

DBSCAN：核心点、边界点、噪声点

MinPts = 4

Figure 8.21. Core,border,and noise points.

A point is a core point if it has more than a specified number of points (MinPts) within Eps. These are points that are at the interior of a cluster

· A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point

· A noise point is any point that is not a core point or aborder point.

核心点

边界点

噪声点

DBSCAN：核心点、边界点、噪声点

DBSCAN 算法

⚫ Center-based approach

Algorithm 8.4 DBSCAN algorithm.

1: Label all points as core,border, or noise points.

2: Eliminate noise points.

3: Put an edge between all core points that are within Eps of each other.

4: Make each group of connected core points into a separate cluster.

5: Assign each border point to one of the clusters of its associated core points.

DBSCAN：核心点、边界点、噪声点

Original Points

Point types: core,border and noise

E p s = 10, M in P t s = 4

DBSCAN擅长的情况

Original Points

Clusters

• Resistant to Noise

• Can handle clusters of different shapes and sizes

DBSCAN

DBSCAN不擅长的情况

Original Points

• Varying densities

• High-dimensional data

维度灾难：距离失效

(MinPts=4, Eps=9.75).

(MinPts=4, Eps=9.92)

DBSCAN: 确定EPS和MinPts

Idea is that for points in a cluster, their kth nearest neighbors are atroughly the same distance

Noise points have the kth nearest neighbor at farther distance

So, plot sorted distance of every point to its $k^{th}$ nearest neighbor

时间、空间复杂度

Time and Space Complexity

The basic time complexity of the DBSCAN algorithm is $O (m \times$ time to findpoints in the Eps-neighborhood),where $m$ is the number of points. In the worst case, this complexity is $O (m^{2})$ . However, in low-dimensional spaces, there are data structures, such as kd-trees, that allow effcient retrieval of all points within a given distance of a specified point,and the time complexity can be as low as $O (m lo g m)$ . The space requirement of DBSCAN, even forhigh-dimensional data, is $O (m)$ because it is only necessary to keep a small amount of data for each point, i.e., the cluster label and the identification of each point as a core, border, or noise point .

2014 KDD TEST OF TIME AWARD

⚫ A Density-Based Algorithm for Discovering Clusters in LargeSpatial Databases with Noise [KDD 1996]

2014 KDD TEST OF TIME AWARD

⚫ A Density-Based Algorithm for Discovering Clusters in LargeSpatial Databases with Noise [KDD 1996]

➢ Proposed the now well-known clustering algorithm DBSCAN

➢ 10000+ citations

[PDF] A density-based algorithm for discovering clusters in large spatial databaseswith noise.

M Ester. HP Krieqel, J Sander, X Xu - Kdd, 1996 - aaai.orgAbstract Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the followingrequirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to

被引用次数：10010相关文章所有75个版本引用保存更多

2014 KDD TEST OF TIME AWARD

⚫ A Density-Based Algorithm for Discovering Clusters in LargeSpatial Databases with Noise [KDD 1996]

➢ Claimed that the average run time complexity of DBSCAN is（ O(n*logn)

➢ Follow papers mis-claimed that the complexity of DBSCAN isO(n*logn)

2015 SIGMOD Best Paper Award

⚫ DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation.

SIGMOD Best Paper Award
Recipients of the award are the following:
2015	DESCAN Revisited: Mis-Claim, Un-Fixability, and Approximation. Yufei Tao, Junhao Gan
2014	Materialization Optimizations for Feature Selection Workloads. Ce Zhang, Arun Kumar, Christopher Ré
2013	Massive Graph Triangulation. Xiaocheng Hu, Yufei Tao, Chin-Wan Chung
2012	High-Pe 2013 SIGMOD Best Paper Award: Processing over XML Streams. Barzan Mozafari, Kai Zeng, Carlo Zaniolo RSS SIGMOD News
2011	Entangled Queries: Enabling Declarative Data-Driven Coordination. Nitin Gupta, Lucja Kot, Sudip Roy, Gabriel Bender, Johannes Gehrke, Christoph Koch SIGMOD Record
Website	FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony Nguyen, Tim Kaldewey, Victor Lee, Scott Brandt, and Pradeep Dubey [award citation] SIGMOD RECORD Weih
Facebook

2015 SIGMOD Best Paper Award

⚫ DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation.

➢ For d ≥ 3, DBSCAN can’t have O(n*logn) time complexity.

D Prove that the DBSCAN problem requires $Ω (n^{4/3})$ time to solve in $d \geq 3$ .

➢ Introduce a new concept called ρ-approximate DBSCAN as analternative to DBSCAN on large datasets of $d \geq 3$ .

2015 SIGMOD Best Paper Award

Yufei Tao

Profes sor

Room 633, Building 78

School of Information Technology and Electrical Engineering

University of Queensland

Brisbane, Queensland 4072

Aus tralia

Tel:+61-7-33658319

Enail:taoyf@i tee.uq. edu au

Publications and Google Scholar

General

Short Bio

Yufei Tao joined as a Frofessor the School of Information Technology and Electrical Engineering,theUniversityof Queensland (UQinJan 2016.He obtained his PhDdegree from Hong Kong Universityof Scienceand Technology (HKusT) in 2002,proudlyunder the supervision of Prof.Dimitris Papadias.He wasa VisitingScientist at the CarmegieMellon Universityduring 2oo2-20o3,and then an Assistant Professor at the CityUniversityof Hong Kong dring 20o3-2006,after which he worked till Jan 2016at the ChineseUniversity ofHong Kong (CUHK)，where he was promoted to Professor in 2009.From 2011 to 2013,he was simultaneouslyaVisiting Professorunderthe World Class Universityprogranof the Koreangoverment,inthe Korea AdvancedInstitute of Science and Technology KAIST)，Korea.Currently，he also holds an honorary position of AdjunctProfessor in the Department of Computer Science and Engineering,CUHK.

He servedasanassociateeditor of ACM Transactians an Database Systems (ToDS from 2008 to2015,and ofIETransactions onKnowledge and DataErgineering(TKDE from 2012 to 2014.He servedasaPCco-chairofIntemational Conference on Data Engineering(ICDE2014,and of International Symposium on Spatial andTemporal Databases (SSTD) 2011.He gavea keynote speech at Intermational Conference on Database Theory(ICDD 2016.

Ma jor Awards

SIGMOD Best Paper Award 2015

SIGMOD Best Paper Award 2013

Hong Kong Young Scientist Award 2002

2015 SIGMOD Best Paper Award

怀疑的精神

善于思考，善于发现问题

科学研究的发展是在曲折中不断前进的！

Reference

Ester, Martin, et al. “A density-based algorithm for discovering clusters in large spatialdatabases with noise.” Kdd. Vol. 96. No. 34. 1996.

Gan, Junhao, and Yufei Tao. “DBSCAN Revisited: Mis-Claim, Un-Fixability, andApproximation.” Proceedings of the 2015 ACM SIGMOD International Conference onManagement of Data. ACM, 2015.

如何评价 SIGMOD 2015 最佳论文《DBSCAN Revisited》？http://www.zhihu.com/question/31845624

K-means和DBSCAN的比较

⚫ Both are partitional clustering that assign each object to a single cluster

➢ K-means: all object

➢ DBSCAN: discard noise

⚫ Type

➢ K-means: prototype-based notion

➢ DBSCAN: density-based notion

Both perform poorly with widely differing densities

➢ K-means: has difficult with non-globular cluster and clusters withdifferent size

➢ DBSCAN: can handle clusters of different size and shapes, notaffected by nosie

K-means和DBSCAN的比较

⚫ Definition requirement

➢ K-means: need well-defined centroid

➢ DBSCAN: need definition of density

⚫ Sparse, high-dimensional data

➢ K-means: work well, such as document data

➢ DBSCAN: performs poorly for Euclidean definition of density

⚫ Distribution of data

➢ K-means: spherical Gaussian distributions

➢ DBSCAN: no assumption

K-means和DBSCAN的比较

Both using all attributes

⚫ Number of clusters

➢ K-means: user define

➢ DBSCAN: automatically,

Two parameters: Eps and Minpts

⚫ Determination

➢ K-means: no

➢ DBSCAN: yes

簇有效性

⚫ For cluster analysis, the analogous question is how to evaluate the“goodness” of the resulting clusters?

⚫ But “clusters are in the eye of the beholder”! 主观性太强！

⚫ Then why do we want to evaluate them?

➢ To compare clustering algorithms

➢ To compare two sets of clusters

➢ To compare two clusters

簇评估(Cluster Validation)的方面

Determining the clustering tendency聚类趋势 of a set of data, i.e.,distinguishing whether non-random structure actually exists in the data先看看是不是真的有簇结构存在
Comparing the results of a cluster analysis to externally knowninformation, e.g., to externally given class labels.
Evaluating how well the results of a cluster analysis fit the data withoutreference to external information.

Use only the data

Comparing the results of two different sets of cluster analyses todetermine which is better.
Determining the ‘correct’ number of clusters.

For 2, 3, and 4, we can further distinguish whether we want to evaluatethe entire clustering or just individual clusters.

簇有效性指标

⚫ Numerical measures that are applied to judge various aspects of clustervalidity, are classified into the following three types.

External Index: Used to measure the extent to which cluster labels matchexternally supplied class labels.

• Entropy 熵

Internal Index: Used to measure the goodness of a clustering structurewithout respect to external information.

• Sum of Squared Error (SSE)

Relative Index: Used to compare two different clusterings or clusters. 实际上不是个单独的簇评估类型，而是度量的具体使用

• Often an external or internal index is used for this function, e.g., SSE or entropy

⚫ Sometimes these are referred to as criteria instead of indices

➢ However, sometimes criterion is the general strategy and index is thenumerical measure that implements the criterion.

簇有效性指标：相关性

⚫ Two matrices

➢ Proximity Matrix (Similarity/Dissimilarity Matrix)

➢ “Incidence” Matrix

• One row and one column for each data point

• An entry is 1 if the associated pair of points belong to the same cluster

• An entry is 0 if the associated pair of points belongs to different clusters

⚫ Compute the correlation between the two matrices

➢ Since the matrices are symmetric, only the correlation betweenn(n-1) / 2 entries needs to be calculated.

⚫ High correlation indicates that points that belong to the same cluster are closeto each other.

⚫ Not a good measure for some density or contiguity based clusters.

簇有效性指标：相关性

⚫ Correlation of incidence and proximity matrices for the K-means clusterings ofthe following two data sets.

Corr = 0.9235

Corr = 0.5810

簇评估：使用相似矩阵

⚫ Order the similarity matrix with respect to cluster labels and inspect visually.

簇评估：使用相似矩阵

⚫ Clusters in random data are not so crisp

DBSCAN

簇评估：使用相似矩阵

⚫ Clusters in random data are not so crisp

K-means

簇评估：使用相似矩阵

⚫ Clusters in random data are not so crisp

Complete linkage hierarchical clustering

簇评估：使用相似矩阵

DBSCAN

内部指标: SSE

⚫ Clusters in more complicated figures aren’t well separated

Internal Index: Used to measure the goodness of a clustering structurewithout respect to external informationSSE

SSE is good for comparing two clusterings or two clusters (averageSSE).

⚫ Can also be used to estimate the number of clusters

内部指标: SSE

⚫ SSE curve for a more complicated data set

SSE of clusters found using K-means

内部指标: 凝聚度和分离度

Cluster Cohesion 凝聚度: Measures how closely related are objects in acluster

➢ Example: SSE

Cluster Separation 分离度: Measure how distinct or well-separated a clusteris from other clusters

Example: Squared Error

➢ Cohesion is measured by the within cluster sum of squares (SSE)

S S E = i \sum x \in C_{i} \sum (x - m_{i})^{2}

➢ Separation is measured by the between cluster sum of squares

S S B = i \sum ∣ C_{i} ∣ (m - m_{i})^{2}

• Where $∣ C i ∣$ is the size of cluster ??

内部指标: 凝聚度和分离度

Example:

SSE+SSB $^{1 =}$ TSS constant

T S S = x \sum (x - m)^{2}

$K = 1$ cluster: $S S E = (1 - 3)^{2} + (2 - 3)^{2} + (4 - 3)^{2} + (5 - 3)^{2} = 10$

S S B = 4 \times (3 - 3)^{2} = 0

T o t a l = 10 + 0 = 10

$K = 2$ clusters: $S S E = (1 - 1.5)^{2} + (2 - 1.5)^{2} + (4 - 4.5)^{2} + (5 - 4.5)^{2} = 11$

S S B = 2 \times (3 - 1.5)^{2} + 2 \times (4.5 - 3)^{2} = 9

T o t a l = 1 + 9 = 10

内部指标: 凝聚度和分离度

⚫ A proximity graph based approach can also be used for cohesion andseparation.

➢ Cluster cohesion is the sum of the weight of all links within a cluster.

Cluster separation is the sum of the weights between nodes in thecluster and nodes outside the cluster.

cohesion

separation

簇有效性：“Black Arts”

“The validation of clustering structures is the most difficult and frustrating part ofcluster analysis. Without a strong effort in this direction, cluster analysis willremain a black art accessible only to those true believers who haveexperience and great courage.” 聚类结构的验证是聚类分析中最困难和最令人沮丧的部分。如果不在这方面付出巨大努力，聚类分析将仍然是一门只有那些有经验和极大勇气的真正信徒才能掌握的黑暗魔法。

Algorithms for Clustering Data, Jain and Dubes

讨论

The entire time I knew Thanos, he only ever had one goal.To bring balance to the universe by wiping out half of all life.

Q：灭霸一个响指就能灭掉一半生命，那怎么保证物种均衡平衡嘞~？

A：聚类分析帮助你！

讨论

Method ：K-means Clustering

宇宙包含 “九大国度”，为防止某个国度灭绝，在每个国度选取一个聚类中心，通过K-means得到聚类簇，再在每个聚类中随机选取一半生物进行消灭，宇宙生物多样性得到进一步保证。

讨论

Method ：K-means Clustering

谢谢大家！

相关课程资源及参考文献请浏览

超算习堂：https://easyhpc.net/course/143

RichardYi's Notebook

Explorer

无监督 （聚类）算法

无监督 （聚类）算法

什么是聚类分析？

聚类分析的应用

⚫ Understanding

⚫ Summarization

聚类：社交网络

聚类：图像分割

聚类：视频检测

什么不是聚类分析？

簇的概念：二义性

簇的概念：二义性

聚类(Clustering)类型

分割式聚类

阶层式聚类

不同聚类集合的其他差异

簇(Cluster)的类型

Well-Separated Clusters

Center-Based Clusters

⚫ Center-based

Contiguity(邻近)-Based Clusters

Density-Based Clusters

⚫ Density-based

Conceptual(概念) Clusters

Objective Function

Objective Function

距离度量指标

Objective Function …

聚类算法

聚类算法

K-means聚类

Algorithm 1 Basic K-means Algorithm.

标准K-means流程

标准K-means流程

K-mean聚类细节

两个不同的K-mean聚类

选择初始质心的重要性

选择初始质心的重要性

选择初始质心的重要性

K-means簇的评估

K的选择

选择初始点的问题

例子：10个簇

例子：10个簇

例子：10个簇

例子：10个簇

如何选择初始质心？

Bisecting(二分) K-means

Bisecting K-means algorithm

例子：二分K-means

处理空簇

处理空簇

Algorithm 1 Basic K-means Algorithm.

预处理和后处理

K-means的其他局限性

K-means的局限: 不同大小

K-means的局限: 不同密度

K-means的局限: 非球形

克服K-means的局限性

克服K-means的局限性

克服K-means的局限性

K-means++ [Arthur et al. ’07]

K-means++ [Arthur et al. ’07]

K-means++ [Arthur et al. ’07]

可伸缩K-means: K-means||

Intuition of

可伸缩K-means: K-means||

可伸缩K-means: K-means||

可伸缩K-means: K-means||

几类kmeans算法的理论保障

K-means, K-means++, and K-means||

可伸缩K-means

⚫ Theorem

可伸缩K-means

Experimental Results: Quality

可伸缩K-means

K-means 练习

K-means 练习

无监督（聚类）算法

无监督（聚类）算法

时间、空间复杂度