這篇文章解決了以下問題:
高維數(shù)據(jù)包括具有幾十到幾千個特征(或維度)的輸入。這是一個典型的上下文問題,例如在生物信息學(xué)(各種排序數(shù)據(jù))或NLP中,如果詞匯量非常大,就會遇到這種情況。高維數(shù)據(jù)是具有挑戰(zhàn)性的,因?yàn)?
什么是子空間聚類?子空間聚類是一種在不同子空間(一個或多個維度的選擇)中發(fā)現(xiàn)聚類的技術(shù)?;镜募僭O(shè)是,我們可以找到只由維度子集定義的有效聚類(不需要具有所有N個特征的一致性)。子空間聚類是傳統(tǒng)N維聚類分析的擴(kuò)展,它允許通過創(chuàng)建行和列聚類同時對特征和觀測進(jìn)行分組。 在特征和觀察的空間中,產(chǎn)生的聚類可能重疊。一個例子如下圖所示。我們可以注意到,兩個聚類中的點(diǎn)可能非常接近,這可能會混淆許多分析整個特征空間的傳統(tǒng)聚類算法。 此外,我們可以看到子空間聚類成功地找到了一個子空間(維a和維c),其中預(yù)期的集群很容易識別。 子空間聚類的類型基于搜索策略,我們可以區(qū)分兩種類型的子空間聚類,如下圖所示:自下而上的方法首先在低維(1 D)空間中找到聚類并迭代合并它們以處理更高維空間(直到ND )。自上而下的方法在整個維度集中查找聚類并評估每個聚類的子空間。下圖取自同一篇論文,概述了最常見的子空間聚類算法。 在搜索策略上,我們可以區(qū)分兩種子空間聚類,如下圖所示:自下而上的方法首先在低維(1d)空間中尋找聚類,然后迭代合并它們以處理高維空間(直到N D)。下圖取自篇論文(https://www./exploration_files/parsons.pdf),概述了最常見的子空間聚類算法。 Clique算法簡而言之,該算法的功能如下:對于每個維度(特征),我們用nBins(輸入?yún)?shù))分割空間;對于每個bin,我們計(jì)算直方圖(計(jì)數(shù)數(shù)量)。我們只考慮dense單元,即計(jì)數(shù)高于作為第二個輸入?yún)?shù)給定的閾值的bins。dense單元的特點(diǎn)如下:
導(dǎo)入Python庫 import osimport sysimport numpy as npimport scipy.sparse.csgraphfrom sklearn import preprocessingfrom sklearn import metricsimport matplotlib.pyplot as pltimport pandas as pdfrom functools import reduceimport seaborn as snsfrom collections import Counterimport itertools 在2D中生成4個聚類 from sklearn.datasets.samples_generator import make_blobsfrom sklearn.preprocessing import StandardScalern_components = 4data, truth = make_blobs(n_samples=100, centers=n_components, random_state=42, n_features=2)data = preprocessing.MinMaxScaler().fit_transform(data)plt.scatter(data[:, 0], data[:, 1], s=50, c = truth)plt.title(f'Example of a mixture of {n_components} distributions')plt.xlabel('Feature 1')plt.ylabel('Feature 2'); class DenseUnit1D: ''' This class ''' def __init__(self, dimension, bin, minBin, maxBin, points): self.dimension = dimension # dimension index self.bin = bin # bin number self.minBin = minBin # inferior bin limit self.maxBin = maxBin # superior bin limit self.points = points # observation indexes in input data def distance(self, du): # Not in the same dimension, can't be neighbors if self.dimension != du.dimension: return -1 return abs(self.bin -du.bin) def __eq__(self, other): '''Overrides the default implementation''' if isinstance(other, DenseUnit): return (Counter(self.dimension) == Counter(other.dimension) and Counter(self.points) == Counter(other.points) ) return False def __hash__(self): return hash(str(self)) def __str__(self): return (f'Dimension {self.dimension}, bin {self.bin}, points {len(self.points)},' + f'[{round(self.minBin, 2)}, {round(self.maxBin, 2)}]')def neighbour(denseUnits1, denseUnits2): ''' Determines if 2 dense units are neighbouring ''' # We allow only 1 bin deviation in one subspace distance = 0 for subspace in range(len(denseUnits1)): subspaceDistance = denseUnits1[subspace].distance(denseUnits2[subspace]) if subspaceDistance == -1: return False distance += subspaceDistance if distance > 1: return 0 return True 設(shè)置參數(shù) thresholdPoints = 2densenbBins = 8 創(chuàng)建一dimensionals dense單元 def createDenseUnitsAndGrid(data, thresholdPoints = thresholdPoints, nbBins = nbBins): ''' This method will return an array of lists, each list containing 1D dense units In 1 D subspace, each list will contain only one element ''' denseUnits1D = [] grid=[] # this is used for rendering purposes for curDim in range(data.shape[1]): minDim = min(data[:,curDim]) maxDim = max(data[:,curDim]) binSize = (maxDim - minDim)/nbBins points = data[:, curDim] g = [] # grid lines for current dimension g.append(minDim) for i in range(nbBins): endBin = minDim + binSize g.append(endBin) # Retrieve bin points per dimension if i == nbBins -1 : # last bin, make sure all points are included binPoints = np.where((points>=minDim) & (points<=maxDim))[0] endBin = maxDim else: binPoints = np.where((points>=minDim) & (points<endBin))[0] # Store only dense bins if len(binPoints)>thresholdPoints: denseUnits1D.append([DenseUnit1D(curDim, i,minDim,endBin,binPoints)]) minDim = endBin grid.append(g) return denseUnits1D, griddenseUnits1D, grid = createDenseUnitsAndGrid(data) 在網(wǎng)格上繪制原始數(shù)據(jù)集 plt.scatter(data[:, 0], data[:, 1])for g in grid[0]: plt.axvline(x=g, c = 'red', linestyle ='--') plt.xlabel('Feature 0') for g in grid[1]: plt.axhline(y=g, c = 'red', linestyle ='--') plt.ylabel('Feature 1') clique算法背后的直覺是存在于k維空間中的聚類也可以在k-1中找到。我們從1D開始,每個維度我們都試圖找到dense bins。如果2個或更多dense bins相鄰,我們將它們合并成一個更大的bin。通過將所有現(xiàn)有dense bins轉(zhuǎn)換為圖可以容易地實(shí)現(xiàn)該操作,其中如果2個dense單元屬于相同維度并且它們的bin索引之間的差異不大于1(例如,特征3和bin 4對應(yīng)的dense單元與特征3和bin 5對應(yīng)的dense單元相鄰)則繪制邊緣。可以通過計(jì)算上述圖上的連通分量來識別要合并的dense單元。 def denseBinsToClusters(candidates, plot = False, debug = False): ''' This method takes as input a collection of subspace candidates. A subspace candidate is a list of 1D dense units. This method will merge neighbouring units by projecting them onto a graph, where we can easily compute connected components ''' graph = np.identity(len(candidates)) for i in range(len(candidates)): for j in range(len(candidates)): graph[i, j] = int(neighbour(candidates[i], candidates[j])) # Find connected components in order to merge neighbouring bins nbConnectedComponents, components = scipy.sparse.csgraph.connected_components( graph, directed=False) if debug: print(graph) print(nbConnectedComponents, components) candidates = np.array(candidates) clusterAssignment = -1 * np.ones(data.shape[0]) # For every cluster for i in range(nbConnectedComponents): # Get dense units of the cluster cluster_dense_units = candidates[np.where(components == i)[0]] if debug: for v in cluster_dense_units: for z in v: print(z) clusterDimensions = {} for j in range(len(cluster_dense_units)): for k in range(len(cluster_dense_units[j])): if cluster_dense_units[j][k].dimension not in clusterDimensions: clusterDimensions[cluster_dense_units[j][k].dimension] = [] clusterDimensions[cluster_dense_units[j][k].dimension].extend(cluster_dense_units[j][k].points) points =reduce(np.intersect1d, list(clusterDimensions.values())) clusterAssignment[points] = i if plot: pred = -1 * np.ones(data.shape[0]) pred[points] = i plt.figure() plt.title(f'In yellow, clusters in {list(clusterDimensions.keys())} dimensions ') plt.scatter(data[:, 0], data[:, 1], c = pred) for g in grid[0]: plt.axvline(x=g, c = 'red', linestyle ='--') for g in grid[1]: plt.axhline(y=g, c = 'red', linestyle ='--') plt.show() if debug: print(clusterDimensions.keys(), points) return clusterAssignmentdenseBinsToClusters(denseUnits1D, plot = True, debug = False); 接下來,我們要計(jì)算每個子空間中從2到輸入維數(shù)的所有有效聚類。該操作歸結(jié)為計(jì)算k維度中的dense單元的組合,并且只保留大于初始最小密度閾值的dense continuous bins的重疊結(jié)果。一旦我們計(jì)算了k-1維度的dense單元,我們就可以通過計(jì)算最后k-1個候選者的所有組合來擴(kuò)展到k維度。 def getSubspaceCandidates(previousUnits, subspaceDimension = 2): import itertools candidates = [] for ix1, ix2 in itertools.combinations(range(len(previousUnits)), 2): dims =[] candidate = [] for i in range(len(previousUnits[ix1])): dims.append(previousUnits[ix1][i].dimension) candidate.append(previousUnits[ix1][i]) points1= previousUnits[ix1][i].points for i in range(len(previousUnits[ix2])): dims.append(previousUnits[ix2][i].dimension) candidate.append(previousUnits[ix2][i]) points2= previousUnits[ix2][i].points points = np.intersect1d(points1, points2) # check points in common if np.unique(dims).shape[0] == subspaceDimension and points.shape[0]>thresholdPoints:# print(f'\n\nadding candidate: {len(points)}')# for v in candidate:# print(v) candidates.append(candidate) return candidatesfor subspaceDim in range(2, data.shape[1] +1): subspaceCandidates = getSubspaceCandidates(denseUnits1D, subspaceDimension = subspaceDim) pred = denseBinsToClusters(subspaceCandidates, plot = True, debug = False); 在2d中,我們能夠檢索如下圖所示的聚類。紫色點(diǎn)被排除在聚類之外,因?yàn)樗鼈兾挥谙∈璧木W(wǎng)格bins中。 plt.scatter(data[:, 0], data[:, 1], c = pred)for g in grid[0]: plt.axvline(x=g, c = 'red', linestyle ='--') plt.xlabel('Feature 0') for g in grid[1]: plt.axhline(y=g, c = 'red', linestyle ='--') plt.ylabel('Feature 1') 驗(yàn)證結(jié)果 from sklearn.metrics.cluster import adjusted_rand_scoreadjusted_rand_score(truth, pred) 0.9090183974614624 Clique聚類因其對輸入?yún)?shù)(bins數(shù)和最小密度)的高靈敏度而受到批評,這可能導(dǎo)致非常不同的結(jié)果。但是,它是自下而上子空間聚類族中的一種基本算法。有多種方法可以優(yōu)化clique算法。 |
|