當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

K-均值对地图上的点进行聚类（2）

發(fā)布時(shí)間：2024/9/20 编程问答 31 豆豆

生活随笔收集整理的這篇文章主要介紹了 K-均值对地图上的点进行聚类（2）小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

3. 對(duì)地圖上的點(diǎn)進(jìn)行聚類

示例：對(duì)于地理數(shù)據(jù)應(yīng)用二分K-均值算法

(1)收集數(shù)據(jù)：使用Yahoo! placeFinder API收集數(shù)據(jù)。
(2)準(zhǔn)備數(shù)據(jù)：只保留經(jīng)緯度信息。
(3)分析數(shù)據(jù)：使用matplotlib來構(gòu)建一個(gè)二維數(shù)據(jù)圖，其中包含簇與地圖。
(4)訓(xùn)練算法：訓(xùn)練不適用無監(jiān)督學(xué)習(xí)。
(5)測(cè)試算法：使用上篇中的bikmeans（）函教。
(6)使用算法：最后的輸出是包含簇及簇中心的地圖。

3.1 收集數(shù)據(jù)

# -*- coding: utf-8 -*- import urllib import json def geoGrab(stAddress, city): apiStem = 'http://where.yahooapis.com/geocode?' params = {}params['flags'] = 'J' # 返回JSONparams['appid'] = 'aaa0VN6k'params['location'] = '%s %s' % (stAddress, city)url_params = urllib.urlencode(params) # 將params字典轉(zhuǎn)換為可以通過URL進(jìn)行傳遞的字符串格式yahooApi = apiStem + url_params print yahooApi # 輸出URLc=urllib.urlopen(yahooApi) #讀取返回值return json.loads(c.read()) # 返回一個(gè)字典from time import sleep def massPlaceFind(fileName):fw = open('places.txt', 'w')for line in open(fileName).readlines():line = line.strip()lineArr = line.split('\t') # 是以tab分隔的文本文件retDict = geoGrab(lineArr[1], lineArr[2]) # 讀取2列和第3列if retDict['ResultSet']['Error'] == 0: # 檢查輸出字典，判斷有沒有出錯(cuò)lat = float(retDict['ResultSet']['Results'][0]['latitude']) # 讀取經(jīng)緯度lng = float(retDict['ResultSet']['Results'][0]['longitude'])print "%s\t%f\t%f" % (lineArr[0], lat, lng)fw.write('%s\t%f\t%f\n' % (line, lat, lng)) # 添加到對(duì)應(yīng)的行上else: print 'error fetching' # 有錯(cuò)誤時(shí)不需要抽取經(jīng)緯度sleep(1) # 避免頻繁調(diào)用API，過于頻繁的話請(qǐng)求會(huì)被封掉fw.close()geoGrab('1 VA Center', 'Augusta,ME')

Yahoo! placeFinder API：

為了使用該服務(wù)，需要注冊(cè)以獲得一個(gè)APIkey。具體地，需要在Yahoo！開發(fā)者網(wǎng)絡(luò)（Http://developer.yahoo.com)中進(jìn)行注冊(cè)。創(chuàng)建一個(gè)桌面應(yīng)用后會(huì)獲得一個(gè)appid。需要appid來使用geocoder。一個(gè)geocoder接受給定地址，然后返回該地址對(duì)應(yīng)的經(jīng)緯度。

JSON：

是一種用于序列化數(shù)組和字典的文件格式。JSON是javascript object
Notation的縮寫，接下來使用urllib的urlencode（）函數(shù)將創(chuàng)建的字典轉(zhuǎn)換為可以通過URL進(jìn)行傳遞的字符串格式。最后，打開URL讀取返回值。由于返回值是json格式的，所以可以使用JSON的Python模塊來將其解碼為一個(gè)字典。一旦返回了解碼后的字典，也就意味著成功地對(duì)一個(gè)地址進(jìn)行了地理編碼。

這里沒能實(shí)現(xiàn)，只是對(duì)代碼分析了下。。

3.2 對(duì)地理坐標(biāo)進(jìn)行聚類

# -*- coding: utf-8 -*- from numpy import * # K-均值聚類支持函數(shù) def loadDataSet(fileName): dataMat = [] fr = open(fileName)for line in fr.readlines():curLine = line.strip().split('\t')fltLine = map(float,curLine) dataMat.append(fltLine)return dataMat# 計(jì)算兩個(gè)向量的歐式距離 def distEclud(vecA, vecB):return sqrt(sum(power(vecA - vecB, 2))) # 為給定數(shù)據(jù)集構(gòu)建一個(gè)包含k個(gè)隨機(jī)質(zhì)心的集合,是以每列的形式生成的 def randCent(dataSet, k):n = shape(dataSet)[1]centroids = mat(zeros((k,n))) for j in range(n): minJ = min(dataSet[:,j]) # 找到每一維的最小rangeJ = float(max(dataSet[:,j]) - minJ) # 每一維的最大和最小值之差centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1)) # 生成隨機(jī)值#print centroids[:,j]return centroids # 返回隨機(jī)質(zhì)心,是和數(shù)據(jù)點(diǎn)相同的結(jié)構(gòu)# k--均值聚類算法(計(jì)算質(zhì)心--分配--重新計(jì)算) def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent): # k是簇的數(shù)目m = shape(dataSet)[0] # 得到樣本的數(shù)目clusterAssment = mat(zeros((m,2))) # 創(chuàng)建矩陣來存儲(chǔ)每個(gè)點(diǎn)的簇分配結(jié)果# 第一列：記錄簇索引值，第二列：存儲(chǔ)誤差，歐式距離的平方centroids = createCent(dataSet, k) # 創(chuàng)建k個(gè)隨機(jī)質(zhì)心clusterChanged = Truewhile clusterChanged: # 迭代使用while循環(huán)來實(shí)現(xiàn)clusterChanged = False for i in range(m): # 遍歷每個(gè)數(shù)據(jù)點(diǎn)，找到距離每個(gè)點(diǎn)最近的質(zhì)心minDist = inf; minIndex = -1for j in range(k): # 尋找最近的質(zhì)心distJI = distMeas(centroids[j,:],dataSet[i,:])if distJI < minDist:minDist = distJI; minIndex = jif clusterAssment[i,0] != minIndex: # 更新停止的條件clusterChanged = TrueclusterAssment[i,:] = minIndex,minDist**2 # minDist**2就去掉了根號(hào) for cent in range(k): # 更新質(zhì)心的位置ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]] centroids[cent,:] = mean(ptsInClust, axis=0) # 然后計(jì)算均值，axis=0:沿列方向 #print 'centroids:',centroidsreturn centroids, clusterAssment # 返回簇和每個(gè)簇的誤差值，誤差值是當(dāng)前點(diǎn)到該簇的質(zhì)心的距離# 二分k--均值聚類算法 def biKmeans(dataSet, k, distMeas=distEclud):m = shape(dataSet)[0] clusterAssment = mat(zeros((m,2))) # 存儲(chǔ)數(shù)據(jù)集中每個(gè)點(diǎn)的簇分配結(jié)果及平方誤差centroid0 = mean(dataSet, axis=0).tolist()[0] # 計(jì)算整個(gè)數(shù)據(jù)集的質(zhì)心：1*2的向量centList =[centroid0] # []的意思是使用一個(gè)列表保存所有的質(zhì)心,簇列表,[]的作用很大for j in range(m): # 遍歷所有的數(shù)據(jù)點(diǎn)，計(jì)算到初始質(zhì)心的誤差值，存儲(chǔ)在第1列clusterAssment[j,1] = distMeas(mat(centroid0), dataSet[j,:])**2while (len(centList) < k): # 不斷對(duì)簇進(jìn)行劃分，直到klowestSSE = inf # 初始化SSE為無窮大for i in range(len(centList)): # 遍歷每一個(gè)簇#print 'i:',i # 數(shù)組過濾得到所有的類別簇等于i的數(shù)據(jù)集ptsInCurrCluster = dataSet[nonzero(clusterAssment[:,0].A==i)[0],:]# 得到2個(gè)簇和每個(gè)簇的誤差，centroidMat：簇矩陣 splitClustAss：[索引值,誤差]centroidMat, splitClustAss = kMeans(ptsInCurrCluster, 2, distMeas) # centroidMat是矩陣sseSplit = sum(splitClustAss[:,1]) # 求二分k劃分后所有數(shù)據(jù)點(diǎn)的誤差和 # 數(shù)組過濾得到整個(gè)數(shù)據(jù)點(diǎn)集的簇中不等于i的點(diǎn)集#print nonzero(clusterAssment[:,0].A!=i)[0]sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:,0].A!=i)[0],1])# 所有剩余數(shù)據(jù)集的誤差之和#print "sseSplit and notSplit: ",sseSplit,',',sseNotSplitif (sseSplit + sseNotSplit) < lowestSSE: # 劃分后的誤差和小于當(dāng)前的誤差，本次劃分被保存#print 'here..........'bestCentToSplit = i # i代表簇?cái)?shù)bestNewCents = centroidMat # 保存簇矩陣#print 'bestNewCents',bestNewCentsbestClustAss = splitClustAss.copy() # 拷貝所有數(shù)據(jù)點(diǎn)的簇索引和誤差lowestSSE = sseSplit + sseNotSplit # 保存當(dāng)前誤差和# centList是原劃分的簇向量，bestCentToSplit是i值#print 'len(centList) and bestCentToSplit ',len(centList),',',bestCentToSplit# 數(shù)組過濾得到的是新劃分的簇類別是1的數(shù)據(jù)集的類別簇重新劃為新的類別值為最大的類別數(shù)bestClustAss[nonzero(bestClustAss[:,0].A == 1)[0],0] = len(centList) # 數(shù)組過濾得到的是新劃分的簇類別是0的數(shù)據(jù)集的類別簇重新劃為新的類別值為ibestClustAss[nonzero(bestClustAss[:,0].A == 0)[0],0] = bestCentToSplit#print 'the bestCentToSplit is: ',bestCentToSplit # 代表的是劃分的簇個(gè)數(shù)-1#print 'the len of bestClustAss is: ', len(bestClustAss) # 數(shù)據(jù)簇的數(shù)據(jù)點(diǎn)個(gè)數(shù)# 新劃分簇矩陣的第0簇向量新增到當(dāng)前的簇列表中centList[bestCentToSplit] = bestNewCents[0,:].tolist()[0] #print 'centList[bestCentToSplit]:',centList[bestCentToSplit]# 新劃分簇矩陣的第1簇向量添加到當(dāng)前的簇列表中centList.append(bestNewCents[1,:].tolist()[0]) # centList是列表的格式#print 'centList',centList# 數(shù)組過濾得到所有數(shù)據(jù)集中簇類別是新簇的數(shù)據(jù)點(diǎn)clusterAssment[nonzero(clusterAssment[:,0].A == bestCentToSplit)[0],:]= bestClustAssreturn mat(centList), clusterAssment # 返回質(zhì)心列表和簇分配結(jié)果# 球面距離計(jì)算，這里是利用球面余弦定理 def distSLC(vecA, vecB): # 經(jīng)度和緯度用角度作為單位，這里用角度除以180然后乘以pi作為余弦函數(shù)的輸入a = sin(vecA[0,1]*pi/180) * sin(vecB[0,1]*pi/180) b = cos(vecA[0,1]*pi/180) * cos(vecB[0,1]*pi/180) * \cos(pi * (vecB[0,0]-vecA[0,0]) /180)return arccos(a + b)*6371.0 # 返回地球表面兩點(diǎn)之間的距離import matplotlib import matplotlib.pyplot as plt # 及簇繪圖函數(shù) def clusterClubs(numClust=5): # 希望分得的簇?cái)?shù)datList = [] # 創(chuàng)建一個(gè)空列表for line in open('places.txt').readlines():lineArr = line.split('\t')datList.append([float(lineArr[4]), float(lineArr[3])]) # 對(duì)應(yīng)的是緯度和經(jīng)度datMat = mat(datList) # 創(chuàng)建一個(gè)矩陣myCentroids, clustAssing = biKmeans(datMat, numClust, distMeas=distSLC)fig = plt.figure() # 創(chuàng)建一幅圖rect=[0.1,0.1,0.8,0.8] # 創(chuàng)建一個(gè)矩形來決定繪制圖的哪一部分scatterMarkers=['s', 'o', '^', '8', 'p', 'd', 'v', 'h', '>', '<'] # 構(gòu)建一個(gè)標(biāo)記形狀的列表來繪制散點(diǎn)圖axprops = dict(xticks=[], yticks=[])ax0=fig.add_axes(rect, label='ax0', **axprops) # 創(chuàng)建一個(gè)子圖imgP = plt.imread('Portland.png') # imread()函數(shù)基于一幅圖像來創(chuàng)建矩陣ax0.imshow(imgP) # imshow()繪制該矩陣ax1=fig.add_axes(rect, label='ax1', frameon=False) # 在同一張圖上又創(chuàng)建一個(gè)字圖for i in range(numClust): # 遍歷每一個(gè)簇ptsInCurrCluster = datMat[nonzero(clustAssing[:,0].A==i)[0],:]markerStyle = scatterMarkers[i % len(scatterMarkers)]ax1.scatter(ptsInCurrCluster[:,0].flatten().A[0], ptsInCurrCluster[:,1].flatten().A[0], marker=markerStyle, s=90)ax1.scatter(myCentroids[:,0].flatten().A[0], myCentroids[:,1].flatten().A[0], marker='+', s=300)plt.show() # 主函數(shù) clusterClubs(5)

由于源代碼里給出了從Yahoo! placeFinder API得到的數(shù)據(jù)，所以這里直接拿來用了，得到的結(jié)果：

也可以通過改變聚類簇?cái)?shù)來查看聚類效果，在此不再演示。

4. 注意的幾點(diǎn)：

4.1球面距離公式是計(jì)算球面上兩點(diǎn)間距離的公式

設(shè)所求
點(diǎn)A ：緯度β1 ，經(jīng)度α1 ；
點(diǎn)B：緯度β2，經(jīng)度α2；
則距離S=R?arccos[cosβ1cosβ2cos（α1?α2）+sinβ1sinβ2]
其中R<script type="math/tex" id="MathJax-Element-8">R</script>為球體半徑。

4.2 fig.add_axes()和Subplot（）用法

ax3 = fig.add_axes([0.1, 0.1, 0.8, 0.8]) ax4 = fig.add_axes([0.72, 0.72, 0.16, 0.16])

這里看幾個(gè)例子：

Axes - Subplot - Axis 之間的關(guān)系：

figure就是畫板，是畫紙的載體，但是具體畫畫等操作是在畫紙上完成的。在pyplot中，畫紙的概念對(duì)應(yīng)的就是Axes/Subplot。

（1）subplot的操作

import matplotlib import matplotlib.pyplot as plt fig = plt.figure() ax = fig.add_subplot(111) ax.set(xlim=[0.5, 4.5], ylim=[-2, 8], title='An Example Axes', ylabel='Y-Axis', xlabel='X-Axis') plt.show()

運(yùn)行結(jié)果：

（2）Axes 和 Subplot 的區(qū)別

Axes的使用：

fig = plt.figure() ax3 = fig.add_axes([0.1, 0.1, 0.8, 0.8]) ax4 = fig.add_axes([0.72, 0.72, 0.16, 0.16]) plt.show() print type(ax3)

運(yùn)行結(jié)果：

這里軸域（Axes）可以理解成一些軸（Axis）的集合，當(dāng)然這個(gè)集合還有很多軸（Axis）的屬性，標(biāo)注等等。

我們用add_axes()方法生成一個(gè)軸域（Axes），括號(hào)里面的值前兩個(gè)是軸域原點(diǎn)坐標(biāo)（這里指的是以整個(gè)figure為參照，前兩個(gè)值是相對(duì)于figure的左下角而言的，而不是當(dāng)前子圖的起始坐標(biāo)），后兩個(gè)是顯示坐標(biāo)軸的長度（這里指的是子圖的軸長，也是相對(duì)于figure的長度而言的）。當(dāng)我們生成了軸域的時(shí)候，從結(jié)果上看確實(shí)是生成了一個(gè)可以畫圖的子圖。

我們還可以分別在兩個(gè)軸域（Axes）中畫圖。對(duì)比兩種方法，兩種對(duì)象，我們可以總結(jié)總結(jié)： add_subplot()方法在生成子圖過程，簡單明了，而用add_axes()方法,則生成子圖的靈活性更強(qiáng)，完全可以實(shí)現(xiàn)add_subplot()方法的功能，可以控制子圖顯示位置，甚至實(shí)現(xiàn)相互重疊的效果。下面是幾個(gè)例子：

subplot

fig = plt.figure() ax1 = fig.add_subplot(211) ax2 = fig.add_subplot(212) plt.show()

Axes

fig = plt.figure() ax3 = fig.add_axes([0, 0, 0.5, 0.5], title='An Example Axes', ylabel='Y-Axis', xlabel='X-Axis') ax4 = fig.add_axes([0.6, 0.6, 0.2, 0.16], title='An Example Axes', ylabel='Y-Axis', xlabel='X-Axis') plt.show()

可以看出ax3是從畫板的原點(diǎn)（0，0）開始畫，橫縱坐標(biāo)軸都是0.5
ax4是從畫板的（0.6，0.6）開始畫，橫縱坐標(biāo)軸分別是0.2和0.6

fig = plt.figure() ax3 = fig.add_axes([0, 0, 0.3, 1], title='An Example Axes', ylabel='Y-Axis', xlabel='X-Axis') ax4 = fig.add_axes([0.6, 0.3, 0.2, 0.3], title='An Example Axes', ylabel='Y-Axis', xlabel='X-Axis') plt.show()

參考：https://www.zhihu.com/question/51745620

4.3 scatterMarkers[i % len(scatterMarkers)]

markerStyle = scatterMarkers[i % len(scatterMarkers)]

使用：

>>> scatterMarkers=['s', 'o', '^', '8', 'p', 'd', 'v', 'h', '>', '<'] >>> for i in range(5): ... markerStyle = scatterMarkers[i % len(scatterMarkers)] ... print markerStyle ... s o ^ 8 p >>>

類似于列表推到式，但此還有一個(gè)特殊的地方是可以循環(huán)使用，也就是更多簇時(shí)，可以循環(huán)標(biāo)記。

總結(jié)

以上是生活随笔為你收集整理的K-均值对地图上的点进行聚类（2）的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：纯黑猫咪（纯黑猫）
下一篇：坐着感觉头晕是怎么回事（感觉头晕是怎么回