LINE: Large-scale Information Network Embedding
LINE算法意義:
1、適用于任意類型的網絡,有向無向有權無權
2、清晰優化目標函數、維護一階和二階相似度
3、百萬級和十億級條邊幾個小時訓練完
4、LINE是WWW2015引用量最高的文章
5、與Deepwalk(2014)、node2vec(2016)一樣是早期網絡學習的代表性工作,經典baselines
6、啟發大量基于網絡結構來做表征學習的工作
圖學習領域?(人工提取特征-基于特征工程) ? ?<--------------? ?Deepwalk、Line、Node2vec? --------------> 深度學習領域(特征和分類于一體 - 基于神經網絡)
論文下載地址:?http://de.arxiv.org/pdf/1503.03578
這篇論文主要結構如下所示:
?
一、摘要Abstract
介紹背景提出LINE模型,維護局部和全局網絡結構
1、提出LINE算法適用于任意類型的信息? ? ? ? 網絡: 有向/無向,有權/無權
2、LINE算法的目標函數同時保留了局部和全局的網絡結構
3、討論算法如何在多個領域上如單詞網絡、社交網絡和引用網絡都驗證了模型的有效性
4、強調LINE算法的規模性,單機上幾個小時之內可訓練百萬級網絡
二、Introduction
介紹圖的重要性與以前的方法做對比,說明DeepWalk缺乏清晰的目標函數
1、提出的順序DeepWalk-2014、Line-2015、Node2vec-2016
2、Node2vec設置p=q=1時等價于DeepWalk
3、DeepWalk無太多可調的超參數.Line可選擇1st Order、2st Order或者組合方式
4、Deep Walk和Node2vec基于隨機游走的算法,LINE是基于網絡結構啟發的算法
三、Realted Work
介紹相關傳統基于圖的算法
四、Edge Sampling
介紹alias sampling技術,時間復雜度分析、低度點和新點的處理
低度點: 難以學習,因為度低鄰居少,2階相似度難學,用高階信息補充,這里面采用的是鄰居的鄰居
新點: 保持其他點的embedding不變,計算新點的embedding
五、Model optimization
介紹模型的優化、負采樣技術、SGD算法
六、LINE算法
介紹一階、二階相似度以及兩者的混合
?
?
?
七、Effectiveness:
實驗探究了模型的有效性、baseline選擇參數設定以及數據集多個任務
八、Experiment
可視化、性能、網絡稀疏度、參數實驗以及規模性
九、Discussion
總結提出一種根據網絡結構的目標函數網絡表征學習方法
關鍵點:圖的一階和二階相似度的理解、圖的一階、二階相似度轉化為目標函數、公式的推導、時間復雜度分析、圖的負采樣alias sampling
創新點:? 根據圖的信息直接建模、大規模數據集上的應用、豐富實驗的論證效果
啟發點: 圖的理解對于網絡表征學習的作用、算法設計通過KL散度將圖的預測值和真實值比較
十、代碼
先隨機模擬一部分數據集,格式主要是(node1,node2,wight), 存在文件里面文件名-weighted.karate.edgelist如下所示:1 32 1 1 22 1 1 20 1 1 18 1 1 14 1 1 13 1 1 12 1 1 11 1 1 9 1 1 8 1 1 7 1 1 6 1 1 5 1 1 4 1 1 3 1 1 2 1 2 31 1 2 22 1 2 20 1 2 18 1 2 14 1 2 8 1 2 4 1 2 3 1 3 14 1 3 9 1 3 10 1 3 33 1 3 29 1 3 28 1 3 8 1 3 4 1 4 14 1 4 13 1 4 8 1 5 11 1 5 7 1 6 17 1 6 11 1 6 7 1 7 17 1 9 34 1 9 33 1 9 33 1 10 34 1 14 34 1 15 34 1 15 33 1 16 34 1 16 33 1 19 34 1 19 33 1 20 34 1 21 34 1 21 33 1 23 34 1 23 33 1 24 30 1 24 34 1 24 33 1 24 28 1 24 26 1 25 32 1 25 28 1 25 26 1 26 32 1 27 34 1 27 30 1 28 34 1 29 34 1 29 32 1 30 34 1 30 33 1 31 34 1 31 33 1 32 34 1 32 33 1 33 34 1 /*** 代碼部分 ***/""" 1. 設置模型參數; 讀圖,存點和邊并做歸一化1) 設置模型參數 設置模型超參數,如1st order, 2nd order,負樣本數量(K), embedding維度, batch、epoch、learning rate等2)輸入輸出輸入文件 './data/weighted.karate.edgelist'輸出文件 './model.pt'"""/*** utils.py ***/import random from decimal import * import numpy as np import collections from tqdm import tqdmclass VoseAlias(object):"""Adding a few modifs to https://github.com/asmith26/Vose-Alias-Method"""def __init__(self, dist):"""(VoseAlias, dict) -> NoneType"""self.dist = distself.alias_initialisation()def alias_initialisation(self):"""Construct probability and alias tables for the distribution."""# Initialise variablesn = len(self.dist)self.table_prob = {} # probability tableself.table_alias = {} # alias tablescaled_prob = {} # scaled probabilitiessmall = [] # stack for probabilities smaller that 1large = [] # stack for probabilities greater than or equal to 1# Construct and sort the scaled probabilities into their appropriate stacksprint("1/2. Building and sorting scaled probabilities for alias table...")for o, p in tqdm(self.dist.items()):scaled_prob[o] = Decimal(p) * nif scaled_prob[o] < 1:small.append(o)else:large.append(o)print("2/2. Building alias table...")# Construct the probability and alias tableswhile small and large:s = small.pop()l = large.pop()self.table_prob[s] = scaled_prob[s]self.table_alias[s] = lscaled_prob[l] = (scaled_prob[l] + scaled_prob[s]) - Decimal(1)if scaled_prob[l] < 1:small.append(l)else:large.append(l)# The remaining outcomes (of one stack) must have probability 1while large:self.table_prob[large.pop()] = Decimal(1)while small:self.table_prob[small.pop()] = Decimal(1)self.listprobs = list(self.table_prob)def alias_generation(self):"""Yields a random outcome from the distribution."""# Determine which column of table_prob to inspectcol = random.choice(self.listprobs)# Determine which outcome to pick in that columnif self.table_prob[col] >= random.uniform(0, 1):return colelse:return self.table_alias[col]def sample_n(self, size):"""Yields a sample of size n from the distribution, and print the results to stdout."""for i in range(size):yield self.alias_generation()def makeDist(graphpath, power=0.75):edgedistdict = collections.defaultdict(int)nodedistdict = collections.defaultdict(int)weightsdict = collections.defaultdict(int)nodedegrees = collections.defaultdict(int)weightsum = 0negprobsum = 0nlines = 0with open(graphpath, "r") as graphfile:for l in graphfile:nlines += 1print("Reading edgelist file...")maxindex = 0with open(graphpath, "r") as graphfile:for l in tqdm(graphfile, total=nlines):line = [int(i) for i in l.replace("\n", "").split(" ")]node1, node2, weight = line[0], line[1], line[2]edgedistdict[tuple([node1, node2])] = weightnodedistdict[node1] += weightweightsdict[tuple([node1, node2])] = weightnodedegrees[node1] += weightweightsum += weightnegprobsum += np.power(weight, power)if node1 > maxindex:maxindex = node1elif node2 > maxindex:maxindex = node2for node, outdegree in nodedistdict.items():nodedistdict[node] = np.power(outdegree, power) / negprobsumfor edge, weight in edgedistdict.items():edgedistdict[edge] = weight / weightsumreturn edgedistdict, nodedistdict, weightsdict, nodedegrees, maxindexdef negSampleBatch(sourcenode, targetnode, negsamplesize, weights,nodedegrees, nodesaliassampler, t=10e-3):"""For generating negative samples."""negsamples = 0while negsamples < negsamplesize:samplednode = nodesaliassampler.sample_n(1)if (samplednode == sourcenode) or (samplednode == targetnode):continueelse:negsamples += 1yield samplednodedef makeData(samplededges, negsamplesize, weights, nodedegrees, nodesaliassampler):for e in samplededges:sourcenode, targetnode = e[0], e[1]negnodes = []for negsample in negSampleBatch(sourcenode, targetnode, negsamplesize,weights, nodedegrees, nodesaliassampler):for node in negsample:negnodes.append(node)yield [e[0], e[1]] + negnodes/*** line.py ***/import torch import torch.nn as nn import torch.nn.functional as Fclass Line(nn.Module):def __init__(self, size, embed_dim=128, order=1):super(Line, self).__init__()assert order in [1, 2], print("Order should either be int(1) or int(2)")self.embed_dim = embed_dimself.order = orderself.nodes_embeddings = nn.Embedding(size, embed_dim)if order == 2:self.contextnodes_embeddings = nn.Embedding(size, embed_dim)# Initializationself.contextnodes_embeddings.weight.data = self.contextnodes_embeddings.weight.data.uniform_(-.5, .5) / embed_dim# Initializationself.nodes_embeddings.weight.data = self.nodes_embeddings.weight.data.uniform_(-.5, .5) / embed_dimdef forward(self, v_i, v_j, negsamples, device):v_i = self.nodes_embeddings(v_i).to(device)if self.order == 2:v_j = self.contextnodes_embeddings(v_j).to(device)negativenodes = -self.contextnodes_embeddings(negsamples).to(device)else:v_j = self.nodes_embeddings(v_j).to(device)negativenodes = -self.nodes_embeddings(negsamples).to(device)mulpositivebatch = torch.mul(v_i, v_j)positivebatch = F.logsigmoid(torch.sum(mulpositivebatch, dim=1))mulnegativebatch = torch.mul(v_i.view(len(v_i), 1, self.embed_dim), negativenodes)negativebatch = torch.sum(F.logsigmoid(torch.sum(mulnegativebatch, dim=2)),dim=1)loss = positivebatch + negativebatchreturn -torch.mean(loss)import argparse from utils.utils import * from utils.line import Line from tqdm import trange import torch import torch.optim as optim import sys import pickle# 使用parser加載信息 parser = argparse.ArgumentParser() # 輸入文件 parser.add_argument("-g", "--graph_path", type=str, default='./data/weighted.karate.edgelist') # 模型信息輸出文件 parser.add_argument("-save", "--save_path", type=str, default='./model.pt') # 模型損失函數值輸出文件 parser.add_argument("-lossdata", "--lossdata_path", type=str, default='./loss.pkl')# Hyperparams. 超參數 # 論文中的1st order, 2nd order parser.add_argument("-order", "--order", type=int, default=2) # 負樣本數量 parser.add_argument("-neg", "--negsamplesize", type=int, default=5) # embedding維度 parser.add_argument("-dim", "--dimension", type=int, default=128) # batch大小 parser.add_argument("-batchsize", "--batchsize", type=int, default=5) # epoch數量 parser.add_argument("-epochs", "--epochs", type=int, default=1) # 學習率設置 parser.add_argument("-lr", "--learning_rate", type=float,default=0.025) # As starting value in paper # 負采樣指數值設置 parser.add_argument("-negpow", "--negativepower", type=float, default=0.75) args = parser.parse_args()### 2. 讀圖,存點和邊并做歸一化edgedistdict, nodedistdict, weights, nodedegrees, maxindex = makeDist( args.graph_path, args.negativepower)### 3. 計算點和邊的alias table# 構建alias table,達到O(1)的采樣效率 edgesaliassampler = VoseAlias(edgedistdict) nodesaliassampler = VoseAlias(nodedistdict)### 4. Line模型實現# 按batchsize將訓練樣本分組 batchrange = int(len(edgedistdict) / args.batchsize) print(maxindex) # line.py中的nn.Module類 line = Line(maxindex + 1, embed_dim=args.dimension, order=args.order) # SGD算法優化模型 opt = optim.SGD(line.parameters(), lr=args.learning_rate,momentum=0.9, nesterov=True)### 5.模型按邊訓練以及負采樣device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")lossdata = {"it": [], "loss": []} it = 0 helper = 0print("\nTraining on {}...\n".format(device)) # 共訓練epoch次數 for epoch in range(args.epochs):print("Epoch {}".format(epoch))# 每次訓練組數:batchsizefor b in trange(batchrange):# edgesaliassampler是實現alias building的VoseAlias類,這里采樣出batchsize條邊samplededges = edgesaliassampler.sample_n(args.batchsize)# makeData是utils.py中的函數,為每條邊采樣出K條負樣本邊# 每一條格式是(node i, node j, negative nodes...)batch = list(makeData(samplededges, args.negsamplesize, weights, nodedegrees,nodesaliassampler))# 轉換成tensor格式batch = torch.LongTensor(batch)if helper == 0:print (batch)helper = 1# 第0列v_i = batch[:, 0]# 第1列v_j = batch[:, 1]# 第2列-最后列negsamples = batch[:, 2:]# 在做BP之前將gradients置0因為是累加的line.zero_grad()# Line模型實現部分loss = line(v_i, v_j, negsamples, device)# 計算梯度loss.backward()# 根據梯度值更新參數值opt.step()lossdata["loss"].append(loss.item())lossdata["it"].append(it)it += 1print("\nDone training, saving model to {}".format(args.save_path)) torch.save(line, "{}".format(args.save_path))print("Saving loss data at {}".format(args.lossdata_path)) with open(args.lossdata_path, "wb") as ldata:pickle.dump(lossdata, ldata)### 6.結果展示和可視化from sklearn import cluster import numpy as npembedding_node=[] for i in range(1,35):input = torch.LongTensor([i])t = line.nodes_embeddings(input)embedding_node.append(t.tolist()[0]) embedding_node=np.matrix(embedding_node).reshape((34,-1)) y_pred = cluster.KMeans(n_clusters=3, random_state=9).fit_predict(embedding_node) # 調用 test_RandomForestClassifier y_pred?
?
總結
以上是生活随笔為你收集整理的LINE: Large-scale Information Network Embedding的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 张量简介与创建
- 下一篇: node2vec: Scalable F