PathSim

Network Schema: network schema is a meta template for a heterogeneous \(G = (V, E)\) with the object type mapping \(\phi: V \to A\) and the link mapping \(\psi: E \to R\), which is a directed graph defined over object types \(\mathcal{A}\), with edges as relations from \(\mathcal{R}\), denoted as \(T_G = (\mathcal{A}, \mathcal{R})\)

Meta path: 元路径是指定义在不同类型定点之间的一系列关系序列组成的一条路径。

PathSim: 是使用元路径来定义的一种新颖的相似搜索度量方法,用于找到异构网络中相同类型的相似顶点。

\[s(x, y)=\frac{2 \times\left|\left\{p_{x \leadsto y}: p_{x \leadsto y} \in \mathcal{P}\right\}\right|}{\left|\left\{p_{x \leadsto x}: p_{x \leadsto x} \in \mathcal{P}\right\}\right|+\left|\left\{p_{y \leadsto y}: p_{y \leadsto y} \in \mathcal{P}\right\}\right|}\]

其中\(p_{x \to y}\)是\(x\)和\(y\)之间的路径实例,\(p_{x \to x}\)是\(x\)和\(x\)之间的路径实例,\(p_{y \to y}\)是\(y\)和\(y\)之间的路径实例。

例如有一个简单的表示作者和发表会议的邻接矩阵\(W_{AC}\),矩阵中每个元素表示作者在每个会议发表的数量:

- SIGMOD VLDB ICDE KDD
Mike 2 1 0 0
Jim 50 20 0 0
Mary 2 0 1 0
Bob 2 1 0 0
Ann 0 0 1 1

那么Mike和Jim的相似度就可以计算为:

\[S(Mike, Jim) = \frac{2 \times (2 \times 50 + 1 \times 20)}{(2^2 + 1^2) + (50^2 + 20^2)} = 0.0826\]

Commuting Matrix: 给定一个网络\(G = (V, E)\),它的network schema为\(T_G\),那么对于一个元路径\(\mathcal{P} = (A_1 A_2 \dots A_l)\)的commuting matrix \(M\)可以表示为\(M=W_{A_{1} A_{2}} W_{A_{2} A_{3}} \dots W_{A_{l-1} A_{l}}\),其中\(W_{A_i A_j}\)是在类型\(A_i\)和类型\(A_j\)之间的邻接矩阵。\(M(i, j)\)代表在给定的meta path \(\mathcal{P}\)下物品\(x_i \in A_1\)和物品\(y_j \in A_l\)之间的路径数量。

给定一个对称的元路径\(\mathcal{P}\),两个相同类型的物品\(x_i\)和\(x_j\)之间的PathSim可以计算为\(s\left(x_{i}, x_{j}\right)=\frac{2 M_{i j}}{M_{i i}+M_{j j}}\)

以上面包含Mike和Jim的那个邻接为例,每个author之间的相似度矩阵可以通过先计算commuting matrix,然后除去一个分母项得到。其python实现的代码如下:

import numpy as np

AC = np.array([
     [2,  1, 0, 0],
     [50, 20, 0, 0],
     [2,  0, 1, 0],
     [2,  1, 0, 0],
     [0,  0, 1, 1],
])
AC = AC.astype(np.float)
CA = AC.T
ACA = np.dot(AC, CA)

Mij = ACA.copy()
Mii = np.repeat(ACA.diagonal()[:, None], 5, axis=1)
Mjj = np.repeat(ACA.diagonal()[None, :], 5, axis=0)

S = 2 * Mij / (Mii + Mjj)

# array([[1.        , 0.08261618, 1.        , 0.        ],
#        [0.08261618, 1.        , 0.08261618, 0.        ],
#        [1.        , 0.08261618, 1.        , 0.        ],
#        [0.        , 0.        , 0.        , 1.        ]])

Variation

  • PathConstrained Random Walks (PCRW)
  • HeteSim

Reference