/ NLP  

NLP系列

基于DSSM的问题语义相似度匹配

CNTK 303: Deep Structured Semantic Modeling with LSTM Networks

DSSM的全称是Deep Structured Semantic Model或者Deep Semantic Similarity Model。
DSSM由微软研究院深度学习研究中心开发,是一个利用深度神经网络把文本(句子,queries,实体等)表示成向量,并且计算文本相似度的模型和方法。
DSSM在信息检索和网络文本排序中有广泛的应用(Huang et al. 2013; Shen et al. 2014a,2014b; Palangi et al. 2016), 广告相关性, 实体搜索和有趣性任务(Gao et al. 2014a, 问答(Yih et al., 2014), 图片描述(Fang et al., 2014), 以及机器翻译 (Gao et al., 2014b) etc.

DSSM可以被用作开发latent semantic models,把不同的实体投影到同一个低维度的语义空间,然后用于文本分类,排序等任务。举例来说,在网络搜索任务中,文本和搜索短语的相关性可以用vector之间的距离表示。He et al., 2014.

Goal

给定一对文本,例如搜索一个关键词和一组网络文本,模型会把他们分别转化成低维的连续向量,然后用cosine相似度来计算文本的相似性。

从上图中我们看到,给定一个query($Q$)和一组文档($D_1, D_2, \ldots, D_n$),模型可以生成一组隐向量表示(semantic features),然后这些semantic features就可以用来计算文本相似度,最终用于文本排序。

从上图中我们看到,query和document都被编码成了向量。
虽然bag of word是人们常用的文本表示方式,但是它丢失了文本中单词之间的位置关系信息。
卷积或者循环神经网络,由于它们编码单词位置信息的能力,在很多NLP问题上有更好的表现。在这份材料中,我们会使用LSTM模型来编码term vector Palangi et. al.

我们使用一个比较小的问答数据集来训练这个模型。这份notebook的作用是展示如何构建一个DSSM模型,而不是为了用它达到State-of-the-art的表现。

1
2
3
4
5
6
7
8
9
10
# Import the relevant libraries
import math
import numpy as np
import os
from __future__ import print_function # Use a function definition from future version (say 3.x from 2.7 interpreter)

# import cntk as C
# import cntk.tests.test_utils
# cntk.tests.test_utils.set_device_from_pytest_env() # (only needed for our build system)
# C.cntk_py.set_fixed_random_seed(1) # fix a random seed for CNTK components

Data Preparation

Download

我们使用一组问答数据集来展示如何使用DSSM模型。
这组数据集包含很多对问答句子。
我们把这些数据预处理成两个部分:

  • 词汇文件:问题和回答各有一个单词文件。问题和答案分别有1204和1019个单词。
  • 问答句子:包含一个训练集和一个验证集。这些文本都被做成了CTF格式。训练集有3500对句子,验证集有409对。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
location = os.path.normpath('data/DSSM')
data = {
'train': { 'file': 'train.pair.tok.ctf' },
'val':{ 'file': 'valid.pair.tok.ctf' },
'query': { 'file': 'vocab_Q.wl' },
'answer': { 'file': 'vocab_A.wl' }
}

import requests

def download(url, filename):
""" utility function to download a file """
response = requests.get(url, stream=True)
with open(filename, "wb") as handle:
for data in response.iter_content():
handle.write(data)

if not os.path.exists(location):
os.mkdir(location)

for item in data.values():
path = os.path.normpath(os.path.join(location, item['file']))

if os.path.exists(path):
print("Reusing locally cached:", path)

else:
print("Starting download:", item['file'])
url = "http://www.cntk.ai/jup/dat/DSSM/%s.csv"%(item['file'])
print(url)
download(url, path)
print("Download completed")
item['file'] = path
Starting download: train.pair.tok.ctf
http://www.cntk.ai/jup/dat/DSSM/train.pair.tok.ctf.csv
Download completed
Starting download: valid.pair.tok.ctf
http://www.cntk.ai/jup/dat/DSSM/valid.pair.tok.ctf.csv
Download completed
Starting download: vocab_Q.wl
http://www.cntk.ai/jup/dat/DSSM/vocab_Q.wl.csv
Download completed
Starting download: vocab_A.wl
http://www.cntk.ai/jup/dat/DSSM/vocab_A.wl.csv
Download completed

数据读取

我们用CTF deserializer来读取数据。当然,你也可以选择用别的方法自己预处理数据。这里提供的CTF reader也提供打乱样本顺序的功能。

1
2
3
4
5
6
7
8
9
# Define the vocabulary size (QRY-stands for question and ANS stands for answer)
QRY_SIZE = 1204
ANS_SIZE = 1019

def create_reader(path, is_training):
return C.io.MinibatchSource(C.io.CTFDeserializer(path, C.io.StreamDefs(
query = C.io.StreamDef(field='S0', shape=QRY_SIZE, is_sparse=True),
answer = C.io.StreamDef(field='S1', shape=ANS_SIZE, is_sparse=True)
)), randomize=is_training, max_sweeps = C.io.INFINITELY_REPEAT if is_training else 1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
train_file = data['train']['file']
print(train_file)

if os.path.exists(train_file):
train_source = create_reader(train_file, is_training=True)
else:
raise ValueError("Cannot locate file {0} in current directory {1}".format(train_file, os.getcwd()))

validation_file = data['val']['file']
print(validation_file)
if os.path.exists(validation_file):
val_source = create_reader(validation_file, is_training=False)
else:
raise ValueError("Cannot locate file {0} in current directory {1}".format(validation_file, os.getcwd()))
data\DSSM\train.pair.tok.ctf
data\DSSM\valid.pair.tok.ctf

Model creation

LSTM-RNN模型可以按照顺序读入句子中的单词,抽取单词中的信息,然后embed成一个vector。
在DSSM模型中,我们采用句子的最后一个hidden state来作为整个句子的vector表示。
这个vector通过两次Feedforward神经网络就可以作为query vector。

                                                    "query vector"
                                                          ^
                                                          |
                                                      +-------+  
                                                      | Dense |  
                                                      +-------+  
                                                          ^         
                                                          |         
                                                     +---------+  
                                                     | Dropout |  
                                                     +---------+
                                                          ^
                                                          |         
                                                      +-------+  
                                                      | Dense |  
                                                      +-------+  
                                                          ^         
                                                          |         
                                                      +------+   
                                                      | last |  
                                                      +------+  
                                                          ^  
                                                          |         
          +------+   +------+   +------+   +------+   +------+   
     0 -->| LSTM |-->| LSTM |-->| LSTM |-->| LSTM |-->| LSTM |
          +------+   +------+   +------+   +------+   +------+   
              ^          ^          ^          ^          ^
              |          |          |          |          |
          +-------+  +-------+  +-------+  +-------+  +-------+
          | Embed |  | Embed |  | Embed |  | Embed |  | Embed | 
          +-------+  +-------+  +-------+  +-------+  +-------+
              ^          ^          ^          ^          ^
              |          |          |          |          |
query  ------>+--------->+--------->+--------->+--------->+

类似地,我们可以把答案句子编码成answer vector。我们首先定义模型的输入,分别是query和answer的sequence,

1
2
3
# Create the containers for input feature (x) and the label (y)
qry = C.sequence.input_variable(QRY_SIZE)
ans = C.sequence.input_variable(ANS_SIZE)

每个CNTK的sequence都包含一个dynamic axis,表示sequence的长度。
直观来说,当你的sequence有不同的长度和不同的单词表大小,他们都应该有一个dynamic axis。
这时候就需要声明named axis。

1
2
3
4
5
6
# Create the containers for input feature (x) and the label (y)
axis_qry = C.Axis.new_unique_dynamic_axis('axis_qry')
qry = C.sequence.input_variable(QRY_SIZE, sequence_axis=axis_qry)

axis_ans = C.Axis.new_unique_dynamic_axis('axis_ans')
ans = C.sequence.input_variable(ANS_SIZE, sequence_axis=axis_ans)

在创建模型之前我们先定义一些模型的参数。

1
2
3
4
5
EMB_DIM   = 25 # Embedding dimension
HIDDEN_DIM = 50 # LSTM dimension
DSSM_DIM = 25 # Dense layer dimension
NEGATIVE_SAMPLES = 5
DROPOUT_RATIO = 0.2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def create_model(qry, ans):
with C.layers.default_options(initial_state=0.1):
qry_vector = C.layers.Sequential([
C.layers.Embedding(EMB_DIM, name='embed'),
C.layers.Recurrence(C.layers.LSTM(HIDDEN_DIM), go_backwards=False),
C.sequence.last,
C.layers.Dense(DSSM_DIM, activation=C.relu, name='q_proj'),
C.layers.Dropout(DROPOUT_RATIO, name='dropout qdo1'),
C.layers.Dense(DSSM_DIM, activation=C.tanh, name='q_enc')
])

ans_vector = C.layers.Sequential([
C.layers.Embedding(EMB_DIM, name='embed'),
C.layers.Recurrence(C.layers.LSTM(HIDDEN_DIM), go_backwards=False),
C.sequence.last,
C.layers.Dense(DSSM_DIM, activation=C.relu, name='a_proj'),
C.layers.Dropout(DROPOUT_RATIO, name='dropout ado1'),
C.layers.Dense(DSSM_DIM, activation=C.tanh, name='a_enc')
])

return {
'query_vector': qry_vector(qry),
'answer_vector': ans_vector(ans)
}

# Create the model and store reference in `network` dictionary
network = create_model(qry, ans)

network['query'], network['axis_qry'] = qry, axis_qry
network['answer'], network['axis_ans'] = ans, axis_ans

Training

现在我们已经创建了模型,下一步就是找到一个合适的损失函数。这个损失函数的功能是,如果我们的问题和一个正确的答案匹配在一起,这个损失就应该是一个接近0的很小的数字,如果问题和答案不匹配,那么损失函数应该给我们返回一个接近1的数字。换句话说,这个损失函数最大化问题和正确答案之间的相似度,最小化问题与错误答案之间的相似度。

DSSM经常被用在信息检索类问题中。往往给定一个搜索的短语或问题,我们需要在海量的文本中寻找正确答案。输入的数据是一个问题和一个潜在的答案(文本或者广告),这些文本或者广告可能会被点击。我们的目标是要提高被点击的概率,也就是说被搜索到的文档或广告与搜索关键词比较相关。一种做法是训练一个分类器,这个分类器可以预测链接是否被点开。为了训练这样一个模型,我们需要被点开的搜索短语和链接,也需要没有被点开的链接。一种模拟没有被点开的链接的方法是从当前minibatch中随机采样其他query产生的链接。这就是 cosine_distance_with_negative_samples 这个function在做的事情。注意,这个function的返回值1表示正确的问题与答案,0表示错误的问题与答案,我们把它叫做similarity。所以,我们用1-cosine_distance_with_negative_samples作为损失函数。

1
2
3
4
5
6
def create_loss(vector_a, vector_b):
qry_ans_similarity = C.cosine_distance_with_negative_samples(vector_a, \
vector_b, \
shift=1, \
num_negative_samples=5)
return 1 - qry_ans_similarity
1
2
3
4
# Model parameters
MAX_EPOCHS = 5
EPOCH_SIZE = 10000
MINIBATCH_SIZE = 50
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Create trainer
def create_trainer(reader, network):

# Setup the progress updater
progress_writer = C.logging.ProgressPrinter(tag='Training', num_epochs=MAX_EPOCHS)

# Set learning parameters
lr_per_sample = [0.0015625]*20 + \
[0.00046875]*20 + \
[0.00015625]*20 + \
[0.000046875]*10 + \
[0.000015625]
lr_schedule = C.learning_parameter_schedule_per_sample(lr_per_sample, \
epoch_size=EPOCH_SIZE)
mms = [0]*20 + [0.9200444146293233]*20 + [0.9591894571091382]
mm_schedule = C.learners.momentum_schedule(mms, \
epoch_size=EPOCH_SIZE, \
minibatch_size=MINIBATCH_SIZE)
l2_reg_weight = 0.0002

model = C.combine(network['query_vector'], network['answer_vector'])

#Notify the network that the two dynamic axes are indeed same
query_reconciled = C.reconcile_dynamic_axes(network['query_vector'], network['answer_vector'])

network['loss'] = create_loss(query_reconciled, network['answer_vector'])
network['error'] = None

print('Using momentum sgd with no l2')
dssm_learner = C.learners.momentum_sgd(model.parameters, lr_schedule, mm_schedule)

network['learner'] = dssm_learner

print('Using local learner')
# Create trainer
return C.Trainer(model, (network['loss'], network['error']), network['learner'], progress_writer)
1
2
# Instantiate the trainer
trainer = create_trainer(train_source, network)
Using momentum sgd with no l2
Using local learner
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Train 
def do_train(network, trainer, train_source):
# define mapping from intput streams to network inputs
input_map = {
network['query']: train_source.streams.query,
network['answer']: train_source.streams.answer
}

t = 0
for epoch in range(MAX_EPOCHS): # loop over epochs
epoch_end = (epoch+1) * EPOCH_SIZE
while t < epoch_end: # loop over minibatches on the epoch
data = train_source.next_minibatch(MINIBATCH_SIZE, input_map= input_map) # fetch minibatch
trainer.train_minibatch(data) # update model with it
t += MINIBATCH_SIZE

trainer.summarize_training_progress()
1
do_train(network, trainer, train_source)
Learning rate per 1 samples: 0.0015625
Momentum per 1 samples: 0.0
Finished Epoch[1 of 5]: [Training] loss = 0.343046 * 1522, metric = 0.00% * 1522 5.720s (266.1 samples/s);
Finished Epoch[2 of 5]: [Training] loss = 0.102804 * 1530, metric = 0.00% * 1530 3.464s (441.7 samples/s);
Finished Epoch[3 of 5]: [Training] loss = 0.066461 * 1525, metric = 0.00% * 1525 3.402s (448.3 samples/s);
Finished Epoch[4 of 5]: [Training] loss = 0.048511 * 1534, metric = 0.00% * 1534 3.390s (452.5 samples/s);
Finished Epoch[5 of 5]: [Training] loss = 0.035384 * 1510, metric = 0.00% * 1510 3.383s (446.3 samples/s);

Validate

当我们训练完模型后,我们需要选择一个训练与验证错误率相近的模型。
可以通过选择不同的epoch数量来选择更好的模型。
通过这种方式选择的模型最终被用于预测。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Validate
def do_validate(network, val_source):
# process minibatches and perform evaluation
progress_printer = C.logging.ProgressPrinter(tag='Evaluation', num_epochs=0)

val_map = {
network['query']: val_source.streams.query,
network['answer']: val_source.streams.answer
}

evaluator = C.eval.Evaluator(network['loss'], progress_printer)

while True:
minibatch_size = 100
data = val_source.next_minibatch(minibatch_size, input_map=val_map)
if not data: # until we hit the end
break

evaluator.test_minibatch(data)

evaluator.summarize_test_progress()
1
do_validate(network, val_source)
Finished Evaluation [1]: Minibatch[1-35]: metric = 0.02% * 410;

预测

我们会把query和answer都转化成vector。然后计算它们之间的cosine similarity。这些cosine similarity的分数可以用来对搜索的网页排序。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# load dictionaries
query_wl = [line.rstrip('\n') for line in open(data['query']['file'])]
answers_wl = [line.rstrip('\n') for line in open(data['answer']['file'])]
query_dict = {query_wl[i]:i for i in range(len(query_wl))}
answers_dict = {answers_wl[i]:i for i in range(len(answers_wl))}

# let's run a sequence through
qry = 'BOS what contribution did e1 made to science in 1665 EOS'
ans = 'BOS book author book_editions_published EOS'
ans_poor = 'BOS language human_language main_country EOS'

qry_idx = [query_dict[w+' '] for w in qry.split()] # convert to query word indices
print('Query Indices:', qry_idx)

ans_idx = [answers_dict[w+' '] for w in ans.split()] # convert to answer word indices
print('Answer Indices:', ans_idx)

ans_poor_idx = [answers_dict[w+' '] for w in ans_poor.split()] # convert to fake answer word indices
print('Poor Answer Indices:', ans_poor_idx)
Query Indices: [1202, 1154, 267, 321, 357, 648, 1070, 905, 549, 6, 1203]
Answer Indices: [1017, 135, 91, 137, 1018]
Poor Answer Indices: [1017, 501, 452, 533, 1018]
1
2
3
4
5
6
7
8
9
10
11
12
# Create the one hot representations
qry_onehot = np.zeros([len(qry_idx),len(query_dict)], np.float32)
for t in range(len(qry_idx)):
qry_onehot[t,qry_idx[t]] = 1

ans_onehot = np.zeros([len(ans_idx),len(answers_dict)], np.float32)
for t in range(len(ans_idx)):
ans_onehot[t,ans_idx[t]] = 1

ans_poor_onehot = np.zeros([len(ans_poor_idx),len(answers_dict)], np.float32)
for t in range(len(ans_poor_idx)):
ans_poor_onehot[t, ans_poor_idx[t]] = 1
1
2
3
4
5
6
7
8
qry_embedding = network['query_vector'].eval([qry_onehot])
ans_embedding = network['answer_vector'].eval([ans_onehot])
ans_poor_embedding = network['answer_vector'].eval([ans_poor_onehot])

from scipy.spatial.distance import cosine

print('Query to Answer similarity:', 1-cosine(qry_embedding, ans_embedding))
print('Query to poor-answer similarity:', 1-cosine(qry_embedding, ans_poor_embedding))
Query to Answer similarity: 0.99995367043
Query to poor-answer similarity: 0.999941420215