基于DSSM的问题语义相似度匹配

CNTK 303: Deep Structured Semantic Modeling with LSTM Networks

DSSM的全称是Deep Structured Semantic Model或者Deep Semantic Similarity Model。
DSSM由微软研究院深度学习研究中心开发，是一个利用深度神经网络把文本（句子，queries，实体等）表示成向量，并且计算文本相似度的模型和方法。
DSSM在信息检索和网络文本排序中有广泛的应用(Huang et al. 2013; Shen et al. 2014a,2014b; Palangi et al. 2016), 广告相关性, 实体搜索和有趣性任务(Gao et al. 2014a, 问答(Yih et al., 2014), 图片描述(Fang et al., 2014), 以及机器翻译 (Gao et al., 2014b) etc.

DSSM可以被用作开发latent semantic models，把不同的实体投影到同一个低维度的语义空间，然后用于文本分类，排序等任务。举例来说，在网络搜索任务中，文本和搜索短语的相关性可以用vector之间的距离表示。He et al., 2014.

Goal

给定一对文本，例如搜索一个关键词和一组网络文本，模型会把他们分别转化成低维的连续向量，然后用cosine相似度来计算文本的相似性。

从上图中我们看到，给定一个query($Q$)和一组文档($D_1, D_2, \ldots, D_n$)，模型可以生成一组隐向量表示(semantic features)，然后这些semantic features就可以用来计算文本相似度，最终用于文本排序。

从上图中我们看到，query和document都被编码成了向量。
虽然bag of word是人们常用的文本表示方式，但是它丢失了文本中单词之间的位置关系信息。
卷积或者循环神经网络，由于它们编码单词位置信息的能力，在很多NLP问题上有更好的表现。在这份材料中，我们会使用LSTM模型来编码term vector Palangi et. al.。

我们使用一个比较小的问答数据集来训练这个模型。这份notebook的作用是展示如何构建一个DSSM模型，而不是为了用它达到State-of-the-art的表现。

# Import the relevant libraries
import math
import numpy as np
import os
from __future__ import print_function # Use a function definition from future version (say 3.x from 2.7 interpreter)

# import cntk as C
# import cntk.tests.test_utils
# cntk.tests.test_utils.set_device_from_pytest_env() # (only needed for our build system)
# C.cntk_py.set_fixed_random_seed(1) # fix a random seed for CNTK components

Data Preparation

Download

我们使用一组问答数据集来展示如何使用DSSM模型。
这组数据集包含很多对问答句子。
我们把这些数据预处理成两个部分：

词汇文件：问题和回答各有一个单词文件。问题和答案分别有1204和1019个单词。
问答句子：包含一个训练集和一个验证集。这些文本都被做成了CTF格式。训练集有3500对句子，验证集有409对。

location = os.path.normpath('data/DSSM')
data = {
  'train': { 'file': 'train.pair.tok.ctf' },
  'val':{ 'file': 'valid.pair.tok.ctf' },
  'query': { 'file': 'vocab_Q.wl' },
  'answer': { 'file': 'vocab_A.wl' }
}

import requests

def download(url, filename):
    """ utility function to download a file """
    response = requests.get(url, stream=True)
    with open(filename, "wb") as handle:
        for data in response.iter_content():
            handle.write(data)

if not os.path.exists(location):
    os.mkdir(location)
     
for item in data.values():
    path = os.path.normpath(os.path.join(location, item['file']))

    if os.path.exists(path):
        print("Reusing locally cached:", path)
        
    else:
        print("Starting download:", item['file'])
        url = "http://www.cntk.ai/jup/dat/DSSM/%s.csv"%(item['file'])
        print(url)
        download(url, path)
        print("Download completed")
    item['file'] = path

Starting download: train.pair.tok.ctf
http://www.cntk.ai/jup/dat/DSSM/train.pair.tok.ctf.csv
Download completed
Starting download: valid.pair.tok.ctf
http://www.cntk.ai/jup/dat/DSSM/valid.pair.tok.ctf.csv
Download completed
Starting download: vocab_Q.wl
http://www.cntk.ai/jup/dat/DSSM/vocab_Q.wl.csv
Download completed
Starting download: vocab_A.wl
http://www.cntk.ai/jup/dat/DSSM/vocab_A.wl.csv
Download completed

数据读取

我们用CTF deserializer来读取数据。当然，你也可以选择用别的方法自己预处理数据。这里提供的CTF reader也提供打乱样本顺序的功能。

# Define the vocabulary size (QRY-stands for question and ANS stands for answer)
QRY_SIZE = 1204
ANS_SIZE = 1019

def create_reader(path, is_training):
    return C.io.MinibatchSource(C.io.CTFDeserializer(path, C.io.StreamDefs(
         query = C.io.StreamDef(field='S0', shape=QRY_SIZE,  is_sparse=True),
         answer  = C.io.StreamDef(field='S1', shape=ANS_SIZE, is_sparse=True)
     )), randomize=is_training, max_sweeps = C.io.INFINITELY_REPEAT if is_training else 1)

train_file = data['train']['file']
print(train_file)

if os.path.exists(train_file):
    train_source = create_reader(train_file, is_training=True)
else:
    raise ValueError("Cannot locate file {0} in current directory {1}".format(train_file, os.getcwd()))

validation_file = data['val']['file']
print(validation_file)
if os.path.exists(validation_file):
    val_source = create_reader(validation_file, is_training=False)
else:
    raise ValueError("Cannot locate file {0} in current directory {1}".format(validation_file, os.getcwd()))

data\DSSM\train.pair.tok.ctf
data\DSSM\valid.pair.tok.ctf

Model creation

LSTM-RNN模型可以按照顺序读入句子中的单词，抽取单词中的信息，然后embed成一个vector。
在DSSM模型中，我们采用句子的最后一个hidden state来作为整个句子的vector表示。
这个vector通过两次Feedforward神经网络就可以作为query vector。

                                                    "query vector"
                                                          ^
                                                          |
                                                      +-------+  
                                                      | Dense |  
                                                      +-------+  
                                                          ^         
                                                          |         
                                                     +---------+  
                                                     | Dropout |  
                                                     +---------+
                                                          ^
                                                          |         
                                                      +-------+  
                                                      | Dense |  
                                                      +-------+  
                                                          ^         
                                                          |         
                                                      +------+   
                                                      | last |  
                                                      +------+  
                                                          ^  
                                                          |         
          +------+   +------+   +------+   +------+   +------+   
     0 -->| LSTM |-->| LSTM |-->| LSTM |-->| LSTM |-->| LSTM |
          +------+   +------+   +------+   +------+   +------+   
              ^          ^          ^          ^          ^
              |          |          |          |          |
          +-------+  +-------+  +-------+  +-------+  +-------+
          | Embed |  | Embed |  | Embed |  | Embed |  | Embed | 
          +-------+  +-------+  +-------+  +-------+  +-------+
              ^          ^          ^          ^          ^
              |          |          |          |          |
query  ------>+--------->+--------->+--------->+--------->+

类似地，我们可以把答案句子编码成answer vector。我们首先定义模型的输入，分别是query和answer的sequence，

1
2
3

# Create the containers for input feature (x) and the label (y)
qry = C.sequence.input_variable(QRY_SIZE)
ans = C.sequence.input_variable(ANS_SIZE)

每个CNTK的sequence都包含一个dynamic axis，表示sequence的长度。
直观来说，当你的sequence有不同的长度和不同的单词表大小，他们都应该有一个dynamic axis。
这时候就需要声明named axis。

# Create the containers for input feature (x) and the label (y)
axis_qry = C.Axis.new_unique_dynamic_axis('axis_qry')
qry = C.sequence.input_variable(QRY_SIZE, sequence_axis=axis_qry)

axis_ans = C.Axis.new_unique_dynamic_axis('axis_ans')
ans = C.sequence.input_variable(ANS_SIZE, sequence_axis=axis_ans)

在创建模型之前我们先定义一些模型的参数。

EMB_DIM   = 25 # Embedding dimension
HIDDEN_DIM = 50 # LSTM dimension
DSSM_DIM = 25 # Dense layer dimension  
NEGATIVE_SAMPLES = 5
DROPOUT_RATIO = 0.2

def create_model(qry, ans):
    with C.layers.default_options(initial_state=0.1):
        qry_vector = C.layers.Sequential([
            C.layers.Embedding(EMB_DIM, name='embed'),
            C.layers.Recurrence(C.layers.LSTM(HIDDEN_DIM), go_backwards=False),
            C.sequence.last,
            C.layers.Dense(DSSM_DIM, activation=C.relu, name='q_proj'),
            C.layers.Dropout(DROPOUT_RATIO, name='dropout qdo1'),
            C.layers.Dense(DSSM_DIM, activation=C.tanh, name='q_enc')
        ])
        
        ans_vector = C.layers.Sequential([
            C.layers.Embedding(EMB_DIM, name='embed'),
            C.layers.Recurrence(C.layers.LSTM(HIDDEN_DIM), go_backwards=False),
            C.sequence.last,
            C.layers.Dense(DSSM_DIM, activation=C.relu, name='a_proj'),
            C.layers.Dropout(DROPOUT_RATIO, name='dropout ado1'),
            C.layers.Dense(DSSM_DIM, activation=C.tanh, name='a_enc')
        ])

    return {
        'query_vector': qry_vector(qry),
        'answer_vector': ans_vector(ans)
    }

# Create the model and store reference in `network` dictionary
network = create_model(qry, ans)

network['query'], network['axis_qry'] = qry, axis_qry
network['answer'], network['axis_ans'] = ans, axis_ans

Training

现在我们已经创建了模型，下一步就是找到一个合适的损失函数。这个损失函数的功能是，如果我们的问题和一个正确的答案匹配在一起，这个损失就应该是一个接近0的很小的数字，如果问题和答案不匹配，那么损失函数应该给我们返回一个接近1的数字。换句话说，这个损失函数最大化问题和正确答案之间的相似度，最小化问题与错误答案之间的相似度。

DSSM经常被用在信息检索类问题中。往往给定一个搜索的短语或问题，我们需要在海量的文本中寻找正确答案。输入的数据是一个问题和一个潜在的答案（文本或者广告），这些文本或者广告可能会被点击。我们的目标是要提高被点击的概率，也就是说被搜索到的文档或广告与搜索关键词比较相关。一种做法是训练一个分类器，这个分类器可以预测链接是否被点开。为了训练这样一个模型，我们需要被点开的搜索短语和链接，也需要没有被点开的链接。一种模拟没有被点开的链接的方法是从当前minibatch中随机采样其他query产生的链接。这就是 cosine_distance_with_negative_samples 这个function在做的事情。注意，这个function的返回值1表示正确的问题与答案，0表示错误的问题与答案，我们把它叫做similarity。所以，我们用1-cosine_distance_with_negative_samples作为损失函数。

def create_loss(vector_a, vector_b):
    qry_ans_similarity = C.cosine_distance_with_negative_samples(vector_a, \
                                                                 vector_b, \
                                                                 shift=1, \
                                                                 num_negative_samples=5)
    return 1 - qry_ans_similarity

# Model parameters
MAX_EPOCHS = 5
EPOCH_SIZE = 10000
MINIBATCH_SIZE = 50

# Create trainer
def create_trainer(reader, network):
    
    # Setup the progress updater
    progress_writer = C.logging.ProgressPrinter(tag='Training', num_epochs=MAX_EPOCHS)

    # Set learning parameters
    lr_per_sample     = [0.0015625]*20 + \
                        [0.00046875]*20 + \
                        [0.00015625]*20 + \
                        [0.000046875]*10 + \
                        [0.000015625]
    lr_schedule       = C.learning_parameter_schedule_per_sample(lr_per_sample, \
                                                 epoch_size=EPOCH_SIZE)
    mms               = [0]*20 + [0.9200444146293233]*20 + [0.9591894571091382]
    mm_schedule       = C.learners.momentum_schedule(mms, \
                                                     epoch_size=EPOCH_SIZE, \
                                                     minibatch_size=MINIBATCH_SIZE)
    l2_reg_weight     = 0.0002

    model = C.combine(network['query_vector'], network['answer_vector'])

    #Notify the network that the two dynamic axes are indeed same
    query_reconciled = C.reconcile_dynamic_axes(network['query_vector'], network['answer_vector'])
  
    network['loss'] = create_loss(query_reconciled, network['answer_vector'])
    network['error'] = None

    print('Using momentum sgd with no l2')
    dssm_learner = C.learners.momentum_sgd(model.parameters, lr_schedule, mm_schedule)

    network['learner'] = dssm_learner
 
    print('Using local learner')
    # Create trainer
    return C.Trainer(model, (network['loss'], network['error']), network['learner'], progress_writer)

1 2	# Instantiate the trainer trainer = create_trainer(train_source, network)

Using momentum sgd with no l2
Using local learner

# Train 
def do_train(network, trainer, train_source):
    # define mapping from intput streams to network inputs
    input_map = {
        network['query']: train_source.streams.query,
        network['answer']: train_source.streams.answer
        } 

    t = 0
    for epoch in range(MAX_EPOCHS):         # loop over epochs
        epoch_end = (epoch+1) * EPOCH_SIZE
        while t < epoch_end:                # loop over minibatches on the epoch
            data = train_source.next_minibatch(MINIBATCH_SIZE, input_map= input_map)  # fetch minibatch
            trainer.train_minibatch(data)               # update model with it
            t += MINIBATCH_SIZE

        trainer.summarize_training_progress()

1	do_train(network, trainer, train_source)

Learning rate per 1 samples: 0.0015625
Momentum per 1 samples: 0.0
Finished Epoch[1 of 5]: [Training] loss = 0.343046 * 1522, metric = 0.00% * 1522 5.720s (266.1 samples/s);
Finished Epoch[2 of 5]: [Training] loss = 0.102804 * 1530, metric = 0.00% * 1530 3.464s (441.7 samples/s);
Finished Epoch[3 of 5]: [Training] loss = 0.066461 * 1525, metric = 0.00% * 1525 3.402s (448.3 samples/s);
Finished Epoch[4 of 5]: [Training] loss = 0.048511 * 1534, metric = 0.00% * 1534 3.390s (452.5 samples/s);
Finished Epoch[5 of 5]: [Training] loss = 0.035384 * 1510, metric = 0.00% * 1510 3.383s (446.3 samples/s);

Validate

当我们训练完模型后，我们需要选择一个训练与验证错误率相近的模型。
可以通过选择不同的epoch数量来选择更好的模型。
通过这种方式选择的模型最终被用于预测。

# Validate
def do_validate(network, val_source):
    # process minibatches and perform evaluation
    progress_printer = C.logging.ProgressPrinter(tag='Evaluation', num_epochs=0)

    val_map = {
        network['query']: val_source.streams.query,
        network['answer']: val_source.streams.answer
        } 

    evaluator = C.eval.Evaluator(network['loss'], progress_printer)

    while True:
        minibatch_size = 100
        data = val_source.next_minibatch(minibatch_size, input_map=val_map)
        if not data:                                 # until we hit the end
            break

        evaluator.test_minibatch(data)

    evaluator.summarize_test_progress()

1	do_validate(network, val_source)

Finished Evaluation [1]: Minibatch[1-35]: metric = 0.02% * 410;

预测

我们会把query和answer都转化成vector。然后计算它们之间的cosine similarity。这些cosine similarity的分数可以用来对搜索的网页排序。

# load dictionaries
query_wl = [line.rstrip('\n') for line in open(data['query']['file'])]
answers_wl = [line.rstrip('\n') for line in open(data['answer']['file'])]
query_dict = {query_wl[i]:i for i in range(len(query_wl))}
answers_dict = {answers_wl[i]:i for i in range(len(answers_wl))}

# let's run a sequence through
qry = 'BOS what contribution did  e1  made to science in 1665 EOS'
ans = 'BOS book author book_editions_published EOS'
ans_poor = 'BOS language human_language main_country EOS'

qry_idx = [query_dict[w+' '] for w in qry.split()] # convert to query word indices
print('Query Indices:', qry_idx)

ans_idx = [answers_dict[w+' '] for w in ans.split()] # convert to answer word indices
print('Answer Indices:', ans_idx)

ans_poor_idx = [answers_dict[w+' '] for w in ans_poor.split()] # convert to fake answer word indices
print('Poor Answer Indices:', ans_poor_idx)

Query Indices: [1202, 1154, 267, 321, 357, 648, 1070, 905, 549, 6, 1203]
Answer Indices: [1017, 135, 91, 137, 1018]
Poor Answer Indices: [1017, 501, 452, 533, 1018]

# Create the one hot representations
qry_onehot = np.zeros([len(qry_idx),len(query_dict)], np.float32)
for t in range(len(qry_idx)):
    qry_onehot[t,qry_idx[t]] = 1
    
ans_onehot = np.zeros([len(ans_idx),len(answers_dict)], np.float32)
for t in range(len(ans_idx)):
    ans_onehot[t,ans_idx[t]] = 1
    
ans_poor_onehot = np.zeros([len(ans_poor_idx),len(answers_dict)], np.float32)
for t in range(len(ans_poor_idx)):
    ans_poor_onehot[t, ans_poor_idx[t]] = 1

qry_embedding = network['query_vector'].eval([qry_onehot])
ans_embedding = network['answer_vector'].eval([ans_onehot])
ans_poor_embedding = network['answer_vector'].eval([ans_poor_onehot])

from scipy.spatial.distance import cosine

print('Query to Answer similarity:', 1-cosine(qry_embedding, ans_embedding))
print('Query to poor-answer similarity:', 1-cosine(qry_embedding, ans_poor_embedding))

Query to Answer similarity: 0.99995367043
Query to poor-answer similarity: 0.999941420215