/ NLP  

NLP系列

视觉问答机器人(VQA) 原理与实现

本章概述

2.1 视觉问答机器人问题介绍
2.2 基于图像信息和文本信息抽取匹配的VQA实现方案
2.3 基于注意力(attention)的深度学习VQA实现方案
2.4 【实战】使用keras完成CNN+RNN基础VQA模型
2.5 【实战】基于attention 的深度学习VQA模型实现

2.1 视觉问答机器人问题介绍

  • 视觉问答任务的定义是对于一张图片和一个跟这幅图片相关的问题,机器需要根据图片信息对问题进行回答。
  • 输入:一张图片和一个关于图片信息的问题,常见的问题形式有选择题,判断题
  • 输出:挑选出正确答案
  • 问题: how many players are in the image?
  • 答案: eleven
  • 人可以清楚地指出图片中的运动员,而且不会把观众也计算在内,我们希望AI机器人也能够对图片信息进行理解,根据问题进行筛选,之后返回正确的答案。

2.1 视觉问答机器人问题介绍

  • 视觉问答任务本质上是一个多模态的研究问题。这个任务需要我们结合自然语言处理(NLP)和计算机视觉(CV)的技术来进行回答。
  • 自然语言处理(NLP)
    • 先理解问题
    • 再产生答案
    • 举一个在NLP领域常见的基于文本的Q&A问题:how many bridges are there in Paris?
    • 一个NLP Q&A 系统需要首先识别出这是一个什么类型的问题,比如这里是一个“how many” 关于计数的问题,所以答案应该是一个数字。之后系统需要提取出哪个物体(object)需要机器去计数,比如这里是 “bridges“。最后需要我们提取出问题中的背景(context),比如这个问题计数的限定范围是在巴黎这个城市。
    • 当一个Q&A系统分析完问题,系统需要根据知识库(knowledge base)去得到答案。
  • 机器视觉(CV)
    • VQA区别于传统的text QA在于搜索答案和推理部分都是基于图片的内容。所以系统需要进行目标检测(object detection),再进行分类(classification),之后系统需要对图片中物体之间的关系进行推理。
  • 总结来说,一个好的VQA系统需要具备能够解决传统的NLP及CV的基础任务,所以这是一个交叉学科,多模态的研究问题。

2.1 视觉问答机器人问题介绍¶

  • 图片数据集
    • Microsoft Common Objects in Context (MSCOCO) 包含了328000张图片,91类物体,2500000个标注数据,这些物体能够被一个4岁小孩轻易地识别出来。
  • 常见的VQA数据集:一个好的数据集需要尽量避免数据采集过程中的偏差(bias),比如说一个数据集中,90%的判断题的答案都是yes,那么一个只输出yes的系统的准确率有90%。

    • DAtaset for QUestion Answering on Real-world images (DAQUAR),第一个重要的VQA数据集,包含了6794个训练样本,5674个测试样本,图片都来自NYU-Depth V2数据集,平均一张图片包含了9个问题答案对(QA pair),这个数据集的缺点是数据太小,不足以训练一个复杂的VQA系统。

    • COCO-QA数据集使用了MSCOCO中123287张图片,其中78736个QA对作为训练,38948个QA对作为测试。这个数据集是通过对MSCOCO中的图片标题(caption)使用NLP工具自动生成出问题和答案对(QA pair),比如一个标题“two chairs in a room”,可以生成一个问题”how many chairs are there?“,所有的答案都是一个单词。虽然这个数据集足够大,但是这种产生QA pair的方法会使得语法错误,或者信息不完整地错误。而且这个数据集只包含了4类问题,且这四类问题的数量不均等,object(69.84%),color(16.59%),counting(7.47%), location(6.10%)

    • the VQA dataset 相对来说更大一些,出来204721张来自MSCOCO的图片,还包含了50000张抽象的卡通图片。一张图片平均有3个问题,一个问题平均有10个答案,总共有超过760000个问题和10000000个答案。全部问题和答案对都是Amazon Mechanical Turk上让人标注的。同时问题包括开放性问题和多选项问题。对于开放性问题,至少3个人提供了一模一样的答案才能作为正确的答案。对于多选题,他们创建了18个候选答案,其中正确(correct)答案是10个人都认为正确的一个答案,有可能(plausible)答案是由三个人没有看过图片只根据问题提供的三个答案,常见(popular)答案是由10个最常见的回答组成(yes,no,1,2,3,4,white,red,blue,green),随机(random)答案是从其他问题的正确答案中随机挑选出来的一个答案。这个数据集的缺点是有些问题太主观了。另一个缺点是有些问题根本不需要图片信息,比如“how many legs does the dog have?” 或者 “what color are the trees?”

2.2 基于图像信息和文本信息抽取匹配的VQA实现方案

  • 通常,一个VQA系统包含了以下三个步骤:
    1. 抽取问题特征
    2. 抽取图片特征
    3. 结合图片和问题特征去生成答案

2.2 基于图像信息和文本信息抽取匹配的VQA实现方案

  • 抽取问题特征
    • 我们通常可以用Bag-of-Words (BOW) 或者LSTM去编码一个问题信息
  • 抽取图片信息
    • 我们通常使用在ImageNet上预训练好的CNN模型
  • 生成答案经常被简化为一个分类问题
  • 各种方法之间比较不一样的是如何把文字特征与图片特征结合。比如我们可以通过把两个特征拼接(concatenation)在一起之后接上一个线性分类器。或者通过Bayesian的方法去预测问题,图片及答案三者之间的特征分布的关系。

2.2 基于图像信息和文本信息抽取匹配的VQA实现方案

  • 基本方法(baselines), Antol et al. (2016) “VQA: Visual Question Answering”,该文章提出通过简单的特征拼接(concatenation)或者element-wise sum/product的方式去融合文本和图片的特征。其中图片特征使用了VGGNet最后一层的1024维特征,文本特征有以下两种方法
    1. 使用BOW的方法去编码一个问题的文本特征,之后再用一个多层的感知器(multi-layer perceptron,MLP)去预测答案。其中MLP包含了两个隐含层,1000个隐含元,使用了tanh 非线性函数,0.5的dropout。
    2. 一个LSTM模型,通过softmax 去预测答案
  • 这些基本方法的结果很有意思,如果一个模型只使用了文本特征,其正确率为48.09%,如果一个模型只使用了图片特征,其正确率为28.13%,而他们最好的模型是使用了LSTM去编码文本特征的,能达到53.74%的正确率。而且多选题的结果会显著好于开放式问题的效果。所有的模型预测的结果都远不如人类的表现。

2.3 基于注意力(attention)的深度学习VQA实现方案

  • 基于注意力的深度学习VQA方法是通过关注图片中相关的部位来获得答案,比如一个问题“what color is the ball?”,则图片中包含了球ball这个object的小区域是比其他区域更具有信息量,比其他区域更相关。相似的,”color“ 和”ball“也比其他单词更加相关。
  • 另一个常见的VQA方案是使用位置注意力(spatial attention)去生成关于区域(region)的位置特征,并训练一个CNN网络。一般有两种方法去获得一张图片关于方位的区域。
    1. 通过将一张图片划分成网格状(grid),并根据问题与图片特征去预测每一个网格的attention weight,将图片的CNN的feature通过加权求和的方式得到attention weighted feature,再通过attention weighted feature发现相对比较重要的区域
    2. 通过目标识别的方式生成很多bounding box
  • 根据生成的区域(region),使用问题去找到最相关的区域,并利用这些区域去生成答案。

2.3 基于注意力(attention)的深度学习VQA实现方案

  • Yang et al. 2016 Stacked Attention Networks for Image Question Answering,提出了一个基于堆叠注意力的VQA系统
  • 图片使用CNN 编码
    $$f_I = CNN_{vgg}(I)$$
  • 问题使用LSTM编码
    $$h_t= LSTM(q), ~~ h_t=CNN(q)$$

  • Stacked Attention,多次重复question-image attention

2.3 基于注意力(attention)的深度学习VQA实现方案

  • Kazemi (2017 et al.) Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering,提出了一个基于注意力的VQA系统
  • 图片使用CNN 编码
    $$\phi = CNN(I)$$
  • 问题使用LSTM编码
    $$s= LSTM(E_q)$$
  • Stacked Attention
    $$\alpha_{c,l} \propto \exp F_c(s, \phi_l) ,~~ \sum_{l=1}^L \alpha_{c,l}=1, ~~ x_c = \sum_l \alpha_{c,l}\phi_l$$
  • classifier, 其中G=[G_1, G_2, …, G_M]是两层的全连接层
    $$P(a_i|I,q) \propto \exp G_i(x,s),~~ x=[x_1, x2,…,x_C]$$

2.4 【实战】使用keras完成CNN+RNN基础VQA模型

  • Keras VQA Demo https://github.com/iamaaditya/VQA_Demo

    1. Keras version 2.0+
    2. Tensorflow 1.2+
    3. scikit-learn
    4. Spacy version 2.0+,用于下载Glove Word embeddings

      1
      python -m spacy download en_vectors_web_lg
    5. OpenCV,用于resize图片成224x224大小

    6. VGG 16,预训练好的权重
1
python demo.py -image_file_name test.jpg -question "Is there a man in the picture?"

1
2
3
%%bash
! git clone https://github.com/iamaaditya/VQA_Demo
! cd VQA_Demo
Cloning into 'VQA_Demo'...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def VQA_MODEL():
image_feature_size = 4096
word_feature_size = 300
number_of_LSTM = 3
number_of_hidden_units_LSTM = 512
max_length_questions = 30
number_of_dense_layers = 3
number_of_hidden_units = 1024
activation_function = 'tanh'
dropout_pct = 0.5


# Image model
model_image = Sequential()
model_image.add(Reshape((image_feature_size,), input_shape=(image_feature_size,)))

# Language Model
model_language = Sequential()
model_language.add(LSTM(number_of_hidden_units_LSTM, return_sequences=True, input_shape=(max_length_questions, word_feature_size)))
model_language.add(LSTM(number_of_hidden_units_LSTM, return_sequences=True))
model_language.add(LSTM(number_of_hidden_units_LSTM, return_sequences=False))

# combined model
model = Sequential()
model.add(Merge([model_language, model_image], mode='concat', concat_axis=1))

for _ in xrange(number_of_dense_layers):
model.add(Dense(number_of_hidden_units, kernel_initializer='uniform'))
model.add(Activation(activation_function))
model.add(Dropout(dropout_pct))

model.add(Dense(1000))
model.add(Activation('softmax'))

return model

1
2
3
4
5
6
7
8
9
10
11
12
13
# 载入库
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import os, argparse
import cv2, spacy, numpy as np
from keras.models import model_from_json
from keras.optimizers import SGD
from sklearn.externals import joblib
from keras import backend as K
from keras.utils.vis_utils import plot_model
K.set_image_data_format('channels_first')
#K.set_image_dim_ordering('th')
Using TensorFlow backend.
1
2
3
4
5
6
# 载入模型的权重
# 需要下载 VGG weights
VQA_model_file_name = 'models/VQA/VQA_MODEL.json'
VQA_weights_file_name = 'models/VQA/VQA_MODEL_WEIGHTS.hdf5'
label_encoder_file_name = 'models/VQA/FULL_labelencoder_trainval.pkl'
CNN_weights_file_name = 'models/CNN/vgg16_weights.h5'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 编译图像模型
def get_image_model(CNN_weights_file_name):
''' Takes the CNN weights file, and returns the VGG model update
with the weights. Requires the file VGG.py inside models/CNN '''
from models.CNN.VGG import VGG_16
image_model = VGG_16(CNN_weights_file_name)
image_model.layers.pop()
image_model.layers.pop()
# this is standard VGG 16 without the last two layers
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
# one may experiment with "adam" optimizer, but the loss function for
# this kind of task is pretty standard
image_model.compile(optimizer=sgd, loss='categorical_crossentropy')
return image_model
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 获得图像特征
def get_image_features(image_file_name):
''' Runs the given image_file to VGG 16 model and returns the
weights (filters) as a 1, 4096 dimension vector '''
image_features = np.zeros((1, 4096))
# Magic_Number = 4096 > Comes from last layer of VGG Model

# Since VGG was trained as a image of 224x224, every new image
# is required to go through the same transformation
im = cv2.resize(cv2.imread(image_file_name), (224, 224))
im = im.transpose((2,0,1)) # convert the image to RGBA


# this axis dimension is required because VGG was trained on a dimension
# of 1, 3, 224, 224 (first axis is for the batch size
# even though we are using only one image, we have to keep the dimensions consistent
im = np.expand_dims(im, axis=0)

image_features[0,:] = image_model.predict(im)[0]
return image_features
1
2
3
4
5
6
7
8
9
10
11
# 获得问题特征
def get_question_features(question):
''' For a given question, a unicode string, returns the time series vector
with each word (token) transformed into a 300 dimension representation
calculated using Glove Vector '''
word_embeddings = spacy.load('en_vectors_web_lg')
tokens = word_embeddings(question)
question_tensor = np.zeros((1, 30, 300))
for j in xrange(len(tokens)):
question_tensor[0,j,:] = tokens[j].vector
return question_tensor
1
2
3
4
5
6
7
8
9
10
11
# 构建VQA系统
def get_VQA_model(VQA_model_file_name, VQA_weights_file_name):
''' Given the VQA model and its weights, compiles and returns the model '''

# thanks the keras function for loading a model from JSON, this becomes
# very easy to understand and work. Alternative would be to load model
# from binary like cPickle but then model would be obfuscated to users
vqa_model = model_from_json(open(VQA_model_file_name).read())
vqa_model.load_weights(VQA_weights_file_name)
vqa_model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
return vqa_model
1
2
image_model = get_image_model(CNN_weights_file_name)
plot_model(image_model, to_file='model_vgg.png')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 测试一张图片和问题
image_file_name = 'test.jpg'
question = u"What vehicle is in the picture?"
# 获取图片特征
image_features = get_image_features(image_file_name)
# 获取问题特征
question_features = get_question_features(question)

y_output = model_vqa.predict([question_features, image_features])

# This task here is represented as a classification into a 1000 top answers
# this means some of the answers were not part of training and thus would
# not show up in the result.
# These 1000 answers are stored in the sklearn Encoder class
warnings.filterwarnings("ignore", category=DeprecationWarning)
labelencoder = joblib.load(label_encoder_file_name)
for label in reversed(np.argsort(y_output)[0,-5:]):
print(str(round(y_output[0,label]*100,2)).zfill(5), "% ", labelencoder.inverse_transform(label))

【2.5 实战】基于attention 的深度学习VQA模型实现

1
2
3
%%bash
# 下载github repo
git clone https://github.com/Cyanogenoid/pytorch-vqa --recursive
Submodule path 'resnet': checked out '9332392b01317d57e92f81e00933c48f423ff503'


Cloning into 'pytorch-vqa'...
Submodule 'resnet' (https://github.com/Cyanogenoid/pytorch-resnet) registered for path 'resnet'
Cloning into '/Users/jjhu/MT/slides/MT-course/vqa/pytorch-vqa/resnet'...
1
2
3
4
%%bash
# 预处理图片与vocab
python preprocess-images.py
python preprocess-vocab.py
1
2
3
%%bash
# 开始训练模型
python train.py

训练代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
# 训练的main 函数
def main():
if len(sys.argv) > 1:
name = ' '.join(sys.argv[1:])
else:
from datetime import datetime
name = datetime.now().strftime("%Y-%m-%d_%H:%M:%S")
target_name = os.path.join('logs', '{}.pth'.format(name))
print('will save to {}'.format(target_name))

cudnn.benchmark = True

# 加载训练数据及validation数据
train_loader = data.get_loader(train=True)
val_loader = data.get_loader(val=True)

# 加载vqa模型及优化器
net = nn.DataParallel(model.Net(train_loader.dataset.num_tokens)).cuda()
optimizer = optim.Adam([p for p in net.parameters() if p.requires_grad])

tracker = utils.Tracker()
config_as_dict = {k: v for k, v in vars(config).items() if not k.startswith('__')}

for i in range(config.epochs):
_ = run(net, train_loader, optimizer, tracker, train=True, prefix='train', epoch=i)
r = run(net, val_loader, optimizer, tracker, train=False, prefix='val', epoch=i)

results = {
'name': name,
'tracker': tracker.to_dict(),
'config': config_as_dict,
'weights': net.state_dict(),
'eval': {
'answers': r[0],
'accuracies': r[1],
'idx': r[2],
},
'vocab': train_loader.dataset.vocab,
}
torch.save(results, target_name)

def run(net, loader, optimizer, tracker, train=False, prefix='', epoch=0):
""" Run an epoch over the given loader """
if train:
net.train()
tracker_class, tracker_params = tracker.MovingMeanMonitor, {'momentum': 0.99}
else:
net.eval()
tracker_class, tracker_params = tracker.MeanMonitor, {}
answ = []
idxs = []
accs = []

tq = tqdm(loader, desc='{} E{:03d}'.format(prefix, epoch), ncols=0)
loss_tracker = tracker.track('{}_loss'.format(prefix), tracker_class(**tracker_params))
acc_tracker = tracker.track('{}_acc'.format(prefix), tracker_class(**tracker_params))

log_softmax = nn.LogSoftmax().cuda()
for v, q, a, idx, q_len in tq:
var_params = {
'volatile': not train,
'requires_grad': False,
}
v = Variable(v.cuda(async=True), **var_params)
q = Variable(q.cuda(async=True), **var_params)
a = Variable(a.cuda(async=True), **var_params)
q_len = Variable(q_len.cuda(async=True), **var_params)

out = net(v, q, q_len)
nll = -log_softmax(out)
loss = (nll * a / 10).sum(dim=1).mean()
acc = utils.batch_accuracy(out.data, a.data).cpu()

if train:
global total_iterations
update_learning_rate(optimizer, total_iterations)

optimizer.zero_grad()
loss.backward()
optimizer.step()

total_iterations += 1
else:
# store information about evaluation of this minibatch
_, answer = out.data.cpu().max(dim=1)
answ.append(answer.view(-1))
accs.append(acc.view(-1))
idxs.append(idx.view(-1).clone())

loss_tracker.append(loss.data[0])
# acc_tracker.append(acc.mean())
for a in acc:
acc_tracker.append(a.item())
fmt = '{:.4f}'.format
tq.set_postfix(loss=fmt(loss_tracker.mean.value), acc=fmt(acc_tracker.mean.value))

if not train:
answ = list(torch.cat(answ, dim=0))
accs = list(torch.cat(accs, dim=0))
idxs = list(torch.cat(idxs, dim=0))
return answ, accs, idxs

attention VQA 模型代码讲解

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

class Net(nn.Module):
""" Re-implementation of ``Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering'' [0]
[0]: https://arxiv.org/abs/1704.03162
"""

def __init__(self, embedding_tokens):
super(Net, self).__init__()
question_features = 1024
vision_features = config.output_features
glimpses = 2

self.text = TextProcessor(
embedding_tokens=embedding_tokens,
embedding_features=300,
lstm_features=question_features,
drop=0.5,
)
self.attention = Attention(
v_features=vision_features,
q_features=question_features,
mid_features=512,
glimpses=2,
drop=0.5,
)
self.classifier = Classifier(
in_features=glimpses * vision_features + question_features,
mid_features=1024,
out_features=config.max_answers,
drop=0.5,
)

for m in self.modules():
if isinstance(m, nn.Linear) or isinstance(m, nn.Conv2d):
init.xavier_uniform(m.weight)
if m.bias is not None:
m.bias.data.zero_()

def forward(self, v, q, q_len):
q = self.text(q, list(q_len.data))

v = v / (v.norm(p=2, dim=1, keepdim=True).expand_as(v) + 1e-8)
a = self.attention(v, q)
v = apply_attention(v, a)

combined = torch.cat([v, q], dim=1)
answer = self.classifier(combined)
return answer

分类器

1
2
3
4
5
6
7
8
class Classifier(nn.Sequential):
def __init__(self, in_features, mid_features, out_features, drop=0.0):
super(Classifier, self).__init__()
self.add_module('drop1', nn.Dropout(drop))
self.add_module('lin1', nn.Linear(in_features, mid_features))
self.add_module('relu', nn.ReLU())
self.add_module('drop2', nn.Dropout(drop))
self.add_module('lin2', nn.Linear(mid_features, out_features))

attention 层

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
class Attention(nn.Module):
def __init__(self, v_features, q_features, mid_features, glimpses, drop=0.0):
super(Attention, self).__init__()
self.v_conv = nn.Conv2d(v_features, mid_features, 1, bias=False) # let self.lin take care of bias
self.q_lin = nn.Linear(q_features, mid_features)
self.x_conv = nn.Conv2d(mid_features, glimpses, 1)

self.drop = nn.Dropout(drop)
self.relu = nn.ReLU(inplace=True)

def forward(self, v, q):
v = self.v_conv(self.drop(v))
q = self.q_lin(self.drop(q))
q = tile_2d_over_nd(q, v)
x = self.relu(v + q)
x = self.x_conv(self.drop(x))
return x


def apply_attention(input, attention):
""" Apply any number of attention maps over the input.
The attention map has to have the same size in all dimensions except dim=1.
"""
n, c = input.size()[:2]
glimpses = attention.size(1)

# flatten the spatial dims into the third dim, since we don't need to care about how they are arranged
input = input.view(n, c, -1)
attention = attention.view(n, glimpses, -1)
s = input.size(2)

# apply a softmax to each attention map separately
# since softmax only takes 2d inputs, we have to collapse the first two dimensions together
# so that each glimpse is normalized separately
attention = attention.view(n * glimpses, -1)
attention = F.softmax(attention)

# apply the weighting by creating a new dim to tile both tensors over
target_size = [n, glimpses, c, s]
input = input.view(n, 1, c, s).expand(*target_size)
attention = attention.view(n, glimpses, 1, s).expand(*target_size)
weighted = input * attention
# sum over only the spatial dimension
weighted_mean = weighted.sum(dim=3)
# the shape at this point is (n, glimpses, c, 1)
return weighted_mean.view(n, -1)

本章小结

2.1 视觉问答机器人问题介绍
2.2 基于图像信息和文本信息抽取匹配的VQA实现方案
2.3 基于注意力(attention)的深度学习VQA实现方案
2.4 【实战】使用keras完成CNN+RNN基础VQA模型
2.5 【实战】基于attention 的深度学习VQA模型实现