/ NLP  

NLP系列

moses统计翻译系统实战

1
2
3
4
5
6
7
8
9
10
11
%%bash

# 安装 Moses
# http://www.statmt.org/moses/?n=Development.GetStarted

# 下载数据集
corpus="$PWD/corpus"
mkdir -p $corpus
cd $corpus
wget http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz
tar zxvf training-parallel-nc-v8.tgz
training/news-commentary-v8.cs-en.cs
training/news-commentary-v8.cs-en.en
training/news-commentary-v8.de-en.de
training/news-commentary-v8.de-en.en
training/news-commentary-v8.es-en.en
training/news-commentary-v8.es-en.es
training/news-commentary-v8.fr-en.en
training/news-commentary-v8.fr-en.fr
training/news-commentary-v8.ru-en.en
training/news-commentary-v8.ru-en.ru
1
2
3
%%bash
corpus="$PWD/corpus"
head -n 5 $corpus/training/news-commentary-v8.fr-en.en
SAN FRANCISCO – It has never been easy to have a rational conversation about the value of gold.
Lately, with gold prices up more than 300% over the last decade, it is harder than ever.
Just last December, fellow economists Martin Feldstein and Nouriel Roubini each penned op-eds bravely questioning bullish market sentiment, sensibly pointing out gold’s risks.
Wouldn’t you know it?
Since their articles appeared, the price of gold has moved up still further. Gold prices even hit a record-high $1,300 recently.
1
2
3
%%bash
corpus="$PWD/corpus"
head -n 5 $corpus/training/news-commentary-v8.fr-en.fr
SAN FRANCISCO – Il n’a jamais été facile d’avoir une discussion rationnelle sur la valeur du métal jaune.
Et aujourd’hui, alors que le cours de l’or a augmenté de 300 pour cent au cours de la dernière décennie, c’est plus difficile que jamais.
En décembre dernier, mes collègues économistes Martin Feldstein et Nouriel Roubini ont chacun publié une tribune libre dans laquelle ils doutaient courageusement du marché haussier, soulignant de manière sensée les risques liés à l’or.
Mais devinez ce qui s’est passé ?
Depuis la parution de leurs articles, le cours de l’or a encore grimpé, pour atteindre récemment un plus haut historique de 1300 dollars l’once.
1
2
3
%%bash
corpus="$PWD/corpus"
wc -l $corpus/training/news-commentary-v8.fr-en.{fr,en}
157168 /Users/jjhu/remoteshare/MT/corpus/training/news-commentary-v8.fr-en.fr
157168 /Users/jjhu/remoteshare/MT/corpus/training/news-commentary-v8.fr-en.en
314336 total
1
2
3
4
5
6
7
8
9
10
11
%%bash
# tokenization: 对句子进行符号化
corpus="$PWD/corpus"
mosesdecoder="$PWD/mosesdecoder"
$mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \
< $corpus/training/news-commentary-v8.fr-en.en \
> $corpus/news-commentary-v8.fr-en.tok.en

$mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \
< $corpus/training/news-commentary-v8.fr-en.fr \
> $corpus/news-commentary-v8.fr-en.tok.fr
Tokenizer Version 1.1
Language: en
Number of threads: 1
Tokenizer Version 1.1
Language: fr
Number of threads: 1
1
2
3
4
5
6
7
8
9
10
11
%%bash 

# 训练 truecaser: 将句子第一个字母变成小写
corpus="$PWD/corpus"
mosesdecoder="$PWD/mosesdecoder"
$mosesdecoder/scripts/recaser/train-truecaser.perl \
--model $corpus/truecase-model.en --corpus \
$corpus/news-commentary-v8.fr-en.tok.en
$mosesdecoder/scripts/recaser/train-truecaser.perl \
--model $corpus/truecase-model.fr --corpus \
$corpus/news-commentary-v8.fr-en.tok.fr
1
2
3
4
5
6
7
8
9
10
11
12
13
%%bash 

# 将句子第一个字母变成小写
corpus="$PWD/corpus"
mosesdecoder="$PWD/mosesdecoder"
$mosesdecoder/scripts/recaser/truecase.perl \
--model $corpus/truecase-model.en \
< $corpus/news-commentary-v8.fr-en.tok.en \
> $corpus/news-commentary-v8.fr-en.true.en
$mosesdecoder/scripts/recaser/truecase.perl \
--model $corpus/truecase-model.fr \
< $corpus/news-commentary-v8.fr-en.tok.fr \
> $corpus/news-commentary-v8.fr-en.true.fr
1
2
3
4
5
6
7
8
%%bash 

# 将双语语料库中句子单词个数多于80的句子去除
corpus="$PWD/corpus"
mosesdecoder="$PWD/mosesdecoder"
$mosesdecoder/scripts/training/clean-corpus-n.perl \
$corpus/news-commentary-v8.fr-en.true fr en \
$corpus/news-commentary-v8.fr-en.clean 1 80
clean-corpus.perl: processing /home/jjhu/MT/corpus/news-commentary-v8.fr-en.true.fr & .en to /home/jjhu/MT/corpus/news-commentary-v8.fr-en.clean, cutoff 1-80, ratio 9
..........(100000).....
Input sentences: 157168  Output sentences:  155362
1
2
3
4
5
6
7
8
9
10
%%bash

corpus="$PWD/corpus"
mosesdecoder="$PWD/mosesdecoder"
# 训练语言模型
lm="$PWD/lm"
mkdir $lm
cd $lm
$mosesdecoder/bin/lmplz -o 3 < $corpus/news-commentary-v8.fr-en.true.en > $lm/news-commentary-v8.fr-en.arpa.en
$mosesdecoder/bin/build_binary $lm/news-commentary-v8.fr-en.arpa.en $lm/news-commentary-v8.fr-en.blm.en
=== 1/5 Counting and sorting n-grams ===
Reading /home/jjhu/MT/corpus/news-commentary-v8.fr-en.true.en
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 4066728 types 62719
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:752628 2:9359020032 3:17548163072
Statistics:
1 62719 D1=0.622525 D2=0.981811 D3+=1.36293
2 906706 D1=0.742501 D2=1.07915 D3+=1.38352
3 2389081 D1=0.835943 D2=1.1625 D3+=1.34981
Memory estimate for binary LM:
type    MB
probing 63 assuming -p 1.5
probing 68 assuming -r models -p 1.5
trie    25 without quantization
trie    14 assuming -q 8 -b 8 quantization 
trie    24 assuming -a 22 array pointer compression
trie    12 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:752628 2:14507296 3:47781620
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:752628 2:14507296 3:47781620
=== 5/5 Writing ARPA model ===
Name:lmplz    VmPeak:26452684 kB    VmRSS:20124 kB    RSSMax:6136000 kB    user:4.572    sys:0.976    CPU:5.548    real:4.68449
Reading /home/jjhu/MT/lm/news-commentary-v8.fr-en.arpa.en
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
%%bash 

corpus="$PWD/corpus"
mosesdecoder="$PWD/mosesdecoder"
working="$PWD/working"
lm="$PWD/lm"
# 训练翻译模型
mkdir $working
cd $working
nohup nice $mosesdecoder/scripts/training/train-model.perl -root-dir train \
-corpus $corpus/news-commentary-v8.fr-en.clean \
-f fr -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:3:$lm/news-commentary-v8.fr-en.blm.en:8 \
-external-bin-dir $mosesdecoder/tools >& training.out &
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
%%bash 
corpus="$PWD/corpus"
mosesdecoder="$PWD/mosesdecoder"
working="$PWD/working"
lm="$PWD/lm"

# 利用dev 数据调整模型参数
cd $corpus
wget http://www.statmt.org/wmt12/dev.tgz
tar zxvf dev.tgz

# 处理dev 数据, tokenization + truecase
$mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \
< $corpus/dev/news-test2008.en > $corpus/news-test2008.tok.en
$mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \
< $corpus/dev/news-test2008.fr > $corpus/news-test2008.tok.fr
$mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.en \
< $corpus/news-test2008.tok.en > $corpus/news-test2008.true.en
$mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.fr \
< $corpus/news-test2008.tok.fr > $corpus/news-test2008.true.fr
dev/
dev/newstest2009-src.fr.sgm
dev/news-test2008-src.hu.sgm
dev/newssyscomb2009-ref.cs.sgm
dev/newstest2009-ref.en.sgm
dev/newstest2009-ref.hu.sgm
dev/newstest2010.cs
dev/newssyscomb2009-ref.hu.sgm
dev/newstest2010-src.es.sgm
dev/newstest2011-src.es.sgm
dev/newstest2010-src.fr.sgm
dev/newssyscomb2009-src.fr.sgm
dev/newstest2011-ref.de.sgm
dev/news-test2008-ref.es.sgm
dev/newssyscomb2009-ref.de.sgm
dev/newstest2011.fr
dev/news-test2008.cs
dev/news-test2008-ref.fr.sgm
dev/newstest2009-ref.cz.sgm
dev/newstest2011-src.de.sgm
dev/newstest2011-ref.cs.sgm
dev/raw/
dev/raw/newstest2011-src.fr.raw.sgm
dev/raw/newstest2011-src.es.raw.sgm
dev/raw/newstest2011-src.cs.raw.sgm
dev/raw/newstest2011-ref.fr.raw.sgm
dev/raw/newstest2011-ref.es.raw.sgm
dev/raw/newstest2011-ref.en.raw.sgm
dev/raw/newstest2011-src.de.raw.sgm
dev/raw/newstest2011-src.en.raw.sgm
dev/raw/newstest2011-ref.de.raw.sgm
dev/raw/newstest2011-ref.cs.raw.sgm
dev/news-test2008-src.fr.sgm
dev/newssyscomb2009-src.it.sgm
dev/newstest2009.fr
dev/news-test2008-ref.hu.sgm
dev/newstest2009-src.it.sgm
dev/newstest2009.cs
dev/news-test2008-src.de.sgm
dev/newstest2009-src.es.sgm
dev/newstest2009.cz
dev/newstest2010-ref.en.sgm
dev/newstest2009-ref.it.sgm
dev/newstest2011-ref.es.sgm
dev/news-test2008.es
dev/news-test2008.cz
dev/newstest2010-src.en.sgm
dev/newstest2011.en
dev/news-test2008-ref.cz.sgm
dev/newstest2009-src.hu.sgm
dev/news-test2008.fr
dev/newstest2009.en
dev/newstest2009-ref.cs.sgm
dev/newssyscomb2009-src.cs.sgm
dev/news-test2008-ref.de.sgm
dev/newssyscomb2009-src.cz.sgm
dev/newssyscomb2009-ref.en.sgm
dev/newstest2011-ref.fr.sgm
dev/news-test2008.en
dev/newstest2009-ref.de.sgm
dev/newstest2009-src.en.sgm
dev/newstest2011-src.fr.sgm
dev/newstest2011-ref.en.sgm
dev/newstest2010-ref.cz.sgm
dev/newstest2010-src.cs.sgm
dev/newssyscomb2009.de
dev/newstest2010-ref.cs.sgm
dev/newssyscomb2009-ref.it.sgm
dev/newstest2009.de
dev/newssyscomb2009.cs
dev/newstest2009-src.de.sgm
dev/newstest2009-src.xx.sgm
dev/newssyscomb2009.fr
dev/news-test2008-ref.cs.sgm
dev/newstest2010.en
dev/newstest2010-src.cz.sgm
dev/newstest2011-src.en.sgm
dev/newssyscomb2009-src.en.sgm
dev/newssyscomb2009-src.de.sgm
dev/news-test2008-ref.en.sgm
dev/newstest2011.de
dev/newssyscomb2009.en
dev/newstest2011.es
dev/newstest2009-src.cz.sgm
dev/newssyscomb2009-ref.fr.sgm
dev/newstest2010.de
dev/newstest2010.es
dev/newstest2010-src.de.sgm
dev/newstest2009.es
dev/newstest2009-ref.es.sgm
dev/news-test2008-src.en.sgm
dev/newstest2009-ref.fr.sgm
dev/newssyscomb2009-src.hu.sgm
dev/newssyscomb2009-src.es.sgm
dev/news-test2008-src.es.sgm
dev/news-test2008.de
dev/newstest2011.cs
dev/newstest2010-ref.es.sgm
dev/news-test2008-src.cs.sgm
dev/newstest2010.fr
dev/newstest2009-src.cs.sgm
dev/newstest2010.cz
dev/newssyscomb2009-ref.cz.sgm
dev/newstest2010-ref.de.sgm
dev/newssyscomb2009.es
dev/newstest2010-ref.fr.sgm
dev/news-test2008-src.cz.sgm
dev/newstest2011-src.cs.sgm
dev/newssyscomb2009-ref.es.sgm


wget: /home/jjhu/Software/anaconda2/envs/py36/lib/libcrypto.so.1.0.0: no version information available (required by wget)
wget: /home/jjhu/Software/anaconda2/envs/py36/lib/libuuid.so.1: no version information available (required by wget)
wget: /home/jjhu/Software/anaconda2/envs/py36/lib/libssl.so.1.0.0: no version information available (required by wget)
--2019-01-20 01:52:41--  http://www.statmt.org/wmt12/dev.tgz
Resolving www.statmt.org (www.statmt.org)... 129.215.197.184
Connecting to www.statmt.org (www.statmt.org)|129.215.197.184|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13260990 (13M) [application/x-gzip]
Saving to: ‘dev.tgz’

     0K .......... .......... .......... .......... ..........  0%  252K 51s
    50K .......... .......... .......... .......... ..........  0%  479K 39s
   100K .......... .......... .......... .......... ..........  1% 28.9M 26s
   150K .......... .......... .......... .......... ..........  1%  102M 19s
   200K .......... .......... .......... .......... ..........  1%  508K 20s
   250K .......... .......... .......... .......... ..........  2% 48.9M 17s
   300K .......... .......... .......... .......... ..........  2% 62.3M 15s
   350K .......... .......... .......... .......... ..........  3% 51.4M 13s
   400K .......... .......... .......... .......... ..........  3%  378K 15s
   450K .......... .......... .......... .......... ..........  3% 65.5M 13s
   500K .......... .......... .......... .......... ..........  4% 70.5M 12s
   550K .......... .......... .......... .......... ..........  4%  109M 11s
   600K .......... .......... .......... .......... ..........  5%  108M 10s
   650K .......... .......... .......... .......... ..........  5% 31.5M 9s
   700K .......... .......... .......... .......... ..........  5%  518K 10s
   750K .......... .......... .......... .......... ..........  6% 35.3M 10s
   800K .......... .......... .......... .......... ..........  6% 43.2M 9s
   850K .......... .......... .......... .......... ..........  6% 43.7M 9s
   900K .......... .......... .......... .......... ..........  7% 47.2M 8s
   950K .......... .......... .......... .......... ..........  7%  532K 9s
  1000K .......... .......... .......... .......... ..........  8% 39.8M 8s
  1050K .......... .......... .......... .......... ..........  8% 36.3M 8s
  1100K .......... .......... .......... .......... ..........  8% 35.5M 8s
  1150K .......... .......... .......... .......... ..........  9% 50.1M 7s
  1200K .......... .......... .......... .......... ..........  9% 44.5M 7s
  1250K .......... .......... .......... .......... .......... 10%  536K 7s
  1300K .......... .......... .......... .......... .......... 10% 29.8M 7s
  1350K .......... .......... .......... .......... .......... 10% 34.8M 7s
  1400K .......... .......... .......... .......... .......... 11% 54.8M 7s
  1450K .......... .......... .......... .......... .......... 11% 30.8M 6s
  1500K .......... .......... .......... .......... .......... 11% 51.6M 6s
  1550K .......... .......... .......... .......... .......... 12%  537K 7s
  1600K .......... .......... .......... .......... .......... 12% 45.9M 6s
  1650K .......... .......... .......... .......... .......... 13% 31.1M 6s
  1700K .......... .......... .......... .......... .......... 13% 45.0M 6s
  1750K .......... .......... .......... .......... .......... 13% 34.0M 6s
  1800K .......... .......... .......... .......... .......... 14% 46.0M 6s
  1850K .......... .......... .......... .......... .......... 14%  540K 6s
  1900K .......... .......... .......... .......... .......... 15% 31.4M 6s
  1950K .......... .......... .......... .......... .......... 15% 38.2M 6s
  2000K .......... .......... .......... .......... .......... 15% 36.4M 6s
  2050K .......... .......... .......... .......... .......... 16% 44.8M 5s
  2100K .......... .......... .......... .......... .......... 16% 39.5M 5s
  2150K .......... .......... .......... .......... .......... 16% 42.1M 5s
  2200K .......... .......... .......... .......... .......... 17%  543K 5s
  2250K .......... .......... .......... .......... .......... 17% 32.1M 5s
  2300K .......... .......... .......... .......... .......... 18% 32.8M 5s
  2350K .......... .......... .......... .......... .......... 18% 44.0M 5s
  2400K .......... .......... .......... .......... .......... 18% 37.8M 5s
  2450K .......... .......... .......... .......... .......... 19% 39.1M 5s
  2500K .......... .......... .......... .......... .......... 19% 37.7M 5s
  2550K .......... .......... .......... .......... .......... 20%  548K 5s
  2600K .......... .......... .......... .......... .......... 20% 26.9M 5s
  2650K .......... .......... .......... .......... .......... 20% 38.8M 5s
  2700K .......... .......... .......... .......... .......... 21% 37.2M 5s
  2750K .......... .......... .......... .......... .......... 21% 42.0M 4s
  2800K .......... .......... .......... .......... .......... 22% 43.6M 4s
  2850K .......... .......... .......... .......... .......... 22% 43.6M 4s
  2900K .......... .......... .......... .......... .......... 22%  550K 5s
  2950K .......... .......... .......... .......... .......... 23% 26.6M 4s
  3000K .......... .......... .......... .......... .......... 23% 42.0M 4s
  3050K .......... .......... .......... .......... .......... 23% 32.5M 4s
  3100K .......... .......... .......... .......... .......... 24% 50.7M 4s
  3150K .......... .......... .......... .......... .......... 24% 40.0M 4s
  3200K .......... .......... .......... .......... .......... 25% 44.2M 4s
  3250K .......... .......... .......... .......... .......... 25%  552K 4s
  3300K .......... .......... .......... .......... .......... 25% 29.1M 4s
  3350K .......... .......... .......... .......... .......... 26% 35.5M 4s
  3400K .......... .......... .......... .......... .......... 26% 38.2M 4s
  3450K .......... .......... .......... .......... .......... 27% 41.3M 4s
  3500K .......... .......... .......... .......... .......... 27% 42.6M 4s
  3550K .......... .......... .......... .......... .......... 27% 48.0M 4s
  3600K .......... .......... .......... .......... .......... 28% 43.4M 4s
  3650K .......... .......... .......... .......... .......... 28%  553K 4s
  3700K .......... .......... .......... .......... .......... 28% 26.4M 4s
  3750K .......... .......... .......... .......... .......... 29% 40.6M 4s
  3800K .......... .......... .......... .......... .......... 29% 42.8M 4s
  3850K .......... .......... .......... .......... .......... 30% 43.0M 4s
  3900K .......... .......... .......... .......... .......... 30% 39.1M 4s
  3950K .......... .......... .......... .......... .......... 30% 63.1M 3s
  4000K .......... .......... .......... .......... .......... 31% 43.2M 3s
  4050K .......... .......... .......... .......... .......... 31% 44.4M 3s
  4100K .......... .......... .......... .......... .......... 32%  551K 3s
  4150K .......... .......... .......... .......... .......... 32% 30.2M 3s
  4200K .......... .......... .......... .......... .......... 32% 59.7M 3s
  4250K .......... .......... .......... .......... .......... 33% 41.1M 3s
  4300K .......... .......... .......... .......... .......... 33% 38.5M 3s
  4350K .......... .......... .......... .......... .......... 33% 54.1M 3s
  4400K .......... .......... .......... .......... .......... 34% 75.4M 3s
  4450K .......... .......... .......... .......... .......... 34% 35.0M 3s
  4500K .......... .......... .......... .......... .......... 35%  555K 3s
  4550K .......... .......... .......... .......... .......... 35% 33.4M 3s
  4600K .......... .......... .......... .......... .......... 35% 30.5M 3s
  4650K .......... .......... .......... .......... .......... 36% 59.5M 3s
  4700K .......... .......... .......... .......... .......... 36% 52.1M 3s
  4750K .......... .......... .......... .......... .......... 37% 42.4M 3s
  4800K .......... .......... .......... .......... .......... 37% 43.8M 3s
  4850K .......... .......... .......... .......... .......... 37% 73.0M 3s
  4900K .......... .......... .......... .......... .......... 38% 38.4M 3s
  4950K .......... .......... .......... .......... .......... 38%  556K 3s
  5000K .......... .......... .......... .......... .......... 38% 28.6M 3s
  5050K .......... .......... .......... .......... .......... 39% 42.3M 3s
  5100K .......... .......... .......... .......... .......... 39% 39.7M 3s
  5150K .......... .......... .......... .......... .......... 40% 45.0M 3s
  5200K .......... .......... .......... .......... .......... 40% 58.1M 3s
  5250K .......... .......... .......... .......... .......... 40% 49.0M 3s
  5300K .......... .......... .......... .......... .......... 41% 52.7M 3s
  5350K .......... .......... .......... .......... .......... 41% 64.7M 3s
  5400K .......... .......... .......... .......... .......... 42% 37.4M 3s
  5450K .......... .......... .......... .......... .......... 42%  557K 3s
  5500K .......... .......... .......... .......... .......... 42% 35.3M 3s
  5550K .......... .......... .......... .......... .......... 43% 38.1M 3s
  5600K .......... .......... .......... .......... .......... 43% 37.6M 2s
  5650K .......... .......... .......... .......... .......... 44% 86.6M 2s
  5700K .......... .......... .......... .......... .......... 44% 41.4M 2s
  5750K .......... .......... .......... .......... .......... 44% 47.1M 2s
  5800K .......... .......... .......... .......... .......... 45% 63.8M 2s
  5850K .......... .......... .......... .......... .......... 45% 64.4M 2s
  5900K .......... .......... .......... .......... .......... 45% 40.8M 2s
  5950K .......... .......... .......... .......... .......... 46%  557K 2s
  6000K .......... .......... .......... .......... .......... 46% 42.3M 2s
  6050K .......... .......... .......... .......... .......... 47% 37.8M 2s
  6100K .......... .......... .......... .......... .......... 47% 39.9M 2s
  6150K .......... .......... .......... .......... .......... 47% 49.7M 2s
  6200K .......... .......... .......... .......... .......... 48% 58.0M 2s
  6250K .......... .......... .......... .......... .......... 48% 56.6M 2s
  6300K .......... .......... .......... .......... .......... 49% 81.2M 2s
  6350K .......... .......... .......... .......... .......... 49% 65.4M 2s
  6400K .......... .......... .......... .......... .......... 49% 47.2M 2s
  6450K .......... .......... .......... .......... .......... 50% 39.2M 2s
  6500K .......... .......... .......... .......... .......... 50%  558K 2s
  6550K .......... .......... .......... .......... .......... 50% 41.7M 2s
  6600K .......... .......... .......... .......... .......... 51% 39.9M 2s
  6650K .......... .......... .......... .......... .......... 51% 39.9M 2s
  6700K .......... .......... .......... .......... .......... 52% 63.8M 2s
  6750K .......... .......... .......... .......... .......... 52% 53.9M 2s
  6800K .......... .......... .......... .......... .......... 52% 52.8M 2s
  6850K .......... .......... .......... .......... .......... 53% 74.6M 2s
  6900K .......... .......... .......... .......... .......... 53% 71.5M 2s
  6950K .......... .......... .......... .......... .......... 54% 57.7M 2s
  7000K .......... .......... .......... .......... .......... 54% 48.3M 2s
  7050K .......... .......... .......... .......... .......... 54%  560K 2s
  7100K .......... .......... .......... .......... .......... 55% 36.4M 2s
  7150K .......... .......... .......... .......... .......... 55% 37.6M 2s
  7200K .......... .......... .......... .......... .......... 55% 49.8M 2s
  7250K .......... .......... .......... .......... .......... 56% 57.2M 2s
  7300K .......... .......... .......... .......... .......... 56% 43.0M 2s
  7350K .......... .......... .......... .......... .......... 57% 70.5M 2s
  7400K .......... .......... .......... .......... .......... 57% 59.3M 2s
  7450K .......... .......... .......... .......... .......... 57% 68.4M 2s
  7500K .......... .......... .......... .......... .......... 58% 66.2M 2s
  7550K .......... .......... .......... .......... .......... 58% 60.8M 2s
  7600K .......... .......... .......... .......... .......... 59% 40.0M 2s
  7650K .......... .......... .......... .......... .......... 59%  559K 2s
  7700K .......... .......... .......... .......... .......... 59% 49.4M 2s
  7750K .......... .......... .......... .......... .......... 60% 47.7M 2s
  7800K .......... .......... .......... .......... .......... 60% 48.4M 2s
  7850K .......... .......... .......... .......... .......... 61% 58.5M 1s
  7900K .......... .......... .......... .......... .......... 61% 48.4M 1s
  7950K .......... .......... .......... .......... .......... 61% 76.9M 1s
  8000K .......... .......... .......... .......... .......... 62% 59.8M 1s
  8050K .......... .......... .......... .......... .......... 62% 62.8M 1s
  8100K .......... .......... .......... .......... .......... 62% 70.7M 1s
  8150K .......... .......... .......... .......... .......... 63% 88.3M 1s
  8200K .......... .......... .......... .......... .......... 63% 63.6M 1s
  8250K .......... .......... .......... .......... .......... 64%  562K 1s
  8300K .......... .......... .......... .......... .......... 64% 31.1M 1s
  8350K .......... .......... .......... .......... .......... 64% 49.1M 1s
  8400K .......... .......... .......... .......... .......... 65% 34.9M 1s
  8450K .......... .......... .......... .......... .......... 65% 69.6M 1s
  8500K .......... .......... .......... .......... .......... 66% 74.9M 1s
  8550K .......... .......... .......... .......... .......... 66% 51.2M 1s
  8600K .......... .......... .......... .......... .......... 66% 87.6M 1s
  8650K .......... .......... .......... .......... .......... 67% 56.2M 1s
  8700K .......... .......... .......... .......... .......... 67% 61.1M 1s
  8750K .......... .......... .......... .......... .......... 67% 89.1M 1s
  8800K .......... .......... .......... .......... .......... 68% 69.5M 1s
  8850K .......... .......... .......... .......... .......... 68% 86.4M 1s
  8900K .......... .......... .......... .......... .......... 69%  559K 1s
  8950K .......... .......... .......... .......... .......... 69% 37.1M 1s
  9000K .......... .......... .......... .......... .......... 69% 62.9M 1s
  9050K .......... .......... .......... .......... .......... 70% 39.3M 1s
  9100K .......... .......... .......... .......... .......... 70% 48.6M 1s
  9150K .......... .......... .......... .......... .......... 71% 96.6M 1s
  9200K .......... .......... .......... .......... .......... 71% 78.9M 1s
  9250K .......... .......... .......... .......... .......... 71% 63.6M 1s
  9300K .......... .......... .......... .......... .......... 72% 55.4M 1s
  9350K .......... .......... .......... .......... .......... 72% 76.0M 1s
  9400K .......... .......... .......... .......... .......... 72% 71.1M 1s
  9450K .......... .......... .......... .......... .......... 73%  109M 1s
  9500K .......... .......... .......... .......... .......... 73% 69.9M 1s
  9550K .......... .......... .......... .......... .......... 74%  562K 1s
  9600K .......... .......... .......... .......... .......... 74% 36.6M 1s
  9650K .......... .......... .......... .......... .......... 74% 47.9M 1s
  9700K .......... .......... .......... .......... .......... 75% 45.8M 1s
  9750K .......... .......... .......... .......... .......... 75% 71.4M 1s
  9800K .......... .......... .......... .......... .......... 76% 47.0M 1s
  9850K .......... .......... .......... .......... .......... 76% 66.5M 1s
  9900K .......... .......... .......... .......... .......... 76% 45.5M 1s
  9950K .......... .......... .......... .......... .......... 77% 86.8M 1s
 10000K .......... .......... .......... .......... .......... 77% 65.4M 1s
 10050K .......... .......... .......... .......... .......... 77%  117M 1s
 10100K .......... .......... .......... .......... .......... 78% 70.6M 1s
 10150K .......... .......... .......... .......... .......... 78% 94.6M 1s
 10200K .......... .......... .......... .......... .......... 79% 87.3M 1s
 10250K .......... .......... .......... .......... .......... 79% 1.37M 1s
 10300K .......... .......... .......... .......... .......... 79%  921K 1s
 10350K .......... .......... .......... .......... .......... 80% 45.0M 1s
 10400K .......... .......... .......... .......... .......... 80% 76.7M 1s
 10450K .......... .......... .......... .......... .......... 81% 48.6M 1s
 10500K .......... .......... .......... .......... .......... 81% 55.7M 1s
 10550K .......... .......... .......... .......... .......... 81% 61.9M 1s
 10600K .......... .......... .......... .......... .......... 82% 60.3M 1s
 10650K .......... .......... .......... .......... .......... 82% 54.6M 1s
 10700K .......... .......... .......... .......... .......... 83% 7.18M 1s
 10750K .......... .......... .......... .......... .......... 83% 4.46M 1s
 10800K .......... .......... .......... .......... .......... 83% 3.80M 1s
 10850K .......... .......... .......... .......... .......... 84% 4.46M 1s
 10900K .......... .......... .......... .......... .......... 84% 4.37M 1s
 10950K .......... .......... .......... .......... .......... 84% 4.47M 0s
 11000K .......... .......... .......... .......... .......... 85% 1.70M 0s
 11050K .......... .......... .......... .......... .......... 85% 34.6M 0s
 11100K .......... .......... .......... .......... .......... 86% 12.7M 0s
 11150K .......... .......... .......... .......... .......... 86% 4.79M 0s
 11200K .......... .......... .......... .......... .......... 86% 3.88M 0s
 11250K .......... .......... .......... .......... .......... 87% 4.29M 0s
 11300K .......... .......... .......... .......... .......... 87% 4.59M 0s
 11350K .......... .......... .......... .......... .......... 88% 4.53M 0s
 11400K .......... .......... .......... .......... .......... 88% 3.94M 0s
 11450K .......... .......... .......... .......... .......... 88% 4.56M 0s
 11500K .......... .......... .......... .......... .......... 89% 4.43M 0s
 11550K .......... .......... .......... .......... .......... 89% 4.54M 0s
 11600K .......... .......... .......... .......... .......... 89% 3.92M 0s
 11650K .......... .......... .......... .......... .......... 90% 4.53M 0s
 11700K .......... .......... .......... .......... .......... 90% 4.49M 0s
 11750K .......... .......... .......... .......... .......... 91% 4.48M 0s
 11800K .......... .......... .......... .......... .......... 91% 3.94M 0s
 11850K .......... .......... .......... .......... .......... 91% 4.49M 0s
 11900K .......... .......... .......... .......... .......... 92% 4.39M 0s
 11950K .......... .......... .......... .......... .......... 92% 4.58M 0s
 12000K .......... .......... .......... .......... .......... 93% 3.87M 0s
 12050K .......... .......... .......... .......... .......... 93% 4.64M 0s
 12100K .......... .......... .......... .......... .......... 93% 4.47M 0s
 12150K .......... .......... .......... .......... .......... 94% 4.47M 0s
 12200K .......... .......... .......... .......... .......... 94% 3.90M 0s
 12250K .......... .......... .......... .......... .......... 94% 4.47M 0s
 12300K .......... .......... .......... .......... .......... 95% 4.42M 0s
 12350K .......... .......... .......... .......... .......... 95% 4.55M 0s
 12400K .......... .......... .......... .......... .......... 96% 3.94M 0s
 12450K .......... .......... .......... .......... .......... 96% 4.44M 0s
 12500K .......... .......... .......... .......... .......... 96% 4.58M 0s
 12550K .......... .......... .......... .......... .......... 97% 4.43M 0s
 12600K .......... .......... .......... .......... .......... 97% 3.89M 0s
 12650K .......... .......... .......... .......... .......... 98% 4.60M 0s
 12700K .......... .......... .......... .......... .......... 98% 4.40M 0s
 12750K .......... .......... .......... .......... .......... 98% 4.60M 0s
 12800K .......... .......... .......... .......... .......... 99% 3.89M 0s
 12850K .......... .......... .......... .......... .......... 99% 4.49M 0s
 12900K .......... .......... .......... .......... .......... 99% 4.44M 0s
 12950K                                                       100%  354G=3.2s

2019-01-20 01:52:44 (3.91 MB/s) - ‘dev.tgz’ saved [13260990/13260990]

Tokenizer Version 1.1
Language: en
Number of threads: 1
Tokenizer Version 1.1
Language: fr
Number of threads: 1
1
2
3
4
5
6
7
8
9
10
11
12
%%bash 
corpus="$PWD/corpus"
mosesdecoder="$PWD/mosesdecoder"
working="$PWD/working"
lm="$PWD/lm"

# 调节参数
cd $working
nohup nice $mosesdecoder/scripts/training/mert-moses.pl \
$corpus/news-test2008.true.fr $corpus/news-test2008.true.en \
$mosesdecoder/bin/moses train/model/moses.ini --mertdir $mosesdecoder/bin/ \
&> mert.out &
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
%%bash
corpus="$PWD/corpus"
mosesdecoder="$PWD/mosesdecoder"
working="$PWD/working"
lm="$PWD/lm"

# 处理test 数据
cd $corpus
$mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \
< $corpus/dev/newstest2011.en > $corpus/newstest2011.tok.en
$mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \
< $corpus/dev/newstest2011.fr > $corpus/newstest2011.tok.fr
$mosesdecoder/scripts/recaser/truecase.perl --model $corpus/truecase-model.en \
< $corpus/newstest2011.tok.en > $corpus/newstest2011.true.en
$mosesdecoder/scripts/recaser/truecase.perl --model $corpus/truecase-model.fr \
< $corpus/newstest2011.tok.fr > $corpus/newstest2011.true.fr

# 翻译test 数据
nohup nice $mosesdecoder/bin/moses \
-f $working/filtered-newstest2011/moses.ini \
< $corpus/newstest2011.true.fr \
> $working/newstest2011.translated.en \
2> $working/newstest2011.out

# 自动评估:BLEU
$mosesdecoder/scripts/generic/multi-bleu.perl \
-lc $corpus/newstest2011.true.en \
< $working/newstest2011.translated.en

资料补充