快捷分类

Improving Parallel Corpus Quality for Chinese-Vietnamese Statistical Machine Translation

更新时间：2016-07-05

Large scale parallel corpora play an important role in the language study and statistical machine translation (SMT) researches. They mainly provide training data for translation models[1-2] and represent resources for automatic lexical obtainment and enrichment[3-4] as well as for grammar introduction[5].

Previous researches focused on building bilingual corpora from the Internet. However, few researches about bilingual corpora were related to Vietnamese. Ref.[6] collected French-Vietnamese bilingual data from the Vietnam News Agency website and aligned the documents based on filtering steps (publishing date, special words and sentence alignments results). They used the Champollion toolkit to align the sentences. Eventually, they obtained 12 100 pairs of bilingual documents and 50 300 bilingual sentence pairs in the news form. Ref.[7] designed a Chinese-Vietnamese bilingual parallel corpus management platform and collected more than 110 000 Chinese-Vietnamese sentence pairs from the Internet. 53 000 words in several fields were aligned manually in their work. In order to extract an English-Vietnamese bilingual corpus from the web, Ref.[8]designed two content-based features: cognates and translation segments. These content-based features together with the structure of the web pages were fed into a machine learning model to extract the bilingual texts. For bilingual e-books sources, they used some forms of linkages between the blocks of the text in two languages. Three steps were adopted: pre-processing, paragraph alignment and sentence alignment.

All the above works have one feature in common, they have to search for the websites containing bilingual data, and then collect, process and align the paragraphs, sentences and words. These methods may produce a quality corpus but it is difficult to construct the bilingual corpus in a large scale.

In addition, there are several sentence-alignment tools which have been published, such as Refs.[9-12]. However, all of them do not support Vietnamese sentence alignment from other languages and vice versa.

Ⅰ类海风锋的7个例合成地面要素1 000 hPa场如图5a所示。由图5a可见,海风沿气压梯度自海上垂直于海岸线吹向内陆,由于海陆热力差异作用,海风锋温度密集带基本与海岸线平行,苏南沿海和苏北海州湾沿岸的温度锋区更显著,流场与锋面相交,海风驱动锋区向内陆推进;此外苏南有一个明显的气旋式环流,有利于此处锋区的加强和气流的辐合。等风速线显示,沿海最大的风速中心在苏中沿岸,大约5 m·s-1,海风登陆,地形摩擦效应,风速自海上向岸减弱,形成沿岸低层风速辐合区。在与江苏交界的安徽中部,有一个鞍形场,显示了多系统环流。

Fortunately, there is a kind of large scale multilingual corpus, in ever growing and easily accessible recources available on-line for SMT, i.e., movie subtitles. It’s easy to access a million-sentence scale of a movie or TV subtitles bilingual corpus in various languages pairs, and the best part is that the data possesses natural rough alignment in the sentence level. The work of Refs.[13-15] focused on this direction and all the sources adopted in these works were obtained from Open-Subtitles.

The Open-Subtitles have many different versions (years: 2011, 2012, 2013, 2016), and the latest is Open-Subtitles-2016. From a linguistic perspective, subtitles cover a wide and interesting breadth of genres, from colloquial language or slang to narrative and expository discourse (e.g. documentaries). Open-Subtitles-2016 includes a total of 1 689 bi-texts extracted from a collection of subtitles containing 2.6 billion sentences (17.2 billion tokens) distributed over 60 languages[16].

Ref.[16] pre-processed the raw subtitle files to generate XML subtitle files, and then performed a cross-lingual alignment to generate XML alignment files (1 XML file per language pair, encoded as a collection of alignments). Finally, they produced a bilingual corpus in the Moses format.

Although the amount of this kind of data is very large, there are still too many problems such as sentence mismatching, translation errors, free translations, font errors, etc. In order to take full advantage of this large scale of subtitle corpus, three filtering methods are introduced: ① the sentence length difference, ② the semantic similarity, and ③ the machine learning to pick out the sentence pairs of good quality.

The rest of this paper is organized as follows.Section 1 presents our proposed model. Section 2 presents a filtering method based on the sentence length. Section 3 presents a filtering method based on machine translation references (based on semantic similarity and machine learning methods). Section 4 describes our experiments and experimental results. Finally, section 5 made the conclusions and future work directions are given.

1 Proposed Model

The raw data in this paper is obtained from the Open-Subtitles-2016(http:∥opus.lingfil.uu.se/OpenSubtitles2016.php). To clean up the noises in this corpus, pre-processing are conducted as: ① remove unnecessary symbols such as: , §, #, [ , ], ※, -, *, @, 「,」, 「, 」, ② remove the sentence pairs that contain font errors, ③ remove the sentences that contain English words, and ④ convert traditional Chinese characters to simplified Chinese characters. After pre-processing, a baseline corpus called C0-corpus is obtained. Fig.1 illustrates the architecture of the system and the pre-processing is in the first part of the architecture. In this section the remaining parts of the system will be described in detail.

C0-corpus is obtained through the pre-processing stage. Next, we do the first filtering which is based on a Chinese-Vietnamese dictionary. At this step, we filter to remove all pairs of sentences that have big differences in length. Then, C1-corpus is obtained. However, there are still some sentences with very poor quality translation; or some translated sentences do not match with the meaning of the original sentences. So, we will apply a further filter to remove those poor-quality translations by employing a method based on a machine translation reference. Since the number of sentence pairs in C1-corpus is still too big, we randomly take 2 000 and 5 000 sentence pairs representatives from the C1-corpus to find the most appropriate threshold values, which will be described below. It is done by combining two methods, namely automatic labelling (based on measures: Cosine, Jaccard, Dice and smoothed-BLEU (Smoothing 3)[17]) and manual labelling. With each measure, we will keep C1-V1 sentence pairs in the C1-coprus to group them into the C2-corpus, if the semantic similarity of V1 and V1-google(V1-google is translated from C1 by using https:∥translate.google.com/) is greater than or equal to the threshold value. Moreover, for utility of the measures from above steps, we use them as features in machine learning methods (support vector machine-SVM, logistic regression-LR) for classification. In the classification, Yes is denoted for good quality sentence pairs, and No is denoted for being removed sentence pairs. The Yes results will be considered as the C2-corpus.

Fig.1 Architecture of corpus filtering system

2 Filtering Based on Sentence Length Difference

2.1 Characteristics of Vietnamese and Chinese

In terms of language typology, Chinese and Vietnamese are the same type of isolated form, so there are some similarities between them. The basic unit of Vietnamese is syllable. In writing, syllables are separated from each other by a white space. The white space itself cannot be used to determine the word because words often include one or more syllables. Tab.1 presents an example of a Vietnamese sentence and its corresponding Chinese one segmented into syllables and words.

Tab.1 Example of a Chinese-Vietnamese sentence pair segmented into syllables and words

As the size of the C1-corpus is very big, it is infeasible for manual evaluation of the translating quality of each sentence pair in C1-corpus. Instead, first we randomly sample 2 000 sentence pairs which represent C1-corpus. They are notated as 2K_ch, 2K_vi and manually labelled: Yes for the good quality sentence pairs and No for bad quality sentence pairs that will be removed. These 2 000 sentence pairs are used as a training data set.

2.2 Sentence relative length based filtering

As mentioned above, the length of the sentence is a good benchmark to determine whether they are accurately translated from each other. It is intuitive in the bilingual corpus. Tab.2 shows the examples of sentence alignment errors in C0-corpus.

where M(Yes), and are the number of Yes occurrences in the M column, Nt0 column and Tt0 column, respectively.

Tab.2 Examples of sentence alignment errors in C0-corpus

Here,a Chinese-Vietnamese dictionary with 340 000 word pairs is taken as the reference, the average difference of Chinese-Vietnamese word pairs is calculated by the following formula:

11月22日，云南省工信委公开的《云南省支持橡胶产业发展实施方案》(征求意见稿)提出，力争2020年全省橡胶产业综合产值达到300亿元，种植面积稳定在900万亩左右，产量达到60万吨。

(1)

where dicthreshold is the threshold value of the dictionary, lch_word is the length of Chinese words, lvi_word is the corresponding length of Vietnamese words, and ldic is the number of the word pairs in this dictionary.

In C0-corpus, the difference of each sentence pair is calculated by

(2)

where dif(ch, vi) is the length difference of a Chinese-Vietnamese sentence pairs, lch_sen is the length of a Chinese sentence, lvi_sen is the length of a Vietnamese sentence, min(lch_sen, lvi_sen) is the smallest value of lch_sen and lvi_sen.

Sentence pairs whose dif(ch, vi) is greater than dicthreshold are removed from C0-corpus. To estimate the impact of sentence length on C1-corups quality, several values of dicthreshold are tried as follows. Then the original dicthreshold value is multiplied by different coefficients, such as coefficients of 1, 2, …, 6 (as mentioned in section 4.2.1) to get new threshold values. From each new dicthreshold value, a corresponding C1-corpus will be obtained from C0-corpus.

数据运营时代，场景背后是可量化的数据。数据流动性越强，生成的结构性场景也越多，用户需求数据越清晰，新场景创造与用户匹配度也越精准。

由于普洱茶的独特风味和良好的保健功能，近年来受到国内外学者的广泛关注，普洱茶的陈化研究也得到了一定程度的重视。但是关于陈化的发生机理、控制因素、工艺关键控制点等方面研究还有待深入。因此，进一步将普洱茶的陈化和醇化进行区分，陈化是指普洱茶在纯自然环境中摆放，无人工干预情况下完成自然转化的过程，而醇化则是指普洱茶在人为设计或人工控制的环境中，有目的性的相对快速完成品质转化的生产工艺过程。目前普洱茶的收藏和陈化过程大多数是自然环境摆放，无任何人工干预，但是鉴于普洱茶品质的特殊性，普洱茶的醇化加工可相对快速地缩短储藏时间，有效提升普洱茶品质，所以醇化工序对普洱茶的品质形成有重要影响。

3 Filtering Based on Machine Translation Reference

After the previous step, although we have collected sentence pairs in the C1-corpus that have similar sentence lengths, these pairs still contain several mis-translated ones as exemplified below.

Tab.3 shows some common mis-translations: ① sentence pairs in line 1 and line 2 that are aligned but differ completely in meaning; ② mis-alignment, i.e., the Chinese sentence in line 3 that’s aligned to the Vietnamese sentence in line 4, the Chinese sentence in line 4 is aligned to the Vietnamese sentence in line 5,which also leads to; ③ free translation, i.e., the Vietnamese sentence in line 3 and the Chinese sentence in line 5 have no matching translation aligned to them.

Tab.3 Examples of mis-translated sentences in C1-corpus

The initial corpus includes 1 120 000 sentence pairs. After pre-processing (see section 1) and manually selecting relatively good 5 000 sentence pairs, we obtained C0-corpus that consists of 997 424 sentence pairs. We combined 5 000 sentence pairs just extracted with other 5 000 sentence pairs collected from: vietnamese.cri.cn to create develop_set and test_set.

3.1 Filtering based on semantic similarity

With some similarities in the type of language as above,it is recognized that Chinese sentences and their Vietnamese translation often have proportional sentence lengths. Therefore, the length of sentences in a sentence pair will be a very good criterion to determine whether they are accurate translations of one another.

First, Google Translate(https:∥translate.google.com/) is utilized to translate the extracted 2 000 Chinese sentences (2K_ch) to Vietnamese (2K_google). Then, the semantic similarity of the 2 000 sentence pairs (2K_vi, 2K_google) is measured by using the measures: Cosine, Jaccard, Dice, and smoothed-BLEU.

At each measure, the most appropriate threshold value is identified to differentiate good-bad quality sentence pairs from C1-corpus. Only sentence pairs (C1-V1) which satisfy the semantic similarity of V1 and V1-google are greater than or equal to the threshold value of that measure will be kept in C2-corpus.

Each Vietnamese sentence pair of the 2K_vi-2K_google set has similarity of 0 if they are semantically completely different and of 1 if they are completely identical. Typically, the range of the similarity is [0,1].

In this step, we search for the most appropriate threshold value for each measure. The most appropriate threshold value is the value at which F1 reaches max value. Taking Cosine measure for example, the following table is considered (the other measure was similarity performed).

《摄影之友》自1985年创刊起，就从爱好者的实际需要出发，追踪影像生活的潮流动态，评测琳琅满目的影像产品，剖析影像创作理念，目前发行量已达35万册，居国内影像媒体发行量之首，是全国唯一上榜的摄影刊物。近年来，《摄影之友》从内容上不断升级的同时，也力求为影友们提供更方便的阅读方式。扫码下载“摄影之友”APP，现在下载还能免费读杂志哦！

As the threshold value is in [0, 1], the original temporary threshold value is set at t0(=0.0) and the final threshold value at 1. For the original temporary threshold (t0), the corresponding values in the Tt0 and the Nt0 columns are calculated as in Tab.4, where i is the number of sentence pairs that are manually labelled, i=1,2,…,2 000.

系统性红斑狼疮的发病病因不明,临床症状隐匿不发且表现多样,发病时累及机体多个系统,现临床中大多应用激素治疗系统性红斑狼疮,激素具有减少自身抗体、抑制免疫系统以及抗炎的作用,其治疗效果显著且见效迅速,但是长期服用激素类药物,存在着诸多并发症发生和不良反应发生的风险,在停药后的复发率也相对较高,对会自身的脂肪代谢产生影响,会一定程度的导致微循环障碍,严重影响下丘脑-垂体-肾上腺功能[4]。因此,在治疗过程中要同时给予可以有效改善微循环障碍,尽可能减少对患者身体的损伤,本次研究将探究热滋阴活血祛瘀汤联合激素治疗系统性红斑狼疮的临床效果是否显著。

Tab.4 Cosine measure of threshold value

No.Manuallabelling(M)Cosine(2K_vi,2K_google)Tt0Nt01Yes0.0NoNo2No0.5YesNo……………1999No0.0NoNo2000Yes0.9YesYes

Here, sentence pairs where Cosine similarity is greater than t0 will have Tt0set at Yes, and No otherwise. In the Nt0 column, we combine Manual labelling column value with Tt0 column value. Nt0 is set at Yes when both these two columns are Yes, and No otherwise. With the other temporary threshold values (from t1 to 1), the corresponding values in the Tt1 and the Nt1 columns, etc. are similarity calculated.

时年62岁的郑筱萸，在被执行死刑前留下的《悔恨的遗书》中感叹道：“明天我就要‘上路’了，我现在最害怕的是，我将如何面对那些被我害死的冤魂？我祈求他们能够原谅我、饶恕我，我这不已经遭报应了嘛？”

(3)

(4)

Finally, we calculate the Precision (P), Recall (R), F1 and the threshold value for Cosine measure at temporary threshold t0 as follows:

(5)

(6)

(7)

In detail: ① Sentence alignment errors, where one Vietnamese sentence is aligned with several Chinese sentences (1st row) and vice versa. ② One Chinese sentence is aligned with several Vietnamese sentences (2nd row). ③ A short Chinese sentence is aligned with a very long Vietnamese sentence (3rd row), and vice versa. ④ A long Chinese sentence is aligned to a very short Vietnamese sentence one (4th row). These errors cause the length of Chinese-Vietnamese sentence pairs to have a big difference, which indicates that they are not translated properly.

还有一个旅客在服务大厅买票，排队的人比较多，想走捷径，找到大厅里一位服务员，希望其代他插队买票，服务员不同意，他转身就投诉了这位服务员，说她不为旅客着想。

Cosinethreshold is calculated as:

Cosinethreshold=max(F1;j=0,1,2,…,10)

(8)

where j=0,1,2,…,10 corresponds to the temporary threshold values from 0.0 to 1 and with each repeating the threshold is increased by 0.1. If the original temporary threshold values are set at 0.00 or 0.000, (e.g,j=0,1,2,…,100 or j=0,1,2,…,1 000) respectively with each repeat, the threshold is increased by 0.01 or 0.001, etc.

Cosinethreshold is the threshold value at which F1 reaches a maximum value when combining the two methods of manual and automatic labelling. Similarly, we also calculate the Jaccardthreshold, Dicethreshold and smoothed-BLEUthreshold values.

Filtering by the different thresholds in the filtering step based on sentence length, different C1-corpora can be obtained from C0-corpus. Tab.7 shows the number of sentence pairs of each corpus and the corresponding BLEU score.

西部矿业企业旗下的主要矿山主要分布在青海、西藏、内蒙古等地，矿山所处的地理位置都比较偏僻，且对于矿石等材料的运输方式比较单一，一般只能用专门的铁路或公路来运送原材料和产品，由于原料产地与加工地点、市场距离较远，物流运输成本高，运输意外中断也会对企业的业务运营造成不利的影响，甚至会对企业的名誉和经济造成损害。

Tab.5 Each measure and their parameters

MeasureCosineJaccardDiceSmoothed-BLEUTrainingset20005000200050002000500020005000Threshold0.3210.3380.1000.0830.1740.1600.0370.033P0.6000.6250.6670.6780.6640.6820.6020.621R0.8780.8470.8380.8450.8440.8390.9430.946F10.7130.7200.7430.7520.7430.7530.7350.750NumberofYes-Yes96323739192366926234910352650

3.2 Filtering based on machine learning

Above, we have manually labelled (Yes, No) for 2 000 sentence pairs (2K_ch, 2K_vi), 5 000 sentence pairs (5K_ch, 5K_vi), and calculated their similarity (2K_vi, 2K_google) (5K_vi, 5K_google) using the Cosine, Jaccard, Dice and smoothed-BLEU measures (2K_google, 5K-google are translated from 2K-ch, 5K-ch by using https:∥translate.google.com/.). In this step, we build SVM and LR classifiers to conduct machine learning methods, based on these labelled data and measures. The classification features are as follows:

SVM: Cosine, Jaccard, Dice, smoothed-BLEU, absdif, reldif

LR: Cosine, Jaccard, Dice, smoothed-BLEU, absdif, reldif

where absdif and reldif are calculated using the formulas:

absdif=lch-lvi

(9)

(10)

where lch is the length of the Chinese sentence and lvi is the length of the Vietnamese sentence. absdif, reldif is the absolute and relative difference of each Chinese-Vietnamese sentence pair in the C1-corpus.

With each C1-corpus corresponding to each threshold value calculated using Eq.(1), we differentiate (Yes/No) sentence pairs from C1-corpus and obtain respective C2-corpus.

4 Experiments

4.1 Data setting

We redraw the subsequent steps of the method in Fig.2 and distinguish the data after each filtering step.

分析来自La Esperanza地区调查的数据。首先，参与超市供应链的概率被评估为社会经济，农业，交易成本和组织变量的函数。这一分析结果表明，社会经济和农业特点无统计显著性差异。这一结果的原因可能是调查人口的同质化，主要包括条件相似的小规模农户。然而，交易成本与生产低质量风险，运输和分级问题这些因素有非常重要的相关性，构成了农民在超市供应链中的参与约束。相反，供应链中相对较高的价格和对买家的信任，对农民加入又有积极的作用。有关交易成本的结果与本研究分析框架中的理论预期是一致的。

Therefore, in this step we aim to eliminate these sentence pairs, which include translation errors, sentence mismatch, free translations, etc. relying on semantic criterions.

Fig.2 Process of the experiments

For the C1-corpus (C1-V1), Google Translate is used to translate Chinese sentences (C1) into Vietnamese sentences (V1-google) and then the measures are used: Cosine, Jaccard, Dice, smoothed-BLEU to measure the similarity of sentence pairs (V1, V1-google). In the experiments below, depending on each threshold value calculated using Eq. (1), from C0-corpus we extract the number of pairs (Chinese, Vietnamese) for each measure (Cosine, Jaccard, Dice, smoothed-BLEU) accordingly to create training data sets for C1-corpus and C2-corpus.

Next, 2 000 and 5 000 pairs are extracted from the C1-corpus randomly (see section 3.1) to build training_set for the next experiments. Note that the only difference of all the experiments is the training_set, develop_set and test_set are completely identical. Thus, the corpus preparation includes 6 categories in Tab.6.

Tab.6 Data sets for the experiments

TypeofdataNumberofsentencepairsForexprimentsdevelop_set5000BLEUofC0-corpus,C1-corpusandC2-corpustest_set5000C0-corpus997424C0-BLEUscore(2K_ch,2K_vi)2000Findingthemostappropriatethresholdvalueforeachthe(5K_ch,5K_vi)5000measurewhichrepresentC1-corpusC1-corpus,C2-corpusDependentoneachthethresholdvalueascalculatedusingEq.(1)C1-BLEUscoresandC2-BLEUscores

In this study, Moses[18] is used as a decoder, SRILM is applied to build the language model, Giza++ for word alignment process, and BLEU for scoring the translation.

4.2 Experimental results

In order to empirically evaluate the quality of the C0-corpus, C1-corpora and C2-corpora, we used them as the training set for a SMT system based on Moses. First, a SMT system is built for evaluating C0-corpus by the BLEU score. With the train set of 997 424 sentence pairs, the C0-BLEU result is 18.78 (baseline).

4.2.1 C1-BLEU scores

In order to test the stability of the classifier, similarly, from C1-corpora we also randomly sample 5 000 sentences pairs(5K_ch, 5K_vi) to find the threshold value for each measure. Tab.5 is the corresponding parameters for each measure of 2 000 and 5 000 sentence pairs. These values are the criteria to determine whether a new sentence pair would be labelled Yes or No.

Tab.7 Change of BLEU score with their threshold

Originalthresholdmultipleofthecoefficients(LEVEL)DictionaryofthresholdsNumberofsentencepairsC1-BLEUscores10.23637552718.4120.47260135018.9630.70874515119.6040.94480171819.3151.18087268819.1661.41689907119.15

However, from level 4 to level 6,it can be observed that C1-BLEU tends to decrease. The reason is that the filter conditions are being loosened and the amount of data added to the training set at the next level is containing more and more poor quality sentence pairs. Therefore, it is completely logical that the lowest C1-BLEU at level 6 is 19.15.

全民学习共享平台的典型学习形式全民学习共享平台是提供给社会每个成员进行终身学习的路径。终身学习是包括正式学习和非正式学习的学习过程。

Fig.3 shows that, at level 1, C1-BLEU score is 18.41, which is smaller than the baseline C0-BLEU score (18.78). This is due to the reason that the overly strict filtering will discard too many translation pairs and the rest is insufficient for training a translation model.

Fig.3 Thresholds and their corresponding BLEU scores

From level 2 onward, at each threshold, most of C1-BLEU scores are higher than the C0-BLEU score and the highest value occurs at level 3. This demonstrates that the filtering step based on sentence length is effective.

The initial threshold calculated by Eq. (1) is 0.236 (level 1). Then this threshold is multiplied with the coefficients from 2 to 6, corresponding to the threshold values from 0.472 to 1.416. For convenience, the levels (1, 2, …,6) are introduced to indicate the different C1-corpora below.

4.2.2 C2-BLEU scores

Now, based on the above C1-corpora with various threshold values, further filtering is conducted based on the measures (Cosine, Jaccard, Dice and smoothed-BLEU) of the threshold value and the machine learning methods (SVM, LR). Representatives of thresholds at levels 1, 2, 3, 4, 5, 6 are taken and obtained the results shown in Fig.4.

This chart (Fig.4) has six intervals for six levels (level 1, 2, … , 6) respectively. The first point of each interval is the value of C1-BLEU score. The six remaining points in each interval are the C2-BLEU scores with the measures:Cosine, Jaccard, Dice, smoothed-BLEU, SVM and LR respectively. The horizontal line is C0-BLEU score.

Fig.4 Thresholds and their C2-BLEU scores

According to this chart, when the threshold at level 1, the value of C1-BLEU score and C2-BLEU score are smaller than C0-BLEU score. Due to the fact that at the filtering step based on the length of the sentence, the threshold value is too small (0.236), and too many good quality sentence pairs are removed. Therefore, in the next step, the number of good quality sentence pairs obtained in C1- corpora and C2- corpora is smaller in the C0-corpus, thus driving down BLEU-score.

Thus, from the level 1 to level 3, C1-BLEU scores tend to increase and reach the highest value at level 3 (19.6), then decrease to level 6 (19.15) in Fig.3. Note that the two data sets (2 000 and 5 000 pairs) we extracted in section 3.1 have no effect on C1-BLEU scores at levels, but only on C2-BLEU scores. Similarly, C2-BLEU scores (with both data set: 2 000 and 5 000 pairs) are also neither good nor stable at level 1 (as above discussed). From level 2 onwards, C2-BLEU scores are always stable and higher than the C1-BLEU scores at the respective levels. C2-BLEU reaches the highest value at level 3 (SVM), which is 20.01 with the data set of 2 000 pairs and level 3 (SVM), which is 20.10 with the data set of 5 000 pairs.

The BLEU score, percentage of sentence pairs and experimental results are presented in Tab.8.

Tab.8 Original corpus, C0-corpus, C1-corpus, C2-corpus and their BLEU scores

StepOriginalcorpusC0-corpusC1-corpus(level3)C2-corpus(level3-SVM)Numberofsentencepairs112556399742474515120005000553250575535Percent100%74.70%55.47%57.70%BLEU6.3018.7819.6020.0120.10

Tab.8 compares results of our experiments in different filtering steps (i.e. C1-corpus and C2-corpus at level 3) with the baseline (C0-corpus) and original corpus. From the results, it is clear that the selected data using our method could obtain better quality corpora and get a higher score using small amount of data.

例如，在完成《赠汪伦》这一课的教学之后，教师可以利用一张纸遮住“白”“行”“上”“水”“尺”“我”，接着让学生将对应的汉字填进去，谁最先正确地完成谁就获胜。比赛结束后发现，有些汉字具有非常高的出错率，教师基于此可以单独拿出该字，组织学生开展“笔画接力”的游戏，对学生进行分组，让他们以小组为单位来进行比赛。在最后，在教师的引导下对整首古诗进行诵读，让他们的印象得到加深。

Noticeably, we could realize the competitive performance of much less data, and this will reduce the computation load in process. To illustrate, C1-corpus uses 74.70% of the baseline data and attain a BLEU score of 19.60. Meanwhile, C2-corpus uses 55.47% (with 2 000 pairs) and 57.70% (with 5 000 pairs) of the baseline data and get a BLEU score of 20.01 and 20.10 (i.e. higher BLEU score in C1-corpus and baseline). It means that only above half of the data can get a competitive performance compared to using all the data.

这是为什么呢？因为孩子的成长需要这些东西，这些东西对他就像养分一样。比如说因为做了某件事情被老师罚站，孩子就会印象深刻，知道这事后面有一个秩序，要遵守游戏规则，不遵守就要受到惩罚，这样的经验很重要。

5 Conclusions and Future Work

This paper proposes the filtering methods based on the sentence length difference, the semantic of sentence pairs, and machine learning for the Chinese-Vietnamese parallel corpus construction. In the filtering step based on the sentence length difference, with all of the various threshold values, the C1-BLEU scores (from level 2) are higher than the C0-BLEU score. Next, two manual data sets are used for the remaining two filter steps. In the filtering step based on the semantic similarity of sentence pairs, the filter quality and quantity depend on the threshold values in step 1. The C2-BLEU scores are consistently higher than the C1-BLEU scores.

In addition, the machine learning methods (SVM, LR) also achieve more stable and higher BLEU scores compared with the single feature methods (based on each measure). Using the manual data sets having the size of 2 000 or 5 000 did not significantly affect C2-BLEUs, as they are extracted randomly to represent the C1-corpora. These methods can also be easily transferred to other language pairs.

In the future, we will attempt to calculate C2-BLEU scores at the remaining threshold values. It is hopeful that at these thresholds,the C2-BLEU scores could reach values higher than the current value of 20.10.

References:

[1] Melamed D I. Models of translational equivalence among words[J]. Computational Linguistics,2000,26(2):221-249.

[2] Jin R, Chai J Y. Study of cross lingual information retrieval using on-line translation systems[C]∥International ACM Sigir Conference on Research & Development in Information Retrieval,2005:619-620.

[3] Gale W A, Church K W. Identifying word correspondences in parallel texts[C]∥Speech and Natural Language, Proceedings of a Workshop Held at Pacific Grove, California, USA, DBLP, 1991: 152-157.

[4] Widdows D, Dorow B, Chan C K. Using parallel corpora to enrich multilingual lexical resources[C]∥International Conference on Language Resources & Evaluation, 2002:240-245.

[5] Kuhn J. Experiments in parallel-text based grammar induction[C]∥Meeting on Association for Computational Linguistics, Association for Computational Linguistics, 2004:470. https:∥dl.acm.org/citation.cfm?id=1218955.1219015.

[6] Do Thi-Ngoc-Diep, Le Viet-Bac, Bigi Brigitte,et al. Mining a comparable text corpus for a Vietnamese-French statistical machine translation system[C]∥Proceedings of the Fourth Workshop on Statistical Machine Translation, Association for Computational Linguistics, 2009:165-172. https:∥dl.acm.org/citation.cfm?id=1626466.

[7] Luo L, Guo J Y, Yu Z T, et al. Construction of a large-scale Sino-Vietnamese bilingual parallel corpus[C]∥IEEE International Conference on System Science and Engineering, IEEE, 2014:154-157.

[8] Le Q H, Le A C. Extracting parallel texts from the web[C]∥Second International Conference on Knowledge and Systems Engineering, IEEE, 2010:147-151.

[9] Gale W A, Church K W. A program for aligning sentences in bilingual corpora[C]∥Meeting on Association for Computational Linguistics, Association for Computational Linguistics, 1991:177-184.

[10] Moore R C. Fast and accurate sentence alignment of bilingual corpora[C]∥Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users, Springer-Verlag, 2002:135-144.

[11] Braune F, Fraser A. Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora[C]∥COLING 2010, International Conference on Computational Linguistics, Posters Volume, 23-27 August 2010, Beijing, China, DBLP, 2010:81-89.

[12] Sennrich R, Volk M. MT-based sentence alignment for OCR-generated parallel texts[C]∥Proc of Amta, 2010: 175-182.

[13] Tiedemann J. Building a multilingual parallel subtitle corpus[J]. International Journal of Multilingualism, 2009, 11(2):266-268.

[14] Tiedemann J. Synchronizing translated movie subtitles[C]∥International Conference on Language Resources and Evaluation, Lrec 2008, 26 May-1 June 2008, Marrakech, Morocco, DBLP, 2012:1902-1906.

š R, Tiedemann J, Rozis R, et al. Billions of parallel words for free: building and using the EU bookshop corpus[C]∥Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), 2014: 1850-1855. https:∥pdfs.semanticscholar.org/359a/f0607033b19e91c5f07715d16a0f2efff85f.pdf.

[16] Lison Pierre, Jörg Tiedemann. OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles[C]∥ LREC, 2016. https:∥www.duo.uio.no/handle/10852/50459.

[17] Chen B, Cherry C. A systematic comparison of smoothing techniques for sentence-level BLEU[C]∥The Workshop on Statistical Machine Translation, 2014:362-367.

[18] Koehn Philipp Hoang. Moses: open source toolkit for statistical machine translation[C]∥Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, Association for Computational Linguistics, 2007:177-180.

作者

Huu-anh Tran， Yuhang Guo， Ping Jian， Shumin Shi，Heyan Huang

基金

分类号

出处

《Journal of Beijing Institute of Technology》 2018年第1期

上一篇：Bionic Attitude Transformation Combined with Closed Motion for a Free Floating Space Robot

下一篇：Recommending Personalized POIs from Location Based Social Network

《Journal of Beijing Institute of Technology》2018年第1期文献

Quantitative Method of the Structural Damage Identification of Gas Explosion Based on Case Study: The Shanxi “11.23” Explosion Investigation 作者：Huanjuan Zhao， Yiran Yan，Xinming Qian

Compressional Deformation in Indentation Process for Microlens Array Mold 作者：Yaqun Bai， Xibin Wang， Tianfeng Zhou， Zhiqiang Liang，Guang Li

Development and Verification of the Equilibrium Strategy for Batteries in Electric Vehicles 作者：Rui Xiong，Yanzhou Duan

Modeling and Optimization of Heat Dissipation Structure of EV Battery Pack 作者：Xinggang Li，Rui Xiong

Scheduling Optimization of Space Object Observations for Radar 作者：Xiongjun Fu， Liping Wu， Chengyan Zhang，Min Xie

Chaff Jamming Recognition of Radar 作者：Min Xie， Shuang Zhao， Xiongjun Fu， Tianyu Zhang，Kaiqiang Liu

Particle Filter Object Tracking Algorithm Based on Sparse Representation and Nonlinear Resampling 作者：Zheyi Fan， Shuqin Weng， Jiao Jiang， Yixuan Zhu，Zhiw en Liu

Automatic Satisfaction Analysis in Call Centers Considering Global Features of Emotion and Duration 作者：Jing Liu， Chaomin Wang， Yingnan Zhang， Pengyu Cong， Liqiang Xu，Zhijie Ren， Jin Hu， Xiang Xie， Junlan Feng，Jingming Kuang

Oversample Reconstruction Based on a Strong Inter-Diagonal Matrix for an Optical Microscanning Thermal Microscope Imaging System 作者：Meijing Gao， Ailing Tan， Jie Xu， Weiqi Jin， Zhenlong Zu，Ming Yang

Dwell Scheduling Algorithm for Digital Array Radar 作者：Qun Zhang， Di Meng， Ying Luo，Yijun Chen

Dielectric Properties and Microwave Heating of Molybdenite Concentrate at 2.45 GHz Frequency 作者：Yonglin Jiang，Bingguo Liu，Jinhui Peng，Libo Zhang

Online Observability-Constrained Motion Suggestion via Efficient Motion Primitive-Based Observability Analysis 作者：Zheng Rong， Shunan Zhong，Nathan Michael

Parameters Sensitivity Analysis and Correction for Concrete Damage Plastic Model 作者：Yaqin Jiang， Pengfei Xu， Chengzhi Wang，Dianshu Liu

Propagating Characteristic of Premixed Methane-Oxygen Deflagration in the Coal Mine Lane Including a Refuge Chamber 作者：Huanjuan Zhao， Yiran Yan， Yinghua Zhang，Yukun Gao

Bionic Attitude Transformation Combined with Closed Motion for a Free Floating Space Robot 作者：Zhanpeng Sun， Yongjin Lu， Lixian Xu，Liang Wang

Improving Parallel Corpus Quality for Chinese-Vietnamese Statistical Machine Translation 作者：Huu-anh Tran， Yuhang Guo， Ping Jian， Shumin Shi，Heyan Huang

Recommending Personalized POIs from Location Based Social Network 作者：Haiying Che， Di Sang，Billy Zimba

Layer-Constrained Triangulated Irregular Network Algorithm Based on Ground Penetrating Radar Data and Its Application 作者：Zhenw u Wang，Jianqiang Ma

Mechanical Performances and Morphology of LDPE/UHMWPE Single-Polymer Composites Produced by Extrusion-Calendering Method 作者：Jian Wang， Ziran Du， Tong Lian，Jiong Peng

杂志信息网