# alignment-scripts

**Repository Path**: mirrors_alvations/alignment-scripts

## Basic Information

- **Project Name**: alignment-scripts
- **Description**: Scripts to preprocess training and test data and to run fast_align and giza
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-09-11
- **Last Updated**: 2026-04-11

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# alignment-scripts
Scripts to preprocess training and test data for alignment experiments and to run and evaluate FastAlign and Mgiza.


## Dependencies
* Python3
* [MosesDecoder](https://github.com/moses-smt/mosesdecoder): Used during preprocessing
* [Sentencepiece](https://github.com/google/sentencepiece): Optional, used for subword splitting at the end of preprocessing
* [FastAlign](https://github.com/clab/fast_align): Only used for FastAlign
* [Mgiza](https://github.com/moses-smt/mgiza/): Only used for Mgiza


## Usage Instructions
* Install all necessary dependencies
* Export install locations for dependencies: `export {MOSES_DIR,FASTALIGN_DIR,MGIZA_DIR}=/foo/bar`
* Make sure you set a reasonable default locale, e.g.: `export LC_ALL=en_US.UTF-8`
* Create folder for your test data: `mkdir -p test`
* Download [Test Data for German-English](https://www-i6.informatik.rwth-aachen.de/goldAlignment/) and move it into the folder `test`
* Run preprocessing: `./preprocess/run.sh`
* Run Fastalign: `./scripts/run_fast_align.sh`
* Run Giza: `./scripts/run_giza.sh` (This might take multiple days)


## Results
All results are in percent in the format: AlignmentErrorRate (Precision/Recall)

### German to English ###
| Method | DeEn | EnDe | Grow-Diag | Grow-Diag-Final |
| --- | ---- | --- | ---- | --------- |
| FastAlign | 28.4% (71.3%/71.8%) | 32.0% (69.7%/66.4%) | 27.0% (84.6%/64.1%) | 27.7% (80.7%/65.5%) |
| Mgiza | 21.0% (86.2%/72.8%) | 23.1% (86.6%/69.0%) | 21.4% (94.3%/67.2%) | 20.6% (91.3%/70.2%) |

### Romanian to English ###
| Method | RoEn | EnRo | Grow-Diag | Grow-Diag-Final |
| --- | ---- | --- | ---- | --------- |
| FastAlign | 33.8% (71.8%/61.3%) | 35.5% (70.6%/59.4%) | 32.1% (85.1%/56.5%) | 32.2% (81.4%/58.1%) |
| Mgiza | 28.7% (82.7%/62.6%) | 32.2% (79.5%/59.1%) | 27.9% (94.0%/58.5%) | 26.4% (90.9%/61.8%) |

### English to French ###
| Method | EnFr | FrEn | Grow-Diag | Grow-Diag-Final |
| --- | ---- | --- | ---- | --------- |
| FastAlign | 16.4% (80.0%/90.1%) | 15.9% (81.3%/88.7%) | 10.5% (90.8%/87.8%) | 12.1% (87.7%/88.3%) |
| Mgiza | 8.0% (91.4%/92.9%) | 9.8% (91.6%/88.3%) | 5.9% (97.5%/89.7%) | 6.2% (95.5%/91.6%) |


## Known Issues
* Does not work on MacOs
* Tokenization of the Canadian Hansards seems to be off when accents are present in the English text: `Ms. H é l è ne Alarie`, `Mr. Andr é Harvey :`, `Mr. R é al M é nard`