# BoB **Repository Path**: myth_psy/BoB ## Basic Information - **Project Name**: BoB - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2022-01-04 - **Last Updated**: 2022-01-04 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ## BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data [](https://github.com/pytorch/pytorch) [](https://www.apache.org/licenses/LICENSE-2.0) [](http://ir.hit.edu.cn/) This repository provides the implementation details for the ACL 2021 main conference paper: **BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data**. [[paper]](https://aclanthology.org/2021.acl-long.14/) ## 1. Data Preparation In this work, we carried out persona-based dialogue generation experiments under a persona-dense scenario (English **PersonaChat**) and a persona-sparse scenario (Chinese **PersonalDialog**), with the assistance of a series of auxiliary inference datasets. Here we summarize the key information of these datasets and provide the links to download these datasets if they are directly accessible. * **For Persona-Dense Experiments** | Dataset | Type | Language | Usage | Where to Download | | ---- | ---- | ---- | ---- | ---- | | ConvAI2 PersonaChat | Dialogue Generation | English | Training | [https://www.aclweb.org/anthology/P18-1205.pdf](https://www.aclweb.org/anthology/P18-1205.pdf) train\_self\_original\_no\_cands & valid\_self\_original\_no\_cands (7801 test dialogues) | | MNLI | Non-dialogue Inference | English | Training | [https://cims.nyu.edu/~sbowman/multinli/multinli_1.0.zip](https://cims.nyu.edu/~sbowman/multinli/multinli_1.0.zip) entailment & contradiction | | DNLI | Dialogue Inference | English | Evaluation | [https://www.aclweb.org/anthology/P19-1363.pdf](https://www.aclweb.org/anthology/P19-1363.pdf) | * **For Persona-Sparse Experiments** | Dataset | Type | Language | Usage | Where to Download | | ---- | ---- | ---- | ---- | ---- | | ECDT2019 PersonalDialog | Dialogue Generation | Chinese | Training | [https://arxiv.org/pdf/1901.09672.pdf](https://arxiv.org/pdf/1901.09672.pdf) dialogues\_train.json & test\_data\_random.json & test\_data\_biased.json | | CMNLI | Non-dialogue Inference | Chinese | Training | [https://github.com/CLUEbenchmark/CLUECorpus2020/](https://github.com/CLUEbenchmark/CLUECorpus2020/) entailment & contradiction | | KvPI | Dialogue Inference | Chinese | Evaluation | [https://github.com/songhaoyu/KvPI](https://github.com/songhaoyu/KvPI) | * **Download Pre-trained BERT** The BoB model is initialized from public BERT checkpoints: * **English BERT**: [https://huggingface.co/bert-base-uncased/tree/main](https://huggingface.co/bert-base-uncased/tree/main) * **Chinese BERT**: [https://huggingface.co/bert-base-chinese/tree/main](https://huggingface.co/bert-base-chinese/tree/main) ## 2. How to Run The `setup.sh` script contains the necessary dependencies to run this project. Simply run `./setup.sh` would install these dependencies. Here we take the English PersonaChat dataset as an example to illustrate how to run the dialogue generation experiments. Generally, there are three steps, i.e., **tokenization**, **training** and **inference**: * **Preprocessing** ``` python preprocess.py --dataset_type convai2 \ --trainset ./data/ConvAI2/train_self_original_no_cands.txt \ --testset ./data/ConvAI2/valid_self_original_no_cands.txt \ --nliset ./data/ConvAI2/ \ --encoder_model_name_or_path ./pretrained_models/bert/bert-base-uncased/ \ --max_source_length 64 \ --max_target_length 32 ``` We have provided some data examples (dozens of lines) at the `./data` directory to show the data format. `preprocess.py` reads different datasets and tokenizes the raw data into a series of vocab IDs to facilitate model training. The `--dataset_type` could be either `convai2` (for English PersonaChat) or `ecdt2019` (for Chinese PersonalDialog). Finally, the tokenized data will be saved as a series of JSON files. * **Model Training** ``` CUDA_VISIBLE_DEVICES=0 python bertoverbert.py --do_train \ --encoder_model ./pretrained_models/bert/bert-base-uncased/ \ --decoder_model ./pretrained_models/bert/bert-base-uncased/ \ --decoder2_model ./pretrained_models/bert/bert-base-uncased/ \ --save_model_path checkpoints/ConvAI2/bertoverbert --dataset_type convai2 \ --dumped_token ./data/ConvAI2/convai2_tokenized/ \ --learning_rate 7e-6 \ --batch_size 32 ``` Here we initialize encoder and both decoders from the same downloaded BERT checkpoint. And more parameter settings could be found at `bertoverbert.py`. * **Evaluations** ``` CUDA_VISIBLE_DEVICES=0 python bertoverbert.py --dumped_token ./data/ConvAI2/convai2_tokenized/ \ --dataset_type convai2 \ --encoder_model ./pretrained_models/bert/bert-base-uncased/ \ --do_evaluation --do_predict \ --eval_epoch 7 ``` Empirically, in the PersonaChat experiment with default hyperparameter settings, the best-performing checkpoint should be found between epoch 5 and epoch 9. If the training procedure goes fine, there should be some results like: ``` Perplexity on test set is 21.037 and 7.813. ``` where `21.037` is the ppl from the first decoder and `7.813` is the final ppl from the second decoder. And the generated results is redirected to `test_result.tsv`, here is a generated example from the above checkpoint: ``` persona:i'm terrified of scorpions. i am employed by the us postal service. i've a german shepherd named barnaby. my father drove a car for nascar. query:sorry to hear that. my dad is an army soldier. gold:i thank him for his service. response_from_d1:that's cool. i'm a train driver. response_from_d2:that's cool. i'm a bit of a canadian who works for america. ``` where `d1` and `d2` are the two BERT decoders, respectively. * **Computing Infrastructure:** * The released codes were tested on **NVIDIA Tesla V100 32G** and **NVIDIA PCIe A100 40G** GPUs. Notice that with a `batch_size=32`, the BoB model will need at least 20Gb GPU resources for training. ## MISC * Build upon 🤗 [Transformers](https://github.com/huggingface/transformers). * Bibtex:
	@inproceedings{song-etal-2021-bob,
	    title = "{B}o{B}: {BERT} Over {BERT} for Training Persona-based Dialogue Models from Limited Personalized Data",
	    author = "Song, Haoyu  and
	      Wang, Yan  and
	      Zhang, Kaiyan  and
	      Zhang, Wei-Nan  and
	      Liu, Ting",
	    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
	    month = aug,
	    year = "2021",
	    address = "Online",
	    publisher = "Association for Computational Linguistics",
	    url = "https://aclanthology.org/2021.acl-long.14",
	    doi = "10.18653/v1/2021.acl-long.14",
	    pages = "167--177",
	}
	
* Email: *hysong@ir.hit.edu.cn*.