diff --git a/README.md b/README.md index 3a89cda..a57bd1a 100644 --- a/README.md +++ b/README.md @@ -42,7 +42,7 @@ interests you. Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text -generation you should look at model like GPT2. +generation you should look at a model like GPT2. ### How to use @@ -166,14 +166,14 @@ The RoBERTa model was pretrained on the reunion of five datasets: - [Stories](https://arxiv.org/abs/1806.02847) a dataset containing a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas. -Together theses datasets weight 160GB of text. +Together these datasets weigh 160GB of text. ## Training procedure ### Preprocessing The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50,000. The inputs of -the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked +the model take pieces of 512 contiguous tokens that may span over documents. The beginning of a new document is marked with `` and the end of one by `` The details of the masking procedure for each sentence are the following: