Add size details

This commit is contained in:
Lysandre 2021-01-13 15:06:44 +00:00 committed by huggingface-web
parent a1e0407207
commit 6b6560eaf5
1 changed files with 10 additions and 0 deletions

View File

@ -36,6 +36,16 @@ classifier using the features produced by the ALBERT model as inputs.
ALBERT is particular in that it shares its layers across its Transformer. Therefore, all layers have the same weights. Using repeating layers results in a small memory footprint, however, the computational cost remains similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers. ALBERT is particular in that it shares its layers across its Transformer. Therefore, all layers have the same weights. Using repeating layers results in a small memory footprint, however, the computational cost remains similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers.
This is the second version of the base model. Version 2 is different from version 1 due to different dropout rates, additional training data, and longer training. It has better results in nearly all downstream tasks.
This model has the following configuration:
- 12 repeating layers
- 128 embedding dimension
- 768 hidden dimension
- 12 attention heads
- 11M parameters
## Intended uses & limitations ## Intended uses & limitations
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to