Automatic correction of README.md metadata. Contact website@huggingface.co for any question

upload flax model
allow flax
2021-09-23 15:45:34 +02:00 · 2021-05-19 14:14:14 +00:00 · 2021-05-19 14:13:57 +00:00 · 2021-01-25 15:12:06 +00:00 · 2021-01-25 14:18:23 +00:00 · 2020-11-19 14:16:34 +01:00
8 changed files with 32062 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@ -6,3 +6,4 @@
 *.tar.gz filter=lfs diff=lfs merge=lfs -text
 *.ot filter=lfs diff=lfs merge=lfs -text
 *.onnx filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@ -0,0 +1,46 @@
+---
+language: ja
+license: cc-by-sa-4.0
+datasets:
+- wikipedia
+widget:
+- text: 東北大学で[MASK]の研究をしています。
+---
+
+# BERT base Japanese (IPA dictionary, whole word masking enabled)
+
+This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language.
+
+This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by the WordPiece subword tokenization.
+Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective.
+
+The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0).
+
+## Model architecture
+
+The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.
+
+## Training Data
+
+The model is trained on Japanese Wikipedia as of September 1, 2019.
+To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles.
+The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.
+
+## Tokenization
+
+The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into subwords by the WordPiece algorithm.
+The vocabulary size is 32000.
+
+## Training
+
+The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.
+
+For the training of the MLM (masked language modeling) objective, we introduced the **Whole Word Masking** in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once.
+
+## Licenses
+
+The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/).
+
+## Acknowledgments
+
+For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program.
--- a/config.json
+++ b/config.json
@ -14,6 +14,7 @@
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
+  "tokenizer_class": "BertJapaneseTokenizer",
  "type_vocab_size": 2,
  "vocab_size": 32000
 }
--- a/flax_model.msgpack
+++ b/flax_model.msgpack
--- a/pytorch_model.bin
+++ b/pytorch_model.bin
--- a/tf_model.h5
+++ b/tf_model.h5
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@ -0,0 +1,5 @@
+{
+    "do_lower_case": false,
+    "word_tokenizer_type": "mecab",
+    "subword_tokenizer_type": "wordpiece"
+}
--- a/vocab.txt
+++ b/vocab.txt
Author	SHA1	Message	Date
elishowk	ab68bf4a4d	Automatic correction of README.md metadata. Contact website@huggingface.co for any question	2021-09-23 15:45:34 +02:00
Patrick von Platen	5ae74ffc5c	upload flax model	2021-05-19 14:14:14 +00:00
Patrick von Platen	2b557f1f16	allow flax	2021-05-19 14:13:57 +00:00
Tohoku University	84425dd597	Correct the license metadata	2021-01-25 15:12:06 +00:00
Tohoku University	fb86e4760e	Create README.md	2021-01-25 14:18:23 +00:00
Julien Chaumond	855e787bb3	[config.json] tokenizer_class: BertJapaneseTokenizer	2020-11-19 14:16:34 +01:00
system	30931391d3	Update tokenizer_config.json	2020-06-07 12:26:07 +00:00
system	f9b8d28251	Update vocab.txt	2020-04-28 21:34:45 +00:00
system	234d4f6335	Update tf_model.h5	2020-04-28 21:34:41 +00:00
system	eadab49326	Update pytorch_model.bin	2020-04-28 21:34:36 +00:00