Automatic correction of README.md metadata. Contact website@huggingface.co for any question

upload flax model
allow flax
2021-09-23 15:45:34 +02:00 · 2021-05-19 14:14:14 +00:00 · 2021-05-19 14:13:57 +00:00 · 2021-01-25 15:12:06 +00:00 · 2021-01-25 14:18:23 +00:00 · 2020-11-19 14:16:34 +01:00
8 changed files with 32062 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@ -6,3 +6,4 @@
 *.tar.gz filter=lfs diff=lfs merge=lfs -text
 *.ot filter=lfs diff=lfs merge=lfs -text
 *.onnx filter=lfs diff=lfs merge=lfs -text
 *.msgpack filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@ -0,0 +1,46 @@
 ---
 language: ja
 license: cc-by-sa-4.0
 datasets:
 - wikipedia
 widget:
 - text: 東北大学で[MASK]の研究をしています。
 ---
 # BERT base Japanese (IPA dictionary, whole word masking enabled)
 This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language.
 This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by the WordPiece subword tokenization.
 Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective.
 The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0).
 ## Model architecture
 The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.
 ## Training Data
 The model is trained on Japanese Wikipedia as of September 1, 2019.
 To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles.
 The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.
 ## Tokenization
 The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into subwords by the WordPiece algorithm.
 The vocabulary size is 32000.
 ## Training
 The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.
 For the training of the MLM (masked language modeling) objective, we introduced the **Whole Word Masking** in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once.
 ## Licenses
 The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/).
 ## Acknowledgments
 For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program.
--- a/config.json
+++ b/config.json
@ -14,6 +14,7 @@
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "tokenizer_class": "BertJapaneseTokenizer",
  "type_vocab_size": 2,
  "vocab_size": 32000
 }
--- a/flax_model.msgpack
+++ b/flax_model.msgpack
--- a/pytorch_model.bin
+++ b/pytorch_model.bin
--- a/tf_model.h5
+++ b/tf_model.h5
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@ -0,0 +1,5 @@
 {
    "do_lower_case": false,
    "word_tokenizer_type": "mecab",
    "subword_tokenizer_type": "wordpiece"
 }
--- a/vocab.txt
+++ b/vocab.txt
Author	SHA1	Message	Date
elishowk	ab68bf4a4d	Automatic correction of README.md metadata. Contact website@huggingface.co for any question	2021-09-23 15:45:34 +02:00
Patrick von Platen	5ae74ffc5c	upload flax model	2021-05-19 14:14:14 +00:00
Patrick von Platen	2b557f1f16	allow flax	2021-05-19 14:13:57 +00:00
Tohoku University	84425dd597	Correct the license metadata	2021-01-25 15:12:06 +00:00
Tohoku University	fb86e4760e	Create README.md	2021-01-25 14:18:23 +00:00
Julien Chaumond	855e787bb3	[config.json] tokenizer_class: BertJapaneseTokenizer	2020-11-19 14:16:34 +01:00
system	30931391d3	Update tokenizer_config.json	2020-06-07 12:26:07 +00:00
system	f9b8d28251	Update vocab.txt	2020-04-28 21:34:45 +00:00
system	234d4f6335	Update tf_model.h5	2020-04-28 21:34:41 +00:00
system	eadab49326	Update pytorch_model.bin	2020-04-28 21:34:36 +00:00