rinna/japanese-gpt-1b is a forked repo from huggingface. License: mit

Go to file

Tianyu Zhao e50d65b5a2 Update README.md		2023-03-22 04:09:05 +00:00
.gitattributes	Adding `safetensors` variant of this model (#1 )	2023-03-20 08:02:33 +00:00
README.md	Update README.md	2023-03-22 04:09:05 +00:00
config.json	first commit	2022-01-20 11:33:44 +09:00
model.safetensors	Adding `safetensors` variant of this model (#1 )	2023-03-20 08:02:33 +00:00
pytorch_model.bin	first commit	2022-01-20 11:33:44 +09:00
rinna.png	first commit	2022-01-20 11:33:44 +09:00
special_tokens_map.json	first commit	2022-01-20 11:33:44 +09:00
spiece.model	first commit	2022-01-20 11:33:44 +09:00
tokenizer_config.json	first commit	2022-01-20 11:33:44 +09:00

README.md

language

thumbnail

japanese-gpt-1b

This repository provides a 1.3B-parameter Japanese GPT model. The model was trained by rinna Co., Ltd.

How to use the model

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt-1b", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt-1b")

if torch.cuda.is_available():
    model = model.to("cuda")

text = "西田幾多郎は、"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        token_ids.to(model.device),
        max_length=100,
        min_length=100,
        do_sample=True,
        top_k=500,
        top_p=0.95,
        pad_token_id=tokenizer.pad_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        bad_words_ids=[[tokenizer.unk_token_id]]
    )

output = tokenizer.decode(output_ids.tolist()[0])
print(output)  
# sample output: 西田幾多郎は、その主著の「善の研究」などで、人間の内面に自然とその根源があると指摘し、その根源的な性格は、この西田哲学を象徴しているとして、カントの「純粋理性批判」と「判断力批判」を対比して捉えます。それは、「人が理性的存在であるかぎりにおいて、人はその当人に固有な道徳的に自覚された善悪の基準を持っている」とするもので、この理性的な善悪の観念を否定するのがカントの

Model architecture

A 24-layer, 2048-hidden-size transformer-based language model.

Training

The model was trained on Japanese C4, Japanese CC-100 and Japanese Wikipedia to optimize a traditional language modelling objective. It reaches around 14 perplexity on a chosen validation set from the same data.

Tokenization

The model uses a sentencepiece-based tokenizer. The vocabulary was first trained on a selected subset from the training data using the official sentencepiece training script, and then augmented with emojis and symbols.

Licenese

The MIT license