Compare commits

..

No commits in common. "c1c0c663ecf7a4de90db1bc2f8d4e2d38a4f93b4" and "c87aed3ce094f3ec19ae144d6e6fd010e34d7c57" have entirely different histories.

4 changed files with 6 additions and 26 deletions

View File

@ -1,19 +0,0 @@
---
license: bsd-3-clause
tags:
- audio-classification
---
# Audio Spectrogram Transformer (fine-tuned on AudioSet)
Audio Spectrogram Transformer (AST) model fine-tuned on AudioSet. It was introduced in the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Gong et al. and first released in [this repository](https://github.com/YuanGongND/ast).
Disclaimer: The team releasing Audio Spectrogram Transformer did not write a model card for this model so this model card has been written by the Hugging Face team.
## Model description
The Audio Spectrogram Transformer is equivalent to [ViT](https://huggingface.co/docs/transformers/model_doc/vit), but applied on audio. Audio is first turned into an image (as a spectrogram), after which a Vision Transformer is applied. The model gets state-of-the-art results on several audio classification benchmarks.
## Usage
You can use the raw model for classifying audio into one of the AudioSet classes. See the [documentation](https://huggingface.co/docs/transformers/main/en/model_doc/audio-spectrogram-transformer#transformers.ASTForAudioClassification.forward.example) for more info.

View File

@ -1,8 +1,9 @@
{ {
"architectures": [ "architectures": [
"ASTForAudioClassification" "AudioSpectrogramTransformerForSequenceClassification"
], ],
"attention_probs_dropout_prob": 0.0, "attention_probs_dropout_prob": 0.0,
"frequency_dimension": 128,
"frequency_stride": 10, "frequency_stride": 10,
"hidden_act": "gelu", "hidden_act": "gelu",
"hidden_dropout_prob": 0.0, "hidden_dropout_prob": 0.0,
@ -1068,13 +1069,12 @@
"Zither": 150 "Zither": 150
}, },
"layer_norm_eps": 1e-12, "layer_norm_eps": 1e-12,
"max_length": 1024,
"model_type": "audio-spectrogram-transformer", "model_type": "audio-spectrogram-transformer",
"num_attention_heads": 12, "num_attention_heads": 12,
"num_hidden_layers": 12, "num_hidden_layers": 12,
"num_mel_bins": 128,
"patch_size": 16, "patch_size": 16,
"qkv_bias": true, "qkv_bias": true,
"time_dimension": 1024,
"time_stride": 10, "time_stride": 10,
"torch_dtype": "float32", "torch_dtype": "float32",
"transformers_version": "4.25.0.dev0" "transformers_version": "4.25.0.dev0"

View File

@ -1,8 +1,7 @@
{ {
"do_normalize": true, "do_normalize": true,
"feature_extractor_type": "ASTFeatureExtractor", "feature_extractor_type": "AudioSpectrogramTransformerFeatureExtractor",
"feature_size": 1, "feature_size": 1,
"max_length": 1024,
"mean": -4.2677393, "mean": -4.2677393,
"num_mel_bins": 128, "num_mel_bins": 128,
"padding_side": "right", "padding_side": "right",

BIN
pytorch_model.bin (Stored with Git LFS)

Binary file not shown.