How do I get all tokens in a AutoTokenizer?
How do I get all tokens in a AutoTokenizer?
To get all tokens from an `AutoTokenizer` in the Hugging Face Transformers library, you can use the `get_vocab()` method. This method returns a dictionary where the keys are the tokens and the values are their corresponding IDs in the tokenizer's vocabulary. Here’s how you can do it:
1 2 3 4 5 6 7 8 9 10 11 12 13 | from transformers import AutoTokenizer # Load the tokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Get the vocabulary vocab = tokenizer.get_vocab() # Extract all tokens tokens = list(vocab.keys()) # Print the tokens print(tokens) |
1 | ['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]', 'the', 'of', 'and', 'to', 'a', 'in', 'for', 'is', ...] |
Testing if the previous output contains the string get_vocab
: True