How do I get all tokens in a AutoTokenizer?
How do I get all tokens in a AutoTokenizer?
To retrieve all tokens from a Hugging Face `AutoTokenizer`, you can access its vocabulary, which includes all known tokens (base tokens and any added tokens). Here's how to do it: ### Method 1: Using `get_vocab()` This method returns a dictionary where keys are tokens and values are their corresponding IDs. Extract the keys to get all tokens.
1 2 3 4 5 6 7 8 9 10 11 12 13 | from transformers import AutoTokenizer # Load the tokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Get the vocabulary dictionary vocab = tokenizer.get_vocab() # Extract tokens (unordered) all_tokens = list(vocab.keys()) # To get tokens sorted by their IDs sorted_tokens = [token for token, _ in sorted(vocab.items(), key=lambda item: item[1])] |
1 2 3 4 5 | # Get the total vocabulary size vocab_size = tokenizer.vocab_size # or len(tokenizer) # Retrieve tokens in ID order tokens_in_order = [tokenizer.convert_ids_to_tokens(i) for i in range(vocab_size)] |
1 2 3 4 5 | print("Total tokens:", len(all_tokens)) print("First 5 tokens:", sorted_tokens[:5]) # Output (for BERT): # Total tokens: 30522 # First 5 tokens: ['[PAD]', '[unused0]', '[unused1]', '[unused2]', '[unused3]'] |
Testing if the previous output contains the string get_vocab
: True