GLiNER-bi-base-v2.0 β LiteRT
LiteRT (.tflite) exports of knowledgator/gliner-bi-base-v2.0
for zero-shot NER on Android (Pixel-class hardware) via the LiteRT runtime.
Matches upstream numerics at float32 precision (max diff 2.8e-05 vs
PyTorch reference, over diverse test inputs).
Architecture
GLiNER-bi-v2 is a bi-encoder span-classification NER model:
- Text encoder: jhu-clsp/ettin-encoder-150m (ModernBERT). Output hidden dim 768 (projected to 512).
- Labels encoder: BAAI/bge-small-en-v1.5 (BERT). Output hidden dim 384 (projected to 512). Mean-pooled to produce one embedding per entity-type string.
- Span head: enumerates all (start, width) spans up to width 12, projects via 2-layer MLPs, scores against each label embedding via dot product.
- Output:
logits[batch, word_idx, width_idx, label_idx]β apply sigmoid- threshold to decide if span is an entity of that type.
Note on the BiLSTM (if you're wondering why there's no LSTM here)
The upstream checkpoint contains trained BiLSTM weights, but BiEncoderSpanModel.forward
in gliner/modeling/base.py never calls self.rnn(...). Verified four ways:
- Code grep:
self.rnn(...)appears inBaseUniEncoderModelbut notBaseBiEncoderModel. - Forward hook: zero calls to
inner.rnn.lstmduringgm.predict_entities(...). - Numerics: our no-LSTM wrapper gives diff = 0.000000 vs upstream
inner(...).logits. - Weight ablation: zeroing or randomizing LSTM weights produces bit-identical predictions (same entities, same scores to 4 decimals).
So the deployed model's accuracy is achieved without the LSTM. We don't ship it.
Files
Text encoder (per-query hot path)
| file | size | purpose |
|---|---|---|
text_encoder_seq256_fp32.tflite |
656.6 MB | baseline precision, seq_len 256 |
text_encoder_seq256_int8.tflite |
182.6 MB | int8 weight-only, seq_len 256 |
text_encoder_seq512_fp32.tflite |
677.2 MB | seq_len 512 |
text_encoder_seq512_int8.tflite |
203.1 MB | seq_len 512 |
text_encoder_seq1024_fp32.tflite |
718.4 MB | seq_len 1024 |
text_encoder_seq1024_int8.tflite |
244.3 MB | seq_len 1024 |
text_encoder_seq2048_fp32.tflite |
800.7 MB | seq_len 2048 |
text_encoder_seq2048_int8.tflite |
326.6 MB | seq_len 2048 |
text_encoder_seq4096_fp32.tflite |
965.3 MB | seq_len 4096 |
text_encoder_seq4096_int8.tflite |
491.3 MB | seq_len 4096 |
text_encoder_seq8192_fp32.tflite |
1043.0 MB | seq_len 8192 (ModernBERT max position 7999) |
text_encoder_seq8192_int8.tflite |
568.9 MB | seq_len 8192 |
The seq_len is baked into each graph β pick the smallest variant β₯ your tokenized input length to minimize compute.
Labels encoder (one-shot at bridge startup)
| file | size | purpose |
|---|---|---|
labels_encoder_fp32.tflite |
133.4 MB | encodes up to 25 entity-type strings in parallel |
Tokenizers
| dir | for |
|---|---|
text_tokenizer/ |
ModernBERT tokenizer β tokenize input text |
labels_tokenizer/ |
BERT tokenizer β tokenize entity-type strings |
Config
| file | purpose |
|---|---|
gliner_config.json |
reference config (max_types=25, max_width=12, span_mode=markerV0) |
Input / Output contract
Labels encoder
Inputs (fixed shape, [25, 32]):
input_idsint64[N=25, L=32]β tokenized entity-type strings, padded to 32 tokens; pad rows with empty stringsattention_maskint64[25, 32]
Output:
labels_embedsfloat32[25, 384 (projected to 512)]β one embedding per label row
Run once per distinct label set. Cache the output tensor on the bridge. Unused rows (beyond your actual N) are ignored downstream.
Text encoder
Inputs (fixed shape at compile time, chosen by seq_len variant):
| name | shape | dtype | description |
|---|---|---|---|
input_ids |
[1, S] |
int64 | tokenized text (ModernBERT tokenizer) |
attention_mask |
[1, S] |
int64 | 1 for real tokens, 0 for padding |
first_subword_positions |
[1, S] |
int64 | for each word slot, the seq position of its first subword token; pad with 0 for unused slots |
word_valid_mask |
[1, S] |
float32 | 1.0 for word slots [0..W_real), 0.0 for pad slots [W_real..S) |
labels_embeds |
[1, 25, 512] |
float32 | precomputed from the labels encoder (broadcast a batch dim to the [25, D] output) |
span_idx |
[1, S*12, 2] |
int64 | (start, end) word-index pairs for each candidate span |
Output:
logitsfloat32[1, S, 12, 25]β score for each (word_idx, width_idx, label_idx) triple. Applysigmoidand threshold (typically 0.3β0.5) to decide entity presence. Only spans wherespan_idx[_, :, 0] < W_real AND span_idx[_, :, 1] < W_realare meaningful; the rest are computed but ignored downstream.
Numerics
Validated with ai_edge_litert.Interpreter (LiteRT desktop build, XNNPACK CPU):
| quant | max diff vs upstream PyTorch (valid spans, logit-scale) | decision-flip rate @ threshold=0.3 |
|---|---|---|
| fp32 | 2.8e-05 (machine epsilon) | 0 / 366 spans |
| int8 | 5.6e-01 (pre-sigmoid logits) | 0 / 366 spans (none of the score drifts cross the threshold) |
Test inputs were mixed English sentences (Elon Musk, Barack Obama, Satoshi Nakamoto, etc.) with 3-6 entity types each.
Usage sketch (Python reference)
import numpy as np
import torch
from transformers import AutoTokenizer
from ai_edge_litert.interpreter import Interpreter
# --- one-time setup ---
text_tok = AutoTokenizer.from_pretrained("ckg/gliner-bi-base-v20-litert", subfolder="text_tokenizer")
labels_tok = AutoTokenizer.from_pretrained("ckg/gliner-bi-base-v20-litert", subfolder="labels_tokenizer")
labels_interp = Interpreter(model_path="labels_encoder_fp32.tflite")
labels_interp.allocate_tensors()
text_interp = Interpreter(model_path="text_encoder_seq512_fp32.tflite")
text_interp.allocate_tensors()
# --- precompute label embeddings ONCE ---
labels = ["person", "organization", "location"]
LABELS_PAD = labels + [""] * (25 - len(labels))
enc = labels_tok(LABELS_PAD, padding="max_length", max_length=32,
truncation=True, return_tensors="np")
labels_in0 = labels_interp.get_input_details()[0]["index"]
labels_in1 = labels_interp.get_input_details()[1]["index"]
labels_out = labels_interp.get_output_details()[0]["index"]
labels_interp.set_tensor(labels_in0, enc["input_ids"].astype(np.int64))
labels_interp.set_tensor(labels_in1, enc["attention_mask"].astype(np.int64))
labels_interp.invoke()
labels_embeds = labels_interp.get_tensor(labels_out)[None, ...] # [1, 25, D]
# --- per-query inference ---
# ... tokenize text + precompute first_subword_positions + word_valid_mask + span_idx
# ... text_interp.set_tensor(...) for each of the 6 inputs
# ... invoke, read logits, sigmoid + threshold, decode spans to (start_char, end_char, label, score)
Provenance
Converted with litert-torch 0.8.0 from the upstream knowledgator/gliner-bi-base-v2.0 checkpoint.
No retraining, no weight modification β only a graph restructuring:
- Skip the dead BiLSTM (confirmed unused by the upstream forward path)
- Replace upstream's dynamic-shape
torch.where-based word/prompt extraction with static gather from precomputed positions (first_subword_positions) - Replace 4D
einsumscoring with 2Dbmm+view
Total params shipped: 194M (upstream) β ~187M in the tflites (labels encoder ~33M, text encoder ~150M, span head ~4M, prompt MLP ~1M). The ~1M unused BiLSTM weights are excluded.
License
Apache-2.0, propagated from knowledgator/gliner-bi-base-v2.0.
- Downloads last month
- 28
Model tree for ckg/gliner-bi-base-v20-litert
Base model
BAAI/bge-base-en-v1.5