GLiNER-bi-base-v2.0 → LiteRT

LiteRT (.tflite) exports of knowledgator/gliner-bi-base-v2.0 for zero-shot NER on Android (Pixel-class hardware) via the LiteRT runtime. Matches upstream numerics at float32 precision (max diff 2.8e-05 vs PyTorch reference, over diverse test inputs).

Architecture

GLiNER-bi-v2 is a bi-encoder span-classification NER model:

Text encoder: jhu-clsp/ettin-encoder-150m (ModernBERT). Output hidden dim 768 (projected to 512).
Labels encoder: BAAI/bge-small-en-v1.5 (BERT). Output hidden dim 384 (projected to 512). Mean-pooled to produce one embedding per entity-type string.
Span head: enumerates all (start, width) spans up to width 12, projects via 2-layer MLPs, scores against each label embedding via dot product.
Output: logits[batch, word_idx, width_idx, label_idx] — apply sigmoid
- threshold to decide if span is an entity of that type.

Note on the BiLSTM (if you're wondering why there's no LSTM here)

The upstream checkpoint contains trained BiLSTM weights, but BiEncoderSpanModel.forward in gliner/modeling/base.py never calls self.rnn(...). Verified four ways:

Code grep: self.rnn(...) appears in BaseUniEncoderModel but not BaseBiEncoderModel.
Forward hook: zero calls to inner.rnn.lstm during gm.predict_entities(...).
Numerics: our no-LSTM wrapper gives diff = 0.000000 vs upstream inner(...).logits.
Weight ablation: zeroing or randomizing LSTM weights produces bit-identical predictions (same entities, same scores to 4 decimals).

So the deployed model's accuracy is achieved without the LSTM. We don't ship it.

Files

Text encoder (per-query hot path)

file	size	purpose
`text_encoder_seq256_fp32.tflite`	656.6 MB	baseline precision, seq_len 256
`text_encoder_seq256_int8.tflite`	182.6 MB	int8 weight-only, seq_len 256
`text_encoder_seq512_fp32.tflite`	677.2 MB	seq_len 512
`text_encoder_seq512_int8.tflite`	203.1 MB	seq_len 512
`text_encoder_seq1024_fp32.tflite`	718.4 MB	seq_len 1024
`text_encoder_seq1024_int8.tflite`	244.3 MB	seq_len 1024
`text_encoder_seq2048_fp32.tflite`	800.7 MB	seq_len 2048
`text_encoder_seq2048_int8.tflite`	326.6 MB	seq_len 2048
`text_encoder_seq4096_fp32.tflite`	965.3 MB	seq_len 4096
`text_encoder_seq4096_int8.tflite`	491.3 MB	seq_len 4096
`text_encoder_seq8192_fp32.tflite`	1043.0 MB	seq_len 8192 (ModernBERT max position 7999)
`text_encoder_seq8192_int8.tflite`	568.9 MB	seq_len 8192

The seq_len is baked into each graph — pick the smallest variant ≥ your tokenized input length to minimize compute.

Labels encoder (one-shot at bridge startup)

file	size	purpose
`labels_encoder_fp32.tflite`	133.4 MB	encodes up to 25 entity-type strings in parallel

Tokenizers

dir	for
`text_tokenizer/`	ModernBERT tokenizer — tokenize input text
`labels_tokenizer/`	BERT tokenizer — tokenize entity-type strings

Config

file	purpose
`gliner_config.json`	reference config (max_types=25, max_width=12, span_mode=markerV0)

Input / Output contract

Labels encoder

Inputs (fixed shape, [25, 32]):

input_ids int64 [N=25, L=32] — tokenized entity-type strings, padded to 32 tokens; pad rows with empty strings
attention_mask int64 [25, 32]

Output:

labels_embeds float32 [25, 384 (projected to 512)] — one embedding per label row

Run once per distinct label set. Cache the output tensor on the bridge. Unused rows (beyond your actual N) are ignored downstream.

Text encoder

Inputs (fixed shape at compile time, chosen by seq_len variant):

name	shape	dtype	description
`input_ids`	`[1, S]`	int64	tokenized text (ModernBERT tokenizer)
`attention_mask`	`[1, S]`	int64	1 for real tokens, 0 for padding
`first_subword_positions`	`[1, S]`	int64	for each word slot, the seq position of its first subword token; pad with 0 for unused slots
`word_valid_mask`	`[1, S]`	float32	1.0 for word slots [0..W_real), 0.0 for pad slots [W_real..S)
`labels_embeds`	`[1, 25, 512]`	float32	precomputed from the labels encoder (broadcast a batch dim to the `[25, D]` output)
`span_idx`	`[1, S*12, 2]`	int64	(start, end) word-index pairs for each candidate span

Output:

logits float32 [1, S, 12, 25] — score for each (word_idx, width_idx, label_idx) triple. Apply sigmoid and threshold (typically 0.3–0.5) to decide entity presence. Only spans where span_idx[_, :, 0] < W_real AND span_idx[_, :, 1] < W_real are meaningful; the rest are computed but ignored downstream.

Numerics

Validated with ai_edge_litert.Interpreter (LiteRT desktop build, XNNPACK CPU):

quant	max diff vs upstream PyTorch (valid spans, logit-scale)	decision-flip rate @ threshold=0.3
fp32	2.8e-05 (machine epsilon)	0 / 366 spans
int8	5.6e-01 (pre-sigmoid logits)	0 / 366 spans (none of the score drifts cross the threshold)

Test inputs were mixed English sentences (Elon Musk, Barack Obama, Satoshi Nakamoto, etc.) with 3-6 entity types each.

Usage sketch (Python reference)

import numpy as np
import torch
from transformers import AutoTokenizer
from ai_edge_litert.interpreter import Interpreter

# --- one-time setup ---
text_tok = AutoTokenizer.from_pretrained("ckg/gliner-bi-base-v20-litert", subfolder="text_tokenizer")
labels_tok = AutoTokenizer.from_pretrained("ckg/gliner-bi-base-v20-litert", subfolder="labels_tokenizer")

labels_interp = Interpreter(model_path="labels_encoder_fp32.tflite")
labels_interp.allocate_tensors()

text_interp = Interpreter(model_path="text_encoder_seq512_fp32.tflite")
text_interp.allocate_tensors()

# --- precompute label embeddings ONCE ---
labels = ["person", "organization", "location"]
LABELS_PAD = labels + [""] * (25 - len(labels))
enc = labels_tok(LABELS_PAD, padding="max_length", max_length=32,
                  truncation=True, return_tensors="np")
labels_in0 = labels_interp.get_input_details()[0]["index"]
labels_in1 = labels_interp.get_input_details()[1]["index"]
labels_out = labels_interp.get_output_details()[0]["index"]
labels_interp.set_tensor(labels_in0, enc["input_ids"].astype(np.int64))
labels_interp.set_tensor(labels_in1, enc["attention_mask"].astype(np.int64))
labels_interp.invoke()
labels_embeds = labels_interp.get_tensor(labels_out)[None, ...]  # [1, 25, D]

# --- per-query inference ---
# ... tokenize text + precompute first_subword_positions + word_valid_mask + span_idx
# ... text_interp.set_tensor(...) for each of the 6 inputs
# ... invoke, read logits, sigmoid + threshold, decode spans to (start_char, end_char, label, score)

Provenance

Converted with litert-torch 0.8.0 from the upstream knowledgator/gliner-bi-base-v2.0 checkpoint. No retraining, no weight modification — only a graph restructuring:

Skip the dead BiLSTM (confirmed unused by the upstream forward path)
Replace upstream's dynamic-shape torch.where-based word/prompt extraction with static gather from precomputed positions (first_subword_positions)
Replace 4D einsum scoring with 2D bmm + view

Total params shipped: 194M (upstream) → ~187M in the tflites (labels encoder ~33M, text encoder ~150M, span head ~4M, prompt MLP ~1M). The ~1M unused BiLSTM weights are excluded.

License

Apache-2.0, propagated from knowledgator/gliner-bi-base-v2.0.

Downloads last month: 28

Model tree for ckg/gliner-bi-base-v20-litert

Base model

BAAI/bge-base-en-v1.5

Finetuned

knowledgator/gliner-bi-base-v2.0

Finetuned

(1)

this model