The Science of Detection:
Entropy & Burstiness
How we distinguish between the statistical probability of machine generation and the chaotic creativity of human thought.
The Detection Pipeline
Normalization
Stripping zero-width characters and normalizing homoglyphs.
Tokenization
Breaking text into semantic units for vector embedding.
Entropy Scoring
Calculating randomness (Perplexity) and flow (Burstiness).
Verification
Cross-referencing against known LLM fingerprints.
Linguistic Entropy
Entropy (or Perplexity) measures the randomness of a text. AI models are designed to minimize surprise—they choose the most probable next word based on vast training data.
Humans, conversely, are unpredictable. We use creative metaphors, slang, and non-linear logic that statistically “confuses” AI models. High perplexity usually indicates human authorship.
Perplexity visualizer
“The quick brown fox jumps over the lazy dog.”
Sentence Structure Analysis
Structural Rhythm (Burstiness)
While Perplexity measures word choice, Burstiness measures the structure of sentences. AI models tend to generate sentences of average length and standard structure (monotonic).
Humans write with “bursts”—a short sentence followed by a long, complex clause. This structural variation creates a rhythm that current LLMs struggle to replicate authentically without explicit prompting.
The Signature Engine
Beyond general entropy, CrossPlag maintains a database of Model Fingerprints. Each AI model (GPT, Claude, Gemini) has specific biases—preferred transition words, specific sentence starters, and refusal patterns.
Our engine scans for these micro-signatures to not just detect AI, but identify which AI was likely used.
Adversarial Robustness
Students and bad actors often attempt to bypass detection using “adversarial attacks.” Our engine includes a dedicated pre-processing layer to neutralize these evasion techniques before analysis begins.
Homoglyph Attacks
Replacing Latin characters (e.g., ‘a’, ‘o’) with identical Cyrillic counterparts. Our system normalizes all scripts to a canonical form.
Zero-Width Spaces
Injecting invisible characters to break word tokenization. We strip all non-printing characters during the normalization phase.
Prompt Injection
Detecting artifacts of prompts designed to confuse detectors (e.g., “Write with high burstiness”).
Content Taxonomy
The four distinct classifications used in our Institutional Reports.
Pure Human
High perplexity and high burstiness. Contains natural inconsistencies and creative logic jumps.
AI-Edited
Human logic with AI-smoothed syntax. Often seen when tools like Grammarly are overused.
Paraphrased
AI text passed through re-phrasers (e.g., Quillbot). Identifiable by “unnatural smoothness” or synonym stuffing.
AI-Generated
Low perplexity, monotonic rhythm. Matches statistical patterns of major LLMs perfectly.