MiniMax M2.7

Early Echoes of Self-Evolution

56.22%

SWE-Pro Score

34%

Hallucination Rate

66.6%

MLE-Bench Medal Rate

30-50%

RL Research Automated

MiniMax · Released March 20, 2026 · First model to participate in its own training evolution

What is M2.7?

The First Model That Helped Build Itself

🧬 M2.7 is MiniMax's first model to deeply participate in its own evolution — updating its own memory, building its own skills, and improving its own training process.

🤖 Self-Evolving Architecture

During development, M2.7 was used to run its own RL experiments, build complex skills, and improve the training harness based on results.

⚙️ Proprietary LLM

Frontier-level closed model designed for AI agents, third-party harnesses, and tools like OpenClaw, Claude Code, and Kilo Code.

📈 Major Leap from M2.5

M2.5 (Feb 2026) focused on polyglot code mastery. M2.7 is built for real-world engineering — causal reasoning in live production systems.

🎯 Agent-First Design

Built to power agentic workflows — complex multi-step tasks, Agent Teams, dynamic tool search, and persistent memory at scale.

Goal: full autonomy in model training and inference architecture without human involvement.

How Self-Evolution Works

The Feedback Loop That Trained M2.7

Run Experiment

M2.7 runs a reinforcement learning experiment autonomously

Short-Term Memory

Generates a markdown memory file capturing what happened

Self-Criticism

Critiques its own results and identifies optimization directions

Self-Optimize

Next round uses full memory + feedback chain to improve

Repeat

Cycle continues — models trained by M2.7 improve continuously over 24h runs

🏅 MLE-Bench Results

3 trials × 24 hours each. Best run: 9 gold, 5 silver, 1 bronze medals. Average 66.6% medal rate — second only to Claude Opus 4.6 (75.7%) and GPT-5.4 (71.2%). Ties with Gemini 3.1.

📊 30-50% Automation

M2.7 can now autonomously perform 30-50% of a reinforcement learning researcher's workflow — data pipelines, training, infra, cross-team collab, persistent memory.

Benchmark Performance

Where M2.7 Stands Against the World's Best

56.22%

SWE-Pro (= GPT-5.3-Codex)

1495

GDPval-AA Elo (Best OSS)

AA-Omniscience (vs -40 for M2.5)

34%

Hallucination Rate

💻 Software Engineering

56.22% on SWE-Pro — matches GPT-5.3-Codex, the highest levels of global competitors. Focus on real-world production debugging, not toy benchmarks.

📄 Professional Office

Elo 1495 on GDPval-AA document processing — highest among open-source-accessible models globally.

🧠 Hallucination Rate

34% hallucination rate — significantly lower than Claude Sonnet 4.6 (46%) and Gemini 3.1 Pro Preview (50%). More reliable outputs.

🏆 MLE-Bench

66.6% medal rate on ML competitions — ties Gemini 3.1, behind only Claude Opus 4.6 (75.7%) and GPT-5.4 (71.2%). Runs on a single A30 GPU.

📝 Omniscience Index

Massive leap: +1 on AA-Omniscience vs -40 for M2.5. Virtually eliminated factual errors that plagued the previous generation.

🤖 System Prompt Adherence

97% skill adherence rate with 40+ complex skills (each >2,000 tokens). Stays on task even with massive instruction sets.

Agent Capabilities

Built for Real-World Agentic Workflows

🤖 M2.7 is purpose-built as an agent backbone — not just a chatbot with tool use bolted on.

👥 Agent Teams

Coordinates multiple specialized sub-agents working in parallel — complex multi-step tasks decomposed and executed autonomously.

🛠️ Complex Skills (40+)

Handles 40+ complex skills simultaneously with 97% adherence — each skill exceeding 2,000 tokens. Stays focused across massive instruction sets.

🔍 Dynamic Tool Search

Discovers and uses tools dynamically based on task requirements — doesn't need pre-configured toolsets for every scenario.

🧠 Persistent Memory

Maintains context across long sessions — updates its own memory files to retain state across multi-day agentic workflows.

🎭 Character Consistency

Strong character consistency and emotional intelligence — reliable persona maintenance across long conversations and role-based deployments.

🔗 Harness Integration

Designed as backend for OpenClaw, Claude Code, Kilo Code, and other third-party agent harnesses — drop-in frontier model.

Software Engineering Capabilities

Real-World Production Engineering — Not Toy Benchmarks

🐛 Production Debugging

Log analysis for bug hunting in live systems — causal reasoning across distributed logs to find root causes fast.

🔄 Refactoring

Understands large codebases holistically — refactors with awareness of downstream impact, not just local changes.

🔒 Code Security

Identifies security vulnerabilities, injection risks, and insecure patterns across real production codebases.

🤖 Machine Learning Code

Writes and debugs ML training pipelines, data preprocessing, and model evaluation code — with deep understanding of the full ML workflow.

📱 Android Development

Full Android development capability — Kotlin, Jetpack Compose, architecture patterns, and platform-specific debugging.

⚡ SWE-Pro: 56.22%

Matches GPT-5.3-Codex on the most challenging software engineering benchmark. Significant leap over M2.5 which focused on polyglot code breadth.

Hallucination & Reliability

The Most Reliable Frontier Model for Production Use

34%

M2.7 Hallucination Rate

46%

Claude Sonnet 4.6

50%

Gemini 3.1 Pro Preview

AA-Omniscience (was -40)

📉 26% Lower Than Claude Sonnet

M2.7's 34% hallucination rate vs Claude Sonnet 4.6's 46% — a significant reliability advantage for production agentic deployments.

📉 32% Lower Than Gemini 3.1

Gemini 3.1 Pro Preview sits at 50% hallucination rate — M2.7 is substantially more reliable for factual tasks and document processing.

🎯 Omniscience Leap

AA-Omniscience Index jumped from -40 (M2.5) to +1 (M2.7) — virtually eliminating the factual error problem that limited the previous generation.

🏆 Best OSS Document Processing

Elo 1495 on GDPval-AA — highest among open-source-accessible models globally for professional document understanding and processing.

M2.7 vs The Competition

Where MiniMax Fits in the Frontier Model Landscape

✅ M2.7 Wins On

Hallucination rate (34% vs 46-50%)
Document processing (best OSS Elo)
Self-evolving architecture (unique)
Agent harness integration
97% skill adherence at 40+ skills
Cost efficiency vs closed models

⚖️ Ties With

Gemini 3.1: MLE-Bench 66.6%
GPT-5.3-Codex: SWE-Pro 56.22%
Frontier-level reasoning tasks

❌ Behind On

MLE-Bench: Claude Opus 4.6 (75.7%) and GPT-5.4 (71.2%) still lead
Multimodal: vision/audio less mature than GPT-5.4
Brand recognition vs OpenAI/Anthropic

M2.7 is the most compelling frontier alternative for agent deployments — lower hallucination, strong coding, open-accessible, and getting smarter by training itself.

Why MiniMax Matters

The Chinese AI Lab Punching Above Its Weight

🇨🇳 Chinese Frontier Lab

One of the most exciting Chinese AI startups — frontier-level LLMs with open-source licenses, competing directly with OpenAI and Anthropic.

🎬 Hailuo Video

Before M2.7, MiniMax built Hailuo — one of the best AI video generation models globally. M2 series is their pivot to language/agent models.

⚡ Rapid Iteration

M2.5 launched February 2026. M2.7 drops March 2026. MiniMax is shipping faster than almost any other frontier lab.

🔓 Open-Accessible

Unlike pure closed models, M2.7 is accessible via API and integrates into open ecosystems — OpenClaw, Kilo Code, Claude Code.

🧬 Self-Evolution Vision

The end goal is clear: a model that trains itself without human involvement. M2.7 is step one. This is the most ambitious AI research agenda in 2026.

💰 Cost Efficiency

Frontier performance at a fraction of the cost of GPT-5.4 or Claude Opus 4.6 — democratizing access to top-tier model capabilities.

The Road to Full Autonomy

What Self-Evolution Means for AI Development

M2.5

Feb 2026

Polyglot code mastery — human-led fine-tuning, open weights

M2.7

Mar 2026

30-50% RL workflow automated — model participates in its own training

M3.x

2026?

Full autonomy — model trains itself end-to-end without human involvement

🔄 The Flywheel

Better model → better RL experiments → better training → even better model. Each generation accelerates the next. The compounding effect is the breakthrough.

⚡ Speed Advantage

If a model can run 30-50% of its own RL research, each iteration cycle is dramatically faster. Human researchers focus on direction, not execution.

🌍 Industry Implication

"The industry focus has shifted from simple chat interfaces to agentic workflows capable of executing complex, multi-step tasks without human intervention." — M2.7 is at the frontier of that shift.

🎯 The Goal

Full autonomy in model training and inference architecture without human involvement. MiniMax is the only lab publicly pursuing this as an explicit product milestone.

Get Started with M2.7

Links & Resources

🌐 MiniMax M2.7

minimax.io/models/text/m27
API access and model documentation

📖 Official Blog

minimax.io/news/minimax-m27-en
"Early Echoes of Self-Evolution" — full technical writeup

📰 VentureBeat Coverage

"New MiniMax M2.7 proprietary AI model is self-evolving and can perform 30-50% of reinforcement learning research workflow"

🦞 OpenClaw Integration

M2.7 works as a drop-in backend for OpenClaw — configure via models.json for agent deployments on your existing infrastructure.

🎬 Hailuo Video

hailuoai.video
MiniMax's AI video generation product — world-class video from text prompts.

📊 This Presentation

stark.boxmining.one/presentations/minimax-m27/

boxmining.com | AI Tricks That Actually Work