Qwen 2.5 leads in coding benchmarks (beating GPT-4o on HumanEval), has 151K token vocabulary optimized for Vietnamese and Chinese, and is the ideal foundation for AI Agent workflows. This guide covers coding benchmarks, multilingual capabilities, Ollama/vLLM deployment, and ReAct agent, multi-agent pipeline, and structured output patterns.
Qwen 2.5: Mastering Code, Multilingual Tasks, and Building AI Agent Workflows
Alibaba's Qwen 2.5 is the open-source leader in coding and multilingual tasks — and the ideal foundation for complex AI Agent workflows. With models ranging from 0.5B to 72B and specialized variants (Coder, Math), Qwen 2.5 covers everything from edge devices to enterprise servers. This guide dives into coding benchmarks, Vietnamese capabilities, and production-grade AI Agent patterns.
1. Qwen 2.5 Model Family Overview
Qwen 2.5 ships with multiple specialized variants:
Model
Parameters
Characteristics
VRAM (Q4)
Qwen2.5-0.5B
0.5B
Edge, mobile
~0.5GB
Qwen2.5-1.5B
1.5B
Edge, speculative draft
~1.5GB
Qwen2.5-3B
3B
Light tasks
~2GB
Qwen2.5-7B
7B
General purpose
~5GB
Qwen2.5-14B
14B
Strong reasoning
~9GB
Qwen2.5-32B
32B
Near-SOTA, balanced
~20GB
Qwen2.5-72B
72B
Top-tier open source
~45GB
Qwen2.5-Coder-32B
32B
Code specialist
~20GB
Qwen2.5-Math-72B
72B
Math specialist
~45GB
Key Architecture Points
GQA (Grouped Query Attention): Optimized KV cache for large batches and long context
Context window: 128K tokens (needs config in Ollama to use fully)
Vocab:151,936 tokens — largest among mainstream models, optimized for Chinese and Vietnamese
YaRN RoPE scaling: Effective long context extrapolation
SwiGLU activation: Standard in modern LLMs
2. Coding Benchmark: Qwen2.5-Coder Leads the World
Qwen 2.5 excels at agentic tasks — following complex instructions, calling tools, and maintaining context across many steps.
ReAct Agent
import json
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="qwen")
tools = [
{
"type": "function",
"function": {
"name": "search_web",
"description": "Search for information on the web",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "execute_python",
"description": "Execute Python code and return the result",
"parameters": {
"type": "object",
"properties": {
"code": {"type": "string"}
},
"required": ["code"]
}
}
},
{
"type": "function",
"function": {
"name": "read_file",
"description": "Read file contents",
"parameters": {
"type": "object",
"properties": {
"path": {"type": "string"}
},
"required": ["path"]
}
}
}
]
defrun_agent(user_task: str, max_steps: int = 10) -> str:
messages = [
{
"role": "system",
"content": "You are an AI agent with tool access. Complete tasks by calling appropriate tools step by step."
},
{"role": "user", "content": user_task}
]
for step inrange(max_steps):
response = client.chat.completions.create(
model="qwen2.5-72b-instruct",
messages=messages,
tools=tools,
tool_choice="auto",
temperature=0.1# Low temperature for stable agent behavior
)
msg = response.choices[0].message
messages.append(msg)
# Agent finishes when it stops calling toolsifnot msg.tool_calls:
return msg.content
# Execute tool callsfor tool_call in msg.tool_calls:
result = execute_tool(
tool_call.function.name,
json.loads(tool_call.function.arguments)
)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": str(result)
})
return"Max steps reached"defexecute_tool(name: str, args: dict) -> str:
if name == "search_web":
# Integrate with Tavily, SerpAPI, etc.returnf"[Search results for: {args['query']}]"elif name == "execute_python":
import subprocess
result = subprocess.run(
["python3", "-c", args["code"]],
capture_output=True, text=True, timeout=30
)
return result.stdout or result.stderr
elif name == "read_file":
withopen(args["path"]) as f:
return f.read()
return"Tool not found"# Example usage
result = run_agent(
"Analyze the sales_data.csv file and create a report with monthly revenue metrics."
)
print(result)
7. Multi-Agent Pipeline
from openai import OpenAI
classQwenAgent:
def__init__(self, name: str, system_prompt: str, model: str = "qwen2.5:72b"):
self.name = name
self.system_prompt = system_prompt
self.client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
self.model = model
defrun(self, task: str, context: str = "") -> str:
messages = [
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": f"Context:\n{context}\n\nTask: {task}"}
]
response = self.client.chat.completions.create(
model=self.model,
messages=messages,
temperature=0.3
)
return response.choices[0].message.content
# Define specialized agents
researcher = QwenAgent(
"Researcher",
"You are a research specialist. Collect and summarize information about the requested topic."
)
coder = QwenAgent(
"Coder",
"You are a senior software engineer. Write clean Python code with tests and documentation.",
model="qwen2.5-coder:32b"
)
reviewer = QwenAgent(
"Reviewer",
"You are a technical reviewer. Evaluate code for performance, security, and best practices."
)
defrun_pipeline(task: str) -> dict:
# Step 1: Research
research = researcher.run(f"Research and summarize: {task}")
# Step 2: Code generation
code = coder.run(
f"Implement solution for: {task}",
context=research
)
# Step 3: Code review
review = reviewer.run(
"Review this code and suggest improvements",
context=code
)
return {"research": research, "code": code, "review": review}
result = run_pipeline("Build a REST API endpoint for Vietnamese sentiment analysis")
8. Structured Output with Pydantic
from pydantic import BaseModel
from typing importListimport json
classTaskBreakdown(BaseModel):
objective: str
subtasks: List[str]
estimated_complexity: str# "low", "medium", "high"
dependencies: List[str]
suggested_tech_stack: List[str]
defanalyze_task(description: str) -> TaskBreakdown:
response = client.chat.completions.create(
model="qwen2.5:72b",
messages=[
{
"role": "system",
"content": "You are a technical architect. Analyze tasks and return JSON."
},
{
"role": "user",
"content": f"""Analyze this task following the TaskBreakdown schema:
Task: {description}
Schema:
{{
"objective": "string",
"subtasks": ["string"],
"estimated_complexity": "low|medium|high",
"dependencies": ["string"],
"suggested_tech_stack": ["string"]
}}
Return only pure JSON, no other text."""
}
],
temperature=0.1,
response_format={"type": "json_object"}
)
data = json.loads(response.choices[0].message.content)
return TaskBreakdown(**data)
result = analyze_task("Build a RAG-powered customer service chatbot system")
print(f"Complexity: {result.estimated_complexity}")
print(f"Subtasks: {', '.join(result.subtasks)}")
9. Production Code Assistant
classQwenCodeAssistant:
def__init__(self):
self.client = OpenAI(
base_url="http://localhost:8001/v1", # Coder model port
api_key="qwen"
)
defgenerate_code(self, spec: str, language: str = "python") -> str:
response = self.client.chat.completions.create(
model="qwen2.5-coder-32b-instruct",
messages=[
{
"role": "system",
"content": f"You are an expert {language} developer. Write clean code with docstrings and unit tests."
},
{"role": "user", "content": spec}
],
temperature=0.1,
max_tokens=4096
)
return response.choices[0].message.content
defreview_code(self, code: str) -> str:
response = self.client.chat.completions.create(
model="qwen2.5-coder-32b-instruct",
messages=[
{
"role": "system",
"content": "Review code and identify: bugs, security issues, performance bottlenecks, style problems."
},
{"role": "user", "content": f"```\n{code}\n```"}
],
temperature=0.2
)
return response.choices[0].message.content
defexplain_code(self, code: str, audience: str = "junior developer") -> str:
response = self.client.chat.completions.create(
model="qwen2.5-coder-32b-instruct",
messages=[{
"role": "user",
"content": f"Explain this code for a {audience}:\n```\n{code}\n```"
}],
temperature=0.5
)
return response.choices[0].message.content
assistant = QwenCodeAssistant()
code = assistant.generate_code(
"Write a FastAPI endpoint to upload and analyze CSV files, returning basic statistics"
)
print(code)
10. Qwen 2.5 vs Other Models
Criterion
Qwen 2.5-72B
Llama 3.3 70B
DeepSeek V3
GPT-4o
Coding (HumanEval)
88.2%
88.4%
91.6%
90.2%
Specialized coding (Coder-32B)
92.7%
N/A
N/A
N/A
Vietnamese
Best open-source
Good
Good
Best (API)
Chinese
Best
Weak
Good
Good
Agentic tasks
✅ Excellent
✅ Good
✅ Good
✅ Best
Context window
128K
128K
64K
128K
Self-host cost
Medium
Medium
High (671B)
N/A
11. Model Selection by Use Case
Use Case
Recommended Model
Reason
Code generation & review
Qwen2.5-Coder-32B
SOTA coding, beats GPT-4o
Vietnamese/Chinese tasks
Qwen2.5-72B
151K vocab, superior training data
Complex AI agents
Qwen2.5-72B
Strong tool calling, long context
Edge/mobile
Qwen2.5-1.5B or 3B
Compact, runs offline
Math reasoning
Qwen2.5-Math-72B
Specialized for mathematics
General API server
Qwen2.5-32B
Good performance/cost balance
Speculative draft
Qwen2.5-1.5B
High speed when paired with 72B
Conclusion
Qwen 2.5 stands out through three core strengths:
Coding SOTA: Qwen2.5-Coder-32B is the best open-source model for coding, surpassing GPT-4o on HumanEval (92.7%) and LiveCodeBench
Multilingual leader: 151K token vocabulary and diverse training data make Qwen the top choice for Vietnamese and Chinese among self-hosted models
Powerful AI agents: Accurate tool calling, stable instruction following, and structured JSON output support — all essential for production agent workflows
For enterprises wanting to self-host the best multilingual model or build production-grade code assistants and AI agents, Qwen 2.5 is the unrivaled choice in the open-source world.