MLC-LLM 端侧部署实践：在手机和浏览器上跑大模型

大模型只能在服务器上运行？MLC-LLM 打破了这个限制，让 7B 模型跑在手机和浏览器上！

什么是 MLC-LLM？

MLC-LLM（Machine Learning Compilation for LLM）是字节跳动开源的端侧大模型推理引擎：

特性	说明
📱 iOS	iPhone 15 Pro 可运行 7B 模型
🤖 Android	主流旗舰机支持
🌐 Web	Chrome/Firefox 浏览器运行
⚡ 高性能	接近原生应用体验

技术原理

1. 编译优化

MLC-LLM 使用 MLC Engineer 编译链：

┌─────────────────────────────────────────────────┐
│            MLC Engineer                         │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐      │
│  │  PyTorch │  │  TVM     │  │  Metal/  │      │
│  │  Model   │→ │  Compile │→ │  Vulkan  │      │
│  └──────────┘  └──────────┘  └──────────┘      │
│                      │                          │
│              ┌──────▼──────┐                   │
│              │   Codegen   │                   │
│              │   (GPU JS)  │                   │
│              └─────────────┘                   │
└─────────────────────────────────────────────────┘

2. 量化压缩

量化	模型大小	iPhone 15 Pro	Android (8Gen3)
FP16	14GB	❌	❌
INT4	3.8GB	✅ ~15 tok/s	✅ ~20 tok/s
INT3	2.8GB	✅ ~18 tok/s	✅ ~25 tok/s
INT2	1.9GB	✅ ~25 tok/s	✅ ~30 tok/s

3. 硬件适配

# 支持的后端
backends = {
    "ios": "Metal",      # Apple GPU
    "android": "Vulkan", # ARM Mali/Adreno
    "web": "WebGPU",     # 浏览器
    "linux": "CUDA",     # PC
}

iOS 部署

前提条件

Xcode 15+
iOS 17.0+
iPhone 15 Pro 及以上（或 iPad Pro M 芯片）

方式一：预编译 App（推荐）

# 使用开源 App：llama.cpp / MLC Chat
# App Store 搜索 "MLC Chat" 或 "llama.cpp"

方式二：自行编译

# 1. 克隆仓库
git clone --recursive https://github.com/mlc-ai/mlc-llm.git
cd mlc-llm

# 2. 安装 Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# 3. 构建 iOS
python3 build.py --platform ios \
    --model Llama-2-7b-chat-hf \
    --quantization q4f16_1

使用方式

// Swift 调用示例
import MLCChat

let chat = MLCChat(modelPath: "/models/llama-2-7b-q4")

// 异步生成
Task {
    for await token in await chat.generate("你好") {
        print(token, terminated: "")
    }
}

Android 部署

Gradle 依赖

// app/build.gradle
dependencies {
    implementation("ai.mlc:mlc4j:0.1.0")
}

Java 调用

import ai.mlc.mlc4j.MLCModel;
import ai.mlc.mlc4j.MLCChat;

// 加载模型
MLCModel model = MLCModel.fromAsset(context, "Llama-2-7b-q4");

// 创建对话实例
MLCChat chat = model.createChat();

// 生成
String response = chat.generate("你好");
System.out.println(response);

性能数据

设备	模型	量化	速度
iPhone 15 Pro	LLaMA-2-7B	INT4	~15 tok/s
iPhone 15 Pro	LLaMA-2-7B	INT3	~18 tok/s
小米 14 (8Gen3)	Qwen-2-7B	INT4	~25 tok/s
Pixel 8 Pro	LLaMA-2-7B	INT4	~12 tok/s

Web 部署（浏览器）

支持的浏览器

Chrome 113+（WebGPU）
Edge 113+
Firefox Nightly（WebGPU）

使用方式

<!DOCTYPE html>
<html>
<head>
  <title>MLC Web Demo</title>
  <script src="https://esm.run/@mlc-ai/web-llm@0.2.5"></script>
</head>
<body>
  <div id="chat"></div>
  <input type="text" id="input" placeholder="输入消息...">
  <button onclick="send()">发送</button>

  <script type="module">
    import * as webllm from "https://esm.run/@mlc-ai/web-llm@0.2.5";

    // 选择模型
    const selectedModel = "Llama-2-7b-chat-hf-q4f16_1-MLC";

    // 初始化
    const engine = await webllm.CreateMLCEngine(
      selectedModel,
      { initProgressCallback: (progress) => console.log(progress) }
    );

    // 对话
    async function send() {
      const input = document.getElementById("input").value;
      const messages = [{ role: "user", content: input }];
      
      const chunks = await engine.chat.completions.create({
        messages,
        stream: true,
      });

      for await (const chunk of chunks) {
        console.log(chunk.choices[0].delta.content);
      }
    }
  </script>
</body>
</html>

WebGPU 内存限制

// 检查 WebGPU 支持
if (!navigator.gpu) {
  console.error("WebGPU not supported");
}

// 查看显存限制
const adapter = await navigator.gpu.requestAdapter();
console.log("Max memory:", adapter.limits.maxStorageBufferBindingSize);

模型准备

1. 下载模型

# 从 HuggingFace 下载
git lfs install
git clone meta-llama/Llama-2-7b-chat-hf

2. 量化模型

# MLC 量化
python3 -m mlc_llm convert_weight \
    --quantization q4f16_1 \
    --model Llama-2-7b-chat-hf \
    --output dist/Llama-2-7b-chat-q4f16_1-mlc

3. 打包

# 打包为 MLC 格式
python3 -m mlc_llm package \
    --model dist/Llama-2-7b-chat-q4f16_1-mlc \
    --output-path dist/Llama-2-7b-chat-q4f16_1-mlc-py \
    --add-metadata '{"model_lib": "Llama-2-7b-chat-q4f16_1"}'

隐私与安全

端侧部署的优势：

优势	说明
🔒 隐私保护	数据不离开设备
📡 离线可用	无需网络连接
⚡ 低延迟	本地推理
💰 无 API 成本	无需云服务

适用场景

✅ 推荐使用端侧：

隐私敏感应用（医疗、法律、金融）
离线助手
嵌入式设备
成本敏感场景

❌ 不适合：

需要超大模型（>70B）
需要频繁更新知识
需要多模态能力

与云端模型对比

维度	端侧 (MLC-LLM)	云端 (API)
延迟	<100ms	1-5s
隐私	✅ 完全本地	❌ 数据上传
模型大小	≤7B	无限制
成本	硬件成本	按调用付费
更新	手动	实时

总结

MLC-LLM 让大模型真正走进端侧：

iOS/Android：原生应用体验
Web：浏览器直接运行
隐私优先：数据永不离开设备
量化优化：INT4 压缩 75%

对于隐私敏感或需要离线工作的场景，端侧部署是最佳选择！

本文是 AI Infra 系列最后一篇，感谢阅读！如有问题，欢迎留言讨论。