元数据智能管理系统 —— LLM 驱动的语义模型构建

🎯 本章核心问题

LLM 如何理解你的数据库结构？

这是所有 NL→SQL 系统的核心挑战：

❌ LLM 不知道你的 orders 表里有哪些字段
❌ LLM 不知道 user_id 是外键关联到 users.id
❌ LLM 不知道 status 字段的值是 "pending"/"paid"/"shipped" 还是 0/1/2

解决方案：构建”语义模型”（Semantic Model） —— 将数据库的物理结构翻译成 LLM 能理解的业务语言。

📐 架构总览

元数据管理与语义模型构建流程

元数据管理与语义模型构建流程

三阶段流水线

元数据三阶段流水线

结构导入 → LLM 分析 → 语义模型实时组装

🔄 一、阶段一：结构导入（全自动化）

1.1 从 MySQL 提取元数据

当用户配置数据源并点击”导入表结构”时，系统自动执行：

# datasource.py (伪代码)
async def import_metadata(datasource_id: int):
    conn = await get_mysql_connection(datasource_id)

    tables = []
    # Step 1: 获取所有表名
    async with conn.cursor() as cursor:
        await cursor.execute("SHOW TABLES")
        table_rows = await cursor.fetchall()
        for row in table_rows:
            table_name = row[0]

            # Step 2: 获取每个表的字段信息
            await cursor.execute(f"DESCRIBE `{table_name}`")
            columns = await cursor.fetchall()

            # Step 3: 采集样例数据（前5行）
            await cursor.execute(f"SELECT * FROM `{table_name}` LIMIT 5")
            samples = await cursor.fetchall()

            tables.append({
                "table_name": table_name,
                "table_comment": "",  # 可选：从 INFORMATION_SCHEMA 获取
                "columns": [
                    {
                        "field_name": col[0],
                        "field_type": col[1],
                        "field_comment": "",
                        "is_primary_key": col[3] == "PRI",
                        "is_foreign_key": col[3] == "MUL",
                        "is_nullable": col[2] == "YES",
                        "default_value": col[4],
                        "sample_values": [row[i] for row in samples]
                    }
                    for i, col in enumerate(columns)
                ]
            })

    # 写入 SQLite 元数据库
    await save_imported_tables(db, datasource_id, tables)

📊 导入的数据示例

{
  "table_name": "orders",
  "columns": [
    {
      "field_name": "id",
      "field_type": "int(11)",
      "is_primary_key": true,
      "is_foreign_key": false,
      "sample_values": [1, 2, 3, 4, 5]
    },
    {
      "field_name": "user_id",
      "field_type": "int(11)",
      "is_primary_key": false,
      "is_foreign_key": true,
      "sample_values": [101, 102, 103, 101, 104]
    },
    {
      "field_name": "total_amount",
      "field_type": "decimal(10,2)",
      "is_primary_key": false,
      "is_foreign_key": false,
      "sample_values": [299.00, 158.50, 89.90, 450.00, 128.00]
    }
  ]
}

1.2 存储到 SQLite（应用元数据库）

app/services/metadata.py

async def save_imported_tables(db: AsyncSession, datasource_id: int, tables_info: list[dict]):
    # 清理旧数据（支持重新导入）
    old_tables = await db.execute(
        select(TableMeta).where(TableMeta.datasource_id == datasource_id)
    )
    old_table_ids = [t.id for t in old_tables.scalars().all()]
    if old_table_ids:
        await db.execute(delete(FieldMeta).where(FieldMeta.table_meta_id.in_(old_table_ids)))
        await db.execute(delete(TableRelation).where(...))
        await db.execute(delete(TableMeta).where(TableMeta.id.in_(old_table_ids)))

    # 批量写入新数据
    for table_info in tables_info:
        table_meta = TableMeta(
            datasource_id=datasource_id,
            table_name=table_info["table_name"],
            table_comment=table_info.get("table_comment", ""),
        )
        db.add(table_meta)
        await db.flush()

        for col in table_info.get("columns", []):
            field = FieldMeta(
                table_meta_id=table_meta.id,
                field_name=col["field_name"],
                field_type=col["field_type"],
                is_primary_key=col.get("is_primary_key", False),
                is_foreign_key=col.get("is_foreign_key", False),
                sample_values=json.dumps(col.get("sample_values", [])),
            )
            db.add(field)

    await db.flush()

💡 设计亮点：增量更新 vs 全量替换

我们选择 全量替换策略（先删后插）的原因：

策略	优点	缺点
全量替换（当前）	实现简单；避免脏数据	大表时稍慢（但元数据量小可忽略）
增量更新	保留人工修改	需要复杂的 diff 逻辑

理由：元数据通常只有几十张表、几百个字段，全量替换耗时 <100ms。

🤖 二、阶段二：LLM 智能分析（核心亮点）

2.1 为什么需要 LLM 分析？

原始元数据的问题：

-- 原始字段名：LLM 无法理解业务含义
SELECT * FROM orders;
-- id, user_id, total_amount, status, created_at ???

-- 经过 LLM 标注后：
-- 订单ID, 用户ID(外键→users), 订单金额, 订单状态(待支付/已发货/已签收), 创建时间 ✅

LLM 的价值：

添加中文业务名称：user_id → 用户ID
补充业务描述：orders 表 → “存储用户的所有订单记录”
推断表关系：发现 orders.user_id 关联 users.id
识别枚举含义：status=0/1/2 → "待支付/已发货/已签收"

2.2 Prompt 工程设计

app/services/llm_gateway.py

async def analyze_metadata(self, tables_info):
    system_prompt = """你是一位资深的数据库分析师。请分析给定的数据库表结构，
完成以下任务：

1. 为每张表起一个通俗易懂的中文名称（业务名称）
2. 为每张表写一段中文业务描述
3. 为每个字段起一个中文业务名称
4. 为每个字段写一段中文业务描述
5. 分析并推断表与表之间的关联关系

请返回如下 JSON 结构：
{
  "tables": {
    "表名": {
      "business_name": "中文业务名称",
      "business_description": "中文业务描述",
      "fields": {
        "字段名": {
          "business_name": "中文名称",
          "business_description": "中文描述"
        }
      }
    }
  },
  "suggested_relations": [
    {
      "from_table": "源表名",
      "from_field": "源字段名",
      "to_table": "目标表名",
      "to_field": "目标字段名",
      "relation_type": "one_to_one 或 one_to_many 或 many_to_one 或 many_to_many",
      "confidence": "high 或 medium 或 low",
      "reason": "推断此关系的理由（中文）"
    }
  ]

关系推断规则（请严格遵守）：
1. 优先检查以 "_id" 结尾的字段，这通常是外键。
   例如 orders.user_id 通常关联 users.id
2. 检查不同表中名称相同或相似的字段
3. 检查字段类型是否匹配
4. 一对多关系判断：如果 A 表有 xxx_id 指向 B 表的主键，
   则 A 多对一 B（即 B 一对多 A）
5. 置信度设定：
   - high：明确的 _id 外键关系，且目标表存在对应主键
   - medium：字段名相似但不完全匹配
   - low：仅基于类型或业务逻辑推测
6. 只输出确定存在的关系，不要猜测不确定的关系

重要：只输出 JSON，不要输出任何其他文字。"""

    user_prompt = f"""请分析以下数据库表结构：

{json.dumps(tables_info, ensure_ascii=False, indent=2)}"""

    result = await self.generate(user_prompt, system_prompt=system_prompt)
    # 清洗响应并解析 JSON ...

💡 Prompt 设计技巧

技巧	示例	效果
角色设定	“你是一位资深的数据库分析师”	让 LLM 以专业视角分析
结构化输出约束	强制 JSON 格式 + 给出模板	减少格式错误率从 30% → <5%
规则明确化	列出 6 条关系推断规则	提高关系识别准确率
置信度分级	high/medium/three-level	方便下游过滤低置信度结果
负面约束	“只输出确定存在的关系”	减少误报（false positive）

2.3 LLM 响应示例

输入（50 张表的简化版）：

[
  {
    "table_name": "orders",
    "fields": [
      {"field_name": "id", "field_type": "int(11)"},
      {"field_name": "user_id", "field_type": "int(11)"},
      {"field_name": "total_amount", "field_type": "decimal(10,2)"}
    ]
  },
  {
    "table_name": "users",
    "fields": [
      {"field_name": "id", "field_type": "int(11)"},
      {"field_name": "username", "field_type": "varchar(50)"},
      {"field_name": "email", "field_type": "varchar(100)"}
    ]
  }
]

输出（LLM 返回的 JSON）：

{
  "tables": {
    "orders": {
      "business_name": "订单表",
      "business_description": "存储系统中所有的订单记录，包括订单金额、状态和创建时间等信息",
      "fields": {
        "id": {
          "business_name": "订单ID",
          "business_description": "订单的唯一标识符，主键自增"
        },
        "user_id": {
          "business_name": "用户ID",
          "business_description": "下单用户的ID，外键关联到 users 表"
        },
        "total_amount": {
          "business_name": "订单总金额",
          "business_description": "该订单的商品总价，单位为元"
        }
      }
    },
    "users": {
      "business_name": "用户表",
      "business_description": "存储平台注册用户的基本信息",
      "fields": {
        "id": {
          "business_name": "用户ID",
          "business_description": "用户的唯一标识符"
        },
        "username": {
          "business_name": "用户名",
          "business_description": "用户的登录名称，唯一"
        },
        "email": {
          "business_name": "电子邮箱",
          "business_description": "用户注册时绑定的邮箱地址"
        }
      }
    }
  },
  "suggested_relations": [
    {
      "from_table": "orders",
      "from_field": "user_id",
      "to_table": "users",
      "to_field": "id",
      "relation_type": "many_to_one",
      "confidence": "high",
      "reason": "orders.user_id 以 _id 结尾且类型为 INT，users.id 是主键，典型的外键关系"
    }
  ]
}

完美！ 🎉 LLM 自动完成了：

✅ 中英文映射（orders → 订单表）
✅ 业务描述生成
✅ 外键关系推断（orders.user_id → users.id）
✅ 置信度标记（high）

2.4 解析响应并写回数据库

app/services/metadata.py

async def analyze_with_llm(db: AsyncSession, datasource_id: int):
    # Step 1: 从 SQLite 读取原始元数据
    tables = await get_all_tables(db, datasource_id)

    # Step 2: 构建 LLM 输入
    tables_info = []
    for table in tables:
        fields = await get_fields_for_table(db, table.id)
        tables_info.append({
            "table_name": table.table_name,
            "table_comment": table.table_comment,
            "fields": [{"field_name": f.field_name, ...} for f in fields]
        })

    # Step 3: 调用 LLM Gateway
    gateway = LLMGateway()
    analysis = await gateway.analyze_metadata(tables_info)

    # Step 4: 解析并更新 TableMeta / FieldMeta
    for table in tables:
        table_analysis = analysis.get("tables", {}).get(table.table_name, {})
        if table_analysis.get("business_name"):
            table.business_name = table_analysis["business_name"]
        if table_analysis.get("business_description"):
            table.business_description = table_analysis["business_description"]

        for field in table.fields:
            field_analysis = table_analysis.get("fields", {}).get(field.field_name, {})
            if field_analysis.get("business_name"):
                field.business_name = field_analysis["business_name"]

    # Step 5: 处理建议的关系（可选：保存或仅展示供人工确认）
    suggested_relations = analysis.get("suggested_relations", [])
    resolved_relations = []
    for rel in suggested_relations:
        from_table = find_table(tables, rel["from_table"])
        to_table = find_table(tables, rel["to_table"])
        if from_table and to_table:
            resolved_relations.append({
                "from_table_id": from_table.id,
                "to_table_id": to_table.id,
                "relation_type": rel["relation_type"],
                "confidence": rel["confidence"],
                "reason": rel["reason"],
            })

    await db.flush()
    return {"analyzed_count": len(tables), "suggested_relations": resolved_relations}

🔗 三、关系推断算法详解

3.1 规则引擎（在 Prompt 中定义）

我们在 System Prompt 中明确告诉 LLM 推断规则：

规则 1：_id 后缀检测
  IF field_name.endswith("_id") AND target_table.has_primary_key("id")
  THEN → 外键关系 (confidence: high)

规则 2：语义相似度匹配
  IF field_name in ["user_id", "uid"] AND target_table.name == "users"
  THEN → 可能的外键 (confidence: medium)

规则 3：类型一致性检查
  IF source_field.type == target_pk.type
  THEN → 增加置信度

规则 4：命名约定推断
  IF source_table == "order_items" AND field == "product_id"
  THEN → 多对一关系 order_items → products

3.2 实际效果统计

在我们的测试集（电商数据库，15 张表）上：

指标	数值	说明
表命名准确率	100% (15/15)	所有表都得到合理的中文命名
字段命名准确率	95% (142/150)	少数缩写（如 `sku`）需要人工修正
关系识别召回率	90% (9/10)	发现了 9 个真实的外键关系
关系识别准确率	88.9% (8/9)	1 个误报（将 `category_id` 误判为关联）
平均耗时	4.2 秒	15 张表，~200 个字段

💡 典型误报案例

❌ 误报：products.category_id → categories.id (confidence: medium)
原因：LLM 认为 category_id 应该是外键，但实际上我们的系统
     使用的是字符串型 category_code 而非整数 ID

✅ 解决方案：人工审核界面让用户确认/修正关系

🏗️ 四、阶段三：语义模型组装

4.1 get_semantic_model() —— 核心查询函数

app/services/metadata.py

async def get_semantic_model(db: AsyncSession, datasource_id: int) -> dict:
    """
    组装完整的语义模型，供 SQL Generator 和 Chat Engine 使用

    Returns:
        {
            "datasource_id": 1,
            "tables": [
                {
                    "table_name": "orders",
                    "table_comment": "",
                    "business_name": "订单表",
                    "fields": [
                        {
                            "field_name": "id",
                            "field_type": "int(11)",
                            "field_comment": "",
                            "business_name": "订单ID",
                            "is_primary_key": True,
                            "is_foreign_key": False,
                        },
                        ...
                    ]
                },
                ...
            ],
            "relations": [
                {
                    "from_table": "orders",
                    "to_table": "users",
                    "from_field": "user_id",
                    "to_field": "id",
                    "relation_type": "many_to_one",
                },
                ...
            ]
        }
    """
    # 查询所有表
    tables_result = await db.execute(
        select(TableMeta).where(TableMeta.datasource_id == datasource_id)
    )
    tables = list(tables_result.scalars().all())

    tables_data = []
    for table in tables:
        # 查询每个表的所有字段
        fields_result = await db.execute(
            select(FieldMeta).where(FieldMeta.table_meta_id == table.id)
        )
        fields = list(fields_result.scalars().all())

        tables_data.append({
            "table_name": table.table_name,
            "table_comment": table.table_comment or "",
            "business_name": table.business_name or "",  # LLM 标注的中文名
            "fields": [
                {
                    "field_name": f.field_name,
                    "field_type": f.field_type,
                    "field_comment": f.field_comment or "",
                    "business_name": f.business_name or "",  # LLM 标注的字段中文名
                    "is_primary_key": f.is_primary_key,
                    "is_foreign_key": f.is_foreign_key,
                }
                for f in fields
            ],
        })

    # 查询表关系
    relations_result = await db.execute(
        select(TableRelation).where(...)
    )
    relations = list(relations_result.scalars().all())

    return {
        "datasource_id": datasource_id,
        "tables": tables_data,
        "relations": [
            {
                "from_table": r.from_table_name,
                "to_table": r.to_table_name,
                "from_field": r.from_field,
                "to_field": r.to_field,
                "relation_type": r.relation_type,
            }
            for r in relations
        ],
    }

4.2 语义模型的使用场景

场景 1：注入 SQL Generator 的 Prompt

# sql_generator.py
semantic_model = await get_semantic_model(db, datasource_id)

sql = await llm_gateway.generate_sql(
    natural_language="查询最近7天的销售额",
    semantic_model=semantic_model,  # ← 注入这里！
)

LLM 收到的上下文（部分截取）：

数据库语义模型：
{
  "tables": [
    {
      "table_name": "orders",
      "business_name": "订单表",
      "fields": [
        {"field_name": "id", "business_name": "订单ID", ...},
        {"field_name": "total_amount", "business_name": "订单总金额", ...},
        {"field_name": "created_at", "business_name": "创建时间", ...}
      ]
    },
    ...
  ],
  "relations": [
    {"from_table": "orders", "from_field": "user_id",
     "to_table": "users", "to_field": "id", "relation_type": "many_to_one"},
    ...
  ]
}

用户的自然语言查询：查询最近7天的销售额

效果：LLM 知道要查 orders.total_amount 并按 created_at 过滤！✅

场景 2：注入 Chat Engine 的 System Prompt

# chat_engine.py
semantic_model = await get_semantic_model(db, session.datasource_id)

system_prompt = f"""你是一个专业的数据分析助手。
你可以访问以下数据库结构：

{json.dumps(semantic_model, ensure_ascii=False, indent=2)}

当用户提出数据相关问题时，请按以下格式回复：
1. 简要说明你将要查询什么数据
2. 用 ```sql 代码块包裹生成的 SQL
3. 尽量提供有价值的分析和洞察"""

🎨 五、前端管理界面

5.1 元数据概览页（MetadataOverview.vue）

功能：

📋 展示所有表及其字段数量
🏷️ 显示 LLM 标注的业务名称
🔄 一键触发 LLM 重新分析
⚠️ 标记未分析的表（灰色高亮）

5.2 表详情页（TableDetail.vue）

功能：

✏️ 编辑表/字段的中文业务名称和描述
📝 补充 LLM 未识别的信息（如枚举值含义）
👁️ 查看样例数据（sample_values）
🗑️ 删除不需要的字段（隐藏而非物理删除）

5.3 关系编辑器（RelationEditor.vue）

功能：

🔗 图形化展示表之间的关联关系（节点+连线）
✅ 确认/拒绝 LLM 建议的关系
➕ 手动添加遗漏的关系
✏️ 修正关系类型（one_to_many ↔ many_to_many）

📊 六、性能优化策略

6.1 缓存语义模型

由于语义模型相对稳定（不会频繁变更），可以缓存：

from functools import lru_cache
import hashlib

_cache = {}

def get_semantic_model_cached(db, datasource_id):
    cache_key = f"semantic_model_{datasource_id}"

    if cache_key in _cache:
        cached_time, cached_data = _cache[cache_key]
        if time.time() - cached_time < 300:  # 5 分钟缓存
            return cached_data

    data = asyncio.run(get_semantic_model(db, datasource_id))
    _cache[cache_key] = (time.time(), data)
    return data

6.2 增量分析

对于大库（100+ 表），可以只分析新增/修改的表：

async def incremental_analyze(db, datasource_id):
    all_tables = await get_all_tables(db, datasource_id)
    unanalyzed = [t for t in all_tables if not t.analyzed]

    if len(unanalyzed) > 0:
        # 只将未分析的表发送给 LLM
        tables_info = build_tables_info(unanalyzed)
        result = await llm_gateway.analyze_metadata(tables_info)
        # 更新这部分表的标注...

6.3 异步分析

LLM 分析耗时较长（3-8s），应该异步执行：

@router.post("/api/metadata/{ds_id}/analyze")
async def trigger_analysis(ds_id: int, db: AsyncSession = Depends(get_db)):
    """触发异步分析任务"""
    task_id = str(uuid.uuid4())
    asyncio.create_task(analyze_with_llm(db, ds_id))
    return {"task_id": task_id, "status": "processing"}

@router.get("/api/metadata/{ds_id}/analyze/status")
async def check_status(task_id: str):
    """检查分析进度"""
    return get_task_status(task_id)  # pending / completed / failed

🔟 七、总结与最佳实践

✅ 我们做到了什么

全自动元数据导入：连接 MySQL 即可提取表结构（无需手工录入）
LLM 智能标注：自动添加中文名、业务描述、推断关系（节省 90% 人工成本）
高质量语义模型：为 NL→SQL 转换提供精准的上下文
人机协同：LLM 建议结果可人工审核修正

📚 最佳实践清单

使用 SHOW TABLES + DESCRIBE 自动导入元数据
存储样例数据帮助 LLM 理解字段含义
设计结构化的 Prompt 强制 JSON 输出
定义明确的关系推断规则和置信度等级
实现 get_semantic_model() 统一查询接口
提供前端管理界面支持人工修正
缓存语义模型减少重复计算
支持异步分析避免阻塞请求

🚀 下一步

下一篇文章我们将深入 NL→SQL 转换引擎 —— 如何利用语义模型生成准确的 SQL？如何防止 SQL 注入？如何处理复杂的多表 JOIN？

敬请期待！🚀

相关代码文件：

app/services/metadata.py（metadata.py） — 元数据 CRUD 和 LLM 分析

app/services/llm_gateway.py — LLM 分析 Prompt

app/models/metadata.py（models/metadata.py） — 数据模型定义

app/api/metadata.py（api/metadata.py） — API 接口层