大模型微服务架构模块实现方案
微服务架构模块实现方案
微服务器板块划分
- 清洗(客户特定)
- 模型工厂模块(数据进去模型出来)
- 一个人工Eval框架(通用)
- 一个Eval框架(客户特定)
板块1 数据清洗(非通用)
每一套数据单独写一个筛选class。
为了标准化我们用Pytorch的dataset class来写。
对于每一套数据我们遵循以下基本流程
- 确定怎么判别对话质量优劣(长短;结果;寓意)这个写在dataset class docstring里面;写成一个 @property: def label
- 确定什么时候需要人工介入,明确触发信号,然后用逻辑(或者LLM批量)定义出发点位置,写成一个 @property: def human_intervention_idx 指向对话中的哪一个位置是需要人工介入了
- 写一个__getitem__从文件里(文件就够了我们这里不动用数据库不用小题大做)获取单个item+label+human_intervention_idx(label定义对话质量)
- 定义一个附加信息的引用共识比如 def get_context 返回用户的额外信息(调用的工具;图片/话术)
- 清洗完的数据做一个单独的模块用API调用 (flask);可以接收请求反馈__getitem__ / get_context 两种输出 (可重复利用)
这里写一个统计模块可视化一下数据长度,优劣分布,需要人工介入的位置分布。(这个可重复利用) def visualise(dataset: torch.Dataset): xxx
总结:这个交付物三个:1. dataset class 2. flast api 3. 可视化方程
板块1 数据清洗模块(非通用)
核心类实现(基于PyTorch Dataset)
import torch
from torch.utils.data import Dataset
import json
from typing import Dict, List, Tuple
import matplotlib.pyplot as pltclass CustomDataset(Dataset):"""数据清洗专用Dataset类Label定义标准:- 质量优劣:对话长度>50字符[0.3分] + 包含结果性语句[0.4分] + 有明确寓意[0.3分]- 总分阈值:>0.7为优质样本"""def __init__(self, file_path: str):self.data = self._load_data(file_path)self.intervention_rules = [("敏感词检测", self._check_sensitive_words),("逻辑断裂检测", self._check_logic_break)]@propertydef label(self) -> float:"""动态计算样本质量评分"""length_score = min(len(self.dialog)/100, 1.0)result_score = 1.0 if any(t in ['结论','建议'] for t in self.dialog) else 0.0metaphor_score = self._detect_metaphor()return 0.3*length_score + 0.4*result_score + 0.3*metaphor_score@property def human_intervention_idx(self) -> int:"""定位需要人工介入的位置"""for idx, turn in enumerate(self.dialog):for rule_name, check_fn in self.intervention_rules:if check_fn(turn):return idxreturn -1def __getitem__(self, idx: int) -> Tuple[Dict, float, int]:item = self.data[idx]dialog = item['conversation']context = item.get('context', {})return {'dialog': dialog,'context': context}, self.label, self.human_intervention_idxdef get_context(self, idx: int) -> Dict:"""获取附加上下文信息"""return self.data[idx].get('context', {})# 私有方法def _load_data(self, path: str) -> List[Dict]:with open(path, 'r') as f:return json.load(f)def _check_sensitive_words(self, text: str) -> bool:return any(word in text for word in ['机密','内部'])def _check_logic_break(self, text: str) -> bool:return len(text.split()) < 3 and '?' in textdef _detect_metaphor(self) -> float:# 使用预训练模型检测寓意(此处为示例)return 0.8 if '比喻' in self.dialog else 0.0def visualize(dataset: Dataset):"""数据可视化分析"""lengths = [len(item[0]['dialog']) for item in dataset]labels = [item[1] for item in dataset]interventions = [item[2] for item in dataset if item[2] != -1]plt.figure(figsize=(15,5))plt.subplot(1,3,1)plt.hist(lengths, bins=20)plt.title('对话长度分布')plt.subplot(1,3,2)plt.hist(labels, bins=10)plt.title('质量评分分布')plt.subplot(1,3,3)plt.hist(interventions, bins=len(dataset)//10)plt.title('人工介入位置分布')plt.tight_layout()plt.savefig('data_stats.png')
Flask API服务
from flask import Flask, request, jsonify
import torchapp = Flask(__name__)
dataset = None # 初始化时加载数据集@app.route('/init', methods=['POST'])
def init_dataset():global datasetfile_path = request.json['path']dataset = CustomDataset(file_path)return jsonify({"status": "loaded", "size": len(dataset)})@app.route('/get_item/<int:idx>')
def get_item(idx: int):item, label, intervention = dataset[idx]return jsonify({"data": item,"label": float(label),"intervention_idx": intervention})@app.route('/visualize')
def generate_visualization():visualize(dataset)return send_file('data_stats.png')if __name__ == '__main__':app.run(port=5000)
板块2 模型工厂模块
模型注册表示例(YAML格式)
models:- name: "llama-7b-ft"type: "finetune"base_model: "meta-llama/Llama-2-7b-hf"access_point: "api.nebius.ai/v1/llama7b"training_params:epochs: 3batch_size: 32lr: 2e-5metrics:accuracy: 0.89latency: 350ms- name: "gemini-prompt-engineered"type: "prompt"base_model: "gemini-2.5"access_point: "api.gemini.com/v2"prompt_template: |你是一个专业客服助手,请根据以下上下文回答问题:{context}历史对话:{history}当前问题:{question}
训练模块核心逻辑
import torch
from transformers import Trainer, TrainingArgumentsclass ModelFactory:def __init__(self, base_model: str):self.model = AutoModelForCausalLM.from_pretrained(base_model)self.tokenizer = AutoTokenizer.from_pretrained(base_model)def train(self, dataset, config: Dict):args = TrainingArguments(output_dir=config['output_dir'],num_train_epochs=config.get('epochs', 3),per_device_train_batch_size=config.get('batch_size', 8),logging_dir='./logs',**config.get('advanced_params', {}))trainer = Trainer(model=self.model,args=args,train_dataset=dataset,data_collator=lambda data: {'input_ids': torch.stack([d[0] for d in data]),'labels': torch.stack([d[1] for d in data])})trainer.train()self.save_model(config['save_path'])def save_model(self, path: str):self.model.save_pretrained(path)self.tokenizer.save_pretrained(path)@staticmethoddef load_for_inference(path: str):return Pipeline.from_pretrained(path)
板块2 精调微调题词板块
- 简单的来说就是[模型:形式是调用端口]和[数据:形式是上面的flask端口]进去 -> [调好的模型:调用端口]出来
- 中间的逻辑整不明白的话打电话给技术经理
- 这里不限制方式(精调微调PE都可以随便什么方式;弄一个模型注册表),用一个平均的基础模型比如llama当基准线,过下面板块3的人工和机器eval
- 可以看下nebius; 但是不限制。gemini2.5现在是性价比最高的模型如果有精挑这个的方式也ok。
research
[] API inference efficiency research + implementation [research]
some research on cheapest way of perform inferneces (on finetunable model; either properiotory or open-source doesn’t matter)
we do one round of resaerch at the beginning for making a MVP
later when the project is launched, this requires daily updates
deliverables:
1. a table with 4 columns: evidence source; model name; price per 1M in/out token; price estimate per session as per data stats; source date
save the output in markdown format in README.MD in the repo, label the date/time of update this table in the footer
[] finetuning / lora [research]
> some research on cheapest way of doing finetuning (of course the model choice cannot be too shit; check out the SOTA cost-efficient models)
> not too much time spent on making this choice we could always revisit this after making our MVP
> deliverables:
1. a table with 4 columns: evidence source; model name; price per 1M in/out token; price estimate per session as per data stats; source date
> put a link to some sources that shows the SOTA cost-efficient models in the footnote
[] hardcode prompt module [code/text]
> we try to achieve some models; test against the same benchmark using this method as well
> note that this is easier to develop but will be long-term more expensive
> models done in this method will also go in registry with “training-method” labelled as “prompt”
code/modules
[] finetine + inference module [code/module]
input:
1. a model of choice with minimal code for adapting the main loop (main loop has to be reusable for different providers)
2. conversation data (json format)
3. some hyperparameters
start with dummy data (chat with stack holder about this); either find (if readily available) or use made up datasets
start with a cost efficient model as per research of step 1 and 2
deliverables:
1. a report (or powerpoint) of what has been done during this process; i.e., the amount of data used, data distribution stats (used for tunning the model); number of epochs (and reason of choice)
2. saved weights or access point (if using cloud providers) as for model accessing
3. code of performing inference
4. we will keep a model registry as yaml or json with name, access point, method, description and hyperparams that are used for tuning this model for full replicability
> maybe use https://studio.nebius.com/
板块3 EVAL (HUMAN + BENCHMAKRS)
说白了就是[模型调用端口]进去:1. 出现一个基于gradio/streamlit或者随便什么快速的前端窗口出来给人工测试 2. 设置几个我们手搓的优劣标准(之后开会决定)然后出现几个分数
code
- this looks simple but is actually the most important bit
- first clearly define what we need the agent to do
[] benchmark and evaluation framework (must done in parallel) with everything else [code/module]
> a realistic scenario will be determined with the stackholder
> a generic LLM will be plugged at the begining as a benchmark
> the agent must survive human eval; tester must actively try to break it; testers: 技术主管和主管
> define and make some sorts of error rate; any sort of glitch
> hardcode a sequence of undesired behaviour
> this will be pluging and score out kind of thing; it will be evaluated automatically (must); e.g., we call this eval framework CS eval
[] the flow goes like this:
> input: a finished agent ready to plugin and use
> plug it in with x senarios (e.g., new customers; catching up with previous customers, etc)
> the code will send it a sequence of interactions and detemrine weather its behaviour fits
> if fits / unfit a numeric score will be derived using some forms of parser, string parse or LLM eval both are fine
> this needs to be defiend later in a discussion that how for how we want to eval our models
板块3 评估框架
自动化评估模块
import gradio as gr
from sklearn.metrics import accuracy_scoreclass EvalFramework:def __init__(self, model_endpoint: str):self.model = ModelLoader.load(model_endpoint)self.test_cases = [{"input": "如何重置密码?", "expected": "指导密码重置流程"},{"input": "我要投诉!", "expected": "安抚情绪并转接主管"}]def auto_eval(self) -> Dict:results = []for case in self.test_cases:pred = self.model.generate(case["input"])score = self._calculate_similarity(pred, case["expected"])results.append(score)return {"accuracy": accuracy_score([1]*len(results), [1 if s>0.8 else 0 for s in results]),"avg_response_time": sum(times)/len(times)}def launch_interface(self):"""启动人工评估界面"""demo = gr.Interface(fn=self._evaluate_single,inputs=gr.Textbox(label="输入测试语句"),outputs=[gr.Textbox(label="模型响应"),gr.Number(label="匹配度得分")])demo.launch()def _calculate_similarity(self, pred: str, expected: str) -> float:# 使用Sentence-BERT计算语义相似度return util.cos_sim(self.model.encode(pred),self.model.encode(expected)).item()
评估流程示例
# 初始化评估框架
eval_framework = EvalFramework("api.nebius.ai/v1/llama7b")# 自动执行测试用例
auto_results = eval_framework.auto_eval()
print(f"自动化评估结果:{auto_results}")# 启动人工评估界面
eval_framework.launch_interface()
实现要点说明
- 模块化设计:每个模块通过API接口通信,符合微服务的高内聚低耦合原则
- 可扩展性:模型注册表支持动态添加新模型,评估框架可配置测试用例
- 成本控制:通过Nebius/Gemini等云平台实现弹性资源调度(参考研究阶段的价格表)
- 质量保障:双重评估机制(自动+人工)确保模型可靠性
- 可观测性:数据可视化模块帮助理解数据分布特征
建议将价格研究结果实时更新在项目README中,格式示例:
| 模型名称 | 服务商 | 输入单价 | 输出单价 | 会话估算成本 |
|---------|--------|---------|---------|-------------|
| Llama-2-7B | Nebius | $0.15/M | $0.30/M | $0.02/会话 |
| Gemini 2.5 | Google | $0.20/M | $0.40/M | $0.03/会话 |
*价格数据更新于2025-04-18,来自各云平台官方报价*
该实现方案综合运用了PyTorch的数据处理能力、Flask的API开发效率和Gradio的快速原型开发特性,符合微服务架构的最佳实践。每个模块均可独立部署,通过标准化接口实现协同工作。