RayLLM 学习资料汇总 - 基于Ray的开源LLM服务解决方案

ray-llm

RayLLM简介

RayLLM(之前称为Aviary)是一个基于Ray Serve构建的LLM服务解决方案,可以轻松部署和管理各种开源LLM。它具有以下特点:

提供了大量预配置的开源LLM,开箱即用
支持部署Hugging Face Hub或本地的Transformer模型
简化了多个LLM的部署和新LLM的添加
提供独特的自动扩展支持,包括缩放到零
全面支持多GPU和多节点模型部署
提供连续批处理、量化和流式传输等高性能特性
提供类似OpenAI的REST API,便于迁移和交叉测试
支持多个LLM后端,包括vLLM和TensorRT-LLM

快速入门

本地部署

推荐使用官方的anyscale/ray-llmDocker镜像来运行RayLLM:

cache_dir=${XDG_CACHE_HOME:-$HOME/.cache}

docker run -it --gpus all --shm-size 1g -p 8000:8000 -e HF_HOME=~/data -v $cache_dir:~/data anyscale/ray-llm:latest bash

# 在Docker容器内运行
serve run ~/serve_configs/amazon--LightGPT.yaml

查询模型

部署模型后,可以通过多种方式查询RayLLM部署:

使用curl:

curl $ENDPOINT_URL/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}],
    "temperature": 0.7
  }'

使用Python:

import os
import json
import requests

s = requests.Session()

api_base = os.getenv("ENDPOINT_URL")
url = f"{api_base}/chat/completions"
body = {
  "model": "meta-llama/Llama-2-7b-chat-hf",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me a long story with many words."}
  ],
  "temperature": 0.7,
  "stream": True,
}

with s.post(url, json=body, stream=True) as response:
    for chunk in response.iter_lines(decode_unicode=True):
        if chunk is not None:
            try:
                # 处理响应内容
                # ...
                print(content, end="", flush=True)
            except json.decoder.JSONDecodeError:
                pass
            except KeyError:
                pass
    print("")

使用OpenAI SDK:

import openai

openai.api_base = "http://localhost:8000/v1"
openai.api_key = "not_a_real_key"

chat_completion = openai.ChatCompletion.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Say 'test'."}
    ],
    temperature=0.7
)
print(chat_completion)