Chat & Generar#

Aprende a chatear con un LLM en Xinference.

Introducción#

Los modelos con capacidad chat o generate suelen denominarse modelos de lenguaje grandes (LLM) o modelos de generación de texto. Estos modelos están diseñados para responder en forma de salida de texto según la entrada recibida, comúnmente denominada «indicación». En general, se pueden guiar estos modelos para completar tareas mediante instrucciones específicas o proporcionando ejemplos concretos.

Los modelos con capacidad generate suelen ser grandes modelos de lenguaje preentrenados. Por otro lado, los modelos equipados con la función chat son modelos de lenguaje (LLM) que han sido ajustados y alineados, optimizados específicamente para escenarios de diálogo. En la mayoría de los casos, los modelos que terminan en «chat» (por ejemplo, llama-2-chat, qwen-chat, etc.) poseen la función chat.

Chat API y Generate API ofrecen dos métodos diferentes para interactuar con LLMs:

Chat API (similar a la Chat Completion API de OpenAI) permite conversaciones de múltiples rondas.
La API de generación (similar a la Completions API de OpenAI) le permite generar texto a partir de un prompt textual.

Capacidad del modelo	Endpoints de la API	Punto final compatible con OpenAI
chat	Chat API	/v1/chat/completions
generate	Generate API	/v1/completions

Lista de modelos compatibles#

Puedes consultar la capacidad de todos los modelos LLM integrados en Xinference.

Modelo de chat#

Chat API#

Prueba usando cURL, el cliente de OpenAI o el cliente Python de Xinference para probar la API de Chat:

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "What is the largest animal?"
        }
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
client.chat.completions.create(
    model="<MODEL_UID>",
    messages=[
        {
            "content": "What is the largest animal?",
            "role": "user",
        }
    ],
    max_tokens=512,
    temperature=0.7
)

from xinference.client import RESTfulClient

client = RESTfulClient("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
messages = [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the largest animal?"}]
model.chat(
    messages,
    generate_config={
      "max_tokens": 512,
      "temperature": 0.7
    }
)

{
  "id": "chatcmpl-8d76b65a-bad0-42ef-912d-4a0533d90d61",
  "model": "<MODEL_UID>",
  "object": "chat.completion",
  "created": 1688919187,
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The largest animal that has been scientifically measured is the blue whale, which has a maximum length of around 23 meters (75 feet) for adult animals and can weigh up to 150,000 pounds (68,000 kg). However, it is important to note that this is just an estimate and that the largest animal known to science may be larger still. Some scientists believe that the largest animals may not have a clear \"size\" in the same way that humans do, as their size can vary depending on the environment and the stage of their life."
      },
      "finish_reason": "None"
    }
  ],
  "usage": {
    "prompt_tokens": -1,
    "completion_tokens": -1,
    "total_tokens": -1
  }
}

Puedes encontrar más ejemplos de la API de Chat en el cuaderno de tutoriales.

Gradio Chat

Ejemplos de cómo usar la API de Chat de Xinference y el cliente de Python.

https://github.com/xorbitsai/inference/blob/main/examples/gradio_chatinterface.py

Modelo de pensamiento mixto#

Algunos modelos de lenguaje grandes están etiquetados como mixtos, y se puede elegir si habilitar el modo de pensamiento para su ejecución.

Added in version v1.17.0: El interruptor enable_thinking a nivel de solicitud es compatible desde v1.17.0.

Xinference proporciona un interruptor a nivel de solicitud enable_thinking, que es aplicable a diferentes plantillas de modelo (por ejemplo, Qwen usa enable_thinking, mientras que algunas plantillas de DeepSeek usan thinking).

Ejemplo de uso:

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "messages": [
        {"role": "user", "content": "What is the largest animal?"}
    ],
    "enable_thinking": false
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
client.chat.completions.create(
    model="<MODEL_UID>",
    messages=[
        {"role": "user", "content": "What is the largest animal?"}
    ],
    extra_body={"enable_thinking": False}
)

from xinference.client import RESTfulClient

client = RESTfulClient("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
model.chat(
    [{"role": "user", "content": "What is the largest animal?"}],
    enable_thinking=False,
)

model.chat(
    [{"role": "user", "content": "What is the largest animal?"}],
    generate_config={"chat_template_kwargs": {"enable_thinking": False}},
)

Modelo generativo#

Generate API#

Generate API replica la Completions API de OpenAI.

La diferencia entre Generate API y Chat API radica principalmente en la forma de entrada. Chat API acepta una lista de mensajes como entrada, mientras que Generate API acepta una cadena de texto libre llamada prompt como entrada.

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "prompt": "What is the largest animal?",
    "max_tokens": 512,
    "temperature": 0.7
  }'

import openai

client = openai.Client(api_key="cannot be empty", base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1")
client.chat.completions.create(
    model=("<MODEL_UID>",
    messages=[
        {"role": "user", "content": "What is the largest animal?"}
    ],
    max_tokens=512,
    temperature=0.7
)

from xinference.client import RESTfulClient

client = RESTfulClient("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
print(model.generate(
    prompt="What is the largest animal?",
    generate_config={
      "max_tokens": 512,
      "temperature": 0.7
    }
))

{
  "id": "cmpl-8d76b65a-bad0-42ef-912d-4a0533d90d61",
  "model": "<MODEL_UID>",
  "object": "text_completion",
  "created": 1688919187,
  "choices": [
    {
      "index": 0,
      "text": "The largest animal that has been scientifically measured is the blue whale, which has a maximum length of around 23 meters (75 feet) for adult animals and can weigh up to 150,000 pounds (68,000 kg). However, it is important to note that this is just an estimate and that the largest animal known to science may be larger still. Some scientists believe that the largest animals may not have a clear \"size\" in the same way that humans do, as their size can vary depending on the environment and the stage of their life.",
      "finish_reason": "None"
    }
  ],
  "usage": {
    "prompt_tokens": -1,
    "completion_tokens": -1,
    "total_tokens": -1
  }
}

FAQ#

¿El LLM de Xinference proporciona métodos de integración con LangChain o LlamaIndex?#

Sí, puedes consultar la sección correspondiente en la documentación oficial de Xinference de cada uno. Aquí están los enlaces: