인공지능

파이썬에서 비동기 LLM API 호출: 포괄적인 가이드

게재 2024 년 9 월 2 일

아유쉬 미탈 미탈

개발자이자 dta 과학자로서 우리는 종종 API를 통해 이러한 강력한 모델과 상호 작용해야 합니다. 그러나 애플리케이션의 복잡성과 규모가 커짐에 따라 효율적이고 성능이 뛰어난 API 상호 작용에 대한 필요성이 중요해집니다. 여기서 비동기 프로그래밍이 빛을 발하며, LLM API로 작업할 때 처리량을 극대화하고 지연 시간을 최소화할 수 있습니다.

이 종합 가이드에서는 Python에서 비동기 LLM API 호출의 세계를 살펴보겠습니다. 비동기 프로그래밍의 기본부터 복잡한 워크플로우를 처리하는 고급 기술까지 모든 것을 다룹니다. 이 글을 마치면 비동기 프로그래밍을 활용하여 LLM 기반 애플리케이션을 강화하는 방법을 확실히 이해하게 될 것입니다.

비동기 LLM API 호출의 세부 사항을 살펴보기 전에 먼저 비동기 프로그래밍 개념에 대한 튼튼한 기초를 다져보겠습니다.

비동기 프로그래밍은 실행의 주 스레드를 차단하지 않고 여러 작업을 동시에 실행할 수 있게 해줍니다. Python에서 이는 주로 다음을 통해 달성됩니다. 비동기 코루틴, 이벤트 루프, 퓨처를 사용하여 동시성 코드를 작성하기 위한 프레임워크를 제공하는 모듈입니다.

주요 개념:

코루틴: 정의된 함수 비동기 정의 일시 정지 및 재개가 가능합니다.
이벤트 루프: 비동기 작업을 관리하고 실행하는 중앙 실행 메커니즘입니다.
기대되는 것들: await 키워드와 함께 사용할 수 있는 객체(코루틴, 태스크, 퓨처).

이러한 개념을 설명하기 위한 간단한 예는 다음과 같습니다.

import asyncio

async def greet(name):
    await asyncio.sleep(1)  # Simulate an I/O operation
    print(f"Hello, {name}!")

async def main():
    await asyncio.gather(
        greet("Alice"),
        greet("Bob"),
        greet("Charlie")
    )

asyncio.run(main())

이 예에서 우리는 비동기 함수를 정의합니다. greet I/O 작업을 시뮬레이션하는 asyncio.sleep(). 그만큼 main 기능 사용 asyncio.gather() 여러 인사말을 동시에 실행합니다. sleep 지연에도 불구하고 세 인사말이 모두 약 1초 후에 인쇄되어 비동기 실행의 힘을 보여줍니다.

LLM API 호출에서 비동기의 필요성

LLM API로 작업할 때, 우리는 종종 여러 API 호출을 순서대로 또는 병렬로 해야 하는 시나리오에 직면합니다. 기존의 동기 코드는 상당한 성능 병목 현상으로 이어질 수 있으며, 특히 LLM 서비스에 대한 네트워크 요청과 같은 고 지연 작업을 처리할 때 그렇습니다.

LLM API를 사용하여 100개의 서로 다른 기사에 대한 요약을 생성해야 하는 시나리오를 생각해 보겠습니다. 동기적 접근 방식을 사용하면 각 API 호출은 응답을 받을 때까지 차단되어 모든 요청을 완료하는 데 몇 분이 걸릴 수 있습니다. 반면 비동기적 접근 방식을 사용하면 여러 API 호출을 동시에 시작할 수 있어 전체 실행 시간을 크게 줄일 수 있습니다.

환경 설정

비동기 LLM API 호출을 시작하려면 필요한 라이브러리를 사용하여 Python 환경을 설정해야 합니다. 필요한 사항은 다음과 같습니다.

파이썬 3.7 또는 그 이상(네이티브 asyncio 지원의 경우)
aiohttp: 비동기 HTTP 클라이언트 라이브러리
Openai: 공식 OpenAI Python 클라이언트 (OpenAI의 GPT 모델을 사용하는 경우)
랭체인: LLM을 사용하여 애플리케이션을 구축하기 위한 프레임워크(선택 사항이지만 복잡한 워크플로에 권장됨)

pip를 사용하여 이러한 종속성을 설치할 수 있습니다.

pip install aiohttp openai langchain
<div class="relative flex flex-col rounded-lg">

asyncio 및 aiohttp를 사용한 기본 비동기 LLM API 호출

aiohttp를 사용하여 LLM API에 대한 간단한 비동기 호출을 만들어 보겠습니다. OpenAI의 GPT-3.5 API를 예로 들어 설명하겠지만, 이 개념은 다른 LLM API에도 적용됩니다.

import asyncio
import aiohttp
from openai import AsyncOpenAI

async def generate_text(prompt, client):
    response = await client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

async def main():
    prompts = [
        "Explain quantum computing in simple terms.",
        "Write a haiku about artificial intelligence.",
        "Describe the process of photosynthesis."
    ]
    
    async with AsyncOpenAI() as client:
        tasks = [generate_text(prompt, client) for prompt in prompts]
        results = await asyncio.gather(*tasks)
    
    for prompt, result in zip(prompts, results):
        print(f"Prompt: {prompt}\nResponse: {result}\n")

asyncio.run(main())

이 예에서 우리는 비동기 함수를 정의합니다. generate_text AsyncOpenAI 클라이언트를 사용하여 OpenAI API를 호출합니다. main 이 기능은 다양한 프롬프트와 용도에 대해 여러 작업을 생성합니다. asyncio.gather() 동시에 실행합니다.

이 접근 방식을 사용하면 LLM API에 여러 요청을 동시에 보낼 수 있어 모든 프롬프트를 처리하는 데 필요한 총 시간을 크게 줄일 수 있습니다.

고급 기술: 배치 및 동시성 제어

이전 예제는 비동기 LLM API 호출의 기본을 보여주지만, 실제 애플리케이션에서는 더 정교한 접근 방식이 필요한 경우가 많습니다. 두 가지 중요한 기술, 즉 요청 일괄 처리와 동시성 제어를 살펴보겠습니다.

요청 일괄 처리: 많은 수의 프롬프트를 처리할 때, 각 프롬프트에 대해 개별 요청을 보내는 것보다 여러 개의 프롬프트를 묶어 일괄 처리하는 것이 더 효율적인 경우가 많습니다. 이렇게 하면 여러 API 호출로 인한 오버헤드를 줄이고 성능을 향상시킬 수 있습니다.

import asyncio
from openai import AsyncOpenAI

async def process_batch(batch, client):
    responses = await asyncio.gather(*[
        client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}]
        ) for prompt in batch
    ])
    return [response.choices[0].message.content for response in responses]

async def main():
    prompts = [f"Tell me a fact about number {i}" for i in range(100)]
    batch_size = 10
    
    async with AsyncOpenAI() as client:
        results = []
        for i in range(0, len(prompts), batch_size):
            batch = prompts[i:i+batch_size]
            batch_results = await process_batch(batch, client)
            results.extend(batch_results)
    
    for prompt, result in zip(prompts, results):
        print(f"Prompt: {prompt}\nResponse: {result}\n")

asyncio.run(main())

동시성 제어: 비동기 프로그래밍은 동시 실행을 허용하지만, API 서버에 과부하가 걸리거나 속도 제한을 초과하지 않도록 동시성 수준을 제어하는 것이 중요합니다. 이를 위해 asyncio.Semaphore를 사용할 수 있습니다.

import asyncio
from openai import AsyncOpenAI

async def generate_text(prompt, client, semaphore):
    async with semaphore:
        response = await client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content

async def main():
    prompts = [f"Tell me a fact about number {i}" for i in range(100)]
    max_concurrent_requests = 5
    semaphore = asyncio.Semaphore(max_concurrent_requests)
    
    async with AsyncOpenAI() as client:
        tasks = [generate_text(prompt, client, semaphore) for prompt in prompts]
        results = await asyncio.gather(*tasks)
    
    for prompt, result in zip(prompts, results):
        print(f"Prompt: {prompt}\nResponse: {result}\n")

asyncio.run(main())

이 예에서는 세마포어를 사용하여 동시 요청 수를 5개로 제한하여 API 서버에 과부하가 걸리지 않도록 합니다.

비동기 LLM 호출에서의 오류 처리 및 재시도

외부 API를 사용할 때는 강력한 오류 처리 및 재시도 메커니즘을 구현하는 것이 매우 중요합니다. 일반적인 오류를 처리하고 재시도에 대한 지수 백오프를 구현하도록 코드를 개선해 보겠습니다.

import asyncio
import random
from openai import AsyncOpenAI
from tenacity import retry, stop_after_attempt, wait_exponential

class APIError(Exception):
    pass

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
async def generate_text_with_retry(prompt, client):
    try:
        response = await client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error occurred: {e}")
        raise APIError("Failed to generate text")

async def process_prompt(prompt, client, semaphore):
    async with semaphore:
        try:
            result = await generate_text_with_retry(prompt, client)
            return prompt, result
        except APIError:
            return prompt, "Failed to generate response after multiple attempts."

async def main():
    prompts = [f"Tell me a fact about number {i}" for i in range(20)]
    max_concurrent_requests = 5
    semaphore = asyncio.Semaphore(max_concurrent_requests)
    
    async with AsyncOpenAI() as client:
        tasks = [process_prompt(prompt, client, semaphore) for prompt in prompts]
        results = await asyncio.gather(*tasks)
    
    for prompt, result in results:
        print(f"Prompt: {prompt}\nResponse: {result}\n")

asyncio.run(main())

이 향상된 버전에는 다음이 포함됩니다.

사용자 지정 APIError API 관련 오류에 대한 예외입니다.
A generate_text_with_retry 장식된 기능 @retry tenacity 라이브러리에서 지수 백오프를 구현합니다.
오류 처리 process_prompt 오류를 포착하고 보고하는 기능.

성능 최적화: 스트리밍 응답

장문 콘텐츠 생성의 경우 스트리밍 응답은 애플리케이션의 인지된 성능을 크게 개선할 수 있습니다. 전체 응답을 기다리는 대신, 사용 가능해지면 텍스트 청크를 처리하고 표시할 수 있습니다.

import asyncio
from openai import AsyncOpenAI

async def stream_text(prompt, client):
    stream = await client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    
    full_response = ""
    async for chunk in stream:
        if chunk.choices[0].delta.content is not None:
            content = chunk.choices[0].delta.content
            full_response += content
            print(content, end='', flush=True)
    
    print("\n")
    return full_response

async def main():
    prompt = "Write a short story about a time-traveling scientist."
    
    async with AsyncOpenAI() as client:
        result = await stream_text(prompt, client)
    
    print(f"Full response:\n{result}")

asyncio.run(main())

이 예제는 API에서 응답을 스트리밍하고 도착하는 대로 각 청크를 인쇄하는 방법을 보여줍니다. 이 접근 방식은 특히 채팅 애플리케이션이나 사용자에게 실시간 피드백을 제공하려는 모든 시나리오에 유용합니다.

LangChain을 사용하여 비동기 워크플로 구축

더 복잡한 LLM 기반 애플리케이션의 경우 LangChain 프레임워크 여러 LLM 호출을 체인으로 연결하고 다른 도구를 통합하는 과정을 간소화하는 고수준 추상화를 제공합니다. 비동기 기능을 갖춘 LangChain을 사용하는 예를 살펴보겠습니다.

이 예에서는 LangChain을 사용하여 스트리밍 및 비동기 실행을 통해 더 복잡한 워크플로를 만드는 방법을 보여줍니다. AsyncCallbackManager 그리고 StreamingStdOutCallbackHandler 생성된 콘텐츠의 실시간 스트리밍을 활성화합니다.

import asyncio
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.manager import AsyncCallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

async def generate_story(topic):
    llm = OpenAI(temperature=0.7, streaming=True, callback_manager=AsyncCallbackManager([StreamingStdOutCallbackHandler()]))
    prompt = PromptTemplate(
        input_variables=["topic"],
        template="Write a short story about {topic}."
    )
    chain = LLMChain(llm=llm, prompt=prompt)
    return await chain.arun(topic=topic)

async def main():
    topics = ["a magical forest", "a futuristic city", "an underwater civilization"]
    tasks = [generate_story(topic) for topic in topics]
    stories = await asyncio.gather(*tasks)
    
    for topic, story in zip(topics, stories):
        print(f"\nTopic: {topic}\nStory: {story}\n{'='*50}\n")

asyncio.run(main())

FastAPI를 사용한 비동기 LLM 애플리케이션 제공

비동기 LLM 애플리케이션을 웹 서비스로 제공하려면 비동기 작업을 기본적으로 지원하는 FastAPI가 매우 적합합니다. 다음은 텍스트 생성을 위한 간단한 API 엔드포인트를 만드는 방법의 예입니다.

from fastapi import FastAPI, BackgroundTasks
from pydantic import BaseModel
from openai import AsyncOpenAI

app = FastAPI()
client = AsyncOpenAI()

class GenerationRequest(BaseModel):
    prompt: str

class GenerationResponse(BaseModel):
    generated_text: str

@app.post("/generate", response_model=GenerationResponse)
async def generate_text(request: GenerationRequest, background_tasks: BackgroundTasks):
    response = await client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": request.prompt}]
    )
    generated_text = response.choices[0].message.content
    
    # Simulate some post-processing in the background
    background_tasks.add_task(log_generation, request.prompt, generated_text)
    
    return GenerationResponse(generated_text=generated_text)

async def log_generation(prompt: str, generated_text: str):
    # Simulate logging or additional processing
    await asyncio.sleep(2)
    print(f"Logged: Prompt '{prompt}' generated text of length {len(generated_text)}")

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

이 FastAPI 애플리케이션은 엔드포인트를 생성합니다. /generate 프롬프트를 받아들이고 생성된 텍스트를 반환하는 것입니다. 또한 응답을 차단하지 않고 추가 처리를 위해 백그라운드 작업을 사용하는 방법도 보여줍니다.

모범 사례 및 일반적인 함정

비동기 LLM API를 사용할 때 다음 모범 사례를 염두에 두십시오.

연결 풀링 사용: 여러 요청을 하는 경우, 오버헤드를 줄이기 위해 연결을 재사용합니다.
적절한 오류 처리를 구현하세요: 네트워크 문제, API 오류, 예상치 못한 응답에 항상 대비하세요.
요금 제한을 존중하세요: API에 과부하가 걸리는 것을 방지하려면 세마포어나 기타 동시성 제어 메커니즘을 사용하세요.
모니터 및 기록: 성능을 추적하고 문제를 식별하기 위해 포괄적인 로깅을 구현합니다.
장편 콘텐츠에는 스트리밍을 사용하세요: 사용자 경험을 향상시키고 부분적인 결과의 조기 처리가 가능합니다.

다음 위로

포괄적 거버넌스: 생성 AI가 모든 사람이 공공 서비스를 이용할 수 있도록 하는 방법

놓치지 마세요.

10년 2025월 최고의 AI 회계 도구 XNUMX가지

아유쉬 미탈

저는 지난 50년 동안 기계 학습과 딥 러닝의 매혹적인 세계에 몰두했습니다. 저의 열정과 전문 지식은 특히 AI/ML에 중점을 둔 XNUMX개 이상의 다양한 소프트웨어 엔지니어링 프로젝트에 기여하도록 이끌었습니다. 나의 계속되는 호기심은 또한 내가 더 탐구하고 싶은 분야인 자연어 처리로 나를 이끌었습니다.

Unite.AI