Parallelism and batch jobs

Simple example

In order to get your answers as quickly as possible, you probably want to run some kind of parallelism on your jobs.

The best way is to use the wait_for_model from options. This basically tells your request to wait on an available model. So when running large jobs it gives time the API to ramp up available ressources to your requests no matter the parallelism you are running.

The initial requests for the first few minutes will be stalled a bit more, so expect requests to run a bit slower initially, then as the API scales up, requests should run as fast a single requests and you can process your load as fast as possible.

Here is a small example:

import json
from multiprocessing import Pool

import requests

MODEL = "joeddav/bart-large-mnli-yahoo-answers"
NUM_REQUESTS = 30
NUM_JOBS = 5
api_url = f"https://api-inference.huggingface.co/models/{MODEL}"
headers = {"Authorization": f"Bearer {HF_API_TOKEN}"}

payload = {
    "inputs": "This is a test about a new dress shop than opened up on 5th avenue",
    # Key option is "wait_for_model", use_cache: False is only necessary for testing
    # purposes because we run the exact same requests.
    "options": {"use_cache": False, "wait_for_model": True},
    "parameters": {"candidate_labels": ["medical", "fashion", "politics"]},
}
payloads = [payload] * NUM_REQUESTS

def run(payload):
    data = json.dumps(payload)

    response = requests.request("POST", api_url, headers=headers, data=data)
    answer = json.loads(response.content)

    max_score = 0
    max_label = None
    for label, score in zip(answer["labels"], answer["scores"]):
        if score > max_score:
            max_score = score
            max_label = label

    return max_label

with Pool(NUM_JOBS) as p:
    results = p.map(run, payloads)
# All results are "fashion"