Parallelism and batch jobs¶
Simple example¶
In order to get your answers as quickly as possible, you probably want to run some kind of parallelism on your jobs.
The best way is to use the wait_for_model
from options
. This basically tells
your request to wait on an available model. So when running large jobs it gives time the
API to ramp up available ressources to your requests no matter the parallelism you are running.
The initial requests for the first few minutes will be stalled a bit more, so expect requests to run a bit slower initially, then as the API scales up, requests should run as fast a single requests and you can process your load as fast as possible.
Here is a small example:
import json
from multiprocessing import Pool
import requests
MODEL = "joeddav/bart-large-mnli-yahoo-answers"
NUM_REQUESTS = 30
NUM_JOBS = 5
api_url = f"https://api-inference.huggingface.co/models/{MODEL}"
headers = {"Authorization": f"Bearer {API_TOKEN}"}
payload = {
"inputs": "This is a test about a new dress shop than opened up on 5th avenue",
# Key option is "wait_for_model", use_cache: False is only necessary for testing
# purposes because we run the exact same requests.
"options": {"use_cache": False, "wait_for_model": True},
"parameters": {"candidate_labels": ["medical", "fashion", "politics"]},
}
payloads = [payload] * NUM_REQUESTS
def run(payload):
data = json.dumps(payload)
response = requests.request("POST", api_url, headers=headers, data=data)
answer = json.loads(response.content)
max_score = 0
max_label = None
for label, score in zip(answer["labels"], answer["scores"]):
if score > max_score:
max_score = score
max_label = label
return max_label
with Pool(NUM_JOBS) as p:
results = p.map(run, payloads)
# All results are "fashion"