Overview

Let’s have a quick look at the 🤗 Accelerated Inference API.

Main features:

  • Leverage 5,000+ Transformer models (T5, Blenderbot, Bart, GPT-2, Pegasus…)

  • Upload, manage and serve your own models privately

  • Run Classification, NER, Conversational, Summarization, Translation, Question-Answering, Embeddings Extraction tasks

  • Get up to 10x inference speedup to reduce user latency

  • Accelerated inference on CPU and GPU (GPU requires a Startup or Enterprise plan)

  • Run large models that are challenging to deploy in production

  • Scale to 1,000 requests per second with automatic scaling built-in

  • Ship new NLP features faster as new models become available

  • Build your business on a platform powered by the reference open source project in NLP

Get you API Token

To get started you need to:

You should see a token api_XXXXXXXX or api_org_XXXXXXX.

If you do not submit your API token when sending requests to the API, you will not be able to run inference on your private models, or benefits from the model pinning and acceleration features of the API.

Running Inference with API Requests

The first step is to choose which model you are going to run. Go to the Model Hub and select the model you want to use. If you are unsure where to start, make sure to check our recommended models for each NLP task available.

ENDPOINT = https://api-inference.huggingface.co/models/<MODEL_ID>

Let’s use gpt2 as an example. To run inference, simply use this code:

import json

import requests

API_URL = "https://api-inference.huggingface.co/models/gpt2"
headers = {"Authorization": f"Bearer {API_TOKEN}"}

def query(payload):
    data = json.dumps(payload)
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

data = query("Can you please let us know more details about your ")

API Options and Parameters

Depending on the task (aka pipeline) the model is configured for, the request will accept specific parameters. When sending requests to run any model, API options allow you to specify the caching and model loading behavior, and inference on GPU (Startup or Enterprise plan required) All API options and parameters are detailed here Detailed parameters

Using CPU-Accelerated Inference (~10x speedup)

As an API customer, your API token will automatically enable CPU-Accelerated inference on your requests. For instance, if you compare gpt2 model inference through our API with CPU-Acceleration, compared to running inference on the model out of the box on a local setup, you should measure a ~10x speedup. The specific performance boost depends on the model and input payload.

To verify you are using the CPU-Accelerated version of a model you can check the x-compute-type header of your requests, which should be cpu+optimized. If you do not see it, it simply means not all optimizations are turned on. If you contact us at api-enterprise@huggingface.co, we’re would probably be able to increase the inference speed, depending on your actual use case.

Using GPU-Accelerated Inference

In order to use GPU-Accelerated inference, you need a Startup or Enterprise plan. To run any model on a GPU, you need to specify it via an option in your request:

{"inputs": "...REGULAR INPUT...", "options": {"use_gpu": true}}

Using GPU-Accelerated inference should produce a significant speedup for all models.

To verify you are using the GPU-Accelerated version of the model you can check the x-compute-type header of your requests, which should be gpu.

Please note: Contact us at api-enterprise@huggingface.co to discuss your use case and usage profile when running GPU-Accelerated inference on many models or large models, so we can optimize the infrastructure accordingly.

Using Large Models (>10 Go)

Large models do not get loaded automatically to protect quality of service. Contact us at api-inference@huggingface.co so we can configure large models for your endpoints.

Model Pinning / Preloading

With over 5,000 models available in the Model Hub, not all can be loaded in compute memory to be instantly available for inference. To guarantee model availability for API customers who integrate them in production applications, we offer to pin frequently used model(s) to their API endpoints, so these models are always instantly available for inference.

The number of models that can be pinned depends on the selected API plan. To get a model pinned to your account, please contact us at api-enterprise@huggingface.co.