Detailed parameters

Which task is used by this model ?

In general the 🤗 Hosted API Inference accepts a simple string as an input. However, more advanced usage depends on the “task” that the model solves.

The “task” of a model is defined here on it’s model page:

_images/task.png

Zero-shot classification task

This task is a super useful to try it out classification with zero code, you simply pass a sentence/paragraph and the possible labels for that sentence and you get a result.

See also

Recommended model: facebook/bart-large-mnli.

Request:

import json

import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/facebook/bart-large-mnli"

def query(payload):
    data = json.dumps(payload)
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

data = query(
    {
        "inputs": "Hi, I recently bought a device from your company but it is not working as advertised and I would like to get reimbursed!",
        "parameters": {"candidate_labels": ["refund", "legal", "faq"]},
    }
)

When sending your request, you should send a JSON encoded payload. Here are all the options

All parameters

inputs (required)

a string or list of strings

parameters (required)

a dict containing the following keys:

- candidate_labels (required)

a list of strings that are potential classes for inputs.

- multi_label

(Default: false) Boolean that is set to True if classes can overlap

options

a dict containing the following keys:

- use_gpu

(Default: false). Boolean to use GPU instead of CPU for inference (requires Startup plan at least)

- use_cache

(Default: true). Boolean. There is a cache layer on the inference API to speedup requests we have already seen. Most models can use those results as is as models are deterministic (meaning the results will be the same anyway). However if you use a non deterministic model, you can set this parameter to prevent the caching mechanism from being used resulting in a real new query.

- wait_for_model

(Default: false) Boolean. If the model is not ready, wait for it instead of receiving 503. It limits the number of requests required to get your inference done. It is advised to only set this flag to true after receiving a 503 error as it will limit hanging in your application to known places.

Return value is either a dict or a list of dicts if you sent a list of inputs

Response:

self.assertEqual(
    data,
    {
        "sequence": "Hi, I recently bought a device from your company but it is not working as advertised and I would like to get reimbursed!",
        "labels": ["refund", "faq", "legal"],
        "scores": [
            # 87% refund
            0.8777876496315002,
            0.10522646456956863,
            0.016985882073640823,
        ],
    },
)
Returned values

sequence

The string sent as an input

labels

The list of strings for labels that you sent (in order)

scores

a list of floats that correspond the the probability of label, in the same order as labels.

Translation task

This task is well known to translate text from one language to another

See also

Recommended model: Helsinki-NLP/opus-mt-ru-en. Helsinki-NLP uploaded many models with many language pairs. Recommended model: t5-base.

Example:

import json

import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/Helsinki-NLP/opus-mt-ru-en"

def query(payload):
    data = json.dumps(payload)
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

data = query(
    {
        "inputs": "Меня зовут Вольфганг и я живу в Берлине",
    }
)

# Response
self.assertEqual(
    data,
    [
        {
            "translation_text": "My name is Wolfgang and I live in Berlin.",
        },
    ],
)

When sending your request, you should send a JSON encoded payload. Here are all the options

All parameters

inputs (required)

a string to be translated in the original languages

options

a dict containing the following keys:

- use_gpu

(Default: false). Boolean to use GPU instead of CPU for inference (requires Startup plan at least)

- use_cache

(Default: true). Boolean. There is a cache layer on the inference API to speedup requests we have already seen. Most models can use those results as is as models are deterministic (meaning the results will be the same anyway). However if you use a non deterministic model, you can set this parameter to prevent the caching mechanism from being used resulting in a real new query.

- wait_for_model

(Default: false) Boolean. If the model is not ready, wait for it instead of receiving 503. It limits the number of requests required to get your inference done. It is advised to only set this flag to true after receiving a 503 error as it will limit hanging in your application to known places.

Return value is either a dict or a list of dicts if you sent a list of inputs

Returned values

translation_text

The string after translation

Summarization task

This task is well known to summarize text a big text into a small text. Be careful, some models have a maximum length of input. That means that the summary cannot handle full books for instance. Be careful when choosing your model. If you want to discuss you summarization needs, please get in touch api-inference@huggingface.co

See also

Recommended model: facebook/bart-large-cnn.

Example:

import json

import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/facebook/bart-large-cnn"

def query(payload):
    data = json.dumps(payload)
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

data = query(
    {
        "inputs": "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct.",
    }
)

# Response
self.assertEqual(
    data,
    [
        {
            "summary_text": "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world.",
        },
    ],
)

When sending your request, you should send a JSON encoded payload. Here are all the options

All parameters

inputs (required)

a string to be summarized

options

a dict containing the following keys:

- use_gpu

(Default: false). Boolean to use GPU instead of CPU for inference (requires Startup plan at least)

- use_cache

(Default: true). Boolean. There is a cache layer on the inference API to speedup requests we have already seen. Most models can use those results as is as models are deterministic (meaning the results will be the same anyway). However if you use a non deterministic model, you can set this parameter to prevent the caching mechanism from being used resulting in a real new query.

- wait_for_model

(Default: false) Boolean. If the model is not ready, wait for it instead of receiving 503. It limits the number of requests required to get your inference done. It is advised to only set this flag to true after receiving a 503 error as it will limit hanging in your application to known places.

Return value is either a dict or a list of dicts if you sent a list of inputs

Returned values

summarization_text

The string after translation

Conversational task

This task corresponds to any chatbot like structure. Models tend to have shorted max_length, so please check with caution when using a given model if you need long range dependency or not.

See also

Recommended model: microsoft/DialoGPT-large.

Example:

import json

import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/microsoft/DialoGPT-large"

def query(payload):
    data = json.dumps(payload)
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

data = query(
    {
        "inputs": {
            "past_user_inputs": ["Which movie is the best ?"],
            "generated_responses": ["It's Die Hard for sure."],
            "text": "Can you explain why ?",
        },
    }
)

# Response
self.assertEqual(
    data,
    {
        "generated_text": "It's the best movie ever.",
        "conversation": {
            "past_user_inputs": [
                "Which movie is the best ?",
                "Can you explain why ?",
            ],
            "generated_responses": [
                "It's Die Hard for sure.",
                "It's the best movie ever.",
            ],
        },
        "warnings": ["Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation."],
    },
)

When sending your request, you should send a JSON encoded payload. Here are all the options

All parameters

inputs (required)

a string to be translated in the original languages

- text (required)

The last input from the user in the conversation.

- generated_responses

A list of strings corresponding to the earlier replies from the model.

- past_user_inputs

A list of strings corresponding to the earlier replies from the user. Should be of the same length of generated_responses.

options

a dict containing the following keys:

- use_gpu

(Default: false). Boolean to use GPU instead of CPU for inference (requires Startup plan at least)

- use_cache

(Default: true). Boolean. There is a cache layer on the inference API to speedup requests we have already seen. Most models can use those results as is as models are deterministic (meaning the results will be the same anyway). However if you use a non deterministic model, you can set this parameter to prevent the caching mechanism from being used resulting in a real new query.

- wait_for_model

(Default: false) Boolean. If the model is not ready, wait for it instead of receiving 503. It limits the number of requests required to get your inference done. It is advised to only set this flag to true after receiving a 503 error as it will limit hanging in your application to known places.

Return value is either a dict or a list of dicts if you sent a list of inputs

Returned values

generated_text

The answer of the bot

conversation

A facility dictionnary to send back for the next input (with the new user input addition).

- past_user_inputs

List of strings. The last inputs from the user in the conversation, after the model has run.

- generated_responses

List of strings. The last outputs from the model in the conversation, after the model has run.

Table question answering task

Don’t know SQL ? Don’t want to dive in a large spreadsheet ? Ask it questions in plain english !

See also

Recommended model: google/tapas-base-finetuned-wtq.

Example:

import json

import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/google/tapas-base-finetuned-wtq"

def query(payload):
    data = json.dumps(payload)
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

data = query(
    {
        "inputs": {
            "query": "How many stars does the transformers repository have?",
            "table": {
                "Repository": ["Transformers", "Datasets", "Tokenizers"],
                "Stars": ["36542", "4512", "3934"],
                "Contributors": ["651", "77", "34"],
                "Programming language": [
                    "Python",
                    "Python",
                    "Rust, Python and NodeJS",
                ],
            },
        }
    }
)

When sending your request, you should send a JSON encoded payload. Here are all the options

All parameters

inputs (required)

a string to be translated in the original languages

- query (required)

The query in plain text that you want to ask the table

- table

A table of data represented as a dict of list where entries are headers and the lists are all the values, all lists must have the same size.

options

a dict containing the following keys:

- use_gpu

(Default: false). Boolean to use GPU instead of CPU for inference (requires Startup plan at least)

- use_cache

(Default: true). Boolean. There is a cache layer on the inference API to speedup requests we have already seen. Most models can use those results as is as models are deterministic (meaning the results will be the same anyway). However if you use a non deterministic model, you can set this parameter to prevent the caching mechanism from being used resulting in a real new query.

- wait_for_model

(Default: false) Boolean. If the model is not ready, wait for it instead of receiving 503. It limits the number of requests required to get your inference done. It is advised to only set this flag to true after receiving a 503 error as it will limit hanging in your application to known places.

Return value is either a dict or a list of dicts if you sent a list of inputs

self.assertEqual(
    data,
    {
        "answer": "AVERAGE > 36542",
        "coordinates": [[0, 1]],
        "cells": ["36542"],
        "aggregator": "AVERAGE",
    },
)
Returned values

answer

The plaintext answer

coordinates

a list of coordinates of the cells references in the answer

cells

a list of coordinates of the cells contents

aggregator

The aggregator used to get the answer

Question answering task

Want to have a nice know-it-all bot that can answer any questions ?

See also

Recommended model: deepset/roberta-base-squad2.

Example:

import json

import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/deepset/roberta-base-squad2"

def query(payload):
    data = json.dumps(payload)
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

data = query(
    {
        "inputs": {
            "question": "What's my name?",
            "context": "My name is Clara and I live in Berkeley.",
        }
    }
)

When sending your request, you should send a JSON encoded payload. Here are all the options

All parameters

inputs (required) a dict containing the following keys:

a string to be translated in the original languages

- question (required)

The question as a string that has an answer within context.

- context (required)

A string that contains the answer to the question

options

a dict containing the following keys:

- use_gpu

(Default: false). Boolean to use GPU instead of CPU for inference (requires Startup plan at least)

- use_cache

(Default: true). Boolean. There is a cache layer on the inference API to speedup requests we have already seen. Most models can use those results as is as models are deterministic (meaning the results will be the same anyway). However if you use a non deterministic model, you can set this parameter to prevent the caching mechanism from being used resulting in a real new query.

- wait_for_model

(Default: false) Boolean. If the model is not ready, wait for it instead of receiving 503. It limits the number of requests required to get your inference done. It is advised to only set this flag to true after receiving a 503 error as it will limit hanging in your application to known places.

Return value is either a dict or a list of dicts if you sent a list of inputs

self.assertEqual(
    data,
    {"score": 0.9842838644981384, "start": 11, "end": 16, "answer": "Clara"},
)
Returned values

answer

A string that’s the answer within the text.

score

A floats that represents how likely that the answer is correct

start

The index (string wise) of the start of the answer within context.

stop

The index (string wise) of the stop of the answer within context.

Text-classification task

Usually used for sentiment-analysis this will output the likelihood of classes of an input.

Example:

import json

import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/distilbert-base-uncased-finetuned-sst-2-english"

def query(payload):
    data = json.dumps(payload)
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

data = query({"inputs": "I like you. I love you"})

When sending your request, you should send a JSON encoded payload. Here are all the options

All parameters

inputs (required) a dict containing the following keys:

a string to be classified

options

a dict containing the following keys:

- use_gpu

(Default: false). Boolean to use GPU instead of CPU for inference (requires Startup plan at least)

- use_cache

(Default: true). Boolean. There is a cache layer on the inference API to speedup requests we have already seen. Most models can use those results as is as models are deterministic (meaning the results will be the same anyway). However if you use a non deterministic model, you can set this parameter to prevent the caching mechanism from being used resulting in a real new query.

- wait_for_model

(Default: false) Boolean. If the model is not ready, wait for it instead of receiving 503. It limits the number of requests required to get your inference done. It is advised to only set this flag to true after receiving a 503 error as it will limit hanging in your application to known places.

Return value is either a dict or a list of dicts if you sent a list of inputs

self.assertEqual(
    data,
    [
        [
            {"label": "NEGATIVE", "score": 0.0001261125144083053},
            {"label": "POSITIVE", "score": 0.9998738765716553},
        ]
    ],
)
Returned values

label

The label for the class (model specific)

score

A floats that represents how likely is that the text belongs the this class.

Named Entity Recognition (NER) task

See Token-classification task

Token-classification task

Usually used for sentence parsing, either grammatical, or Named Entity Recognition (NER) to understand keywords contained within text.

Example:

import json

import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/dbmdz/bert-large-cased-finetuned-conll03-english"

def query(payload):
    data = json.dumps(payload)
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

data = query({"inputs": "My name is Sarah Jessica Parker but you can call me Jessica"})

When sending your request, you should send a JSON encoded payload. Here are all the options

All parameters

inputs (required) a dict containing the following keys:

a string to be classified

options

a dict containing the following keys:

- use_gpu

(Default: false). Boolean to use GPU instead of CPU for inference (requires Startup plan at least)

- use_cache

(Default: true). Boolean. There is a cache layer on the inference API to speedup requests we have already seen. Most models can use those results as is as models are deterministic (meaning the results will be the same anyway). However if you use a non deterministic model, you can set this parameter to prevent the caching mechanism from being used resulting in a real new query.

- wait_for_model

(Default: false) Boolean. If the model is not ready, wait for it instead of receiving 503. It limits the number of requests required to get your inference done. It is advised to only set this flag to true after receiving a 503 error as it will limit hanging in your application to known places.

Return value is either a dict or a list of dicts if you sent a list of inputs

self.assertEqual(
    data,
    [
        {
            "entity_group": "PER",
            "score": 0.9991337060928345,
            "word": "Sarah Jessica Parker",
            "start": 11,
            "end": 31,
        },
        {
            "entity_group": "PER",
            "score": 0.9979912042617798,
            "word": "Jessica",
            "start": 52,
            "end": 59,
        },
    ],
)
Returned values

entity_group

The type for the entity being recognized (model specific).

score

How likely the entity was recognized.

word

The string that was captured

start

The offset stringwise where the answer is located. Useful to disambiguate if word occurs multiple times.

end

The offset stringwise where the answer is located. Useful to disambiguate if word occurs multiple times.

Text-generation task

Use to continue text from a prompt. This is a very generic task.

See also

Recommended model: gpt2 (it’s a simple model, but fun to play with).

Example:

import json

import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/gpt2"

def query(payload):
    data = json.dumps(payload)
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

data = query({"inputs": "The answer to the universe is"})

When sending your request, you should send a JSON encoded payload. Here are all the options

All parameters

inputs (required):

a string to be generated from

parameters

dict containing the following keys:

- top_k

(Default: None). Integer to define the top tokens considered within the sample operation to create new text.

- top_p

(Default: None). Float to define the tokens that are within the sample` operation of text generation. Add tokens in the sample for more probable to least probable until the sum of the probabilities is greater than top_p.

- temperature

(Default: 1.0). Float (0.0-100.0). The temperature of the sampling operation. 1 means regular sampling, 0 mens top_k=1, 100.0 is getting closer to uniform probability.

- repetition_penalty

(Default: None). Float (0.0-100.0). The more a token is used within generation the more it is penalized to not be picked in successive generation passes.

- max_new_tokens

(Default: None). Int (0-256). The amount of new tokens to be generated, this does not include the input length it is a estimate of the size of generated text you want. Each new tokens slows down the request, so look for balance between response times and length of text generated.

- max_time

(Default: None). Float (0-120.0). The amount of time in seconds that the query should take maximum. Network can cause some overhead so it will be a soft limit. Use that in combination with max_new_tokens for best results.

- return_full_text

(Default: True). Bool. If set to False, the return results will not contain the original query making it easier for prompting.

- num_return_sequences

(Default: 1). Integer. The number of proposition you want to be returned.

options

a dict containing the following keys:

- use_gpu

(Default: false). Boolean to use GPU instead of CPU for inference (requires Startup plan at least)

- use_cache

(Default: true). Boolean. There is a cache layer on the inference API to speedup requests we have already seen. Most models can use those results as is as models are deterministic (meaning the results will be the same anyway). However if you use a non deterministic model, you can set this parameter to prevent the caching mechanism from being used resulting in a real new query.

- wait_for_model

(Default: false) Boolean. If the model is not ready, wait for it instead of receiving 503. It limits the number of requests required to get your inference done. It is advised to only set this flag to true after receiving a 503 error as it will limit hanging in your application to known places.

Return value is either a dict or a list of dicts if you sent a list of inputs

self.assertEqual(
    data,
    [
        {
            "generated_text": "The answer to the universe is not a question of how hard or soft our bones might find them, or how much blood they might experience. There are no super-hard, soft, hard bones in the universe. Some may just be in the bottom"
        }
    ],
)
Returned values

generated_text

The continuated string

Text2text-generation task

Essentially Text-generation task. But uses Encoder-Decoder architecture, so might change in the future for more options.

Fill mask task

Tries to fill in a hole with a missing word (token to be precise). That’s the base task for BERT models.

See also

Recommended model: bert-base-uncased (it’s a simple model, but fun to play with).

Example:

import json

import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/bert-base-uncased"

def query(payload):
    data = json.dumps(payload)
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

data = query({"inputs": "The answer to the universe is [MASK]."})

When sending your request, you should send a JSON encoded payload. Here are all the options

All parameters

inputs (required):

a string to be filled from, must contain the [MASK] token (check model card for exact name of the mask)

options

a dict containing the following keys:

- use_gpu

(Default: false). Boolean to use GPU instead of CPU for inference (requires Startup plan at least)

- use_cache

(Default: true). Boolean. There is a cache layer on the inference API to speedup requests we have already seen. Most models can use those results as is as models are deterministic (meaning the results will be the same anyway). However if you use a non deterministic model, you can set this parameter to prevent the caching mechanism from being used resulting in a real new query.

- wait_for_model

(Default: false) Boolean. If the model is not ready, wait for it instead of receiving 503. It limits the number of requests required to get your inference done. It is advised to only set this flag to true after receiving a 503 error as it will limit hanging in your application to known places.

Return value is either a dict or a list of dicts if you sent a list of inputs

self.assertEqual(
    data,
    [
        {
            "sequence": "the answer to the universe is no.",
            "score": 0.16963955760002136,
            "token": 2053,
            "token_str": "no",
        },
        {
            "sequence": "the answer to the universe is nothing.",
            "score": 0.07344776391983032,
            "token": 2498,
            "token_str": "nothing",
        },
        {
            "sequence": "the answer to the universe is yes.",
            "score": 0.05803241208195686,
            "token": 2748,
            "token_str": "yes",
        },
        {
            "sequence": "the answer to the universe is unknown.",
            "score": 0.043957844376564026,
            "token": 4242,
            "token_str": "unknown",
        },
        {
            "sequence": "the answer to the universe is simple.",
            "score": 0.04015745222568512,
            "token": 3722,
            "token_str": "simple",
        },
    ],
)
Returned values

sequence

The actual sequence of tokens that ran against the model (may contain special tokens)

score

The probability for this token.

token

The id of the token

token_str

The string representation of the token

Automatic speech recognition task

This task reads some audio input and outputs the said words within the audio files.

See also

Recommended model: Check your langage.

Request:

import json

import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/facebook/wav2vec2-base-960h"

def query(filename):
    with open(filename, "rb") as f:
        data = f.read()
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

data = query("sample1.flac")

When sending your request, you should send a binary payload that simply contains your audio file. We try to support most formats (Flac, Wav, Mp3, Ogg etc…). And we automatically rescale the sampling rate to the appropriate rate for the given model (usually 16KHz).

All parameters

no parameter (required)

a binary representation of the audio file. No other parameters are currently allowed.

Return value is either a dict or a list of dicts if you sent a list of inputs

Response:

self.assertEqual(
    data,
    {
        "text": "GOING ALONG SLUSHY COUNTRY ROADS AND SPEAKING TO DAMP AUDIENCES IN DRAUGHTY SCHOOL ROOMS DAY AFTER DAY FOR A FORTNIGHT HE'LL HAVE TO PUT IN AN APPEARANCE AT SOME PLACE OF WORSHIP ON SUNDAY MORNING AND HE CAN COME TO US IMMEDIATELY AFTERWARDS"
    },
)
Returned values

text

The string that was recognized within the audio file.