AI Engineer Guide

Code Embeddings with Codestral Embed by Mistral

Mistral has recently launched Codestral Embed which they’ve trained specifically for code.

They claim 👇

Codestral Embed significantly outperforms leading code embedders in the market today: Voyage Code 3, Cohere Embed v4.0 and OpenAI’s large embedding model.

You can use it by specifying codestral-embed as model name

When generating embedding you can configure both the output_dimension and output_dtype

Output DType

output_dtype allows you to get the precision and format of the embeddings. Using it you can get your desired level of numerical accuracy and representation.

These are the types they’ve mentioned in their API docs

DTypeDescription
float (default)A list of 32-bit (4-byte) single-precision floating-point numbers. Provides the highest precision and retrieval accuracy.
int8A list of 8-bit (1-byte) integers ranging from -128 to 127.
uint8A list of 8-bit (1-byte) integers ranging from 0 to 255.
binaryA list of 8-bit integers that represent bit-packed, quantized single-bit embedding values using the int8 type. The length of the returned list of integers is 1/8 of output_dimension. This type uses the offset binary method.
ubinarySimilar to binary, but uses the uint8 type for bit-packed, quantized single-bit embedding values.

Output Dimension

The model’s output dimension is flexible in nature. You can configure the output dimension that you want (defaults to 1536 and maximum value of 3072)

For any integer target dimension n, you can choose to retain the first n dimensions. These dimensions are ordered by relevance, and the first n are selected for a smooth trade-off between quality and cost.

Example

problem_description = "Given two numbers, return their sum. Example: Input: a = 3, b = 5 Output: 8"

solution = "def add_numbers(a, b): return a + b"

curl -X POST "https://api.mistral.ai/v1/embeddings" \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer ${API_KEY}" \
     -d '{
           "model": "codestral-embed",
           "output_dimension": 10,
           "output_dtype": "float",
           "input": [
             "'"$problem_description"'",
             "'"$solution"'"
           ]
         }' \
     -o embedding.json

Use cases

Codestral Embed is designed to excel in code retrieval and semantic understanding. According to their website, some key use cases include:

What’s the catch?

Codestral Embed is available on our API under the name codestral-embed-2505 at a price of $0.15 per million tokens. It is also available on our batch API at a 50% discount.

It does not provide open weights. You can use it via their API only.

If you want open weights code embed model, you can try something like nomic-embed-code or jina-embeddings-v2-base-code

References

Happy embedding code!

#Bookmark #Mistral #Codestral-Embed #Embedding