Working with AI data stored in S3-compatible object storage

Suggest edits

We recommend you to prepare your own S3 compatible object storage bucket with some test data and try the steps in this section with that. But it is possible to simply use the example S3 bucket data as is in the examples here even with your custom access key and secret key credentials because these have been configured for public access.

In addition we use image data and an according image encoder LLM in this example instead of text data. But you could also use plain text data on object storage similar to the examples in the previous section.

First let's create a retriever for images stored on s3-compatible object storage as the source. We specify torsten as the bucket name and an endpoint URL where the bucket is created. We specify an empty string as prefix because we want all the objects in that bucket. We use the clip-vit-base-patch32 open encoder model for image data from HuggingFace. We provide a name for the retriever so that we can identify and reference it subsequent operations:

SELECT pgai.create_s3_retriever(
    'image_embeddings', -- Name of the similarity retrieval setup
    'public', -- Schema of the source table
    'clip-vit-base-patch32', -- Embeddings encoder model for similarity data
    'img', -- data type, could be either img or text
    'torsten', -- S3 bucket name
    '', -- prefix
    'https://s3.us-south.cloud-object-storage.appdomain.cloud' -- s3 endpoint address
);
Output
 create_s3_retriever 
---------------------
 
(1 row)

Next, run the refresh_retriever function.

SELECT pgai.refresh_retriever('image_embeddings');
Output
 refresh_retriever
-------------------
    
(1 row)

Finally, run the retrieve_via_s3 function with the required parameters to retrieve the top K most relevant (most similar) AI data items. Be aware that the object type is currently limited to image and text files.

SELECT data from pgai.retrieve_via_s3(
    'image_embeddings', -- retriever's name
     1, -- top K
    'torsten', -- S3 bucket name
    'foto.jpg', -- object name
    'https://s3.us-south.cloud-object-storage.appdomain.cloud'
);
Output
data        
--------------------
 {'img_id': 'foto'}
 (1 row)

Could this page be better? Report a problem or suggest an addition!