Screenshot of RecipeSnap.

Annotating Images with S3 and Prodigy


GitHub Code

A couple years ago I was working on a project where I needed to stream images from S3 to Prodigy for annotation. I love how customizeable Prodigy is, so I wrote a recipe to create annotation tasks from images stored in a S3 bucket. The custom recipe worked for the small use case I was working on at the time, but the recipe was having problems scaling.

The initial version of the recipe was reading images from S3, hashing them and then checking the databse to see if the image existed. As more images are annotated the S3 paginator has to read more images from the bucket until it finds new images, which can be costly.

So, when I faced another project where I needed to annotate images stored in S3 I addressed a couple of the earlier issues and updated the recipe. You can find the latest version on GitHub.

By default Prodigy stores the raw image in the Prodigy database. Storing data locally that aleady exists in S3 seemed redundant (and my Macbook was also running out of storage) so I added a function before_db that replaces the raw image data with the corresponding S3 key to save storage space in the database.

    def before_db(examples: List) -> List[Dict[Text, Any]]:
        '''
        Replaces the raw image data with the S3 key of the image before writing to the Prodigy DB.
        '''
        for eg in examples:
            eg["image"] = eg["meta"]["key"]
        return examples

The S3 key stored in the Prodigy database could then be used to speed up loading annotation tasks in the get_stream function. Instead of loading an image from S3 and then checking if the image exists in the Prodigy database, the presence of the key is checked and only if the key doesn't exist would an image be loaded.

    def get_stream() -> Generator:
        # Build paginator for S3 objects.
        paginator = s3.get_paginator('list_objects')
        paginate_params = {
            'Bucket': bucket
        }

        if prefix:
            paginate_params['Prefix'] = prefix

        page_iterator = paginator.paginate(**paginate_params)

        # Iterate through the pages.
        for page in page_iterator:
            # Iterate through items on the page.
            for obj in page['Contents']:
                img_key = obj['Key']

                # Skip the record if equal to the prefix.
                if img_key == f"{prefix}/":
                    continue

                # Check if the key exists in the database. If so, skip the record.
                if in_db:
                    if img_key in in_db:
                        continue

                # Read the image.
                _img_bytes = s3.get_object(Bucket=bucket, Key=img_key).get('Body').read()

                yield {'image': img_to_b64_uri(_img_bytes, 'image/jpg'), "meta": {"key": img_key}}

Now images load faster to speed up annotations!

Using the function requires two tasks from the user. First, users need to define the Prodigy interface and task at the bottom of the stream_from_s3 function. Second, run the function from the command line by defining the following parameters:

DATASET: Name of the Prodigy dataset.
BUCKET: Name of the S3 bucket to load images from.
PREFIX: (Optional) Prefix to limit the objects to load from your Bucket.

Then run the code with the following:

prodigy stream_from_s3 DATASET BUCKET PREFIX -F s3_loader.py

I hope you find this example helpful. Once again, you can get the full code on GitHub. Follow along and I'll share more Prodigy recipes and more!