Harnessing AI to measure the weight of ocean-bound plastic

The emergence of Large Language Models (LLMs) like Google’s Gemini or OpenAI's ChatGPT has enabled image analysis at an unprecedented speed and scale. So what if we would ask an LLM to look at a picture to determine the weight of a trash bag on a scale?

‍

The donation

On our platform you can donate to tangible outcomes for nature, for example removing 10 kilograms of ocean-bound plastic from a river. The partner that receives your donation will get to work to remove the 10kgs of plastic and share the results with you via our platform.

The operational reality

Project partners do cleanups that are much bigger than just the 10kgs you donated. As a platform Sumthing aggregates the individual donations into larger orders of a few tons of plastic, that match the operational reality for the project partner. When the project starts cleaning up, it's impractical for the project partners to fill each bag of trash to precisely 10 kilograms. Instead, project partners sends us images of the collected trash with a varying weight of these bags, from 10 up to 50 kilograms. The challenge then is to automatically allocate the correct number of donations for the weight of the bag. For instance, a 30-kilogram bag equates to three donations. This is where AI automation comes in.

‍

Exploring the power of LLMs

My exploration started by testing out some models with a simple question: "What is happening in this image?" I was amazed by the accuracy of the outcomes in some of the models out there.

However, when inquiring about a trash bag's weight, the LLMs' often started to hallucinate, demonstrating the approach's initial unreliability.

‍

Focus area within the image

Considering the potential for noise in the images, I refined my approach. What if, instead of analyzing the entire image, we asked the LLM to zoom in on just the scale? This significantly improved accuracy to a point where I started feeling confident the approach could work.

‍

Improving accuracy

In my quest to achieve at least 80% accuracy with the model, I explored several factors affecting accuracy:

Scale detection: My initial approach involved a basic script searching for circular shapes in images, imitating the appearance of analog scales. This method was not only demanding on resources but also error-prone, frequently confusing irrelevant circles for the actual scale. Drawing from our coral detection techniques, I developed a model tailored for pinpointing analog scales. This new model boasts nearly perfect accuracy and continues to improve with each new image, much like how a fine wine enhances over time.

‍

Choosing the Right LLM: My exploration started with ChatGPT, which initially impressed me with its accurate image descriptions. Yet, when it came to specific inquiries, it often produced nonsensical responses, prompting me to consider alternative options. Claude from Anthropic seemed promising, but unfortunately, it wasn't available in the Netherlands yet. Ultimately, I turned to Google Gemini, which, despite its shortcomings in providing detailed descriptions, proved exceptionally reliable in answering highly specific questions, making it the most dependable option.

Question precision: Initially, my questions were somewhat vague, such as, "What does the scale indicate?" By refining my queries into ultra-specific prompts, I significantly enhanced the clarity of the task for the LLM, leading to much improved responses. My final prompt was as follows: “Given a close-up image of a mechanical scale with a white dial (0-100 kg) and a metal pointer, determine the number the pointer points to (ignore stickers/extra markings)”.
Double-checking: To reduce the chance of errors, every image is subjected to a secondary review. This crucial step helps identify any discrepancies in the LLM's responses, ensuring that only consistent answers are accepted.

‍

Conclusion

In summary, our current model achieves about 90% accuracy in determining weight. The remaining 10% might be classified as unknown or occasionally incorrect. Below are some examples—can you guess why?

‍

All images undergo a human review after the categorization to ensure the accuracy is 99.99%. Yes, humans make mistakes too.

‍

But why not develop our own model to interpret the scale, which would likely increase accuracy? It seems feasible, especially after reading a promising article on how to do so. The challenge lies in the specificity required for each scale. Given that we currently use three different scales, the model would need to be individually tailored to each one. Utilizing a LLM offers a more flexible solution, as it isn't affected by the scale's size or shape, making it broadly applicable.

‍

For more details or if you're interested in contributing to our next project, please drop us a line on hello@sumthing.org.

‍