The emergence of Large Language Models (LLMs) like Google’s Gemini or OpenAI's ChatGPT has enabled image analysis at an unprecedented speed and scale. So what if we would ask an LLM to look at a picture to determine the weight of a trash bag on a scale?
On our platform you can donate to tangible outcomes for nature, for example removing 10 kilograms of ocean-bound plastic from a river. The partner that receives your donation will get to work to remove the 10kgs of plastic and share the results with you via our platform.
Project partners do cleanups that are much bigger than just the 10kgs you donated. As a platform Sumthing aggregates the individual donations into larger orders of a few tons of plastic, that match the operational reality for the project partner. When the project starts cleaning up, it's impractical for the project partners to fill each bag of trash to precisely 10 kilograms. Instead, project partners sends us images of the collected trash with a varying weight of these bags, from 10 up to 50 kilograms. The challenge then is to automatically allocate the correct number of donations for the weight of the bag. For instance, a 30-kilogram bag equates to three donations. This is where AI automation comes in.
My exploration started by testing out some models with a simple question: "What is happening in this image?" I was amazed by the accuracy of the outcomes in some of the models out there.
However, when inquiring about a trash bag's weight, the LLMs' often started to hallucinate, demonstrating the approach's initial unreliability.
Considering the potential for noise in the images, I refined my approach. What if, instead of analyzing the entire image, we asked the LLM to zoom in on just the scale? This significantly improved accuracy to a point where I started feeling confident the approach could work.
In my quest to achieve at least 80% accuracy with the model, I explored several factors affecting accuracy:
In summary, our current model achieves about 90% accuracy in determining weight. The remaining 10% might be classified as unknown or occasionally incorrect. Below are some examples—can you guess why?
All images undergo a human review after the categorization to ensure the accuracy is 99.99%. Yes, humans make mistakes too.
But why not develop our own model to interpret the scale, which would likely increase accuracy? It seems feasible, especially after reading a promising article on how to do so. The challenge lies in the specificity required for each scale. Given that we currently use three different scales, the model would need to be individually tailored to each one. Utilizing a LLM offers a more flexible solution, as it isn't affected by the scale's size or shape, making it broadly applicable.
For more details or if you're interested in contributing to our next project, please drop us a line on hello@sumthing.org.