This is a summarized version of my original article published on Process Point Technology’s blog.
In the world of AI solutions for process industries, creating custom models for specialized client problems often requires extensive datasets with detailed annotations. However, manual annotation is resource-intensive, and data confidentiality concerns often prevent the use of public annotation tools. This is where Vision Language Models (VLMs) come in as a promising solution.
Our Approach
We explored whether VLMs could effectively analyze and annotate specialized industrial imagery using natural language prompts, focusing on Personal Protective Equipment (PPE) analysis. Our investigation involved two distinct phases:
- Phase 1: OllaMa-based Implementation
- Utilized LLaVa-13B and LLaMa3.2-Vision models
- Focused on rapid prototyping and quick experimentation
- Demonstrated strong performance in simple annotation tasks
- Processing time: 15-20 minutes for 185 images
- Phase 2: Transformers-based Implementation
- Employed Ovis1.6-Gemma2-9B and Qwen2-VL-9B models
- Offered higher accuracy but slower processing
- Better suited for complex annotation scenarios
- Processing time: 900-1200 minutes for 185 images
Key Findings
Our experimental evaluation revealed several important insights:
- All models performed best with simple images at high detection accuracy
- Performance decreased notably with complex images
- LLaVa 13B showed superior performance with simple image detection
- OVIS-VL handled moderate complexity better than other models
Practical Implications
The study demonstrated that VLMs can significantly reduce manual annotation effort, though model selection should consider specific use case requirements. A hybrid approach combining VLM capabilities with human verification may be optimal for industrial applications.
This is a summary of my detailed technical analysis originally published here. The original article includes complete experimental details, code implementations, and detailed performance metrics.