| # Image Analysis with InternVL2 | |
| This project uses the InternVL2-40B-AWQ model for high-quality image analysis, description, and understanding. It provides a Gradio web interface for users to upload images and get detailed analysis. | |
| ## Features | |
| - **High-Quality Image Analysis**: Uses InternVL2-40B (4-bit quantized) for state-of-the-art image understanding | |
| - **Multiple Analysis Types**: General description, text extraction, chart analysis, people description, and technical analysis | |
| - **Simple UI**: User-friendly Gradio interface for easy image uploading and analysis | |
| - **Efficient Resource Usage**: 4-bit quantized model (AWQ) for reduced memory footprint and faster inference | |
| ## Requirements | |
| The application requires: | |
| - Python 3.9+ | |
| - CUDA-compatible GPU (recommended 24GB+ VRAM) | |
| - Transformers 4.37.2+ | |
| - lmdeploy 0.5.3+ | |
| - Gradio 3.38.0 | |
| - Other dependencies in `requirements.txt` | |
| ## Setup | |
| ### Docker Setup (Recommended) | |
| 1. **Build the Docker image**: | |
| ``` | |
| docker build -t internvl2-image-analysis . | |
| ``` | |
| 2. **Run the Docker container**: | |
| ``` | |
| docker run --gpus all -p 7860:7860 internvl2-image-analysis | |
| ``` | |
| ### Local Setup | |
| 1. **Create a virtual environment**: | |
| ``` | |
| python -m venv venv | |
| source venv/bin/activate # On Windows: venv\Scripts\activate | |
| ``` | |
| 2. **Install dependencies**: | |
| ``` | |
| pip install -r requirements.txt | |
| ``` | |
| 3. **Run the application**: | |
| ``` | |
| python app_internvl2.py | |
| ``` | |
| ## Usage | |
| 1. Open your browser and navigate to `http://localhost:7860` | |
| 2. Upload an image using the upload box | |
| 3. Choose an analysis type from the options | |
| 4. Click "Analyze Image" and wait for the results | |
| ### Analysis Types | |
| - **General**: Provides a comprehensive description of the image content | |
| - **Text**: Focuses on identifying and extracting text from the image | |
| - **Chart**: Analyzes charts, graphs, and diagrams in detail | |
| - **People**: Describes people in the image - appearance, actions, and expressions | |
| - **Technical**: Provides technical analysis of objects and their relationships | |
| ## Testing | |
| To test the model directly from the command line: | |
| ``` | |
| python test_internvl2.py --image path/to/your/image.jpg --prompt "Describe this image in detail." | |
| ``` | |
| ## Deployment to Hugging Face | |
| To deploy to Hugging Face Spaces: | |
| ``` | |
| python upload_internvl2_to_hf.py | |
| ``` | |
| ## Model Details | |
| This application uses InternVL2-40B-AWQ, a 4-bit quantized version of InternVL2-40B. The original model consists of: | |
| - **Vision Component**: InternViT-6B-448px-V1-5 | |
| - **Language Component**: Nous-Hermes-2-Yi-34B | |
| - **Total Parameters**: ~40B (6B vision + 34B language) | |
| ## License | |
| This project is released under the same license as the InternVL2 model, which is MIT license. | |
| ## Acknowledgements | |
| - [OpenGVLab](https://github.com/OpenGVLab) for creating the InternVL2 models | |
| - [Hugging Face](https://huggingface.co/) for model hosting | |
| - [lmdeploy](https://github.com/InternLM/lmdeploy) for model optimization | |
| - [Gradio](https://gradio.app/) for the web interface |