Image captioning system Using Artificial Intelligence
Keywords:artificial intelligence, machine learning, neural networks, Image captioning system,
People who are blind or visually impaired often find it difficult to fully engage with the world around them, as they cannot accurately perceive the visual information that sighted people take for granted. As a result, they often require some form of human assistance to navigate their environment and access information. In this project, we aim to use image captioning techniques to help solve this problem. By leveraging a large dataset and machine learning algorithms, we hope to improve the ability to convert captured and stored images into text and speech that can be easily understood by blind individuals. Our project is inspired by recent advancements in multimodal neural networks, which have been successfully used in image captioning systems. Specifically, we will use a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to translate images into text. The CNN will be used for visual feature extraction, while the RNN will be trained on image-sentence ground truth pairings to generate captions. One of the main challenges we face is the language barrier. While there have been numerous studies on image captioning for single-target languages, we aim to develop a system that can generate captions in multiple languages. To achieve this, we will use the GoogleTrans library, which implements the Google Translate API. Our project will follow the AI essentials framework for designing AI products, as well as the Scrum methodology for managing the software development lifecycle. We will begin by collecting a dataset of 8,000 images from the Flicker 8k dataset, and use OCR Tesseract to extract text from any images that contain text. We will then use a pre-trained CNN as a feature extractor, and feed these features into an LSTM, which will generate captions. These captions will be translated into multiple languages using the GoogleTrans library, and finally converted to speech using the gTTS library. Overall, our project aims to improve the lives of blind and visually impaired individuals by providing them with a more accurate and comprehensive understanding of their surroundings. By leveraging machine learning and neural networks, we hope to develop a system that is capable of generating accurate and useful captions in multiple languages, thereby bridging the language barrier and making it easier for blind individuals to access information.