Related Works Image Captioning [3, 9, 37, 39, 40] neural encoder-decoder framework [35] [37] employs CNNs to encode image into fixed-length vector, and RNNs [13] as decoder to sequentially generate words [3, 23, 40] capture fine-grained visual details, attentive image captioning models are proposed to dynamically ground words with relevant image parts in generation To reduce exposure bias and metric mismatching in sequential training [29], notable efforts are made to optimise non-differentiable metrics using reinforcement learning [22, 31, 41] To further boost accuracy, detected semantic concepts [9, 39, 45] are adopted in captioning framework The visual concepts learned from large-scale external datasets also enable the model to generate captions with novel objects beyond paired image captioning datasets [1, 24] A more structured representation over concepts, scene graph [16], is further explored [43, 44] in image captioning which can take advantage of detected objects and their relationships [3] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition, pages 6077–6086, 2018 [9] Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. Semantic compositional networks for visual captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5630–5639, 2017 [37] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3156–3164, 2015 [39] Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton Van Den Hengel. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 203–212, 2016 [40] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov , Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, pages 2048–2057, 2015. [35] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014. [37] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3156–3164, 2015 [23] Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 375–383, 2017