Input image preprocessing
●Same to CLIP image encoder
●224 x 224 resolution
Others
●No [cls] token in CLIP image encoder
●(32K-1) tokens (text) for output
●No [eos] token (instead of it [sep] token is used)
●We shuffle labels for each image in training (?)
●The global batch size is 512