Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding