About me

I am a joint PhD student, supervised by Dongmei Zhang, Shi Han, Yanlin Wang and Prof. Hongbin Sun, between Microsoft Research Asia and Xi’an Jiaotong University.

I am mainly focusing on code intelligence (intersection of Software Engineering and Artificial Intelligence), which leverages artificial intelligence approaches to analyze and model source code and its related artifacts. Specifically, it utilizes the machine learning model to mine knowledge from large scale, free source code data (Big Code) which is available on Github, etc, obtains the better code representation (based on code token/ AST/ PDG/ IR etc) and applies the representation to the downstream tasks such as code summarization, code search, clone detection, code completion, program repair, etc.

My research areas currently include: (1) Code Represention Learning; (2) Code Summarization; (3) Code Search; (4)Commit Message Generation

Publications

2023

CoCoAST: Representing Source Code via Hierarchical Splitting and Reconstruction of Abstract Syntax Trees
EMSE (CCF-B) [pdf] [code]
Ensheng Shi, Yanlin Wang, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, Hongbin Sun
TLNR: We propose a novel model CoCoAST that hierarchically splits and reconstructs ASTs to comprehensively capture the syntactic and semantic information of code without the loss of AST structural information. We have applied our source code representation to two common program comprehension tasks, code summarization and code search.

You Augment Me: Exploring ChatGPT-based Data Augmentation for Semantic Code Search
ICSME2023 (CCF-B) [pdf] [code]
Yanlin Wang, Lianghong Guo, Ensheng Shi*, Wenqing Chen, Jiachi Chen, Wanjun Zhong, Menghan Wang, Hui Li, Ziyu Lyu, Hongyu Zhang, Zibin Zheng
TLNR: We propose a novel approach ChatDANCE, which utilizes high-quality and diverse augmented data generated by a large language model and leverages a filtering mechanism to eliminate low-quality augmentations.

Towards Efficient Fine-tuning of Pre-trained Code Models: An Experimental Study and Beyond
ISSTA2023 (CCF-A) [pdf] [code]
Ensheng Shi, Yanlin Wang, Hongyu Zhang, Lun Du, Shi Han, Dongmei Zhang, Hongbin Sun
TLNR: we conduct an extensive experimental study to explore what happens to layer-wise pre-trained representations and their encoded code knowledge during fine-tuning. We then propose efficient alternatives to fine-tune the large pre-trained code model based on the above findings.

  • lexical, syntactic and structural properties of source code are encoded in the lower, intermediate, and higher layers, respectively, while the semantic property spans across the entire model.
  • The process of fine-tuning preserves most of the code properties. Specifically, the basic code properties captured by lower and intermediate layers are still preserved during fine-tuning. Furthermore, we find that only the representations of the top two layers change most during fine-tuning for various downstream tasks.
  • Based on the above findings, we propose Telly to efficiently fine-tune pre-trained code models via layer freezing.

CoCoSoDa: Effective Contrastive Learning for Code Search
ICSE2023 (CCF-A)[pdf] [code]
Ensheng Shi, Wenchao Gu, Yanlin Wang, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, Hongbin Sun
TLNR: We propose CoCoSoDa to effectively utilize contrastive learning for code search via two key factors in contrastive learning: data augmentation and negative samples.

  • CoCoSoDa outperforms 14 baselines and especially exceeds CodeBERT, GraphCodeBERT, and UniXcoder by 13.3%, 10.5%, and 5.9% on average MRR scores, respectively.
  • The ablation studies show the effectiveness of each component of our approach.
  • We adapt our techniques to several different pre-trained models such as RoBERTa, CodeBERT, and GraphCodeBERT and observe a significant boost in their performance in code search.
  • Our model performs robustly under different hyperparameters. Furthermore, we perform qualitative and quantitative analyses to explore reasons behind the good performance of our model.

2022

RACE: Retrieval-augmented Commit Message Generation
EMNLP2022 (CCF-B)[pdf] [code]
Ensheng Shi, Yanlin Wang, Wei Tao, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, Hongbin Sun
TLNR: We propose a new retrieval-augmented neural commit message generation method, which treats the retrieved similar commit as an exemplar and leverages it to generate an accurate commit message.

A Large-Scale Empirical Study of Commit Message Generation: Models, Datasets and Evaluation
EMSE2022 (CCF-B) [pdf] [code]
Wei Tao, Yanlin Wang, Ensheng Shi, Lun Du, Shi Han, Hongyu Zhang, Dongmei Zhang, Wenqiang Zhang
TLNR: To achieve a better understanding of how the existing approaches perform in solving this problem, this paper conducts a systematic and in-depth analysis of the state-of-the-art models and datasets.

On the Evaluation of Neural Code Summarization
ICSE2022 (CCF-A) [pdf] [code][blog]
Ensheng Shi, Yanlin Wang, Lun Du, Junjie Chen, Shi Han, Hongyu Zhang, Dongmei Zhang, Hongbin Sun
TLNR: Some interesting and surprising findings on the evaluated metric, code-preprocessing, and evaluated datasets. Building a shared code summarization toolbox andgiving actionable suggestions on the evaluation of neural code summarization.

  • The BLEU metric widely used in existing work of evaluating code summarization models has many variants. Ignoring the differences among these variants could greatly affect the validity of the claimed results.
  • BLEU_DC (sentence BLEU with smoothing method 4) is most correlated to human perception on the evaluation of neural code summarization model among the 6 widely used BLEU variants.
  • Code pre-processing choices can have a large (from -18\% to +25\%) impact on the summarization performance and should not be neglected.
  • Performing S (identifier splitting) is always significantly better than not performing it. And different code pre-processing has a large impact on performance (-18\% to +25\%)
  • Some important characteristics of datasets (corpus sizes, data splitting methods, and duplication ratios) have a significant impact on model evaluation.
  • Based on the experimental results, we give actionable suggestions for evaluating code summarization and choosing the best method in different scenarios. We also build a shared code summarization toolbox to facilitate future research.

2021

CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees
EMNLP2021 (CCF-B) [pdf] [code]
Ensheng Shi, Yanlin Wang, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, Hongbin Sun
TLNR: Our model hierarchically splits and reconstructs ASTs to obtain the better code representation for code summarization.

Is a Single Model Enough? MuCoS: A Multi-Model Ensemble Learning Approach for Semantic Code Search
CIKM 2021 (CCF-B) [pdf] [code]
Lun Du, Xiaozhou Shi, Yanlin Wang, Ensheng Shi, Shi Han, Dongmei Zhang
TLNR: Ensembling three models which focus on the structure of code , local variables, and the information of API invocation, separately, for semantic code search.

On the Evaluation of Commit Message Generation Models: An Experimental Study
ICSME 2021 (CCF-B) [pdf] [code]
Wei tao, Yanlin Wang, Ensheng Shi, Lun Du, Shi Han, Dongmei Zhang, Wenqiang Zhang
TLNR: We conduct the empirical study on evaluated metrics and existing datasets. We also collect a large-scale, information-rich, and multi-language commit message dataset MCMD and evaluate existing models on this dataset.

2020

CoCoGUM: Contextual Code Summarization with Multi-Relational GNN on UMLs
MSR-TR 2020 [pdf]
Yanlin Wang, Lun Du, Ensheng Shi, Yuxuan Hu, Shi Han, Dongmei Zhang
TLNR: We explore modeling two global contexts: intra-class level context and inter-class level context for code summarization.

Educations

2019.8 ~ Present: Xi'an Jiaotong University

  • MSRA-XJTU Joint PHD

  • The College of Artificial Intelligence

2015.8 ~ 2019.7: Xi'an Jiaotong University

  • Outstanding Graduate

  • Automatic Science and Technology

Experiences

  • Research Intern in Microsoft Research Asia
    Advised by Dongmei Zhang, Shi Han, & Yanlin Wang in Data, Knowledge, and Intelligence group, from Jun 2020 to Present.
  • Research Intern in Microsoft Research Asia
    Advised by Dongmei Zhang, Shi Han, Zhouyu Fu, & Mengyu Zhou in Software Analytics group, from Nov 2018 to Aug 2019.

Awards

  • 2023 Outstanding Doctoral Graduate Student (The Highest Honor Awarded to Doctoral Students by Xi'an Jiaotong University)
  • 2023&2022&2021 Excellent Graduate Student
  • 2023&2018. National Scholarship
  • 2019 The Third Prize of Asia and Pacific Mathematical Contest
  • 2019 Outstanding Graduate Award
  • 2019 The Honor of " Star of Tomorrow" by MSRA
  • 2018 Elite Class Scholarship of Institute of Automation, China Academy of Sciences
  • 2018 Grateful Scientist Bursary
  • 2017&2016 National Encouragement Scholarship
  • 2016 First Prize of Mathematical Modeling Contest at Provincial Level
  • 2017&2016 Award for Perseverance and Diligence
  • 2015 The Third Prize of Mathematical Modeling Contest at University Level