I'm actively looking for PhD positions!
Here is my statement of purpose.
I am a senior undergraduate student at the EECS school of Peking University. I am currently working remotely with Prof. Tim Althoff and Prof. Amy Zhang. Last fall, I was a visiting student at the Paul G. Allen School of Computer Science & Engineering at The University of Washington.
I am broadly interested in studying the gap between human behavior and computer. Today, as computers become more and more integrated into our daily life, it is important to figure out a way to bridge this gap. I design, build, and study tools to help machines to handle and interpret the complexibility in human behaviors, while applying techniques from Human-Computer Interaction, Visualization, Machine Learning, and Software Engineering.
Currently, I am particularlly interested in studying data scientist behaviors. With an exponential increase in available data and a growing culture of "data-driven discovery", analysis tools have enjoyed widespread innovation and adoption. However, robust processes to guide the use of these systems remain in relatively short supply. My work is related to the following problems that data scientists typically encounter when programming:
Mining Collective Data Science Knowledge from Code on the Web to Suggest Alternative Data Analysis Approaches
Mike Merril, Ge Zhang, Tim Althoff
In submission to WWW 2021
LOP-OCR: A Language-Oriented Pipeline for Large-chunk Text OCR
Zijun Sun*, Ge Zhang*, Junxu Lu, Jiwei Li
Large scale analysis of source code, and in particular scientific source code, holds the promise of better understanding the data science process, identifying analytical best practices, and providing insights to the builders of scientific toolkits. However, large corpora have remained unanalyzed in depth, as descriptive labels are absent and require expert domain knowledge to generate. We propose a novel weakly supervised transformer-based architecture for computing joint representations of code from both abstract syntax trees and surrounding natural language comments. Our model enables us to examine a set of 118,000 Jupyter Notebooks to uncover common data analysis patterns. Focusing on notebooks with relationships to academic articles, we conduct the largest ever study of scientific code and find that notebook composition correlates with the citation count of corresponding papers.
Data science projects are based on a series of decision points including data filtering, feature operationalization and selection, model specifications, and parametric assumptions. Importantly, although the robustness of such projects often hinges on their sensitivity to these decisions, prior work on Multiverse Analysis has shown that analysts' exploration is often limited based on their training, domain, and personal experience. This limited exploration of alternative analysis approaches can lead to highly sensitive or irreproducible outcomes and misguided decision-making. However, supporting novice analysts in exploring alternative approaches is challenging and typically requires expert feedback that is costly and hard to scale. Here, we leverage public collective data science knowledge in the form of code submissions to the popular data science platform Kaggle. Specifically, we mine this code repository for small differences between user submissions, which often highlight key decision points and alternative approaches in the respective analyses. We formulate the tasks of identifying decision points and suggesting alternatives as a classification task and a sequence-to-sequence prediction task. We leverage information on relationships within libraries through neural graph representation learning in a multitask learning framework. We demonstrate that our model, MORAY, is able to correctly predict decision points with state-of-the-art results.
There are two common cases of collaboration in data science. One is pair authoring, where person A does most of the programming work, and person B gives A ideas about what decisions to make. The other one is divide and conquer, where collaborators each take part in the programming work (i.e., A working on data preprocessing and B working on modeling). To better support collaborative cognition in these two cases, we designed and developed an extension to Jupyter Notebook that helps users split work and create branches upon original code. First, we propose to modularize notebooks by adding wrappers outside cells so that users can split work based on modules. For each wrapper, users can specify input and output variable types in advance to avoid conflicts in collaborative editing, such as modifying shared variables without making copies. Second, we allow users to branch over the original code by subdividing cells horizontally in each wrapper, where they can explore alternatives easily in a clear and organized view without writing much explanation. Besides, to better support computational notebooks’ presentation nature, users can also manually collapse wrappers or hide branches to tailor the view for their needs. This extension provides easy navigation and clear presentation views for both technical and non-technical people. The modularization and branching features help programmers split work and control versions strategically in collaboration.