2020-2021 Capstone Project: Data Engineering

Background.


Replicability is necessary to substantiate and increase confidence in existing research conclusions. However, the education and social sciences fields largely lack the proper data infrastructure to facilitate efficient reproduction and replication of research studies. This replication crisis is enabled by a lack of transparency, consistency, and accessibility of the data collection and analysis processes. Moreover, the field of education research often utilizes varied and inefficient tools due to a lack of training on data practices and awareness of effective tools. The volume and variety of data collected in education research contributes to the inconsistency of techniques, software, and conventions employed in the field. The lack of protocols makes large-scale projects and collaboration difficult and creates technology gaps for applied researchers that need to be filled through training. Additionally, education researchers often do not have the background knowledge needed to fully develop technical success measures for the data pipelines associated with their projects. Therefore, this team is tasked with creating a platform as infrastructure to address these limitations in education research.

 
Meet the Team

Jordan Machita

Originally from Pennsylvania, Jordan is currently pursuing his Master's in Data Science at the University of Virginia after graduating from the University of Tampa in May 2020 with degrees in Mathematical Programming and Finance. Jordan's internship and work experience is in Cybersecurity and Finance, while his formal research experience is in Graph Theory. Upon completion of the program in May 2021, Jordan plans to take the knowledge and skills he has developed at the University of Virginia and apply them to the financial world.

e: jm8ux@virginia.edu

Jordan Machita

Taylor Rohrich

Taylor is a student in the Master’s in Data Science program at the University of Virginia. He completed his undergraduate degree in Computer Science from the University of Virginia in 2019. Currently working on a Data Engineering capstone project with the SERA team, he focuses on using cloud resources to develop research data pipelines that are robust and scalable. His GitHub profile can be found at the following link:  https://github.com/taylorrohrich.

e: trr2as@virginia.edu

Taylor Rohrich

Yiran Zheng

Yusheng Jiang

Scope.


This project includes :

  • Automate aspects of the cleaning and analysis

  • A data tracking dashboard

  • Understanding and replicating the current data cleaning process

  • Proposing and implementing efficiency improvements

  • Creating a cloud-based data store to host the database

This project is centered around the specifications of the TeachSim data. The infrastructure and schema will later be adapted to the SERA data, which is outside the scope of this year’s work. However, we will consider the needs of both projects in order to build a flexible infrastructure. Automation of the loading, cleaning, and pulling of data and the data tracking dashboard are optional components of the project that depends on the team’s progress in the main project goals.

Products


The Data Engineering Capstone team was able to produce the desired cloud-based data store, automate some of the existing processes, and design a data tracker. The code and user documentation for the capstone team’s project can be found on their github page, here.