2020-2021 Capstone Project: Data Engineering
Background.
Replicability is necessary to substantiate and increase confidence in existing research conclusions. However, the education and social sciences fields largely lack the proper data infrastructure to facilitate efficient reproduction and replication of research studies. This replication crisis is enabled by a lack of transparency, consistency, and accessibility of the data collection and analysis processes. Moreover, the field of education research often utilizes varied and inefficient tools due to a lack of training on data practices and awareness of effective tools. The volume and variety of data collected in education research contributes to the inconsistency of techniques, software, and conventions employed in the field. The lack of protocols makes large-scale projects and collaboration difficult and creates technology gaps for applied researchers that need to be filled through training. Additionally, education researchers often do not have the background knowledge needed to fully develop technical success measures for the data pipelines associated with their projects. Therefore, this team is tasked with creating a platform as infrastructure to address these limitations in education research.
Scope.
This project includes :
Automate aspects of the cleaning and analysis
A data tracking dashboard
Understanding and replicating the current data cleaning process
Proposing and implementing efficiency improvements
Creating a cloud-based data store to host the database
This project is centered around the specifications of the TeachSim data. The infrastructure and schema will later be adapted to the SERA data, which is outside the scope of this year’s work. However, we will consider the needs of both projects in order to build a flexible infrastructure. Automation of the loading, cleaning, and pulling of data and the data tracking dashboard are optional components of the project that depends on the team’s progress in the main project goals.
Products
The Data Engineering Capstone team was able to produce the desired cloud-based data store, automate some of the existing processes, and design a data tracker. The code and user documentation for the capstone team’s project can be found on their github page, here.