Data Engineering and Management

Led by Anandita Krishnamachari, Alexis Prijoles, and Brian Wright.

Researchers in the social and behavioral sciences are often not trained in computer sciences, or in data collection and processing. Therefore, providing training materials and supports is really important. Especially in inter-disciplinary replication studies where we need to balance replication as a research design with feasibility and content-area expertise. Additionally, researchers in the social and behavioral sciences face many software challenges due to the lack of existing infrastructure to meet their diverse needs.

To address these challenges, our work on the replication platform includes two major components: researcher supports, and backend infrastructure. In both cases, we look to branches of data and computer science that are useful for ensuring high quality data collection, processing and analysis.

SERA%252BData%252Blifecycle.jpg

Researcher Supports

It is important to make data management considerations at all points of the data life-cycle. An illustration of the data life-cycle is provided here. Although the diagram suggests that this is a linear process, the actual process may include stage overlap or iterations.

To support researchers in making these considerations and implementing data science best practices, we are creating a Data Management Protocol. The Data Management Protocol will provide an overview of the data management principles and practices that promote transparency and replicability, and helpful templates and links.

Backend Infrastructure

Software challenges arise because of the lack of cyber-infrastructure to support social and behavioral science research from end to end. Researchers often turn to homegrown systems (internally built), subscription-based/licensed products (e.g., Qualtrics or Box), and/or Custom vendor-developed programs and applications.​ These approaches present issues in terms of data security, lack of customization and flexibility across contexts or research designs, inefficiencies and high costs. ​

Collaboratively with the SDS team, we are exploring software integration and data storage options to more clearly define the data infrastructure needs for replication efforts and create scalable solutions.