Data infrastructure resources for social science

April 26, 2023

For nearly a decade, NSF's Directorate of Social, Behavioral and Economic Sciences (SBE) has supported the development of data infrastructure that enables data-intensive research in the social sciences. The first program responsible for managing these infrastructure projects was the Building Community and Capacity for Data-Intensive Research in the Social, Behavioral and Economic Sciences and in Education and Human Resources Program (BCC-SBE/EHR). This program was followed by the Resource Implementations for Data Intensive Research in the Social, Behavioral and Economic Sciences Program (RIDIR). Now, the Human Networks and Data Science  Infrastructure Program (HNDS-I) addresses the development of data resources and relevant analytic techniques that support fundamental SBE research.

In the spirit of HNDS-I and past programs, here we present a list of infrastructure resources created by researchers supported by the Social, Behavioral and Economic Sciences Directorate with the goal of advancing research in all areas of social, behavioral and economic research.

Aquatic Resource Trade in Species (ARTIS)
The ARTIS database provides global estimates of seafood species and nutrient trade flows from 1996-2020.
NSF Award #2121238.

Automatically Annotated Repository of Digital Video and Audio Resources Community (AARDVARC)
AARDVARC was an early version of the GORILLA platform, which is an open platform to safely store and share language resources and technologies. It also provides services to transform, annotate and generate linguistic data and analytical resources for specific languages or from primary data that contributors provide.
NSF Award #1519887

Circling the Research Triangle
This database integrates a number of diverse data sources useful for studying the industrial genesis of the region surrounding the Research Triangle in North Carolina.
NSF Award #1439532

Complex Group Interactions (XGI)
XGI is a software library to facilitate the analysis of networks with higher-order interactions and is intended to allow investigators from many disciplines to study how the spread of epidemics and opinions can be modified by simultaneous interactions between multiple people.
NSF Award #2121905.

Criminal Justice Administrative Records System (CJARS)
The Criminal Justice Administrative Records System (CJARS) is a nationally integrated longitudinal database linking individuals' criminal history with social, demographic and economic information provided by the U.S. Census.
NSF Award #1925563

The cyberSW platform permits network analysis and ingestion of archaeological data from the American Southwest. It permits users to explore the distribution of artifacts and reconstruct demography and social networks of the pre-Hispanic world.
NSF Award #1738258

The DataARC platform is a collection of datasets with an integrated search tool that supports transdisciplinary research on human-environment interactions in the North Atlantic. DataARC currently houses fifteen datasets, covering subjects from the humanities to the environmental sciences and spanning thousands of years and vast geography including Iceland, Greenland, the United Kingdom and Scandinavia.
NSF Award #1637076

eHRAF World Cultures Database
The electronic Human Relations Area Files (eHRAF) World Cultures database contains information on present and past aspects of cultural and social life for a worldwide sample of societies.
NSF Award #2024286

Freedom of Information Archive
The Freedom of Information Archive contains collections of over 3 million declassified documents and tools for textual analysis and visualization. It is the largest body of declassified documents available to anyone outside of the government.
NSF Award #1637159

International Historical Geographic Information System (IHGIS)
The IPUMS International Historical Geographic Information System (IHGIS) provides data tables from population and housing censuses as well as agricultural censuses from around the world, along with corresponding GIS boundary files.
NSF Award #1738369

Longitudinal, Intergenerational Family Electronic Microdata (LIFE-M)
The Longitudinal, Intergenerational Family Electronic Microdata (LIFE-M) links millions of individuals and families living in the late 19th and 20th centuries using vital records and decennial censuses. This combination of records provides a life-course and intergenerational perspective on the evolution of health and economic outcomes. Currently, the LIFE-M data contains records from Ohio and North Carolina.
NSF Award #1539228

National Environmental Policy Act database (
The National Environmental Policy Act database ( is a knowledge-discovery platform for finding and analyzing decades of applied science and records of public participation in U.S. environmental decision-making processes. The database includes all draft and final Environmental Impact Statements and related public documents released by federal agencies between 2012-2022.
NSF Award #1831551

NeuLaw Criminal Record Database (NCRD)
The NeuLaw Criminal Record Database (NCRD) is a collection of tens of millions of crime records. It contains information about individual offenders, their crimes and their interactions with the criminal justice system for Harris County, TX, New York City, NY, and Miami-Dade, FL. The NCRD is part of the SciLaw Criminal Record Database.
NSF Award #1439453.

PolicyFlow is an open-source visual analytic toolkit for exploring the time-evolving patterns of policy adoption that uses the SPID database.
NSF Award #1637067

The Python Spatial Analysis Library for Open Source, Cross Platform Geospatial Data Science
This is an open-source project to support spatial data science. It includes python libraries and tools for exploratory analysis, modeling and visualization.
NSF Award #1831615

State Policy Innovation and Diffusion (SPID)
The State Policy Innovation and Diffusion (SPID) database includes the year of adoption of hundreds of policies that diffused across the United States with information about the policies.
NSF Award #1636668

Sub-National Data Archive System for Social and Behavioral Data (SUNGEO)
SUNGEO is a data repository that integrates multiple sources of sub-national data at multiple spatio-temporal scales and a suite of methods for data processing and analysis.
NSF Award #1925693

Synthesizing Knowledge of Past Environments (SKOPE)
Synthesizing Knowledge of Past Environments (SKOPE) is a data platform containing long-term, high-resolution paleoenvironmental data in the American Southwest reaching back more than 2000 years. The web application allows users to discover, explore, visualize and synthesize knowledge of environments in the recent or remote past.
NSF Award #1637189

TalkBank is the world's largest open-access collection of transcript and multimedia data on spoken language. Data in TalkBank have been contributed by hundreds of researchers working in over 34 languages internationally who are committed to principles of open data sharing.
NSF Award #1539129

UTD Event Data
This site provides data on political and social events occurring around the globe with historical coverage and from multiple language sources, available within hours of the events occurrence, and accompanied by tools to analyze the data.
NSF Award #1539302