This repository contains an R script that performs data cleaning and statistical sampling methods on a developer survey dataset. The dataset analyzed in this project contains responses regarding online learning resources.
- Dataset Overview
- Cleaning the Dataset
- Defining Study Variables
- Population Parameters
- Sampling Methods
- Results
The dataset used for this analysis is named survey_results_public.csv
and consists of responses from a survey about online learning resources. The dataset contains 89,184 responses across 84 columns.
The helper function string_contains_checker
checks if at least one of the required strings is present in a delimited string.
string_contains_checker <- function(input_str, delimiter, required_strs) {
split_strings = unlist(strsplit(input_str, delimiter))
for (req_str in required_strs) {
if (req_str %in% split_strings) {
return (1)
}
}
return (0)
}
Only the columns "Age", "LearnCode", and "Country" are kept for analysis. Responses are filtered based on whether participants reported using online resources for learning.
interested_colnames = c("Age", "LearnCode", "Country")
dataset = dataset[, interested_colnames]
The following binary study variables are defined based on responses:
- x: 1 if learning using online courses/certifications, and 0 otherwise.
- y: 1 if learning using other online resources, and 0 otherwise.
The population parameters are calculated based on the cleaned dataset, including sample sizes and variances for both x and y.
N = dim(dataset)[1]
mu_x.pop = mean(dataset$x)
sigma2_x.pop = var(dataset$x)
This section implements SRSWOR to obtain samples from the dataset. The mean and variance for the sampled variables are calculated.
srswor_indices = sample(1:N, n, replace = FALSE)
sampled_df = dataset[srswor_indices, ]
Stratified sampling is conducted by both age and region, allowing for more accurate estimates within each stratum.
Stratification by age divides the dataset into strata based on age groups, using proportional allocation to determine sample sizes.
ages = unique(dataset$Age)
strata.prop.n_A = rep(NA, length(ages))
Similar to age stratification, the dataset is stratified by region to enhance sample representation.
regions = unique(dataset$Region)
strata.prop.n_R = rep(NA, length(regions))
The results from both SRSWOR and stratified sampling are saved in CSV files and can be accessed for further analysis.
The results include means, variances, and sample sizes for each stratum.
To run the script, ensure you have R and the necessary libraries installed. You can install required packages with:
install.packages("countrycode")