Survey Analysis using R

This repository contains an R script that performs data cleaning and statistical sampling methods on a developer survey dataset. The dataset analyzed in this project contains responses regarding online learning resources.

Dataset Overview

The dataset used for this analysis is named survey_results_public.csv and consists of responses from a survey about online learning resources. The dataset contains 89,184 responses across 84 columns.

Cleaning the Dataset

Helper Functions

The helper function string_contains_checker checks if at least one of the required strings is present in a delimited string.

string_contains_checker <- function(input_str, delimiter, required_strs) {
  split_strings = unlist(strsplit(input_str, delimiter))
  
  for (req_str in required_strs) {
    if (req_str %in% split_strings) {
      return (1)
    }
  }
  return (0)
}

Data Filtering

Only the columns "Age", "LearnCode", and "Country" are kept for analysis. Responses are filtered based on whether participants reported using online resources for learning.

interested_colnames = c("Age", "LearnCode", "Country")
dataset = dataset[, interested_colnames]

Defining Study Variables

The following binary study variables are defined based on responses:

x: 1 if learning using online courses/certifications, and 0 otherwise.
y: 1 if learning using other online resources, and 0 otherwise.

Population Parameters

The population parameters are calculated based on the cleaned dataset, including sample sizes and variances for both x and y.

N = dim(dataset)[1]
mu_x.pop = mean(dataset$x)
sigma2_x.pop = var(dataset$x)

Sampling Methods

Simple Random Sampling Without Replacement (SRSWOR)

This section implements SRSWOR to obtain samples from the dataset. The mean and variance for the sampled variables are calculated.

srswor_indices = sample(1:N, n, replace = FALSE)
sampled_df = dataset[srswor_indices, ]

Stratified Sampling

Stratified sampling is conducted by both age and region, allowing for more accurate estimates within each stratum.

Stratification by Age

Stratification by age divides the dataset into strata based on age groups, using proportional allocation to determine sample sizes.

ages = unique(dataset$Age)
strata.prop.n_A = rep(NA, length(ages))

Stratification by Region

Similar to age stratification, the dataset is stratified by region to enhance sample representation.

regions = unique(dataset$Region)
strata.prop.n_R = rep(NA, length(regions))

Results

The results from both SRSWOR and stratified sampling are saved in CSV files and can be accessed for further analysis.

The results include means, variances, and sample sizes for each stratum.

Installation

To run the script, ensure you have R and the necessary libraries installed. You can install required packages with:

install.packages("countrycode")

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
script.R		script.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Survey Analysis using R

Table of Contents

Dataset Overview

Cleaning the Dataset

Helper Functions

Data Filtering

Defining Study Variables

Population Parameters

Sampling Methods

Simple Random Sampling Without Replacement (SRSWOR)

Stratified Sampling

Stratification by Age

Stratification by Region

Results

Installation

About

Releases

Packages

Languages

pranathlcp/developer-survey-analysis

Folders and files

Latest commit

History

Repository files navigation

Survey Analysis using R

Table of Contents

Dataset Overview

Cleaning the Dataset

Helper Functions

Data Filtering

Defining Study Variables

Population Parameters

Sampling Methods

Simple Random Sampling Without Replacement (SRSWOR)

Stratified Sampling

Stratification by Age

Stratification by Region

Results

Installation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages