{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Projects - Example Project for ECON 325\n",
        "\n",
        "COMET Team <br> *Shiming Wu, Rathin Dharani, Jonathan Graves*  \n",
        "2023-01-12\n",
        "\n",
        "# Outline\n",
        "\n",
        "If you are reviewing the materials from ECON 325, or self-studying it,\n",
        "this is a good self-test to see if you understand all of the material.\n",
        "After completing this course, you should be able to:\n",
        "\n",
        "-   Read this notebook, and understand what the difference analyses are,\n",
        "    and how they are being used\n",
        "-   Critique the choices made, understanding their pros and cons\n",
        "-   Understand what the R code is doing, and how it implements the\n",
        "    analyses\n",
        "-   Be able to describe how to adjust or change this to do other\n",
        "    analysis or change the focus or assumptions made in the analysis so\n",
        "    far\n",
        "\n",
        "If you’re interested in getting started with econometric analysis, you\n",
        "may also use this as a model to guide your own project.\n",
        "\n",
        "# Education, Career and Inequality"
      ],
      "id": "49854027-b54d-4d1d-961c-5226174b66a3"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# load the packages\n",
        "\n",
        "library(tidyverse)\n",
        "library(haven)\n",
        "library(ggplot2)\n",
        "library(stargazer)\n",
        "\n",
        "#install.packages(\"vtable\") #run this line if you do not have \"vtable\" installed already. \n",
        "library(vtable)"
      ],
      "id": "38001d30-2092-4029-b610-0e5150315616"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Notes\n",
        "\n",
        "<span id=\"fn2\">[<sup>1</sup>](#fn2s)Stargazer package is due to: Hlavac,\n",
        "Marek (2022). stargazer: Well-Formatted Regression and Summary\n",
        "Statistics Tables. R package version 5.2.3.\n",
        "https://CRAN.R-project.org/package=stargazer</span>\n",
        "\n",
        "# Introduction\n",
        "\n",
        "Getting a university degree can be tough. We may wonder whether higher\n",
        "education level can help us to achieve a more successful career.\n",
        "Besides, is education the only factor that impacts our career? In this\n",
        "project, we are going to answer these questions. Since one of the\n",
        "important proxies for a successful career is income, we are going to\n",
        "study the relationship between education, income, and wealth.\n",
        "\n",
        "Now, let’s first import our data and clean our dataset. For this\n",
        "project, we will be using data from the 2019 Survey of Financial\n",
        "Security (SFS), provided by Statistics Canada (see the license notes).\n",
        "The survey was conducted by household unit, so income and wealth\n",
        "variables represent total income and wealth from a family, and household\n",
        "characteristics are usually the characteristics of main earners in the\n",
        "household. The main earner is the person who has highest income in a\n",
        "family. In order to study career outcomes, we restrict our samples to be\n",
        "households whose major sources of incomes are wages, salaries and\n",
        "self-employment incomes."
      ],
      "id": "e064e3cc-0b8d-436d-982b-a5ae3062e0b2"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# READING IN DATA\n",
        "SFS_data <- read_dta(\"../datasets_projects/SFS_2019_Eng.dta\")\n",
        "\n",
        "# FILTERING OUT NA VALUES FOR TARGET VARIABLES\n",
        "SFS_data <- filter(SFS_data, !is.na(SFS_data$pefmtinc))\n",
        "SFS_data <- filter(SFS_data, !is.na(SFS_data$pefatinc))\n",
        "SFS_data <- filter(SFS_data, !is.na(SFS_data$pwnetwpg))\n",
        "SFS_data <- filter(SFS_data, !is.na(SFS_data$pagemieg))\n",
        "\n",
        "# SUBSETTING DATA \n",
        "SFS_data <- subset(SFS_data, pefmjsif == \"02\" | pefmjsif == \"03\") \n",
        "# restrict samples to be households whose major sources of incomes are wages, salaries and self-employment incomes.\n",
        "\n",
        "# RENAMING VARIABLES TO READABLE NAMES\n",
        "SFS_data <- rename(SFS_data, income_before_tax = pefmtinc)\n",
        "SFS_data <- rename(SFS_data, income_after_tax = pefatinc)\n",
        "SFS_data <- rename(SFS_data, wealth = pwnetwpg)\n",
        "SFS_data <- rename(SFS_data, gender = pgdrmie)\n",
        "SFS_data <- rename(SFS_data, education = peducmie)\n",
        "SFS_data <- rename(SFS_data, business = pbusind)\n",
        "SFS_data <- rename(SFS_data, province = ppvres)\n",
        "SFS_data <- rename(SFS_data, credit_limit = pattlmlc)\n",
        "SFS_data <- rename(SFS_data, age = pagemieg)\n",
        "SFS_data <- rename(SFS_data, employment = plffptme)\n",
        "\n",
        "# REFACTORING SOME VARIABLES\n",
        "SFS_data<-SFS_data[!(SFS_data$education==\"9\"),] # remove observations that education is \"not stated\"\n",
        "SFS_data$education <- as.numeric(SFS_data$education)\n",
        "SFS_data <- SFS_data[order(SFS_data$education),] # sort dataset by variable `education`\n",
        "SFS_data$education <- as.character(SFS_data$education)\n",
        "SFS_data$education[SFS_data$education == \"1\"] <- \"Less than high school\" \n",
        "# replace content to be 'less than high school' if orginal content is '1' for variable education\n",
        "SFS_data$education[SFS_data$education == \"2\"] <- \"High school\"\n",
        "SFS_data$education[SFS_data$education == \"3\"] <- \"Non-university post-secondary\"\n",
        "SFS_data$education[SFS_data$education == \"4\"] <- \"University\"\n",
        "\n",
        "# CHANGING ALL CATEGORICAL VARIABLES TO FACTOR VARIABLES (originally they were string variables)\n",
        "SFS_data$gender <- as_factor(SFS_data$gender)\n",
        "SFS_data$education <- as_factor(SFS_data$education)\n",
        "SFS_data$business <- as_factor(SFS_data$business)\n",
        "SFS_data$province <- as_factor(SFS_data$province)\n",
        "SFS_data$age <- as_factor(SFS_data$age)\n",
        "SFS_data$employment <- as_factor(SFS_data$employment)"
      ],
      "id": "d0b3797e-759a-447b-817c-f5d05a76b3cb"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Since we want to study income, let’s restrict the sample to main working\n",
        "groups, mostly are people with age 25 to 65. The age variable we use is\n",
        "the age of main earner in a household."
      ],
      "id": "d80e804d-0520-4124-97e1-86ed0e15f55b"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "SFS_data <- \n",
        "        SFS_data %>%\n",
        "        mutate(agegr = case_when(\n",
        "              age == \"01\" ~ \"Under 30\", #under 20\n",
        "              age == \"02\" ~ \"Under 30\", #20-24\n",
        "              age == \"03\" ~ \"20s\", #25-29\n",
        "              age == \"04\" ~ \"30s\",\n",
        "            age == \"05\" ~ \"30s\",\n",
        "              age == \"06\" ~ \"40s\",\n",
        "              age == \"07\" ~ \"40s\",\n",
        "              age == \"08\" ~ \"50s\",\n",
        "              age == \"09\" ~ \"50s\",\n",
        "              age == \"10\" ~ \"60s\", #60-64\n",
        "              age == \"11\" ~ \"Above 65\", #65-69\n",
        "              age == \"12\" ~ \"Above 65\", #70-74\n",
        "              age == \"13\" ~ \"Above 75\", #75-79\n",
        "              age == \"14\" ~ \"Above 75\", #80 and above\n",
        "              )) %>%\n",
        "        mutate(agegr = as_factor(agegr))\n",
        "\n",
        "SFS_data <- subset(SFS_data, agegr == \"20s\" | agegr == \"30s\" | agegr == \"40s\" | agegr == \"50s\" | agegr == \"60s\" )\n",
        "SFS_data$agegr <- factor(SFS_data$agegr,levels = c(\"20s\", \"30s\", \"40s\", \"50s\", \"60s\"))"
      ],
      "id": "1b2d91ec-97f9-4dce-91bf-86b0245efcb3"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Summary Statistics\n",
        "\n",
        "Before we begin our exploration, let’s first describe variables of\n",
        "interest.\n",
        "\n",
        "-   Variables incomes and wealth are of the family unit. The wealth\n",
        "    variable represents net worth of the family unit.\n",
        "\n",
        "-   Mean of income before tax of a family in 2019 is about \\$124,804.5.\n",
        "\n",
        "-   The standard deviation of income after tax is smaller than income\n",
        "    before tax, which suggests tax and government transfers reduce\n",
        "    income inequality in Canada.\n",
        "\n",
        "-   Net wealth of a family unit is approximately 9.5 times of the income\n",
        "    after tax of a family. Unlike income after tax, standard deviation\n",
        "    of wealth is large, thus dispersion of wealth is larger than income,\n",
        "    which conforms with economic literature."
      ],
      "id": "7b7ef94a-caae-4ac6-8f32-59ceefa4248d"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "sumtable(SFS_data, \n",
        "      vars = c(\"income_before_tax\", \"income_after_tax\", \"wealth\"),\n",
        "       summ = c('mean(x)','median(x)', 'sd(x)' ),\n",
        "       digits = 7,\n",
        "       out = 'return')"
      ],
      "id": "aa17ae2d-f34f-42b1-81d7-a01d4d84ed27"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Since `age` and `education` are factor variables, we can visualize them\n",
        "with histograms. As we can see from the graph below, many main earners\n",
        "are in their 50s, and most main earners have a post-secondary degree."
      ],
      "id": "f77d8477-931a-47e4-a389-a1e2260b32ec"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "age_plot<-ggplot(SFS_data, aes(x=agegr)) + \n",
        "          geom_bar(fill=\"lightblue\") + \n",
        "          ggtitle(\"Wage earners by Age Category\") +\n",
        "          ylab(\"Number of Earners\")+\n",
        "          xlab(\"Age Groups\") \n",
        "options(repr.plot.width=7,repr.plot.height=7) #set size of the plot\n",
        "edu_plot<-ggplot(SFS_data, aes(x=education)) +\n",
        "          geom_bar(fill=\"lightgreen\") + \n",
        "          ggtitle(\"Wage earners by Highest Level of Education Recieved\") +\n",
        "          ylab(\"Number of Earners\") +\n",
        "          xlab(\"Education\") \n",
        "age_plot\n",
        "edu_plot"
      ],
      "id": "6ae6433a-f2eb-42e8-aeed-343b8be94c7e"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In order to preclude effects of extreme values, let’s remove\n",
        "observations that lie in highest 2% and lowest 2% of incomes and wealth.\n",
        "We see that both incomes and wealth are right-skewed: they have long\n",
        "right tails."
      ],
      "id": "f04ad507-ad49-40ba-96b6-1d5605ca9c7c"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# FINDS THE LOWER AND UPPER BOUND SPECIFIED \n",
        "pct2_income_before <- quantile(SFS_data$income_before_tax,c(0.02,0.98),type=1)\n",
        "\n",
        "# FILTERS OUT THE VALUES OUTSIDE OF THE BOUNDS SPECIFIED\n",
        "SFS_data <- filter(SFS_data, SFS_data$income_before_tax > pct2_income_before[1] & SFS_data$income_before_tax < pct2_income_before[2])\n",
        "\n",
        "pct2_income_after <- quantile(SFS_data$income_after_tax,c(0.02,0.98),type=1)\n",
        "SFS_data <- filter(SFS_data, SFS_data$income_after_tax > pct2_income_after[1] & SFS_data$income_after_tax < pct2_income_after[2])\n",
        "\n",
        "pct2_wealth <- quantile(SFS_data$wealth,c(0.02,0.98),type=1)\n",
        "SFS_data <- filter(SFS_data, SFS_data$wealth > pct2_wealth[1] & SFS_data$wealth < pct2_wealth[2])"
      ],
      "id": "8779547c-eb7d-4745-88aa-8f8f37f0a715"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "options(repr.plot.width=7,repr.plot.height=7)\n",
        "\n",
        "income_before_tax_plot<-ggplot(SFS_data, aes(x=income_before_tax)) + geom_histogram(colour = 4, fill = \"lightblue\", \n",
        "                 bins = 30)  + xlab(\"Income before Tax\") + ylab(\"Number of Observations\") + scale_x_continuous()\n",
        "income_after_tax_plot<-ggplot(SFS_data, aes(x=income_after_tax)) + geom_histogram(colour = 4, fill = \"lightblue\", \n",
        "                 bins = 30)  + xlab(\"Income after Tax\") + ylab(\"Number of Observations\") + scale_x_continuous()\n",
        "\n",
        "income_before_tax_plot\n",
        "income_after_tax_plot\n",
        "\n",
        "wealth_plot<-ggplot(SFS_data, aes(x=wealth)) + geom_histogram(colour = 4, fill = \"lightblue\", \n",
        "                 bins = 30)  + xlab(\"Wealth\") + ylab(\"Number of Observations\") + scale_x_continuous()\n",
        "wealth_plot"
      ],
      "id": "25bd9074-8617-4dd5-ae85-c004aa43c1b0"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Education and Career\n",
        "\n",
        "Now let’s study the relationship between education, income and wealth.\n",
        "\n",
        "1.  Create a function `CI` to calculate confidence interval.\n",
        "2.  Store the mean of income and wealth by groups of education in table\n",
        "    `results`.\n",
        "3.  Use the `CI` function to calculate confidence intervals and combine\n",
        "    results in table `df_gr`. We also display results in graphs.\n",
        "\n",
        "From table `df_gr`, we notice that incomes and wealth increase as\n",
        "education increases. Thus education does help us to achieve a more\n",
        "successful career.\n",
        "\n",
        "Meanwhile, standard deviations also increase as education increases.\n",
        "This implies income and wealth inequality increase as education\n",
        "increases.\n",
        "\n",
        "From a personal point of view, in order to have a successful career,\n",
        "obtaining a degree is just a starting point as career development is\n",
        "also crucial for a successful career. From a policy perspective,\n",
        "policies that only target pre-labor market skill accumulation will be\n",
        "less effective than policies that foster career progression in reducing\n",
        "inequality."
      ],
      "id": "7a57958c-6164-4357-aafa-24cc59c9fcde"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "CI <- function(data) {\n",
        "    x <- mean(data) # calculate mean of input\n",
        "    n <- length(data) # calculate number of observations\n",
        "    df <- n - 1 # degree of freedom is n-1\n",
        "    t <- qt(p = 0.05, df = df) # finding the t value for a confidence level of 95% \n",
        "    s <- sd(data) # finding the sample standard deviation of input\n",
        "    \n",
        "    # calculating the lower and upper bounds of the desired confidence interval\n",
        "    lower_bound <- x - (t*s/sqrt(n))\n",
        "    upper_bound <- x + (t*s/sqrt(n))\n",
        "    \n",
        "    bound<-c(lower_bound,upper_bound) # store lower bound and upper bound to a vector named `bound`\n",
        "    return(bound)\n",
        "}"
      ],
      "id": "367836c2-e405-45d6-b748-4e6212823c79"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "'Generate a Dataframe to Display Summary Statistics by Education Groups'\n",
        "\n",
        "#We can start by creating a new data frame named `results` to store mean and standard deviations of incomes and wealth by education groups. We can include our calculated confidence interval by running function `CI` with each of our variables of interest (`income_before_tax` and `wealth`).\n",
        "\n",
        "results <- data.frame(SFS_data %>%\n",
        "  group_by(education) %>%\n",
        "  summarize(m_income = mean(income_before_tax),\n",
        "            sd_income = sd(income_before_tax),\n",
        "            CI_L_income = c(CI(income_before_tax)[2]),\n",
        "            CI_U_income = c(CI(income_before_tax)[1]), #Calculate confidence interval for `income_before_tax` using the `CI` function \n",
        "            m_wealth = mean(wealth),\n",
        "            sd_wealth = sd(wealth),\n",
        "            CI_L_wealth = c(CI(wealth)[2]), #Calculate confidence interval for `wealth` using the `CI` function \n",
        "            CI_U_wealth = c(CI(wealth)[1])))\n",
        "\n",
        "results   "
      ],
      "id": "d9be47bf-cd7d-4a8b-b4cc-355ecfab4eac"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "options(repr.plot.width=8,repr.plot.height=3)\n",
        "g <- ggplot(data = SFS_data, aes(x = education, y = income_before_tax)) + xlab(\"Education of Main Earner\") + ylab(\"Income before Tax\") + scale_y_continuous() \n",
        "g1 <- g + geom_bar(stat = \"summary\", fun = \"mean\", fill = \"lightblue\") #produce a summary statistic, the mean\n",
        "g1 <- g1 + coord_flip() #make a horizontal bar graph!\n",
        "\n",
        "f <- ggplot(data = SFS_data, aes(x = education, y = wealth)) + xlab(\"Education of Main Earner\") + ylab(\"Wealth\") + scale_y_continuous() \n",
        "f1 <- f + geom_bar(stat = \"summary\", fun = \"mean\", fill = \"lightblue\") #produce a summary statistic, the mean\n",
        "f1 <- f1 + coord_flip() #make a horizontal bar graph!\n",
        "\n",
        "g1\n",
        "f1"
      ],
      "id": "abb6b6d4-82d8-494f-9e96-fa173dda23c9"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "From table and graphs above, we know incomes and wealth of different\n",
        "education groups are different. But is the difference statistically\n",
        "significant? We use two sample t-test to verify the results.\n",
        "\n",
        "The t-tests for income and wealth are significant. The results suggest\n",
        "people who obtain university degrees will have higher wages and wealth\n",
        "than people who own non-university post-secondary degrees."
      ],
      "id": "eb0b87df-2c77-420d-bd5d-cfdf74e71fd7"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "t1 = t.test(\n",
        "       x = filter(SFS_data, education == \"Non-university post-secondary\")$income_before_tax,\n",
        "       y = filter(SFS_data, education == \"University\")$income_before_tax,\n",
        "       alternative = \"two.sided\",\n",
        "       mu = 0,\n",
        "       conf.level = 0.95)\n",
        "\n",
        "t1 \n",
        "\n",
        "round(t1$estimate[1] - t1$estimate[2],2) "
      ],
      "id": "c12f3835-174c-43a9-9f92-894ee61b93f9"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "t2 = t.test(\n",
        "       x = filter(SFS_data, education == \"Non-university post-secondary\")$wealth,\n",
        "       y = filter(SFS_data, education == \"University\")$wealth,\n",
        "       alternative = \"two.sided\",\n",
        "       mu = 0,\n",
        "       conf.level = 0.95)\n",
        "\n",
        "t2 \n",
        "\n",
        "round(t2$estimate[1] - t2$estimate[2],2) "
      ],
      "id": "dd922900-1f8d-469a-be60-b8b865918170"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, we study correlations between education, incomes and wealth.\n",
        "\n",
        "We first draw graphs to depict correlations between education, incomes,\n",
        "and wealth. From graphs below, we find that they have positive\n",
        "correlations."
      ],
      "id": "2ce4b97a-7448-4d52-9ec9-25b6f43f7b9f"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "options(repr.plot.width=7,repr.plot.height=7)\n",
        "before_after_tax<-ggplot(SFS_data, aes(x = income_before_tax, y = income_after_tax)) + \n",
        "    xlab(\"Incomes before tax\") + ylab(\"Incomes after tax\") +\n",
        "    geom_point(shape = 1) + geom_smooth(method = lm, formula=y ~ x) + \n",
        "    scale_x_continuous() + scale_y_continuous() \n",
        "after_tax_wealth<-ggplot(SFS_data, aes(x = income_after_tax, y = wealth)) + \n",
        "    xlab(\"Incomes after tax\") + ylab(\"Wealth\") +\n",
        "    geom_point(shape = 1) + geom_smooth(method = lm, formula=y ~ x) + \n",
        "    scale_x_continuous() + scale_y_continuous()\n",
        "edu_before_tax <- ggplot(SFS_data, aes(x = education, y = income_before_tax)) + \n",
        "    xlab(\"Education\") + ylab(\"Incomes\") + geom_boxplot(fill = \"lightyellow\") + scale_y_continuous()\n",
        "edu_wealth <- ggplot(SFS_data, aes(x = education, y = wealth)) + \n",
        "    xlab(\"Education\") + ylab(\"Wealth\") + geom_boxplot(fill = \"lightyellow\") + scale_y_continuous()\n",
        "\n",
        "before_after_tax\n",
        "after_tax_wealth\n",
        "edu_before_tax\n",
        "edu_wealth"
      ],
      "id": "e2ff8345-9c1b-4621-8904-a777e830c5ff"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let’s find out how large the correlations are between variables. We\n",
        "first transform `education` to be a numeric variable which increases as\n",
        "education increases.\n",
        "\n",
        "From table below, we know the correlations of education and incomes, and\n",
        "education and wealth are 0.26 and 0.21 respectively. This conforms with\n",
        "our previous results that incomes and wealth increase as education\n",
        "increases. But since correlations are not very close to 1, that means\n",
        "education is not the only factor to a successful career."
      ],
      "id": "5aff9a32-8e3f-4ff2-91b3-73a6ef09acf6"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "SFS_data$education<-as.numeric(SFS_data$education)\n",
        "cor_data<-select(SFS_data, c('education','income_before_tax','income_after_tax','wealth'))\n",
        "mydata.cor = cor(cor_data)\n",
        "mydata.cor"
      ],
      "id": "4040c855-4798-4ca5-aa77-cd93a23a6836"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "From the table and graphs above, we know there is positive correlations\n",
        "between education and incomes, as well as wealth. Are the correlations\n",
        "significantly different from 0? To test our hypothesis we can perform\n",
        "Pearson correlation tests. The results suggest that correlations are\n",
        "significantly different from 0."
      ],
      "id": "75537844-c24f-47f7-82c5-96c1e22264c4"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Pearson correlation test\n",
        "cor.test(SFS_data$education, SFS_data$income_before_tax, use=\"complete.obs\") \n",
        "cor.test(SFS_data$education, SFS_data$wealth, use=\"complete.obs\") "
      ],
      "id": "09dff19c-d815-46a7-bb3a-1a9d7eb7bc7c"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "So far, we know that incomes and wealth increase as education increases.\n",
        "But there is one question which is not clear yet: is there another\n",
        "factor A that affects education and income at the same time, so that the\n",
        "real “force” that drives up income is factor A, rather than education?\n",
        "In order to answer this question, we can run regressions. Let’s first\n",
        "change `education` back to a factor variable."
      ],
      "id": "255fa16b-2418-46a2-89b2-cc8abca8293f"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "SFS_data$education <- as.character(SFS_data$education)\n",
        "SFS_data$education[SFS_data$education == \"1\"] <- \"Less than high school\"\n",
        "SFS_data$education[SFS_data$education == \"2\"] <- \"High school\"\n",
        "SFS_data$education[SFS_data$education == \"3\"] <- \"Non-university post-secondary\"\n",
        "SFS_data$education[SFS_data$education == \"4\"] <- \"University\"\n",
        "SFS_data$education <- as_factor(SFS_data$education)"
      ],
      "id": "658dbf7c-b700-441a-aaf0-72760e1ca0c7"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "regression1 <- lm(income_before_tax ~ education, data = SFS_data)\n",
        "regression2 <- lm(income_before_tax ~ education + gender, data = SFS_data)\n",
        "regression3 <- lm(income_before_tax ~ education + gender + agegr, data = SFS_data)\n",
        "\n",
        "stargazer(regression1, regression2, regression3, title=\"Comparison of Controls\",\n",
        "          align = TRUE, type=\"text\", keep.stat = c(\"n\",\"rsq\"))"
      ],
      "id": "e6c753c6-fff7-4f62-8e5b-d67d22f4cba7"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The table above displays results of 3 regressions.\n",
        "\n",
        "Column (1) shows the regression with education and incomes.\n",
        "\n",
        "-   The average income of households whose education is less than high\n",
        "    school is CAD 81239.44. Average income of family with high school\n",
        "    degree is CAD 81239.44 + CAD 13550.36 = CAD 94789.8.\n",
        "-   Similarly, we can calculate the average incomes of households with\n",
        "    non-university post-secondary degree and university degree (as seen\n",
        "    in the two sample t-tests).\n",
        "-   So the first column suggests incomes increase as education\n",
        "    increases.\n",
        "\n",
        "Column (2) includes education and gender in the regression.\n",
        "\n",
        "-   Compared with male-lead family, a family with a female primary\n",
        "    earner has a lower income.\n",
        "-   However, the coefficient of `education` is still positive,\n",
        "    significant and monotonically increasing, which means after\n",
        "    controlling effects of gender, income still increases as education\n",
        "    increases.\n",
        "\n",
        "Column (3) shows regression with gender and age group.\n",
        "\n",
        "-   We can see that people tend to earn more as they move on to later\n",
        "    stages of their career.\n",
        "-   Once again, the positive and monotonically increasing coefficients\n",
        "    of education show the positive correlation between education and\n",
        "    incomes. Since after we include other factors that may affect\n",
        "    incomes, the coefficients of education are still positive and\n",
        "    stable, we have stronger evidence that income increase as education\n",
        "    increases.\n",
        "\n",
        "# Going Further: Long-term Effect of Education\n",
        "\n",
        "From previous results, we find that incomes and wealth increase as\n",
        "education increases. The next interesting question is whether there is\n",
        "long-term effects of education on career outcomes. In order to study\n",
        "this question, we explore returns to education for different age groups.\n",
        "It will be great that if we can study this question via panel data, but\n",
        "since we do not observe the same person across time periods, we study\n",
        "returns to education from different age groups.\n",
        "\n",
        "We perform Welch two sample t-test for each age group. We study the\n",
        "difference between university degree holders and non-university\n",
        "post-secondary degree holders. We compare outcomes of age group 30s and\n",
        "50s, which represents early stage and mature stage of career\n",
        "respectively.\n",
        "\n",
        "We first study education impacts on incomes. For both groups,\n",
        "differences of incomes between university and non-university are\n",
        "significantly different from 0. Surprisingly, the income gap of 50s is\n",
        "about 3 times as the gap when they are in 30s.\n",
        "\n",
        "As for wealth, wealth gaps between the two education groups increase\n",
        "even more when people enter a mature stage of their career. The wealth\n",
        "gap of 50s is more than 4 times as the gap when they are in 30s."
      ],
      "id": "23d2a864-7594-4d04-a0a3-8f3ad90b78f2"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#Returns to education: University\n",
        "\n",
        "#30s\n",
        "\n",
        "retU = t.test(\n",
        "       x = filter(SFS_data, (agegr == \"30s\" & education=='Non-university post-secondary'))$income_before_tax,\n",
        "       y = filter(SFS_data, (agegr == \"30s\" & education=='University'))$income_before_tax,\n",
        "       alternative = \"two.sided\",\n",
        "       mu = 0,\n",
        "       conf.level = 0.95)\n",
        "\n",
        "retU\n",
        "round(retU$estimate[2] - retU$estimate[1],2)\n",
        "\n",
        "#50s\n",
        "\n",
        "retUF = t.test(\n",
        "       x = filter(SFS_data, (agegr == \"50s\" & education=='Non-university post-secondary'))$income_before_tax,\n",
        "       y = filter(SFS_data, (agegr == \"50s\" & education=='University'))$income_before_tax,\n",
        "       alternative = \"two.sided\",\n",
        "       mu = 0,\n",
        "       conf.level = 0.95)\n",
        "\n",
        "retUF\n",
        "round(retUF$estimate[2] - retUF$estimate[1],2)"
      ],
      "id": "ed192d76-959b-4d84-8a58-6e5889229dbc"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#Returns to education: University\n",
        "\n",
        "#30s\n",
        "\n",
        "retU = t.test(\n",
        "       x = filter(SFS_data, (agegr == \"30s\" & education=='Non-university post-secondary'))$wealth,\n",
        "       y = filter(SFS_data, (agegr == \"30s\" & education=='University'))$wealth,\n",
        "       alternative = \"two.sided\",\n",
        "       mu = 0,\n",
        "       conf.level = 0.95)\n",
        "\n",
        "retU\n",
        "round(retU$estimate[2] - retU$estimate[1],2)\n",
        "\n",
        "#50s\n",
        "\n",
        "retUF = t.test(\n",
        "       x = filter(SFS_data, (agegr == \"50s\" & education=='Non-university post-secondary'))$wealth,\n",
        "       y = filter(SFS_data, (agegr == \"50s\" & education=='University'))$wealth,\n",
        "       alternative = \"two.sided\",\n",
        "       mu = 0,\n",
        "       conf.level = 0.95)\n",
        "\n",
        "retUF\n",
        "round(retUF$estimate[2] - retUF$estimate[1],2)"
      ],
      "id": "0dcc5a7b-cd98-4981-a5ad-baff83e67aa8"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Summary\n",
        "\n",
        "In this project, we study relationship between education and career. We\n",
        "use incomes and wealth as proxies for labor market outcomes.\n",
        "\n",
        "We first describe income and wealth distribution in Canadian households,\n",
        "and we find that there are income inequality and wealth inequality in\n",
        "Canada.\n",
        "\n",
        "Next, we study relationship between education and career outcomes via\n",
        "exploration of correlations, two sample t-test and regressions. Our\n",
        "results suggest that there are significant and positive correlation of\n",
        "education, incomes and wealth. We find that incomes and wealth increase\n",
        "as education increases. Thus education does help us to achieve a more\n",
        "successful career. Meanwhile, standard deviations of incomes and wealth\n",
        "also increase as education increases. This implies income and wealth\n",
        "inequality increase as education increases.\n",
        "\n",
        "Finally, we explore the long-term effects of education via studying\n",
        "return to education of different age groups. We find that income and\n",
        "wealth gaps increase dramatically for university degree holders and\n",
        "non-university degree counterparts when they enter later stage of\n",
        "career. This result implies that there is long-term effect of education\n",
        "on incomes and wealth.\n",
        "\n",
        "To sum up, from a personal point of view, not only is obtaining a degree\n",
        "important in order to have a successful career, but career development\n",
        "is also crucial. From a policy perspective, we need policies that target\n",
        "pre-labor market skill accumulation, as well as policies that foster\n",
        "career progression to effectively reduce inequality.\n",
        "\n",
        "## List of R Commands"
      ],
      "id": "3c389c6a-6a67-4093-a23b-23cdd9f8a6e9"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# 'Commands about Dataframe'\n",
        "# \n",
        "# read_dta(file) #read dta files\n",
        "# \n",
        "# !is.na(data) #return not NA data\n",
        "# \n",
        "# filter(dataframe, conditions) #return dataframe that satisfies conditions, e.g.\n",
        "# SFS_data <- filter(SFS_data, !is.na(SFS_data$pefmtinc)) #return dataframe that variable pefmtinc is not NA\n",
        "# \n",
        "# subset(dataframe, conditions) #select subset of dataframe that satisfies conditions, e.g.\n",
        "# SFS_data <- subset(SFS_data, pefmjsif == \"02\" | pefmjsif == \"03\") #select subset of dataframe that major income sources are salary or self-employment incomes\n",
        "# \n",
        "# rename(dataframe, new name = old name) #rename variables in dataframe, e.g.\n",
        "# SFS_data <- rename(SFS_data, income_after_tax = pefatinc)\n",
        "# \n",
        "# dataframe[order(data),] #sort dataframe according to a variable, e.g.\n",
        "# SFS_data <- SFS_data[order(SFS_data$education),]\n",
        "# \n",
        "# mutate(new variable = [operation] existing variables) #adds new variables and preserves existing ones\n",
        "# \n",
        "# case_when(conditions) #vectorise multiple if_else() statements\n",
        "# \n",
        "# #e.g.\n",
        "# SFS_data <- \n",
        "#         SFS_data %>%\n",
        "#         mutate(agegr = case_when(\n",
        "#               age == \"01\" ~ \"Under 30\", #under 20\n",
        "#               age == \"02\" ~ \"Under 30\", #20-24\n",
        "#               age == \"03\" ~ \"20s\", #25-29\n",
        "#               age == \"04\" ~ \"30s\",\n",
        "#             age == \"05\" ~ \"30s\",\n",
        "#               age == \"06\" ~ \"40s\",\n",
        "#               age == \"07\" ~ \"40s\",\n",
        "#               age == \"08\" ~ \"50s\",\n",
        "#               age == \"09\" ~ \"50s\",\n",
        "#               age == \"10\" ~ \"60s\", #60-64\n",
        "#               age == \"11\" ~ \"Above 65\", #65-69\n",
        "#               age == \"12\" ~ \"Above 65\", #70-74\n",
        "#               age == \"13\" ~ \"Above 75\", #75-79\n",
        "#               age == \"14\" ~ \"Above 75\", #80 and above\n",
        "#               )) %>% \n",
        "# #create a new variable named `agegr` based on variable `age`. If age==\"01\", agegr will be \"Under 30\".\n",
        "# \n",
        "# data.frame(variable=c(...),...) #create dataframe which contains variable, and we can define what's in the column with c()\n",
        "# #e.g.\n",
        "# df <- data.frame(variables=c('income before tax','income after tax','wealth'),\n",
        "#                  mean=round(c(mean(SFS_data$income_before_tax),mean(SFS_data$income_after_tax),mean(SFS_data$wealth)),2),\n",
        "#                  median=round(c(median(SFS_data$income_before_tax),median(SFS_data$income_after_tax),median(SFS_data$wealth)),2),\n",
        "#                 sd=round(c(sd(SFS_data$income_before_tax),sd(SFS_data$income_after_tax),sd(SFS_data$wealth)),2))\n",
        "# #create a dataframe to contain variable names, means, medians and standard deviations.\n",
        "# \n",
        "# select(dataframe, variables) #select columns from a dataframe, e.g.\n",
        "# cor_data<-select(SFS_data, c('education','income_before_tax','income_after_tax','wealth'))\n",
        "# \n",
        "# length(data) #length of the vector\n",
        "# \n",
        "# group_by(variable) #takes an existing tbl and converts it into a grouped tbl where operations are performed \"by group\"\n",
        "# \n",
        "# summarize(variables) #creates a new data frame\n",
        "# \n",
        "# #e.g.\n",
        "# results <- \n",
        "#     SFS_data %>% \n",
        "#     group_by(education) %>%\n",
        "#     summarize(m_income = mean(income_before_tax), sd_income = sd(income_before_tax),\n",
        "#               m_wealth = mean(wealth), sd_wealth = sd(wealth))\n",
        "#convert SFS_data to a grouped table, grouped by `education`. make a new table with means, standard deviations of each group"
      ],
      "id": "d766adec-bf60-4d1e-b020-95829bcbf19f"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "'Commands about Types'\n",
        "\n",
        "# as.numeric(data) #transform strings to numbers, e.g.\n",
        "# SFS_data$education <- as.numeric(SFS_data$education)\n",
        "# \n",
        "# as.character(data) #transform numbers to strings, e.g.\n",
        "# SFS_data$education <- as.character(SFS_data$education)\n",
        "# \n",
        "# as_factor(data) #transform strings to factor variables, e.g.\n",
        "# SFS_data$gender <- as_factor(SFS_data$gender)\n",
        "# \n",
        "# factor(factor variable,levels = c(\"A\", \"B\", \"C\", \"D\", \"E\")) #order factor variable, e.g.\n",
        "# SFS_data$agegr <- factor(SFS_data$agegr,levels = c(\"20s\", \"30s\", \"40s\", \"50s\", \"60s\"))"
      ],
      "id": "ec4a365f-4d2d-4878-8ad2-43e08165146e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "'Commands about Calculation'\n",
        "\n",
        "# round(data, number of digits) #round numbers to have certain number of digits\n",
        "# \n",
        "# sqrt(n) #square root of n"
      ],
      "id": "8b4fa090-71a5-44bc-b7be-20255d9a4401"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "'Commands about Summary Statistics'\n",
        "\n",
        "# mean(data) #mean of data\n",
        "# \n",
        "# median(data) #median of data\n",
        "# \n",
        "# sd(data) #standard deviation of data \n",
        "# \n",
        "# quantile(data, vector of probabilities) #produces sample quantiles corresponding to the given probabilities, e.g.\n",
        "# pct2_income_before <- quantile(SFS_data$income_before_tax,c(0.02,0.98),type=1)\n",
        "# #sample quantiles of 0.02 and 0.98 of income_before_tax\n",
        "# \n",
        "# cor(data) #correlations of variables\n",
        "# \n",
        "# qt(p = 0.05, df = df) #return t statistics with confidence level of 95% and degree of freedom of df"
      ],
      "id": "79bde1eb-e373-4fcb-9e5b-d31791cd96b1"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#'Commands about Plots'\n",
        "\n",
        "# ggplot(data,options) #create plots, e.g.\n",
        "# \n",
        "# age_plot<-ggplot(SFS_data, aes(x=factor(agegr))) + geom_bar(fill=\"lightblue\") + xlab(\"Age Groups\") + theme(axis.text.x = element_text(angle = 90))\n",
        "# #create bar plot for factor varible `agegr`, with color lightblue, with texts display vertically\n",
        "# \n",
        "# income_before_tax_plot<-ggplot(SFS_data, aes(x=income_before_tax)) + geom_histogram()  + xlab(\"Income before Tax\") + ylab(\"Number of Observations\")\n",
        "# #create histogram for `income_before_tax`, with labels of x-axis and y-axis\n",
        "# \n",
        "# before_after_tax<-ggplot(SFS_data, aes(x = income_before_tax, y = income_after_tax)) + geom_point(shape = 1) + geom_smooth(method = lm)\n",
        "# #create scatter plot with a regression line\n",
        "# \n",
        "# ggarrange(plots, ncol=number of columns, nrow = number of rows) #arrange plots, e.g\n",
        "# ggarrange(income_before_tax_plot,income_after_tax_plot,  ncol = 2, nrow = 1) # arrange 2 plots in the same row\n",
        "# \n",
        "# corrplot(mydata.cor) #plot correlations"
      ],
      "id": "3e6ddd21-539d-4b7d-968e-33d937404432"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#'define a function named `CI`, input `data`, output `bound`'\n",
        "\n",
        "# CI <- function(data) {\n",
        "#     ...\n",
        "#     return(bound)\n",
        "# }"
      ],
      "id": "3c8b5042-a55a-43fa-aff7-45e1cd11dfd6"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#'Commands about Tests'\n",
        "\n",
        "# t.test(x, y, alternative = c(\"two.sided\", \"less\", \"greater\"),\n",
        "#        mu = 0, paired = FALSE, var.equal = FALSE,\n",
        "#        conf.level = 0.95, …) #Performs one and two sample t-tests on vectors of data.\n",
        "# \n",
        "# cor.test(x, y, use=\"complete.obs\") #Test for association between paired samples, using one of Pearson's product moment correlation coefficient"
      ],
      "id": "de0687af-81e1-4223-b894-e5af671d027e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#'Commands about Regressions'\n",
        "\n",
        "# lm(dependent variable ~ independent variable, data) #run a regression\n",
        "# \n",
        "# stargazer(regression1, regression2, regression3, ...) #show several regression results in a table"
      ],
      "id": "8dcdde13-c8b3-4de2-a076-1111f1d13834"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Quick Note on Commands\n",
        "\n",
        "It’s important to remember that you may need to use these commands in\n",
        "ways that aren’t specified above. For example, you may want to use the\n",
        "`t_test` function with a different confidence level, so you may set\n",
        "`conf.level = 0.99`. You may also have to add new parameters depending\n",
        "on what you are trying to accomplish. For example, to run a linear\n",
        "regression with a subset of the data, you would need to add a `subset`\n",
        "parameter to the `lm` function. To discover what each of the functions\n",
        "can do, check out the R documentation for the functions to get a\n",
        "detailed list of the different parameters and default values you can use\n",
        "for function.\n",
        "\n",
        "https://www.rdocumentation.org/"
      ],
      "id": "eb8c8c85-7162-4a35-bb67-5c3d11aa279a"
    }
  ],
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "name": "ir",
      "display_name": "R",
      "language": "r"
    }
  }
}