{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 06 - Conducting Within Group Analysis\n", "\n", "Marina Adshade, Paul Corcuera, Giulia Lo Forte, Jane Platt \n", "2024-05-29\n", "\n", "## Prerequisites\n", "\n", "1. Inspect and clean the variables of a data set.\n", "2. Generate basic variables for a variety of purposes.\n", "\n", "## Learning Outcomes\n", "\n", "1. Use `arrange`, `group_by`, `group_keys`, and `ungroup` to sort and\n", " organize data for specific purposes.\n", "2. Generate variables with `summarise` to analyze patterns within\n", " groups of data.\n", "3. Reshape data frames using `pivot_wider` and `pivot_longer`.\n", "\n", "## 6.1 Key Functions for Group Analysis\n", "\n", "When we are working on a particular project, it is often quite important\n", "to know how to summarize data for specific groupings, whether of\n", "variables or observations meeting specific conditions. In this notebook,\n", "we will look at a variety of functions for conducting this group-level\n", "analysis. We will rely heavily on the `dyplr` package, which we have\n", "implicitly imported through the `tidyverse` package. Let’s import these\n", "packages and load in our “fake_data” now. Recall that this data set is\n", "simulating information of workers in the years 1982-2012 in a fake\n", "country where a training program was introduced in 2003 to boost their\n", "earnings." ], "id": "0d4a3cb5-e70e-403e-b4a6-a5b6349ec1ea" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(haven)\n", "library(tidyverse)\n", "library(IRdisplay)\n", "\n", "fake_data <- read_dta(\"../econ490-stata/fake_data.dta\") # change me!" ], "id": "244b66fc-2c7f-43e7-a470-69cfc106bb27" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we’ve loaded in our data and already know how to view it, clean\n", "it, and generate additional variables for it as needed, we can look at\n", "some helpful commands for grouping this data.\n", "\n", "### 6.1.1 `arrange`\n", "\n", "Before grouping data, we may want to order our data set based on the\n", "values of a particular variable. The `arrange` function helps us achieve\n", "this. It takes in a data frame and variable and rearranges our data\n", "frame in ascending order of the values of that variable, with the option\n", "to arrange in descending order requiring a further `desc` function. As\n", "an example, let’s rearrange our entire data set in order of the variable\n", "*year*." ], "id": "45d16e68-3892-4d60-b01d-840248f1b354" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# arrange the data frame by ascending year\n", "fake_data %>% arrange(year)\n", "\n", "# arrange the data frame by descending year\n", "fake_data %>% arrange(desc(year))" ], "id": "1b70bca9-40d4-44bd-85ec-c6a1dfa36c73" }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also pass multiple variable parameters to the `arrange` function\n", "to indicate how we should further sort our data within each year\n", "grouping. For instance, including the *region* variable will further\n", "sort each year grouping in order of region." ], "id": "438e8012-e3f3-4a27-94b8-8c28164f8f5a" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data %>% arrange(year, region)" ], "id": "5a875c50-975b-4df9-aa51-85d1da0f0043" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.1.2 `group_by`\n", "\n", "This is one of the most pivotal functions in R. It allows us to group a\n", "data frame by the values of a specific variable and perform further\n", "operations on those groups. Let’s say that we wanted to group our data\n", "set by *region* and count the number of observations in each region. To\n", "accomplish this, we can simply pass this variable as a parameter to our\n", "`group_by` function and further pipe this result into the `tally`\n", "function (which counts the number of observations)." ], "id": "9b167ffc-396f-4a07-8d9d-8b0483f12c3b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data %>% group_by(region) %>% tally()" ], "id": "652736b3-f302-4879-a559-dc7a43fc6f92" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how the `group_by` function nicely groups the regions in\n", "ascending order for us automatically. Unlike with the `arrange`\n", "function, it does not preserve the data set in its entirety. It instead\n", "collapses our data set into groups, thus it is important not to redefine\n", "our “data” data frame by this `group_by` if we want to preserve our\n", "original data.\n", "\n", "We can also pass multiple arguments to `group_by`. If we pass both\n", "*region* and *treated* to our function as inputs, our region groups will\n", "be further grouped by observations which are and are not treated. Let’s\n", "count the number of treated and untreated observations in each region." ], "id": "3dbec210-3c92-4732-be74-de6ce2466f6f" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data %>% group_by(region, treated) %>% tally()" ], "id": "2c70f220-7e62-4bd4-b910-669ba1414ff0" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can pipe a `group_by` object into another `group_by` object.\n", "In this case, the second `group_by` will simply overwrite the first. For\n", "example, if we wanted to pass our original *region* `group_by` into a\n", "mere *treated* `group_by`, we get as output a data frame counting the\n", "total number of observations that are treated and untreated." ], "id": "f717921c-837c-46c1-bfac-cd39ba3762d8" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data %>% group_by(region) %>% group_by(treated) %>% tally()" ], "id": "07aca3e8-4cb9-4ffc-ba73-96232a2de7b5" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.1.3 `group_keys`\n", "\n", "This function allows us to see the specific groups for a `group_by` data\n", "frame we have created. For instance, if we wanted to see every year in\n", "the data, we could group by *year* and then apply the `group_keys`\n", "function." ], "id": "e344fc5d-9a08-4a31-a662-424cf2b3133b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data %>% group_by(year) %>% group_keys()" ], "id": "85d578a3-5279-406a-82cc-a0a3f7e3c8e6" }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is equivalent to using the `unique` function directly on a column\n", "of our data set. The `unique` function lists all unique values for a\n", "specified list." ], "id": "fec876c6-6d74-4541-a2cd-cdfbcda97dcc" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "unique(fake_data$year)" ], "id": "703df2b1-2f22-4cbd-b920-244a7635c553" }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output is just a list in this case, instead of another data frame as\n", "above.\n", "\n", "### 6.1.4 `ungroup`\n", "\n", "We can even selectively remove grouping variables from a grouped data\n", "frame. Say we grouped by *region* and *treated* but then wanted to just\n", "count how many treated groups there are by *region*. If this double\n", "grouped data frame is defined as A, we can simply use `ungroup` A to\n", "remove the grouping by treatment status." ], "id": "511af4cf-8ba4-4a70-b933-4dcecfe86863" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "A <- fake_data %>% group_by(region, treated) %>% tally()\n", "A" ], "id": "e0b601e3-4e8d-4629-bf3e-bcfbe0ccaa87" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "A %>% ungroup(treated) %>% tally()" ], "id": "6f12a655-89e6-4152-98bb-7ca5b40ba2e2" }, { "cell_type": "markdown", "metadata": {}, "source": [ "We may also be interested in knowing how many groupings we have created.\n", "We can remove all grouping variables by leaving the input of `ungroup()`\n", "empty." ], "id": "a3a1bcda-3686-438e-9465-2f541017b7d0" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "A %>% ungroup() %>% tally()" ], "id": "29e7f8f6-7355-48cb-b240-c198c196aad7" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6.2 Generating Variables for Group Analysis\n", "\n", "We have already seen how to redefine and add new variables to a data\n", "frame using the `df$ <-` format. We have also seen how to use the\n", "`mutate` function to add new variables to a data frame. However, we\n", "often want to add new variables to grouped data frames to display\n", "information about the different groups rather than different\n", "observations in the original data frame. That is where `summarise` comes\n", "in.\n", "\n", "The `summarise` function gives us access to a variety of common\n", "functions we can use to generate variables corresponding to groups. For\n", "instance, we may want to find the mean earnings of each region. To do\n", "this, we can group on *region* and then add a variable to our grouped\n", "data frame which aggregates the mean of the *earnings* variable for each\n", "region group. We must use the `summarise` function for this, since it\n", "gives us access to the earnings of every member of each group." ], "id": "e84514c4-a5d0-4757-b79c-c60f7cc76c4f" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data %>% group_by(region) %>% summarise(meanearnings = mean(earnings))" ], "id": "82192933-79a8-4f36-8885-9749b016182d" }, { "cell_type": "markdown", "metadata": {}, "source": [ "We may want more detailed information about each region. We can pass a\n", "series of parameters to `summarise` and it will generate variables for\n", "all of these requests. Let’s say we want the mean and standard deviation\n", "of *earnings* for each group, as well as the range of *earnings* for\n", "each group." ], "id": "1ad12936-dae0-4053-90d9-f0cd268aa5b7" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data %>% \n", " group_by(region) %>% \n", " summarise(meanearnings = mean(earnings), stdevearnings = sd(earnings), range = max(earnings) - min(earnings))" ], "id": "dac8344a-ac20-43f9-a4dc-6b8ad397ea37" }, { "cell_type": "markdown", "metadata": {}, "source": [ "We may also want to calculate the number of observations in each region\n", "as an additional variable. Before, we could simply group by our *region*\n", "variable and then immediately apply the `tally` function. However, now\n", "that we have defined a series of other variables, our data set on which\n", "`tally` operates is different. Watch what happens when we try to use\n", "`tally` after using `summarise`." ], "id": "9920c33b-24e6-4aad-b386-122ffcf0a324" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data %>% \n", " group_by(region) %>% \n", " summarise(meanearnings = mean(earnings), stdevearnings = sd(earnings), range = max(earnings) - min(earnings)) %>%\n", " tally()" ], "id": "9043ffd5-ba7c-4074-8bbe-66305e395fe0" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now watch what happens when we try to use `tally` before using\n", "`summarise`." ], "id": "fd00cff1-4398-45f0-af50-e4ba319a1faa" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data %>% \n", " group_by(region) %>% \n", " tally() %>%\n", " summarise(meanearnings = mean(earnings), stdevearnings = sd(earnings), range = max(earnings) - min(earnings))" ], "id": "9437d563-2f3b-4667-a63c-1d451740e331" }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the first case, tally does not have the necessary information left in\n", "the data frame to count the number of observations in each region. In\n", "the second case, tally has shrunk the data frame so much that the\n", "functions within `summarise` do not have the necessary information to\n", "make their calculations.\n", "\n", "This is where `n` comes in. This is a special function used within the\n", "`summarise` variable. It represents the number of observations within\n", "each group of a data frame. As such, it is directly paired with\n", "`group_by`, although it can be paired with `mutate` when we are working\n", "with the number of observations in a data set as a whole (i.e. with one\n", "group, meaning `n` represents the position of each observation)." ], "id": "615ebf98-08ce-4f48-86dc-795e088f0725" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data %>% \n", " group_by(region) %>% \n", " summarise(meanearnings = mean(earnings), stdevearnings = sd(earnings), range = max(earnings) - min(earnings), total = n())" ], "id": "182c2b99-251a-4a29-9459-2cdbf6879cc5" }, { "cell_type": "markdown", "metadata": {}, "source": [ "The entire process of generating variables for group analysis in this\n", "section is similar to collapsing a data set in Stata. Luckily, it can be\n", "done more quickly here in R.\n", "\n", "## 6.3 Reshaping Data\n", "\n", "Sometimes in our process of data analysis, we want to restructure our\n", "data frame. To do this, we can take advantage of a series of functions\n", "within the `tidyr` package that we have imported implicitly through\n", "loading in the `tidyverse` package. These functions allow us to quickly\n", "change the format of our data frame without having to redefine all of\n", "its columns and rows manually.\n", "\n", "For instance, we often want to transform our data from “wide”\n", "(cross-sectional) to “long” (panel) format, or vice versa. Suppose that\n", "we wish to make our data set more “cross-sectional” in appearance by\n", "dropping the age variable and adding an earnings variable for each year,\n", "with the values in these new columns corresponding to the earnings of\n", "each person in that year. Effectively, by adding columns, we are making\n", "our data set “wider”, so it is no surprise that the function is called\n", "`pivot_wider`.\n", "\n", "It takes the following arguments:\n", "\n", "1. `names_from`: which columns to get the name of the output columns\n", " (in our example, `year`);\n", "2. `values_from`: which columns to get the cell values from (in our\n", " example, `earnings`)." ], "id": "7e21e208-5ed6-463d-b15e-4b15bdb0b118" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wide_data <- fake_data %>% arrange(year) %>% select(-age) %>% pivot_wider(names_from = \"year\", values_from = \"earnings\")\n", "head(wide_data)" ], "id": "f2f63953-7e67-4c25-9930-5d572aab43f8" }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the function above took the values from *year* and\n", "generated a new variable for each of them from 1982 to 2012, then\n", "supplied the corresponding values from *earnings* to each of these year\n", "variables. When a worker’s information isn’t recorded for a given year\n", "(and thus they have no recorded wage), the *earnings* variable is marked\n", "as missing.\n", "\n", "We can pivot more than one variable. Instead of pivoting only the\n", "variable *year*, we can pivot both the variables *year* and *age*. We do\n", "so by specifying both variables in the `values_from` argument." ], "id": "fdf675f9-443b-4847-a4dc-2acf098b22db" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data %>% arrange(year) %>% pivot_wider(names_from = \"year\" , values_from = c(\"earnings\",\"age\"))" ], "id": "68048c3f-f706-4dff-8c8b-3436d11df3b1" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now suppose we want to work backward and transform this data set back\n", "into its original, “longer” shape (just now without the *age* variable).\n", "To do this, we can invoke the complementary `pivot_longer` function. The\n", "arguments we need to specify are:\n", "\n", "1. `cols`: the name of the columns we want to pivot to longer format\n", " (in our case, `'1995':'2011'`);\n", "2. `names_to`: the name of the new column that will be created from the\n", " information stored in the column names specified by `cols` (in our\n", " case, `\"year\"`);\n", "3. `values_to`: the name of the column to create from the data stored\n", " in cell values, `\"earnings\"`." ], "id": "bf3f94e1-b03e-47b4-ae9e-a156ea8ff19b" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_data <- wide_data %>% pivot_longer(cols = '1995':'2011', names_to = \"year\", values_to = \"earnings\")\n", "head(long_data)" ], "id": "15baae0e-dd4c-47d0-b164-74144f4465ea" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember that, when going from long to wide format, we created several\n", "missing values every time a worker information for a given year was not\n", "available. Now that we transform our data back from wide to long format,\n", "we may carry with us all those missing values we had created.We can ask\n", "R to automatically exclude them, by adding the option\n", "`values_drop_na = TRUE`." ], "id": "ed2f7e11-c0c2-48e4-a1b1-3581bc310829" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_data_short <- wide_data %>% pivot_longer(cols = '1995':'2011', names_to = \"year\", values_to = \"earnings\", values_drop_na = TRUE)\n", "head(long_data_short)" ], "id": "904ef4a6-5543-4a0c-a339-c66e8e6250b6" }, { "cell_type": "markdown", "metadata": {}, "source": [ "If this doesn’t seem intuitive or quickly comprehensible, don’t worry.\n", "Even many experienced coders struggle with the pivoting/reshaping\n", "functionality. With practice, it will become much more digestible!\n", "\n", "## 6.4 Common Mistakes\n", "\n", "It is easy to forget that `group_by()` creates a new data frame with a\n", "limited number of variables.\n", "\n", "Suppose we want to compute average earnings by region and treated\n", "status. We may try to do something like the following:" ], "id": "7dce2183-2d7d-4eb3-9c7a-5fd04b241c69" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "step1 <- fake_data %>%\n", " group_by(region) %>%\n", " summarise(meanearnings = mean(earnings))\n", "\n", "step2 <- step1 %>%\n", " group_by(treated) %>%\n", " summarise(meanearnings = mean(earnings))" ], "id": "b265d39c-0580-4d85-9ead-f18a96e3ea4c" }, { "cell_type": "markdown", "metadata": {}, "source": [ "This results in an error: the first `group_by` creates a new data frame\n", "that does not contain the variable *treated* anymore. We can see that\n", "also by looking at the error message: *column ‘treated’ is not found*.\n", "\n", "The right way of doing what we wanted is as follows:" ], "id": "ad73f393-e845-462e-ad4a-7a5b43ae5fb7" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data %>% \n", " group_by(region, treated) %>% \n", " summarise(meanearnings = mean(earnings))" ], "id": "01faa0bd-ce19-4f21-9454-bea1c78584cd" }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we move from wide to long format, or vice versa, the variables that\n", "we *do not* pivot should remain constant over the variable that we pivot\n", "(namely, the variable we use in the `names_from` argument).\n", "\n", "Consider the example below. It is similar to what we did above but it\n", "has a crucial difference; can you spot it?" ], "id": "68756a2a-5d9f-4e5f-878d-227adef3a273" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fake_data %>% arrange(year) %>% pivot_wider(names_from = \"year\", values_from = \"earnings\")" ], "id": "80fb0219-2c05-42dc-96f4-f3517867dcbb" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Earlier we dropped the variable *age*, while now we are keeping it. The\n", "variable *age* now is treated as if it was constant during *year*, the\n", "variable we are using for pivoting the data.\n", "\n", "This is not necessarily a mistake, and in fact R allows us to do the\n", "reshape. However, it changes the way in which we interpret *age*: it is\n", "now the age of the worker in their first year of appearance in the\n", "dataset.\n", "\n", "## 6.5 Wrap Up\n", "\n", "Being able to generate new variables and modify a data set to suit your\n", "specific research is pivotal. Now you should hopefully have more\n", "confidence in your ability to perform these tasks. Next, we will explore\n", "the challenges posed by working with multiple data sets at once.\n", "\n", "## 6.6 Wrap-up Table\n", "\n", "| Function | Description |\n", "|-------------------------------|-----------------------------------------|\n", "| `arrange` | It orders observations based on the ascending or descending order of one or more variables. |\n", "| `group_by` | It groups observations based on the values of one or more variables. It may be combined with `summarise` to compute summary statistics by group. |\n", "| `ungroup` | It removes one or more grouping variables. |\n", "| `pivot_wider` | It pivots data from long to wide format. |\n", "| `pivot_longer` | It pivots data from wide to long format. |" ], "id": "f1f10556-97c0-43a0-ae8e-6bdcaf3fee62" } ], "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "ir", "display_name": "R", "language": "r" } } }