{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 05 - Opening Datasets\n", "\n", "Marina Adshade, Paul Corcuera, Giulia Lo Forte, Jane Platt \n", "2024-05-29\n", "\n", "## Prerequisites\n", "\n", "1. Understand how to effectively use Stata do files and know how to\n", " generate log files.\n", "2. Run basic Stata commands such as `help`, `describe`, `summarize`,\n", " `for` and `while`.\n", "3. Know how to use macros in writing Stata commands.\n", "\n", "## Learning Outcomes\n", "\n", "1. Understand how to use `clear` at the beginning of do-files.\n", "2. Know how to change directories so that Stata can find relevant\n", " files.\n", "3. Import datasets in csv and excel formats.\n", "4. Import datasets in dta format.\n", "5. Save data files.\n", "\n", "## 5.0 Intro" ], "id": "b57db9cb-3283-4320-b0ed-7ac19d6f3006" }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import stata_setup\n", "stata_setup.config('C:\\Program Files\\Stata18/','se')" ], "id": "7d80fe63" }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ ">>> import sys\n", ">>> sys.path.append('/Applications/Stata/utilities') # make sure this is the same as what you set up in Module 01, Section 1.3: Setting Up the STATA Path\n", ">>> from pystata import config\n", ">>> config.init('se')" ], "id": "114097cb" }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this repository, there is a folder named “data”, with a sub-folder\n", "named “raw”. In that sub-folder, there are two different versions of the\n", "same data set: “fake_data.csv” and “fake_data.dta”. The data sets\n", "simulate information from workers in the years 1982-2012 in a fake\n", "country where, in 2003, a policy was enacted that allowed some workers\n", "to enter a training program with the purpose of boosting their earnings.\n", "We will be using this data set to learn how to explore and manipulate\n", "real-world data sets.\n", "\n", "## 5.1 Clearing the Workspace\n", "\n", "Do-files should begin with a command that clears the previous work that\n", "has been open in Stata. This makes sure that:\n", "\n", "1. We do not waste computer memory on things other than the current\n", " project.\n", "2. Whatever result we obtain in the current session truly belongs to\n", " that session.\n", "\n", "We can clear the workspace of many different things (see `help clear` if\n", "needed). For the purpose of this lecture, the most comprehensive thing\n", "to do is to run the following:" ], "id": "864b7f74-1c11-4778-8f3a-2269db1c08ff" }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "%%stata\n", "clear *" ], "id": "420e7771" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that `*` selects everything.\n", "\n", "## 5.2 Changing Directories\n", "\n", "Before we get started on importing data into Stata, it is useful to know\n", "how to change the folder that Stata accesses whenever we run a command\n", "that either opens or saves a file. Once we instruct Stata to change the\n", "directory to a specific folder, from that point onward it will open\n", "files from that folder and save all files to that folder, including data\n", "files, do-files, and log-files. Stata will continue to do this until\n", "either the program is closed or we change to another directory. This\n", "means that every time we open Stata, we need to change the directory to\n", "the one we want to use.\n", "\n", "Having a main directory with sub-directories is extremely important for\n", "research. When we are doing large research projects that use many\n", "different data sets from different sources as well as do-files and\n", "log-files that perform different tasks, it can be easy to lose track of\n", "what is where. For that reason, we will want to create different folders\n", "for many of our different steps. We will go into more detail about this\n", "in [Module\n", "18](https://comet.arts.ubc.ca/docs/Research/econ490-stata/18_Wf_Guide2.html),\n", "but it is a good idea to get into the habit of doing this as soon as you\n", "start a project.\n", "\n", "We will begin by using the `pwd` command to view the current working\n", "directory." ], "id": "a4a7ba93-f326-4f6c-add7-49fc4e734352" }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "%%stata\n", "pwd" ], "id": "37d60561" }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** We write the directory path within quotation marks to make\n", "sure Stata interprets this as a single string of words. If we don’t do\n", "this, we may encounter issues with folders that include blank spaces.\n", "\n", "Now let’s change the directory to the specific location where we saved\n", "the fake_data file using the command below. We can change our workspace\n", "to a directory named “some_folder/some_sub_folder” by writing\n", "`cd \"some_folder/some_sub_folder\"`.\n", "\n", "Use the space below to do this on your own computer." ], "id": "7b0f9cdf-7f90-4ed8-9ddd-748f47a84cb2" }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "%%stata\n", "\n", "cd \" \" \n", "* type your file path to the folder containing the data between the quotation marks in the line above" ], "id": "0a9da062" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that once we change directories, Stata outputs the full name of\n", "the directory where we are currently working.\n", "\n", "One trick to using `cd` is that we can use periods (.) to move back\n", "folders: two period to move back one folder, three periods to move back\n", "two folders, etc. Try the command below to compare the folder Stata is\n", "now directed to with the command above. You can repeat this using two\n", "periods." ], "id": "4f67d5ee-3da2-424a-be57-ee609d4dfc7d" }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "%%stata\n", "\n", "cd .." ], "id": "290fa74b" }, { "cell_type": "markdown", "metadata": {}, "source": [ "An easier way to change the directory is by typing the `cd` command\n", "followed by the folder you want to set as your working directory. In\n", "this method, quotation marks are not necessary." ], "id": "75309f17-d3a7-46fc-b810-c5b1cbf1d898" }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "%%stata\n", "\n", "cd myfolder\n", "\n", "*Use myfolder as a placeholder for the folder you want to set as your working directory" ], "id": "13816b86" }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition, we can use the command `cd` on its own to go back to the\n", "home directory.\n", "\n", "The process for changing directories in Stata varies depending on the\n", "type of computer being used. If one approach does not work, it is\n", "possible that the method is not suitable to your computer. Please see\n", "the Stata manual for instructions on how to change directories according\n", "to the type of computer you are using:\n", "https://www.stata.com/manuals/dcd.pdf.\n", "\n", "## 5.3 Opening Data Sets\n", "\n", "### 5.3.1 Excel and CSV Files\n", "\n", "When looking for the data for your research, you will realize that many\n", "data sets are not formatted for Stata (.dta files). In some cases, data\n", "sets are formatted as Excel or CSV files. Not surprisingly, the command\n", "to load in data is called `import`. It comes in two main forms:\n", "`import excel` and `import delimited`.\n", "\n", "Let’s import the data set called `fake_data.csv`. We need to use\n", "`import delimited` to import this data into Stata. The syntax for this\n", "command is\n", "`import delimited [using] filename [, import_delimited_options]`.\n", "\n", "We always include the option `clear` when we use import to make sure\n", "we’re clearing any previous data set that was opened before in our Stata\n", "session. Recall that to use an option, we include a comma (`,`) after\n", "the command line and write the option name. Feel free to read the\n", "documentation of these commands by writing `help import delimited`.\n", "\n", "Note that the command below will not import the data unless you have\n", "changed your directory (above) to the folder which contains this file.\n", "\n", "**Ignore the following block of code that will create a csv file to be\n", "used as an example**" ], "id": "43dee160-9455-41bb-8c74-06e956bec757" }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "%%stata\n", "\n", "use fake_data, clear\n", "export delimited using \"fake_data.csv\", replace" ], "id": "331e4603" }, { "cell_type": "markdown", "metadata": {}, "source": [ "To load a .csv data set, we write:" ], "id": "078fad91-ffa6-48d1-917f-86179cc3c7b1" }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "%%stata\n", "\n", "import delimited using \"fake_data.csv\", clear" ], "id": "56e71e29" }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we run this command, Stata will print a message saying that there\n", "are 9 variables and almost 3 million observations. When we open data\n", "sets that are not in Stata format, it is very important to check whether\n", "the first row of the data includes the variable names.\n", "\n", "We can use the command `list` to look at our data. It is better to limit\n", "the observations we see since we don’t want to see all 3 million! Thus,\n", "we use `in` to constrain the list to the first 3 observations below." ], "id": "c9fd85aa-4a8c-4b7c-97e9-b8d972d65f7f" }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "%%stata\n", "\n", "list in 1/3 " ], "id": "687d9543" }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, the first row of data is interpreted as the variable names\n", "of the data set, which in this case is correct. If that’s not the case,\n", "we need to include the import delimited option `varnames(#|nonames)`,\n", "where we replace `#` by the observation number that includes the\n", "variable names. If the data has no names, the option is\n", "`varnames(nonames)`. Don’t forget that we can always check the\n", "documentation by writing `help import delimited`.\n", "\n", "### 5.3.2 Stata Files\n", "\n", "To open data sets in the Stata format, we use the command `use`. As we\n", "can observe from the example below, we can recognize a data set is\n", "stored in Stata format because the file’s name will end with .dta" ], "id": "0b1ad953-2081-422e-b359-7073e39f35f2" }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "%%stata\n", "\n", "use \"fake_data.dta\", clear" ], "id": "737aac9d" }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "%%stata\n", "\n", "list in 1/3 " ], "id": "b386d06e" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.3.3 Data from Online Sources\n", "\n", "It is also possible to download data sets directly from other sources,\n", "such as websites. This allows us to import data without saving it as a\n", ".csv or Excel file.\n", "\n", "For example, it is possible to import data from the World Bank using the\n", "command `wbopendata`. This requires us to install some packages first,\n", "but once it is done, it simplifies the process of importing only what we\n", "need. More information is available\n", "[here](https://datahelpdesk.worldbank.org/knowledgebase/articles/889464-wbopendata-stata-module-to-access-world-bank-data).\n", "\n", "Another useful command is `sdmxuse`. This allows us to import data from\n", "all sorts of statistical agencies, such as the IMF and the OECD. More\n", "information is available\n", "[here](https://journals.sagepub.com/doi/pdf/10.1177/1536867X1801800407).\n", "\n", "The types of tools are widely available and often the best approach for\n", "getting data from statistical agencies.\n", "\n", "### 5.3.4 Other Files\n", "\n", "We can open a number of different data files in Stata with no issues. If\n", "you are struggling, one option at UBC is to use the program StatTransfer\n", "to convert your file to .dta format. This program is available in the\n", "library on the UBC Vancouver Campus at one of the [Digital Scholarship\n", "workstations](https://researchcommons.library.ubc.ca/digital-scholarship-lab-use-policy-and-guideline/).\n", "Once your data is in .dta format, it can be imported with the `use`\n", "command seen above.\n", "\n", "**Note:** UBC has research support available for any student who needs\n", "help with data, including anyone who needs help getting data into a\n", "format that can be imported into Stata. You can find the contact\n", "information for the Economics Librarian on the [UBC Library ECON 490\n", "Research Guide](https://guides.library.ubc.ca/ECON490).\n", "\n", "## 5.4 Commands to Explore the Dataset\n", "\n", "### 5.4.1 `describe`\n", "\n", "The first command we are going to use describes the basic\n", "characteristics of the variables in the loaded data set." ], "id": "17bc7106-9033-4880-83dd-21ad056a67b5" }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "%%stata\n", "\n", "describe" ], "id": "954f74e3" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.4.2 `browse`\n", "\n", "In addition to the `describe` command, in the Stata interface we can\n", "also open the data editor and see the raw data as if it were an Excel\n", "file. To do this, we type `browse`. This command will open a new Stata\n", "window. If we want to do this from within Jupyter, we use the command\n", "with `%` before `browse`. Either approach gives us a “spreadsheet-like”\n", "view of the data set." ], "id": "fd0973ed-3612-4d91-a6d0-b675120193f8" }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "%%stata\n", "\n", "%browse" ], "id": "fcd0a660" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Opening the data editor has many benefits. Most importantly, we get to\n", "see the data as a whole, allowing us to have a clearer perspective of\n", "the information the data set is providing us. For example, here we\n", "observe that we have unique worker codes, the year where they are\n", "observed, worker characteristics (sex, age, and earnings), and whether\n", "or not they participated in the specified 2003 training program.\n", "\n", "### 5.4.3 `codebook`\n", "\n", "We can further analyze any variable by using the `codebook` command.\n", "Let’s do this here to learn more about the variable *earnings*." ], "id": "596ae51b-d04c-454e-97ba-83fc990aea31" }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "%%stata\n", "\n", "codebook earnings" ], "id": "0b29d54e" }, { "cell_type": "markdown", "metadata": {}, "source": [ "The codebook command gives us important information about this variable\n", "such as the type (i.e. string or numeric), how many missing observations\n", "it has (very useful to know!) and all unique values. If the variable is\n", "numeric, it will also provide some summary statistics. If the variable\n", "is a string, it will provided examples of some of the entries.\n", "\n", "Try changing the variable name in the cell above to see the codebook\n", "entries for different variables in the data set.\n", "\n", "### 5.4.4 `tabulate`\n", "\n", "We can also learn more about the frequency of the different measures of\n", "one variable by using the command `tabulate`." ], "id": "4a30f7a0-a8f7-4bab-a4b1-3481c95ce345" }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "%%stata\n", "\n", "tabulate region" ], "id": "43cefcc5" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we can see that there are five regions indicated in this data set.\n", "We can see that the majority of people surveyed came from region 1.\n", "\n", "We can actually include two variables in the `tabulate` command if we\n", "want more information. When we do this, below we see that there were\n", "234,355 female identified and 425,698 male identified persons surveyed\n", "in region 1." ], "id": "f09ec5b2-172c-465a-8624-f370e826b29f" }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "%%stata\n", "\n", "tabulate region sex" ], "id": "cff6453d" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.4.5 `lookfor`\n", "\n", "What if there’s a gazillion variables and we’re looking for a particular\n", "one? Thankfully, Stata provides a nice command called `lookfor` which\n", "helps us search for variables based on keywords. Suppose we want to look\n", "for a variable that is related to year." ], "id": "2e7c0c3f-3c76-475f-8d3c-2af5f985f8bb" }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "%%stata\n", "\n", "lookfor year" ], "id": "38dcaf17" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Stata found three variables that include the word year either in the\n", "variable name or in the variable label. This is super useful when we are\n", "getting to know a data set!\n", "\n", "## 5.5 Saving Data Sets\n", "\n", "We can save any opened data set in Stata format by writing\n", "`save using \"some_directory/dataset_name.dta\", replace`. The replace\n", "option overwrites a previous version of the file to keep our save\n", "current.\n", "\n", "Try saving the data we have been working on in a new folder named\n", "“cleaned_data”." ], "id": "2f92225e-86ee-4d0e-a486-016c99940979" }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "%%stata\n", "\n", "save using \"insert_directory/dataset_name.dta\", replace" ], "id": "6e05da79" }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also save files in different formats with the `export excel` and\n", "`export delimited` commands. Look at the help documentation for more\n", "details.\n", "\n", "## 5.6 Errors\n", "\n", "The `tabulate` command may be used in conjunction with conditional\n", "statements. When specifying the condition, ensure that you use quotation\n", "marks; otherwise, Stata will return an error code. Uncomment each line\n", "of code below to see it in action." ], "id": "8579427c-03b6-4e7c-b3f7-272c0bad5116" }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "%%stata\n", "\n", "\n", "*tabulate sex if sex==F //incorrect\n", "*tabulate sex if sex==\"F\" //correct" ], "id": "51dd6982" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.7 Wrap Up\n", "\n", "Now that we are able to import data into Stata, we can start doing\n", "analysis! Try finding a data set that interests you and practice some of\n", "the commands that you have already learned in the first few modules. In\n", "the next module, we will look at commands for working with data in\n", "greater depth.\n", "\n", "## 5.8 Wrap-up Table\n", "\n", "| Command | Function |\n", "|----------------------------------|--------------------------------------|\n", "| `clear` | used to clear the workspace |\n", "| `cd` | used to change the working directory |\n", "| `pwd` | used to view the current working directory |\n", "| `use` | used to open a Stata dataset |\n", "| `import delimited` | used to load a csv dataset |\n", "| `import excel` | used to load an excel dataset |\n", "| `list` | used to look at the data |\n", "| `describe` | used to describe the basic characteristics of the variables in the loaded dataset |\n", "| `browse` | used to open up the data editor and view the observations of the dataset |\n", "| `codebook` | used to describe data contents |\n", "| `tabulate` | used to summarize the frequency of the different measures of a variable |\n", "| `lookfor` | used to search for the variables of a dataset based on keywords |\n", "| `export excel` | used to save a dataset in excel format |\n", "| `export delimited` | used to save a dataset in csv format |\n", "\n", "## References\n", "\n", "[Import data from\n", "excel](https://www.youtube.com/watch?v=N5ZFgzN2_7c&list=PLN5IskQdgXWnnIVeA_Y0OBGmnw21fvcmU&index=5)\n", "
[Import delimited\n", "data](https://www.youtube.com/watch?v=8vwRchxkaZs&list=PLN5IskQdgXWnnIVeA_Y0OBGmnw21fvcmU&index=6)" ], "id": "563d76e5-8f3c-4704-a29d-666ac31122ed" } ], "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "python3", "display_name": "Python 3 (ipykernel)", "language": "python", "path": "/usr/local/share/jupyter/kernels/python3" }, "language_info": { "name": "python", "codemirror_mode": { "name": "ipython", "version": "3" }, "file_extension": ".py", "mimetype": "text/x-python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } } }