# Replication Readme for Sarada, Andrews, and Ziebarth (2019)

## Rough Overview of Data Construction Process

1. Create Census datasets from NBER servers for matching to patentee lists. **Cannot be replicated on local machine.**
2. Assemble datasets from other sources (NHGIS, Jim Shaw, Census Summary Stats, First Names) using Stata.
3. Parse Annual Reports using Python.
4. Match parsed patentees to Census using Stata and controlled by Python/Matching code.
5. Build datasets of matched results using Stata.
6. Analyze these datasets using Stata.

## `Code/Python`

### `Matching`

*This folder contains a number of Python files to manage the matching the process. Make sure that Stata is on your path. On Mac, this means modifying the variable $PATH from the Terminal.*

1. `call_to_stata.py` - Calls Stata from Python. This assumes that Stata is on the path. If not, ca ndo this by changing the `$Path` variable from the command line.
2. `check_results.py` - Checks whether a job exited "normally" by comparing number of matched files to product of number of Origin and Target files.
3. `create_mask.py` - Used by update_log to identify jobs already finished.
4. `extract_prepare.py` - Unzip a zip file of dtas and then repeatedly call `prepare_dtas.py`.
5. `gen_dropbox_dict.py` - Creates a dictionary of directories to subfolders in the project Dropbox folder.
6. `gen_local_dict.py` - Creates a dictionary of directories to subfolders in the local folder.
7. `gen_manifest.py` - Creates manifest of jobs (Origin-Target pairs) that need to be run. First, checks what zip files exists in Matching directory and builds log from that. Otherwise just takes Cartesian product of Origin and Target zip files. The version of this with the _Adj suffix is for the case where the Census year differs from the patentee year.
8. `get_timestamp.py` - Get timestamp for the log of jobs.
9. `initialize_log.py` - Initialize the log of jobs. If existing log does not exist, then it calls `gen_manifest.py`. The version of this with the _Adj suffix is for the case where the Census year differs from the patentee year.
10. `parallelize_matching.py` - This is file to run from command line. It takes the following arguments in this order:
  1. Census year to work on.
	2. Path to project files. Something like ~/Dropbox/Demographics of Patent Grantees.
	3. Path to write Origin, Target, Matched files temporarily. Can be same as path to project files. The reason for this is to avoid downloading and uploading thousands of files to Dropbox that will be created in the matching process. So this directory could be somewhere out of Dropbox like ~/Downloads.
	4. User name for manifest of matching jobs
	The version of this with the _Adj suffix is for the case where the Census year differs from the patentee year.
11. `prepare_dtas.py` - Calls `prepare_dtas.do`.
12. `save_log.py`- Copies old log file and saves new one.
13. `serialize_file.py` - Used by update_log, this splits up file name.
14. `submit_job.py` - This runs a single matching job. The version of this with the _Adj suffix is for the case where the Census year differs from the patentee year.
15. `update_log.py` - Updates the log of files matched.
16. `which.py` - Identifies the type of machine the code is running on.

### `Parser`

1. `build_parser_dict.py` - This creates a dictionary of regular expressions we use to extract patentee information from the text files.
2. `check_results.py` - Functions to check parsing results.
3. `clean_strings.py` - Functions to clean strings.
4. `gen_output.py` - Functions for outputting the parsed results.
5. `helper.py` - Two helper functions used to keep track of location in list.
6. `parse_patents.py` - Python file to call to parse a particular Annual Report.
7. `parser_funcs.py` - Functions related to parsing a given line of the report.
8. `single_job.py` - Functions related to running a whole year of parsing.
9. `timeout.py` - Function to timeout the process if taking too long.

## `Code/Stata`

1. `analyze_do` - Creates tables and figures in the paper.
2. `assemble.do` - Builds datasets used in the paper.

### Analyze

1. `figures_first_name.do` - This generates the figures using just first names.
2. `figures_JimShaw.do` - This generates the figures comparing our parsed samples to the data from Jim Shaw.
3. `figure_match_statistics.do` - This generates table of basic statistics on our sample.
4. `figures_matched.do` - This generates the figures using our matched sample.
5. `figures_matched_unmatched.do` - This generates the figures comparing the sample of matched to unmatched patentees.
6. `figures_matched_unmatched_Adj.do` - A modified version of `figures_matched_unmatched.do` for use in plotting the adjacent year matched results.
7. `figure_parse_rates.do` - This generates a figure of the parse rates in Census years.
8. `regressions.do` - This generates all the regression results.

### Assemble

1. `assemble_data_for_figures.do` - This assembles data for the figures using all of the `matched_patentees_`\``year'.dta` and outputs `./Generated/data_for_figures.dta`.
2. `assemble_data_for_regressions.do` - This assembles data for the regressions using all of the `matched_patentees_`\``year'.dta` and outputting `./Generated/data_for_regressions.dta`.
3. `assemble_matched_patentees.do` - This appends files of matched patentees for a given year together generating `matched_patentees_`\``year'.dta`. This also merges in NHGIS and county summary statistics data.
4. `assemble_JimShaw.do` - This assembles the Jim Shaw data and creates some files in `./JimShaw/Stata Data` used by `figures_JimShaw.do`.
5. `assemble_sumstats_census.do` - This assembles demographic summary statistics from population census.
6. `assemble_sumstats_census_final.do` - Assembles year-state files into `./Generated/countySumStats_AllYears.dta`.
7. `assemble_NHGIS_data.do` - This assembles NHGIS data into `./Data/Generated/nhgis_AllYears.dta`. Individual census year files are saved to `Data/NHGIS/Stata Data`.
8. `assemble_firstNames_patentees.do` - This assembles dataset of fraction of female and black patenting based on first name and state of patentees by year into `Data/Generated/first_names_merged_patentees.dta`. It also creates datasets of these probabilities by first name in `Data/First Names/first_names_`\``year'.dta`.
9. `assemble_patentees.do` - This assembles lists of patentees from the results of our parser into the files `patentees_`\``year'.dta`. Also "chunks" these files into smaller ones for matching putting them in `./Origin`.

#### `./Helper`

1. `clean_BP.do` - Clean birthplace variable and generate flag for valid value of it.
2. `clean_first.do` - Clean first names.
3. `clean_last.do` - Clean last names.
4. `clean_post_parse.do` - Cleans the parsed text files, ensuring name, town, and assignee information is recorded under the correct variable name and removing clearly erroneously parsed records.
5. `clean_state.do` - Clean state names.
6. `clean_strings_NBER.do` - Cleaned the strings for data on NBER server. This is then integrated into the other `clean_*.do` files for consistency.
7. `clean_town.do`- Clean town names.
8. `first_name_abbrevs.do` - Replaces common first name abbreviations with full name using the dta file `first_name_abbrevs.dta`.
9. `fix_OCR.do` - Fix common OCR mistakes.
10. `gen_demographic_variables.do` - Generate demographic variables for matched patentees.
11. `gen_dist_match_variables.do` - Calculate distribution of demographic of matches for given patentee.
12. `gen_matching_variables.do` - Generate matching variables for matched patentees.
13. `gen_NHGIS_variables.do` - Generate variables from NHGIS data.
14. `install_dependencies.do` - Install necessary Stata packages and create folders for outputting results.
14. `label_demographic_vars` - Labels demographic variables of matched patentee datasets.
15. `label_matching_vars.do` - Labels variables in matched patentees datasets.
16. `label_NHGIS_vars.do` - Labels variables created from NHGIS data.
17. `merge_data.do` - Merge in other datasets into matched patentees datasets.
18. `prepare_Census_data_NBER` - Used to generate Census data from NBER servers.
19. `trans_State_StateAbbrev.do` - Translate from state names to abbreviations.
20. `trans_StateAbbrev_State.ado` -  Translate from state abbreviations to names.

### Matching
1. `matching_patentees.do` - Matches patentees to Census. This is called by Python scripts in `./Python/Matching` subdirectory.
2. `matching_patentees_First.do` is a `matching_patentees_Last.do`, and `matching_patentees_Town.do`, which place higher weights on the variables noted in the program names. They also only read in a 10% random sample of patentees. These are used in robustness checks for how we set the initial matching weights. The version of this with the _Adj suffix is for the case where the Census year differs from the patentee year.
3. `prepare_dtas.do` - Read a dta of individuals from Census and prepare for matching. This is called by `prepare_dtas.py`.

## Data

### 1900 Matching Quality Check  

*Data from manually matching patentee lists to population in Vermont.*

### Annual Reports

1. `./Parsed` - Text files of parsed Annual Reports patentees by Python. Other files in these subdirectories are outputs from our parsing of the original reports. These are read into Stata by `assemble_patentees.do`.
2. `./Original Reports` - These are original reports we parsed.

### Census SumStats

./`year` containing state dta files - Generated by `assemble_sumstats_census.do` based on files in Target.

### First Names

./`year` containing state dta files- Generated by `assemble_firstNames_patentees.do` based on files in Target.    

### Generated

*Derived datasets used in running regressions and creating figures.*

### NHGIS

*County-level demographic characteristics downloaded from NHGIS. These include the originals and Stata files generated by* `assemble_NHGIS_data.do`.

### Origin

*Datasets of patentees "chunked" into smaller files for matching. This is a temporary directory.*

### Other Patent Lists

*Other patent datasets including Jim Shaw and from the USPTO.*

### Patentees

1. `./Parsed/patentees_`\``year`' - Lists of patentees parsed from Annual reports.
2. `./Matched/matched_patentees_``\``year`' - Matched patentees in \``year`'. Generated by assembly of matching results for a given year by `assemble_matched_patentees_year.do`.

### Target

*Datasets with individuals from census. These are created by the files in* `Code/Python/Census Data` *run on the NBER servers.*