Importing USPTO PatentsView Data Download Files with R.
In the last article we explored how to download USPTO PatentsView patent data files. In the process we used web scraping with the rvest
package to help us identify the files to download and to keep a record of that we could store with the data.
In this article we are going to focus on importing the data files into R. While we will use R, a very similar logic will apply if writing this code in Python.
The USPTO PatentsView data files are a set of zip files that take up around 100 Gigabytes for the granted patents (grants) and a lower 26 GB for applications (called pregrant). In addition to the main tables there are separate yearly download tables for the main text segments of the files consisting of brief summary, the description and the claims. If you have been following along then the entire grant directory should look something like Figure 1.
If you want to run the entire US patent collection, including the description, brief summary and claims it is probably best to anticipate around 500Gb of disk space for the full set of files when unzipped.
The question now is how to import this data.
We have a number of choices when planning to import this data. The best choice for your work will partly depend on what you want to do with the data afterwards, in particular how much of this data do you plan to use? In reality there are three main scenarios:
We will address scenario 1 in this article and then move to the others in the next articles.
In this article we focus on importing the bulk data using R.
the USPTO patent data files that we downloaded in the previous post. We will mainly address how to import some of the tables into R
If we downloaded some or all of the
```{.r .distill-force-highlighting-css}
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/wipo-analytics, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Oldham (2022, Jan. 13). WIPO Patent Analytics: Import Bulk PatentsView Data with R. Retrieved from https://wipo-analytics.github.io/posts/2022-01-13-patentsview-import-bulk-data/
BibTeX citation
@misc{oldham2022import, author = {Oldham, Paul}, title = {WIPO Patent Analytics: Import Bulk PatentsView Data with R}, url = {https://wipo-analytics.github.io/posts/2022-01-13-patentsview-import-bulk-data/}, year = {2022} }