When you request a downloaded dataset from the Data Portal, there are many ways to work with the results. Sometimes, rather than accessing the data through THREDDS (such as via .ncml or the subset service), you just want to download all of the files to work with on your own machine.
There are several methods you can use to download your delivered files from the server en masse, including:
GNU Wget is a command-line utility for downloading files from the web. With Wget, you can download files using HTTP, HTTPS, and FTP protocols. Wget provides a number of options allowing you to download multiple files, resume downloads, limit the bandwidth, recursive downloads, download in the background, mirror a website, and much more.
Wget - downloading files from a directory What i'm trying to do is this: download all files from a directory on a web-server (no subfolders and their contents; no upper-level folder contents). Download a List of Files at Once If you can’t find an entire folder of the downloads you want, wget can still help. Just put all of the download URLs into a single TXT file. Then point wget to that document with the -i option. Click the “Download All!” button and the files will be added to the queue and downloaded. Download Chrono Download Manager. Download Master (Chrome) This is another Chrome extension that downloads a load of files in a folder pretty easily.
- shell – curl or wget
- python – urllib2
- java – java.net.URL
Below, we detail how you can use wget or python to do this.
It’s important to note that the email notification you receive from the system will contain two different web links. They look very similar, but the directories they point to differ slightly.
First Link: https://opendap.oceanobservatories.org/thredds/catalog/ooi/sage-marine-rutgers/20171012T172409-CE02SHSM-SBD11-06-METBKA000-telemetered-metbk_a_dcl_instrument/catalog.html
The first link (which includes thredds/catalog/ooi) will point to your dataset on a THREDDS server. THREDDS provides additional capabilities to aggregrate or subset the data files if you use a THREDDS or OpenDAP compatible client, like ncread in Matlab or pydap in Python.
Second Link: https://opendap.oceanobservatories.org/async_results/sage-marine-rutgers/20171012T172409-CE02SHSM-SBD11-06-METBKA000-telemetered-metbk_a_dcl_instrument
The second link points to a traditional Apache web directory. From here, you can download files directly to your machine by simply clicking on them.
Using wget
First you need to make sure you have wget installed on your machine. If you are on a mac and have the homebrew package manager installed, in the terminal you can type:
Alternatively, you can grab wget off GitHub here https://github.com/jay/wget
Once wget is installed, you can recursively download an entire directory of data using the following command (make sure you use the second (Apache) web link (URL) provided by the system when using this command):
This simpler version may also work.
Here is an explanation of the specified flags.
- -r signifies that wget should recursively download data in any subdirectories it finds.
- -l1 sets the maximum recursion to 1 level of subfolders.
- -nd copies all matching files to current directory. If two files have identical names it appends an extension.
- -nc does not download a file if it already exists.
- -np prevents files from parent directories from being downloaded.
- -e robots=off tells wget to ignore the robots.txt file. If this command is left out, the robots.txt file tells wget that it does not like web crawlers and this will prevent wget from working.
- -A.nc restricts downloading to the specified file types (with .nc suffix in this case)
- –no-check-certificate disregards the SSL certificate check. This is useful if the SSL certificate is setup incorrectly, but make sure you only do this on servers you trust.
Using python
Wget Download All Files In Directory Tool
wget is rather blunt, and will download all files it finds in a directory, though as we noted you can specify a specific file extension.
Wget A Folder
If you want to be more granular about which files you download, you can use Python to parse through the data file links it finds and have it download only the files you really want. This is especially useful when your download request results in a lot of large data files, or if the request includes files from many different instruments that you may not need.
Wget Download All Files In Ftp Directory
Here is an example script that uses the THREDDS service to find all .nc files included in the download request. Under the hood, THREDDS provides a catalog.xml file which we can use to extract the links to the available data files. This xml file is relatively easier to parse than raw html.
The first part of the main() function creates an array of all of the files we would like to download (in this case, only ones ending in .nc), and the second part actually downloads them using urllib.urlretrieve(). If you want to download only files from particular instruments, or within specific date ranges, you can customize the code to filter out just the files you want (e.g. using regex).
Don’t forget to update the server_url and request_url variables before running the code. You may also need to install the required libraries if you don’t already have them on your machine.
— Last revised on May 31, 2018 —