Geocaching Analytics – Part 2: GCInsight Data Gathering
This post is going to dive into the details of actually getting the Geocaching data used for later analysis into a convenient and usable source.
Unlike many sample data sources out there, the data is not made (easily) publicly available, though anybody can create a free account and view most of the data on the website. Getting the data out of the website and into a SQL database is another story altogether…
Some things we’ll need to get started:
- Geocaching.com Premium Account – The first step is to upgrade to a Geocaching.com Preimium Account, which as of this writing is currently $20/year. While geocaching.com does offer free accounts, those accounts cannot see all geocaches (many geocaches are flagged as “premium only”) and they also suffer from data export limitations.
- GSAK (The Geocaching Swiss Army Knife) – GSAK is a Windows-based application which is used for downloading and managing Geocache data. It’s one of the few tools which has the ability to download Geocache data from geocaching.com utilizing the Geocaching.com API (more on that later) and handles de-duplicating and merging cache data. This handy application is a one-time payment of $30.
There are three primary methods of retrieving data from the geocaching.com website as of this writing:
- Built-in Functions (Pocket Queries) – Pocket Queries are the functionality built in to the geocaching.com website for premium members. They’re intended as a method of gathering a number of geocaches, based on search parameters, and exporting them to a single XML file (the GPX format). There is a limit of a maximum of 1000 caches that can be returned per query and a maximum of 5 queries that can be run per day. So, using this method, you can export a maximum of 5,000 caches per day. One additional limitation is this method will not retrieve caches which have been retired (archived) but remain in the official geocaching.com database for historical purposes. One strange quirk/weakness to this method is that it will not retrieve the number of favorite points that have been awarded to a cache.
- API Queries (GSAK) – Geocaching.com does not make their API publicly available, instead, you must use a program which has access to the API via their partner program. In this case, that application is GSAK. GSAK has the ability to download geocaches directly from the geocaching.com website via the API. The API allows for a maximum of 6,000 caches to be downloaded (with full details) and 10,000 with light details (typically the light details are fine). This allows for a combined 16,000 caches to be downloaded per day. Additionally, this will retrieve the number of favorite points that have been awarded to a cache.
- Web page scraping – Web page scraping is writing a utility which will retrieve a cache detail web page and then scrape the details off of it. Since this functions in the same manner as a client would, there is no limit to the number of geocaches which can be downloaded in this manner, however, it does require writing a custom application and updating it every time the geocaching.com site redesigns their cache detail pages. The c:geo Android app would be an example of this. We won’t be using this method in this application at this time.
So, from the above methods, GSAK sounds like the best method, since it has the highest limits and the ability to retrieve the number of favorite points assigned to a cache as well. The one downside to GSAK is that, due to other limitations in the API, it does not have a good method for retrieving all caches in a state or placed within a date range, so here’s the hybrid approach I recommend:
- Use Pocket Queries for the Initial Load – Due to the 5 Pocket Queries per day limitation, this step will take the longest and may take a week or more. We will take advantage of the “Placed During – Between” filter to create enough Pocket Queries to span from the beginning of Geocaching (5/1/2000) until today. Create your first pocket query with the following options set (all others default):
- 1000 caches total (the maximum)
- Within States – Georgia (or your state)
- From Origin – By Coordinates (Enter coordinates for around the center of your state…it doesn’t have to be exact, but you should be able to get everywhere in the state within 500 miles of that point)
- Within a radius of: 500 miles (the maximum)
- Placed During – Between: Choose a date range that gets you less than 1,000 caches (it will tell you 1,000 even if you go over 1,000). Pressing the submit button tells you the number of results. Simply refine your age range until you have a result between 900 and 999.
- Output to GPS Exchange Format (*.gpx)
|The number of cache results returned from a given Pocket Query.|
|The full options to set for a Pocket Query.|
- Schedule Pocket Queries to Run One Time, 5/Day – With enough Pocket Queries created to cover all of the active caches in your state (about 20 for Georgia as of this writing), you now need to schedule them. As you can only run 5 per day, it may take a week or more of scheduled queries to run them all. When each one runs, a .ZIP file will be available for download (for up to 6 days) which can be downloaded manually or via the Download Pocket Queries option in GSAK.
Staggered 5/day Pocket Queries.
- Import Pocket Queries Into GSAK – With the Pocket Queries run, you can now import the data into GSAK. If you’ve downloaded the .ZIP files, you can load them via the File -> Load GPX/LOC/ZIP file option. If you have not downloaded them yet, you can load them via the Geocaching.com Access -> Download Pocket Queries… option. Make sure to set the County update option to Y on the load settings dialog to ensure you get county data as well.
GSAK Load Options with County
- Refresh the Cache Data via GSAK for All Caches – With a copy of all of the active Geocaches in your area now downloaded to a local database, you now need to perform a cache refresh. This is necessary as the number of Favorite Points awarded to a cache is not loaded when a Pocket Query is downloaded, which is needed for some of the most interesting analysis. Due to the API limitation of 6,000 Full Detail caches, it is necessary to set a filter based on the Placed Date such that you have just under 6,000 caches shown (in the status bar bat the bottom or the screen). This may take a couple of days to complete due to the API limitations.
The GSAK Placed Date Filter with 6,000 Caches Shown.
|Set the Scope to All in Current Filter and the format to Full details.|
- Use Pocket Queries for the Initial Load of Any New Caches – The hard part is completed! Going forward, use a Pocket Query, running say once a week or so, to download any caches placed after your date. This will keep the database up to date as new caches are released.
- Refresh the Cache Data via GSAK at regular intervals – Periodically, especially after new caches are entered via a Pocket Query, you will want to go through and refresh your entire cache database using the Placed Date intervals. Unless you are doing analysis using cache logs, the Light Format (10,000) limit is adequate and you can combine Light and Full to do 16,000 refreshes per day. Should an active cache be retired (archived) these refreshes will mark them as so.