Zonal Summations of GBIF Species Occurrence
Data On GPUs
Analyzing how species are
distributed on the Earth has been one of the fundamental questions
in the intersections of environmental sciences, geosciences and
biological sciences. With world-wide data contributions, more than
375 million species occurrence records for nearly 1.5 million
species have been deposited to the Global Biodiversity Information
Facility (GBIF) data portal. The sheer amounts of point and
polygon data and the computation-intensive point-in-polygon tests
for zonal statistics for biodiversity studies have imposed
significant technical challenges. In this study, we have
significantly extended our previous work on parallel primitives
based spatial joins on commodity Graphics Processing Units (GPUs)
and have developed new efficient and scalable techniques to enable
parallel zonal statistics on the GBIF data completely on GPUs with
limited memory capacity.Experiment
results have shown that an impressive end-to-end response time
under 100 seconds can be achieved for zonal statistics on the 375+
million species records over 15+ thousand global eco-regions with
4+ million vertices on a single Nvidia Quadro 6000 GPU device. The
achieved high performance, which is several orders of magnitude
faster than reference serial implementations using traditional
open source geospatial techniques, not only demonstrates the
potential of GPU computing for large scale geospatial processing,
but also makes interactive query driven visual exploration of
global biodiversity data possible.
*Code clean up in process. There are two CUDA files: pt2cell_preprocess_gpu.cu: align points to grid cells
(grid-file indexing). This step can easily be ported multi-core
CPUs by using GNU parallel sort API and can support large point
datasets that are beyond GPU memory. pip_spatial_join_gpu.cu: index polygon data (also using
grid-file indexing) and perform spatial join (including filtering
and refinements). Variations using multi-GPUs and hybrid CPU-GPU
code were also developed but not included.
Baseline Serial Algorithms and Implementations
Descriptions of baseline implementations using traditional open
source software stack and running on single processors:
1) Serial Implementation 1 (SI1) builds an R-Tree on WWF eco-region polygons
14,458 polygons, 16,838 rings and
) by using libspatialindex.
For each species occurrence, a a point is created to query against
the R-Tree index structure using the
ISpatialIndex:intersectsWithQuery provided by libspatialindex . If
the point intersect with any of the MBRs stored in the leaf
nodes of the R-Tree, the point is tested with the
polyogns that are represented by the MBRs using the Contains
function in the OGRGeometry class provided by the GDAL/OGR library. If the point is
within the polygon, the polygon FeatureID will be assigned to the
point; otherwise -1 will be assigned instead.
2) Serial Implementation 2 (SI2) adds an optimization technique we
have used for the GPU design and it uses the same input as the GPU
implementation. Recall that in the GPU implementation, in addition
to the .xy file that stores the long/lat pairs of the point
coordinates, the raster cell identifiers and the number of points in
the raster cells are also part of the input. For each of the grid
cell, implementation 2 first queries its bounding box against the
R-Tree index structure and locate all polygons whose MBRs intersect
with the grid ell bounding box. For each of the point in the grid
cell, the Contains
function in the OGRGeometry class is applied for
One might expect SI2 will always be faster than SI1. However, our
experiments have shown differently. This is because, will SI2
reduces R-Tree query costs, it might increases the number of
point-in-polygon tests which are more expensive than R-Tree queries,
given that the R-Tree index structure is likely to be completely
cached in memory (the R-Tree structure is only about 2.4 MB) . As a
matter of fact, there are points that, although their cell
boundary intersect with some polygon MBRs, they may not intersect
with these MBRs. As such, SI2 is likely to introduce false positives
that can be avoided by query the points against the polygon R-Tree
Code, Data and Steps to Repeat Baseline Experiments
The following three programs (C++ code) allows interested readers to
repeat our experiments using a sample dataset.
1) The code for generating the R-Tree indexing structure (used for
both BI1 and BI2), the two baseline implementations can be
downloaded as gzipped tar ball by following this link.
2) The sample dataset with 746,302 species occurrences for 25
species with numbers of occurrences between 10,000 and 100,000 can
be downloaded as a gzipped tar ball by following this
link (G45_G01). FYI: not all files in the tar ball are used.
Please contact me if you are curious what the files are used for.
Please visit GBIF data portal
for more detailed info.
3) A copy of the WWF ecoregion shapefile can be downloaded by
following this link
(Warning: 51 MB). Please acknowledge WWF by following the details at
4) Two additional point datasets, one is smaller (
) and one is bigger (
9,397,443) are now provided
for additional evaluations. They can be accessed by following the
these two links: G45_G02
(warning: 8.7 MB).
To repeat our baseline experiments:
1) download and unzip the above three tar balls to your directory.
2) install libspatialindex and GDAL/OGR library and make the shared
library files (.so file under linux) accessible by modifying
variable (or specify their paths in g++ command line using -L).
3) change the paths to the WWF eco-region shapefile in the three
source C++ programs (changing to command line is in progress).
3) compile the three C++ programs by following the example command
line for compilation at the beginning of each program.
4) run GenPolyRTree first to generate the polygon index (newrtp.*
under the current directory).
5) run PIPGDALTest1 and/or PIPGDALTest2 by following the example
command line for execution at the beginning of each program. Please
note the differences between the two.
Jianting Zhang, Simin You* and Le Gruenwald (2015).
Efficient Parallel Zonal Statistics on Large-Scale Global
Biodiversity Data on GPUs. Proceedings of 2015 ACM SIGSPATIAL International
Workshop on Analytics for Big Geospatial Data, 10 formatted
pages, Nov. 3, Seattle, WA, USA (Technical report version