In this project, we extend Cloudera Impala with big spatial data processing capacity. ISP contains a set of UDFs (User-Defined Functions) to support spatial operations. In addition to spatial extensions using UDFs, we have also developed an efficient parallel spatial join module, which is accelerated by not only multicore CPUs but also GPUs. Since ISP is natively developed in C++ as Impala does, no additional overhead will be imposed when running the GPU module. Because of current inefficient Java support on GPUs, ISP has great potentials to outperform other GPU enabled big spatial data solutions that use Java based systems. The goal of ISP is to combine both Big Data technology and GPU parallel computing for large scale spatial data processing.
To support spatial operations, we have developed a set of UDFs using the existing open source package, i.e., GEOS. Spatial operations, such as
ST_Intersect, can be easily developed by wrapping GEOS APIs. The geometry objects are stored as text format, i.e., Well-Known Text format (WKT). During the process, the text strings are first parsed as geometry objects and then relative operations can be performed. However, we found GEOS has performance issue and cannot be used on the GPU. In ISP, we have adopted array based data structure for storing in-memory geometry objects. The design makes the geometry data can be easily streamed between CPU memory and GPU memory. Meanwhile, direct manipulation on the geometry objects for spatial operations is much faster than using the GEOS library. Based on our benchmark, significant improvement on point-in-polygon test is achieved by using the array based representations.
Although spatial operations have been provided, directly performing spatial join using cross product of two datasets will be very slow. In order to efficiently process large-scale datasets, the naive cross spatial join cannot meet the requirement. In ISP, we have developed indexed spatial join where spatial indexes are created on-demand. By integrating spatial index, the spatial join can be improved significantly. The spatial join module we have implemented is a broadcast spatial join, which is also developed in our SpatialSpark project. Ideally, this kind of spatial join is more suitable for joining a big dataset with a considerable small dataset (usually it fits in main memory). Similar to hash join in original Impala, our spatial join module is also parallelized for multi-core CPUs. Meanwhile, we also incorporate our previous GPU accelerated spatial join work to seek higher performance.
More details can be found from our technical reports.
We have tested several tasks on a 10-noode Amazon EC2 cluster (g2.2xlarge). For aggregating New York City Taxi data (~170 million points, from here) with census blocks (~40k polygons, data from here), ISP with GPU enabled can achieve 5x faster than SpatialSpark.
Download data and format it to tab-separated files, representing geometries using WKT format. Then import data to HDFS, and create appropriate table schema in Impala.
1. start HDFS
2. start PostgreSQL for Hive metastore
3. start Hive metastore
hive --service metastore
4. start Impala catalogd and statestored
5. start impalad
-geos=true will use GEOS library for geometry operation otherwise set it to
false to use our own implementation.
-mem_limit=15g will set memory limit for impalad.
Enable Spatial UDFs
/lib/ on HDFS, and import it in impala-shell.
create function st_within(string, string) returns boolean location '/lib/impala-sp.ll' symbol='ST_Within';
In impala shell, run
select count(*) from taxi_geom spatial join nycb where st_within(taxi_geom.geom, nycb.wkt);
For a small example, run
select count(*) from point spatial join pluto where st_within(point.geom, pluto.wkt);
We are implementing partitioned spatial join to support joining two big datasets as we did in our SpatialSpark. On the other hand, we are desigining spatial index support in Impala for efficient spatial range query using both multi-core GPUs and GPUs.
ISP is based on Impala 2.0 (cloned at 9/16/2014), additional required libraries include:
In ISP-MC, UDFs are wrappers of GEOS functions. For instance,
ST_Within corresponds to
within function in the geometry API. Since GEOS is not used for performing geometry operations in ISP-MC+ and ISP-GPU, UDFs are used as placeholders to support spatial query syntax.
To compile ISP-MC+ and ISP-GPU, the step is the same as compiling original Impala.
build_pip.shfor select ISP-MC+ and ISP-GPU
source set-impala.shfor environment setup first)