ISP: Big Spatial Data Processing On Impala Using Multicore CPUs and GPUs

Introduction

In this project, we extend Cloudera Impala with big spatial data processing capacity. ISP contains a set of UDFs (User-Defined Functions) to support spatial operations. In addition to spatial extensions using UDFs, we have also developed an efficient parallel spatial join module, which is accelerated by not only multicore CPUs but also GPUs. Since ISP is natively developed in C++ as Impala does, no additional overhead will be imposed when running the GPU module. Because of current inefficient Java support on GPUs, ISP has great potentials to outperform other GPU enabled big spatial data solutions that use Java based systems. The goal of ISP is to combine both Big Data technology and GPU parallel computing for large scale spatial data processing.

Spatial Extension

To support spatial operations, we have developed a set of UDFs using the existing open source package, i.e., GEOS. Spatial operations, such as ST_Intersect, can be easily developed by wrapping GEOS APIs. The geometry objects are stored as text format, i.e., Well-Known Text format (WKT). During the process, the text strings are first parsed as geometry objects and then relative operations can be performed. However, we found GEOS has performance issue and cannot be used on the GPU. In ISP, we have adopted array based data structure for storing in-memory geometry objects. The design makes the geometry data can be easily streamed between CPU memory and GPU memory. Meanwhile, direct manipulation on the geometry objects for spatial operations is much faster than using the GEOS library. Based on our benchmark, significant improvement on point-in-polygon test is achieved by using the array based representations.

Parallel Spatial Join

Although spatial operations have been provided, directly performing spatial join using cross product of two datasets will be very slow. In order to efficiently process large-scale datasets, the naive cross spatial join cannot meet the requirement. In ISP, we have developed indexed spatial join where spatial indexes are created on-demand. By integrating spatial index, the spatial join can be improved significantly. The spatial join module we have implemented is a broadcast spatial join, which is also developed in our SpatialSpark project. Ideally, this kind of spatial join is more suitable for joining a big dataset with a considerable small dataset (usually it fits in main memory). Similar to hash join in original Impala, our spatial join module is also parallelized for multi-core CPUs. Meanwhile, we also incorporate our previous GPU accelerated spatial join work to seek higher performance.

broadcast

More details can be found from our technical reports.

Performance

We have tested several tasks on a 10-noode Amazon EC2 cluster (g2.2xlarge). For aggregating New York City Taxi data (~170 million points, from here) with census blocks (~40k polygons, data from here), ISP with GPU enabled can achieve 5x faster than SpatialSpark.

Repeat Experiment on Taxi Data

Preprocessing
Download data and format it to tab-separated files, representing geometries using WKT format. Then import data to HDFS, and create appropriate table schema in Impala.

Environment Setup
1. start HDFS
2. start PostgreSQL for Hive metastore
3. start Hive metastore hive --service metastore
4. start Impala catalogd and statestored bin/start-catalogd.sh and bin/start-statestored.sh
5. start impalad bin/start-impalad.sh, -geos=true will use GEOS library for geometry operation otherwise set it to false to use our own implementation. -mem_limit=15g will set memory limit for impalad.

Enable Spatial UDFs
Upload impala-sp.ll to /lib/ on HDFS, and import it in impala-shell.
create function st_within(string, string) returns boolean location '/lib/impala-sp.ll' symbol='ST_Within';

Run Query
In impala shell, run
select count(*) from taxi_geom spatial join nycb where st_within(taxi_geom.geom, nycb.wkt);

For a small example, run
select count(*) from point spatial join pluto where st_within(point.geom, pluto.wkt);

Work In Progress

We are implementing partitioned spatial join to support joining two big datasets as we did in our SpatialSpark. On the other hand, we are desigining spatial index support in Impala for efficient spatial range query using both multi-core GPUs and GPUs.

Code

Dependency

ISP is based on Impala 2.0 (cloned at 9/16/2014), additional required libraries include:

Spatial UDFs using GEOS

In ISP-MC, UDFs are wrappers of GEOS functions. For instance, ST_Within corresponds to within function in the geometry API. Since GEOS is not used for performing geometry operations in ISP-MC+ and ISP-GPU, UDFs are used as placeholders to support spatial query syntax.

  1. Download and unpack source code from here
  2. Setup Impala compile environment source set-impala.sh
  3. cmake . and make
  4. Follow steps introduced previously

ISP-MC+ and ISP-GPU

To compile ISP-MC+ and ISP-GPU, the step is the same as compiling original Impala.

  1. Download and unpack source code from here (WARNING 1.7GB)
  2. Run build_pip.sh for select ISP-MC+ and ISP-GPU
  3. Proceed with normal Impala build steps (run source set-impala.sh for environment setup first)
  4. The major changes for Impala can be found in backend.diff2 file

Related Technical Reports

Contact