In the last few years, the volume of remote sensing data has increased significantly.

This has primarily been driven by an increase in the number of satellite launches and drastic improvements in sensing technology (e.g., better cameras with hyper-spectral imaging, synthetic aperture radar, and lidar). With increased quality in resolution and spatiotemporal coverage, there is a demand for consuming insights derived in business-critical systems across a wide range of industries, including insurance, urban planning, agriculture, climate change, and flood prevention. This creates a need for a spatiotemporal technology that can seamlessly ingest, store, and query terabytes to petabytes of remote sensed data in a cost-efficient and developer-friendly manner.

A cost-efficient and developer-friendly solution

Although several IBM products (e.g., IBM Event Streams, IBM Db2, and IBM Db2 Warehouse) provide spatial and time series functionalities, the cloud poses a unique challenge due to the disaggregation of compute and (low-cost) storage layers. Typical cloud data lakes support indexing and queries on single-dimensional data. However, this falls short of supporting multi-dimensional data indexing and queries, as is the case with geospatial data. 

Satellite data: Background

Remote sensing data comes in two flavors: (a) raster data and (b) vector data. Raster data is “image” and vector data is geometries (where each geometry — typically a point, line segment, or a polygon — has a set of values associated with it). An example raster image from the popular MODIS Terra product is shown below, which depicts the NDVI (Normalized Difference Vegetation Index, a simple graphical indicator for green vegetation) values for the New York region:

Raster data is consumed in the form of “images” in downstream applications. For example, a deep learning pipeline that extracts building footprints from RGB (visual) imagery from satellites. 

Vector data is extremely useful when performing analytic queries on the geometries and their values. For example, aligning the MODIS Aqua (water) layer with the cloud cover layer, with the goal of enhancing data quality. Such operations require the ability to evaluate functions like interpolation and forecasting on these point geometries (pixel) values. 

Both these queries are necessary to support the consumption of satellite data on a cloud platform.

Serverless architecture

We design a uniform serverless pipeline for ingesting, storing, and querying remote sensed satellite data using IBM Cloud. This approach utilizes the cheapest form of cloud storage and compute, while providing a simple developer experience to manage terabytes to petabytes of remote sensing data. The proposed architecture is depicted below:

Satellite data is a continuous data stream collected periodically through ground stations. After the signals are processed, they are typically in a standard scientific data format (e.g., HDF, netCDF, and GeoTIFF). As satellite imagery is global, they are usually in WGS84 coordinate system and timestamped. Storing this raw data in object storage is not amenable for fast and efficient querying. The data needs to be transformed into a queryable format, such as Parquet or cloudoptimized GeoTIFF.

Our approach is to use IBM Cloud Object Storage for storing the cloud-optimized GeoTIFF and Parquet data such that it is amenable to efficient queries downstream. The transformation from raw data formats to these big data formats is performed using IBM Cloud Code Engine to accommodate for elastic scaling as well as the periodic “spiky” activity. 

IBM Cloud Code Engine is IBM’s Knative serverless platform for executing massively parallel jobs. Once the data is transformed, a cloud function can be invoked to query raster imagery, given a region and time-range. This leverages the ability of Cloud Object Storage to do byte range reads and return the relevant data (on both cloud-optimized GeoTIFF and Parquet), thus reducing download costs as well as the query times. 

Finally, in order to perform analytic queries like aligning the layers and aggregations, we use IBM Cloud SQL Query and its geospatial data skipping features. SQL Query is a serverless ETL tool for analytic queries on big data at rest. SQL Query supports geospatial and time series analytic queries natively and provides the ability to read only the relevant spatial and temporal data based on the query, which also reduces the download costs as well as the query times.

Examples

Below, we highlight code snippets that demonstrate the various steps of the above workflow.

Transform raw data to COG

# import libraries
from pyrip.transform import hdf_to_tif
from pyrip.cogeo import optimize
# all various of raster formats (GeoTiff, HDF, netCDF, SAFE, etc.) are supported and raw files are converted to GeoTiff first
raw_tif_files = hdf_to_tif(raw_file, match_substrs=['NDVI', 'EVI'])
for raw_tif_file in raw_tif_files:
    # project image to EPSG:4326
    reprojected_tif_file = reproject(raw_tif_file)
    # optimize GeoTiff to Cloud Optimized GeoTiff
    optimized_tif_file = optimize(reprojected_tif_file)

Transform raw data to Parquet

# import libraries
from pyrip.transform import tif_to_df
# easy transformations between raster format and vector format using tif_to_df and df_to_tif
df = tif_to_df(reprojected_tif_file)
# save to parquet file
df.to_parquet(parquet_file, engine='pyarrow', compression='snappy', index=False)

Create spatial index on Parquet

CREATE METAINDEX
MINMAX FOR lat,
MINMAX FOR lon
ON TABLE ndvi

Imagery query

# specify layer, date/date range, and region of interest
layer = 'MOD13Q1_250m_16_days_NDVI'
date = '20200609'
bbox = (-74.4, 40.6, -73.6, 41.4)
result_tif = 'result.tif'

vrt_file = '/vsis3/{}/{}/layer={}/date={}/index.vrt'.format(bucket, prefix, layer, date)
query(vrt_file, result_tif, bbox, access_key_id=access_key_id, secret_access_key=secret_access_key, endpoint=endpoint)

Analytical query

When layers are naturally aligned, we can join multiple layers directly:

# specify date range and region of interest
start_date = '20180915'
end_date = '20180930'
bbox = (-112, 41.5, -111, 42.5)

# this query can be run either with Spark sql or IBM Cloud SQL Query
query = """
SELECT intr.lat, intr.lon, intr.date, 
CASE
    WHEN mask.value == 0 THEN intr.value
    WHEN mask.value > 7 THEN 0
    ELSE 9
END AS value
FROM 
    (SELECT *
    FROM landsat
    WHERE layer = 'INTR' AND date >= {0} AND date <= {1} AND lon > {2} AND lat > {3} AND lon < {4} AND lat < {5}) intr, 
    (SELECT *
    FROM landsat
    WHERE layer = 'MASK' AND date >= {0} AND date <= {1} AND lon > {2} AND lat > {3} AND lon < {4} AND lat < {5}) mask 
WHERE intr.lat = mask.lat AND intr.lon = mask.lon AND intr.date = mask.date
""".format(start_date, end_date, *bbox)

When layers are NOT naturally aligned, we can join multiple layers by utilizing geohash. The geohash bit depth gives us full flexibility on the spatial resolution that we want to align with:

# specify geohash bit depth, which controls the spatial resolution of layer merging
bit_depth = 35

# this query can be run either with Spark sql or IBM Cloud SQL Query
query = """
SELECT
ST_Y(ST_BoundingBoxCenter(ST_Envelope(ST_GeohashDecode(scl.geohash)))) AS lat,
ST_X(ST_BoundingBoxCenter(ST_Envelope(ST_GeohashDecode(scl.geohash)))) AS lon,
CASE
    WHEN scl.is_vegetation THEN (nir.value - red.value) / (nir.value + red.value)
    ELSE -1
END AS ndvi
FROM
    (SELECT substr(geohash_binary_str, 0, {0}) as geohash, array_contains(collect_list(value), 4) as is_vegetation
    FROM scl
    GROUP BY substr(geohash_binary_str, 0, {0})) scl,
    
    (SELECT substr(geohash_binary_str, 0, {0}) as geohash, avg(value) as value
    FROM nir
    GROUP BY substr(geohash_binary_str, 0, {0})) nir,
    
    (SELECT substr(geohash_binary_str, 0, {0}) as geohash, avg(value) as value
    FROM red
    GROUP BY substr(geohash_binary_str, 0, {0})) red
WHERE scl.geohash = nir.geohash AND nir.geohash = red.geohash
""".format(bit_depth)

Conclusion

In this post, we have covered how to efficiently ingest, query, and analyze remote sensing data in a cloud native way on IBM Cloud. Remote sensing data is playing a more and more important role in industries on business insights, and we believe leveraging cloud is the way of processing and analyzing remote sensing data, as we could easily ingest and query an arbitrary amount of data in a serverless way. 

Ingesting and analyzing na arbitrary volume is crucial because remote sensing data can be extremely large and grow significantly on a daily basis, while querying in a serverless fashion is also vitally important; you pay only the data you need, no matter how much underlying data you have, which is very cost efficient (e.g., $5/TB for SQL Query).

Further reading

Was this article helpful?
YesNo

More from Cloud

The power of embracing distributed hybrid infrastructure

2 min read - Data is the greatest asset to help organizations improve decision-making, fuel growth and boost competitiveness in the marketplace. But today’s organizations face the challenge of managing vast amounts of data across multiple environments. This is why understanding the uniqueness of your IT processes, workloads and applications demands a workload placement strategy based on key factors such as the type of data, necessary compute capacity and performance needed and meeting your regulatory security and compliance requirements. While hybrid cloud has become…

Serverless vs. microservices: Which architecture is best for your business?

7 min read - When enterprises need to build an application, one of the most important decisions their leaders must make is what kind of software development to use. While there are many software architectures to choose from, serverless and microservices architectures are increasingly popular due to their scalability, flexibility and performance. Also, with spending on cloud services expected to double in the next four years, both serverless and microservices instances should grow rapidly since they are widely used in cloud computing environments. While…

Seamless cloud migration and modernization: overcoming common challenges with generative AI assets and innovative commercial models

3 min read - As organizations continue to adopt cloud-based services, it’s more pressing to migrate and modernize infrastructure, applications and data to the cloud to stay competitive. Traditional migration and modernization approach often involve manual processes, leading to increased costs, delayed time-to-value and increased risk. Cloud migration and modernization can be complex and time-consuming processes that come with unique challenges; meanwhile there are many benefits to gen AI assets and assistants and innovative commercial models. Cloud Migration and Modernization Factory from IBM Consulting®…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters