Switching from GDAL’s Python bindings
This document is written specifically for users of GDAL’s Python bindings
(osgeo.gdal
) who have read about Rasterio’s philosophy and
want to know what switching entails. The good news is that switching may not
be complicated. This document explains the key similarities and differences
between these two Python packages and highlights the features of Rasterio that
can help in switching.
Mutual Incompatibilities
Rasterio and GDAL’s bindings can contend for global GDAL objects. Unless you
have deep knowledge about both packages, choose exactly one of import
osgeo.gdal
or import rasterio
.
GDAL’s bindings (gdal
for the rest of this document) and Rasterio are not
entirely compatible and should not, without a great deal of care, be imported
and used in a single Python program. The reason is that the dynamic library
they each load (these are C extension modules, remember), libgdal.so
on
Linux, gdal.dll
on Windows, has a number of global objects and the two
modules take different approaches to managing these objects.
Static linking of the GDAL library for gdal
and rasterio
can avoid
this contention, but in practice you will almost never see distributions of
these modules that statically link the GDAL library.
Beyond the issues above, the modules have different styles – gdal
reads and
writes like C while rasterio
is more Pythonic – and don’t complement each
other well.
The GDAL Environment
GDAL library functions are excuted in a context of format drivers, error
handlers, and format-specific configuration options that this document will
call the “GDAL Environment.” Rasterio has an abstraction for the GDAL
environment, gdal
does not.
With gdal
, this context is initialized upon import of the module. This
makes sense because gdal
objects are thin wrappers around functions and
classes in the GDAL dynamic library that generally require registration of
drivers and error handlers. The gdal
module doesn’t have an abstraction
for the environment, but it can be modified using functions like
gdal.SetErrorHandler()
and gdal.UseExceptions()
.
Rasterio has modules that don’t require complete initialization and
configuration of GDAL (rasterio.dtypes
, rasterio.profiles
, and
rasterio.windows
, for example) and in the interest of reducing overhead
doesn’t register format drivers and error handlers until they are needed. The
functions that do need fully initialized GDAL environments will ensure that
they exist. rasterio.open()
is the foremost of this category of functions.
Consider the example code below.
import rasterio
# The GDAL environment has no registered format drivers or error
# handlers at this point.
with rasterio.open('example.tif') as src:
# Format drivers and error handlers are registered just before
# open() executes.
Importing rasterio
does not initialize the GDAL environment. Calling
rasterio.open()
does. This is different from gdal
where import
osgeo.gdal
, not osgeo.gdal.Open()
, initializes the GDAL environment.
Rasterio has an abstraction for the GDAL environment, rasterio.Env
, that
can be invoked explicitly for more control over the configuration of GDAL as
shown below.
import rasterio
# The GDAL environment has no registered format drivers or error
# handlers at this point.
with rasterio.Env(CPL_DEBUG=True, GDAL_CACHEMAX=128000000):
# This ensures that all drivers are registered in the global
# context. Within this block *only* GDAL's debugging messages
# are turned on and the raster block cache size is set to 128 MB.
with rasterio.open('example.tif') as src:
# Perform GDAL operations in this context.
# ...
# Done.
# At this point, configuration options are set back to their
# previous (possibly unset) values. The raster block cache size
# is returned to its default (5% of available RAM) and debugging
# messages are disabled.
As mentioned previously, gdal
has no such abstraction for the GDAL
environment. The nearest approximation would be something like the code
below.
from osgeo import gdal
# Define a new configuration, save the previous configuration,
# and then apply the new one.
new_config = {
'CPL_DEBUG': 'ON', 'GDAL_CACHEMAX': '512'}
prev_config = {
key: gdal.GetConfigOption(key) for key in new_config.keys()}
for key, val in new_config.items():
gdal.SetConfigOption(key, val)
# Perform GDAL operations in this context.
# ...
# Done.
# Restore previous configuration.
for key, val in prev_config.items():
gdal.SetConfigOption(key, val)
Rasterio achieves this with a single Python statement.
with rasterio.Env(CPL_DEBUG=True, GDAL_CACHEMAX=512000000):
# ...
Please note that to the Env class, GDAL_CACHEMAX
is strictly an integer number of bytes. GDAL’s shorthand notation is not supported.
Format Drivers
gdal
provides objects for each of the GDAL format drivers. With Rasterio,
format drivers are represented by strings and are used only as arguments to
functions like rasterio.open()
.
dst = rasterio.open('new.tif', 'w', format='GTiff', **kwargs)
Rasterio uses the same format driver names as GDAL does.
Dataset Identifiers
Rasterio uses URIs to identify datasets, with schemes for different protocols. The GDAL bindings have their own special syntax.
Unix-style filenames such as /var/data/example.tif
identify dataset files
for both Rasterio and gdal
. Rasterio also accepts ‘file’ scheme URIs
like file:///var/data/example.tif
.
Rasterio identifies datasets within ZIP or tar archives using Apache VFS style
identifiers like zip:///var/data/example.zip!example.tif
or
tar:///var/data/example.tar!example.tif
.
Datasets served via HTTPS are identified using ‘https’ URIs like
https://landsat-pds.s3.amazonaws.com/L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B1.TIF
.
Datasets on AWS S3 are identified using ‘s3’ scheme identifiers like
s3://landsat-pds/L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B1.TIF
.
With gdal
, the equivalent identifiers are respectively
/vsizip//var/data/example.zip/example.tif
,
/vsitar//var/data/example.tar/example.tif
,
/vsicurl/landsat-pds.s3.amazonaws.com/L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B1.TIF
,
and
/vsis3/landsat-pds/L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B1.TIF
.
To help developers switch, Rasterio will accept these identifiers and other format-specific connection strings, too, and dispatch them to the proper format drivers and protocols.
Dataset Objects
Rasterio and gdal
each have dataset objects. Not the same classes, of
course, but not radically different ones. In each case, you generally get
dataset objects through an “opener” function: rasterio.open()
or
gdal.Open()
.
So that Python developers can spend less time reading docs, the dataset object
returned by rasterio.open()
is modeled on Python’s file object. It even has
the close()
method that gdal
lacks so that you can actively close
dataset connections.
Bands
gdal
and Rasterio
both have band objects.
But unlike gdal’s band, Rasterio’s band is just a tuple of the dataset,
band index and some other band properties.
Thus Rasterio never has objects with dangling dataset pointers.
With Rasterio, bands are represented by a numerical
index, starting from 1 (as GDAL does), and are used as arguments to dataset
methods. To read the first band of a dataset as a numpy.ndarray
, do this.
with rasterio.open('example.tif') as src:
band1 = src.read(1)
A band object can be used to represent a single band (or a sequence of bands):
with rasterio.open('example.tif') as src:
bnd = rasterio.band(src, 1)
print(bnd.dtype)
Other attributes of GDAL band objects generally surface in Rasterio as tuples returned by dataset attributes, with one value per band, in order.
>>> src = rasterio.open('example.tif')
>>> src.indexes
(1, 2, 3)
>>> src.dtypes
('uint8', 'uint8', 'uint8')
>>> src.descriptions
('Red band', 'Green band', 'Blue band')
>>> src.units
('DN', 'DN', 'DN')
Developers that want read-only band objects for their applications can create them by zipping these tuples together.
from collections import namedtuple
Band = namedtuple('Band', ['idx', 'dtype', 'description', 'units'])
src = rasterio.open('example.tif')
bands = [Band(vals) for vals in zip(
src.indexes, src.dtypes, src.descriptions, src.units)]
Namedtuples are like lightweight classes.
>>> for band in bands:
... print(band.idx)
...
1
2
3
Geotransforms
The DatasetReader.transform
attribute is comparable to the
GeoTransform
attribute of a GDAL dataset, but Rasterio’s has more power.
It’s not just an array of affine transformation matrix elements, it’s an
instance of an Affine
class and has many handy methods. For example, the
spatial coordinates of the upper left corner of any raster element is the
product of the DatasetReader.transform
matrix and the (column, row)
index
of the element.
>>> src = rasterio.open('example.tif')
>>> src.transform * (0, 0)
(101985.0, 2826915.0)
The affine transformation matrix can be inverted as well.
>>> ~src.transform * (101985.0, 2826915.0)
(0.0, 0.0)
To help developers switch, Affine
instances can be created from or
converted to the sequences used by gdal
.
>>> from rasterio.transform import Affine
>>> Affine.from_gdal(101985.0, 300.0379266750948, 0.0,
... 2826915.0, 0.0, -300.041782729805).to_gdal()
...
(101985.0, 300.0379266750948, 0.0, 2826915.0, 0.0, -300.041782729805)
Coordinate Reference Systems
The DatasetReader.crs
attribute is an instance of Rasterio’s
CRS()
class and works well with pyproj
.
>>> from pyproj import Transformer
>>> src = rasterio.open('example.tif')
>>> transformer = Transformer.from_crs(src.crs, "EPSG:3857", always_xy=True)
>>> transformer.transfform(101985.0, 2826915.0)
(-8789636.707871985, 2938035.238323653)
Offsets and Windows
Rasterio adds an abstraction for subsets or windows of a raster array that
GDAL does not have. A window is a pair of tuples, the first of the pair being
the raster row indexes at which the window starts and stops, the second being
the column indexes at which the window starts and stops. Row before column,
as with ndarray
slices. Instances of Window
are created by passing the
four subset parameters used with gdal
to the class constructor.
src = rasterio.open('example.tif')
xoff, yoff = 0, 0
xsize, ysize = 10, 10
subset = src.read(1, window=Window(xoff, yoff, xsize, ysize))
Valid Data Masks
Rasterio provides an array for every dataset representing its valid data mask
using the same indicators as GDAL: 0
for invalid data and 255
for valid
data.
>>> src = rasterio.open('example.tif')
>>> src.dataset_mask()
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype-uint8)
Arrays for dataset bands can also be had as a numpy.ma.MaskedArray
.
>>> src.read(1, masked=True)
masked_array(data =
[[-- -- -- ..., -- -- --]
[-- -- -- ..., -- -- --]
[-- -- -- ..., -- -- --]
...,
[-- -- -- ..., -- -- --]
[-- -- -- ..., -- -- --]
[-- -- -- ..., -- -- --]],
mask =
[[ True True True ..., True True True]
[ True True True ..., True True True]
[ True True True ..., True True True]
...,
[ True True True ..., True True True]
[ True True True ..., True True True]
[ True True True ..., True True True]],
fill_value = 0)
Where the masked array’s mask
is True
, the data is invalid and has been
masked “out” in the opposite sense of GDAL’s mask.
Errors and Exceptions
Rasterio always raises Python exceptions when an error occurs and never returns
an error code or None
to indicate an error. gdal
takes the opposite
approach, although developers can turn on exceptions by calling
gdal.UseExceptions()
.