Extend geospatial questions in Amazon Athena with UDFs and AWS Lambda

Amazon Athena is a serverless and interactive inquiry service that permits you to quickly examine information in Amazon Simple Storage Service (Amazon S3) and 25-plus information sources, consisting of on-premises information sources or other cloud systems utilizing SQL or Python. Athena integrated abilities consist of querying for geospatial information; for instance, you can count the variety of earthquakes in each Californian county One drawback of examining at county-level is that it might provide you a deceptive impression of which parts of California have actually had the most earthquakes. This is since the counties aren’t similarly sized; a county might have had more earthquakes merely since it’s a huge county. What if we desired a hierarchical system that enabled us to focus and out to aggregate information over various equally-sized geographical locations?

In this post, we provide an option that utilizes Uber’s Hexagonal Hierarchical Spatial Index (H3) to divide the world into equally-sized hexagons. We then utilize an Athena user-defined function (UDF) to figure out which hexagon each historic earthquake happened in. Since the hexagons are equally-sized, this analysis offers a reasonable impression of where earthquakes tend to take place.

At the end, we’ll produce a visualization like the one listed below that reveals the variety of historic earthquakes in various locations of the western United States.

H3 divides the world into equal-sized routine hexagons. The variety of hexagons depends upon the selected resolution, which might differ from 0 (122 hexagons, each with edge lengths of about 1,100 km) to 15 (569,707,381,193,162 hexagons, each with edge lengths of about 50 cm). H3 allows analysis at the location level, and each location has the very same shapes and size.

Service introduction

The service extends Athena’s integrated geospatial abilities by developing a UDF powered by AWS Lambda Lastly, we utilize an Amazon SageMaker note pad to run Athena questions that are rendered as a choropleth map The following diagram highlights this architecture.

The end-to-end architecture is as follows:

  1. A CSV file of historic earthquakes is published into an S3 container.
  2. An AWS Glue external table is developed based upon the earthquake CSV.
  3. A Lambda function computes H3 hexagons for criteria (latitude, longitude, resolution). The function is composed in Java and can be called as a UDF utilizing questions in Athena.
  4. A SageMaker note pad utilizes an AWS SDK for pandas bundle to run a SQL inquiry in Athena, consisting of the UDF.
  5. A Plotly Express bundle renders a choropleth map of the variety of earthquakes in each hexagon.


For this post, we utilize Athena to check out information in Amazon S3 utilizing the table specified in the AWS Glue Information Brochure connected with our earthquake dataset. In regards to authorizations, there are 2 primary requirements:

Configure Amazon S3

The primary step is to produce an S3 container to save the earthquake dataset, as follows:

  1. Download the CSV file of historic earthquakes from GitHub
  2. On the Amazon S3 console, pick Pails in the navigation pane.
  3. Pick Produce container
  4. For Pail name, go into a worldwide distinct name for your information container.
  5. Pick Produce folder, and go into the folder name earthquakes
  6. Upload the file to the S3 container. In this example, we publish the earthquakes.csv file to the earthquakes prefix.

Produce a table in Athena

Browse to Athena console to produce a table. Total the following actions:

  1. On the Athena console, pick Question editor
  2. Select your chosen Workgroup utilizing the drop-down menu.
  3. In the SQL editor, utilize the following code to produce a table in the default database:
     develop external TABLE earthquakes
    earthquake_date STRING,
    latitude DOUBLE,
    longitude DOUBLE,
    depth DOUBLE,
    magnitude DOUBLE,
    magtype STRING,
    mbstations STRING,
    space STRING,
    range STRING,
    rms STRING,
    source STRING,
    eventid STRING
    KEPT AS TEXTFILE area's 3://<< MY-DATA-BUCKET>>/ earthquakes/';

Produce a Lambda function for the Athena UDF

For a comprehensive description on how to construct Athena UDFs, see Querying with user specified functions We utilize Java 11 and Uber H3 Java binding to construct the H3 UDF. We offer the execution of the UDF on GitHub

There are a number of alternatives for releasing a UDF utilizing Lambda In this example, we utilize the AWS Management Console For production releases, you most likely wish to utilize facilities as code such as the AWS Cloud Advancement Package (AWS CDK). For details about how to utilize the AWS CDK to release the Lambda function, describe the task code repository Another possible implementation choice is utilizing AWS Serverless Application Repository (SAR).

Release the UDF

Release the Uber H3 binding UDF utilizing the console as follows:

  1. Go to binary directory site in the GitHub repository, and download aws-h3-athena-udf- *. container to your regional desktop.
  2. Produce a Lambda function called H3UDF with Runtime set to Java 11 (Corretto), and Architecture set to x86_64
  3. Upload the aws-h3-athena-udf *. container file.
  4. Modification the handler name to com.aws.athena.udf.h3.H3AthenaHandler
  5. In the General setup area, pick Edit to set the memory of the Lambda function to 4096 MB, which is a quantity of memory that works for our examples. You might require to set the memory size bigger for your usage cases.

Utilize the Lambda function as an Athena UDF

After you produce the Lambda function, you’re all set to utilize it as a UDF. The following screenshot reveals the function information.

You can now utilize the function as an Athena UDF. On the Athena console, run the following command:

 UTILIZING EXTERNAL FUNCTION lat_lng_to_cell_address( lat DOUBLE, lng DOUBLE, res INTEGER).
LAMBDA '<< MY-LAMBDA-ARN>>'-- Change with ARN of your Lambda function.
PICK *,.
lat_lng_to_cell_address( latitude, longitude, 4) AS h3_cell.
FROM earthquakes.
WHERE latitude in between 18 AND 70;

The udf/examples folder in the GitHub repository consists of more examples of the Athena questions.

Establishing the UDFs

Now that we revealed you how to release a UDF for Athena utilizing Lambda, let’s dive deeper into how to establish these sort of UDFs. As described in Querying with user specified functions, in order to establish a UDF, we initially require to execute a class that acquires UserDefinedFunctionHandler Then we require to execute the functions inside the class that can be utilized as UDFs of Athena.

We start the UDF execution by specifying a class H3AthenaHandler that acquires the UserDefinedFunctionHandler Then we execute functions that serve as wrappers of functions specified in the Uber H3 Java binding We ensure that all the functions specified in the H3 Java binding API are mapped, so that they can be utilized in Athena as UDFs. For instance, we map the lat_lng_to_cell_address function utilized in the preceding example to the latLngToCell of the H3 Java binding.

On top of the call to the Java binding, a lot of the functions in the H3AthenaHandler check whether the input specification is null. The null check works since we do not presume the input to be non-null. In practice, null worths for an H3 index or address are not uncommon.

The following code reveals the execution of the get_resolution function:

/ ** Returns the resolution of an index.
* @param h3 the H3 index.
* @return the resolution. Null when h3 is null.
* @throws IllegalArgumentException when index runs out variety.
public Integer get_resolution( Long h3) {
last Integer outcome;.
if (h3 == null) {
outcome = null;.
} else {
outcome = h3Core.getResolution( h3);.
return outcome;.

Some H3 API functions such as cellToLatLng return List<< Double>> of 2 components, where the very first component is the latitude and the 2nd is longitude. The H3 UDF that we execute supplies a function that returns popular text (WKT) representation. For instance, we offer cell_to_lat_lng_wkt, which returns a Point WKT string rather of List<< Double>> We can then utilize the output of cell_to_lat_lng_wkt in mix with the integrated spatial Athena function ST_GeometryFromText as follows:

Choose ST_GeometryFromText( cell_to_lat_lng_wkt( 622506764662964223))

Athena UDF just supports scalar information types and does not support embedded types. Nevertheless, some H3 APIs return embedded types. For instance, the polygonToCells function in H3 takes a List<< List<< List<< GeoCoord>>>> > > Our execution of polygon_to_cells UDF gets a Polygon WKT rather. The following reveals an example Athena inquiry utilizing this UDF:

-- get all h3 hexagons that cover Toulouse, Nantes, Lille, Paris, Nice.
Choose polygon_to_cells(' POLYGON (( 43.604652 1.444209, 47.218371 -1.553621, 50.62925 3.05726, 48.864716 2.349014, 43.6961 7.27178, 3.604652 1.444209))', 2)

Usage SageMaker note pads for visualization

A SageMaker note pad is a handled maker finding out calculate circumstances that runs a Jupyter note pad application. In this example, we will utilize a SageMaker note pad to compose and run our code to imagine our outcomes, however if your usage case consists of Apache Glow then utilizing Amazon Athena for Apache Glow would be a terrific option. For recommendations on security finest practices for SageMaker, see Structure protected maker finding out environments with Amazon SageMaker You can produce your own SageMaker note pad by following these guidelines:

  1. On the SageMaker console, pick Note Pad in the navigation pane.
  2. Pick Note pad circumstances
  3. Pick Produce note pad circumstances
  4. Go into a name for the note pad circumstances.
  5. Pick an existing IAM function or produce a function that permits you to run SageMaker and grants access to Amazon S3 and Athena.
  6. Pick Produce note pad circumstances
  7. Wait on the note pad status to alter from Developing to InService
  8. Open the note pad circumstances by selecting Jupyter or JupyterLab

Check out the information

We’re now all set to check out the information.

  1. On the Jupyter console, under New, pick Note Pad
  2. On the Select Kernel drop-down menu, pick conda_python3.
  3. Include brand-new cells by selecting the plus indication.
  4. In your very first cell, download the following Python modules that aren’t consisted of in the basic SageMaker environment:.
    ! pip set up geojson.
    ! pip set up awswrangler.
    ! pip set up geomet.
    ! pip set up shapely

    GeoJSON is a popular format for keeping spatial information in a JSON format. The geojson module permits you to quickly check out and compose GeoJSON information with Python. The 2nd module we set up, awswrangler, is the AWS SDK for pandas This is an extremely simple method to check out information from different AWS information sources into Pandas information frames. We utilize it to check out earthquake information from the Athena table.

  5. Next, we import all the bundles that we utilize to import the information, improve it, and imagine it:.
     from geomet import wkt.
    import plotly.express as px.
    from shapely.geometry import Polygon, mapping.
    import awswrangler as wr.
    import pandas as pd.
    from shapely.wkt import loads.
    import geojson.
    import ast

  6. We start importing our information utilizing the athena.read _ sql. _ inquiry function in AWS SDK for pandas. The Athena inquiry has a subquery that utilizes the UDF to include a column h3_cell to each row in the earthquakes table, based upon the latitude and longitude of the earthquake. The analytic function COUNT is then utilized to learn the variety of earthquakes in each H3 cell. For this visualization, we’re just thinking about earthquakes within the United States, so we filter out rows in the information frame that are outside the location of interest:.
     def run_query( lambda_arn, db, resolution):.
    inquiry = f""" UTILIZING EXTERNAL FUNCTION cell_to_boundary_wkt( cell VARCHAR).
    LAMBDA' {lambda_arn} '.
    Choose h3_cell, cell_to_boundary_wkt( h3_cell) as limit, quake_count FROM(.
    UTILIZING EXTERNAL FUNCTION lat_lng_to_cell_address( lat DOUBLE, lng DOUBLE, res INTEGER).
    LAMBDA' {lambda_arn} '.
    Choose h3_cell, COUNT

    AS quake_count.
    ( SELECT *,.
    lat_lng_to_cell_address( latitude, longitude, {resolution}) AS h3_cell.
    FROM earthquakes.
    WHERE latitude in between 18 AND 70– For this visualisation, we’re just thinking about earthquakes within the U.S.A..
    AND longitude in between -175 AND -50.
    GROUP BY h3_cell ORDER BY quake_count DESC) cell_quake_count”””.
    return wr.athena.read _ sql_query( inquiry, database= db).

    lambda_arn=’<< MY-LAMBDA-ARN>>’ # Change with ARN of your lambda.
    db_name=”<< MY-DATABASE-NAME>>” # Change with name of your Glue database.
    earthquakes_df = run_query( lambda_arn= lambda_arn, db= db_name, resolution= 4).
    earthquakes_df. head()

The following screenshot reveals our outcomes. Follow together with the remainder of the actions in our Jupyter note pad

to see how we examine and imagine our example with H3 UDF information.

Picture the outcomes To imagine our outcomes, we utilize the Plotly Express

module to produce a choropleth map of our information. A choropleth map is a kind of visualization that is shaded based upon quantitative worths. This is a terrific visualization for our usage case since we’re shading various areas based upon the frequency of earthquakes.

In the resulting visual, we can see the varieties of frequency of earthquakes in various locations of The United States and Canada. Keep in mind, the H3 resolution in this map is lower than in the earlier map, that makes each hexagon cover a bigger location of the world.

Tidy Up

  1. To prevent sustaining additional charges on your account, erase the resources you developed: On the SageMaker console, pick the note pad and on the Actions menu, pick Stop
  2. Wait on the status of the note pad to alter to Stopped, then pick the note pad once again and on the Actions menu, pick Erase
  3. On the Amazon S3 console, pick the container you developed and pick Empty
  4. Go into the container name and pick Empty
  5. Select the container once again and pick Erase
  6. Go into the container name and pick Erase container
  7. On the Lambda console, pick the function name and on the Actions menu, pick Erase


In this post, you saw how to extend functions in Athena for geospatial analysis by including your own user-defined function. Although we utilized Uber’s H3 geospatial index in this presentation, you can bring your own geospatial index for your own customized geospatial analysis. In this post, we utilized Athena, Lambda, and SageMaker note pads to imagine the outcomes of our UDFs in the western United States. Code examples remain in the h3-udf-for-athena

GitHub repo.

As a next action, you can customize the code in this post and tailor it for your own requirements to acquire additional insights from your own geographical information. For instance, you might imagine other cases such as dry spells, flooding, and logging.

About the Authors John Telford

is a Senior Specialist at Amazon Web Provider. He is an expert in huge information and information storage facilities. John has a Computer technology degree from Brunel University. Anwar Rizal

is a Senior Artificial intelligence specialist based in Paris. He deals with AWS clients to establish information and AI options to sustainably grow their company. Pauline Ting

is an Information Researcher in the AWS Expert Provider group. She supports clients in accomplishing and accelerating their company result by establishing sustainable AI/ML options. In her extra time, Pauline delights in taking a trip, browsing, and attempting brand-new dessert locations.(*)

Like this post? Please share to your friends:
Leave a Reply

;-) :| :x :twisted: :smile: :shock: :sad: :roll: :razz: :oops: :o :mrgreen: :lol: :idea: :grin: :evil: :cry: :cool: :arrow: :???: :?: :!: