Scratch a dog and you’ll find a permanent job.
– Franklin P. Jones
I am writing about one of my own experiences with AWS Athena, the serverless query engine.
AWS marketed this service as serverless, scalable, with no management or maintenance overhead service. This is somewhat I agree but when it comes to the variety of data, it has a problem.
All we know is that Athena uses Apache Hive to define tables and databases. Apart from that, it can simply run queries on data in Amazon S3 by defining the schema, which is great. but it never says what kind of data it can simply process!
Although AWS Athena claims it can process data in CSV, JSON, ORC, Avro and Parquet format, I have a bad experience with dealing with JSON data.
1. Idnetify Data
Let me explain. The data was in JSON format with file extension .json. The below data in the code block looks like an array of three records, which needs to be processed by the Glue crawler and populate the data into the tables. The below data and example is taken from AWS documentation.
[
{
"type": "constituency",
"id": "ocd-division\/country:us\/state:ak",
"name": "Alaska"
},
{
"type": "constituency",
"id": "ocd-division\/country:us\/state:al\/cd:1",
"name": "Alabama's 1st congressional district"
},
{
"type": "constituency",
"id": "ocd-division\/country:us\/state:al\/cd:2",
"name": "Alabama's 2nd congressional district"
}
]
To process the data the Glue Crawler need to have a custom JSON Classifier. When I run a crawler using the built-in JSON classifier, the entire file is used to define the schema. Because initially didn’t specify a JSON path, the crawler treats the data as one object, that is, just an array. For example, the schema might look like the following:
root
|-- record: array
However, to create a schema that is based on each record in the JSON array, create a custom JSON classifier and specify the JSON path as $[*]
. When this JSON path is specified, the classifier interrogates all 3 records in the array to determine the schema. The resulting schema contains separate fields for each object, similar to the following:
root
|-- type: string
|-- id: string
|-- name: string
JSON path
A JSON path that points to an object is used to define a table schema. The JSON path can be written in dot notation or bracket notation. The following operators are supported:
Operator | Description |
---|---|
$ | Root element of a JSON object. This starts all path expressions |
* | Wildcard character. Available anywhere a name or numeric are required in the JSON path. |
.<name> | Dot-notated child. Specifies a child field in a JSON object. |
['<name>'] | Bracket-notated child. Specifies child field in a JSON object. Only a single child field can be specified. |
[<number>] | Array index. Specifies the value of an array by index. |
Till this all fine but the pain starts here.
The Glue crawler creates a table in the Glue Data Catalog.