Provision Glue Resources using Terraform

Scratch a dog and you’ll find a permanent job.
– Franklin P. Jones

This is my first script for provisioning the Glue Crawler Resources on AWS.

The important resources that need to be created while provisioning resources in Glue. In this post, I am taking a use case that will create the Data Catalog based on various source datasets.

IAM Role : (Glue Role and Policies).
Data Catalogs (Database, Tables).
Glue Crawlers.
Glue Triggers.

1. IAM Role

STEP-1: Adding a new Glue Service role for a developer, e.g. “dev_AWSGlueServiceRole“

resource "aws_iam_role" "glueRole" {
  name = "dev_AWSGlueServiceRole"
  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "glue.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF
}

STEP-2: Attaching the AWS managed pre-defined policy “AWSGlueServiceRole” to the role created above “dev_AWSGlueServiceRole“.

resource "aws_iam_role_policy_attachment" "glue_service" {
    role = "${aws_iam_role.glueRole.id}"
    policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
}

STEP-3: Finally, as my crawler looking for data at the source location on an S3 Bucket for e.g(dev_source_data“), we need to define a policy and the Glue role to the bucket.

resource "aws_iam_role_policy" "dev_glue_s3_policy" {
  name = "dev_glue_s3_policy"
  role = "${aws_iam_role.glueRole.id}"
  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:*"
      ],
      "Resource": [
        "arn:aws:s3:::dev_source_data",
        "arn:aws:s3:::dev_source_data/*"
      ]
    }
  ]
}
EOF
}

2. Data Catalogs

resource "aws_glue_catalog_table" "aws_glue_catalog_table" {
  name = "mytable"
  database_name = aws_glue_catalog_database.aws_glue_catalog_database.name

  table_type = "EXTERNAL_TABLE"
  parameters = {
    "classification" = "json"
  }
  storage_descriptor {
    location      = "s3://mybucket/myprefix"
    input_format  = "org.apache.hadoop.mapred.TextInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"

  ser_de_info {
    name = "myserdeinfo"
      serialization_library = "org.openx.data.jsonserde.JsonSerDe"
      parameters = {
        "paths" = "jsonrootname"
      }
    }

    columns {
      name    = "column1"
      type    = "array<struct<resourcearn:string,tags:array<struct<key:string,value:string>>>>"
    }
  }
  partition_keys {
    name    = "part1"
    type    = "string"
  }
  partition_keys {
    name    = "part2"
    type    = "string"
  }
}

3. Glue Crawler

Before going to create the resources let me explain a bit about the crawlers.

A Crawler is a component that accesses various data stores and extracts the metadata and creates and manages table definitions in the AWS Glue Data Catalog. Same time it also manages the data partitions of the Glue tables based on the source data.

Sample Script to create a Glue Catalog Table, e.g. “aws_glue_catalog_table”.

resource "aws_glue_crawler" "events_crawler" {
  database_name = aws_glue_catalog_database.glue_database.name
  schedule      = "cron(0 1 * * ? *)"
  name          = "events_crawler_${var.environment_name}"
  role          = aws_iam_role.glue_role.arn
  tags          = var.tags

  configuration = jsonencode(
    {
      Grouping = {
        TableGroupingPolicy = "CombineCompatibleSchemas"
      }
      CrawlerOutput = {
        Partitions = { AddOrUpdateBehavior = "InheritFromTable" }
      }
      Version = 1
    }
  )

  s3_target {
    path = "s3://${aws_s3_bucket.data_lake_bucket.bucket}"
  }
}

4. Glue Triggers

We mostly encounter 3-types of scenarios to run the Glue Crawlers.

Conditional Trigger (based on some job condition)
On-Demand Trigger(adhoc or on-demand based running the cralwer)
Scheduled Trigger(Scheduled to trigger on specific time of a day/week/month)

Apart from this, there are scenarios when the triggers can have both a crawler action and a crawler condition.

In the below example the crawler “cralwer_1” will execute and “cralwer_2″ will be executed on “success”.

resource "aws_glue_trigger" "conditional_trigger" {
  name = "example"
  type = "CONDITIONAL"

  actions {
    crawler_name = aws_glue_crawler.cralwer_1.name
  }

  predicate {
    conditions {
      job_name = aws_glue_job.cralwer_2.name
      state    = "SUCCEEDED"
    }
  }
}