Estimated reading time: 11 minutes
AWS Glue is a serverless ETL (Extract, transform, and load) service on the AWS cloud. It makes it easy for customers to prepare their data for analytics. In this article, I will briefly touch upon the basics of AWS Glue and other AWS services.
I will then cover how we can extract and transform CSV files from Amazon S3. We will also look at how these CSVs convert into a data catalog and query them using Amazon Athena without the need for any EC2 instance or server. Towards the end, we will load the transformed data into Amazon Redshift that can later be used for analysis.
Components of AWS Glue
- Data catalog: The data catalog holds the metadata and the structure of the data.
- Database: It is used to create or access the database for the sources and targets.
- Table: Create one or more tables in the database that can be used by the source and target.
- Crawler and Classifier: A crawler is used to retrieve data from the source using built-in or custom classifiers. It creates/uses metadata tables that are pre-defined in the data catalog.
- Job: A job is business logic that carries out an ETL task. Internally, Apache Spark with python or scala language writes this business logic.
- Trigger: A trigger starts the ETL job execution on-demand or at a specific time.
- Development endpoint: It creates a development environment where the ETL job script can be tested, developed and debugged.
End-to-End ETL on AWS Glue
This tutorial helps you understand how AWS Glue works along with Amazon S3 and Amazon Redshift. This practical guide will show how to read data from different sources (we will cover Amazon S3 in this guide) and apply some must required data transformations such as joins and filtering on the tables and finally load the transformed data in Amazon Redshift.
A JDBC connection connects data sources and targets using Amazon S3, Amazon RDS, Amazon Redshift or any external database.
Before implementing any ETL job, you need to create an IAM role and upload the data into Amazon S3.
Create an IAM role to access AWS Glue + Amazon S3:
- Open the Amazon IAM console
- Click on Roles in the left pane. Then click on Create Role.
- Choose the AWS service from Select type of trusted entity section
- Choose Glue service from “Choose the service that will use this role” section
- Choose Glue from “Select your use case” section
- Click on Next:Permissions
- Select AWSGlueServiceRole from the Attach Permissions Policies section. AWSGlueServiceRole policy contains permissions to access Glue, CloudWatch, EC2, S3, and IAM
- Click on Next:Tags. Leave the Add tags section blank and click on Next:Review
- Provide a name to identify the service role AWSGlueServiceRole-<role name>(for simplicity add prefix ‘AWSGlueServiceRole-’ in the role name) for the role
- Click on Create role
- Your role with full access to AWS Glue and limited access to Amazon S3 has been created
Upload source CSV files to Amazon S3:
- On the Amazon S3 console, click on the Create a bucket where you can store files and folders.
- Enter a bucket name, select a Region and click on Next
- The remaining configuration settings for creating an S3 bucket are optional. Click Next to create your S3 bucket.
- Create a new folder in your bucket and upload the source CSV files
Create a connection for the target database into Amazon Redshift:
- Prerequisite: You must have an existing cluster, database name and user for the database in Amazon Redshift.
- In the AWS Glue console, click on the Add Connection in the left pane.
In the dialog box, enter the connection name under Connection name and choose the Connection type as Amazon Redshift. Click Next to move to the next screen.

- Select your existing cluster in Amazon Redshift as the cluster for your connection
- Enter the database name, username, and password of the cluster and click Next

Review the details of the connection and then click FinishYour Amazon Redshift connection is now created and can be verified through the Test Connection.

Adding a crawler to create data catalog using Amazon S3 as a data source
On the left pane in the AWS Glue console, click on Crawlers -> Add Crawler
Enter the crawler name in the dialog box and click Next

Choose S3 as the data store from the drop-down list
Select the folder where your CSVs are stored in the Include path field

If you have any other data source, click on Yes and repeat the above steps. In this guide, we do not have another example, and we’ll click on No. Then, click Next.

Select Choose an existing IAM role and select the previously created role name from the dropdown list of IAM roles and click Next

Leave the default option of Run on Demand and click Next

Choose an existing database. If you do not have one, Click Add Database to create a new database on the fly.
Table prefixes are optional and left to the user to customer. The system would also create these automatically after running the crawler. Click Next.

Once you are satisfied with the configuration, click on Finish

You can now run the crawler by clicking Run it now

Databases on the left pane let you verify if the tables were created automatically by the crawler. Amazon Athena enables you to view the data in the tables.

Find out more about our Advanced AWS Services!
Add a Job to execute ETL work
In the left pane, Click on Job, then click on Add Job
Enter a name for the Job and then select an IAM role previously created for AWS Glue
Select Spark for the Type and select Python or Scala. For our purposes, we are using Python.
You can edit the number of DPU (Data processing unit) value in the Maximum capacity field of Security configuration, script libraries, and job parameters (optional).
The remaining configuration is optional and default values would work fine.
Click Next

Choose a data source table from Choose a data source section. You can choose only a single data source. Click Next.

Choose a data target table from the list of tables. Either you can create new tables or choose an existing one.

If you haven’t created any target table, select Create tables in your data target option
Our target database is Amazon Redshift and hence we should select JDBC from the dropdown of Datastore and the connection created earlier from the Connection list
Enter a database name that must exist in the target data store. Click Next.

You can map the columns of the source table with those of the target table. For this tutorial, we are going ahead with the default mapping. The business logic can also later modify this.
Click Save job and edit script.

Open the Python script by selecting the recently created job name. Click on Action -> Edit Script.
The left pane shows a visual representation of the ETL process. The right-hand pane shows the script code and just below that you can see the logs of the running Job.

Let’s understand the script that performs extraction, transformation and loading process on AWS Glue.
We begin by Importing the necessary python libraries that create the ETL Job.
import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job
Get the name of Job through the command line.
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])
Initialize the GlueContext and SparkContext for the Job.
sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args)
Extract the data of tbl_syn_source_1_csv and tbl_syn_source_2_csv tables from the data catalog. AWS Glue supports Dynamic Frames of the data.
datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "db_demo1", table_name = "tbl_syn_source_1_csv", transformation_ctx = "datasource1") datasource2 = glueContext.create_dynamic_frame.from_catalog(database = "db_demo1", table_name = "tbl_syn_source_2_csv", transformation_ctx = "datasource2")
Now, Apply transformation on the source tables. You can join both the tables on statecode column of tbl_syn_source_1_csv and code column of tbl_syn_source_2_csv.
Several transformations are available within AWS Glue such as RenameField, SelectField, Join, etc. Refer – https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html.
join1 = Join.apply(frame1 = datasource1, frame2 = datasource2, keys1 = "statecode", keys2 = "code", transformation_ctx = "join1")
Load the joined Dynamic Frame in Amazon Redshift (Database=dev and Schema=shc_demo_1).
datasink1 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = join1, catalog_connection = "my-redshift-1", connection_options = {"dbtable": "sch_demo_1.tbl_joined", "database": "dev"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink1")
Finally, commit your Job.
job.commit()
Save and execute the Job by clicking on Run Job.
Code Sample
import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME']) ## Initialize the GlueContext and SparkContext sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args) ## Read the data from Amazon S3 and have their structure in the data catalog. datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "db_demo1", table_name = "tbl_syn_source_1_csv", transformation_ctx = "datasource1") datasource2 = glueContext.create_dynamic_frame.from_catalog(database = "db_demo1", table_name = "tbl_syn_source_2_csv", transformation_ctx = "datasource2") ## Apply transformation, join the tables join1 = Join.apply(frame1 = datasource1, frame2 = datasource2, keys1 = "statecode", keys2 = "code", transformation_ctx = "join1") ## Write the transformed data into Amazon Redshift datasink1 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = join1, catalog_connection = "my-redshift-1", connection_options = {"dbtable": "sch_demo_1.tbl_joined", "database": "dev"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink1") job.commit()
More on transformation with AWS Glue
AWS Glue has a few limitations on the transformations such as UNION, LEFT JOIN, RIGHT JOIN, etc. To overcome this issue, we can use Spark. Convert Dynamic Frame of AWS Glue to Spark DataFrame and then you can apply Spark functions for various transformations.
Example: Union transformation is not available in AWS Glue. However, you can use spark union() to achieve Union on two tables.
## Convert Glue Dynamic frame to Spark DataFrame spark_data_frame_1 = glue_dynamic_frame_1.toDF() spark_data_frame_2 = glue_dynamic_frame_2.toDF() ## Apply UNION Transformation on Spark DataFrame spark_data_frame_union = spark_data_frame_1.union(spark_data_frame_2).distinct() ## Again, convert Spark DataFrame back to Glue Dynamic Frame glue_dynamic_frame_union = DynamicFrame.fromDF(spark_data_frame_union, glueContext, "spark_data_frame_union")
Github link for source code: https://gist.github.com/nitinmlvya/ba4626e8ec40dc546119bb14a8349b45
Further Reading:
“Synerzip team is very responsive & quick to adopt new technologies. Team naturally follows best practices, does peer reviews and delivers quality output, thus exceeding client expectations.”
“Synerzip’s agile processes & daily scrums were very valuable, made communication & time zone issues work out successfully.”
“Synerzip’s flexible and responsible team grew to be an extension to the StepOne team. Typical concerns of time zone issues did not exist with Synerzip team.”
“Synerzip worked in perfect textbook Agile fashion – releasing working demos every two weeks. Though aggressive schedules, Synerzip was able to deliver a working product in 90 days, which helped Zimbra stand by their commitment to their customers.”
“Outstanding product delivery and exceptional project management, comes from DNA of Synerzip.”
“Studer product has practically taken a 180% turn from what it was, before Synerzip came in. Synerzip cost is very reasonable as compared to the work they do.”
“Synerzip makes the timezone differences work FOR the customer, enabling a positive experience for us. ‘Seeing is believing’, so we decided to give it a shot and the project was very successful.”
“The Synerzip team seamlessly integrates with our team. We started seeing results within the first sprint. And due to the team’s responsiveness, we were able to get our product to the sales cycle within 7 months.”
“Product management team from Synerzip is exceptional and has a clear understanding of Studer’s needs. Synerzip team gives consistent performance and never misses a deadline.”
“Synerzip is different because of the quality of their leadership, efficient team and clearly set methodologies. Studer gets high level of confidence from Synerzip along with significant cost advantage of almost 50%”
“Synerzip’s hiring approach and practices are worth applauding. Working with Synerzip is like
“What you see is what you get”.”
“Synerzip has dedicated experts for every area. Synerzip helped Tangoe save a lot of cost, still giving a very high quality product.”
“Synerzip gives tremendous cost advantage in terms of hiring and growing the team to be productive verses a readymade team. Synerzip is one company that delivers “co –development” to the core!”
“Synerzip is a great company to work with. Good leadership and a warm, welcoming attitude of the team are additional plus points.”
“Our relationship with Synerzip is very collaborative, and they are our true partners as our values match with theirs.”
“Synerzip has proven to be a great software product co-development partner. It is a leader because of its great culture, its history, and its employee retention policies. ExamSoft’s clients are happy with the product, and that’s how ExamSoft measures that all is going well.”
“They possess a great technical acumen with a burning desire to solve problems. The team always takes the initiative and ownership in all the processes they follow. Synerzip has played a vital role in our scaling up and was a perfect partner in cost, efficiency, and schedules.”
“As we are a startup, things change on a weekly basis, but Synerzip team has been flexible in adapting the same”
“Synerzip team has been very proactive in building the best quality software, bringing in best practices, and cutting edge innovation for our company.”
“We’ve been working for more than six years with Synerzip and its one of the better, if not the best, experience I’ve had working with an outsourcing company.”
“My experience with Synerzip is that they have the talent. You throw a problem at them, and someone from that team helps to solve the issue.”
“The breadth and depth of technical abilities that Synerzip brings on the table and the UX work done by them for this project exceeded my expectations!”
“Synerzip UX designers very closely represent their counterparts in the US in terms of their practice, how they tackle problems, and how they evangelize the value of UX.”
“Synerzip team understood the requirements well and documented them to make sure they understood them rightly.”
“Synerzip is definitely not a typical offshore company. Synerzip team is incredibly communicative, agile, and delivers on its commitments.”
“Working with Synerzip helped us accelerate our roadmap in ways we never thought possible!”
“While working with Synerzip, I get a feeling of working with a huge community of resources, who can jump in with the skills as needed.”