05 February, 2020

Copying files from AWS S2 to Azure Blob with Azure Data Factory service

In today's multi cloud environment, copying files from one cloud provider to another is becoming a very common task. In one of my recent IoT engagement, the customer wants to leverage Azure cloud going forward but already has sensor data files (json) residing in AWS S3 bucket. So the customer first wants those to be copied to the Azure Blob storage before completely switching to Azure. It's not just few files but thousands of such files waiting to be copied.

Let's take up this challenge and create a small POC around it. Azure Data Factory came to rescue. If you are not already familiar with this service yet then I would encourage you to give it a look. It's a great service which can solve many of your data related needs. In our case the need is to copy files between  AWS ==> AZURE



Prerequisites:
  1. Azure Subscription (Free account will do)
  2. AWS Subscription (Free account will do)
First let's take a look at AWS side from where the files needs to be copied, i.e the source. You can see that I have a S3 bucket (iot-aws-data) containing 10 files under "iot" folder". These files are nothing but plain json files containing the data generated from sensors.  
{
   "readingId" : "1001eee1",
   "readingDate" : "2020-01-04",
   "sensorId" : 2345,
   "level" : 4
}

Now before we can think of copying data, we need to have credentials to get into AWS S3 bucket. It's not recommended to use the root credentials so let's create an user with admin privilege in our AWS account.

Click on Services. Type IAM in the search bar and click as highlighted below


Click Users and then click the "Add user" button in the next screen


Give the user name of your choice (I have given AzureUser 😉). Tick the checkboxes and click the "Next Permission" button


In the next screen select the following as below and then click next >> next and create the user


The user with administrator access will be created. Now a vital step. Don't forget to note down the Access key ID and Secret access key from the screen. Keep those in notepad which will need later.


DONE from the AWS cloud side. So, let dive into Azure.

Search for "Data factory" in the Azure search bar and select as highlighted below


Fill in the details as per your choice. Make sure you un-check the "Enable GIT" checkbox as it's not required for this demo else you will have to fill in more additional details pertaining to that. Click "Create"


Once created go to your data factory overview page and click "Author & Monitor" to launch ADF dashboard.


Click on "Create Pipeline"


Give a name of our choice for your pipeline (I have given AwsS3ToAzureBlob) and select the Copy data activity under Move & transform and drag it to the area right as shown in the below screen


Select the activity if not already selected and give a meaningful name of your choice (I have given CopyFilesAwsToAzure). Click on the "Source" tab (next to the General tab)


Click +New against the "Source dataset" drop-down and select Amazon S3. Click "Continue"


Select "JSON" as format and click "Continue"


Give a name of your choice in the properties window and select "New" from the list for linked service


In the next window we need to provide our AWS credentials
Name:  free text
Connect via integration runtime: Leave the default AutoResolveIntegrationRuntime
Access Key ID: << the access key id you had copied from the AWS create user screen >>
Secret Access Key: << the secret access key id you had copied from the AWS create user screen >>

Hit the "Test connection" at the bottom. If it succeeds (it should) Click Create


Click the "Browse" button and select your AWS S3 bucket from the list. Click OK


Now we need to copy all files and not just one file.  So select "Wildcard file path" and provide the folder name of the AWS S3 bucket containing all the files and the file name as "*.json" (this should be the default as well).
Scroll down a bit and also provide the Max concurrent connections to 2 (choice is yours) as we also want to copy files in parallel.
You can also view the data if you wish by hitting the Preview data button.



We are done with the source part. Now click the Sink tab and hit "+New" against the Sink dataset drop-down. Select "Azure Blob Storage" and hit Continue


Select "JSON" as format and click "Continue"


Give a name of your choice in the properties window and select "New" from the list for linked service

In the next window we need to provide our Azure Blob information
Name:  free text
Connect via integration runtime: Leave the default AutoResolveIntegrationRuntime
Authentication method: << Account key >> ..... There are other options as well
Azure subscription: << Select the correct azure subscription >>
Storage account name: << select the storage account >> .... You must have created the storage account

Hit the "Test connection" at the bottom. If it succeeds (it should) Click Create


Click the "Browse" button and select your Azure storage container from the list. Click OK



We are done with the Sink part well.

Now as a last step hit the "Validate all" button at the top and then hit the "Publish all" button. Both should pass without any issue. Please note that you need  to publish/deploy the pipeline before using it.

Cool... Now lets run the pipeline. Click on Add trigger >> Trigger now.  Click Finish in the next window to fire up the copy process.



Once fired, you can monitor the progress from the monitor window. Its "In progress". Keeping finger crossed 😮


And it succeeds 😀


Great job!... now just do a quick check. Open your azure blob container and yep... the files are now indeed copied into it.  🙋


Congratulations! for coming this far. Hope this article will boost up your self confidence and would encourage you to further explore this service.

Do share with me about your experience and what you have built upon this foundation. I would love to hear from you.