blog.mjmcc.io: A well-populated AWS CloudFormation template for building an EMR Cluster

Wednesday, 21 September 2016

A well-populated AWS CloudFormation template for building an EMR Cluster

22/02/17: Update to the template including support for EMR 5 and task node functionality for processing (task nodes optional)

I've been working on a more efficient way of deploying EMR (Elastic MapReduce) clusters for "Big Data' processing using applications that come part of the Hadoop Suite. Originally, I was just using a parameterised Jenkins job with a lengthy AWS CLI command but that became difficult to maintain the more functionality I added to it. I won't share it with you, as it was 100+ lines long. I came across AWS CloudFormation which makes the deployments easy to build and maintain.

AWS CloudFormation enables you to create and manage AWS resources using Infrastructure as Code (I've attached the link below to the AWS CloudFormation Product & Service page for more information). During investigation of how I could script EMR in CloudFormation, I noticed there was not much resource available online to build a template which suited a tailor-made EMR cluster. I tried to use tools to build the template such as CloudForm and the CloudFormation template designer but no such luck. In the end, I took the most basic EMR template available on the AWS Knowledge base and built on top of it. Feel free to use it, I'll also keep it updated as I add to it.

---
AWSTemplateFormatVersion: '2010-09-09'
Description: Cloudformation Template to spin up EMR clusters V3 (Version 5 of EMR
  only)
Parameters:
  clusterName:
    Description: Name of the cluster
    Type: String
  taskInstanceCount:
    Description: Number of task instances
    Type: String
    AllowedValues:
    - '1'
    - '2'
    - '3'
    - '4'
    - '5'
    - '6'
    - '7'
    ConstraintDescription: Up to 7 nodes only
  emrVersion:
    Description: Version of EMR
    Type: String
    AllowedPattern: emr-5.[0-9].[0-9]
    ConstraintDescription: 'Must be EMR Version 4 (i.e: emr-5.3.0)'
  masterInstanceType:
    Description: Instance type of Master Node
    Type: String
  coreInstanceType:
    Description: Instance type of Core Node
    Type: String
  taskInstanceType:
    Description: Instance type of Task Node
    Type: String
  environmentType:
    Description: What environment do you want the cluster to be in
    Type: String
  s3BucketBasePath:
    Description: Bucket to log EMR actions to
    Type: String
  taskBidPrice:
    Description: Bid price for Task nodes
    Type: String
  terminationProtected:
    Description: Is the cluster to have termination protection enabled
    Type: String
    AllowedValues:
    - 'true'
    - 'false'
    ConstraintDescription: Boolean
  awsRegion:
    Description: awsRegion
    Default: eu-west-1
    AllowedValues:
    - eu-west-1
    - eu-central-1
    Type: String
Conditions:
  isLive:
    Fn::Equals:
    - Ref: environmentType
    - live
Resources:
  EMRClusterV5:
    Type: AWS::EMR::Cluster
    Properties:
      Instances:
        MasterInstanceGroup:
          InstanceCount: 1
          InstanceType:
            Ref: masterInstanceType
          Market: ON_DEMAND
          Name: Master instance group - 1
        CoreInstanceGroup:
          InstanceCount: 1
          InstanceType:
            Ref: coreInstanceType
          Market: ON_DEMAND
          Name: Core instance group - 2
        TerminationProtected:
          Ref: terminationProtected
        Ec2SubnetId: ENTER SUBNET HERE
        Ec2KeyName: ENTER NAME OF SSH KEY HERE
        EmrManagedMasterSecurityGroup: ENTER SECURITY GROUP HERE
        EmrManagedSlaveSecurityGroup: ENTER SECURITY GROUP HERE
        ServiceAccessSecurityGroup: ENTER SECURITY GROUP HERE
      BootstrapActions:
      - Name: NAME OF BOOTSTRAP
        ScriptBootstrapAction:
          Path: S3 LOCATION OF SHELL SCRIPT
      Configurations:
      - Classification: hadoop-log4j
        ConfigurationProperties:
          hadoop.log.maxfilesize: 256MB
          hadoop.log.maxbackupindex: '3'
          hadoop.security.log.maxfilesize: 256MB
          hadoop.security.log.maxbackupindex: '3'
          hdfs.audit.log.maxfilesize: 256MB
          hdfs.audit.log.maxbackupindex: '3'
          mapred.audit.log.maxfilesize: 256MB
          mapred.audit.log.maxbackupindex: '3'
          hadoop.mapreduce.jobsummary.log.maxfilesize: 256MB
          hadoop.mapreduce.jobsummary.log.maxbackupindex: '3'
      - Classification: hbase-log4j
        ConfigurationProperties:
          hbase.log.maxbackupindex: '3'
          hbase.log.maxfilesize: 10MB
          hbase.security.log.maxbackupindex: '3'
          hbase.security.log.maxfilesize: 10MB
      - Classification: yarn-site
        ConfigurationProperties:
          yarn.log-aggregation.retain-seconds: '43200'
      Applications:
      - Name: Hadoop
      - Name: Hive
      - Name: Pig
      - Name: Hue
      - Name: HCatalog
      - Name: Sqoop
      - Name: Ganglia
      - Name: Spark
      - Name: Oozie
      - Name: Tez
      Name:
        Ref: clusterName
      JobFlowRole: ENTER EMR ROLE HERE
      ServiceRole: ENTER EMR ROLE HERE
      ReleaseLabel:
        Ref: emrVersion
      LogUri:
        Fn::Join:
        - ''
        - - s3n://
          - Ref: s3BucketBasePath
          - "/logs/"
      VisibleToAllUsers: true
      Tags:
      - Key: Name
        Value:
          Fn::Join:
          - ''
          - - emr-instance-
            - Ref: AWS::StackName
            - ''
      - Key: Environment
        Value:
          Ref: environmentType
      - Key: Stack ID
        Value:
          Ref: AWS::StackName
  EMRTaskNodes:
    Type: AWS::EMR::InstanceGroupConfig
    Properties:
      InstanceCount:
        Ref: taskInstanceCount
      InstanceType:
        Ref: taskInstanceType
      BidPrice:
        Ref: taskBidPrice
      Market: SPOT
      InstanceRole: TASK
      Name: Task instance group - 3
      JobFlowId:
        Ref: EMRClusterV5

To deploy the stack - you would use the following command:

aws cloudformation create-stack --stack-name [STACK NAME] \

--template-url [LOCATION OF TEMPLATE] --parameters \
ParameterKey=clusterName,ParameterValue=$stackName \
ParameterKey=taskInstanceCount,ParameterValue=$taskNodeCount \
ParameterKey=coreInstanceType,ParameterValue=$coreNodeInstanceType \
ParameterKey=taskInstanceType,ParameterValue=$taskNodeInstanceType \
ParameterKey=emrVersion,ParameterValue=$emrVersion \
ParameterKey=environmentType,ParameterValue=$environmentType \
ParameterKey=masterInstanceType,ParameterValue=$masterNodeInstanceType \
ParameterKey=s3BucketBasePath,ParameterValue=$s3BucketBasePath \
ParameterKey=terminationProtected,ParameterValue=$terminationProtected \
ParameterKey=taskBidPrice,ParameterValue=$bidPrice --region $awsRegion

To update the stack (e.g number of core nodes):

aws cloudformation update-stack --stack-name [STACK NAME] \

--use-previous-template --parameters \

ParameterKey=clusterName,UsePreviousValue=true \
ParameterKey=taskInstanceCount,ParameterValue=$taskNodeCount \
ParameterKey=coreInstanceType,UsePreviousValue=true \
ParameterKey=taskInstanceType,UsePreviousValue=true \
ParameterKey=emrVersion,UsePreviousValue=true \
ParameterKey=environmentType,UsePreviousValue=true \
ParameterKey=masterInstanceType,UsePreviousValue=true \
ParameterKey=s3BucketBasePath,UsePreviousValue=true \
ParameterKey=terminationProtected,UsePreviousValue=true \
ParameterKey=taskBidPrice,UsePreviousValue=true --region $awsRegion

The "--use-previous-template" switch and "UsePreviousValue" resource ensure nothing else changes.

Finally, to delete the stack:

aws cloudformation update-stack --stack-name [STACK_NAME] \

--use-previous-template --parameters \
ParameterKey=clusterName,UsePreviousValue=true \
ParameterKey=taskInstanceCount,ParameterValue=$taskNodeCount \
ParameterKey=coreInstanceType,UsePreviousValue=true \
ParameterKey=taskInstanceType,UsePreviousValue=true \
ParameterKey=emrVersion,UsePreviousValue=true \
ParameterKey=environmentType,UsePreviousValue=true \
ParameterKey=masterInstanceType,UsePreviousValue=true \
ParameterKey=s3BucketBasePath,UsePreviousValue=true \
ParameterKey=terminationProtected,ParameterValue=false \
ParameterKey=taskBidPrice,UsePreviousValue=true --region $awsRegion
sleep 20

aws cloudformation delete-stack --stack-name [STACK_NAME] --region [REGION]

The first section of the command updates the stack by changing the termination protection value to 'false'. Once that has completed, the stack is then deleted.

In conclusion, we've changed a script which consists of 100+ lines of code to commands which average 14 lines (if you want to include line continuation).

Link to AWS CloudFormation: https://aws.amazon.com/cloudformation/

21 comments:

Tim Harsch21 February 2017 at 04:09
Thanks for posting this, it has been very useful. A couple of issues I encountered:
1. I received error: "Encountered unsupported property Configuration"
I had to delete the configuration sections from your template.
2. The parameter s3BucketBasePath" must have a Type or it will be rejected.
3. I changed the "Name" property to be "Ref": "clusterName"

ReplyDelete
Replies
Unknown1 August 2017 at 10:51
Invalid bootstrap action path, must be a location in Amazon S3 or a local path starting with 'file:'.
ReplyDelete
Replies
Anonymous11 August 2017 at 16:28

Regarding all those parameters in the update-stack call, do you really need to specify them if they aren't changing. Based on what it says here, it uses what's already in yor template:
http://docs.aws.amazon.com/AWSCloudFormation/latest/APIReference/API_Parameter.html

"If you don't specify a key and value for a particular parameter, AWS CloudFormation uses the default value that is specified in your template."

But then, why do they have "UsePreviousValue" so maybe you have to. Seems excessive.
ReplyDelete
Replies
santhosh k28 August 2017 at 07:58
Nice Information my sincere thanks for sharing this post Please continue to share this kind of post
AWS Training in BTM Layout
ReplyDelete
Replies
Unknown31 August 2017 at 07:45
nice blog has been shared by you. before i read this blog i didn't have any knowledge about this but now i got some knowledge about this. so keep on sharing such kind of an interesting blogs.
Selenium Training in Bangalore
ReplyDelete
Replies
Unknown5 September 2017 at 07:41
This comment has been removed by the author.
ReplyDelete
Replies
Unknown12 September 2017 at 06:15
This comment has been removed by the author.
ReplyDelete
Replies
Unknown12 January 2018 at 06:13
Those guidelines additionally worked to become a good way to
recognize that other people online have the identical fervor like mine
to grasp great deal more around this condition.

AWS Training in Chennai

AWS Training in Bangalore

AWS Training in Bangalore

ReplyDelete
Replies
Anonymous1 February 2018 at 13:00
I wants master node instance Id into the output section, which attribute should i used to get that
ReplyDelete
Replies
priya16 August 2018 at 13:25

3d Animation Course training Classes

Best institute for 3d Animation and Multimedia

Best institute for 3d Animation Course training Classes in Noida- webtrackker Is providing the 3d Animation and Multimedia training in noida with 100% placement supports. for more call - 8802820025.

3D Animation Training in Noida

Company Address:

Webtrackker Technology

C- 67, Sector- 63, Noida

Phone: 01204330760, 8802820025

Email: info@webtrackker.com

Website: http://webtrackker.com/Best-institute-3dAnimation-Multimedia-Course-training-Classes-in-Noida.php

Our courses:
3D Animation and Multimedia Training in Noida.
3d Multimedia Institute in Noida.
Animation and Multimedia Training in Noida.
Animation and Multimedia Training institute in Noida .
Multimedia Training institute in Noida.
Multimedia Training classes in Noida.
3D Animation Training in Noida.
3D Animation Training institute in Noida.

ReplyDelete
Replies
easylearn23 August 2019 at 10:34
Hi, thank you very much for new information , i learned something new. Very well written. It was so good to read and usefull to improve knowledge. Keep posting. If you are looking for any python related information please visit our website
python training in pune.
ReplyDelete
Replies
Big Data Hadoop training institutes20 September 2019 at 13:37
Your info is really amazing with impressive content..Excellent blog with informative concept. Really I feel happy to see this useful blog, Thanks for sharing such a nice blog..
If you are looking for any Data science Related information please visit our website Data science courses in Pune page!
ReplyDelete
Replies
Anonymous4 October 2019 at 08:12
This comment has been removed by the author.
ReplyDelete
Replies
Padminiprwatech20 December 2019 at 07:19
Thanks for sharing useful information. I learned something new from your bog. Its very interesting and informative. keep updating. If you are looking for any R Programming related information, please visit our website R Programming training institute in bangalore
ReplyDelete
Replies
Mithun1 June 2020 at 08:20
It's very Inspiring to Visit your Site...I Grasp something new from your bog...keep Updating
Java training in chennai | Java training in annanagar | Java training in omr | Java training in porur | Java training in tambaram | Java training in velachery
ReplyDelete
Replies
deiva3 September 2020 at 14:10
nice blog has been shared by you. before i read this blog i didn't have any knowledge about this but now i got some knowledge about this. so keep on sharing such kind of an interesting blogs.
data science training in chennai

data science training in omr

android training in chennai

android training in omr

devops training in chennai

devops training in omr

artificial intelligence training in chennai

artificial intelligence training in omr
ReplyDelete
Replies
ram20 March 2021 at 16:11
Content on your blog is really informative 50 High Quality for just 50 INR
2000 Backlink at cheapest
5000 Backlink at cheapest
Boost DA upto 15+ at cheapest
Boost DA upto 25+ at cheapest
Boost DA upto 35+ at cheapest
Boost DA upto 45+ at cheapest
ReplyDelete
Replies
Buy SEO Service24 March 2021 at 09:15
Thanksyou for the valuable content.50 High Quality for just 50 INR
2000 Backlink at cheapest
5000 Backlink at cheapest
Boost DA upto 15+ at cheapest
Boost DA upto 25+ at cheapest
Boost DA upto 35+ at cheapest
Boost DA upto 45+ at cheapest
ReplyDelete
Replies
Amir Iqbal30 October 2021 at 10:10
Nice content, Keep it up. Thanks for sharing.
https://realcracks.org/
ReplyDelete
Replies

Add comment

blog.mjmcc.io

Wednesday, 21 September 2016

A well-populated AWS CloudFormation template for building an EMR Cluster

21 comments:

Platform Engineering: Developer Experience

Labels