Timothy Maxwell | bazaarvoice: engineering

One of the chief promises of the cloud is fast scalability, but what good is snappy scalability without load prediction to match? How many teams out there are still manually switching group sizes when load spikes? If you would like to make your Amazon EC2 scaling more predictive, less reactive and hopefully less expensive it is my intention to help you with this article.

Problem 1: AWS EC2 Autoscaling Groups can only scale in response to metrics in CloudWatch and most of the default metrics are not sufficient for predictive scaling.

For instance, by looking at the CloudWatch Namespaces reference page we can see that Amazon SQS queues, EC2 Instances and many other Amazon services post metrics to CloudWatch by default.

From SQS you get things like NumberOfMessagesSent and SentMessageSize. EC2 Instances post metrics like CPUUtilization and DiskReadOps. These metrics are helpful for monitoring. You could also use them to reactively scale your service.

The downside is that by the time you notice that you are using too much CPU or sending too few messages, you’re often too late. EC2 instances take time to start up and instances are billed by the hour, so you’re either starting to get a backlog of work while starting up or you might shut down too late to take advantage of an approaching hour boundary and get charged for a mostly unused instance hour.

More predictive scaling would start up the instances before the load became business critical or it would shut down instances when it becomes clear they are not going to be needed instead of when their workload drops to zero.

Problem 2: AWS CloudWatch default metrics are only published every 5 minutes.

In five minutes a lot can happen, with more granular metrics you could learn about your scaling needs quite a bit faster. Our team has instances that take about 10 minutes to come online, so 5 minutes can make a lot of difference to our responsiveness to changing load.

Solution 1 & 2: Publish your own CloudWatch metrics

Custom metrics can overcome both of these limitations, you can publish metrics related to your service’s needs and you can publish them much more often.

For example, one of our services runs on EC2 instances and processes messages off an SQS queue. The load profile can vary over time; some messages can be handled very quickly and some take significantly more time. It’s not sufficient to simply look at the number of messages in the queue as the average processing speed can vary between 2 and 60 messages per second depending on the data.

We prefer that all our messages be handled within 2 hours of being received. With this in mind I’ll describe the metric we publish to easily scale our EC2 instances.

ApproximateSecondsToCompleteQueue = MessagesInQueue / AverageMessageProcessRate

The metric we publish is called ApproximateSecondsToCompleteQueue. A scheduled executor on our primary instance runs every 15 seconds to calculate and publish it.

private AmazonCloudWatchClient _cloudWatchClient = new AmazonCloudWatchClient();
_cloudWatchClient.setRegion(RegionUtils.getRegion("us-east-1"));

...

PutMetricDataRequest request = new PutMetricDataRequest()
  .withNamespace(CUSTOM_SQS_NAMESPACE)
  .withMetricData(new MetricDatum()
  .withMetricName("ApproximateSecondsToCompleteQueue")
  .withDimensions(new Dimension()
                    .withName(DIMENSION_NAME)
                    .withValue(_queueName))
  .withUnit(StandardUnit.Seconds)
  .withValue(approximateSecondsToCompleteQueue));

_cloudWatchClient.putMetricData(request);

In our CloudFormation template we have a parameter calledDesiredSecondsToCompleteQueue and by default we have it set to 2 hours (7200 seconds). In the Auto Scaling Group we have a scale up action triggered by an Alarm that checks whether DesiredSecondsToCompleteQueue is less than ApproximateSecondsToCompleteQueue.

"EstimatedQueueCompleteTime" : {
  "Type": "AWS::CloudWatch::Alarm",
  "Condition": "HasScaleUp",
  "Properties": {
    "Namespace": "Custom/Namespace",
    "Dimensions": [{
      "Name": "QueueName",
      "Value": { "Fn::Join" : [ "", [ {"Ref": "Universe"}, "-event-queue" ] ] }
    }],
    "MetricName": "ApproximateSecondsToCompleteQueue",
    "Statistic": "Average",
    "ComparisonOperator": "GreaterThanThreshold",
    "Threshold": {"Ref": "DesiredSecondsToCompleteQueue"},
    "Period": "60",
    "EvaluationPeriods": "1",
    "AlarmActions" : [{
      "Ref": "ScaleUpAction"
    }]
  }
}

Visualizing the Outcome

What’s a cloud blog without some graphs? Here’s what our load and scaling looks like after implementing this custom metric and scaling. Each of the colors in the middle graph represents a service instance. The bottom graph is in minutes for readability. Note that our instances terminate themselves when there is nothing left to do.

I hope this blog has shown you that it’s quite easy to publish your own CloudWatch metrics and scale your EC2 AutoScalingGroups accordingly.

At Bazaarvoice we use Dropwizard for a lot of our java based SOA services. Recently I upgraded our Dropwizard dependency from 0.6 to the newer 0.7 version on a few different services. Based on this experience I have some observations that might help any other developers attempting to do the same thing.

Package Name Change
The first change to look at is the new package naming. The new io.dropwizard package replaces com.yammer.dropwizard. If you are using codahale’s metrics library as well, you’ll need to change com.yammer.metrics to com.codahale.metrics. I found that this was a good place to start the migration: if you remove the old dependencies from your pom.xml you can start to track down all the places in your code that will need attention (if you’re using a sufficiently nosy IDE).

- com.yammer.dropwizard -> io.dropwizard
- com.yammer.dropwizard.config -> io.dropwizard.setup
- com.yammer.metrics -> com.codahale.metrics

Class Name Change
aka: where did my Services go?

Something you may notice quickly is that the Service interface is gone, it has been moved to a new name: Application.

- Service -> Application

Configuration Changes
The Configuration object hierarchy and yaml organization has also changed. The http section in yaml has moved to server with significant working differences.

Here’s an old http configuration:

http:
  port: 8080
  adminPort: 8081
  connectorType: NONBLOCKING
  requestLog:
    console:
      enabled: true
    file:
      enabled: true
      archive: false
      currentLogFilename: target/request.log

and here is a new server configuration:

server:
  applicationConnectors:
    - type: http
      port: 8080
  adminConnectors:
    - type: http
      port: 8081
  requestLog:
    appenders:
      - type: console
      - type: file
        currentLogFilename: target/request.log
        archive: true

There are at least two major things to notice here:

You can create multiple connectors for either the admin or application context. You can now serve several different protocols on different ports.
Logging is now appender based, and you can configure a list of appenders for the request log.

Speaking of appender-based logging, the logging configuration has changed as well.

Here is an old logging configuration:

logging:
  console:
    enabled: true
  file:
    enabled: true
    archive: false
    currentLogFilename: target/diagnostic.log
  level: INFO
  loggers:
    "org.apache.zookeeper": WARN
    "com.sun.jersey.spi.container.servlet.WebComponent": ERROR

and here is a new one:

logging:
  level: INFO
  loggers:
    "org.apache.zookeeper": WARN
    "com.sun.jersey.spi.container.servlet.WebComponent": ERROR
  appenders:
    - type: console
    - type: file
      archive: false
      currentLogFilename: target/diagnostic.log

Now that you can configure a list of logback appenders, you can write your own or get one from a library. Previously this kind of logging configuration was not possible without significant hacking.

Environment Changes
The whole environment API has been re-designed for more logical access to different components. Rather than just making calls to methods on the environment object, there are now six component specific environment objects to access.

JerseyEnvironment jersey = environment.jersey();
ServletEnvironment servlets = environment.servlets();
AdminEnvironment admin = environment.admin();
LifecycleEnvironment lifecycle = environment.lifecycle();
MetricRegistry metrics = environment.metrics();
HealthCheckRegistry healthCheckRegistry = environment.healthChecks();

AdminEnvironment extends ServletEnvironment since it’s just the admin servlet context.

By treating the environment as a collection of libraries rather than a Dropwizard monolith, fine-grained control over several configurations is now possible and the underlying components are easier to interact with.

Here is a short rundown of the changes:

Lifecycle Environment
Several common methods were moved to the lifecycle environment, and the build pattern for Executor services has changed.

0.6:

     environment.manage(uselessManaged);
     environment.addServerLifecycleListener(uselessListener);
     ExecutorService service = environment.managedExecutorService("worker-%", minPoolSize, maxPoolSize, keepAliveTime, duration);
     ExecutorServiceManager esm = new ExecutorServiceManager(service, shutdownPeriod, unit, poolname);
     ScheduledExecutorService scheduledService = environment.managedScheduledExecutorService("scheduled-worker-%", corePoolSize);

0.7:

     environment.lifecycle().manage(uselessManaged);
     environment.lifecycle().addServerLifecycleListener(uselessListener);
     ExecutorService service = environment.lifecycle().executorService("worker-%")
             .minThreads(minPoolSize)
             .maxThreads(maxPoolSize)
             .keepAliveTime(Duration.minutes(keepAliveTime))
             .build();
     ExecutorServiceManager esm = new ExecutorServiceManager(service, Duration.seconds(shutdownPeriod), poolname);
     ScheduledExecutorService scheduledExecutorService = environment.lifecycle().scheduledExecutorService("scheduled-worker-%")
             .threads(corePoolSize)
             .build();

Other Miscellaneous Environment Changes
Here are a few more common environment configuration methods that have changed:

0.6

environment.addResource(Dropwizard6Resource.class);

environment.addHealthCheck(new DeadlockHealthCheck());

environment.addFilter(new LoggerContextFilter(), "/loggedpath");

environment.addServlet(PingServlet.class, "/ping");

0.7

environment.jersey().register(Dropwizard7Resource.class);

environment.healthChecks().register("deadlock-healthcheck", new ThreadDeadlockHealthCheck());

environment.servlets().addFilter("loggedContextFilter", new LoggerContextFilter()).addMappingForUrlPatterns(EnumSet.allOf(DispatcherType.class), true, "/loggedpath");

environment.servlets().addServlet("ping", PingServlet.class).addMapping("/ping");

Object Mapper Access

It can be useful to access the objectMapper for configuration and testing purposes.

0.6

ObjectMapper objectMapper = bootstrap.getObjectMapperFactory().build();

0.7

ObjectMapper objectMapper = bootstrap.getObjectMapper();

HttpConfiguration
This has changed a lot, it is much more configurable and not quite as simple as before.
0.6

HttpConfiguration httpConfiguration = configuration.getHttpConfiguration();
int applicationPort = httpConfiguration.getPort();

0.7

HttpConnectorFactory httpConnectorFactory = (HttpConnectorFactory) ((DefaultServerFactory) configuration.getServerFactory()).getApplicationConnectors().get(0);
int applicationPort = httpConnectorFactory.getPort();

Test Changes
The functionality provided by extending ResourceTest has been moved to ResourceTestRule.
0.6

import com.yammer.dropwizard.testing.ResourceTest;

public class Dropwizard6ServiceResourceTest extends ResourceTest {
  @Override
  protected void setUpResources() throws Exception {
    addResource(Dropwizard6Resource.class);
    addFeature("booleanFeature", false);
    addProperty("integerProperty", new Integer(1));
    addProvider(HelpfulServiceProvider.class);
  }
}

0.7

import io.dropwizard.testing.junit.ResourceTestRule;
import org.junit.Rule;

public class Dropwizard7ServiceResourceTest {

  @Rule
  ResourceTestRule resources = setUpResources();

  protected ResourceTestRule setUpResources() {
    return ResourceTestRule.builder()
      .addResource(Dropwizard6Resource.class)
      .addFeature("booleanFeature", false)
      .addProperty("integerProperty", new Integer(1))
      .addProvider(HelpfulServiceProvider.class)
      .build();
  }
}

Dependency Changes

Dropwizard 0.7 has new dependencies that might affect your project. I’ll go over some of the big ones that I ran into during my migrations.

Guava
Guava 18.0 has a few API changes:

Closeables.closeQuietly only works on objects implementing InputStream instead of anything implementing Closeable.
All the methods on HashCodes have been migrated to HashCode.

Metrics
Metric 3.0.2 is a pretty big revision to the old version, there is no longer a static Metrics object available as the default registry. Now MetricRegistries are instantiated objects that need to be managed by your application. Dropwizard 0.7 handles this by giving you a place to put the default registry for your application: bootstrap.getMetricRegistry().

Compatible library version changes
These libraries changed versions but required no other code changes. Some of them are changed to match Dropwizard dependencies, but are not directly used in Dropwizard.

Jackson
2.3.3

Jersey
1.18.1

Coursera Metrics-Datadog
1.0.2

Jetty
9.0.7.v20131107

Apache Curator
2.4.2

Amazon AWS SDK
1.9.21

Future Concerns
Dropwizard 0.8
The newest version of Dropwizard is now 0.8, once it is proven stable we’ll start migrating. Hopefully I’ll find time to write another post when that happens.

Thank You For Reading
I hope this article helps.

bazaarvoice: engineering

The official blog of Bazaarvoice R&D

Author Archives: Timothy Maxwell

Predictively Scaling EC2 Instances with Custom CloudWatch Metrics

Upgrading Dropwizard 0.6 to 0.7