Cloud Automation: problem & solution May 12, 2010
![]()
(See copyright info at the end of this article)
Imagine a world where cloud-computing is no longer a theory or implemented by just a few huge corporations. It’s already mainstream. People have been using it for years, and clouds-in-a-box are but a double-click away.
Information technology veterans would say cloud-computing has always been there wearing other costumes such as ‘grid computing’, ‘cluster’ and ‘utility computing services’. They argue that we wrap these old concepts with new buzz words like Vmotioning, Vbusiness and any other V’s you can imagine. All under the benevolent cloud.
Is cloud computing really just an old concept with new clothes?
Regardless of the answer, the cloud is here to stay. Giant companies advocate their cloud solutions. Amazon, Oracle, IBM and communication giants like BT all aim to sell you a brand new cloud to wrap around your business.
You can easily conclude that ANYTHING-As-a-Service (AaaS) is how we are going to deliver business solutions.
Let’s have an honest conversation about “the day after the cloud”.
A cloud supposedly saves you money since you consolidate your hardware, but it does more than that since cloud environment is aimed at having dynamic resources to your business.
Are we not forgetting performance during the migration to cloud environment? can you control the operations-performance in your post-cloud business? Surely, with a cloud saves you a 30% of your operations costs, you should be happy. But what if you had to give up on 7% of your performance in the process? Will your new cloud have the right to exist if the business behind the cloud is slowing your operations?
What happens once you get to your brand new shiny cloud? What happens on day one? day two?
First, one need to be reminded that users demand the same performance as before or better.
Second, your physical machines in the cloud will be fully utilized, and your applications will receive more resources. Yes, it’s a good thing, but remember that fully utilized machines mean every box is now a critical machine, thus your data-center will exhibit more critical errors that used to be minor errors. To offset that problem, you may want to buy more hardware and spread the risk. This actually counters the savings everyone says cloud-computing should bring to a business.
Third, there are too many vmachines out there providing CPU/memory/temperature data, more than before. That data is only meaningful if you know what business application these servers are involved in, which is why you should be monitoring the applications to begin with.
Here’s an everyday example: Your main Java applications will each recieve 3GB of RAM and 4.5CPUs in the cloud. That’s great, but your users now complain that their application is slow to respond in the cloud. Where do you start looking? Your reaction will be measured by how fast you manage to pinpoint the specific faulty transaction.
Using classic IT concepts we know an application belongs to one or more servers that you manage. To fix a performance problem, you can engage your IT, QA and development teams. IT people are trained to react to a complaint by applying a scripted troubleshooting procedure. Usually they will look at the server heartbeat-stats (CPU/RAM/IO/Number-of-app-requests) in order to find where the problem is. In a complex environment, IT people have discovered they spend over 60% of their time (!) chasing those problems. They will try to search for the issues causing your performance degradation. Typically, this research ends up this way
- The 24/7 call center will be alerted by end users.
- The alert reaches the supervisor. The supervisor will take time to investigate and will learn that the application is slow or not responding.
- The resources screen shows basic metrics. “number of requests” and CPU/RAM/storage usage is the most you get.
- A special screen on the supervisor’s desk adds resources to the application with a click of a few buttons.
- As a result, the application is back to smooth running.
- The resolution came too late for some users, who took their business elsewhere.
Typically, IT people try to re-configure the application to receive more server-resources from each server or add more CPUs and memory up to the limit of your physical servers as well. Bottom line: it’s slow. Corporations lose business over these performance issues. IT teams spend their time hunting for problems when they could spend time pursuing positive goals.
Another option is getting quicker and more accurate answers from Application Performance Monitoring (APM) tools such as Precise Software, Correlsense, DBTuna and more.
Using cloud concepts, your applications belongs to virtual servers that make up your cloud. You can allocate more CPUs and memory while virtual machines (the cloud) are your limit. However, monitoring the CPU/memory load of a single server is a lot less meaningful in a cloud environment. Your Vmachines are irrelevant to the nature of the problem, or to the nature of the necessary fix. You simply have too many virtual servers in the mix to be able to pinpoint a specific transaction that caused a bottleneck on your service.
If it’s a hardware fault, you will most likely get alerted by multiple Vmachines at once.
If a single Vmachine exhibits a transaction failure then the Vmachine statistics are probably not going to clue you into the cause of your pain.
Application Performance Monitoring (APM) tools behave very well within a cloud. Meaningful monitoring options show not only the server heartbeat stats but also in-depth application activity. When performance is an issue, a relevant context is displayed for your service. For example: If your orders/purchases transaction is slow you will see that someone clicked on a web page trying to “buy 100 GOOG stocks” or “Submit order”. It takes 12 seconds for the transaction to complete and display a confirmation to the user. A resource-consuming Java-loop is involved. APM tools will display the problem and the location where the problem took place, down to the relevant piece of Java code. Alerts are sent to the pre-defined IT people on shift. Development teams are then engaged for a quick fix.
The same concept applies when APM tools engage database issues, networking, web and more. SAP, Citrix, DB2, Oracle, .Net, MSSql are only a partial list of platforms that benefit from the in-depth view of APM tools.
APM tools must be the trigger for any dynamic allocation of cloud resources, yet cloud vendors have yet to provide a clear plan for how they will mobilize cloud resources in a live cloud environment. Are they hoping to rely on meaningless CPU/memory/number-of-requests data? If APM tools are involved, not only do you know the current behavior of your cloud, but you also know and automatically act on statistical load trend-analysis that helps you predict your cloud’s behavior.
For some businesses the end of each quarter is a known and obvious load-peak. Can APM tools predict less obvious peaks? the answer is yes. They have done so for years. Cloud vendors need to manage their cloud using this available APM data as a trigger for dynamic automated changes in the cloud.
APM tools have always targeted the transaction. They had the ability to mask the server from your decision making process when servers were irrelevant. APM tools did that long before the cloud came to be our soupe du jour. The right management of cloud environments must include autonomous dynamic changes based on the business transaction heartbeat, and not the Vmachine heartbeat.
The answer to effective cloud automation management lies within the synergy between cloud-performance and APM tools: cloud vendors are already working on an SDK interface for cloud management, including automated functions that request more CPU/memory/disk resources from physical machines in the cloud or vmotioning entire virtual machines to supply dynamic boost when needed. Effective cloud management automation should use APM data as trigger for such automated decisions.
A typical day will look like this e-commerce business example:
- A cloud functions as normal during the night.
- During the morning a peak is expected due to past trends and APM measurements. The cloud will allocate resources in advance to the main application group.
- When the peak is over, some virtual machines and CPU/memory resources will be de-allocated and used elsewhere or placed on reserve.
- An unexpected peak is detected by automated APM tools.
- Cloud SDK function is used to allocate unexpected resources.
- End users or customers do not experience a problem.
- A notification is sent to the IT group and business owner about the unexpected modification.
- When APM tools detect normal behavior, the appropriate SDK function is called up to de-allocate the resources. Notifications are sent again.
The future of cloud computing is bright. However, it should be managed through expert automated resource allocation, which relies on APM to provide the necessary answer to a simple question: How is my business doing right now, and do I need to change my cloud to keep it running or run it faster?
Ask your cloud vendor how it plans to handle cloud-automation. If the answer does not include application monitoring, you may want to look elsewhere. In order to manage clouds, good cloud management vendors will use over 12 years of experience accumulated by APM tools. Cloud vendors will either buy APM companies or make efforts to use these available tools via their APIs. They can also spend years trying to develop the same level of accuracy in monitoring infrastructure. Imagine a world that waits years for vendors to get it right without APM tools. Then go with APM..
Dor Juravski (dor@kynsloo.com)
Owner of kYnsloo consulting offering objective APM expertise in the market
http://www.kynsloo.com
Copyright © 2010 Dor Juravski and kYnsloo LTD. Do not use this article or any part of it without written permission from the author Dor Juravski. Information delivered in this article is delivered only to be considered for publication and no responsibility from reading or otherwise using this article shall be applied to the author or kYnsloo LTD.

Leave a Reply