Which IT security roles fall to the ops team?

What priority should an IT operations team give to security tasks?

To help secure data and applications, an IT ops team needs to do much more than put up firewalls and apply other traditional security measures. Here’s where to begin.

Chris Moyer

VP of Technology – ACI Information Group – SearchCloudApplications

Specter, Meltdown and similar zero-day vulnerabilities are the scary sorts of things that keep operations teams — especially those with IT security roles — awake at night. Fortunately for most cloud-based companies, those vulnerabilities can be addressed with the latest software updates or an adjustment to your Amazon Elastic Compute Cloud machine images. Organizations that run on serverless platforms have it even easier, needing only to wait for Amazon, Microsoft or Google to apply the patches to the underlying hardware.

Still, these vulnerabilities account for only a small fraction of the attack surface that modern-day operations teams must watch over.

To take their IT security roles seriously, these staffers need to be concerned with stolen credentials and corrupted code repositories, among numerous other threats. Custom alerts can help ops teams detect abnormal conditions, and software-testing procedures can be adjusted to include security risk detection. There’s plenty for ops to do.

Consider a Node.js application launched using the serverless platform on AWS Lambda. Every dependency included in an application could potentially become compromised and lead to malicious code being installed on your site. Such a calamity could result in the loss of your users’ data or your intellectual property.

Systems for continuous integration (CI) and continuous delivery (CD) allow developers to iterate much faster. This is generally a good thing, since it produces much smaller deployments and generally results in fewer bugs. Unfortunately, these CI/CD tools tend to rely on third-party systems to gather packages and requirements, and those repositories can become compromised.

One recent outage, for example, showed a critical vulnerability in the NPM code repository. Supposedly safe packages were replaced with nearly identical ones containing attack code. Since NPM packages can include build and deployment hooks as well, this could do anything from stealing AWS credentials used to deploy your application to harvesting credit card numbers and passwords. Even packages you’ve completely validated as safe and have been using for years could have been compromised during a new deployment.

Previously, operations teams could mitigate some of this risk simply by controlling the hardware. Also, they could put in place specialized firewalls to prevent suspicious network traffic from causing issues, such as a site trying to upload credit card numbers to a known malicious IP address. With the move to cloud serverless technologies, much of this control has been taken away from ops, even while their IT security roles remain.

Adding Detection to the CI/CD Process

For teams with well-defined CI/CD practices, the build process should already have automated unit testing in place for bugs. It’s a natural progression to also require that build step to add in tests for security vulnerabilities. Many tools and organizations can help with this sort of thing, including Snyk and the Open Web Application Security Project, or OWASP. Ops teams are typically responsible for setting up these types of tools, and many of them can be set to run one-time scans before a build, as well as perform ongoing checks of production systems.

Additionally, ops teams with IT security roles or concerns may choose to create a custom in-house repository. For example, NPM Enterprise allows companies to include a feature-compatible version of NPM. This can be maintained by an internal team, behind a firewall, and it prevents the installation of third-party plug-ins that aren’t pre-approved. This can lead to faster, more secure and more reliable deployments.

Anomaly detection and manual approval of suspicious requests can be useful in preventing unwanted activity.

Some attacks result from things that cannot be identified before a system is in production. For example, users’ accounts can be breached. Or, even worse, a developer’s account can be compromised.

With AWS, it’s critically important that each service has strict identity permissions. For example, a user’s API probably shouldn’t have the ability to create new Elastic Compute Cloud instances or to delete users. Developers should be brought along slowly and not granted write access until after they’ve proven they aren’t going to accidentally wipe out the entire database. And no one should have root AWS credentials, except maybe the CTO.

It’s also important to make sure all identity and access management (IAM) users are required to have multifactor authentication (MFA) tokens set up, and it may be useful to turn on S3 versioning as well as require an MFA token to delete S3 objects.

It’s always a good idea to back up critical data in another location — and encrypt it, if it’s sensitive. It’s important to note, however, that when you store backups in different locations, you’re increasing the exposure of that data to attackers. More backups are not always better.

Most cloud providers offer managed backup options. Those should always be the first choice.

Monitor for unusual activity

Even with strict policies in place and personnel focused on IT security roles, it’s inevitable that something will go wrong. Credentials will either be leaked accidentally or be exposed through some malicious code installed on someone’s computer.

It’s important for operations teams to monitor cloud activity. For AWS users, this is typically done via CloudWatch. In Azure, consider Operational Insights, Application Insights or other monitoring tools.

It’s also worth setting up custom alarms. These help you spot abnormalities, such as when an IAM account performs operations that deviate from normal patterns; that unusual behavior could indicate a system compromise.

It can be trickier to identify issues with end users. While some problems can be obvious — such as when a user tries to log into their account from China an hour after logging in from California — other situations aren’t as readily apparent. Anomaly detection and manual approval of suspicious requests can be useful in preventing unwanted activity. Several services can help manage these types of rules, and most authentication services, such as Auth0, already provide built-in anomaly detection.

Web application firewalls can also provide added protection to your web-based access points by blocking traffic using pre-defined rules from communities, as well as custom logic based on patterns your operations team identifies.

For example, if someone is trying to access a wp-admin URL on your custom in-house application, chances are they’re trying to hack into something. Many of the targeted vulnerabilities are for WordPress and applications in the PHP scripting language, so operations teams should be on the lookout for requests for suspicious URLs and be ready to block all traffic from offending IP addresses.

This was last published in May 2018

Application performance tuning tips and tricks

What common causes lead to slow application performance?

Tune hardware, resource allocations and prioritizations to make applications perform their best. Just don’t do it without tracking the results.

Brian Kirsch

IT Architect, Instructor – Milwaukee Area Technical College – SearchServerVirtualization

A slow application is a challenge for IT operations teams, as the problem generally comes with about as much background information as a warning light on a car dashboard that just says Service.

Application performance tuning covers a wide range of possible options. You might hear calls to fix the code, but often, the issue slowing down an application isn’t as deep as faults in the source code. Similarly, routine server and application maintenance — clean up the application install points and remove old logs and temporary files — helps but won’t correct larger issues. Application performance tuning comes down to the things that surround the application and matching the right IT resources to a given application.

Not all applications run the same on the same hardware, virtual or physical. It’s a myth that slow applications simply need more resources. In fact, too liberal resource allocation can do damage to the overall environment, causing unnecessary contention. For example, a single-threaded application won’t run faster with multiple CPUs, no matter how many it gets, and other workloads in the virtual environment might end up starved for processing power. Overallocation is similar to overeating — not good for your health — and it’s expensive and time-consuming as well. Sometimes, more expenditures are worth it: Choices such as network bandwidth or storage tiers are much deeper than just a cost question if the application fails to function properly due to constraints.

Performance tuning starts with resources

Approach application performance tuning through the idea of proper-sizing resources. Proper-sizing also sets applications up well for a move to the cloud, where the organization pays for however many resources the application consumes monthly, rather than in cyclical data center capacity refreshes.

To properly size and scale a deployment, start with the application requirements as a guideline — but don’t treat them as laws. Every application designer has optimal settings for the application’s resources. Optimal might cover only the top of the range of acceptable application performance, however. And IT teams should query what the optimal resources are for the expected user base: 100 or 10,000? Balance the app designer’s settings with the available environment, and use this information as a starting point for reliability and performance from day one.

We often focus on restrictions and limitation in virtualized resources, but setting priority on workloads can be just as critical to the overall operations of the application.

To start application performance tuning, dig into the resource usage profile. Does the application have a highly transaction-driven process or a more intense lookup process? The answer can quickly shift attention from a CPU-driven process to a memory-driven one. In virtual environments, set priority for CPU usage for high transactional loads and memory reservations for intense lookups. Use the same types of settings with both networking and I/O traffic priorities.

We often focus on restrictions and limitation in virtualized resources, but setting priority on workloads can be just as critical to the overall operations of the application. The actual delivery of the application from a VM, container or cloud resource must be configured correctly and tuned in each case, so take a narrow case-by-case approach rather than one for the whole IT deployment overall.

In cloud, sky’s the limit

Cloud resources are essentially infinite, so resource availability is not a challenge for application performance. However, the budget is limited, so strongly consider where and how your company wants to spend its money. Careless resource management wastes money when it occurs on internal IT systems, but when the same problems move to the cloud, the cost is more apparent. When it’s in black and white on a bill, you tend to notice skyrocketing consumption quickly.

Relationships between the resources are critical. A change in one affects the others, and one change, in isolation, could fail to help performance. The move from a spinning disk I/O platform to solid-state drive to tune performance in a storage-heavy application might require additional memory for caching before the improvement is noticeable to the user, for example.

Observe the effects of application performance tuning

Let monitoring tools be your guide. As you make a change in the deployment or how it is managed, follow its effect on application response times and resource utilization. Compare the pre- and post-change metrics to see what impact allocation changes and prioritizations have.

Monitoring data also provides insight into what changes to make next. Application performance tuning can move the hotspot or pinch point from one resource type to another.

Ideally, tuning eliminates any possible slow points in application operation — along with the complaints from users. At this point, apply the acquired performance information to the deployment setup. None of the effort to get the application to work correctly will matter if the changes are not reflected back in the initial deployment stage. Correct the base deployment, and avoid making the same mistakes repeatedly.

Adjust deployment best practices and expected ranges carefully. These changes have wide-reaching effects. If you didn’t do enough monitoring and adjustments beforehand, mistakes will propagate across all new deployments.

This was last published in August 2018

Clarifying what cloud and virtualized data storage really mean

How are you using storage virtualization to lower your storage costs?

Cloud storage doesn’t always mean the public cloud. Virtualization and virtualized data storage aren’t always about virtual servers and desktops. Find out what’s really going on.

Logan G. Harbaugh

IT consultant/freelance reviewer – Independent consultant – SearchStorage

Let’s clear up some misconceptions about storage. First, cloud storage isn’t always hosted on a public service, such as AWS and Microsoft Azure. And second, virtualization and virtualized data storage don’t just refer to virtual servers or desktop systems hosted on VMware ESX or Microsoft Hyper-V. These two misconceptions are related, because one true thing about cloud storage is that it is virtualized.

To a certain extent, all storage is virtualized. Even the most basic block-based hardware system — a single hard disk — is mapped by the storage controller attached to the hard disk. This translates the physical hardware blocks, sectors and tracks on the hard drive’s physical disks into a virtual set of blocks, sectors and tracks that the motherboard and storage controller use to communicate with the physical disk.

Likewise, file-based storage creates an SMB or NFS volume containing files and metadata, even though the underlying file system might be different from the one presented by the storage system. Many file servers use more modern file systems, such as ZFS, instead of SMB or NFS, and then translate. Others use CIFS or NFS and present the volume as both. That way, an SMB volume can be presented as an NFS volume and vice versa. This is also a type of virtualized data storage.

The truth about virtualized data storage

Storage virtualization refers to storage that isn’t directly accessible to the storage consumer. It can be a server, server instance, client system or other system that needs storage. Nearly all storage in the data center and public and private clouds is virtualized.

One true thing about cloud storage is that it is virtualized.

Even iSCSI volumes and Fibre Channel LUNs that appear to be block devices and theoretically identical to an internal hard disk can be considered virtualized. They’re generally RAID volumes, which mean that several physical disks are presented as one or more virtual disks. In addition, software features, such as tiering, snapshots and replication, require a virtualization layer between the physical storage and the consumer. Deduplication, compression and object storage layers add additional layers of virtualized data storage.

Virtualization can be useful. A volume that appears to an application or end user as a single contiguous directory tree may include files hosted on different storage tiers, some on local hard disks and others on low-cost cloud storage tiers. This results in high-performance storage at the lowest possible cost, because virtualized data storage lets files that haven’t been accessed for a while be moved to inexpensive storage.

Cloud options

Cloud storage is often assumed to be storage in the public cloud, like Amazon S3, Google Cloud Platform and Microsoft Azure. However, many vendors offer some form of cloud storage, ranging from backup vendors, such as Barracuda and Zetta; to Oracle, Salesforce and other cloud application vendors; to alternatives to the big three, such as DigitalOcean and Rackspace.

Data center cloud products also make storage easily available to applications, whether or not they’re running locally. Dell EMC, Hewlett Packard Enterprise, Hitachi Vantara and NetApp all offer these capabilities. Some of these products are proprietary, some are single-purpose and some are based on open source standards, such as Ceph.

This was last published in July 2018

AIOps use cases address complexity in virtual data centers

How would you use an AIOps platform in your deployment?

The ever-growing complexity of virtual infrastructures is making management a struggle, but an AIOps platform can enable admins to transform raw data into useful information.

Robert Sheldo

Contributor – SearchSQLServer

Artificial intelligence for IT operations, or AIOps, use cases are plentiful, but IT administrators might find the most significant benefits using machine learning technologies to optimize virtual resources.

An organization’s IT infrastructure generates massive amounts of system data, ranging from application latency rates to hardware temperatures. As the infrastructure grows and becomes more complex, so too does the management process, which often results in siloed operations and fragmented information. Even if IT has the capacity to centralize disparate resources, admins might still lack the tools comprehensive enough to orchestrate efforts across all the domains.

AIOps can address these issues by correlating and analyzing the wide breadth of system data with machine learning and then determine the appropriate actions. AI refers broadly to the simulation of human intelligence, and it uses learning and reasoning to arrive at conclusions based on inputted data. AI is self-correcting and continuously improves as operators add more data.

Machine learning is a subset of AI that applies specifically to computers. Machine learning enables a computer to learn from inputted data, predict outcomes and run operations without specific programming. At the heart of this technology is a set of intelligent algorithms that performs in-depth statistical analysis against the data. AIOps use cases emerge from the resulting patterns, as the technology automatically adjusts its actions and evolves as new data becomes available.


Figure A. Machine learning builds on algorithms that can discover patterns in inputted data.

IT teams have begun to incorporate machine learning and other AI technologies into management strategies that address complex infrastructures. This trend uses AI concepts, such as big data analytics, to automatically correlate data across millions of data points, identify issues and provide ways to address those issues in as close to real time as possible.

AIOps platforms offer automation and clarification

AIOps use cases are compelling because the technology can augment or replace a wide range of IT operations tools. AIOps can reduce mundane administrative tasks, but it can also prevent outages, filter out noise from legitimate issues, reduce siloed operations, perform more accurate root cause analysis and generally streamline IT operations.

An AIOps platform uses a variety of strategies to facilitate administrative processes, but it begins with comprehensive data collection from data points across all the domains. The data can include log files, help desk ticketing services and output from monitoring tools. The platform monitors systems across all virtual and nonvirtual environments, using big data technologies to aggregate and organize the data into useful formats.

From there, the AIOps platform analyzes the data with machine learning algorithms and other AI technologies to correlate information, uncover patterns, detect anomalies, determine root causes and identify causal relationships between servers, systems and platforms.

An AIOps platform can then use the results of its analysis to automate and orchestrate operations, which triggers actions based on key points. For example, if an application running in a VM needs more processing power, an AIOps platform can automatically allocate the necessary resources to that application.

A comprehensive AIOps platform can also provide tools for admins to generate reports and visualizations that can help them identify issues, track changes to the environment, make decisions and provide general insight into the IT infrastructure.


Figure B. AIOps tools monitor infrastructure for data patterns that can inform automated operations.

Virtualization optimization tops AIOps use cases

The way virtualization platforms abstract resources can create IT infrastructure management challenges. Although virtualization can be highly beneficial for hardware utilization and workload management, the environment’s hidden nature can also make it more difficult to pinpoint potential issues.

Additionally, virtualization can introduce a proliferation of VMs that creates significantly more data points than physical resources would, which exponentially increases the amount of system data.

Many IT management tools aren’t comprehensive or accurate enough to effectively monitor and manage both virtual and nonvirtual environments.

At the same time, capacity planning for virtual environments is difficult; admins often over or under provision resources. Many IT management tools aren’t comprehensive or accurate enough to effectively monitor and manage both virtual and nonvirtual environments.

Addressing these management limitations is a primary AIOps use case, as the technology can improve resource utilization, capacity planning, threat detection, anomaly detection and storage management. AIOps platforms can also correlate data points across all the environments and address the limitations of many management tools, all of which can lead to better resource usage.

For example, ScienceLogic offers an AIOps platform that provides visibility into the entire physical and virtual infrastructure, which makes it possible to track compute resources wherever they are and optimize application performance and availability. Dynatrace offers a similar service that enables admins to better understand their infrastructure, detect dependencies and anomalies, and identify root causes.

SIOS iQ — another AIOps service — is similar, but it’s specific to VMware virtual environments. It can identify resource concerns, such as idle VMs or unnecessary snapshots, while defining, prioritizing and resolving issues.

AIOps is still a relatively young field, especially when it comes to resource management across virtual and nonvirtual environments. Given the ever-growing complexities that come with IT infrastructures, however, AIOps might be one of the most promising technologies on the horizon. The proliferation of containers, microservices, serverless computing and other technologies that complicate modern environments make AIOps use cases even more compelling.

This was last published in December 2018

How to realize server virtualization energy savings

What virtualization energy savings have you experienced in your data center?

Virtualizing servers is one of the most effective steps to reduce energy consumption, especially when the hypervisor and VM placement are taken into account.

Robert Sheldon

Contributor – SearchSQLServer

One of the most important benefits of virtualization is better hardware resource utilization; with fewer physical servers, organizations use less energy to deliver workloads and cool the underlying equipment. Careful decision-making can help IT administrators realize even more server virtualization energy savings.

In 2014, researchers demonstrated how virtualization reduces both energy consumption and cooling requirements. In their report, “Implementation of Server Virtualization to Build Energy Efficient Data Centers,” a small group of researchers described the effect of applying virtualization to a data center with 500 servers.

Prior to virtualizing the systems, the servers’ utilization rate averaged around 10%, with each machine consuming about 100 watts of power, for a total of 50,000 watts. The researchers broke workloads into three categories based on application type, and then implemented virtualization to support those workloads.

With this approach, they reduced the number of servers to 96, with each running an average of five VMs. At the same time, they increased the utilization rates to an average of 30% and the energy consumption rates to 275 watts per server. Although this is an increase in the number of watts per server, the reduction of servers to 96 resulted in a total energy consumption of only 26,400 watts, a reduction of 23,600 watts.

The results of this study aren’t an anomaly. Other research provides similar results. One of the benefits of virtualization is that it leads to better resource utilization, which means fewer hardware resources, resulting in lower energy consumption for both running computers and keeping them cool.

The hypervisor difference

Research into server virtualization energy savings has also extended into more specific areas. For example, experiments suggest that the hypervisor that virtualizes the workloads also plays a role in energy consumption. This doesn’t necessarily mean one hypervisor is always better than another. In fact, several factors contribute to a hypervisor’s energy consumption, such as the type of workload, the servers on which the hypervisors are installed and the hypervisor itself.

One of the benefits of virtualization is that it leads to better resource utilization, which means fewer hardware resources, resulting in lower energy consumption for both running computers and keeping them cool.

In their 2017 report “Energy efficiency comparison of hypervisors,” researchers describe the extensive experiments they performed on four leading hypervisors: KVM, VMware ESXi, Citrix XenServer and Microsoft Hyper-V. For each hypervisor, they ran four workloads (very light, light, fair and very heavy) on four different server platforms (HP DL380 G6, Intel S2600GZ, Lenovo RD450 and APM X-C1).

Each experiment was specific to a hypervisor, workload and server platform. For every experiment, the researchers determined the real-time power consumption, the operation’s duration and the total energy consumed to run that operation.

The researchers concluded that hypervisors exhibit different power and energy consumption when running the same workload on the same server. What’s more important is that no one hypervisor proved to be more energy efficient than the other hypervisors across all workloads and platforms. In other words, no single hypervisor is always the most energy efficient or least energy efficient.

The researches also concluded that lower power consumption doesn’t always translate to server virtualization energy savings. Under certain workloads, a hypervisor might consume less power than other hypervisors, but if the workload takes longer to run, the total energy consumption could be much higher.

Power consumption, completion time and energy consumption depend on the specific workload and platform. When choosing a hypervisor, IT teams should keep in mind the type of workloads they run and the platforms on which those workloads run. If possible, they should test potential hypervisors based on their anticipated workloads.

The VM difference

VM placement and its effect on energy consumption have also received a fair amount of attention, with much of the focus on how algorithms help server virtualization energy savings. The idea here is that intelligent software is used to strategically control where VMs are located across the available cluster to minimize energy consumption without impacting performance.

For example, in the 2016 report “Energy-efficient virtual machine placement using enhanced firefly algorithm,” researchers proposed two ways to modify the firefly algorithm to address VM placement issues in a cloud-based data center. The firefly algorithm, used to address optimization problems, is a meta-heuristic algorithm based on firefly behavior. The researchers compared their modified algorithms to algorithms used to map VMs to physical machines. They found that they could reduce energy consumption by as much as 12% using their own algorithms.

In 2017, another group of researchers published the report “Energy-Efficient Many-Objective Virtual Machine Placement Optimization in a Cloud Computing Environment,” which also proposed an algorithm for server virtualization energy savings through intelligent VM placement. Their algorithm is based on the knee point-driven evolutionary algorithm, a high-performance procedure for addressing many-objective problems. When the researchers compared their algorithm to several others, they saw a reduction in energy consumption between 1% and 28%.

In February 2018, a group of researchers published the report “An Energy Efficient Ant Colony System for Virtual Machine Placement in Cloud Computing.” In this case, the proposed algorithm is based on the ant colony system (ACS) algorithm, another meta-heuristic optimization procedure inspired by insects. When modifying the ACS algorithm, they also incorporated order exchange and migration local search techniques. The combination resulted in a 6% decrease in power consumption, compared to several existing algorithms.

Although much of the research on VM placement is focused on cloud computing, there’s also been a fair amount on VM placement in the more traditional data center. In either case, the important point is that an intelligent approach to VM placement results in server virtualization energy savings.

This was last published in November 2018

What VM automation tools are available?

What automation tools does your company use and why?

There are many different VM automation tools available — some of them part of much wider product and feature suites. Determine which features you need and find the tool that works.

Stephen J. Bigelow

Senior Technology Editor – TechTarget – SearchWindowsServer

Admins should examine the types of VM automation tools that are available and which product suites they are a part of to decide which tools they want to use.

There are myriad different VM automation tools that you can use to automate the creation and management of VMs in a modern data center. Although there are far too many VM automation tools to provide a concise list, these examples can illustrate the available options.

IT rarely implements automation as a discrete automation platform, but rather implements it from a subset of features in broader virtualization or systems management tools.

Leading virtualization platforms typically provide native VM automation tools. VMware vSphere can provide automation scripting capabilities through interfaces such as VMware vSphere PowerCLI and VMware vSphere CLI, as well as the Windows PowerShell command-line interface. For example, you can use PowerCLI scripts to generate daily reports on the state of the virtual environment, identify users that created suspect VMs and summarize vSwitch ports.

VMware’s vRealize Automation is an extensible tool that can help manage resource lifecycles, optimization and reclamation.

Explore third-party VM automation tools

There are also numerous popular third-party systems and configuration management tools that provide comprehensive automation capabilities.

Automation is an essential complement in data center virtualization.

Puppet Enterprise automates the provisioning, configuration, enforcement and management of software and servers. Red Hat Ansible specializes in complex workflows and streamlines complex tasks and deployments. Chef defines infrastructure as code, which enables you to automatically create and enforce complex infrastructures policies to reduce configuration drift. Salt Open automates data center infrastructure and workload deployments.

You can find automation capabilities in many other tools, as well, such as Jenkins, Terraform, Juju, Vagrant and Docker. You can implement automation for traditional, fixed data centers with established configurations, as well as for more dynamic environments, such as DevOps software projects where infrastructures and resources must follow busy continuous integration/continuous delivery workflow pipelines.

Automation is an essential complement in data center virtualization. Although not every VM requires automation in every circumstance, the benefits of the speed and consistency brought by automation can be compelling, and there are many VM automation tools readily available to implement automation and orchestration in virtualized environments. Plan and test automation deployments carefully to avoid some common automation mistakes and limitations.

This was last published in October 2018

When a NoOps implementation is — and when it isn’t — the right choice

How does your organization approach automation and NoOps initiatives?

NoOps skills and tools are highly useful regardless of the IT environment, but site reliability engineering brings operations admins into development when an organization can’t afford to lose them.

Emily Mell

Assistant Site Editor – SearchITOperations

For some organizations, NoOps is a no-go; for others, it’s the only way to go.

Some organizations envision that NoOps’ benefits of peak infrastructure and application automation and abstraction eliminate the need for operations personnel to manage the IT environment.

But not every IT environment is cut out for a NoOps implementation. Site reliability engineering (SRE) is a useful middle ground for organizations that aren’t ready or equipped for NoOps.

Will NoOps replace DevOps?

The truth is that NoOps is a scenario reserved almost exclusively for startup organizations that start out on A-to-Z IT automation tools and software. In cloud deployments, with no hardware or data centers to manage, developers can code or automate operations tasks — and operations administrators might struggle to find a seat at the table.

Instead, organizations that start from day one with automated provisioning, deployment, management and monitoring should hire an operations specialist or an IT infrastructure architect to help the development team set up environments and pipelines, said Maarten Moen, consultant at 25Friday, a consulting agency based in Hilversum, North Holland, which helps startups set up NoOps environments.

Development or technology engineers move more to a full-stack engineering background. They can do the front and back end, but also the infrastructure and the cloud structure behind it. Maarten Moenconsultant, 25Friday

DevOps is still too siloed and interdependent for the speed of modern business activity, Moen said. This assessment holds weight for the startups 25Friday advises: there is no reason for a five-person company to split into distinct development and operations teams.

Instead, Moen suggests organizations instate a center of excellence, with one to three senior-level operations consultants or advisors to help development teams set up — but not implement — the infrastructure and share best practices. Development teams implement the infrastructure so that they’re familiar with how it works and how to maintain it.

Legacy applications are incompatible with NoOps

A NoOps changeover won’t work for organizations with established legacy applications — which, in this case, means any app five years old or more — and sizeable development and operations teams. The tools, hardware and software necessary to maintain these apps require operational management. Odds are high that a legacy application would need to be completely rebuilt to accommodate NoOps, which is neither cost-effective nor reasonably feasible.

Moreover, most organizations don’t have a loosely coupled system that facilitates a NoOps structure, said Gary Gruver, consultant and author of several DevOps books. Small independent teams can only do so much, and building a wall between them doesn’t make sense. In the end, someone must be accountable to ensure that the application and infrastructure functions in production, both on premises and in cloud environments, he said.

NoOps vs. SRE

When a larger organization adopts infrastructure-as-code tools, senior operations staff often act as advisors, and they then shift into engineering roles.

But downsizing isn’t always the answer. The SRE role, which emphasizes automation to reduce human error and ensure speed, accuracy and reliability, has supplanted the operations administrator role in many IT organizations because it places ops much closer to, or even within, the development team.

“It’s operations work, and it has been, whether we’re analysts, sys admins, IT operations, SREs, DevOps — it doesn’t matter,” said Jennifer Davis, principle SRE at RealSelf Inc., a cosmetic surgery review company based in Seattle, Wash., and an O’Reilly Media author, in a talk at the Velocity conference last month in New York.

Operations work can be done by anyone, but organizations differ on how they handle that fact. Does an organization eliminate the operations team altogether or simply reorganize them into SREs and retain their operational knowledge and experience?

At RealSelf, Davis’ SRE team educates and mentors developers in all aspects of operations via positions in sprint teams.

“Some of the critical skills that we have as operations folks [are] filling in gaps with relevant information, identifying information that is nonessential, and prioritizing a wide array of work,” she said.*

Is NoOps feasible?

Basically, NoOps is the same thing as no pilots or no doctors.

Jennifer Davis,principle SRE, RealSelf

NoOps implementations by 25Friday have been successful, Moen said. RealSelf’s Davis, however, argues that NoOps is, in general, not feasible for non-startups.

“Basically, NoOps is the same thing as no pilots or no doctors,” Davis said. “We need to have pathways to use the systems and software that we create. Those systems and software are created by humans — who are invaluable — but they will make mistakes. We need people to be responsible for gauging what’s happening.”

Human fallibility has driven the move to scripting and automation in IT organizations for decades. Companies should strive to have as little human error as possible, but also recognize that humans are still vital for success.

Comprehensive integration of AI into IT operations tools is still several years away, and even then, AI will rely on human interaction to operate with the precision expected. Davis likens the situation to the ongoing drive for autonomous cars: They only work if you eliminate all the other drivers on the road.

Next steps for operations careers

As IT organizations adopt modern approaches to deployment and management, such as automated tools and applications that live on cloud services, the operations team will undoubtedly shrink. Displaced operations professionals must then retrain in other areas of IT.

Some move into various development roles, but quality assurance and manual testing are popular career shifts in Europe, especially for old-fashioned professionals whose jobs have been automated out of their hands, Moen said. SRE is another path that requires additional training, but not a complete divergence from one’s existing job description.

Admins should brush up on scripting skills, as well as the intricacies of distributed platforms to be ready to both hand over any newer self-service infrastructure to developers and train them to use it properly. They should study chaos engineering to help developers create more resilient applications and team up with both the ops and dev teams to create a best practices guideline to manage technical debt in the organization. This becomes paramount as infrastructure layouts grow more complex.

NoOps and SRE are both possible future directions on IT organizations’ radars. NoOps mini-environments can live within a fuller DevOps environment, which can be managed by site reliability engineers. Ultimately, the drive for automation must reign, but not with the extermination of an entire profession that ensures its continued success.

Site editor David Carty contributed to this article.

*Information changed after publication.

This was last published in November 2018

AIOps tools supplement — not supplant — DevOps pipelines

How does your DevOps team apply AIOps to the data your CI/CD toolchain churns out?

While the line between DevOps and AIOps often seems blurred, the two disciplines aren’t synonymous. Instead, they present some key differences in terms of required skill sets and tools.

Will Kelly

DevOpsAgenda

Artificial intelligence for IT operations, or AIOps, applies AI to data-intensive and repetitive tasks across the continuous integration and development toolchain. 

DevOps professionals cite monitoring, task automation and CI/CD pipelines as prime areas for AIOps tools, but there’s still a lack of clarity around when and how broadly teams should apply AI practices.

Where AI meets IT ops

The terms AIOps and DevOps are both common in product marketing, but they’re not always used accurately. DevOps is driving a cultural shift in how organizations are structured, said Andreas Grabner, DevOps activist at Dynatrace, an application performance management company based in Waltham, Mass.

AIOps tools enable an IT organization’s traditional development, test and operations teams to evolve into internal service providers to meet the current and future digital requirements of their customers — the organization’s employees.

AIOps platforms can also help enterprises monitor data across hybrid architectures that span legacy and cloud platforms, Grabner said. These complex IT environments demand new tools and technologies, which both require and generate more data. Organizations need a new approach to capture and manage that data throughout the toolchain — which, in turn, drives the need for AIOps tools and platforms.

AIOps can also be perceived as a layer that runs on top of DevOps tools and processes, said Darren Chait, COO and co-founder of Hugo, a provider of team collaboration tools based in San Francisco. Organizations that want to streamline data-intensive, manual and repetitive tasks — such as ticketing — are good candidates for an AIOps platform proof-of-concept project.

In addition, AIOps tools offer more sophisticated monitoring capabilities than other software in the modern DevOps stack. AIOps tools, for example, monitor any changes in data that might have a significant effect on the business, such as those related to performance and infrastructure configuration drift. That said, AIOps tools might be unnecessary for simple monitoring requirements that are linear and straightforward.

The line between AIOps and DevOps

DevOps and AIOps tools are both useful in CI/CD pipelines and for production operations tasks, such as monitoring, systems diagnosis and incident remediation. But while there is some overlap between them, each of these tool sets is unique in its requirements for effective implementation. For example, AIOps automates machine language model training to complete a software build. AIOps tools must be adaptive to machine-learning-specific workflows that can handle recursion to support continuous machine language model training.

The AIOps automation design approach is fundamentally different from the repetition of the machine language training process: It’s recursive and conditional in nature, largely dependent upon the accuracy rating of procured data. The design approach also depends on selective data-extraction algorithms.

In terms of tool sets, DevOps engineers see Jenkins, CircleCI, Travis, Spinnaker and Jenkins X as CI/CD industry standards, but they aren’t AIOps-ready like tools such as Argo — at least not yet.

So, while AIOps augments DevOps with machine learning technology, AIOps isn’t the new DevOps — and ops teams should ignore the hype that tells them otherwise.

This was last published in January 2019

As-a-service at your service: Cloud tools simplify DevOps automation

Which cloud services does your organization use for DevOps automation?

AWS offers the widest range of cloud-integrated DevOps tools. But Google Cloud has a strategy with more focus on integration, and Azure closely ties in with other Microsoft products.

Kurt Marko

Consultant – MarkoInsights – SearchMicroservices

DevOps has as many definitions as it has practitioners. But what started as a cultural and philosophical movement has evolved into an approach to software development and delivery.

Organizations can get components of the DevOps toolchain from a cloud service provider and pay only for use, rather than a la carte for capabilities, user accounts or system licenses, which makes as-a-service DevOps tools a compelling option — particularly in light of the variety of services and full-spectrum toolchain options available.

Cloud vendors bridge the worlds of development and IT operations through streamlined, automated and controlled IT management. While these services overlap with some stand-alone DevOps automation tools, they also include infrastructure services, such as workload balancing, autoscaling, deployment templates and configuration management.

DevOps automation is still relatively immature, but it is also dynamic enough that tool selection is a significant hurdle. Organizations also face the decision to build or buy a toolchain. The DevOps tools market is governed by few de facto standards — an organization’s best set of tools depends upon both its software development processes and the tools and languages with which developers and ops professionals are already familiar.

Examine the respective offerings for DevOps automation from AWS, Microsoft Azure and Google Cloud, briefly enumerated below, to find the right fit with a cloud-native application or one that’s migrating to public cloud to take advantage of scale and specialized services.

AWS DevOps services

AWS offers a suite of DevOps automation services that span the application lifecycle from development and test to deployment and operations.

Version control. CodeCommit is a managed Git source code repository that works with existing Git tools and integration tools, such as AWS CodeBuild — described below — and the Jenkins continuous integration (CI) server.

Code build and test. CodeBuild automates code compilation, module integration and test script execution and is compatible with the Java, Ruby and Python programming languages and Node.js, as well as .NET and Docker container environments.

Software release automation and delivery. CodePipeline automates the various stages of an application release process from beta testing environments to production. Use it for advanced deployment techniques, such as phased and blue/green deployment, to support continuous delivery (CD).

Fully packaged CI/CD pipeline. CodeStar provides an entire CI/CD pipeline and software development environment that automates the setup, configuration and instantiation of other services, such as CodeCommit, CodeBuild, CodePipeline and CodeDeploy, along with infrastructure deployment using the cloud provider’s EC2 IaaS templates, Elastic Beanstalk web app deployment orchestration or serverless Lambda.

Infrastructure deployment automation and infrastructure as code. AWS offers a variety of services for these operations-side tasks, including CodeDeploy, CloudFormation, OpsWorks and Elastic Beanstalk. CodeDeploy automates application deployment and configuration on EC2 instances, while CloudFormation provides a configuration language to describe and automate the creation of EC2 instances and other AWS resources, such as Relational Database Service. OpsWorks is a configuration management service that supports Chef or Puppet to automate EC2 configuration and deployment. Elastic Beanstalk is a PaaS for web applications that handles deployment, instance scaling and load balancing.

Configuration management. Systems Manager provides automation scripts for basic management across a fleet of EC2 instances. Use it for tasks such as command execution, system inventory, software patching and version control of system configurations. Config is a resource monitoring and policy auditing service that both monitors and records continuation changes to a fleet of EC2 instances and audits their state against a set of predefined rules.

Microsoft Azure tools

Azure segments its offering into DevOps automation services focused on the development and integration portion of the pipeline and tools for operational management and governance. Together, these provide similar capabilities to the suite available on AWS.

Development project planning and reporting. Azure Boards tracks ideas throughout the planning stages and includes built-in Scrum boards to enable sprints and standups. It also offers data analytics.

Source code management and version control. Azure Repos offers support for any Git client and enables custom and prebuilt webhooks and APIs.

CI/CD pipeline. Azure Pipelines automates application builds and deployments with support for Node.js, Python, Java, PHP, C/C++ and .NET, among others. Azure Pipelines also supports containerization and deployment to other cloud services.

Code test automation. Azure Test Plans executes a variety of performance tests for local and cloud applications and provides a breadth of data analytics for quality assurance.

Configuration management. Azure Automation is a way to control, integrate and maintain systems and applications across pure Azure cloud and hybrid environments.

Infrastructure deployment automation and infrastructure as code. Azure Quickstart Templates, Automation and PowerShell Desired State Configuration (DSC) combine to automate infrastructure deployment and manage code configuration in ways that ensure consistency and reduce human error. Azure Quickstart Templates are declarative documents that deploy resources through Azure Resource Manager. Automation turns manual updates with Windows and Linux systems into automated processes. And PowerShell DSC runs configurations as code through PowerShell.

Resource policy monitoring and auditing. Azure Policy enables the user to create, assign and manage various policies. It enforces rules for resource management and tracks compliance for various legal or corporate standards, as well as service-level agreements.

Design of complex system environments. Azure Blueprints — in preview at time of publication — automates the creation and maintenance of IT environments with limits on access that enable production governance.

Managed PaaS. Azure App Service offers scalability, security and flexibility to applications on web, mobile or cloud platforms, as well as APIs.

Microsoft bundles many of these services together into a suite of DevOps automation tools in the form of Visual Studio Team Services, which is available as Azure DevOps Services.

Google Cloud tools

Google Cloud tools fall into most categories identified above but also rely upon integrations to third-party offerings for several functions.

Version control. Google Cloud Source Repositories is a private Git repository that enables users to collaborate and scale their code in as many repositories as necessary without add-on charges.

Code build and test. Cloud Build enables application development without language or host-destination restrictions. Cloud Build also automates deployments with built-in integration to the vendor’s services, such as Kubernetes Engine, App Engine, Cloud Functions for serverless and Firebase managed app builds.

Release automation and CI/CD. Google Cloud offers integrations for many third-party CI/CD tools, including CircleCI, Jenkins and Spinnaker.

Infrastructure automation. Deployment Manager load balances and auto scales deployments with a declarative approach.

Configuration management. Google doesn’t have a configuration management service specifically for its cloud platform but rather integrates with popular third-party tools, such as Chef, Puppet and Salt.

Policy monitoring and management. Stackdriver Logging, which includes features such as Audit Logging, stores and monitors log data from both Google Cloud and AWS, as well as other sources.

PaaS. Like AWS and Azure, Google Cloud offers a PaaS option for DevOps teams. App Engine is designed to support many development languages and platforms to build scalable apps without direct infrastructure management.


Cloud providers offer several infrastructure services that IT operations and DevOps professionals should evaluate for their innate automation capabilities.

AWS offers the most comprehensive set of cloud-native DevOps automation services that combine to support diverse work styles and workflows. Azure matches AWS in most categories and offers tight integration with existing Microsoft Visual Studio tools, which makes it better for organizations that standardized on Microsoft’s development tools. Google lags behind its major competitors in DevOps depth but does integrate well with popular third-party DevOps automation tools, enabling organizations with existing CI/CD and configuration management pipelines to extend these to Google Cloud environments easily.

This was last published in November 2018

What roles do vCPE and uCPE have at the network edge?

Why might your organization avoid implementing universal or virtual CPE?

As service providers look to virtualize the edge and deliver network services faster, they’re turning to vCPE and uCPE that run services as software on generic hardware.

John Burke

CIO and Principal Research Analyst – Nemertes Research – SearchEnterpriseWAN

While software-defined networking is only just starting to gain significant traction within enterprise networks, it has transformed network service providers. They are bringing SDN technology to the enterprise edge now, in the form of virtualized customer premises equipment, or vCPE.

In the past, a provider delivering a set of network services would use specialized hardware to deliver the services. This hardware was most often a branch router, and it sometimes included one or more separate firewalls, a distributed denial-of-service defense device or WAN optimizer — all in the branch.

Now, the goal is to have generic hardware at the branch and have it run some or all of the software needed to provide all of the desired services.

Shifting the burden from hardware development to software development brings providers unprecedented flexibility and agility in addressing market needs. It also simplifies deployments if a single box — a so-called universal CPE, or uCPE — can run the branch end of any service needed.

VNFs: Making services more modular

Virtualization, uCPE and vCPE are creating new opportunities for both providers and customers to adopt new services, try new platforms and transform their IT infrastructures.

The traditional software-only delivery of a network function has focused on virtual appliances, which tend to fully replicate the functions of a hardware appliance in a single unit. The network functions virtualization approach separates the appliance into smaller function parcels — virtual network functions (VNFs) that cooperate to deliver network functionality.

A service provider can dynamically push VNFs down to the CPE platform to run on the customer premises, run the VNFs in server resources on the provider side of the edge or even run them in their own core — wherever makes the most sense for that service and that customer. Firewall functionality, for example, can often be best delivered on the provider side of a link — why deliver a packet that will be thrown away as soon as it hits the CPE? But compression services are best delivered on the customer side to maximize their effect.

Changing the WAN

Virtualization, uCPE and vCPE are creating new opportunities for both providers and customers to adopt new services, try new platforms and transform their IT infrastructures. Enterprises are keenly interested in software-defined WAN right now, and many providers use a vCPE model to deliver SD-WAN.

Some providers adopt a fully edge-hosted model, in which a uCPE box hosts a complete SD-WAN package — one that could run on dedicated hardware. Others deploy a hybrid edge or cloud model, where the SD-WAN depends — to some extent — on services delivered from the provider’s cloud. Still, others have a fully cloud-hosted model, like network-as-a-service providers delivering SD-WAN as a feature set service.

Whichever model a service provider uses, the number and breadth of vCPE deployments are exploding in the wake of providers’ internal SDN transitions and with the strength of interest in SD-WAN.

This was last published in January 2019