Technical Details#
SARB Security Approval#
Admin Support Model#
Tower runs in Azure (AKS) and we have very few UMN dependencies.
- IAM
- Auth is done against Azure, and groups are synced from Grouper
- Cloud Enablement Team
- Manage Subscription, network IP blocks, etc
- NTS
- Maintain the vpn connection back on prem and DNS records
- Hosting Engineering and Automation Teams (HEAT)
- Manage virtual machines (VMs) customers run playbooks against and firewalls on those VMs
Application Lifecycle#
We maintain a test instance of AWX and keep a maintenance schedule. 1st Monday of the month we run through the updates in Test. The second Monday we run those updates in Production. Updates include K8s versions, AWX Operators, Ingress, and Prometheus depending on which has new versions available. Production updates get a TDX Ticket to track.
Credential / Secrets Management#
Secrets Management Guidelines and Configuration
Azure Key Vaults have access control for Identity and Firewalls.
Data Handling#
Authentication is handled via Azure OAuth and AWX does not need to store secrets/keys/etc. Each time a playbook is run, a new container is spun up that gets the code from Github and pulls secrets from Azure Key Vault (or any other vault a customer has configured). The credentials added to get other secrets are encrypted by AWX and write-only, once entered they can not be retrieved.
System Dependencies, Logical Separation / Segmentation, Network and Remote Access#
AWX is deployed as an Operator in Kubernetes and requires Persistent Storage. It consists of an Controller, a number of CRDs (schema extensions to k8s objects), and individual instances. The Controller spins up individual instances of AWX, of which there may be as many as needed. Each Instance consists of a number of awx components (web/worker/api), a redis cache, an external Azure postgresSQL. Each instance gets its own storage and FQDN as well, however share Backup storage. We currently have one live production instance that is shared by all customers.
The architecture has been broken out into two different diagrams for simplicity. One is the management of the cluster itself, and the other is access by Users. You may want to open the image in a new tab or download!
Admin Diagram#
There are two main traffic flows. One is the administration of the cluster. All communications to the cluster flows through the kubernetes API (core of the control plane) which is provided as a service in Azure and exists separate from the worker nodes. The API sits behind a rudimentary firewall/IP whitelist, limited to UofM IPs. The other is via the WebUI of AWX, this is covered in more detail in the next section. However, that pathway is the same pathway AWX uses to communicate to VMS on prem. The cluster vnet is paired with the cloud hosting teams hub network which maintains the VPN connection back on prem. The firewalls are set by the Hosting Engineering and Automation Teams (HEAT) at the Managed Linux platform level to allow traffic from the cluster.
Users Diagram#
The only access users have is to the WebUI. The only credentials/secrets AWX should know are just the ones need to access a keyvault. When a ansible playbook is launched, awx spins up a "worker" pod that then fetches any secrets from the keyvault that it needs, runs the playbook, logs everything to Splunk, then is deleted along with any secrets in pulled.
Encryption#
Security in Azure Database for PostgreSQL
Identity and Access Management#
AWX authenticates using Azure AD which is supported by IAM. Group membership is managed manually, which controls what what content users can see.
Each instance of AWX has a single admin account that has logins monitored.
Integrations#
AWX can be integrated with source code repositories using ssh keys. Key vaults can be integrated using credentials from a service principal.
Lifecycle Management#
We have a regular monthly update schedule. The end of AWX will go one of two ways. 1. We purchase Tower 2. TBD
Logging and Monitoring#
Splunk alerts have been set up for Admin logins and an alert if there are no logs found in a period of time; these alerts open a TDX ticket with the team to be handled same day.
Non-critical alerts are sent to a notification channel in Slack (e.g. alerts are sent out when backups and other management jobs are run).
Tower is configured to send aggregated logs to Splunk.
Kubernetes, which is used to manage the Tower container architecture, has an operator running which aggregates and ports infrastructure logs to Splunk using Banzai Logging Operator.
Gap Analysis#
https://docs.google.com/spreadsheets/u/0/d/1iZyaNK4rzMx2jOBmf4wZ9sjbsZB6hXNlruI0zpijsNI/view