Note
This is a practical collection of instructions, troubleshooting tips, and playbooks for managing and maintaining the Alert Distribution System.
This is a collection of instructions for how to operate the Alert Distribution System. An overview of the system is provided in DMTN-210 [3] which is essential background reading for this document.
1 Basic Tools¶
1.1 ArgoCD¶
Argo CD [1] is a tool for deploying software onto a Kubernetes cluster. Most changes to the Alert Distribution System are done through Argo.
There’s some official SQuARE-maintained documentation for Argo. This section tries to summarize what you’d need to know to work with the Alert Distribution System, but that thorough documentation is worth reading too.
1.1.1 Accessing Argo¶
The integration deployment uses the Argo installation at https://data-int.lsst.cloud/argo-cd. Access to that installation is managed by the SQuARE team.
When you go to the Argo UI for the first time, you’ll see a big mess of many “applications.” The primary one is named alert-stream-broker, but you may also be interested in the “strimzi” and “strimzi-registry-operator” ones.
1.1.2 Applying Changes by Syncing¶
Once you’re in an application, the main action you take is syncing. When you “sync” an application, you synchronize its state in Kubernetes with a “desired state” which is derived from Git.
You do this by clicking the big “SYNC” button at the top of the Argo UI. This brings up a daunting list of all the resources that will be synchronized. Usually you don’t want to make changes to the options here, although you might want to enable “prune.”
When “prune” is enabled, Argo will delete any orphaned resources that no longer seem to be desired. Without this option, they will linger around. They probably won’t cause harm, but this can be confusing.
Additionally, some resources may not update properly if they depend on updates in other applications. These may require you to delete that specific resources and then re-deploy.
The recommended deployment order for easy troubleshooting is:
- Deploy Kafka Cluster
- Deploy strimzi registry operator
- Deploy the schema registry
- Deploy the ingress schema for the schema registry
- Deploy other services
The strimzi registry operator and schema registry can be a bit of a chicken or egg problem, and you may have to re-deploy the operator again after the schema registry, and then redeploy the registry for the two to reconcile with each other after the schema ingress and schema registry are deployed.
If, during deployment, any resource begins to error continuously, you can delete that specific resource while troubleshooting. This prevents the status channel from being continuously spammed with errors. It is recommended to grab the error log first from the logs tab first before deleting the resources.
If for some reason the instance of alert-stream-broker has been removed from the active applications, you can re-deploy it by going to the science-platform application and re-syncing alert-stream-broker.
Trouble shooting: If all of the brokers have failed but everything else is running, the brokers may be out of storage. This means that Kafka needs to have either the storage allotted or the retention limits adjusted. This requires a restart of the brokers, and may require a full re-deployment of the whole system.
If you try and restart the brokers from a fail state (whether they have run out of storage or not), and they end up in crashback loops, make sure to delete their persistent volume claims to ensure that they can rebuild.
1.1.3 What is “Desired State” in Argo?¶
The “desired state” of a service is based on whatever is currently in the master branch of the Phalanx repository.
Each application has a matching service in the Phalanx repo - for example, services/alert-stream-broker - which contains a Chart.yaml
file,
a charts directory containing several charts the broker depends on, and a values-idfint.yaml
file (and possibly more values-*.yaml
files if the service is deployed to more environments than just the IDF integration environment).
The Chart.yaml
file lists Helm charts - and, very crucially, their versions - that define the actual configuration to be used.
The values
file(s) list the particular configuration values that should be plugged in to the Helm chart templates used by that service.
the values.yaml
file should only contain information that is agnostic to which environment the s.
Most of the sourced Helm charts are found in the Charts directory of alert-stream-broker. The specific charts used are described in more complete detail in DMTN-210. [3]
Argo is sometimes a little bit delayed from the state of the Phalanx repository, perhaps by a few minutes. You might want to refresh a few times and make sure that the Git reference listed under “Current Sync Status” on the Argo UI for an application matches what you expect to apply.
For troubleshooting: Changes to the strimzi-operator may cause the alert-schema-registry to not fully deploy.
1.2 1Password¶
1Password is a password management tool. LSST IT uses it to distribute passwords, and the SQuARE team has adapted it for managing secrets stored in Kubernetes.
- It’s worth reading the documentation in Phalanx on this subject:
Managing the Alert Distribution System requires 1Password access. The LSST IT team can grant that access. Then, you’ll also need access to the “RSP-Vault” vault in 1Password, which can be granted by the SQuARE team.
The idea is that credentials are stored in a special 1Password vault with carefully formatted fields. Then you can run the phalanx installer/update_secrets.sh script to copy secrets from 1Password into Vault, which is a tool for encrypting secret data.
In the background, a tool called Vault Secrets Operator copies secret data in Vault and puts it into Kubernetes secrets for use in Kubernetes applications.
This is used to manage the passwords for the Kafka users that can access the alert stream: their passwords are set in 1Password, copied into Vault with the script, and then automatically synchronized into Strimzi KafkaUsers (see also: DMTN-210 3.2.3.1: 1Password, Vault, and Passwords).
1.3 Terraform¶
Terraform is a tool for managing resources stored in Google Cloud through code. The IDF deployment of the alert distribution system uses the github.com/lsst/idf_deploy for its Terraform configuration.
That repository hosts documentation directly in its README. Changes are made entirely through GitHub Actions workflows, so they get applied simply by merging into the main branch.
1.4 Google Cloud Console¶
The Google Cloud project which hosts the IDF deployment of the alert distribution system is “science-platform-int”. You can use the Google Cloud Platform’s console to see some of the things going on inside the system’s cloud resources.
In particular, the Storage Browser can help with identifying anything going wrong with the storage buckets used by the Alert Database, and the Kubernetes Engine UIs might help with exploring the behavior of the deployed systems on Kubernetes.
1.5 Kowl¶
Kowl [2] is a web application that provides a UI for a Kafka broker. It can help with peeking at messages in the Kafka topics, viewing the broker’s configuration, monitoring the state of consumer groups, and more.
Kowl can be run locally using Docker. It requires superuser permissions in the Kafka broker, which can be first retrieved from 1Password (see 2.4 Retrieving Kafka superuser credentials). Then, here’s how to run it locally:
export KAFKA_USER=$(op item get "alert-stream idfint kafka-admin" --fields label=username)
export KAFKA_PASSWORD=$(op item get "alert-stream idfint kafka-admin" --fields label=password)
docker run \
-p 8080:8080 \
-e KAFKA_BROKERS=alert-stream-int.lsst.cloud:9094 \
-e KAFKA_TLS_ENABLED=true \
-e KAFKA_SASL_ENABLED=true \
-e KAFKA_SASL_USERNAME=$KAFKA_USER \
-e KAFKA_SASL_PASSWORD=$KAFKA_PASSWORD \
-e KAFKA_SASL_MECHANISM=SCRAM-SHA-512 \
-e KAFKA_SCHEMAREGISTRY_ENABLED=true \
-e KAFKA_SCHEMAREGISTRY_URLS=https://alert-schemas-int.lsst.cloud \
quay.io/cloudhut/kowl:master
Once the Kowl container is running, you can view its UI by going to http://localhost:8080.
You should see something like this:
By clicking on a topic, you can see the deserialized messages in the topic. You can expand them by clicking the “+” sign in each row next to the “Value” column. For example:
You can also look at the schema and its versions in the Schema Registry tab:
You can use the Consumer Groups tab to see the position of any consumers. For example, here we can see the Pitt-Google broker:
Kowl has many more capabilities. See the official Kowl documentation [2] for more.
Note: Do note use –network=host, as the current behavior doesn’t allow docker to publish port 8080 and you won’t be able to access the Kowl through the local host.
2 Tool Setup¶
2.1 Getting kubectl
Access¶
While the below instructions are still valid, it is now no longer recommended to use kubectl and instead do everything in the Google Cloud web interface.
- Install
kubectl
: https://kubernetes.io/docs/tasks/tools/ - Install
gcloud
: https://cloud.google.com/sdk/docs/install - Run “gcloud auth login <your google cloud account>”. For example, “gcloud auth login swnelson@lsst.cloud.”
- Run “gcloud container clusters get-credentials science-platform-int”.
You should now have kubectl
access. Try kubectl get kafka --namespace alert-stream-broker to verify. You should see output like this:
-> % kubectl get kafka --namespace alert-stream-broker
NAME DESIRED KAFKA REPLICAS DESIRED ZK REPLICAS READY WARNINGS
alert-broker 3 3 True
If you haven’t set up your region correctly, you will see this error:
Fetching cluster endpoint and auth data. ERROR: (gcloud.container.clusters.get-credentials) ResponseError: code=404, message=Not found: projects/science-platform-int-dc5d/zones/us-west1-b/clusters/science-platform-int.
If that happens, open up ~/.config/gcloud/configurations/config_default and set the zone to the suggested zone.
2.2 Secure Password Use¶
In the following sections, you CAN fill in both the username and the password manually on your command line. However, this is not secure and can leave the password/usernames in your command line history. Instead, if you are using 1password, you should use 1passwords command line tool so that you do not directly enter your credentials.
2.3 Running Kowl¶
Make sure you have docker installed.
Make sure the Docker daemon is running. If using Docker Desktop start up the application.
Retrieve Kafka superuser credentials, as described in 2.4 Retrieving Kafka superuser credentials.
Run the following:
export KAFKA_USER=$(op item get "alert-stream idfint kafka-admin" --fields label=username) export KAFKA_PASSWORD=$(op item get "alert-stream idfint kafka-admin" --fields label=password) docker run \ -p 8080:8080 \ -e KAFKA_BROKERS=alert-stream-int.lsst.cloud:9094 \ -e KAFKA_TLS_ENABLED=true \ -e KAFKA_SASL_ENABLED=true \ -e KAFKA_SASL_USERNAME=$KAFKA_USER \ -e KAFKA_SASL_PASSWORD=$KAFKA_PASSWORD \ -e KAFKA_SASL_MECHANISM=SCRAM-SHA-512 \ -e KAFKA_SCHEMAREGISTRY_ENABLED=true \ -e KAFKA_SCHEMAREGISTRY_URLS=https://alert-schemas-int.lsst.cloud \ quay.io/cloudhut/kowl:master
- Go to http://localhost:8080
2.4 Retrieving Kafka superuser credentials¶
The superuser has access to do anything. Be careful with these credentials! To find the credentials:
- Log in to 1Password in the LSST IT account.
- Go to the “RSP-Vault” vault.
- Search for “alert-stream idfint kafka-admin”.
2.5 Retrieving development credentials¶
This user only has limited permissions, mimicking those of a community broker.
- Log in to 1Password in the LSST IT account.
- Go to the “RSP-Vault” vault.
- Search for “alert-stream idfint rubin-communitybroker-idfint”.
3 System Status¶
3.1 Testing connectivity¶
First, get the set of developer credentials (2.5 Retrieving development credentials).
Then, use one of the example consumer applications listed in sample_alert_info/examples. These will show whether you’re able to connect to the Kafka stream and receive sample alert packets, as well as whether you’re able to retrieve schemas from the Schema Registry.
3.2 Checking disk usage¶
First, check how much disk is used by Kafka:
Run Kowl, following the instructions in 2.2 Secure Password Use.
Navigate to the brokers view at http://localhost:8080/brokers.
You should see the amount of disk used by each broker in the right-most column under “size.”
Next, check how much is requested in the persistent volume claims used by the Kafka brokers:
Ensure you have kubectl access (2.1 Getting kubectl Access).
Run kubectl get pvc --namespace alert-stream-broker. You should see output like this:
-> % kubectl get pvc -n alert-stream-broker NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE data-0-alert-broker-kafka-0 Bound pvc-e5bf9fb1-e763-4c03-8294-b81a6955bde3 1500Gi RWO standard 77d data-0-alert-broker-kafka-1 Bound pvc-c289fc0d-39a0-44b1-b073-2aab5c47ba3a 1500Gi RWO standard 77d data-0-alert-broker-kafka-2 Bound pvc-6307f422-0448-45bd-985b-f7e559e54bb9 1500Gi RWO standard 77d data-alert-broker-zookeeper-0 Bound pvc-bd8bb38f-a5d3-47f9-a9a1-13c66f04f80e 1000Gi RWO standard 77d data-alert-broker-zookeeper-1 Bound pvc-01463914-9b1f-49bd-992f-de0b6b0284ca 1000Gi RWO standard 77d data-alert-broker-zookeeper-2 Bound pvc-eb37bbaa-cc49-4541-baf4-6f2444330d6f 1000Gi RWO standard 77d
3.3 Checking consumer group status¶
- Run Kowl, following the instructions in 2.2 Secure Password Use.
- Navigate to the consumer group view at http://localhost:8080/groups
There should be an entry for each consumer group that is connected or has connected recently.
The “Coordinator” column indicates which of the three Kafka broker nodes is used for coordinating the group’s partition ownership.
The “Members” column indicates the number of currently-active processes which are consuming data.
The “Lag” column indicates how many messages are unread by the consumer group.
3.4 Checking logs¶
In general, logs are available on the Google Cloud Log Explorer UI.
To access them:
Log in to the Google Cloud console at https://console.cloud.google.com.
Navigate to the Log Explorer UI, https://console.cloud.google.com/logs/query
Enter a search query. For example:
This will bring up all logs from Kafka brokers:
There are additional “Log fields” on the left column. You can use these to filter to a single one of the three brokers via the “Pod name” field.
You can pick a different time range by clicking on “Last 1 hour” in the top right:
See also: the GCP Log Explorer documentation: https://cloud.google.com/logging/docs/view/logs-viewer-interface
Each of the subsections lists search queries that can be used to filter logs.
3.4.1 Checking Kafka logs¶
Search for the following:
resource.type="k8s_container"
resource.labels.container_name="kafka"
resource.labels.namespace_name="alert-stream-broker"
3.4.2 Checking Strimzi logs¶
Search for the following:
resource.type="k8s_container"
resource.labels.namespace_name="strimzi"
3.4.3 Checking Strimzi Registry Operator logs¶
Search for the following:
resource.type="k8s_container"
resource.labels.namespace_name="strimzi-registry-operator"
3.4.4 Checking Schema Registry logs¶
Search for the following:
resource.type="k8s_container"
resource.labels.pod_name:"alert-schema-registry"
resource.labels.namespace_name="alert-stream-broker"
3.4.5 Checking Alert Database logs¶
Search for the following:
resource.type="k8s_container"
resource.labels.pod_name:"alert-database"
resource.labels.namespace_name="alert-stream-broker"
3.4.6 Checking Alert Stream Simulator logs¶
Search for the following:
resource.type="k8s_container"
resource.labels.pod_name="alert-stream-simulator"
resource.labels.namespace_name="alert-stream-broker"
4 Administration¶
4.1 Sharing passwords¶
Log in to 1Password in the LSST IT account.
Go to the “RSP-Vault” vault.
Search for the username of the account you want to share.
Click on the 3-dot menu in the top right and choose “Share…”:
This will open a new browser window for a sharing link.
Set the duration and availability as desired, and click “Get Link to Share”:
Share the link as you see fit.
Shared links can also be revoked; see 1Password Documentation for more.
4.2 Changing passwords¶
- Log in to 1Password in the LSST IT account.
- Go to the “RSP-Vault” vault.
- Search for the username of the account you want to modify.
- Click on the password field. Generate a new password and set it, and save your changes.
- Follow the instructions in Phalanx: Updating a secret stored in 1Password and VaultSecret.
Then verify that the change was successful by checking it in Argo.
- Log in to Argo (see also 1.1.1 Accessing Argo).
- Navigate to the “alert-stream-broker” application.
- In the “filters” on the left side, search for your targeted username in the “Name” field. You should see a filtered set of resources now.
- Click on the “secret” resource and check that it has an “updated” timestamp that is after you made your changes. If not, delete the “Secret” resource; it will be automatically recreated quickly. Once recreated, the user’s password will be updated automatically.
If this seems to be having trouble, consider checking:
- the Vault Secrets Operator logs to make sure it is updating secrets correctly
- the Strimzi Entity Operator logs to make sure they are updating user accounts correctly
- the Kafka broker logs to make sure it’s healthy
4.3 Adding a new user account¶
First, generate new credentials for the user:
Log in to 1Password in the LSST IT account.
Go to the “RSP-Vault” vault.
Create a new secret.
- Name it “alert-stream idfint <username>”.
- Set the “Username” field to <username>.
- Set the “Password” field to something autogenerated.
- Add a field named “generate_secrets_key”. Set its value to “alert-stream-broker <username>-password”
- Add a field named “environment”. Set its value to “data-int.lsst.cloud”
If you’re running in a different environment than the IDF integration environment, replaced “idfint” and “data-int.lsst.cloud” with appropriate values.
Sync the secret into Vault following the instructions in Phalanx documentation.
Second, add the user to the configuration for the cluster:
Make a change to github.com/lsst-sqre/phalanx’s services/alert-stream-broker/values-idfint.yaml file. Add the new user to the list of users under
alert-stream-broker.users
: https://github.com/lsst-sqre/phalanx/blob/bb417e80e0d9d1148da6edccae400eec006576e1/services/alert-stream-broker/values-idfint.yaml#L33-L73Make sure you use the same username, and grant it read-only access to the
alerts-simulated
topic by settingreadonlyTopics: ["alerts-simulated"]
just like the other entries.If more topics should be available, add them. If running in a different environment than the IDF integration environment, modify the appropriate config file, not values-idfint.yaml.
Make a pull request with your changes, and make sure it passes automated checks, and get it reviewed.
Merge your PR. Wait a few minutes (perhaps 10) for Argo to pick up the change.
Log in to Argo CD.
Navigate to the ‘alert-stream-broker’ application.
Click “sync” and leave all the defaults to sync your changes, creating the new user.
Verify that the new KafkaUser was created by using the filters on the left side to search for the new username.
Verify that the user was added to Kafka by using Kowl and going to the “Access Control List” section (see 2.2 Secure Password Use).
Optionally verify that access works using a method similar to that in 3.1 Testing connectivity.
4.4 Removing a user account¶
- Delete the user from the list in github.com/lsst-sqre/phalanx’s services/alert-stream-broker/values-idfint.yaml file.
- Make a pull request with this change, and make sure it passes automated checks, and get it reviewed.
- Merge your PR.
- Delete the user’s credentials from 1Password in the RSP-Vault vault of the LSST IT account. You can find the credentials by searching by username.
- Log in to Argo CD.
- Navigate to the ‘alert-stream-broker’ application.
- Click “sync”. Click the “prune” checkbox to prune out the defunct user. Apply the sync.
Verify that the user was removed from Kafka by using Kowl and going to the “Access Control List” section (see 2.2 Secure Password Use). The user shouldn’t be in the ACLs anymore.
4.5 Granting users read-only access to a new topic¶
- Make a change to github.com/lsst-sqre/phalanx’s services/alert-stream-broker/values-idfint.yaml file.
In the list of users under
alert-stream-broker.users
, add the new topic to thereadonlyTopics
list for each user that should have access. - Make a pull request with your changes, and make sure it passes automated checks, and get it reviewed.
- Merge your PR. Wait a few minutes (perhaps 10) for Argo to pick up the change.
- Log in to Argo CD.
- Navigate to the ‘alert-stream-broker’ application.
- Click “sync” and leave all the defaults to sync your changes, modifying access.
Verify that the change worked by using Kowl and going to the “Access Control List” section (see 2.2 Secure Password Use). There should be matching permissions with Resource=TOPIC, Permission=ALLOW, and Principal being the users who were granted access.
4.6 Adding a new Kafka topic¶
Add a new KafkaTopic resource to the
templates
directory in one of the charts that composes the alert-stream-broker service. This will be in the alert-stream-broker/charts repository. For example, there is a KafkaTopic resource in the alert-stream-simulator/templates/kafka-topics.yaml file.These files use the Helm templating language. See The Chart Template Developer’s Guide for more information on this language.
Strimzi’s documentation (“5.2.1: Kafka topic resource”) may be helpful in configuring the topic. The schema for KafkaTopic resources has a complete reference at 11.2.90: KafkaTopic schema reference.
Pick the chart that is most relevant to the topic you are adding. If it is not relevant to any particular chart, use the general charts/alert-stream-broker chart.
Increment the version of the chart by updating the
version
field of its Chart.yaml file. For example, this line of the alert-stream-simulator chart.Make a pull request with your changes to alert-stream-broker/charts, and make sure it passes automated checks, and get it reviewed. Merge your PR.
Next, you’ll update the services/alert-stream-broker/Chart.yaml file to reference the new version number of the chart you have updated. For example, this line would need to be updated if you were adding a topic to the alert-stream-simulator.
Make a pull request with your changes to github.com/lsst-sqre/phalanx, and make sure it passes automated checks, and get it reviewd. Merge your PR.
Wait a few minutes (perhaps 10) for Argo to pick up the change to Phalanx.
Log in to Argo CD.
Navigate to the ‘alert-stream-broker’ application.
Click ‘sync’ and leave all the defaults to sync your changes, creating the new topic.
Verify that the change worked by using Kowl and going to the “Topics” section (see 2.2 Secure Password Use). There should be a new topic created.
To let users read from the topic, see 4.5 Granting users read-only access to a new topic.
4.7 Granting Alert DB access¶
Alert DB access is governed by membership in GitHub organizations and teams.
The list of permitted GitHub groups for the IDF integration environment is in the services/gafaelfawr/values-idfint.yaml file in github.com/lsst-sqre/phalanx.
As of this writing, that list is composed of ‘lsst-sqre-square’ and ‘lsst-sqre-friends’, so any users who wish to have access need to be added to the “square” or “friends” teams in the lsst-sqre GitHub organization.
Invite a user to join one of those groups to grant access.
To change the set of permitted groups, modify the services/gafaelfawr/values-idfint.yaml file to change the list under the read:alertdb
scope.
Then, sync the change to Gafaelfawr via Argo CD.
5 Making Changes¶
5.1 Deploying a change with Argo¶
In general, to make any change with ArgoCD, you update Helm charts, update Phalanx, and then “sync” the alert-stream-application:
- Make desired changes to Helm charts, if required, in alert-stream-broker/charts. Note that any changes to Helm charts always require the version to be updated.
- Merge your Helm chart changes.
- Update the services/alert-stream-broker/Chart.yaml file to reference the new version number of the chart you have updated, if you made any Helm chart changes.
- Update the services/alert-stream-broker/values-idfint.yaml file to pass in any new template parameters, or make modifications to existing ones.
- Merge your Phalanx changes.
- Wait a few minutes (perhaps 10) for Argo to pick up the change to Phalanx.
- Log in to Argo CD at https://data-int.lsst.cloud/argo-cd.
- Navigate to the ‘alert-stream-broker’ application.
- Click ‘sync’ to synchronize your changes.
5.2 Updating the Kafka version¶
The Kafka version is set in the alert-stream-broker/templates/kafka.yaml file in services/alert-stream-broker.
It is parameterized through the kafka.version
value in the alert-stream-broker chart, which defaults to “2.8”.
When upgrading the Kafka version, you also may need to update the kafka.logMesageFormatVersion
and kafka.interBrokerProtocolVersion
.
These change slowly, but old values can be incompatible with new Kafka versions.
See Strimzi documentation on Kafka Versions to be sure.
So, to update the version of Kafka used, update the services/alert-stream-broker/values-idfint.yaml file in github.com/lsst-sqre/phalanx.
Under alert-stream-broker
, then under kafka
, add a value: version: <whatever you want>
.
If necessary, also set logMessageFormatVersion
and interBrokerProtocolVersion
here.
Then, follow the steps in 5.1 Deploying a change with Argo to apply these changes.
See also: the Strimzi Documentation’s “9.5: Upgading Kafka”.
5.3 Updating the Strimzi version¶
First, you probably want to read the Strimzi Documentation’s “9. Upgrading Strimzi”.
The Strimzi version is governed by the version referenced in github.com/lsst-sqre/phalanx’s services/strimzi/Chart.yaml file. Update that version, and do anything else recommended by Strimzi in their documentation, such as changes to resources.
Then, apply the change in a way similar to that described in 5.1 Deploying a change with Argo. Note though that you’ll be synchronizing the ‘strimzi’ application in Argo, not the ‘alert-stream-broker’ application in Argo.
5.4 Resizing Kafka broker disk storage¶
Some reference reading:
- DMTN-210’s section 3.2.1.3: Storage.
- “Persistent storage improvements”
Change the alert-stream-broker.kafka.storage.size value in services/alert-stream-broker/values-idfint.yaml in github.com/lsst-sqre/phalanx. This is the amount of disk space per broker instance.
Apply the change, as described in 5.1 Deploying a change with Argo.
This may take a little while to apply, since it is handled through the asynchronous Kafka operator, which reconciles storage size every few minutes. When it starts reconciling, it rolls the change out gradually across the Kafka cluster to maintain availability.
Note that storage sizes can only be increased, never decreased.
5.5 Updating the alert schema¶
For background, you might want to read DMTN-210’s section 3.4.4: Schema Synchronization Job.
The high-level steps are to:
- Commit your changes in the lsst/alert_packet repository, obeying its particular versioning system
- Build a new lsstdm/lsst_alert_packet container
- Publish a new lsst-alert-packet Python package
- Load the schema into the schema registry, incrementing the Schema ID
- Update the alert-stream-simulator to use the new Python package and new schema ID
5.5.1 Making a new alert schema¶
First, make a new subdirectory in github.com/lsst/alert_packet’s python/lsst/alert/packet/schema directory. For example, the current latest version as of this writing is 4.0, so there’s a python/lsst/alert/packet/schema/4/0 directory which holds Avro schemas. You could put a new schema in python/lsst/alert/packet/schema/4/1.
Start by copying the current schema into the new directory, and then make your changes. Then, update python/lsst/alert/packet/schema/latest.txt to reference the new schema version number.
5.5.2 Creating a container which loads the schema¶
When you are satisfied with your changes, push them and open a PR.
As long as your github branch starts with “tickets/” or is tagged, this will automatically kick off the “build_sync_container” GitHub Actions job, which will create a Docker container holding the alert schema.
The container will be named lsstdm/lsst_alert_packet:<tag-or-branch-name>
; slashes are replaced with dashes in the tag-or-branch-name spot.
For example, if you’re working on a branch named tickets/DM-34567, then the container will be created and pushed to lsstdm/lsst_alert_packet:tickets-DM-34567.
You can use this ticket-number-based container tag while doing development, but once you’re sure of things, merge the PR and then tag a release. The release tag can be the version of the alert schema (for example “4.1”) if you like - it doesn’t really matter what value you pick; there are so many version numbers flying around with alert schemas that it’s going to be hard to find any scheme which is ideal.
To confirm that your container is working, you can run the conatiner locally. For example, for the “w.2022.04” tag:
-> % docker run --rm lsstdm/lsst_alert_packet:w.2022.04 'syncLatestSchemaToRegistry.py --help'
usage: syncLatestSchemaToRegistry.py [-h]
[--schema-registry-url SCHEMA_REGISTRY_URL]
[--subject SUBJECT]
optional arguments:
-h, --help show this help message and exit
--schema-registry-url SCHEMA_REGISTRY_URL
URL of a Schema Registry service
--subject SUBJECT Schema Registry subject name to use
5.5.3 Loading the new schema into the schema registry¶
To load the new schema into the schema registry, update the alert-stream-schema-registry.schemaSync.image.tag
value to the tag that you used for the container.
The defaults are set in the alert-stream-schema-registry’s values.yaml file. You can update the defaults, or you can update the parameters used in Phalanx for a particular environment under the alert-stream-schema-registry field.
Apply these changes as described in 5.1 Deploying a change with Argo. The result should be that a new schema is added to the schema registry.
Once the change is deployed, the job that loads the schema will start. You can monitor it in the Argo UI by looking for the Job named ‘sync-schema-job’.
You can confirm it worked by using Kowl (see 2.2 Secure Password Use) and using its UI for looking at the schema registry’s contents.
5.5.4 Publishing a new lsst-alert-packet Python package¶
The alert stream simulator gets its version of the alert packet schema from the lsst-alert-packet
Python package.
The version of this package that it uses is set in setup.py of github.com/lsst-dm/alert-stream-simulator.
You’ll need to publish a new version of the lsst-alert-packet Python package in order to get a new version in alert-stream-simulator.
Start by updating the version in setup.cfg of github.com/lsst/alert_packet. Merge your change which includes the new version in setup.cfg.
The new version of the package needs to be published to PyPI, the Python Package Index: https://pypi.org/project/lsst-alert-packet/. It is managed by a user named ‘lsst-alert-packet-admin’, which has credentials stored in 1Password in the RSP-Vault vault. Use 1Password to get the credentials for that user.
Once you have credentials and have incremented the version, you’re ready to publish to PyPI. Explaining how to do that is out of scope of this guide, but Twine is a good tool for the job.
5.5.5 Updating the Alert Stream Simulator package¶
The alert stream simulator needs to use the new version of the lsst-alert-packet
version which you published to PyPI.
Second, the chart which runs the simulator needs to be updated to use the right ID of the schema in the schema registry.
The version of lsst-alert-packet
is set in the setup.py file of github.com/lsst-dm/alert-stream-simulator.
Update this to include the newly-published Python package.
Once you have made and merged a PR to this, tag a new release of the alert stream simulator using git tag. When your tag has been pushed to the alert stream simulator GitHub repository, an automated build will create a container (in a manner almost exactly the same as you saw for lsst/alert_packet).
You can use docker run to verify that this worked.
For example, for version v1.2.1
:
-> % docker run --rm lsstdm/alert-stream-simulator:v1.2.1 'rubin-alert-sim -h'
usage: rubin-alert-sim [-h] [-v] [-d]
{create-stream,play-stream,print-stream} ...
optional arguments:
-h, --help show this help message and exit
-v, --verbose enable info-level logging (default: False)
-d, --debug enable debug-level logging (default: False)
subcommands:
{create-stream,play-stream,print-stream}
create-stream create a stream dataset to be run through the
simulation.
play-stream play back a stream that has already been created
print-stream print the size of messages in the stream in real time
5.5.6 Getting the schema registry’s ID¶
Next, you’ll need to get the ID that is used by the schema registry so that you can use it in the alert stream simulator deployment. This is easiest to retrieve using Kowl.
Run Kowl (see 2.2 Secure Password Use) and then navigate to http://localhost:8080/schema-registry/alert-packet. There should be a drop-down with different versions. You probably want the latest version, which might already be the one being displayed. Select the desired version.
At the top of the screen, you should see the “Schema ID” of the schema you have selected. This integer is an ID we’ll need to reference later.
5.5.7 Updating the Alert Stream Simulator values¶
You’re almost done. We need to update the alert stream simulator deployment to use the new container version, and to use the new schema ID.
The container version is set in values-idfint.yaml’s alert-stream-simulator.image.tag field. Update this to match the tag you used in github.com/lsst-dm/alert-stream-simulator.
The schema ID is set in values-idfint.yaml as well, under alert-stream-simulator.schemaID
.
This is set to 1
by default.
Those changes to values-idfint.yaml are half the story. You probably also should update the defaults, which is done by editing the values.yaml files in the alert-stream-simulator chart. This values.yaml changes the dynamic configurations on a topic level, which override any settings, such as retention.ms or retention.bytes set on a broker level.
Once you have made those changes, apply them following the instructions in 5.1 Deploying a change with Argo.
The new simulator make take a few minutes to come online as the data needs to be reloaded. Once the sync has completed, you can verify that the change worked.
Verify that it worked using Kowl (see 2.2 Secure Password Use) by looking at the Messages UI (keep in mind that it can take up to 37 seconds for messages to appear!). The mesages should be encoded using your new schema.
Warning
You probably want to change the sample alert data (see 5.6 Changing the sample alert data) used by the alert stream simulator.
If you don’t do this, then the alert packets will be decoded using the version used when sample alerts were generated, then re-encoded using the new alert schema.
You can manage this transition using Avro’s aliases, but it might be simpler to simultaneously switch to a new version of the sample alert data.
5.6 Changing the sample alert data¶
The sample alert data used by the alert stream simulator is set in a Makefile:
.PHONY: datasets
datasets: data/rubin_single_ccd_sample.avro data/rubin_single_visit_sample.avro
data:
mkdir -p data
data/rubin_single_ccd_sample.avro: data
wget --no-verbose --output-document data/rubin_single_ccd_sample.avro https://lsst.ncsa.illinois.edu/~ebellm/sample_precursor_alerts/latest_single_ccd_sample.avro
data/rubin_single_visit_sample.avro: data
wget --no-verbose --output-document data/rubin_single_visit_sample.avro https://lsst.ncsa.illinois.edu/~ebellm/sample_precursor_alerts/latest_single_visit_sample.avro
The last two show what’s happening. The sample alerts are downloaded from https://lsst.ncsa.illinois.edu/~ebellm/sample_precursor_alerts/latest_single_visit_sample.avro.
The sample alerts could be retrieved from anywhere else. The important things are that they should be encoded in Avro Object Container File format (that is, with all alerts in one file, preceded by a single instance of the Avro schema), and that they should represent a single visit of alert packet data.
Make changes to the makefile to get data from somewhere else, and then merge your changes.
Make a git tag using the format vX.Y.Z
, for example v1.3.10
, and push that git tag up.
This will trigger a build job for the container using the new tag.
Next, copy that tag into charts/alert-stream-simulator/values.yaml, and follow the instructions from 5.1 Deploying a change with Argo. This will configure the alert stream simulator to use the new alert data, publishing it every 37 seconds.
5.7 Deploying on a new Kubernetes cluster on Google Kubernetes Engine¶
Deploying on a new Kubernetes cluster will take a lot of steps, and has not been done before, so this section is somewhat speculative.
5.7.1 Prerequisites¶
There are certain prerequisites before even starting. These are systems that are dependencies of the alert distribution system’s current implementation, so they must be present already.
They are:
- Argo CD should be installed and configured to make deployment possible using configuration from Phalanx and Helm. This means there should be some “environment” analogous to “idfint” which is used in the IDF integration deployment.
- Gafaelfawr should be installed to set up the ingress for the alert database.
- cert-manager should be installed so that broker TLS certificates can be automatically provisioned.
- The nginx ingress controller should be installed to set up the ingress for the schema registry.
- Workload Identity needs to be configured properly (for example, through Terraform) on the Google Kubernetes Engine instance to allow the alert database to gain permissions to interact with Google Cloud Storage buckets.
5.7.2 Preparation with Terraform¶
Before starting, some resources should be provisioned, presumably using Terraform:
- A node pool for Kafka instances to run on.
- Storage buckets for alert packets and schemas.
- IAM roles providing access to the storage buckets for the alert database ingester and server (as writer and reader, respectively).
The current node pool configuration in the IDFINT environment can be found in the environments/deployments/science-platform/env/integration-gke.tf file:
{
name = "kafka-pool"
machine_type = "n2-standard-32"
node_locations = "us-central1-b"
local_ssd_count = 0
auto_repair = true
auto_upgrade = true
preemptible = false
image_type = "cos_containerd"
enable_secure_boot = true
disk_size_gb = "500"
disk_type = "pd-standard"
autoscaling = true
initial_node_count = 1
min_count = 1
max_count = 10
}
]
node_pools_labels = {
core-pool = {
infrastructure = "ok",
jupyterlab = "ok"
},
dask-pool = {
dask = "ok"
},
kafka-pool = {
kafka = "ok"
}
}
node_pools_taints = {
core-pool = [],
dask-pool = []
kafka-pool = [
{
effect = "NO_SCHEDULE"
key = "kafka",
value = "ok"
}
]
}
Storage bucket configuration is in environment/deployments/science-platform/env/integration-alertdb.tfvars:
# Project
environment = "int"
project_id = "science-platform-int-dc5d"
# In integration, only keep 4 weeks of simulated alert data.
purge_old_alerts = true
maximum_alert_age = 28
writer_k8s_namespace = "alert-stream-broker"
writer_k8s_serviceaccount_name = "alert-database-writer"
reader_k8s_namespace = "alert-stream-broker"
reader_k8s_serviceaccount_name = "alert-database-reader"
# Increase this number to force Terraform to update the int environment.
# Serial: 2
This references the environment/deployments/science-platform/alertdb module.
Note that buckets and roles are already created in the RSP’s Dev and Prod projects.
It may be helpful to look at the PRs originally configured the Int environment:
5.7.3 Provision the DNS for the schema registry¶
DNS is provisioned by the SQuARE team, so you’ll have to make requests to them for this part.
The target environment is running Gafaelfawr, so it has some base IP address used for the main ingress. The schema registry can run on the same IP address, even though it uses a different hostname.
So, request a DNS A record which points to the base IP of the targeted environment’s main ingress.
For example, ‘data-int.lsst.cloud’, which is the base URL for the INT IDF environment, is an A record for ‘35.238.192.49’. The schema registry therefore gets a DNS A record ‘alert-schemas-int.lsst.cloud’ which similarly points to 35.238.192.49.
5.7.4 Configuring a new Phalanx deployment¶
You’ll need to configure a new Phalanx deployment.
To do this, create a values-<environment>.yaml
file in the services/alert-stream-broker directory of github.com/lsst-sqre/phalanx which matches the environment.
You must explicitly set a hostname for the schema registry (in alert-stream-schema-registry.hostname
and alert-database.ingester.schemaRegistryURL
).
Use the one you provisioned in the previous step.
You will also need to explicitly pass in the alert database GCP project and bucket names. Be careful to set the fields of the alert database to the right values that match what you created in Terraform.
Finally, make sure to not set the alert-stream-broker.kafka.externalListener
field yet.
This field uses IPs and hostnames which we don’t yet know.
You will similarly need to configure the values-<environment>.yaml
file for Strimzi (in services/strimzi) and for the Strimzi Registry Operator (in services/strimzi_registry_operator).
You will also need to enable the alert_stream_broker
, strimzi
, and strimzi_registry_operator
applications in the science-platform/values-<environment>.yaml
file.
For example, see the science-platform/values-idfint.yaml file, which has enabled: true
for those three apllications.
You need to do that for your target environment as well.
5.7.5 Enabling the new services in Argo¶
Argo needs to be synced - that is, the Argo application itself - in order to detect the newly-enabled alert_stream_broker
, strimzi
, and strimzi_registry_operator
applications.
Do that first - log in to Argo in the target environment, and sync the Argo application.
Next, sync Strimzi. It should succeed without errors.
Next, sync the Strimzi Registry Operator. It should also succeed without errors.
Next, sync the alert stream broker application. Errors are expected at this stage. Our goal is just to do the initial setup so some of the resources come up, but not everything will work immediately.
5.7.6 Provisioning DNS records¶
Once the alert-stream-broker is synced into a half-broken, half-working state, we can start to get the IP addresses used by its services. This will let us provision more DNS records: those for the Kafka brokers.
In the current gcloud setup, this must be done through Square. If you cannot use the existing static IPs, you must request that you are assigned three for the Kafka brokers, and that the DNS records are updated to point to the correct static IPs.
You will then need to update values-idfint.yaml
:
The Kafka brokers MUST point to static IPs, as restarting Kafka will otherwise result in the assigned IP’s to change. If they do not, there will be problems with the SSL certificates and he users will not be able to connect. See the following link for an explination on why:
https://strimzi.io/blog/2021/05/07/deploying-kafka-with-lets-encrypt-certificates/
Previously, this setup was done through kubectl. However, it is now handled through Square. The kubectl instructions have been kept in case there is a need to use it in the future.
5.7.7 Previous DNS provisioning workflow¶
To provision the Kafka broker IPs, we will use kubectl to look up the IP addresses provisioned for the broker (see 2.1 Getting kubectl Access).
Run kubectl get service --namespace alert-stream-broker to get a list of all the services running:
-> % kubectl get service -n alert-stream-broker
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
alert-broker-kafka-0 LoadBalancer 10.130.20.152 35.239.64.164 9094:31402/TCP 78d
alert-broker-kafka-1 LoadBalancer 10.130.23.65 34.122.165.155 9094:31828/TCP 78d
alert-broker-kafka-2 LoadBalancer 10.130.21.82 35.238.120.127 9094:31070/TCP 78d
alert-broker-kafka-bootstrap ClusterIP 10.130.20.156 <none> 9091/TCP,9092/TCP,9093/TCP 78d
alert-broker-kafka-brokers ClusterIP None <none> 9090/TCP,9091/TCP,9092/TCP,9093/TCP 78d
alert-broker-kafka-external-bootstrap LoadBalancer 10.130.16.127 35.188.169.31 9094:30118/TCP 78d
alert-broker-zookeeper-client ClusterIP 10.130.25.236 <none> 2181/TCP 78d
alert-broker-zookeeper-nodes ClusterIP None <none> 2181/TCP,2888/TCP,3888/TCP 78d
alert-schema-registry ClusterIP 10.130.27.137 <none> 8081/TCP 76d
alert-stream-broker-alert-database ClusterIP 10.130.27.41 <none> 3000/TCP 21d
The important column here is “EXTERNAL-IP.” Use it to discover the IP addresses for each of the individual broker hosts, and for the “external-bootstrap” service. Request DNS A records that map useful hostnames to these IP addresses - this is done by the SQuARE team, so you’ll need help.
Once you have DNS provisioned, make another change to values-<environment>.yaml
to lock in the IP addresses and inform Kafka of the hostnames to use.
For example, here’s values-idfint.yaml
:
Apply this change as usual (see 5.1 Deploying a change with Argo). Now the broker should be accessible.
5.7.8 Adding users¶
Make new user credential sets in 1Password for the new targeted environment. See 4.3 Adding a new user account for how to do this.
In addition, make a user named ‘kafka-admin’ in 1Password in the same way.
Make sure to use the right value for the environment
field of the 1Password items.
Then, set alert-stream-broker.vaultSecretsPath
in values-<environment>.yaml
to secret/k8s_oeprator/<environment>/alert-stream-broker
. This will configure the Vault Secrets Operator to correctly feed secrets through.
5.7.9 Lingering issues¶
You may need to re-sync several times to trigger the data-loading job of the alert stream simulator. When the system is in its half-broken state, this job will fail, and it can exponentially back-off which can take a very long time to recover. It can also hit a max retry limit and stop attempting to load data.
Using Argo to “sync” will kick it off again, which may fix the problem.
5.7.10 Testing connectivity¶
You should now have a working cluster. You should be able to run Kowl with the new superuser identity and it ought to be able to connect.
5.8 Deploying on a new Kubernetes cluster off of Google¶
Deploying to a new Kubernetes cluster off of Google will require all the same steps as described in the previous section, but with a few additional wrinkles.
First, the alert-stream-broker chart uses the “load balancer” service type to provide external internet access to the Kafka nodes. Load balancer services are very platform-specific; on Google it corresponds to creation of TCP Load Balancers. On a non-Google platform, it might work very differently.
One option would be to use the targeted platform’s load balancers. Another option is to use Node Ports or Ingresses instead. The 5-part Strimzi blog post series “Accessing Kafka” goes into detail about these options.
Second, the alert database uses Google Cloud Storage buckets to store raw alert and schema data.
This would need to be replaced with something appropriate for the targeted environment.
The requirements are made clear in the storage.py
files of the github.com/lsst-dm/alert_database_ingester and github.com/lsst-dm/alert_database_server repositories.
An implementation would need to fulfill the abstract interface provided in that file.
There may be more requirements, but these are certainly to need investigation if you’re planning to move to a different Kubernetes provider.
5.9 Changing the schema registry hostname¶
The Schema Registry’s hostname is controlled by the ‘hostname’ value passed in to charts/alert-stream-schema-registry. Updating that will update the hostname expected by the service.
In addition, a new DNS record will need to be created by whoever is provisioning DNS for the target environment. For the INT IDF environment, that’s SQuARE. It should route the new hostname to the ingress IP address.
Finally, the new schema registry needs to be passed in to the alert database in its ingester.schemaRegistryURL
value.
5.10 Changing the Kafka broker hostnames¶
Kafka broker hostnames can be changed by modifying the values passed in to charts/alert-stream-broker. Once changed, the broker will not work until DNS records are also updated.
See also: 5.7.6 Provisioning DNS records.
5.11 Changing the alert database URL¶
The alert database’s URL is based off of that of the cluster’s main Gafaelfawr ingress, so it cannot be changed entirely. However, it uses a path prefix, which can be changed. This path prefix is controlled by a value passed in to the alert database chart.
5.12 Changing the Kafka hardware¶
To change the hardware used by Kafka, change the nodes used in the node pool. This is set in the terraform configuration in environment/deployments/science-platform/env/integration-gke.tfvars:
{
name = "kafka-pool"
machine_type = "n2-standard-32"
node_locations = "us-central1-b"
local_ssd_count = 0
auto_repair = true
auto_upgrade = true
preemptible = false
image_type = "cos_containerd"
enable_secure_boot = true
disk_size_gb = "500"
disk_type = "pd-standard"
autoscaling = true
initial_node_count = 1
min_count = 1
max_count = 10
}
Change this, and apply the terraform change.
This may cause some downtime as the kafka nodes are terminated and replaced with new ones, evicting the Kafka brokers, but this isn’t known for certain.
References
[1] | Argo CD: Declarative GitOps CD for Kubernetes. URL: https://argo-cd.readthedocs.io/en/stable/. |
[2] | Kowl: A Web UI for Apache Kafka. URL: https://github.com/cloudhut/kowl. |
[3] | (1, 2) [DMTN-210]. Spencer Nelson. Implementation of the lsst alert distribution system. 2021. URL: https://dmtn-210.lsst.io/ |