DMTN-214: Alert Distribution System Operator's Manual

  • Spencer Nelson and Brianna Smart

Latest Revision: 2023-03-24

Note

This is a practical collection of instructions, troubleshooting tips, and playbooks for managing and maintaining the Alert Distribution System.

This is a collection of instructions for how to operate the Alert Distribution System. An overview of the system is provided in DMTN-210 [3] which is essential background reading for this document.

1   Basic Tools

1.1   ArgoCD

Argo CD [1] is a tool for deploying software onto a Kubernetes cluster. Most changes to the Alert Distribution System are done through Argo.

There’s some official SQuARE-maintained documentation for Argo. This section tries to summarize what you’d need to know to work with the Alert Distribution System, but that thorough documentation is worth reading too.

1.1.1   Accessing Argo

The integration deployment uses the Argo installation at https://data-int.lsst.cloud/argo-cd. Access to that installation is managed by the SQuARE team.

When you go to the Argo UI for the first time, you’ll see a big mess of many “applications.” The primary one is named alert-stream-broker, but you may also be interested in the “strimzi” and “strimzi-registry-operator” ones.

1.1.2   Applying Changes by Syncing

Once you’re in an application, the main action you take is syncing. When you “sync” an application, you synchronize its state in Kubernetes with a “desired state” which is derived from Git.

You do this by clicking the big “SYNC” button at the top of the Argo UI. This brings up a daunting list of all the resources that will be synchronized. Usually you don’t want to make changes to the options here, although you might want to enable “prune.”

When “prune” is enabled, Argo will delete any orphaned resources that no longer seem to be desired. Without this option, they will linger around. They probably won’t cause harm, but this can be confusing.

Additionally, some resources may not update properly if they depend on updates in other applications. These may require you to delete that specific resources and then re-deploy.

The recommended deployment order for easy troubleshooting is:

  1. Deploy Kafka Cluster
  2. Deploy strimzi registry operator
  3. Deploy the schema registry
  4. Deploy the ingress schema for the schema registry
  5. Deploy other services

The strimzi registry operator and schema registry can be a bit of a chicken or egg problem, and you may have to re-deploy the operator again after the schema registry, and then redeploy the registry for the two to reconcile with each other after the schema ingress and schema registry are deployed.

If, during deployment, any resource begins to error continuously, you can delete that specific resource while troubleshooting. This prevents the status channel from being continuously spammed with errors. It is recommended to grab the error log first from the logs tab first before deleting the resources.

If for some reason the instance of alert-stream-broker has been removed from the active applications, you can re-deploy it by going to the science-platform application and re-syncing alert-stream-broker.

Trouble shooting: If all of the brokers have failed but everything else is running, the brokers may be out of storage. This means that Kafka needs to have either the storage allotted or the retention limits adjusted. This requires a restart of the brokers, and may require a full re-deployment of the whole system.

If you try and restart the brokers from a fail state (whether they have run out of storage or not), and they end up in crashback loops, make sure to delete their persistent volume claims to ensure that they can rebuild.

1.1.3   What is “Desired State” in Argo?

The “desired state” of a service is based on whatever is currently in the master branch of the Phalanx repository. Each application has a matching service in the Phalanx repo - for example, services/alert-stream-broker - which contains a Chart.yaml file, a charts directory containing several charts the broker depends on, and a values-idfint.yaml file (and possibly more values-*.yaml files if the service is deployed to more environments than just the IDF integration environment).

The Chart.yaml file lists Helm charts - and, very crucially, their versions - that define the actual configuration to be used. The values file(s) list the particular configuration values that should be plugged in to the Helm chart templates used by that service. the values.yaml file should only contain information that is agnostic to which environment the s.

Most of the sourced Helm charts are found in the Charts directory of alert-stream-broker. The specific charts used are described in more complete detail in DMTN-210. [3]

Argo is sometimes a little bit delayed from the state of the Phalanx repository, perhaps by a few minutes. You might want to refresh a few times and make sure that the Git reference listed under “Current Sync Status” on the Argo UI for an application matches what you expect to apply.

For troubleshooting: Changes to the strimzi-operator may cause the alert-schema-registry to not fully deploy.

1.2   1Password

1Password is a password management tool. LSST IT uses it to distribute passwords, and the SQuARE team has adapted it for managing secrets stored in Kubernetes.

It’s worth reading the documentation in Phalanx on this subject:

Managing the Alert Distribution System requires 1Password access. The LSST IT team can grant that access. Then, you’ll also need access to the “RSP-Vault” vault in 1Password, which can be granted by the SQuARE team.

The idea is that credentials are stored in a special 1Password vault with carefully formatted fields. Then you can run the phalanx installer/update_secrets.sh script to copy secrets from 1Password into Vault, which is a tool for encrypting secret data.

In the background, a tool called Vault Secrets Operator copies secret data in Vault and puts it into Kubernetes secrets for use in Kubernetes applications.

This is used to manage the passwords for the Kafka users that can access the alert stream: their passwords are set in 1Password, copied into Vault with the script, and then automatically synchronized into Strimzi KafkaUsers (see also: DMTN-210 3.2.3.1: 1Password, Vault, and Passwords).

1.3   Terraform

Terraform is a tool for managing resources stored in Google Cloud through code. The IDF deployment of the alert distribution system uses the github.com/lsst/idf_deploy for its Terraform configuration.

That repository hosts documentation directly in its README. Changes are made entirely through GitHub Actions workflows, so they get applied simply by merging into the main branch.

1.4   Google Cloud Console

The Google Cloud project which hosts the IDF deployment of the alert distribution system is “science-platform-int”. You can use the Google Cloud Platform’s console to see some of the things going on inside the system’s cloud resources.

In particular, the Storage Browser can help with identifying anything going wrong with the storage buckets used by the Alert Database, and the Kubernetes Engine UIs might help with exploring the behavior of the deployed systems on Kubernetes.

1.5   Kowl

Kowl [2] is a web application that provides a UI for a Kafka broker. It can help with peeking at messages in the Kafka topics, viewing the broker’s configuration, monitoring the state of consumer groups, and more.

Kowl can be run locally using Docker. It requires superuser permissions in the Kafka broker, which can be first retrieved from 1Password (see 2.4   Retrieving Kafka superuser credentials). Then, here’s how to run it locally:

export KAFKA_USER=$(op item get "alert-stream idfint kafka-admin" --fields label=username)
export KAFKA_PASSWORD=$(op item get "alert-stream idfint kafka-admin" --fields label=password)

docker run \
    -p 8080:8080 \
    -e KAFKA_BROKERS=alert-stream-int.lsst.cloud:9094 \
    -e KAFKA_TLS_ENABLED=true \
    -e KAFKA_SASL_ENABLED=true \
    -e KAFKA_SASL_USERNAME=$KAFKA_USER \
    -e KAFKA_SASL_PASSWORD=$KAFKA_PASSWORD \
    -e KAFKA_SASL_MECHANISM=SCRAM-SHA-512 \
    -e KAFKA_SCHEMAREGISTRY_ENABLED=true \
    -e KAFKA_SCHEMAREGISTRY_URLS=https://alert-schemas-int.lsst.cloud \
    quay.io/cloudhut/kowl:master

Once the Kowl container is running, you can view its UI by going to http://localhost:8080.

You should see something like this:

_images/kowl_topics.png

By clicking on a topic, you can see the deserialized messages in the topic. You can expand them by clicking the “+” sign in each row next to the “Value” column. For example:

_images/kowl_messages.png

You can also look at the schema and its versions in the Schema Registry tab:

_images/kowl_schemas.png

You can use the Consumer Groups tab to see the position of any consumers. For example, here we can see the Pitt-Google broker:

_images/kowl_consumers.png

Kowl has many more capabilities. See the official Kowl documentation [2] for more.

Note: Do note use –network=host, as the current behavior doesn’t allow docker to publish port 8080 and you won’t be able to access the Kowl through the local host.

2   Tool Setup

2.1   Getting kubectl Access

While the below instructions are still valid, it is now no longer recommended to use kubectl and instead do everything in the Google Cloud web interface.

  1. Install kubectl: https://kubernetes.io/docs/tasks/tools/
  2. Install gcloud: https://cloud.google.com/sdk/docs/install
  3. Run “gcloud auth login <your google cloud account>”. For example, “gcloud auth login swnelson@lsst.cloud.”
  4. Run “gcloud container clusters get-credentials science-platform-int”.

You should now have kubectl access. Try kubectl get kafka --namespace alert-stream-broker to verify. You should see output like this:

-> % kubectl get kafka --namespace alert-stream-broker
NAME           DESIRED KAFKA REPLICAS   DESIRED ZK REPLICAS   READY   WARNINGS
alert-broker   3                        3                     True

If you haven’t set up your region correctly, you will see this error:

Fetching cluster endpoint and auth data. ERROR: (gcloud.container.clusters.get-credentials) ResponseError: code=404, message=Not found: projects/science-platform-int-dc5d/zones/us-west1-b/clusters/science-platform-int.

If that happens, open up ~/.config/gcloud/configurations/config_default and set the zone to the suggested zone.

2.2   Secure Password Use

In the following sections, you CAN fill in both the username and the password manually on your command line. However, this is not secure and can leave the password/usernames in your command line history. Instead, if you are using 1password, you should use 1passwords command line tool so that you do not directly enter your credentials.

2.3   Running Kowl

  1. Make sure you have docker installed.

  2. Make sure the Docker daemon is running. If using Docker Desktop start up the application.

  3. Retrieve Kafka superuser credentials, as described in 2.4   Retrieving Kafka superuser credentials.

  4. Run the following:

    export KAFKA_USER=$(op item get "alert-stream idfint kafka-admin" --fields label=username)
    export KAFKA_PASSWORD=$(op item get "alert-stream idfint kafka-admin" --fields label=password)
    
    docker run \
      -p 8080:8080 \
      -e KAFKA_BROKERS=alert-stream-int.lsst.cloud:9094 \
      -e KAFKA_TLS_ENABLED=true \
      -e KAFKA_SASL_ENABLED=true \
      -e KAFKA_SASL_USERNAME=$KAFKA_USER \
      -e KAFKA_SASL_PASSWORD=$KAFKA_PASSWORD \
      -e KAFKA_SASL_MECHANISM=SCRAM-SHA-512 \
      -e KAFKA_SCHEMAREGISTRY_ENABLED=true \
      -e KAFKA_SCHEMAREGISTRY_URLS=https://alert-schemas-int.lsst.cloud \
      quay.io/cloudhut/kowl:master
    
  1. Go to http://localhost:8080

2.4   Retrieving Kafka superuser credentials

The superuser has access to do anything. Be careful with these credentials! To find the credentials:

  1. Log in to 1Password in the LSST IT account.
  2. Go to the “RSP-Vault” vault.
  3. Search for “alert-stream idfint kafka-admin”.

2.5   Retrieving development credentials

This user only has limited permissions, mimicking those of a community broker.

  1. Log in to 1Password in the LSST IT account.
  2. Go to the “RSP-Vault” vault.
  3. Search for “alert-stream idfint rubin-communitybroker-idfint”.

3   System Status

3.1   Testing connectivity

First, get the set of developer credentials (2.5   Retrieving development credentials).

Then, use one of the example consumer applications listed in sample_alert_info/examples. These will show whether you’re able to connect to the Kafka stream and receive sample alert packets, as well as whether you’re able to retrieve schemas from the Schema Registry.

3.2   Checking disk usage

First, check how much disk is used by Kafka:

  1. Run Kowl, following the instructions in 2.2   Secure Password Use.

  2. Navigate to the brokers view at http://localhost:8080/brokers.

    You should see the amount of disk used by each broker in the right-most column under “size.”

Next, check how much is requested in the persistent volume claims used by the Kafka brokers:

  1. Ensure you have kubectl access (2.1   Getting kubectl Access).

  2. Run kubectl get pvc --namespace alert-stream-broker. You should see output like this:

    -> % kubectl get pvc -n alert-stream-broker
    NAME                            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
    data-0-alert-broker-kafka-0     Bound    pvc-e5bf9fb1-e763-4c03-8294-b81a6955bde3   1500Gi     RWO            standard       77d
    data-0-alert-broker-kafka-1     Bound    pvc-c289fc0d-39a0-44b1-b073-2aab5c47ba3a   1500Gi     RWO            standard       77d
    data-0-alert-broker-kafka-2     Bound    pvc-6307f422-0448-45bd-985b-f7e559e54bb9   1500Gi     RWO            standard       77d
    data-alert-broker-zookeeper-0   Bound    pvc-bd8bb38f-a5d3-47f9-a9a1-13c66f04f80e   1000Gi     RWO            standard       77d
    data-alert-broker-zookeeper-1   Bound    pvc-01463914-9b1f-49bd-992f-de0b6b0284ca   1000Gi     RWO            standard       77d
    data-alert-broker-zookeeper-2   Bound    pvc-eb37bbaa-cc49-4541-baf4-6f2444330d6f   1000Gi     RWO            standard       77d
    

3.3   Checking consumer group status

  1. Run Kowl, following the instructions in 2.2   Secure Password Use.
  2. Navigate to the consumer group view at http://localhost:8080/groups

There should be an entry for each consumer group that is connected or has connected recently.

The “Coordinator” column indicates which of the three Kafka broker nodes is used for coordinating the group’s partition ownership.

The “Members” column indicates the number of currently-active processes which are consuming data.

The “Lag” column indicates how many messages are unread by the consumer group.

3.4   Checking logs

In general, logs are available on the Google Cloud Log Explorer UI.

To access them:

  1. Log in to the Google Cloud console at https://console.cloud.google.com.

  2. Navigate to the Log Explorer UI, https://console.cloud.google.com/logs/query

  3. Enter a search query. For example:

    This will bring up all logs from Kafka brokers:

    _images/console_kafka_logs.png

There are additional “Log fields” on the left column. You can use these to filter to a single one of the three brokers via the “Pod name” field.

You can pick a different time range by clicking on “Last 1 hour” in the top right:

_images/console_log_timerange.png

See also: the GCP Log Explorer documentation: https://cloud.google.com/logging/docs/view/logs-viewer-interface

Each of the subsections lists search queries that can be used to filter logs.

3.4.1   Checking Kafka logs

Search for the following:

resource.type="k8s_container"
resource.labels.container_name="kafka"
resource.labels.namespace_name="alert-stream-broker"

3.4.2   Checking Strimzi logs

Search for the following:

resource.type="k8s_container"
resource.labels.namespace_name="strimzi"

3.4.3   Checking Strimzi Registry Operator logs

Search for the following:

resource.type="k8s_container"
resource.labels.namespace_name="strimzi-registry-operator"

3.4.4   Checking Schema Registry logs

Search for the following:

resource.type="k8s_container"
resource.labels.pod_name:"alert-schema-registry"
resource.labels.namespace_name="alert-stream-broker"

3.4.5   Checking Alert Database logs

Search for the following:

resource.type="k8s_container"
resource.labels.pod_name:"alert-database"
resource.labels.namespace_name="alert-stream-broker"

3.4.6   Checking Alert Stream Simulator logs

Search for the following:

resource.type="k8s_container"
resource.labels.pod_name="alert-stream-simulator"
resource.labels.namespace_name="alert-stream-broker"

4   Administration

4.1   Sharing passwords

  1. Log in to 1Password in the LSST IT account.

  2. Go to the “RSP-Vault” vault.

  3. Search for the username of the account you want to share.

  4. Click on the 3-dot menu in the top right and choose “Share…”:

    _images/1password_sharing.png

    This will open a new browser window for a sharing link.

  5. Set the duration and availability as desired, and click “Get Link to Share”:

    _images/1password_sharing_link.png

Share the link as you see fit.

Shared links can also be revoked; see 1Password Documentation for more.

4.2   Changing passwords

  1. Log in to 1Password in the LSST IT account.
  2. Go to the “RSP-Vault” vault.
  3. Search for the username of the account you want to modify.
  4. Click on the password field. Generate a new password and set it, and save your changes.
  5. Follow the instructions in Phalanx: Updating a secret stored in 1Password and VaultSecret.

Then verify that the change was successful by checking it in Argo.

  1. Log in to Argo (see also 1.1.1   Accessing Argo).
  2. Navigate to the “alert-stream-broker” application.
  3. In the “filters” on the left side, search for your targeted username in the “Name” field. You should see a filtered set of resources now.
  4. Click on the “secret” resource and check that it has an “updated” timestamp that is after you made your changes. If not, delete the “Secret” resource; it will be automatically recreated quickly. Once recreated, the user’s password will be updated automatically.

If this seems to be having trouble, consider checking:

  • the Vault Secrets Operator logs to make sure it is updating secrets correctly
  • the Strimzi Entity Operator logs to make sure they are updating user accounts correctly
  • the Kafka broker logs to make sure it’s healthy

4.3   Adding a new user account

First, generate new credentials for the user:

  1. Log in to 1Password in the LSST IT account.

  2. Go to the “RSP-Vault” vault.

  3. Create a new secret.

    1. Name it “alert-stream idfint <username>”.
    2. Set the “Username” field to <username>.
    3. Set the “Password” field to something autogenerated.
    4. Add a field named “generate_secrets_key”. Set its value to “alert-stream-broker <username>-password”
    5. Add a field named “environment”. Set its value to “data-int.lsst.cloud”

    If you’re running in a different environment than the IDF integration environment, replaced “idfint” and “data-int.lsst.cloud” with appropriate values.

  4. Sync the secret into Vault following the instructions in Phalanx documentation.

Second, add the user to the configuration for the cluster:

  1. Make a change to github.com/lsst-sqre/phalanx’s services/alert-stream-broker/values-idfint.yaml file. Add the new user to the list of users under alert-stream-broker.users: https://github.com/lsst-sqre/phalanx/blob/bb417e80e0d9d1148da6edccae400eec006576e1/services/alert-stream-broker/values-idfint.yaml#L33-L73

    Make sure you use the same username, and grant it read-only access to the alerts-simulated topic by setting readonlyTopics: ["alerts-simulated"] just like the other entries.

    If more topics should be available, add them. If running in a different environment than the IDF integration environment, modify the appropriate config file, not values-idfint.yaml.

  2. Make a pull request with your changes, and make sure it passes automated checks, and get it reviewed.

  3. Merge your PR. Wait a few minutes (perhaps 10) for Argo to pick up the change.

  4. Log in to Argo CD.

  5. Navigate to the ‘alert-stream-broker’ application.

  6. Click “sync” and leave all the defaults to sync your changes, creating the new user.

Verify that the new KafkaUser was created by using the filters on the left side to search for the new username.

Verify that the user was added to Kafka by using Kowl and going to the “Access Control List” section (see 2.2   Secure Password Use).

Optionally verify that access works using a method similar to that in 3.1   Testing connectivity.

4.4   Removing a user account

  1. Delete the user from the list in github.com/lsst-sqre/phalanx’s services/alert-stream-broker/values-idfint.yaml file.
  2. Make a pull request with this change, and make sure it passes automated checks, and get it reviewed.
  3. Merge your PR.
  4. Delete the user’s credentials from 1Password in the RSP-Vault vault of the LSST IT account. You can find the credentials by searching by username.
  5. Log in to Argo CD.
  6. Navigate to the ‘alert-stream-broker’ application.
  7. Click “sync”. Click the “prune” checkbox to prune out the defunct user. Apply the sync.

Verify that the user was removed from Kafka by using Kowl and going to the “Access Control List” section (see 2.2   Secure Password Use). The user shouldn’t be in the ACLs anymore.

4.5   Granting users read-only access to a new topic

  1. Make a change to github.com/lsst-sqre/phalanx’s services/alert-stream-broker/values-idfint.yaml file. In the list of users under alert-stream-broker.users, add the new topic to the readonlyTopics list for each user that should have access.
  2. Make a pull request with your changes, and make sure it passes automated checks, and get it reviewed.
  3. Merge your PR. Wait a few minutes (perhaps 10) for Argo to pick up the change.
  4. Log in to Argo CD.
  5. Navigate to the ‘alert-stream-broker’ application.
  6. Click “sync” and leave all the defaults to sync your changes, modifying access.

Verify that the change worked by using Kowl and going to the “Access Control List” section (see 2.2   Secure Password Use). There should be matching permissions with Resource=TOPIC, Permission=ALLOW, and Principal being the users who were granted access.

4.6   Adding a new Kafka topic

  1. Add a new KafkaTopic resource to the templates directory in one of the charts that composes the alert-stream-broker service. This will be in the alert-stream-broker/charts repository. For example, there is a KafkaTopic resource in the alert-stream-simulator/templates/kafka-topics.yaml file.

    These files use the Helm templating language. See The Chart Template Developer’s Guide for more information on this language.

    Strimzi’s documentation (“5.2.1: Kafka topic resource”) may be helpful in configuring the topic. The schema for KafkaTopic resources has a complete reference at 11.2.90: KafkaTopic schema reference.

    Pick the chart that is most relevant to the topic you are adding. If it is not relevant to any particular chart, use the general charts/alert-stream-broker chart.

  2. Increment the version of the chart by updating the version field of its Chart.yaml file. For example, this line of the alert-stream-simulator chart.

  3. Make a pull request with your changes to alert-stream-broker/charts, and make sure it passes automated checks, and get it reviewed. Merge your PR.

  4. Next, you’ll update the services/alert-stream-broker/Chart.yaml file to reference the new version number of the chart you have updated. For example, this line would need to be updated if you were adding a topic to the alert-stream-simulator.

  5. Make a pull request with your changes to github.com/lsst-sqre/phalanx, and make sure it passes automated checks, and get it reviewd. Merge your PR.

  6. Wait a few minutes (perhaps 10) for Argo to pick up the change to Phalanx.

  7. Log in to Argo CD.

  8. Navigate to the ‘alert-stream-broker’ application.

  9. Click ‘sync’ and leave all the defaults to sync your changes, creating the new topic.

Verify that the change worked by using Kowl and going to the “Topics” section (see 2.2   Secure Password Use). There should be a new topic created.

To let users read from the topic, see 4.5   Granting users read-only access to a new topic.

4.7   Granting Alert DB access

Alert DB access is governed by membership in GitHub organizations and teams.

The list of permitted GitHub groups for the IDF integration environment is in the services/gafaelfawr/values-idfint.yaml file in github.com/lsst-sqre/phalanx.

As of this writing, that list is composed of ‘lsst-sqre-square’ and ‘lsst-sqre-friends’, so any users who wish to have access need to be added to the “square” or “friends” teams in the lsst-sqre GitHub organization.

Invite a user to join one of those groups to grant access.

To change the set of permitted groups, modify the services/gafaelfawr/values-idfint.yaml file to change the list under the read:alertdb scope. Then, sync the change to Gafaelfawr via Argo CD.

5   Making Changes

5.1   Deploying a change with Argo

In general, to make any change with ArgoCD, you update Helm charts, update Phalanx, and then “sync” the alert-stream-application:

  1. Make desired changes to Helm charts, if required, in alert-stream-broker/charts. Note that any changes to Helm charts always require the version to be updated.
  2. Merge your Helm chart changes.
  3. Update the services/alert-stream-broker/Chart.yaml file to reference the new version number of the chart you have updated, if you made any Helm chart changes.
  4. Update the services/alert-stream-broker/values-idfint.yaml file to pass in any new template parameters, or make modifications to existing ones.
  5. Merge your Phalanx changes.
  6. Wait a few minutes (perhaps 10) for Argo to pick up the change to Phalanx.
  7. Log in to Argo CD at https://data-int.lsst.cloud/argo-cd.
  8. Navigate to the ‘alert-stream-broker’ application.
  9. Click ‘sync’ to synchronize your changes.

5.2   Updating the Kafka version

The Kafka version is set in the alert-stream-broker/templates/kafka.yaml file in services/alert-stream-broker. It is parameterized through the kafka.version value in the alert-stream-broker chart, which defaults to “2.8”.

When upgrading the Kafka version, you also may need to update the kafka.logMesageFormatVersion and kafka.interBrokerProtocolVersion. These change slowly, but old values can be incompatible with new Kafka versions. See Strimzi documentation on Kafka Versions to be sure.

So, to update the version of Kafka used, update the services/alert-stream-broker/values-idfint.yaml file in github.com/lsst-sqre/phalanx. Under alert-stream-broker, then under kafka, add a value: version: <whatever you want>. If necessary, also set logMessageFormatVersion and interBrokerProtocolVersion here.

Then, follow the steps in 5.1   Deploying a change with Argo to apply these changes.

See also: the Strimzi Documentation’s “9.5: Upgading Kafka”.

5.3   Updating the Strimzi version

First, you probably want to read the Strimzi Documentation’s “9. Upgrading Strimzi”.

The Strimzi version is governed by the version referenced in github.com/lsst-sqre/phalanx’s services/strimzi/Chart.yaml file. Update that version, and do anything else recommended by Strimzi in their documentation, such as changes to resources.

Then, apply the change in a way similar to that described in 5.1   Deploying a change with Argo. Note though that you’ll be synchronizing the ‘strimzi’ application in Argo, not the ‘alert-stream-broker’ application in Argo.

5.4   Resizing Kafka broker disk storage

Some reference reading:

Change the alert-stream-broker.kafka.storage.size value in services/alert-stream-broker/values-idfint.yaml in github.com/lsst-sqre/phalanx. This is the amount of disk space per broker instance.

Apply the change, as described in 5.1   Deploying a change with Argo.

This may take a little while to apply, since it is handled through the asynchronous Kafka operator, which reconciles storage size every few minutes. When it starts reconciling, it rolls the change out gradually across the Kafka cluster to maintain availability.

Note that storage sizes can only be increased, never decreased.

5.5   Updating the alert schema

For background, you might want to read DMTN-210’s section 3.4.4: Schema Synchronization Job.

The high-level steps are to:

  • Commit your changes in the lsst/alert_packet repository, obeying its particular versioning system
  • Build a new lsstdm/lsst_alert_packet container
  • Publish a new lsst-alert-packet Python package
  • Load the schema into the schema registry, incrementing the Schema ID
  • Update the alert-stream-simulator to use the new Python package and new schema ID

5.5.1   Making a new alert schema

First, make a new subdirectory in github.com/lsst/alert_packet’s python/lsst/alert/packet/schema directory. For example, the current latest version as of this writing is 4.0, so there’s a python/lsst/alert/packet/schema/4/0 directory which holds Avro schemas. You could put a new schema in python/lsst/alert/packet/schema/4/1.

Start by copying the current schema into the new directory, and then make your changes. Then, update python/lsst/alert/packet/schema/latest.txt to reference the new schema version number.

5.5.2   Creating a container which loads the schema

When you are satisfied with your changes, push them and open a PR. As long as your github branch starts with “tickets/” or is tagged, this will automatically kick off the “build_sync_container” GitHub Actions job, which will create a Docker container holding the alert schema. The container will be named lsstdm/lsst_alert_packet:<tag-or-branch-name>; slashes are replaced with dashes in the tag-or-branch-name spot.

For example, if you’re working on a branch named tickets/DM-34567, then the container will be created and pushed to lsstdm/lsst_alert_packet:tickets-DM-34567.

You can use this ticket-number-based container tag while doing development, but once you’re sure of things, merge the PR and then tag a release. The release tag can be the version of the alert schema (for example “4.1”) if you like - it doesn’t really matter what value you pick; there are so many version numbers flying around with alert schemas that it’s going to be hard to find any scheme which is ideal.

To confirm that your container is working, you can run the conatiner locally. For example, for the “w.2022.04” tag:

-> % docker run --rm lsstdm/lsst_alert_packet:w.2022.04 'syncLatestSchemaToRegistry.py --help'
usage: syncLatestSchemaToRegistry.py [-h]
                                     [--schema-registry-url SCHEMA_REGISTRY_URL]
                                     [--subject SUBJECT]

optional arguments:
  -h, --help            show this help message and exit
  --schema-registry-url SCHEMA_REGISTRY_URL
                        URL of a Schema Registry service
  --subject SUBJECT     Schema Registry subject name to use

5.5.3   Loading the new schema into the schema registry

To load the new schema into the schema registry, update the alert-stream-schema-registry.schemaSync.image.tag value to the tag that you used for the container.

The defaults are set in the alert-stream-schema-registry’s values.yaml file. You can update the defaults, or you can update the parameters used in Phalanx for a particular environment under the alert-stream-schema-registry field.

Apply these changes as described in 5.1   Deploying a change with Argo. The result should be that a new schema is added to the schema registry.

Once the change is deployed, the job that loads the schema will start. You can monitor it in the Argo UI by looking for the Job named ‘sync-schema-job’.

You can confirm it worked by using Kowl (see 2.2   Secure Password Use) and using its UI for looking at the schema registry’s contents.

5.5.4   Publishing a new lsst-alert-packet Python package

The alert stream simulator gets its version of the alert packet schema from the lsst-alert-packet Python package. The version of this package that it uses is set in setup.py of github.com/lsst-dm/alert-stream-simulator.

You’ll need to publish a new version of the lsst-alert-packet Python package in order to get a new version in alert-stream-simulator.

Start by updating the version in setup.cfg of github.com/lsst/alert_packet. Merge your change which includes the new version in setup.cfg.

The new version of the package needs to be published to PyPI, the Python Package Index: https://pypi.org/project/lsst-alert-packet/. It is managed by a user named ‘lsst-alert-packet-admin’, which has credentials stored in 1Password in the RSP-Vault vault. Use 1Password to get the credentials for that user.

Once you have credentials and have incremented the version, you’re ready to publish to PyPI. Explaining how to do that is out of scope of this guide, but Twine is a good tool for the job.

5.5.5   Updating the Alert Stream Simulator package

The alert stream simulator needs to use the new version of the lsst-alert-packet version which you published to PyPI. Second, the chart which runs the simulator needs to be updated to use the right ID of the schema in the schema registry.

The version of lsst-alert-packet is set in the setup.py file of github.com/lsst-dm/alert-stream-simulator. Update this to include the newly-published Python package.

Once you have made and merged a PR to this, tag a new release of the alert stream simulator using git tag. When your tag has been pushed to the alert stream simulator GitHub repository, an automated build will create a container (in a manner almost exactly the same as you saw for lsst/alert_packet).

You can use docker run to verify that this worked. For example, for version v1.2.1:

-> % docker run --rm lsstdm/alert-stream-simulator:v1.2.1 'rubin-alert-sim -h'
usage: rubin-alert-sim [-h] [-v] [-d]
                       {create-stream,play-stream,print-stream} ...

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         enable info-level logging (default: False)
  -d, --debug           enable debug-level logging (default: False)

subcommands:
  {create-stream,play-stream,print-stream}
    create-stream       create a stream dataset to be run through the
                        simulation.
    play-stream         play back a stream that has already been created
    print-stream        print the size of messages in the stream in real time

5.5.6   Getting the schema registry’s ID

Next, you’ll need to get the ID that is used by the schema registry so that you can use it in the alert stream simulator deployment. This is easiest to retrieve using Kowl.

Run Kowl (see 2.2   Secure Password Use) and then navigate to http://localhost:8080/schema-registry/alert-packet. There should be a drop-down with different versions. You probably want the latest version, which might already be the one being displayed. Select the desired version.

At the top of the screen, you should see the “Schema ID” of the schema you have selected. This integer is an ID we’ll need to reference later.

5.5.7   Updating the Alert Stream Simulator values

You’re almost done. We need to update the alert stream simulator deployment to use the new container version, and to use the new schema ID.

The container version is set in values-idfint.yaml’s alert-stream-simulator.image.tag field. Update this to match the tag you used in github.com/lsst-dm/alert-stream-simulator.

The schema ID is set in values-idfint.yaml as well, under alert-stream-simulator.schemaID. This is set to 1 by default.

Those changes to values-idfint.yaml are half the story. You probably also should update the defaults, which is done by editing the values.yaml files in the alert-stream-simulator chart. This values.yaml changes the dynamic configurations on a topic level, which override any settings, such as retention.ms or retention.bytes set on a broker level.

Once you have made those changes, apply them following the instructions in 5.1   Deploying a change with Argo.

The new simulator make take a few minutes to come online as the data needs to be reloaded. Once the sync has completed, you can verify that the change worked.

Verify that it worked using Kowl (see 2.2   Secure Password Use) by looking at the Messages UI (keep in mind that it can take up to 37 seconds for messages to appear!). The mesages should be encoded using your new schema.

Warning

You probably want to change the sample alert data (see 5.6   Changing the sample alert data) used by the alert stream simulator.

If you don’t do this, then the alert packets will be decoded using the version used when sample alerts were generated, then re-encoded using the new alert schema.

You can manage this transition using Avro’s aliases, but it might be simpler to simultaneously switch to a new version of the sample alert data.

5.6   Changing the sample alert data

The sample alert data used by the alert stream simulator is set in a Makefile:

.PHONY: datasets
datasets: data/rubin_single_ccd_sample.avro data/rubin_single_visit_sample.avro

data:
        mkdir -p data

data/rubin_single_ccd_sample.avro: data
        wget --no-verbose --output-document data/rubin_single_ccd_sample.avro https://lsst.ncsa.illinois.edu/~ebellm/sample_precursor_alerts/latest_single_ccd_sample.avro

data/rubin_single_visit_sample.avro: data
        wget --no-verbose --output-document data/rubin_single_visit_sample.avro https://lsst.ncsa.illinois.edu/~ebellm/sample_precursor_alerts/latest_single_visit_sample.avro

The last two show what’s happening. The sample alerts are downloaded from https://lsst.ncsa.illinois.edu/~ebellm/sample_precursor_alerts/latest_single_visit_sample.avro.

The sample alerts could be retrieved from anywhere else. The important things are that they should be encoded in Avro Object Container File format (that is, with all alerts in one file, preceded by a single instance of the Avro schema), and that they should represent a single visit of alert packet data.

Make changes to the makefile to get data from somewhere else, and then merge your changes. Make a git tag using the format vX.Y.Z, for example v1.3.10, and push that git tag up. This will trigger a build job for the container using the new tag.

Next, copy that tag into charts/alert-stream-simulator/values.yaml, and follow the instructions from 5.1   Deploying a change with Argo. This will configure the alert stream simulator to use the new alert data, publishing it every 37 seconds.

5.7   Deploying on a new Kubernetes cluster on Google Kubernetes Engine

Deploying on a new Kubernetes cluster will take a lot of steps, and has not been done before, so this section is somewhat speculative.

5.7.1   Prerequisites

There are certain prerequisites before even starting. These are systems that are dependencies of the alert distribution system’s current implementation, so they must be present already.

They are:

  • Argo CD should be installed and configured to make deployment possible using configuration from Phalanx and Helm. This means there should be some “environment” analogous to “idfint” which is used in the IDF integration deployment.
  • Gafaelfawr should be installed to set up the ingress for the alert database.
  • cert-manager should be installed so that broker TLS certificates can be automatically provisioned.
  • The nginx ingress controller should be installed to set up the ingress for the schema registry.
  • Workload Identity needs to be configured properly (for example, through Terraform) on the Google Kubernetes Engine instance to allow the alert database to gain permissions to interact with Google Cloud Storage buckets.

5.7.2   Preparation with Terraform

Before starting, some resources should be provisioned, presumably using Terraform:

  • A node pool for Kafka instances to run on.
  • Storage buckets for alert packets and schemas.
  • IAM roles providing access to the storage buckets for the alert database ingester and server (as writer and reader, respectively).

The current node pool configuration in the IDFINT environment can be found in the environments/deployments/science-platform/env/integration-gke.tf file:

  {
    name = "kafka-pool"
    machine_type = "n2-standard-32"
    node_locations     = "us-central1-b"
    local_ssd_count    = 0
    auto_repair        = true
    auto_upgrade       = true
    preemptible        = false
    image_type         = "cos_containerd"
    enable_secure_boot = true
    disk_size_gb       = "500"
    disk_type          = "pd-standard"
    autoscaling        = true
    initial_node_count = 1
    min_count          = 1
    max_count          = 10
  }
]

node_pools_labels = {
  core-pool = {
    infrastructure = "ok",
    jupyterlab = "ok"
  },
  dask-pool = {
    dask = "ok"
  },
  kafka-pool = {
    kafka = "ok"
  }
}

node_pools_taints = {
  core-pool = [],
  dask-pool = []
  kafka-pool = [
    {
      effect = "NO_SCHEDULE"
      key = "kafka",
      value = "ok"
    }
  ]
}

Storage bucket configuration is in environment/deployments/science-platform/env/integration-alertdb.tfvars:

# Project
environment = "int"
project_id  = "science-platform-int-dc5d"

# In integration, only keep 4 weeks of simulated alert data.
purge_old_alerts  = true
maximum_alert_age = 28

writer_k8s_namespace           = "alert-stream-broker"
writer_k8s_serviceaccount_name = "alert-database-writer"
reader_k8s_namespace           = "alert-stream-broker"
reader_k8s_serviceaccount_name = "alert-database-reader"

# Increase this number to force Terraform to update the int environment.
# Serial: 2

This references the environment/deployments/science-platform/alertdb module.

Note that buckets and roles are already created in the RSP’s Dev and Prod projects.

It may be helpful to look at the PRs originally configured the Int environment:

5.7.3   Provision the DNS for the schema registry

DNS is provisioned by the SQuARE team, so you’ll have to make requests to them for this part.

The target environment is running Gafaelfawr, so it has some base IP address used for the main ingress. The schema registry can run on the same IP address, even though it uses a different hostname.

So, request a DNS A record which points to the base IP of the targeted environment’s main ingress.

For example, ‘data-int.lsst.cloud’, which is the base URL for the INT IDF environment, is an A record for ‘35.238.192.49’. The schema registry therefore gets a DNS A record ‘alert-schemas-int.lsst.cloud’ which similarly points to 35.238.192.49.

5.7.4   Configuring a new Phalanx deployment

You’ll need to configure a new Phalanx deployment.

To do this, create a values-<environment>.yaml file in the services/alert-stream-broker directory of github.com/lsst-sqre/phalanx which matches the environment.

You must explicitly set a hostname for the schema registry (in alert-stream-schema-registry.hostname and alert-database.ingester.schemaRegistryURL). Use the one you provisioned in the previous step.

You will also need to explicitly pass in the alert database GCP project and bucket names. Be careful to set the fields of the alert database to the right values that match what you created in Terraform.

Finally, make sure to not set the alert-stream-broker.kafka.externalListener field yet. This field uses IPs and hostnames which we don’t yet know.

You will similarly need to configure the values-<environment>.yaml file for Strimzi (in services/strimzi) and for the Strimzi Registry Operator (in services/strimzi_registry_operator).

You will also need to enable the alert_stream_broker, strimzi, and strimzi_registry_operator applications in the science-platform/values-<environment>.yaml file. For example, see the science-platform/values-idfint.yaml file, which has enabled: true for those three apllications. You need to do that for your target environment as well.

5.7.5   Enabling the new services in Argo

Argo needs to be synced - that is, the Argo application itself - in order to detect the newly-enabled alert_stream_broker, strimzi, and strimzi_registry_operator applications. Do that first - log in to Argo in the target environment, and sync the Argo application.

Next, sync Strimzi. It should succeed without errors.

Next, sync the Strimzi Registry Operator. It should also succeed without errors.

Next, sync the alert stream broker application. Errors are expected at this stage. Our goal is just to do the initial setup so some of the resources come up, but not everything will work immediately.

5.7.6   Provisioning DNS records

Once the alert-stream-broker is synced into a half-broken, half-working state, we can start to get the IP addresses used by its services. This will let us provision more DNS records: those for the Kafka brokers.

In the current gcloud setup, this must be done through Square. If you cannot use the existing static IPs, you must request that you are assigned three for the Kafka brokers, and that the DNS records are updated to point to the correct static IPs.

You will then need to update values-idfint.yaml:

The Kafka brokers MUST point to static IPs, as restarting Kafka will otherwise result in the assigned IP’s to change. If they do not, there will be problems with the SSL certificates and he users will not be able to connect. See the following link for an explination on why:

https://strimzi.io/blog/2021/05/07/deploying-kafka-with-lets-encrypt-certificates/

Previously, this setup was done through kubectl. However, it is now handled through Square. The kubectl instructions have been kept in case there is a need to use it in the future.

5.7.7   Previous DNS provisioning workflow

To provision the Kafka broker IPs, we will use kubectl to look up the IP addresses provisioned for the broker (see 2.1   Getting kubectl Access).

Run kubectl get service --namespace alert-stream-broker to get a list of all the services running:

-> % kubectl get service  -n alert-stream-broker
NAME                                    TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)                               AGE
alert-broker-kafka-0                    LoadBalancer   10.130.20.152   35.239.64.164    9094:31402/TCP                        78d
alert-broker-kafka-1                    LoadBalancer   10.130.23.65    34.122.165.155   9094:31828/TCP                        78d
alert-broker-kafka-2                    LoadBalancer   10.130.21.82    35.238.120.127   9094:31070/TCP                        78d
alert-broker-kafka-bootstrap            ClusterIP      10.130.20.156   <none>           9091/TCP,9092/TCP,9093/TCP            78d
alert-broker-kafka-brokers              ClusterIP      None            <none>           9090/TCP,9091/TCP,9092/TCP,9093/TCP   78d
alert-broker-kafka-external-bootstrap   LoadBalancer   10.130.16.127   35.188.169.31    9094:30118/TCP                        78d
alert-broker-zookeeper-client           ClusterIP      10.130.25.236   <none>           2181/TCP                              78d
alert-broker-zookeeper-nodes            ClusterIP      None            <none>           2181/TCP,2888/TCP,3888/TCP            78d
alert-schema-registry                   ClusterIP      10.130.27.137   <none>           8081/TCP                              76d
alert-stream-broker-alert-database      ClusterIP      10.130.27.41    <none>           3000/TCP                              21d

The important column here is “EXTERNAL-IP.” Use it to discover the IP addresses for each of the individual broker hosts, and for the “external-bootstrap” service. Request DNS A records that map useful hostnames to these IP addresses - this is done by the SQuARE team, so you’ll need help.

Once you have DNS provisioned, make another change to values-<environment>.yaml to lock in the IP addresses and inform Kafka of the hostnames to use. For example, here’s values-idfint.yaml:

Apply this change as usual (see 5.1   Deploying a change with Argo). Now the broker should be accessible.

5.7.8   Adding users

Make new user credential sets in 1Password for the new targeted environment. See 4.3   Adding a new user account for how to do this.

In addition, make a user named ‘kafka-admin’ in 1Password in the same way.

Make sure to use the right value for the environment field of the 1Password items.

Then, set alert-stream-broker.vaultSecretsPath in values-<environment>.yaml to secret/k8s_oeprator/<environment>/alert-stream-broker. This will configure the Vault Secrets Operator to correctly feed secrets through.

5.7.9   Lingering issues

You may need to re-sync several times to trigger the data-loading job of the alert stream simulator. When the system is in its half-broken state, this job will fail, and it can exponentially back-off which can take a very long time to recover. It can also hit a max retry limit and stop attempting to load data.

Using Argo to “sync” will kick it off again, which may fix the problem.

5.7.10   Testing connectivity

You should now have a working cluster. You should be able to run Kowl with the new superuser identity and it ought to be able to connect.

5.8   Deploying on a new Kubernetes cluster off of Google

Deploying to a new Kubernetes cluster off of Google will require all the same steps as described in the previous section, but with a few additional wrinkles.

First, the alert-stream-broker chart uses the “load balancer” service type to provide external internet access to the Kafka nodes. Load balancer services are very platform-specific; on Google it corresponds to creation of TCP Load Balancers. On a non-Google platform, it might work very differently.

One option would be to use the targeted platform’s load balancers. Another option is to use Node Ports or Ingresses instead. The 5-part Strimzi blog post series “Accessing Kafka” goes into detail about these options.

Second, the alert database uses Google Cloud Storage buckets to store raw alert and schema data. This would need to be replaced with something appropriate for the targeted environment. The requirements are made clear in the storage.py files of the github.com/lsst-dm/alert_database_ingester and github.com/lsst-dm/alert_database_server repositories. An implementation would need to fulfill the abstract interface provided in that file.

There may be more requirements, but these are certainly to need investigation if you’re planning to move to a different Kubernetes provider.

5.9   Changing the schema registry hostname

The Schema Registry’s hostname is controlled by the ‘hostname’ value passed in to charts/alert-stream-schema-registry. Updating that will update the hostname expected by the service.

In addition, a new DNS record will need to be created by whoever is provisioning DNS for the target environment. For the INT IDF environment, that’s SQuARE. It should route the new hostname to the ingress IP address.

Finally, the new schema registry needs to be passed in to the alert database in its ingester.schemaRegistryURL value.

See also: 5.7.3   Provision the DNS for the schema registry.

5.10   Changing the Kafka broker hostnames

Kafka broker hostnames can be changed by modifying the values passed in to charts/alert-stream-broker. Once changed, the broker will not work until DNS records are also updated.

See also: 5.7.6   Provisioning DNS records.

5.11   Changing the alert database URL

The alert database’s URL is based off of that of the cluster’s main Gafaelfawr ingress, so it cannot be changed entirely. However, it uses a path prefix, which can be changed. This path prefix is controlled by a value passed in to the alert database chart.

5.12   Changing the Kafka hardware

To change the hardware used by Kafka, change the nodes used in the node pool. This is set in the terraform configuration in environment/deployments/science-platform/env/integration-gke.tfvars:

{
  name = "kafka-pool"
  machine_type = "n2-standard-32"
  node_locations     = "us-central1-b"
  local_ssd_count    = 0
  auto_repair        = true
  auto_upgrade       = true
  preemptible        = false
  image_type         = "cos_containerd"
  enable_secure_boot = true
  disk_size_gb       = "500"
  disk_type          = "pd-standard"
  autoscaling        = true
  initial_node_count = 1
  min_count          = 1
  max_count          = 10
}

Change this, and apply the terraform change.

This may cause some downtime as the kafka nodes are terminated and replaced with new ones, evicting the Kafka brokers, but this isn’t known for certain.

References

[1]Argo CD: Declarative GitOps CD for Kubernetes. URL: https://argo-cd.readthedocs.io/en/stable/.
[2]Kowl: A Web UI for Apache Kafka. URL: https://github.com/cloudhut/kowl.
[3](1, 2) [DMTN-210]. Spencer Nelson. Implementation of the lsst alert distribution system. 2021. URL: https://dmtn-210.lsst.io/