Reliability Performance Roadmap ¶
Features / Functionalities 🚀⏲📊 ¶
Category | Tags / Labels | Feature / Functionality |
Status | Doc |
---|---|---|---|---|
Monitoring Metrics & Alerting |
leverage monitoring-metrics-alerting prometheus grafana |
Metrics: install and configure Prometheus (NodeExporter for EC2 / BlackBox exporter / Alert Monitroing), install and configure Grafana (K8s Plugin + Prometheus int + CloudWatch int) |
✅ | ❌ |
Monitoring Metrics & Alerting |
leverage monitoring-metrics-alerting grafana cloudwatch |
Metrics: Grafana + AWS Cloudwatch integrations config (https://github.com/monitoringartist/grafana-aws-cloudwatch-dashboards) |
2021 Q2 | ❌ |
Monitoring Metrics & Alerting |
leverage monitoring-metrics-alerting apm |
APM: review, analyze and implement (New Relic, DataDog, ElasticAPM Agent/Server) |
2021 Q2 | ❌ |
Monitoring Metrics & Alerting |
leverage monitoring-metrics-alerting documentation |
Define and document reference notification/escalation procedure |
✅ | ❌ |
Monitoring Metrics & Alerting |
leverage monitoring-metrics-alerting |
Alerting: configure AlertsManager, Elastalert (optimized logs rotation when using it from docker image), PagerDuty, Slack according to the procedure above |
2021 Q2 | ❌ |
Monitoring Metrics & Alerting |
leverage monitoring-metrics-alerting prometheus |
Monitor Infra Tool Instances (WebHook Proxy, Jenkins, Vault, Pritunl, Prometheus, Grafana, etc) / implement monitoring via Prometheus + Grafana or Another Solution |
✅ | ❌ |
Monitoring Distributed Tracing |
leverage monitoring-tracing jaeger |
Distributed Tracing Instrumentation: review, analyze and implement to detect and improve transactions performance and svs dep analysis (jaeger, instana, lightstep, AWS X-Ray, etc) |
2021 Q3 | ❌ |
Monitoring Logging |
leverage monitoring-logs efk |
Logging / EFK - use separate indexes per K8s components & apps/svc for each custer/env (segregating dev/stg from prd) + enable ES monitoring w/ X-Pack + configure curator to rotate indices + tool to improve index mgmt |
2021 Q2 | ❌ |
Performance & Optimization |
leverage performance-optimization ci-cd-pipeline |
Load Testing: set up and run continuous load tests pipelines (Jenkins) to determine and improve apps/services capacity through time (apapche ab, gatling, iperf, locust, taurus, BlazeMeter and https://github.com/loadimpact/k6) |
2021 Q3 | ❌ |
Performance & Optimization |
leverage performance-optimization ci-cd-pipeline |
Performance Testing (stress, soak, spike, etc): set up and run continuous performance tests pipelines (Jenkins) to measure performance through time (apapche ab, gatling, iperf, locust, taurus and BlazeMeter) |
2021 Q3 | ❌ |
Performance & Optimization |
leverage performance-optimization kubernetes |
Tune K8S nodes (EC2 family type, size and AWS ASG -> K8s HPA + Cluster AutoScaler ) |
2021 Q3 | ❌ |
Performance & Optimization |
leverage performance-optimization kubernetes |
Tune K8S requests and limits per namespace (CPU and RAM) / https://github.com/FairwindsOps/goldilocks |
2021 Q2 | ❌ |
Performance & Optimization |
leverage performance-optimization s3 |
S3: ensure each bucket is using the proper storage types and persistence (automate mv these objs into lower $ storage tier w/ Life Cycle Policies or w/ S3 Intelligent-Tiering) |
✅ | ❌ |
Disaster Recovery |
leverage disaster-recovery backup |
AWS Backup Service: RDS, EC2 (AMI), EBS, Dynamo, EFS, SFx, Storage Gw |
✅ | ❌ |
Disaster Recovery |
leverage disaster-recovery backup |
Replication: S3 (CRR cross-region replication or SRR same-region replication) |
✅ | ❌ |
Disaster Recovery |
leverage disaster-recovery backup |
Replication: VPC / Compute / Database (CRR cross-region replication) |
✅ | ❌ |
Disaster Recovery |
leverage disaster-recovery backup kubernetes |
Backup and migrate Kubernetes applications and their persistent volumes w/ https://velero.io/ |
2021 Q3 | ❌ |
Disaster Recovery |
leverage documentation disaster-recovery |
Review: Disaster recovery plan, missing resources, RTO / RPO, level of automation |
2021 Q4 | ❌ |
Disaster Recovery |
leverage documentation disaster-recovery |
Improve Plan: create a plan to improve the existing recovery plan and determine implementation phases |
2021 Q4 | ❌ |
Disaster Recovery |
leverage documentation disaster-recovery |
Execute Plan: implement according to the plan, review/measure and iterate |
2021 Q4 | ❌ |