V Production

Depending on your application needs, you may have different requirements for production readiness. So, for a small internal service that’s not mission-critical, you don’t need the same level of operational readiness as for a customer-facing application that pays the bills. Here are some general guidelines that one must take into consideration and should have a full-proof plan for them before going into production.


General

  1. Ownership: Service owners are identified. Contact information and methods are provided.

  2. Onboarding: Integration instructions for APIs are documented.

  3. Defined service-level indicators (SLIs) / service-level objectives (SLOs) / service-level agreements (SLAs): The SLIs and SLOs are documented and accessible. If applicable, you’ve also documented the SLAs.

Disaster Recovery

  1. Disaster recovery (DR): DR plans have been documented and tested.

  2. Backups: Backups of data occur regularly.

  3. Redundancy: Services should include at least two instances and could require deployment in multiple regions or locations.

Deployment

  1. Deployment strategy: The automated deployment strategy has been documented. For example, strategies include blue-green, canary, or others to create safer zero-downtime deployments.

  2. Continuous integration: When engineers commit their changes, the system kicks off automated builds, tests, and deployment to a lower-level environment.

  3. Continuous delivery: Deploying to production involves nothing more than approval and a click of a button. Changelogs and release notes indicate what changes exist in each environment.

  4. Static code analysis: Code is automatically scanned, formatted, or linted according to coding standards.

Operations

  1. On-call policy: The service has an on-call system that pages the owning team for incidents. Ideally, this involves tools like PagerDuty or Squadcast.

  2. Incident management: The incident management and escalation processes have been documented. This includes processes for postmortem and long-term remediation.

  3. Runbooks: Runbooks have been written and are accessible, with known failure scenarios. You update runbooks whenever a new scenario is uncovered.

  4. Logging: The service utilizes centralized logs, and the logs can be accessed easily.

  5. Metrics: At a minimum, the Four Golden Signals are available for the service.

  6. Tracing: The application transactions can be traced, using the appropriate tools and sampling configuration for the service.

Testing

  1. Unit tests: Unit tests execute at every code push, automatically.

  2. Integration tests: If appropriate, automated integration tests execute and pass successfully.

  3. End-to-end or acceptance tests: Automated end-to-end or acceptance tests run as part of the continuous integration / continuous deployment (CI/CD pipeline). If manual testing is required, test results are documented.

  4. Broken tests: Failing tests break the build.

Resiliency

  1. Load testing: Load tests are automated or occur on a regular cadence. You document and publish the results.

  2. Stress testing: Stress tests are automated or occur on a regular cadence. You document and publish the results.

  3. Chaos engineering: Once the applications have proven the ability to stand up to load and stress, chaos engineering is integrated to identify weak points and opportunities to reduce failures.

Security

  1. Authentication/authorization: Each service or application requires proper authentication and authorization.

  2. Secrets management: Secrets are secured properly in a vault or secret store. Tools like truffleHog or git-secrets scan code to identify potential secrets.

  3. Static application security testing (SAST): Static code analysis tools like Checkmarx or Snyk monitor code in the CI/CD pipeline. The build breaks any time there are security vulnerabilities above a certain threshold. Thresholds are set based on service needs.

  4. Dynamic application security testing (DAST) / penetration (pen) testing: Automated DAST runs at appropriate intervals. Manual DAST or pen testing runs according to the security requirements of the service or company. As a note, some companies require DAST or pen testing prior to large changes or launches. Others run them quarterly. Your production readiness checklist should include the appropriate cadence for your situation.

  5. Dependency scan: All dependencies are using the latest or patched versions. For this, consider automating the scan using tools like FOSSA or Nexus Vulnerability Scanner to validate versions and licenses.

Governance, Risk, and Compliance (GRC)

  1. GRC documentation: GRC checklists have been completed as required. Many companies have a separate GRC system available. In that case, this checklist indicates its completion and documentation.

  2. Confidentiality, integrity, availability (CIA) rating: The CIA rating of the service has been documented and published.