Skip to main content
The Signicat Blog
Jon Skarpeteig

Tribe Lead, Global Platform

Signicat Platform Resiliency – In depth

At Signicat, we take our role as a critical supplier very seriously. Don’t just take our word for it. Our information security is regularly audited through ISO27001, SOC2, and other certifications. This ensures adherence to strong security policies and adoption of industry best practices covering privacy, integrity and availability.

Signicat platform resiliency builds on top of Google Cloud data center resiliency design. 

Data center dispersion

Our software is distributed across several physical data centers in active-active mode. Capacity management ensures sufficient capacity to handle a full availability zone outage. In the unlikely event of a full region outage, a predefined backup region is prepared with full immutable backups for recovery.

Signicat Service Resilience Architecture

The Signicat platform design ensures service resiliency through several mechanisms, including; redundant infrastructure, load balancing, fault tolerance, graceful degradation, continuous monitoring, and robust security measures.

Signicat Service Resilience Architecture

Redundancy

All services that make up the Signicat product portfolio have more than one running copy. This ensures that any single instance or server may fail without any data loss. Write operations is not considered complete until replication has been performed, such that any operation requiring persistence will block until data is successfully written. The replication is performed into other availability zones, to ensure the various copies do not share the same failure-domain. To minimize impact of component failure, workloads runs in active-active-active configuration.

Rolling upgrade strategy

When updates are getting rolled out, the new version is started side by side of the previous version. Only after the health checks are okay for the new version will the previous version be taken down. Rolling upgrades, a feature provided by Kubernetes, enable zero downtime during updates by running the previous version alongside the new version throughout the upgrade process. This strategy also allows for canary deployments through traffic shaping techniques.

Load balancing

Load balancing refers to the process of distributing incoming network traffic or workload evenly across multiple servers or computing resources. The goal is to ensure optimal resource utilization, maximize throughput, minimize response time, and avoid overloading any single server.

At Signicat there are multiple load balancers. The mechanisms in place varies slightly based on what aspect is evaluated in the network flow. In the OSI model, 7 different network layers are defined. For load balancing, layer 3 (network packet), layer 4 (transport datagrams) and layer 7 (application data) are important in the Signicat resilient load balancing architecture.

Internet Communication

Network traffic to and from the internet is typically referred to as north-south traffic.

When a network packet bound for Signicat flows through the internet, it is routed across multiple routers between the client and a Signicat server. These routers typically determine the best network path through dynamic Border Gateway Protocol (BGP) configurations. This is technically done by announcing which IP addresses you have in your data center, in order for connected routers to send any traffic bound for these IPs to you. In the Signicat architecture the same ip is announced for multiple data centers. This is known as anycast. In case of a network router failure, the network path through this router will get automatically removed, and the resilliency of the internet is to route the network packet to the next best path.

Once the network packet reach the IP address, it will get processed by the layer 4 load balancer.

At this point the network packet have reached one of the data center load balancing servers. These will typically proxy network packets to backing servers. One common example is a TCP socket, where the TCP communication channel is kept open between client and a load balancer node. A new connection is then opened from the load balancer to a backend server processing the request. From the backend servers it will look like all requests come from the load balancer IP. Typically an internal IP in a private subnet for IPv4 and a NAT Gateway doubling as a firewall.

Signicat also offloads TLS termination in the load balancer and leverage a layer 7 API Gateway before requests reach the backend server.

The API Gateway will load balance HTTP requests to specific Signicat products. Clients attempting to reach https://customer1.app.signicat.com/auth, will resolve DNS to the IP, open a TCP socket, negotiate TLS then send a HTTP request starting with:

GET /auth 
Host: customer1.app.signicat.com

The API Gateway will process this information, and send the request to the authentication product backend servers. At this point, the request has reached the internal cluster, and even more advanced load balancing mechanisms are involved for what is known as east-west (internal) network traffic.

Service to service communication

East-west traffic, or service communication, is internal communication not leaving the cluster. For any one end user request there might be multiple services involved in producing a response. In Signicat, any authenticated API will reach out to the policy based access control service to validate the requester permissions. These requests do not go out to the external load balancer, but is distributed within the private network.

For any load balancer backend server, there is a health check taking place. Ensuring that the backend server is in fact there. If it’s no longer there, the load balancer will stop directing traffic to it. In the case of L3 and L4, this mechanism is fairly simple. While the application layer allows for a lot more logic.

Inside Kubernetes, there is a service catalogue which at all times keep track of service state. This is done through health check probes. The most basic one is just to validate if the service is running. In Signicat, we have mandatory liveness and readiness probes. These ensure that only service replicas which are able to process requests will be marked as ready. If this health check fails, it will immediately get removed from the service catalogue.

Multiple Kubernetes features leverage this catalogue, including L3/L4 load balancing capabilities. This is at the hearth of resilliency and dynamic scalability. If there are new services available, they will get added to the service catalogue which in turn means they get enabled in the relevant load balancers and API Gateway components.

Traffic management

The layer 7 API Gateway load balancer enables Signicat to leverage traffic shaping for canary deployments. This software rollout strategy will redirect only a small portion of the overall traffic to the newly released service. The filter for this can match for anything in a HTTP header (like customers part of a beta program) or just as a small percentage or requests.

Traffic shaping or canary deploys allows the majority of the system to remain unchanged, and if there is any unforseen issues with the latest release it will impact only a small portion of the overall incoming requests. This rollout strategy will then continiously reconfigure the API Gateway load balancer funtionality to scale up traffic into the latest version until all traffic is over and the older version will be shut down.

The API Gateway additionally include rate limits to instruct noisy clients to back off (429 Too Many Requests response). This ensure that the backend servers do not get overloaded on abnormal volumes, while still providing a coherent response to all clients.

The Signicat rate limits are designed to ensure availability of the overall system in case of scanners, attackers or similar. These can be individually configured per tenant upon request

Fault Tolerance

With redundancies in all components, only capacity is lost in the case of component failure. The Signicat platform is designed to mask such events transparently.

 

Backup systems

With the active-active-active design, the concept of a software backup system is effectively reduced to capacity management. Capacity planning ensures that at any time the system will have sufficient capacity to serve the clients, including traffic spikes. In the case of an availability zone outage, there should be no degradation in the service. Therefore the Signicat Business Continuity Plan (BCP) requires a planned surplus of capacity at all times.

Auto scaling

All service workloads have mandatory resource requests specified (typically cpu and memory needs) required per replica. Horizontal autoscaler is configured for the workload, to automatically issue additional replicas once the resource utilization exceeds the defined threshold for the service. And the scheduler is instructured to spread evenly across the availability zones and nodes as much as possible using a topologySpreadConstraint configuration.

In case there is insufficient capacity of the provisioned virtual machines, a cluster autoscaler is configured to effectively buy additional hardware resources on demand. This will automatically scale until the manually configured maximum is reached (not to break the bank on targeted DDoS attacks or misbehaving workloads).

Self-healing workloads

Signicat has chosen Kubernetes as the main workload management system. One of the key features is the declarative configuration of services. If 3 replicas is configured, Kubernetes will continuously monitor and react if reality does not match. So once a replica is unavailable, Kubernetes will immediately spawn a replica into a healthy environment to regain lost capacity. And the load balancer will get automatically configured to stop sending traffic to the faulty replica, and start sending traffic to the newly spawned one.

An analogy of the declarative Kubernetes configuration is that of a thermostat. You can configure the temperature, and then it’s up to the heating and cooling system to continuously converge towards the desired state.

Publisher-Subscriber Pattern

Intra service communication strive to use the publisher-subscriber pattern where applicable. This async queue based system ensures that messages can be picked up by another process in case of failure to effectively pick up the work from where it left off. Signicat use Google Pub/Sub as the backend for this at least once delivery, where messages are only acknowledged after they have been successfully processed. Otherwise they will get retried automatically.

Retries

In the case of in-flight requests, the Signicat service mesh enables retry logic at the network level. Idempotent requests that are heading to a faulty component, will get retried into one of the healthy replicas.

All workloads in Kubernetes have mandatory health probes defined. Most importantly, the readiness checks. Readiness probe will poll an API endpoint which checks if all dependencies and configuration needed to successfully serve requests are met. This includes any storage or database connections. If this check is not successful, the workload will get automatically removed from the load balancer configuration and not receive traffic. Only after this check passes again will the service start receiving traffic. Requests that were already inbound to the failing service may get retried automatically.

As a side-note: Signicat also mandate the use of a liveness probe. If this probe fails, Kubernetes will restart the full process. The liveness API endpoint implementation for this probe is configured to only validate if the process itself is operational, without checking external dependencies. This is because in the case of a network partition or similar, we would see cascading failures with the full fleet of processes shutting down and spawning again which is computationally expensive.

Graceful Degradation

Graceful degradation is the ability of a system or application to continue functioning with limited features or performance, even when some of its components fail or encounter issues.

The aim is to prevent a complete system breakdown by ensuring that critical operations can still be performed.

Fallback to stale data

For critical dependencies, Signicat leverage aggressive caching. Including a full data copy. This allows any read operation to succeed in the case of unavailability of the dependency. One example of this is Signicats Account API, who’s main purpose is to buffer towards our Customer Relationship Management (CRM) vendor. In the case of degraded performance, network issues or vendor outages, any read operation will continue to be served from cache. Self-service functionality around purchasing new products and onboarding of new customers will be limited during the incident, but other functionality will appear unaffected.

Graceful PBAC degradation

Policy Based Access Control (PBAC) is used across all Signicat APIs. This allows us to write comprehensive policies for API access. Example: In order to access one of our authenticated API products, the request must:

  • Have a valid authentication token, signed by a trusted auth token issuer (Signicat OIDC)
  • Be linked with an active user account in the Signicat IAM system
  • Have required permissions (role) to access this particular API
  • Have required permissions (role) to access the customer tenant which the API request targets
  • Have the applicable product purchased and enabled
  • Not maxed out on quota
  • Not exceed request rate limit

If all of the above is met, the request is accepted. If not, the request is denied. This is centrally managed by the Signicat PBAC engine to ensure consistent API access everywhere.

The PBAC pattern also allows for resiliency optimization. Lets say that the link to Customer Relationship Management (CRM) system is down. The PBAC cache has expired, making it impossible to determine what products is enabled for the client.

In this case, the Signicat PBAC will allow the request and skip the product check entirely! Why? Because the risk introduced is that usage of products not under contract can’t be billed by Signicat. Compared with the alternative, to block each and every API request, this graceful degradation in the failure scenario leaves the systems fully operational as viewed by the customer.

Service impact

  • Quota systems having issues? Ignore the policy and allow requests
  • CRM system down? Allow reading from cache, block write requests
  • Token validation having issues? Block all requests!

We operate against a higher internal Service Level Objective (SLO) for our essential services, like token validation. In some cases, the degraded state is acceptable until normal work hours begin. While if the token validation fails on-call is immediately alerted to remedy the sitation.

While not every functionality supports graceful degradation, it remains a key design goal to optimize availability across the Signicat platform.

Monitoring

All Signicat services are monitored 24/7. To ensure availability at all times, both internal monitoring and

Synthetic Monitoring

Synthetic monitoring is used to simulate users using the Signicat products. This includes both API usage and UI elements.

Synthetic API test

At regular intervals, API calls corresponding to critical product functionality is run. Any deviation from expected result will immediately show up as an anomaly. The screenshot shows an example of what a service outage would look like.

API anomaly detection

Load testing

Signicat also regularly perform production load testing using artificial traffic. This is to ensure any undetected scalability issues is found by Signicat before it leads to customer impact.

End to end UI validation

In addition to API calls, Signicat has continuous UI validation to ensure what the user is presented matches expectations. These are very high level test cases and touches upon a large part of the functionality. This means a large amount of systems gets tested in each run.

In the screenshots below you can see the corresponding results of loading MitID with an unexpected result.

Internal monitoring

Internal monitoring is in place to quickly identify root causes or get early warnings about potential issues. These sensors are typically very specific to a small surface area of the solution, like an individual service or component of the solution.

Service mesh live

Logging

At Signicat we leverage structured logs when possible. For application logs, this means json logs with predefined metadata fields to enable quick filtering for trace id, customer tenant, product or service name and other contextual data. Log events signaling issues have predefined thresholds and alarms, which will notify on-call personnel. Alarms may trigger on the likes of elevated error rates or security events.

Special item logs, such as audit logs, security logs and billing events are separated from application logs. This is because these logs have higher privacy constraints, retention needs and other specific requirements. This also gives us more flexibility with the application logs, which is the bulk of the logs, to the point where they may be treated as throwaway logs.

Metrics

Time series data, or metrics gets sampled across systems making up the Signicat platform and product portfolio. This includes business metrics, like conversion rate, and pure technical measurements like cpu and memory consumption.

Example from one of our Kubernetes clusters

Metrics analysis

Predefined thresholds will trigger alarm and automatically notify Signicat personnel. In the case of outages, alarms will also wake up on-call, but the majority of alerts are designed to trigger ahead of time.

Automatic analysis takes place using per-defined rules to detect unwanted scenarios. Examples include:

  • Disk space near depletion
  • Failure to respond or substantial increase in response time
  • Drop in success rate
  • Malicious traffic patterns

Additionally, manual inspection is part of on-call routines through daily checklist

Alerting

Any alert will be routed to corresponding teams through a third party tool. Responses to alarms get handled by 24/7 on-call rotations. This ensures a path to notify Signicat staff which may operate completely independent of the Signicat services themselves.

Incident response

Runbooks are created for the alarms, outlying required steps to resolution. Signicat also practice our Disaster Recovery Plans (DRP) once a year, rehearsing relevant disaster scenarios. The multiple on-call rotations include dedicated experts on various products and components, under the platform engineering model of “You build it, you own it”. This is to ensure the shortest possible time to recovery. 

Break the glass procedures

  • Root accounts stored in a separate system. Alerting upon usage
  • Privilege escalation through infrastructure operations team

Consistent post-mortems

  • Blameless, asking the 5 whys
  • Enforced management review
    • Ensuring follow-up tasks and fixes get sufficient attention

Security

Signicat employs a foundational security model, with security layers built on top of each other. This approach reflects a defense-in-depth security strategy.

Signicat security layers

Foundational Security Architecture

The foundational security covers aspects that are used across the platform. This is generally operated by dedicated central functions responsible for managing the foundation resources that are consumed by multiple product areas and workloads.

  • Policy controls are programmatic constraints that enforce acceptable resource configurations and prevent risky configurations. It uses a combination of policy controls including infrastructure-as-code (IaC) validation in pipelines combined Signicat Information Security Management System (ISMS) policy constraints
  • Architecture controls are the configuration of cloud resources like networks and resource hierarchy. This architecture is based on security best practices
  • Detective controls detect anomalous or malicious behavior. Signicat uses platform features such as our observability stack, which provides capabilities to enforce custom detective controls

Policy Controls

Signicat employ a number of Information Security Management System (ISMS) policies as defined by ISO27001 and SOC2 ensuring high security and quality requirements to meet customer expectations. The policies are all approved by the Signicat Information Security Board, and managed via git for version control.

Access Management

Access follows least privilege principle, with predefined roles and accesses. All access is managed through Single Sign-On (SSO), which simplifies onboarding, offboarding, and role changes while providing a comprehensive overview of who has access to what and where. More granular permissions are handled per context. Infrastructure access is managed by a cloud access policy describing several levels of access and specific components like Kubernetes have even more granularity with a full predefined responsibility matrix:

Software Delivery LifeCycle (SDLC)

An important part of Signicat security policies is the Software Delivery LifeCycle (SDLC) which describes the Signicat change management process and way of work. Examples include:

  • Infrastructure defined as code (IaC)
  • Separation of concerns through mandatory merge request approval
  • Quality assurance targets

Adherence to these requirements is also subject for third party audits, like the SOC2 Type 2 audit. Enforcement also includes automated alerts going out, like human activity detected on resources governed by IaC.

Architecture Controls

Pipeline tooling ensures that the artifact deployed to production is identical to what has been built, tested, and security scanned, without any maliciously injected malware. Supply chain security is inspired by the SLSA framework.

Supply chain security

Source threats security

  • Source repo hosting with industry leading vendor
  • Mandatory merge request approvals
  • Static and Dynamic Application Security Scanning (SAST/DAST)
  • Infrastructure as Code scanning
  • Secret Detection

Dependency threats security

  • Software Bill Of Materials (SBOM) dependency security scanning
    • Detects known CVEs from SBOM

Build threats security

  • Ephemeral build containers blocking build worker attacks on software supply chain
    • Every pipeline job runs in a fresh container, deleting it once the job ends
  • Open Container Initiative (OCI) standard

Access management

Signicat follows the least privilege principle by limiting access to resources/service on the service or team level.

Signicat use a single source of truth for employee accounts, where authorization is linked with group membership. Permissions are assigned using groups, defined by IaC code. Permissions are scoped down to agile team granularity, ensuring product and team separation.

Cloud resources

Access to cloud resources (storage buckets, databases, keys and more) is scoped into teams. The team members and teams service accounts are the only ones (except global admins) that have access to the team's cloud resources. This translates to products exclusively having access only to their corresponding database of storage.

VPCs, Subnets, Firewall and Routing

From public internet, only port 443 is open. Any new port opening would be done using Infrastructure as Code (IaC) and merge requests. All changes must be approved by the Signicat platform team after evaluating justified need. In case an engineer in from the platform team opens a merge request, a different member must approve in order to fulfill separation of concerns as defined by the SDLC.

Network connectivity is segmented and protected by multiple firewalls. Examples include:

  • DMZ (internet)
  • Management network (Host network with virtual machines)
  • Pod (container) network

All incoming traffic is filtered through firewall on the Google load balancer, Istio API Gateway and iptables firewalls for each individual container. No hosts on the management network can be reached from the outside.

Host Operating System Hardening

Containers run in ContainerOS with containerd which is a hardened virtual machine image provided by Google Cloud

  • Smaller attack surface: Container-Optimized OS has a smaller footprint, reducing instance's potential attack surface
  • Locked-down by default: Container-Optimized OS instances include a locked-down firewall and other security settings by default
    • Container-Optimized OS does not include a package manager, blocking installation of software packages directly on an instance
    • Container-Optimized OS does not support execution of non-containerized applications
    • The Container-Optimized OS kernel is locked down; Not possible to install third-party kernel modules or drivers

Security through isolation using containers

Signicat containerization approach provide additional security hardening on top of the traditional foundational security layer. These additional measures rely heavily on intrinsic security properties of Kubernetes and containerization. In particular containers provide a hardened, repeatable process for ensuring only the services, functions, processes and components specifically required for its operation.

Container Images

Builds are based on Dockerfile, Jib or similar tools. Only explicitly required packages and ports are specified, and runs only a single process. When possible, the container is started with a read-only file system for immutability.

Process Isolation

Containers are namespaces and control groups (cgroups) put together

All processes which runs inside containers are isolated using cgroups and namespaces to ensure one compromised process does not compromise the system as a whole.

Control groups (cgroups)

Cgroup isolation provides hardware resource separation per process

  • Resource limiting: a group can be configured not to exceed a specified memory limit or use more than the desired amount of processors or be limited to specific peripheral devices
  • Prioritization: one or more groups may be configured to utilize fewer or more CPUs or disk I/O throughput
  • Accounting: a group's resource usage is monitored and measured
  • Control: groups of processes can be frozen or stopped and restarted

Namespaces

Processes inside a container is isolated from the rest of the system using Linux namespaces:

  • mnt – allows a container to have its own set of mounted filesystems and root directory
    • Processes in one mnt namespace cannot see the mounted filesystems of another mnt namespace
  • uts – allows different host and domain names to different processes
  • pid – provides processes with an independent set of process IDs
    • A parent namespace can see the children namespaces and affect them, but a child can neither see the parent namespace nor affect it
  • user – per namespace mapping users and group IDs
    • User root inside the namespace can be mapped to a non-privileged user on the host
  • ipc – Inter Process Communication provides semaphores, message queues and shared memory segments
    • It is not widely used, but some programs still depend on it
  • time – ability to changes date and time in a container
  • net – network stack virtualization
    • Each namespace will have a private set of IP addresses, its own routing table, socket listing, connection tracking table, firewall and other network-related resources

Kubernetes Pod

  • One or more containers sharing relevant namespaces and control group
  • Processes inside containers run with an unprivileged user
  • Assigned a unique IP address shared by all containers

Networking

  • Pod network is by default an overlay network, without directly access to host network
    • Separate IP segment and packet encapsulation (VXLAN)
  • Each pod has their own virtual ethernet (veth) connected to a virtual switch (linux bridge)
    • Dedicated firewall rules per pod

Network Hardening

Network design is based on Zero-Trust networking

  • Leverage cloud load balancer as the first level of defense
    • Automatically block network protocol and volumetric DDoS attacks such as protocol floods (SYN, TCP, HTTP, and ICMP) and amplification attacks (NTP, UDP, DNS)
  • Incoming traffic passes through an API Gateway (Ingress), implemented using Istio
    • Filter all traffic not explicitly allowed, at Layer7 (Application layer / HTTPS)
  • Injection of default security headers for all services, like: Content-Security-Policy and Strict-Transport-Security
  • Pod network traffic incoming/outgoing redirected to Envoy (Istio) for enforced encryption/decryption (mTLS) for all internal network communication

Network security hardening

Security of data in transit

External traffic is protected according to international standards, including several eID compliance requirements

This configuration includes:

  • TLS 1.2 or higher
  • Only modern and safe ciphers, following the already mentioned requirements
    • Daily monitoring of TLS Ciphers against industry best practices and product specific requirements
    • Implementation:
  • DNSSEC signing of DNS records

Signicat API Security

Same security fundamentals as our products providing authentication for banks, government and other highly regulated industries. In fact, it’s the very same Signicat certified OIDC service powering authentication for Signicat APIs. Additionally, Signicat has centralized authorization by implementing a Policy Based Access Control (PBAC) engine which combines validation of cryptographically strong authentication, Role Based Access Control (RBAC) and context aware rules such as product checks, quotas or graceful failure modes.

Conclusion

Signicat's robust and resilient infrastructure, coupled with defense in depth security measures and continuous auditing, ensures that our services remain reliable and secure. Our proactive approach to redundancy, failure modes and security underscores our dedication to delivering uninterrupted service to our clients. At Signicat, we prioritize resilience and reliability, ensuring that our clients can always trust in the security, integrity and availability of our products and services.