Skip to main content
The Signicat Blog
An illustration showing a hand stopping a line of falling dominoes, with a shield icon above, symbolising fraud protection and risk prevention.
Jon Skarpeteig

Tribe Lead, Global Platform

Signicat Platform Resiliency – In depth

At Signicat, we take our role as a critical supplier very seriously. Don’t just take our word for it. Our information security is regularly audited through ISO27001, SOC2, and other certifications. This ensures adherence to strong security policies and adoption of industry best practices covering privacy, integrity and availability.

Signicat platform resiliency builds on top of Google Cloud data center resiliency design. 

A cloud infrastructure diagram illustrating a high-availability and disaster recovery setup. The diagram shows a primary region in the Netherlands, which contains three separate Availability Zones, each with its own data centers. These zones are interconnected with fiber paths. A separate Backup Region is shown in Finland, connected with fiber paths to the primary region for disaster recovery purposes.

Data center dispersion

Our software is distributed across several physical data centers in active-active mode. Capacity management ensures sufficient capacity to handle a full availability zone outage. In the unlikely event of a full region outage, a predefined backup region is prepared with full immutable backups for recovery.

Signicat Service Resilience Architecture

The Signicat platform design ensures service resiliency through several mechanisms, including; redundant infrastructure, load balancing, fault tolerance, graceful degradation, continuous monitoring, and robust security measures.

A diagram titled 'Signicat Service Resilience' that is divided into six sections. The sections are: 1. Redundancy: which includes Servers, Databases, Network, and Availability zones. The icon shows two servers with synchronised databases. 2. Load balancing: which includes Incoming requests, Resource utilization, and Traffic distribution. The icon shows one input branching to three outputs. 3. Fault Tolerance: which includes Error detection, Backup systems, and Self-healing. The icon shows a primary database connected to multiple backups where one has an x indicating error. 4. Graceful degradation: which includes Essential vs. non-essential and Reduced functionality. The icon shows a process degrading to a simpler state. 5. Monitoring: which includes Real-time metrics, Alerting, Log analysis, and Incident response. The icon shows a screen with analytics. 6. Security: which includes Information security management, Software delivery lifecycle, and Defence in depth. The icon is a shield."

Redundancy

All services that make up the Signicat product portfolio have more than one running copy. This ensures that any single instance or server may fail without any data loss. Write operations is not considered complete until replication has been performed, such that any operation requiring persistence will block until data is successfully written. The replication is performed into other availability zones, to ensure the various copies do not share the same failure-domain. To minimize impact of component failure, workloads runs in active-active-active configuration.

An icon illustrating data and server redundancy. It shows a primary server with its associated file folders and database, which are shown to be equal to or synchronised with a secondary, identical server setup.

Rolling upgrade strategy

When updates are getting rolled out, the new version is started side by side of the previous version. Only after the health checks are okay for the new version will the previous version be taken down. Rolling upgrades, a feature provided by Kubernetes, enable zero downtime during updates by running the previous version alongside the new version throughout the upgrade process. This strategy also allows for canary deployments through traffic shaping techniques.

Load balancing

Load balancing refers to the process of distributing incoming network traffic or workload evenly across multiple servers or computing resources. The goal is to ensure optimal resource utilization, maximize throughput, minimize response time, and avoid overloading any single server.

A network architecture diagram of a data center that illustrates the concepts of 'North-South' and 'East-West' traffic. North-South Traffic: A vertical arrow on the right shows traffic flowing from an external network at the top, down through layers of network devices, firewalls, and servers. This represents data entering or leaving the data center. East-West Traffic: A horizontal arrow at the bottom shows traffic flowing between virtual machines (VMs) within the data center. Architecture: The diagram shows servers connected in a fully meshed topology to a layer of switches, which in turn connect to the VMs at the lowest level.

At Signicat there are multiple load balancers. The mechanisms in place varies slightly based on what aspect is evaluated in the network flow. In the OSI model, 7 different network layers are defined. For load balancing, layer 3 (network packet), layer 4 (transport datagrams) and layer 7 (application data) are important in the Signicat resilient load balancing architecture.

Internet Communication

Network traffic to and from the internet is typically referred to as north-south traffic.

A diagram illustrating a resilient mesh network. It shows numerous router-like nodes interconnected with each other in a web-like structure, indicating multiple, redundant data paths throughout the network.

When a network packet bound for Signicat flows through the internet, it is routed across multiple routers between the client and a Signicat server. These routers typically determine the best network path through dynamic Border Gateway Protocol (BGP) configurations. This is technically done by announcing which IP addresses you have in your data center, in order for connected routers to send any traffic bound for these IPs to you. In the Signicat architecture the same ip is announced for multiple data centers. This is known as anycast. In case of a network router failure, the network path through this router will get automatically removed, and the resilliency of the internet is to route the network packet to the next best path.

Once the network packet reach the IP address, it will get processed by the layer 4 load balancer.

A diagram illustrating a load balancing architecture. A single 'Client' box on the left sends traffic to a middle layer containing four 'Load Balancer' boxes. Each load balancer then distributes the traffic across a final layer of six boxes labeled 'Real Sevice' on the right, demonstrating redundancy and traffic distribution.

At this point the network packet have reached one of the data center load balancing servers. These will typically proxy network packets to backing servers. One common example is a TCP socket, where the TCP communication channel is kept open between client and a load balancer node. A new connection is then opened from the load balancer to a backend server processing the request. From the backend servers it will look like all requests come from the load balancer IP. Typically an internal IP in a private subnet for IPv4 and a NAT Gateway doubling as a firewall.

Signicat also offloads TLS termination in the load balancer and leverage a layer 7 API Gateway before requests reach the backend server.

A diagram showing the architecture of an API Gateway. On the left, 'Client Apps,' which include 'Web Apps' and 'Mobile Apps,' send requests to a central 'API Gateway.' The API Gateway layer handles common functions such as 'Authentication,' 'Rate limiting,' and 'Logging.' It then routes the requests to the appropriate backend microservices on the right, which are 'Signature service,' 'eID Hub service,' and 'MobileID service.'

The API Gateway will load balance HTTP requests to specific Signicat products. Clients attempting to reach https://customer1.app.signicat.com/auth, will resolve DNS to the IP, open a TCP socket, negotiate TLS then send a HTTP request starting with:

GET /auth 
Host: customer1.app.signicat.com

The API Gateway will process this information, and send the request to the authentication product backend servers. At this point, the request has reached the internal cluster, and even more advanced load balancing mechanisms are involved for what is known as east-west (internal) network traffic.

Service to service communication

East-west traffic, or service communication, is internal communication not leaving the cluster. For any one end user request there might be multiple services involved in producing a response. In Signicat, any authenticated API will reach out to the policy based access control service to validate the requester permissions. These requests do not go out to the external load balancer, but is distributed within the private network.

A diagram of a microservices architecture, showing six services labeled 'Service A,' 'Service B,' 'Service C,' 'Service D,' 'Service E,' and 'Service F.' The services are depicted as nodes in a fully interconnected mesh network, with lines showing that each service can communicate directly with every other service, indicating a high degree of availability and resilience.

For any load balancer backend server, there is a health check taking place. Ensuring that the backend server is in fact there. If it’s no longer there, the load balancer will stop directing traffic to it. In the case of L3 and L4, this mechanism is fairly simple. While the application layer allows for a lot more logic.

Inside Kubernetes, there is a service catalogue which at all times keep track of service state. This is done through health check probes. The most basic one is just to validate if the service is running. In Signicat, we have mandatory liveness and readiness probes. These ensure that only service replicas which are able to process requests will be marked as ready. If this health check fails, it will immediately get removed from the service catalogue.

Multiple Kubernetes features leverage this catalogue, including L3/L4 load balancing capabilities. This is at the hearth of resilliency and dynamic scalability. If there are new services available, they will get added to the service catalogue which in turn means they get enabled in the relevant load balancers and API Gateway components.

Traffic management

The layer 7 API Gateway load balancer enables Signicat to leverage traffic shaping for canary deployments. This software rollout strategy will redirect only a small portion of the overall traffic to the newly released service. The filter for this can match for anything in a HTTP header (like customers part of a beta program) or just as a small percentage or requests.

A diagram illustrating the canary release deployment strategy. It shows a group of users whose traffic is split between two versions of a product. The majority of users are directed to the stable 'Product Version,' while a small subset, as described by the text 'A portion of users receive canary,' is routed to the new 'Canary Version' for testing.

Traffic shaping or canary deploys allows the majority of the system to remain unchanged, and if there is any unforseen issues with the latest release it will impact only a small portion of the overall incoming requests. This rollout strategy will then continiously reconfigure the API Gateway load balancer funtionality to scale up traffic into the latest version until all traffic is over and the older version will be shut down.

The API Gateway additionally include rate limits to instruct noisy clients to back off (429 Too Many Requests response). This ensure that the backend servers do not get overloaded on abnormal volumes, while still providing a coherent response to all clients.

The Signicat rate limits are designed to ensure availability of the overall system in case of scanners, attackers or similar. These can be individually configured per tenant upon request

Fault Tolerance

With redundancies in all components, only capacity is lost in the case of component failure. The Signicat platform is designed to mask such events transparently.

 

A diagram illustrating network fault tolerance and redundancy. A primary network node on the left is shown with three connections to three nodes on the right. One of the connections is broken, indicated by a red line with a cross mark. The other two connections remain active, showing that the network can withstand a partial failure.

Backup systems

With the active-active-active design, the concept of a software backup system is effectively reduced to capacity management. Capacity planning ensures that at any time the system will have sufficient capacity to serve the clients, including traffic spikes. In the case of an availability zone outage, there should be no degradation in the service. Therefore the Signicat Business Continuity Plan (BCP) requires a planned surplus of capacity at all times.

Auto scaling

All service workloads have mandatory resource requests specified (typically cpu and memory needs) required per replica. Horizontal autoscaler is configured for the workload, to automatically issue additional replicas once the resource utilization exceeds the defined threshold for the service. And the scheduler is instructured to spread evenly across the availability zones and nodes as much as possible using a topologySpreadConstraint configuration.

In case there is insufficient capacity of the provisioned virtual machines, a cluster autoscaler is configured to effectively buy additional hardware resources on demand. This will automatically scale until the manually configured maximum is reached (not to break the bank on targeted DDoS attacks or misbehaving workloads).

Self-healing workloads

Signicat has chosen Kubernetes as the main workload management system. One of the key features is the declarative configuration of services. If 3 replicas is configured, Kubernetes will continuously monitor and react if reality does not match. So once a replica is unavailable, Kubernetes will immediately spawn a replica into a healthy environment to regain lost capacity. And the load balancer will get automatically configured to stop sending traffic to the faulty replica, and start sending traffic to the newly spawned one.

A diagram explaining a declarative control loop by comparing it to a thermostat. The diagram is split into three sections: On the left, World state: Represents the current status.The analogy is a 'Termometer,' and the technical example is 'System probes,' shown with a graph icon. In the middle, Desired state: Represents the target status. The analogy is a 'Thermostat dial,' and the technical example is 'Declarative configuration,' shown with a gear and wrench icon. On the right side, Actuator: Represents the tool that makes changes. The analogy is 'Radiators/boilers,' and the technical example is 'Kubernetes,' shown with the Kubernetes logo. The Desired state has one arrow coming out of it pointing to the Acuator and World State.

An analogy of the declarative Kubernetes configuration is that of a thermostat. You can configure the temperature, and then it’s up to the heating and cooling system to continuously converge towards the desired state.

Publisher-Subscriber Pattern

Intra service communication strive to use the publisher-subscriber pattern where applicable. This async queue based system ensures that messages can be picked up by another process in case of failure to effectively pick up the work from where it left off. Signicat use Google Pub/Sub as the backend for this at least once delivery, where messages are only acknowledged after they have been successfully processed. Otherwise they will get retried automatically.

Retries

In the case of in-flight requests, the Signicat service mesh enables retry logic at the network level. Idempotent requests that are heading to a faulty component, will get retried into one of the healthy replicas.

All workloads in Kubernetes have mandatory health probes defined. Most importantly, the readiness checks. Readiness probe will poll an API endpoint which checks if all dependencies and configuration needed to successfully serve requests are met. This includes any storage or database connections. If this check is not successful, the workload will get automatically removed from the load balancer configuration and not receive traffic. Only after this check passes again will the service start receiving traffic. Requests that were already inbound to the failing service may get retried automatically.

As a side-note: Signicat also mandate the use of a liveness probe. If this probe fails, Kubernetes will restart the full process. The liveness API endpoint implementation for this probe is configured to only validate if the process itself is operational, without checking external dependencies. This is because in the case of a network partition or similar, we would see cascading failures with the full fleet of processes shutting down and spawning again which is computationally expensive.

Graceful Degradation

Graceful degradation is the ability of a system or application to continue functioning with limited features or performance, even when some of its components fail or encounter issues.

A diagram illustrating a secure access or proxy pattern. It shows a validated identity, represented by an ID card with a checkmark, which is granted access to a central system or service. This central system then explicitly blocks direct access to a downstream resource of connected users, indicated by a red 'X' over the connecting path. This demonstrates that the central system acts as a gatekeeper.

The aim is to prevent a complete system breakdown by ensuring that critical operations can still be performed.

Fallback to stale data

For critical dependencies, Signicat leverage aggressive caching. Including a full data copy. This allows any read operation to succeed in the case of unavailability of the dependency. One example of this is Signicats Account API, who’s main purpose is to buffer towards our Customer Relationship Management (CRM) vendor. In the case of degraded performance, network issues or vendor outages, any read operation will continue to be served from cache. Self-service functionality around purchasing new products and onboarding of new customers will be limited during the incident, but other functionality will appear unaffected.

Graceful PBAC degradation

Policy Based Access Control (PBAC) is used across all Signicat APIs. This allows us to write comprehensive policies for API access. Example: In order to access one of our authenticated API products, the request must:

  • Have a valid authentication token, signed by a trusted auth token issuer (Signicat OIDC)
  • Be linked with an active user account in the Signicat IAM system
  • Have required permissions (role) to access this particular API
  • Have required permissions (role) to access the customer tenant which the API request targets
  • Have the applicable product purchased and enabled
  • Not maxed out on quota
  • Not exceed request rate limit

If all of the above is met, the request is accepted. If not, the request is denied. This is centrally managed by the Signicat PBAC engine to ensure consistent API access everywhere.

The PBAC pattern also allows for resiliency optimization. Lets say that the link to Customer Relationship Management (CRM) system is down. The PBAC cache has expired, making it impossible to determine what products is enabled for the client.

In this case, the Signicat PBAC will allow the request and skip the product check entirely! Why? Because the risk introduced is that usage of products not under contract can’t be billed by Signicat. Compared with the alternative, to block each and every API request, this graceful degradation in the failure scenario leaves the systems fully operational as viewed by the customer.

Service impact

  • Quota systems having issues? Ignore the policy and allow requests
  • CRM system down? Allow reading from cache, block write requests
  • Token validation having issues? Block all requests!

We operate against a higher internal Service Level Objective (SLO) for our essential services, like token validation. In some cases, the degraded state is acceptable until normal work hours begin. While if the token validation fails on-call is immediately alerted to remedy the sitation.

While not every functionality supports graceful degradation, it remains a key design goal to optimize availability across the Signicat platform.

Monitoring

All Signicat services are monitored 24/7. To ensure availability at all times, both internal monitoring and

Synthetic Monitoring

Synthetic monitoring is used to simulate users using the Signicat products. This includes both API usage and UI elements.

Synthetic API test

At regular intervals, API calls corresponding to critical product functionality is run. Any deviation from expected result will immediately show up as an anomaly. The screenshot shows an example of what a service outage would look like.

A system monitoring dashboard for a service named 'DTP EID HUB Login Flow (Production)'. The dashboard is divided into three sections. In the upper left corner, Last Check Summary: This section shows the latest status. The last check was on 8/28/2024 at 9:17:31 AM, the total number of checks is 39,151, and the SLA uptime is 99.99%. In the upper right corner, Uptime & Confirmed Errors Graph: A line graph displays the uptime percentage for the month of August. The graph shows a consistent high uptime until August 28th and a large red bar between August 27th and August 29th that indicates a period of errors. In the bottom half of the page, Monitor Log Table: A detailed log lists individual checks with their timestamp, status, and description. Most entries show a status of 'OK.' However, the last entry at 9:04:04 AM shows an 'HTTP send failure' from the 'Lille - 1' checkpoint, which corresponds to the error event shown on the graph.

API anomaly detection

Load testing

Signicat also regularly perform production load testing using artificial traffic. This is to ensure any undetected scalability issues is found by Signicat before it leads to customer impact.

End to end UI validation

In addition to API calls, Signicat has continuous UI validation to ensure what the user is presented matches expectations. These are very high level test cases and touches upon a large part of the functionality. This means a large amount of systems gets tested in each run.

In the screenshots below you can see the corresponding results of loading MitID with an unexpected result.

A visual regression test showing an 8.32% difference, where the actual result is an error page with code 'IDP-3200' from an aborted MitID authentication, likely by the user.

Internal monitoring

Internal monitoring is in place to quickly identify root causes or get early warnings about potential issues. These sensors are typically very specific to a small surface area of the solution, like an individual service or component of the solution.

A service mesh monitoring graph showing traffic flow between microservices, where an 'istio-ingressgateway' node directs traffic to a 'broker' service (with a connection latency of 203ms), to an 'idp-nbid-prod' node (with a connection latency of 40ms), and to another 'idp-nbid-prod' node (with a connection latency of 37ms). The 'broker' service then routes traffic to other services, including 'idp-nbid-prod' and 'idp-sbid-prod,' with connections represented by lines showing latency values like 40ms and 52ms. A side panel provides more details, listing healthy services (broker, idp-ftn-prod, idp-nbid-prod, and idp-sbid-prod) with a green checkmark and a service named 'auth' with a warning icon. It also specifies 34 apps, 3 services, and 70 edges. This panel also details the total HTTP traffic for inbound at 92.31 requests per second with a 98.38% success rate and a 1.62% error, visualised by a mostly green status bar that breaks down responses into 'OK,' '3xx,' '4xx,' and '5xx' categories.

Service mesh live

Logging

At Signicat we leverage structured logs when possible. For application logs, this means json logs with predefined metadata fields to enable quick filtering for trace id, customer tenant, product or service name and other contextual data. Log events signaling issues have predefined thresholds and alarms, which will notify on-call personnel. Alarms may trigger on the likes of elevated error rates or security events.

Special item logs, such as audit logs, security logs and billing events are separated from application logs. This is because these logs have higher privacy constraints, retention needs and other specific requirements. This also gives us more flexibility with the application logs, which is the bulk of the logs, to the point where they may be treated as throwaway logs.

Metrics

Time series data, or metrics gets sampled across systems making up the Signicat platform and product portfolio. This includes business metrics, like conversion rate, and pure technical measurements like cpu and memory consumption.

An observability dashboard with an 'Overview' section showing Global CPU Usage as 14.4% Real, 65.8% Requests, and 0.660% Limits, and Global RAM Usage as 31.23% Real, 67.15% Requests, and 64.16% Limits. Below this, CPU Usage is detailed as Real 87.4, Requests 253, Limits 2.53, and Total 384, while RAM Usage is Real 745 GiB, Requests 1012 GiB, Limits 967 GiB, and Total 1.47 TiB. The dashboard also displays counts of 26 Nodes, 113 Namespaces, and 792 Running Pods, alongside a 'Kubernetes Resource Count' graph showing stable resource levels over time. A 'Resources' section contains two time-series graphs: 'Cluster CPU Utilization,' which shows CPU percentage starting around 10% and rising to above 13% over 8 hours, and 'Cluster Memory Utilization,' which shows memory fluctuating between 30.4% and 31.4% over approximately 4.5 hours.

Example from one of our Kubernetes clusters

Metrics analysis

Predefined thresholds will trigger alarm and automatically notify Signicat personnel. In the case of outages, alarms will also wake up on-call, but the majority of alerts are designed to trigger ahead of time.

Automatic analysis takes place using per-defined rules to detect unwanted scenarios. Examples include:

  • Disk space near depletion
  • Failure to respond or substantial increase in response time
  • Drop in success rate
  • Malicious traffic patterns

Additionally, manual inspection is part of on-call routines through daily checklist

Alerting

Any alert will be routed to corresponding teams through a third party tool. Responses to alarms get handled by 24/7 on-call rotations. This ensures a path to notify Signicat staff which may operate completely independent of the Signicat services themselves.

Incident response

Runbooks are created for the alarms, outlying required steps to resolution. Signicat also practice our Disaster Recovery Plans (DRP) once a year, rehearsing relevant disaster scenarios. The multiple on-call rotations include dedicated experts on various products and components, under the platform engineering model of “You build it, you own it”. This is to ensure the shortest possible time to recovery. 

A diagram split into two sections. The left section contains a grid of sixteen logos in individual squares: Google Cloud, Kubernetes, IntelliJ, Istio, GitLab, Grafana, Argo CD, Google Compute Engine, Docker, Helm, Visual Studio Code, Prometheus, SonarQube, Jenkins, OpenTelemetry, and Grafana Loki logos. The right section contains a large illustration of a mobile device, overlaid with a large shield icon which itself contains a padlock icon. Behind the device is an illustration of a wallet containing documents, with a user icon and a text box floating above it.

Break the glass procedures

  • Root accounts stored in a separate system. Alerting upon usage
  • Privilege escalation through infrastructure operations team

Consistent post-mortems

  • Blameless, asking the 5 whys
  • Enforced management review
    • Ensuring follow-up tasks and fixes get sufficient attention

Security

Signicat employs a foundational security model, with security layers built on top of each other. This approach reflects a defense-in-depth security strategy.

A diagram showing a set of concentric circles on the left with a list of corresponding labels on the right. The center is a solid purple circle, identified by a line from the label 'Customer Data'. Moving outwards, the subsequent rings are pointed to by the following labels in order: 'Process Isolation', 'OS Hardening', 'Architecture Controls', 'Network Security', 'Encryption', 'API Security', 'Physical Security', and finally 'Policies and training' which points to the ninth and outermost ring. The two outermost rings are colored teal, while the inner rings and central circle are shades of purple.

Signicat security layers

Foundational Security Architecture

The foundational security covers aspects that are used across the platform. This is generally operated by dedicated central functions responsible for managing the foundation resources that are consumed by multiple product areas and workloads.

  • Policy controls are programmatic constraints that enforce acceptable resource configurations and prevent risky configurations. It uses a combination of policy controls including infrastructure-as-code (IaC) validation in pipelines combined Signicat Information Security Management System (ISMS) policy constraints
  • Architecture controls are the configuration of cloud resources like networks and resource hierarchy. This architecture is based on security best practices
  • Detective controls detect anomalous or malicious behavior. Signicat uses platform features such as our observability stack, which provides capabilities to enforce custom detective controls
A diagram showing a process flow divided into four sections: 'Policy', 'Automated Deployment Pipeline', 'Architecture', and 'Detection'. The process begins with a 'Submitter' providing input to an 'Infrastructure as Code Repository', which also receives input from a 'Change Management System' under the 'Policy' section. The 'Infrastructure as Code Repository' then feeds into a step labeled 'Automated Compliance Validation', which is under the 'Automated Deployment Pipeline' section and also receives input from an 'Approver'. This validation step leads to a gear icon in the 'Architecture' section, which is associated with a list of items: 'Folders', 'Projects', 'VPCs', 'Subnets', 'Firewalls', 'Routing', and 'IAM Permissions'. Finally, the gear icon leads to a magnifying glass icon in the 'Detection' section, which is associated with a list of items: 'Logging', 'Monitoring', 'Analysis', and 'Alerting'.

Policy Controls

Signicat employ a number of Information Security Management System (ISMS) policies as defined by ISO27001 and SOC2 ensuring high security and quality requirements to meet customer expectations. The policies are all approved by the Signicat Information Security Board, and managed via git for version control.

A block diagram with a horizontal block at the top labeled 'ISMS'. Below it is another horizontal block labeled 'Business continuity plan'. The next level contains a vertical block on the far left labeled 'Policies and procedures'. To the right of that, several other blocks are arranged: 'Production system - high reliability design' and 'Event management' are side-by-side. Above a stacked block labeled 'Disaster recovery plans' and a block labeled 'Backup' is another block labeled 'Incident management'. The bottom layer of the diagram consists of three horizontal blocks: 'Preventive' is positioned under 'Policies and procedures' and 'Production system - high reliability design'; 'Detective' is positioned under 'Event management'; and 'Corrective' is positioned under 'Incident management', 'Disaster recovery plans', and 'Backup'.

Access Management

Access follows least privilege principle, with predefined roles and accesses. All access is managed through Single Sign-On (SSO), which simplifies onboarding, offboarding, and role changes while providing a comprehensive overview of who has access to what and where. More granular permissions are handled per context. Infrastructure access is managed by a cloud access policy describing several levels of access and specific components like Kubernetes have even more granularity with a full predefined responsibility matrix:

A responsibility matrix table with three columns ('Responsibilities', 'Service Owners', 'Platform owners') divided into two sections. The first section, 'Admin access to Namespace', details the following: for 'Manage Secrets', Service Owners have 'Yes - Full Access' and Platform owners 'Provide self-service'; for 'Cert Management for HTTPS or within Containers', Service Owners have 'Yes - Full Access' and Platform owners have 'Service owner self service'; for 'Kubectl and API Access', Service Owners have 'Yes - Full Access' and Platform owners are 'Responsible to provide access'; for 'Blue Green, Canary Deployments', Service Owners have 'Yes - Full Access to manage deployments, rollbacks' and Platform owners 'Provide self-service'; for 'View logs (E.G: ElasticSearch/Kibana)', Service Owners have 'Yes - Access to view logs and dashboards' and Platform owners are 'Responsible for log collection'; for 'View metrics (E.G Prometheus/Grafana)', Service Owners have 'Yes - Access to view logs and dashboards' and Platform owners are 'Responsible for metrics collection'; for 'Console access into Pods to debug application', Service Owners have 'Yes - Full Access' and Platform owners 'Provide self-service'; for 'Manage Configurations using Config Maps', Service Owners have 'Yes - Full Access' and Platform owners 'Provide self-service'; for 'Schedule Jobs', Service Owners have 'Yes - Full Access' and Platform owners 'Provide self-service'. The second section, 'Kubernetes Cluster Level', details: for 'Manage Ingress / Service Mesh', Service Owners are 'Not Responsible' and Platform owners are 'Responsible'; for 'DNS, CNI, RBAC', Service Owners are 'Not Responsible, except manage access to namespace' and Platform owners are 'Responsible'; for 'Backup & Restore', Service Owners are 'Not Responsible' and Platform owners are 'Responsible'; for 'S/W Upgrade - Docker, Kubernetes, Cluster components', Service Owners are 'Not Responsible' and Platform owners are 'Responsible'; for 'Infra Patching - Worker Nodes and Masters', Service Owners are 'Not Responsible' and Platform owners are 'Responsible'; for 'Observability - Logs, metrics, tracing', Service Owners are 'Implementing necessary support in their systems' and Platform owners 'Provide observability platform'; and for 'High Availability', Service Owners are 'Responsible from application architecture & HA' and Platform owners are 'Responsible to deliver platform availability to meet SLA'.

Software Delivery LifeCycle (SDLC)

An important part of Signicat security policies is the Software Delivery LifeCycle (SDLC) which describes the Signicat change management process and way of work. Examples include:

  • Infrastructure defined as code (IaC)
  • Separation of concerns through mandatory merge request approval
  • Quality assurance targets

Adherence to these requirements is also subject for third party audits, like the SOC2 Type 2 audit. Enforcement also includes automated alerts going out, like human activity detected on resources governed by IaC.

Architecture Controls

Pipeline tooling ensures that the artifact deployed to production is identical to what has been built, tested, and security scanned, without any maliciously injected malware. Supply chain security is inspired by the SLSA framework.

Supply chain security

A diagram of a software supply chain threat model showing a process flow from a 'Producer' to a 'Consumer' through 'Source', 'Build', and 'Package' stages, with a separate 'Dependencies' box feeding into the 'Build' stage. The diagram is overlaid with threat categories and specific, lettered threats pointing to different stages. A legend at the bottom explains these threats: 'SOURCE THREATS' are A) 'Submit unauthorized change', B) 'Compromise source repo', and C) 'Build from modified source', which are shown pointing to the 'Source' and 'Build' stages. 'DEPENDENCY THREATS' is D) 'Use compromised dependency', shown pointing to the 'Dependencies' box. 'BUILD THREATS' are E) 'Compromise build process', pointing to the 'Build' stage; F) 'Upload modified package' and G) 'Compromise package registry', both pointing to the 'Package' stage; and H) 'Use compromised package', pointing to the final 'Consumer' step.

Source threats security

  • Source repo hosting with industry leading vendor
  • Mandatory merge request approvals
  • Static and Dynamic Application Security Scanning (SAST/DAST)
  • Infrastructure as Code scanning
  • Secret Detection

Dependency threats security

  • Software Bill Of Materials (SBOM) dependency security scanning
    • Detects known CVEs from SBOM

Build threats security

  • Ephemeral build containers blocking build worker attacks on software supply chain
    • Every pipeline job runs in a fresh container, deleting it once the job ends
  • Open Container Initiative (OCI) standard
A diagram called build threats security shows how different files reference each other. It displays a root directory containing an 'oci-layout' file, an 'index.json' file, and a 'blobs/sha256' directory. An arrow from 'index.json' points to a JSON snippet where a 'manifest' object contains a 'digest' with the value 'sha256:0578e4...'. This digest, in turn, points to a blob file named '0578e4...' inside the 'blobs/sha256' directory. This blob file is itself a JSON manifest, containing a 'config' object with a digest of 'sha256:9bd428a...' and a 'layers' array with a digest of 'sha256:4a56a43...'. An arrow from the 'config' digest points to the blob file named '9bd428...', and an arrow from the 'layers' digest points to the blob file named '4a56a4...', with both of these files also located in the 'blobs/sha256' directory.

Access management

Signicat follows the least privilege principle by limiting access to resources/service on the service or team level.

Signicat use a single source of truth for employee accounts, where authorization is linked with group membership. Permissions are assigned using groups, defined by IaC code. Permissions are scoped down to agile team granularity, ensuring product and team separation.

Cloud resources

Access to cloud resources (storage buckets, databases, keys and more) is scoped into teams. The team members and teams service accounts are the only ones (except global admins) that have access to the team's cloud resources. This translates to products exclusively having access only to their corresponding database of storage.

VPCs, Subnets, Firewall and Routing

From public internet, only port 443 is open. Any new port opening would be done using Infrastructure as Code (IaC) and merge requests. All changes must be approved by the Signicat platform team after evaluating justified need. In case an engineer in from the platform team opens a merge request, a different member must approve in order to fulfill separation of concerns as defined by the SDLC.

Network connectivity is segmented and protected by multiple firewalls. Examples include:

  • DMZ (internet)
  • Management network (Host network with virtual machines)
  • Pod (container) network

All incoming traffic is filtered through firewall on the Google load balancer, Istio API Gateway and iptables firewalls for each individual container. No hosts on the management network can be reached from the outside.

Host Operating System Hardening

Containers run in ContainerOS with containerd which is a hardened virtual machine image provided by Google Cloud

  • Smaller attack surface: Container-Optimized OS has a smaller footprint, reducing instance's potential attack surface
  • Locked-down by default: Container-Optimized OS instances include a locked-down firewall and other security settings by default
    • Container-Optimized OS does not include a package manager, blocking installation of software packages directly on an instance
    • Container-Optimized OS does not support execution of non-containerized applications
    • The Container-Optimized OS kernel is locked down; Not possible to install third-party kernel modules or drivers

Security through isolation using containers

Signicat containerization approach provide additional security hardening on top of the traditional foundational security layer. These additional measures rely heavily on intrinsic security properties of Kubernetes and containerization. In particular containers provide a hardened, repeatable process for ensuring only the services, functions, processes and components specifically required for its operation.

A diagram of a container-based architecture showing a layered stack. The bottom layer is a solid block labeled 'Hardware'. Above it is a solid block labeled 'Host OS'. On top of the Host OS is a layer labeled 'Container Engine'. The top layer consists of four separate containers running side-by-side on the Container Engine. The first container contains a 'Web Service' and 'Binaries'; the second container contains an 'eID Hub Service' and 'Binaries'; the third container contains a 'Signature Service' and 'Binaries'; and the fourth container contains a 'MobileID Service' and 'Binaries'.

Container Images

Builds are based on Dockerfile, Jib or similar tools. Only explicitly required packages and ports are specified, and runs only a single process. When possible, the container is started with a read-only file system for immutability.

Process Isolation

Containers are namespaces and control groups (cgroups) put together

All processes which runs inside containers are isolated using cgroups and namespaces to ensure one compromised process does not compromise the system as a whole.

A diagram illustrating the relationship between containers (namespaces), cgroups, and the kernel, divided into two main sections labelled 'Host' and 'Kernel'. The 'Host' section contains three separate boxes running side-by-side, labelled 'Container 1 (namespaces)', 'Container 2 (namespaces)', and 'Container 3 (namespaces)'. The 'Kernel' section below it contains three separate trapezoid-shaped blocks, each labelled 'cgroups'. A line connects each container in the 'Host' section to its corresponding 'cgroups' block in the 'Kernel' section below it.

Control groups (cgroups)

Cgroup isolation provides hardware resource separation per process

  • Resource limiting: a group can be configured not to exceed a specified memory limit or use more than the desired amount of processors or be limited to specific peripheral devices
  • Prioritization: one or more groups may be configured to utilize fewer or more CPUs or disk I/O throughput
  • Accounting: a group's resource usage is monitored and measured
  • Control: groups of processes can be frozen or stopped and restarted
A diagram divided into a top section labeled 'Control Group (cgroup)' and a bottom section with four labeled boxes. The 'Control Group' section contains four separate colored blocks, each labeled '25%': one is light grey, one is purple, one is teal, and one is dark purple. The bottom section has four corresponding boxes labeled 'RAM', 'CPU', 'NET', and 'I/O', and each of these boxes also contains four blocks labeled '25%'. An arrow from the light grey '25%' block in the top section points to the 'RAM' box; an arrow from the purple '25%' block points to the 'CPU' box; an arrow from the teal '25%' block points to the 'NET' box; and an arrow from the dark purple '25%' block points to the 'I/O' box.

Namespaces

Processes inside a container is isolated from the rest of the system using Linux namespaces:

  • mnt – allows a container to have its own set of mounted filesystems and root directory
    • Processes in one mnt namespace cannot see the mounted filesystems of another mnt namespace
  • uts – allows different host and domain names to different processes
  • pid – provides processes with an independent set of process IDs
    • A parent namespace can see the children namespaces and affect them, but a child can neither see the parent namespace nor affect it
  • user – per namespace mapping users and group IDs
    • User root inside the namespace can be mapped to a non-privileged user on the host
  • ipc – Inter Process Communication provides semaphores, message queues and shared memory segments
    • It is not widely used, but some programs still depend on it
  • time – ability to changes date and time in a container
  • net – network stack virtualization
    • Each namespace will have a private set of IP addresses, its own routing table, socket listing, connection tracking table, firewall and other network-related resources
A diagram showing a series of eight nested squares. The innermost square is labelled 'FTN Mediator' and contains a smaller box labeled 'tomcat'. The subsequent squares, moving outwards from the center, are individually labelled with the following terms in order: 'mnt', 'uts', 'pid', 'user', 'ipc', 'time', and 'net' for the outermost square.

Kubernetes Pod

  • One or more containers sharing relevant namespaces and control group
  • Processes inside containers run with an unprivileged user
  • Assigned a unique IP address shared by all containers
A diagram composed of two distinct parts. The upper part shows a series of seven concentric, nested rectangles, with the innermost rectangle containing two side-by-side boxes. The left inner box is labeled 'FTN Mediator' and contains a smaller box labeled 'tomcat', while the right inner box is labeled 'Istio' and contains a smaller box labeled 'Envoy'; both of these inner boxes are themselves enclosed by two nested boxes labeled 'mnt' and 'uts'. The outer concentric rectangles are labeled from the inside out as 'pid', 'user', 'ipc', 'time', and 'net'. To the right of the main nested structure is a list of labels ('iptables', 'routes', 'eth0', 'lo', 'sockets'), with an arrow pointing from 'eth0' to the IP address '10.0.0.42'. The lower part of the diagram is labeled 'Control Group (cgroup)' and shows four colored blocks each labeled '25%', which are connected by arrows to four corresponding resource boxes below them labeled 'RAM', 'CPU', 'NET', and 'I/O', with each of these resource boxes also containing four blocks labeled '25%'.

Networking

  • Pod network is by default an overlay network, without directly access to host network
    • Separate IP segment and packet encapsulation (VXLAN)
  • Each pod has their own virtual ethernet (veth) connected to a virtual switch (linux bridge)
    • Dedicated firewall rules per pod
A network diagram showing two pods on the left, labelled 'Pod 1' and 'Pod 2'. Each pod contains two interfaces, labelled 'lo' and 'eth0'. The 'eth0' interface of 'Pod 1' is connected via a line labelled 'net' to an interface labeled 'veth42'. The 'eth0' interface of 'Pod 2' is connected via a line labelled 'net' to an interface labelled 'veth44'. The 'veth42' and 'veth44' interfaces are shown inside a larger box labeled 'Linux bridge', which also contains another interface on the right labelled 'eth0'.

Network Hardening

Network design is based on Zero-Trust networking

  • Leverage cloud load balancer as the first level of defense
    • Automatically block network protocol and volumetric DDoS attacks such as protocol floods (SYN, TCP, HTTP, and ICMP) and amplification attacks (NTP, UDP, DNS)
  • Incoming traffic passes through an API Gateway (Ingress), implemented using Istio
    • Filter all traffic not explicitly allowed, at Layer7 (Application layer / HTTPS)
  • Injection of default security headers for all services, like: Content-Security-Policy and Strict-Transport-Security
  • Pod network traffic incoming/outgoing redirected to Envoy (Istio) for enforced encryption/decryption (mTLS) for all internal network communication
A diagram labeled 'Istio Mesh' that is divided into a 'Data plane' and a 'Control plane'. In the Data plane, a solid green line representing 'Data plane traffic' flows from a user icon labeled 'APIs Content' through a connection labeled 'JWT+TLS mTLS', into an 'Ingress' gateway. The traffic then goes to a 'Proxy' which communicates with 'Service A'. This proxy communicates with a second 'Proxy' for 'Service B' via a line labeled 'HTTP, gRPC, TCP' and 'mTLS'. The second proxy communicates with 'Service B' and then sends traffic to an 'Egress' gateway, which flows out to a user icon labeled 'External API' via another 'JWT+TLS mTLS' connection. In the Control plane, a component labeled 'istiod' contains boxes for 'Certificate authority', 'Authentication policies', 'Network configuration', 'Authorisation policies', and 'API server configuration'. Dashed blue lines representing 'Control plane traffic' connect 'istiod' to the Ingress, Egress, and both Proxy components via a 'Control Plane Interface'. A 'Key' at the bottom of the diagram defines the solid green line as 'Data plane traffic', the dashed blue line as 'Control plane traffic', a blue arrow icon as 'Local authorisation', and an orange gear-like icon as 'Certificate'.

Network security hardening

Security of data in transit

External traffic is protected according to international standards, including several eID compliance requirements

This configuration includes:

  • TLS 1.2 or higher
  • Only modern and safe ciphers, following the already mentioned requirements
    • Daily monitoring of TLS Ciphers against industry best practices and product specific requirements
    • Implementation:
  • DNSSEC signing of DNS records

Signicat API Security

Same security fundamentals as our products providing authentication for banks, government and other highly regulated industries. In fact, it’s the very same Signicat certified OIDC service powering authentication for Signicat APIs. Additionally, Signicat has centralized authorization by implementing a Policy Based Access Control (PBAC) engine which combines validation of cryptographically strong authentication, Role Based Access Control (RBAC) and context aware rules such as product checks, quotas or graceful failure modes.

Conclusion

Signicat's robust and resilient infrastructure, coupled with defense in depth security measures and continuous auditing, ensures that our services remain reliable and secure. Our proactive approach to redundancy, failure modes and security underscores our dedication to delivering uninterrupted service to our clients. At Signicat, we prioritize resilience and reliability, ensuring that our clients can always trust in the security, integrity and availability of our products and services.