Cold-start stream parsing is the process by which a [CESR](/concept/cesr) stream parser initializes or recovers from errors by locating well-defined framing information to correctly parse groups of elements, using special count codes as synchronization points to avoid buffer flushing and data loss.
Related Concepts
No related concepts available
Comprehensive Explanation
cold-start-stream-parsing
Process Definition
Cold-start stream parsing addresses a fundamental challenge in CESR (Composable Event Streaming Representation) stream processing: how parsers initialize, synchronize, and recover from errors when processing serialized cryptographic primitives and event data in KERI-based systems.
What It Accomplishes
The cold-start parsing process enables a stream parser to:
Initialize parsing state after system reboot or first connection to a stream source
Locate framing boundaries that delineate individual primitives or groups of primitives
Recover from malformed data without discarding buffered content
Re-synchronize at well-defined boundaries when encountering ambiguous or corrupted stream segments
Support mixed serialization formats (CESR, JSON, CBOR, MessagePack) within a single stream
When It's Used
Cold-start parsing occurs in several critical scenarios:
System Initialization: When a witness, watcher, or validator first connects to a KERI event stream
Error Recovery: When the parser encounters malformed or ambiguous data that disrupts normal parsing
Implementation Notes
Critical Implementation Details
Buffer Management
Preserve In-Transit Data: The fundamental principle of cold-start parsing is that recovery should never flush buffers unless absolutely necessary. Implementations must:
Maintain separate read and write pointers into buffers
Only discard data between error point and recovery boundary
Preserve all data after the recovery boundary
Use circular buffers or similar structures to avoid unnecessary copying
Buffer Sizing: Allocate buffers large enough to:
Contain at least one complete event group
Span typical count code boundaries
Allow scanning for recovery boundaries without overflow
Recommended minimum: 4KB for text streams, 3KB for binary streams
Framing Code Recognition
Count Code Validation: When scanning for boundaries, implementations must:
Check that the candidate count code is valid for the current CESR version
Verify that the declared element count or byte length is reasonable
Confirm that subsequent data matches the declared structure
Validate that primitives following the count code are properly self-framing
Three-Bit Signatures: The sniffer component should use efficient bit-masking to detect format types:
# Example bit patterns for format detection
CESR_TEXT_MASK = 0b11000000 # Base64 chars start with these bits
CESR_BINARY_MASK = 0b11100000 # Binary framing codes
JSON_MASK = 0b01111011 # '{' character
CBOR_MASK = 0b10100000 # CBOR major type
Error Recovery Strategy
Scan Distance Limits: Implementations should limit how far they scan for boundaries:
Recommended maximum: 1024 bytes for most applications
High-throughput systems: May use larger limits (4KB-8KB)
Resource-constrained devices: May use smaller limits (256-512 bytes)
If no boundary found within limit, declare fatal error
Recovery Logging: Track all cold-start events for debugging:
Version Codes: Specify which CESR code tables to use for parsing
Buffer Management: In-transit data storage that must be preserved during recovery
Process Flow
Initial Cold Start Sequence
When a parser performs a cold start from system initialization:
Step 1: Default Version Loading
The parser initializes with a default CESR version that specifies which code tables to load. This default ensures the parser can begin processing even if the stream doesn't start with an explicit version code.
Step 2: Stream Format Detection (Sniffing)
The sniffer component examines the first bytes/characters of the stream to detect format:
CESR Text: Begins with Base64 URL-safe character framing codes
JSON: Begins with { or [ characters
CBOR: Begins with CBOR major type indicators
MessagePack: Begins with MGPK format markers
Each format has unique three-bit combinations in its object codes or group codes that enable unambiguous detection. This property makes CESR streams sniffable.
Step 3: Framing Code Extraction
Once format is detected, the parser extracts the initial framing code:
For CESR streams: Reads the count code or group code that specifies how many primitives or bytes follow
For JSON/CBOR/MGPK: Uses regex or format-specific parsing to locate the version string field and determine the serialized section length
Step 4: Boundary-Aligned Parsing
CESR's 24-bit alignment constraint ensures all primitives align on boundaries that are:
Integer multiples of 4 Base64 characters (24 bits) in text domain
Integer multiples of 3 bytes (24 bits) in binary domain
This alignment guarantees that framing codes always begin at predictable positions, enabling atomic extraction of elements.
Parser can recover from transient failures without service interruption
Monitoring cold-start frequency helps detect infrastructure issues
Graceful degradation maintains partial service during problems
Cross-Platform Compatibility
Implementations across languages (Python, Rust, TypeScript):
Must implement identical cold-start behavior
Should produce same recovery decisions for same input
Need consistent error reporting formats
Must maintain interoperability despite implementation differences
Conclusion
Cold-start stream parsing is a critical capability that enables CESR's robustness and composability. By providing well-defined synchronization points through count codes and supporting recovery without buffer flushing, CESR parsers can maintain high reliability even in the face of network issues, malformed data, or system restarts. This capability is essential for the production deployment of KERI-based identity systems where continuous operation and data integrity are paramount.
Default Version Selection: Choose default CESR version based on:
Most recent stable specification
Compatibility with deployed systems
Support for required primitive types
Current recommendation: CESR 1.0
Version Code Handling: When encountering version codes:
Verify version code appears at top level (not nested)
Load appropriate code tables atomically
Maintain backward compatibility with older versions
Log version transitions for audit trails
Performance Optimization
Fast Path for Normal Operation: Optimize for the common case:
Cache frequently used code table entries
Use lookup tables instead of conditional logic
Minimize allocations during normal parsing
Reserve cold-start overhead for actual error cases
Lazy Boundary Validation: During scanning:
Perform quick checks first (valid code range)
Only do expensive validation (length verification) for promising candidates
Use early exit conditions to avoid unnecessary work
Thread Safety
Shared Code Tables: In multi-threaded environments:
Code tables should be immutable after loading
Use copy-on-write for version transitions
Protect parser state with appropriate synchronization
Consider per-thread parser instances for high concurrency
Testing Requirements
Malformed Data Testing: Implementations must be tested with:
Truncated primitives at various positions
Invalid count codes with plausible values
Format confusion (CESR text that looks like JSON)
Repeated recovery scenarios
Boundary cases (recovery at buffer edges)
Interoperability Testing: Verify that:
Different implementations recover at same boundaries
Error reporting is consistent across implementations
Version transitions work identically
Mixed format streams parse identically
Security Considerations
Denial of Service Prevention: Protect against:
Malicious data designed to trigger excessive scanning
Crafted streams that cause repeated cold starts
Resource exhaustion through buffer overflow
Timing attacks based on recovery behavior
Isolation: Ensure that:
Recovery in one stream doesn't affect others
Malformed data can't corrupt parser state
Error conditions are properly contained
Cryptographic material is never exposed during recovery