Chapter 15: Project: Compressing Time-Series Sensor Data

Welcome to Chapter 15! This is where we bring everything we’ve learned about OpenZL together into an exciting, hands-on project. In the real world, data is often structured, and one of the most common forms is time-series data, particularly from sensors. Think about temperature readings, IoT device metrics, or stock prices – they all have a timestamp and one or more associated values.

In this chapter, you’ll learn how to leverage OpenZL’s unique format-aware compression capabilities to efficiently compress time-series sensor data. We’ll simulate a stream of sensor readings, define a schema that OpenZL can understand, and then use OpenZL to achieve impressive compression ratios. This project will solidify your understanding of OpenZL’s core concepts, from schema definition to the power of specialized codec graphs. Get ready to build something practical and see OpenZL in action!

Before we dive in, make sure you’re comfortable with the basics of OpenZL setup, schema definition, and the concept of codec graphs from previous chapters. We’ll be building on that foundation.

Core Concepts for Time-Series Compression

Time-series data presents both challenges and opportunities for compression. Let’s explore why OpenZL is particularly well-suited for this type of data.

What Makes Time-Series Data Special?

Imagine a sensor reporting temperature and humidity every second. What do you notice about this data?

Temporal Order: Data points are inherently ordered by time.
Regularity: Often, data arrives at fixed intervals (e.g., every second, minute).
Redundancy: Consecutive readings are often very similar or follow predictable patterns (e.g., temperature doesn’t usually jump from 20°C to 50°C in a second).
Structure: Each data point typically has a timestamp and one or more numerical values (e.g., {"timestamp": ..., "temperature": ..., "humidity": ...}).

Generic compressors like Gzip or Zstd are good at finding general byte patterns, but they don’t understand the semantic structure of your data. They don’t know that “temperature” is a floating-point number that changes slowly, or that “timestamp” is an ever-increasing integer. This is where OpenZL shines!

OpenZL’s Format-Aware Advantage

OpenZL, as we’ve learned, is a format-aware compression framework. Instead of treating your data as a flat stream of bytes, it takes a description of your data’s structure (a schema) and builds a specialized compressor tailored to that exact format.

For time-series data, this means:

Schema-Driven Optimization: You define the types of your fields (e.g., int64 for timestamp, float32 for temperature). OpenZL uses this information to apply codecs best suited for each data type.
Codec Graph Specialization: OpenZL can choose specific codecs for timestamps (e.g., delta encoding for increasing values), and different codecs for sensor readings (e.g., run-length encoding for stable periods, or more advanced prediction-based codecs for gradual changes).
Exploiting Redundancy: By understanding the data types and their typical behavior in time-series (e.g., small changes between consecutive values), OpenZL can apply techniques like differential encoding or specialized floating-point compression algorithms that generic compressors can’t.

Let’s visualize how OpenZL might process a simple time-series record.

graph TD A[Raw Data Record] --> B{Parse Schema} B --> C[Timestamp Field] B --> D[Temperature Field] B --> E[Humidity Field] C --> C1[Delta Encoder Time] D --> D1[Floating Point Codec] E --> E1[Floating Point Codec] C1 --> F[Combined Encoded Output] D1 --> F E1 --> F style A fill:#f9f,stroke:#333,stroke-width:2px style F fill:#9f9,stroke:#333,stroke-width:2px

Figure 15.1: A simplified OpenZL Codec Graph for a single time-series record. Each field is processed by a specialized codec before being combined.

In this diagram, the “Parse Schema” step allows OpenZL to identify individual fields. Then, based on the schema and potentially some training data, it selects specific codecs. For instance, timestamps often benefit from delta encoding because successive timestamps usually differ by a small, consistent amount. Temperature and humidity, being floating-point numbers, might use specialized floating-point compression techniques.

Step-by-Step Implementation: Compressing Sensor Data

For this project, we’ll simulate a stream of sensor data, define its structure, and then compress and decompress it using OpenZL. We’ll assume you have OpenZL installed and set up according to Chapter 2. We’ll use a high-level conceptual API representation, as OpenZL primarily offers C++ APIs, and the goal is to understand the process rather than specific language bindings.

Prerequisites:

OpenZL library installed (refer to Chapter 2 for setup).
A C++17 compatible compiler (e.g., GCC 9+, Clang 9+).
CMake (version 3.15+).

We’ll work with a simplified C++ example to illustrate the concepts.

Step 1: Generating Sample Time-Series Data

First, let’s create some synthetic time-series data. Imagine a sensor reporting temperature and humidity.

// sensor_data_generator.h
#pragma once
#include <vector>
#include <string>
#include <chrono>
#include <random>

struct SensorReading {
    int64_t timestamp_ms; // Milliseconds since epoch
    float temperature_c;  // Celsius
    float humidity_percent; // Percentage
};

std::vector<SensorReading> generate_sensor_data(size_t num_readings) {
    std::vector<SensorReading> data;
    data.reserve(num_readings);

    auto now = std::chrono::system_clock::now();
    int64_t current_timestamp = std::chrono::duration_cast<std::chrono::milliseconds>(
        now.time_since_epoch()
    ).count();

    std::mt19937 gen(0); // For reproducibility
    std::normal_distribution<float> temp_dist(25.0f, 1.0f); // Avg 25C, Std Dev 1C
    std::normal_distribution<float> humid_dist(60.0f, 2.0f); // Avg 60%, Std Dev 2%

    for (size_t i = 0; i < num_readings; ++i) {
        SensorReading reading;
        reading.timestamp_ms = current_timestamp + (i * 1000); // 1-second interval
        reading.temperature_c = temp_dist(gen);
        reading.humidity_percent = humid_dist(gen);
        data.push_back(reading);
    }
    return data;
}

Explanation:

We define a SensorReading struct to hold our data points: timestamp_ms (a long long for milliseconds), temperature_c, and humidity_percent (both float).
The generate_sensor_data function creates a vector of these structs.
It simulates readings taken every second, with temperature and humidity fluctuating slightly around averages using normal distributions. This kind of data is perfect for OpenZL!

Step 2: Defining the OpenZL Schema

Now, let’s tell OpenZL about the structure of our SensorReading. This is the crucial step where OpenZL gains its “format awareness.” OpenZL uses a schema definition language, often represented in a structured format like JSON or a programmatic builder. For our example, we’ll use a conceptual C++ builder pattern.

// openzl_schema_definition.h
#pragma once
#include <string>
#include <vector>

// Conceptual representation of OpenZL schema components
namespace OpenZL {

enum class DataType {
    INT64,
    FLOAT32,
    // ... other types
};

struct Field {
    std::string name;
    DataType type;
    // Potentially other metadata like 'is_monotonic', 'delta_encoding_candidate'
};

struct Schema {
    std::string name;
    std::vector<Field> fields;

    // Conceptual method to register the schema with OpenZL
    void register_with_openzl() const {
        // In a real OpenZL API, this would involve calling a library function
        // to pass this schema definition to the framework.
        // E.g., OpenZL::SchemaRegistry::add(this->name, *this);
        // For now, consider this a conceptual step.
        std::cout << "--- Registering OpenZL Schema: " << name << " ---" << std::endl;
        for (const auto& field : fields) {
            std::cout << "  Field: " << field.name << ", Type: ";
            switch (field.type) {
                case DataType::INT64: std::cout << "INT64"; break;
                case DataType::FLOAT32: std::cout << "FLOAT32"; break;
            }
            std::cout << std::endl;
        }
        std::cout << "Schema registered conceptually." << std::endl;
    }
};

Schema create_sensor_schema() {
    Schema sensor_schema;
    sensor_schema.name = "SensorReadingSchema";
    sensor_schema.fields.push_back({"timestamp_ms", DataType::INT64});
    sensor_schema.fields.push_back({"temperature_c", DataType::FLOAT32});
    sensor_schema.fields.push_back({"humidity_percent", DataType::FLOAT32});
    return sensor_schema;
}

} // namespace OpenZL

Explanation:

We define conceptual DataType, Field, and Schema structs to represent OpenZL’s schema definition.
The create_sensor_schema function builds our specific schema, mapping the field names and types from our SensorReading struct.
The register_with_openzl method is a placeholder for how you’d interact with the actual OpenZL library to make your schema known. This step is critical because it allows OpenZL to understand the logical structure of your data.

Step 3: Compressing the Data

With the schema defined, OpenZL can now build a specialized compression plan (a codec graph) and compress our data.

// main.cpp
#include "sensor_data_generator.h"
#include "openzl_schema_definition.h"
#include <iostream>
#include <vector>
#include <numeric> // For std::iota

// --- Conceptual OpenZL Compression/Decompression APIs ---
// In a real scenario, these would be actual library calls.
namespace OpenZL {

// Represents a compiled compression plan (codec graph)
struct Compressor {
    std::string schema_name;
    // Internal state for the actual compression logic
    Compressor(const std::string& name) : schema_name(name) {
        std::cout << "OpenZL: Initializing compressor for schema: " << schema_name << std::endl;
        // In reality, this would involve OpenZL compiling a codec graph
        // based on the registered schema and potentially training data.
    }

    // Conceptual compression function
    std::vector<char> compress(const std::vector<SensorReading>& data) {
        std::cout << "OpenZL: Compressing " << data.size() << " sensor readings..." << std::endl;
        // Simulate compression logic. For structured data, OpenZL would apply
        // field-specific codecs.
        // For demonstration, let's just create a dummy compressed buffer.
        size_t original_size = data.size() * sizeof(SensorReading);
        size_t compressed_size = original_size / 4; // Simulate 4x compression
        std::vector<char> compressed_buffer(compressed_size, 'Z'); // Fill with dummy data
        std::cout << "  Original size: " << original_size << " bytes" << std::endl;
        std::cout << "  Compressed size: " << compressed_buffer.size() << " bytes" << std::endl;
        return compressed_buffer;
    }
};

// Represents a compiled decompression plan
struct Decompressor {
    std::string schema_name;
    Decompressor(const std::string& name) : schema_name(name) {
        std::cout << "OpenZL: Initializing decompressor for schema: " << schema_name << std::endl;
    }

    // Conceptual decompression function
    std::vector<SensorReading> decompress(const std::vector<char>& compressed_data, size_t original_num_records) {
        std::cout << "OpenZL: Decompressing " << compressed_data.size() << " bytes..." << std::endl;
        // Simulate decompression logic.
        std::vector<SensorReading> decompressed_data;
        decompressed_data.reserve(original_num_records);

        // For this conceptual example, we'll "re-generate" data similar to original
        // This is NOT how real decompression works, but illustrates the outcome.
        // In reality, OpenZL would reconstruct the original SensorReading objects.
        auto now = std::chrono::system_clock::now();
        int64_t current_timestamp = std::chrono::duration_cast<std::chrono::milliseconds>(
            now.time_since_epoch()
        ).count();
        std::mt19937 gen(0); // Same seed as generator for "matching" output
        std::normal_distribution<float> temp_dist(25.0f, 1.0f);
        std::normal_distribution<float> humid_dist(60.0f, 2.0f);

        for (size_t i = 0; i < original_num_records; ++i) {
            SensorReading reading;
            reading.timestamp_ms = current_timestamp + (i * 1000);
            reading.temperature_c = temp_dist(gen);
            reading.humidity_percent = humid_dist(gen);
            decompressed_data.push_back(reading);
        }
        std::cout << "  Decompressed " << decompressed_data.size() << " sensor readings." << std::endl;
        return decompressed_data;
    }
};

} // namespace OpenZL


int main() {
    // 1. Generate sample data
    std::cout << "--- Generating Sample Sensor Data ---" << std::endl;
    size_t num_readings = 1000;
    std::vector<SensorReading> original_data = generate_sensor_data(num_readings);
    std::cout << "Generated " << original_data.size() << " sensor readings." << std::endl;
    std::cout << "First reading: Timestamp=" << original_data[0].timestamp_ms
              << ", Temp=" << original_data[0].temperature_c
              << ", Humidity=" << original_data[0].humidity_percent << std::endl;
    std::cout << "Last reading: Timestamp=" << original_data.back().timestamp_ms
              << ", Temp=" << original_data.back().temperature_c
              << ", Humidity=" << original_data.back().humidity_percent << std::endl;
    std::cout << std::endl;

    // 2. Define and register OpenZL Schema
    OpenZL::Schema sensor_schema = OpenZL::create_sensor_schema();
    sensor_schema.register_with_openzl(); // Conceptually register
    std::cout << std::endl;

    // 3. Initialize OpenZL Compressor
    OpenZL::Compressor compressor(sensor_schema.name);
    std::cout << std::endl;

    // 4. Compress the data
    std::vector<char> compressed_buffer = compressor.compress(original_data);
    std::cout << std::endl;

    // 5. Initialize OpenZL Decompressor
    OpenZL::Decompressor decompressor(sensor_schema.name);
    std::cout << std::endl;

    // 6. Decompress the data
    std::vector<SensorReading> decompressed_data = decompressor.decompress(compressed_buffer, num_readings);
    std::cout << std::endl;

    // 7. Verification (conceptual)
    // In a real scenario, you would compare original_data and decompressed_data
    // element by element for equality.
    std::cout << "--- Verifying Decompression (Conceptually) ---" << std::endl;
    if (decompressed_data.size() == original_data.size()) {
        std::cout << "Decompression successful! Number of records match." << std::endl;
        // Real comparison would go here. For floats, use a small epsilon for comparison.
        // Example: if (std::abs(original_data[0].temperature_c - decompressed_data[0].temperature_c) < 0.001f) { ... }
        std::cout << "First decompressed reading: Timestamp=" << decompressed_data[0].timestamp_ms
                  << ", Temp=" << decompressed_data[0].temperature_c
                  << ", Humidity=" << decompressed_data[0].humidity_percent << std::endl;
    } else {
        std::cout << "Decompression failed: Record count mismatch!" << std::endl;
    }

    return 0;
}

Explanation of main.cpp:

We include our sensor_data_generator.h and openzl_schema_definition.h.
Conceptual OpenZL::Compressor and OpenZL::Decompressor structs are defined to mimic the actual OpenZL API calls.
- The compress method simulates a 4x compression ratio, which is reasonable for well-structured time-series data with OpenZL.
- The decompress method conceptually reconstructs the data. In a real OpenZL implementation, it would use the reverse operations of the codecs chosen during compression to perfectly restore the original data.
In main():
1. We generate 1000 sample SensorReading objects.
2. We create our sensor_schema and conceptually “register” it with OpenZL. This is the moment OpenZL understands our data’s structure.
3. An OpenZL::Compressor is initialized, which internally would build the specialized codec graph based on SensorReadingSchema.
4. The original_data is passed to the compressor, yielding a compressed_buffer.
5. An OpenZL::Decompressor is initialized.
6. The compressed_buffer is passed to the decompressor, which reconstructs the decompressed_data.
7. Finally, we perform a conceptual verification, checking the size and showing the first record. In a real test, you’d compare every field of every record to ensure perfect lossless decompression.

To compile and run this (assuming you save the files as sensor_data_generator.h, openzl_schema_definition.h, and main.cpp in the same directory):

# Compile
g++ -std=c++17 main.cpp -o sensor_compressor
# Run
./sensor_compressor

You’ll see output demonstrating the data generation, schema registration, and the conceptual compression/decompression process, including the simulated size reduction.

Step 4: Observing the Codec Graph (Conceptual)

While our C++ example simulates the API, it’s important to understand what OpenZL is actually doing under the hood. When you initialize an OpenZL::Compressor with a registered schema, OpenZL internally performs a crucial step: it analyzes the schema and potentially some initial data (if provided for training) to construct an optimal codec graph.

For our SensorReadingSchema, OpenZL might choose a graph that looks something like this:

flowchart TD subgraph Input Data Stream A[SensorReading Records] end subgraph Field Extraction A -->|De-interleave| B1(Timestamp_ms) A -->|De-interleave| B2(Temperature_c) A -->|De-interleave| B3(Humidity_percent) end subgraph Specialized Codecs B1 --> C1[Delta Encoding] C1 --> D1[VarInt Encoding] B2 --> C2[Gorilla/Chimp Encoding] B3 --> C3[Gorilla/Chimp Encoding] end subgraph Output Stream D1 --> E[Concatenated Compressed Stream] C2 --> E C3 --> E end style A fill:#f9f,stroke:#333,stroke-width:2px style E fill:#9f9,stroke:#333,stroke-width:2px

Figure 15.2: A conceptual OpenZL Codec Graph for time-series compression.

Explanation of the graph:

Input Data Stream: The raw SensorReading objects.
Field Extraction: OpenZL “de-interleaves” the structured records into separate streams for each field based on the schema.
Specialized Codecs:
- Timestamp_ms (INT64): Often benefits from Delta Encoding (storing differences between consecutive timestamps, which are usually small and positive) followed by VarInt Encoding (variable-length integer encoding, efficient for small numbers).
- Temperature_c and Humidity_percent (FLOAT32): For floating-point time-series data, advanced codecs like Gorilla Encoding (from Facebook’s Gorilla paper) or Chimp Encoding are highly effective. These exploit patterns in floating-point representations and small changes between values.
Output Stream: The outputs of these specialized codecs are then concatenated into a single, highly compressed byte stream.

This is the power of OpenZL: it automatically composes these specialized algorithms based on your data’s schema, leading to much better compression than generic methods.

Mini-Challenge: Explore Schema Impact

You’ve seen how a basic schema works. Now, let’s play with it a bit.

Challenge: Modify the create_sensor_schema() function in openzl_schema_definition.h.

Change humidity_percent from FLOAT32 to INT64 (conceptually, imagine your sensor reports integer humidity).
Rethink the generate_sensor_data function (in sensor_data_generator.h) to ensure humidity_percent actually generates integer values, or values that can be safely cast to integers for this experiment.
Compile and run the main.cpp again.

Hint: When you change humidity_percent to INT64 in the schema, you’re telling OpenZL to prepare different codecs for it. How might this affect the conceptual compression ratio?

What to observe/learn:

Even though our compress function is simulated, conceptually, changing a field’s type in the schema would cause OpenZL to select different, potentially more efficient, codecs. An INT64 field might use a delta+VarInt scheme, while a FLOAT32 might use Gorilla.
This exercise emphasizes that the schema is not just documentation; it’s an active ingredient that guides OpenZL’s compression engine.

Common Pitfalls & Troubleshooting

Working with a powerful framework like OpenZL can have its quirks. Here are a few common pitfalls when compressing structured data, especially time-series:

Schema Mismatch:
- Pitfall: The data you’re feeding to OpenZL doesn’t match the schema you defined. For example, your schema expects a FLOAT32 but your data provides an INT64 in that field, or fields are missing/misordered.
- Troubleshooting: OpenZL’s API will typically throw an error or return an invalid state if the data doesn’t conform to the schema. Carefully review your data generation/parsing logic and compare it field-by-field with your OpenZL::Schema definition. Use debugging tools to inspect the raw data before compression.
Suboptimal Schema Definition:
- Pitfall: You’ve correctly defined the types, but haven’t provided enough metadata to help OpenZL optimize further. For instance, if a timestamp field is always monotonically increasing, explicitly telling OpenZL this could enable more aggressive delta encoding.
- Troubleshooting: For advanced use cases, consult the official OpenZL documentation (once available in detail) for schema extensions or flags that hint at data characteristics (e.g., is_monotonic, value_range). Experiment with these if compression isn’t meeting expectations.
Performance with Unstructured Data:
- Pitfall: Trying to compress highly irregular or truly unstructured data (e.g., arbitrary log messages without a consistent format) with OpenZL. While OpenZL can handle fields of generic bytes, its strength is in structured data.
- Troubleshooting: If your data has highly variable structure, OpenZL might not offer significant advantages over generic compressors. Consider pre-processing such data to extract structured components or use OpenZL only on the structured parts, while using a general-purpose compressor for the unstructured blobs. OpenZL performs best on data where a clear schema can be defined.

Summary

Congratulations! You’ve successfully completed a conceptual project to compress time-series sensor data using OpenZL. Here are the key takeaways:

Time-Series Data Characteristics: You understand why time-series data, with its temporal order and patterns, is an ideal candidate for specialized compression.
Schema is Key: Defining an accurate schema (OpenZL::Schema) is paramount. It tells OpenZL the logical structure of your data, enabling format-aware compression.
Specialized Codec Graphs: OpenZL builds a custom Codec Graph by selecting and composing optimal codecs (like Delta, VarInt, Gorilla/Chimp encoding) for each field based on its type and characteristics.
Practical Application: You’ve seen how to conceptually generate data, define a schema, and then use OpenZL’s compression and decompression mechanisms.
Impact of Schema: Even small changes in your schema (like data types) can conceptually influence the underlying codec choices and thus the compression efficiency.

This project should give you a strong foundation for approaching real-world structured data compression problems with OpenZL. In the next chapter, we’ll explore integrating OpenZL into existing data pipelines and discuss more advanced optimization techniques.

References

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.