ScyllaDB University LIVE, FREE Virtual Training Event | March 21
Register for Free
ScyllaDB Documentation Logo Documentation
  • Server
  • Cloud
  • Tools
    • ScyllaDB Manager
    • ScyllaDB Monitoring Stack
    • ScyllaDB Operator
  • Drivers
    • CQL Drivers
    • DynamoDB Drivers
  • Resources
    • ScyllaDB University
    • Community Forum
    • Tutorials
Download
ScyllaDB Docs ScyllaDB Enterprise ScyllaDB Architecture ScyllaDB SSTable Format ScyllaDB SSTable - 2.x SSTable Data File

Caution

You're viewing documentation for a previous version. Switch to the latest stable version.

SSTable Data File¶

The data file contains a part of the actual data stored in the database. Basically, it is one long list of rows, where each row lists its key and its columns.

The data file alone does not provide an efficient means to efficiently find a row with a specific key. For this, the SSTables Index File and SSTables Summary File exist. Additionally the SSTables Bloom Filter File exists for quickly determining if a specific key exists in this SSTable (an Apache Cassandra table is written incrementally to several separate SSTables).

The data file may be compressed as described in SSTables Compression. As we explain there, the compression layer offers random access to the uncompressed data, like an ordinary file, so here we can assume the data file is uncompressed.

This document explains the format of the sstable data file, but glosses over the question of how higher-level Apache Cassandra concepts - such as clustering columns, static columns, collections, etc., translate to sstable data. This is discussed in SSTables interpretation.

The Data File¶

The data file is nothing more than a long sequence of rows:

struct data_file {
    struct row[];
};

The code usually skips directly to the position of a row with a desired key (using the index file), so we’ll want an API to efficiently read this whole row. We’ll probably (TODO: find what uses this… compaction?) also need an API to efficiently iterate over successive rows (without rereading the same disk blocks).

Rows¶

References: SSTableIdentityIterator.java, constructor. DeletionTime.java

Each row begins with a header which includes the row’s key, if (and when) it was deleted, followed by a sequence of cells (column names and values) or other types of atoms described below. The last atom in a row is special row-end atom, marking the end of the row.

struct row {
    be16 key_length;
    char key[key_length];
    struct deletion_time deletion_time;
    struct atom atoms[];
};

Note that the row definition does not include its length - the reader reads the row incrementally until reaching the row-end atom. Alternatively, if we want to read an entire row into memory before parsing it, we can figure out its length using the SSTables Index File. If we reached this row from a particular index entry, the next index entry will point to the byte after the end of this row.

If we reached a particular row through the index, we may already know we have the right key, and can skip the key at the beginning of the row without bothering to deserialize it.

The deletion_time structure determines whether this is a row tombstone - i.e., whether the whole row has been deleted, and if so, when:

struct deletion_time {
    be32 local_deletion_time;
    be64 marked_for_delete_at;
};

The special value LIVE = (MAX_BE32, MIN_BE64), i.e., the bytes 7F FF FF 80 00 00 00 00 00 00 00, is the default for live, undeleted, rows. marked_for_delete_at is a timestamp (typically in microseconds since the UNIX epoch) after which data should be considered deleted. If set to MIN_BE64, the data has not been marked for deletion at all. local_deletion_time is the local server timestamp (in seconds since the UNIX) epoch when this tombstone was created - this is only used for purposes of purging the tombstone after gc_grace_seconds have elapsed.

Atoms (cells, end-of-row, and more)¶

References: OnDiskAtom.java:deserializeFromSSTable(), ColumnSerializer:deserializeColumnBody(), RangeTombstone:deserializeBody().

A row’s value is a list of atoms, each of which is usually a cell (a column name and value) or an end-of-row atom, but can also be additional types of atoms as explained below.

Each atom, of any type begins with a column name, a byte string with 16-bit length. If the length of the name is 0 (in other words, the atom begins with two null bytes), this is an end-of-row atom, as the other atom types always have non-empty names. Note that, yes, the column names are repeated in each and every row. The compression layer eliminates much of the disk-space waste, but the overhead of parsing this verbose encoding remains.

struct atom {
    be16 column_name_length;
    char column_name[column_name_length];
}

If the atom has a non-empty name, it is not an end-of-row, and following column_name appears a single byte mask:

enum mask {
    DELETION_MASK        = 0x01,
    EXPIRATION_MASK      = 0x02,
    COUNTER_MASK         = 0x04,
    COUNTER_UPDATE_MASK  = 0x08,
    RANGE_TOMBSTONE_MASK = 0x10,
};
struct nonempty_atom : atom {
    char mask;
}

The mask determines which type of atom this is:

If mask & (RANGE_TOMBSTONE_MASK | COUNTER_MASK | EXPIRATION_MASK) == 0, we have a regular cell. This has a 64-bit timestamp (can be used to decide which value of a cell is most recent), and a value, serialized into a byte array with 32-bit length:

struct cell_atom : nonempty_atom {
    be64 timestamp;
    be32 value_length;
    char value[value_length];
};

(Note: The COUNTER_UPDATE_MASK and DELETION_MASK might be turned in for cell_atom, modifying its meaning).

if mask & RANGE_TOMBSTONE_MASK, we have a

struct range_tombstone_atom : nonempty_atom {
    u16 last_column_length;
    char last_column_name[last_column_length];
    struct deletion_time dt;
};

Such a range-tombstone atom effects not just the single column column_name, but the range between column_name and last_column_name (as usual, this range is defined using the underlying comparator of the column name type).

if mask & COUNTER_MASK, we have a

struct counter_cell_atom : nonempty_atom {
    be64 timestamp_of_last_delete;
    be64 timestamp;
    be32 value_length;
    char value[value_length];
};

if mask & EXPIRATION_MASK, we have a

struct expiring_cell_atom : nonempty_atom {
    be32 ttl;
    be32 expiration;
    be64 timestamp;
    be32 value_length;
    char value[value_length];
};

Note that it is not valid to have more than one RANGE_TOMBSTONE_MASK, COUNTER_MASK or EXPIRATION_MASK on the same atom.

Name and Value Serialization¶

References: Composite.java, CompositeType.java.

It is important to remember that both column names and values described above are stored as a byte strings (preceded by its length, 16-bit or 32-bit). But in Apache Cassandra, both names and values may have various types (as determined by the CQL schema), and those are serialized to a byte string before the byte string is serialized to disk as part of the atom.

This has a surprising effect on the encoding of column names in the data file. Starting with Apache Cassandra 1.2, unless the table is created “WITH compact storage”, column names are always composite, i.e., a sequence of components. A composite column name is serialized to a byte array like this:

struct serialized_composite_name {
     struct {
         be16 component_length;
         char[] component; // of length component_length
         char end_of_component;        // usually 0, can be -1 (\0xff) or 1 (\0x01) - see below.
     } component[];
}

Then end_of_component is usually 0, but can also be -1 or 1 for specifying not a specific column but ranges, as explained in comments in Composite.java and CompositeType.java.

So the surprising result is that even single-component column names produce wasteful double-serialization (unless the tables has WITH compact storage): For example, the column name “age”, a composite name with just one component, is first serialized into \0 \3 a g e \0 and then this serialized string is written as the column name, preceded by its own length, 6: \0 \6 \0 \3 a g e \0.

Note that the above means we need to know, when reading an sstable, whether column names are composite or not. Therefore the sstable reader need to know if this table has “WITH compact storage” or not.

CQL Row Marker¶

In some cases (namely, tables built through CQL without “WITH compact storage”), each row will contain a bizarre extra cell called a “CQL Row Marker” which the Apache Cassandra developers (who apparently don’t care about wasting space…) added to allow a row to remain even if all its columns are deleted. It’s worth knowing that this extra cell exists, as its existence might surprise the uninitiated.

The “CQL Row Marker” is a normal cell in a row, which has an “empty composite” name and an empty value. Note that the cell’s column name is not empty - it can’t be (an empty name is an end-of-row marker). Rather, it is a composite name with one empty-string component. Such a composite name is serialized, as explained above, to \0 \0 \0 - the first two null bytes are the empty component’s length, and at the end we have the additional null added in the serialization. This three-null-bytes is what gets used as the column name

ScyllaDB Architecture

Copyright

© 2016, The Apache Software Foundation.

Apache®, Apache Cassandra®, Cassandra®, the Apache feather logo and the Apache Cassandra® Eye logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Was this page helpful?

PREVIOUS
SSTable Compression
NEXT
SSTable format in ScyllaDB
  • Create an issue

On this page

  • SSTable Data File
    • The Data File
    • Rows
    • Atoms (cells, end-of-row, and more)
    • Name and Value Serialization
    • CQL Row Marker
ScyllaDB Enterprise
  • enterprise
    • 2024.2
    • 2024.1
    • 2023.1
    • 2022.2
  • Getting Started
    • Install ScyllaDB Enterprise
      • ScyllaDB Web Installer for Linux
      • Install ScyllaDB Without root Privileges
      • Install scylla-jmx Package
      • Air-gapped Server Installation
      • ScyllaDB Housekeeping and how to disable it
      • ScyllaDB Developer Mode
      • Launch ScyllaDB on AWS
      • Launch ScyllaDB on GCP
      • Launch ScyllaDB on Azure
    • Configure ScyllaDB
    • ScyllaDB Configuration Reference
    • ScyllaDB Requirements
      • System Requirements
      • OS Support
      • Cloud Instance Recommendations
      • ScyllaDB in a Shared Environment
    • Migrate to ScyllaDB
      • Migration Process from Cassandra to ScyllaDB
      • ScyllaDB and Apache Cassandra Compatibility
      • Migration Tools Overview
    • Integration Solutions
      • Integrate ScyllaDB with Spark
      • Integrate ScyllaDB with KairosDB
      • Integrate ScyllaDB with Presto
      • Integrate ScyllaDB with Elasticsearch
      • Integrate ScyllaDB with Kubernetes
      • Integrate ScyllaDB with the JanusGraph Graph Data System
      • Integrate ScyllaDB with DataDog
      • Integrate ScyllaDB with Kafka
      • Integrate ScyllaDB with IOTA Chronicle
      • Integrate ScyllaDB with Spring
      • Shard-Aware Kafka Connector for ScyllaDB
      • Install ScyllaDB with Ansible
      • Integrate ScyllaDB with Databricks
      • Integrate ScyllaDB with Jaeger Server
      • Integrate ScyllaDB with MindsDB
    • Tutorials
  • ScyllaDB for Administrators
    • Administration Guide
    • Procedures
      • Cluster Management
      • Backup & Restore
      • Change Configuration
      • Maintenance
      • Best Practices
      • Benchmarking ScyllaDB
      • Migrate from Cassandra to ScyllaDB
      • Disable Housekeeping
    • Security
      • ScyllaDB Security Checklist
      • Enable Authentication
      • Enable and Disable Authentication Without Downtime
      • Creating a Custom Superuser
      • Generate a cqlshrc File
      • Reset Authenticator Password
      • Enable Authorization
      • Grant Authorization CQL Reference
      • Certificate-based Authentication
      • Role Based Access Control (RBAC)
      • ScyllaDB Auditing Guide
      • Encryption: Data in Transit Client to Node
      • Encryption: Data in Transit Node to Node
      • Generating a self-signed Certificate Chain Using openssl
      • Configure SaslauthdAuthenticator
      • Encryption at Rest
      • LDAP Authentication
      • LDAP Authorization (Role Management)
      • Software Bill Of Materials (SBOM)
    • Admin Tools
      • Nodetool Reference
      • CQLSh
      • Admin REST API
      • Tracing
      • ScyllaDB SStable
      • ScyllaDB Types
      • SSTableLoader
      • cassandra-stress
      • SSTabledump
      • SSTableMetadata
      • ScyllaDB Logs
      • Seastar Perftune
      • Virtual Tables
      • Reading mutation fragments
      • Maintenance socket
      • Maintenance mode
      • Task manager
    • Version Support Policy
    • ScyllaDB Monitoring Stack
    • ScyllaDB Operator
    • ScyllaDB Manager
    • Upgrade Procedures
      • About Upgrade
      • Upgrade Guides
    • System Configuration
      • System Configuration Guide
      • scylla.yaml
      • ScyllaDB Snitches
    • Benchmarking ScyllaDB
    • ScyllaDB Diagnostic Tools
  • ScyllaDB for Developers
    • Develop with ScyllaDB
    • Tutorials and Example Projects
    • Learn to Use ScyllaDB
    • ScyllaDB Alternator
    • ScyllaDB Drivers
      • ScyllaDB CQL Drivers
      • ScyllaDB DynamoDB Drivers
  • CQL Reference
    • CQLSh: the CQL shell
    • Appendices
    • Compaction
    • Consistency Levels
    • Consistency Level Calculator
    • Data Definition
    • Data Manipulation
      • SELECT
      • INSERT
      • UPDATE
      • DELETE
      • BATCH
    • Data Types
    • Definitions
    • Global Secondary Indexes
    • Expiring Data with Time to Live (TTL)
    • Functions
    • Wasm support for user-defined functions
    • JSON Support
    • Materialized Views
    • Non-Reserved CQL Keywords
    • Reserved CQL Keywords
    • DESCRIBE SCHEMA
    • Service Levels
    • ScyllaDB CQL Extensions
  • Features
    • Lightweight Transactions
    • Global Secondary Indexes
    • Local Secondary Indexes
    • Materialized Views
    • Counters
    • Change Data Capture
      • CDC Overview
      • The CDC Log Table
      • Basic operations in CDC
      • CDC Streams
      • CDC Stream Generations
      • Querying CDC Streams
      • Advanced column types
      • Preimages and postimages
      • Data Consistency in CDC
    • Workload Attributes
    • Workload Prioritization
  • ScyllaDB Architecture
    • Data Distribution with Tablets
    • ScyllaDB Ring Architecture
    • ScyllaDB Fault Tolerance
    • Consistency Level Console Demo
    • ScyllaDB Anti-Entropy
      • ScyllaDB Hinted Handoff
      • ScyllaDB Read Repair
      • ScyllaDB Repair
    • SSTable
      • ScyllaDB SSTable - 2.x
      • ScyllaDB SSTable - 3.x
    • Compaction Strategies
    • Raft Consensus Algorithm in ScyllaDB
    • Zero-token Nodes
  • Troubleshooting ScyllaDB
    • Errors and Support
      • Report a ScyllaDB problem
      • Error Messages
      • Change Log Level
    • ScyllaDB Startup
      • Ownership Problems
      • ScyllaDB will not Start
      • ScyllaDB Python Script broken
    • Upgrade
      • Inaccessible configuration files after ScyllaDB upgrade
    • Cluster and Node
      • Handling Node Failures
      • Failure to Add, Remove, or Replace a Node
      • Failed Decommission Problem
      • Cluster Timeouts
      • Node Joined With No Data
      • NullPointerException
      • Failed Schema Sync
    • Data Modeling
      • ScyllaDB Large Partitions Table
      • ScyllaDB Large Rows and Cells Table
      • Large Partitions Hunting
      • Failure to Update the Schema
    • Data Storage and SSTables
      • Space Utilization Increasing
      • Disk Space is not Reclaimed
      • SSTable Corruption Problem
      • Pointless Compactions
      • Limiting Compaction
    • CQL
      • Time Range Query Fails
      • COPY FROM Fails
      • CQL Connection Table
    • ScyllaDB Monitor and Manager
      • Manager and Monitoring integration
      • Manager lists healthy nodes as down
    • Installation and Removal
      • Removing ScyllaDB on Ubuntu breaks system packages
  • Knowledge Base
    • Upgrading from experimental CDC
    • Compaction
    • Consistency in ScyllaDB
    • Counting all rows in a table is slow
    • CQL Query Does Not Display Entire Result Set
    • When CQLSh query returns partial results with followed by “More”
    • Run ScyllaDB and supporting services as a custom user:group
    • Customizing CPUSET
    • Decoding Stack Traces
    • Snapshots and Disk Utilization
    • DPDK mode
    • Debug your database with Flame Graphs
    • Efficient Tombstone Garbage Collection in ICS
    • How to Change gc_grace_seconds for a Table
    • Gossip in ScyllaDB
    • Increase Permission Cache to Avoid Non-paged Queries
    • How does ScyllaDB LWT Differ from Apache Cassandra ?
    • Map CPUs to ScyllaDB Shards
    • ScyllaDB Memory Usage
    • NTP Configuration for ScyllaDB
    • Updating the Mode in perftune.yaml After a ScyllaDB Upgrade
    • POSIX networking for ScyllaDB
    • ScyllaDB consistency quiz for administrators
    • Recreate RAID devices
    • How to Safely Increase the Replication Factor
    • ScyllaDB and Spark integration
    • Increase ScyllaDB resource limits over systemd
    • ScyllaDB Seed Nodes
    • How to Set up a Swap Space
    • ScyllaDB Snapshots
    • ScyllaDB payload sent duplicated static columns
    • Stopping a local repair
    • System Limits
    • How to flush old tombstones from a table
    • Time to Live (TTL) and Compaction
    • ScyllaDB Nodes are Unresponsive
    • Update a Primary Key
    • Using the perf utility with ScyllaDB
    • Configure ScyllaDB Networking with Multiple NIC/IP Combinations
  • Reference
    • AWS Images
    • Azure Images
    • GCP Images
    • Configuration Parameters
    • Glossary
    • Limits
    • ScyllaDB Enterprise vs. Open Source Matrix
    • API Reference (BETA)
    • Metrics (BETA)
  • ScyllaDB University
  • ScyllaDB FAQ
  • Alternator: DynamoDB API in Scylla
    • Getting Started With ScyllaDB Alternator
    • ScyllaDB Alternator for DynamoDB users
    • Alternator-specific APIs
Docs Tutorials University Contact Us About Us
© 2025, ScyllaDB. All rights reserved. | Terms of Service | Privacy Policy | ScyllaDB, and ScyllaDB Cloud, are registered trademarks of ScyllaDB, Inc.
Last updated on 09 Apr 2025.
Powered by Sphinx 7.4.7 & ScyllaDB Theme 1.8.6