4 Months with Cassandra, a love story from Cloudkick
![]()
At Cloudkick we track a ton of metrics about our customer's servers and it's quite a challenge to store such massive amounts of data. Early on, we made the decision to avoid using tools like RRDTool, so we could provide a more holistic look at infrastructure. We had two firm requirements: we wanted to show trends on a macro-level and to have very low latency so our users would never wait for graphs to build. We initially used PostgreSQL, but as we added billions of rows, the performance quickly degraded. We used cron jobs that would roll up the data on intervals to trade storage for throughput and lower latency. With clever partitioning, we were able to stretch the system to a certain point, but beyond that we faced issues of needing a much bigger machine and faster rotational disks to accommodate our business requirements; that's when started looking at other solutions.
We evaluated many of the NoSQL options but in the end, we chose Apache Cassandra because it provided many advantages for our use-cases.
Advantages of Cassandra
Linear scalability
To meet our requirements, we needed a fast solution that allowed us to easily throw cloud servers at the problem. Specifically, we needed to achieve thousands of writes per second. Cassandra's architecture, being based on Amazon's Dynamo, could cope with our write load and make it easy to add more nodes to an existing cluster.
Massive write performance
Write performance in Cassandra is excellent. The internals are specifically geared towards a heavy-write system. It writes to a memory table and a serial commit log, and every so often the memory table is flushed to disk in what the Big Table paper describes as a sorted strings table, often called an SSTable — an immutable data structure. There is a lot more happening behind the scenes, but the performance characteristics are clear: there is nothing slow in the write path. The Cassandra wiki page on Architecture Internals provides more details.
Low operational costs
Traditional sharding and replication with databases like MySQL and PostgreSQL have been shown to work even on the largest scale websites — but come at a large operational cost. Setting up replication for MySQL can be done quickly, but there are many issues you need to be aware of, such as slave replication lag. Sharding can be done once you reach write throughput limits, but you are almost always stuck writing your own sharding layer to fit how your data is created and operationally, it takes a lot of time to set everything up correctly. We skipped that step all together and added a couple hooks to make our data aggregation service siphon to both PostgreSQL and Cassandra for the initial integration.
Cassandra at Cloudkick
Cassandra has some unique terminology which may be unfamiliar. Please refer to the Data Model wiki page for an explanation of terms.
The data model is much less complex than a traditional SQL RDMS system. In the real world, this typically equates to denormalization of data. A great example of that is in Digg's blog post on their use of Cassandra. Our data model is a little bit different in that it doesn't really need to be denormalized since it's inherently simple.
Cloudkick's primary use of Cassandra is for storing monitoring data of different metrics, checked by our system. This equates down to a very simple model for us. To store all the monitoring information that Cloudkick collects, it only requires a few Column Families. In our system, we have three big classes of data we store:
- Metrics (numerics and text)
- Rolled up data
- Statuses at time slices
Of these different classes of data, all are "indexed" by timestamp.
Here's a Column Family declaration as it would appear in storage-conf.xml:
<ColumnFamily CompareWith="LongType" Name="NumericArchive" />This particular Column Family stores every single original point of data in the system. The LongType declaration specifies that the column names are 8-byte values that determine the order of the columns. It's analogous to choosing a primary key in a database, except in Cassandra you only get a couple "indexes" for free. Some people say Cassandra is a 4 or 5 level hash, depending on how you configure the column family. The pseudo-hash is below:
[KeySpace][CF][Key][Column]or[KeySpace][CF][Key][SuperColumn][SubColumn]You can also do range queries on one or two levels depending on the Partitioner you pick. With the Order Preserving Partitioner, you can perform a range_slice which can give you a slice of the keys and a slice of the columns. This is pretty powerful since it gives you a full cross-section of the data with a single call. However, we use the random partitioner which only allows access to a range query from a column level. In Python-land, we store each metric with the typical Thrift call. Storing the column names is as simple as using the struct module (!Q is an 8-byte value in network order).
import struct name = struct.pack('!Q', long_ts)This snippet of code simply converts the time stamp to a value so it's ready to insert values into the system. It's important to keep the long column names in network order as that's what the Cassandra Java process expects. For the curious, Thrift takes care of byte order for values that are explicitly typed as a value; however, since this is just a byte stream, it has no knowledge of the type of values and therefore doesn't reorder them.
Reading the values in Python is much of the same. If I wanted values for a time slice between '2009-09-01' and '2009-09-03', I could use the following:
t1 = int(time.mktime(datetime(2009,9,1).timetuple())) t2 = int(time.mktime(datetime(2009,9,3).timetuple())) client.get_slice('MonitorApp', '', ColumnParent(column_family='NumericArchive'), SlicePredicate(slice_range=SliceRange(start=t1, end=t2)), ONE)Another interesting bit of Cloudkick technology is how we roll values up into different time slices. To keep querying fast, a mechanism like this is necessary.
We keep track of the following roll-up intervals:
- 5 m
- 20 m
- 30 m
- 60 m
- 4 hr
- 12 hr
- 1 day
For instance, if you want to look over the "month" time period, there's a good chance that you don't want to look at 5 minute intervals. If you looked at the 5 minute intervals over a month, for one series you could potentially see 8640 points of data. (
12 five minute intervals * 24 hours * 30 days) This is obviously too much data, so you would look at something more manageable, like the one-hour or four-hour intervals, which would be 720 or 180 points, respectively. In each point, we also store additional information so the column definition looks like this:<ColumnFamily CompareWith="LongType" Name="Rollup60m" ColumnType="Super" CompareSubcolumnsWith="BytesType" />This declaration is different than the NumericArchive declaration because it is a SuperColumn. If you remember the hash example above, we have now added another layer which we can leverage. In the NumericArchive section we only stored the native point. Since we're combining multiple points into one data point, there's more data for us to store. So for the sub-columns, we store some simple key-value pairs. Notice that it's sorted by BytesType which does a string comparison, this part is less interesting for us since we typically only retrieve either all the sub-columns or none at all.
The different key-value pairs we store, with regard to the Rollup columns, are:
- average - the average of all the points over the interval
- count - the number of points accumulated in the interval
- derivative - the change in value over the change in time (great for counters or gauges like bandwidth)
We still get the sorted values with another level of a hash, which is useful as we can then retrieve the derivative and average all in one query.
Hybrid NoSQL
Cloudkick still uses traditional SQL systems for much of our data — data like our user accounts are stored in a standard master/slave MySQL setup. We even keep data that references keys in Cassandra in MySQL so we can quickly write a Django view that queries metadata about a monitoring check using the Django ORM, but still use Cassandra for the bulk of the data about that check. We'll likely keep moving more data into Cassandra as we need to, but for some data the ability to write arbitrary SQL queries is still very useful.
Administration and operational issues
nodetool, previously known as nodeprobe
Cassandra includes a utility called
nodetool, previously callednodeprobe. This utility lets you do common adminstrative tasks on your cluster, like checking if a node is up, decommissioning a node, or triggering a compaction. The Cassandra wiki page on nodetool provides more details.Major compactions
Because SSTables are never modified on-disk, only replaced, you need to do a major compaction in order to reclaim all of your disk space if you delete or modify a row.
![]()
You do need to keep a watch of your disk-space growth and make sure to trigger a major compaction periodically. Ideally, when your system is under lower load.
Tombstones
Because Cassandra is a distributed database, deleting rows can be complicated. Jonathan Ellis' recent blog post does a fantastic job explaining how distributed deletes and tombstones can affect maintenance and administration.
Random / Order Preserving Partitioner
Choosing tokens for each node is a major part of a Cassandra deployment. The token selection decides which node stores the data in Cassandra. If you're using the Order Preserving Paritioner, you must be very careful about how you pick tokens, because with bad ones your data will be lopsided, with signifigantly more being stored on a small number of nodes. If you know how your keys are generated, you need to partition them as evenly as possible. If you don't need to do ranged-based queries, we'd suggest using the Random Partitioner. The Cassandra Wiki contains more information on configuring the partitioner.
Client reconnection
Our original client would try a single Cassandra node and throw an exception if it was unable to connect. This might make sense if you're used to a data storage system like MySQL where, if the master is down, you can't do much. In Cassandra's case, you want to make sure you try to query multiple nodes before giving up — so make your clients try a list of servers before quitting.
Thrift issues
Since Cassandra uses Apache Thrift as the default RPC mechanism, exposing the Thrift layer to any non-controlled data can be dangerous. We use firewalls on our nodes to make sure our Thrift ports are only exposed to a very small set of machines, because even just telneting into the port and typing "hello" can cause the JVM to OOM. This is discussed in the THRIFT-601 issue report. In Cassandra trunk, Apache Avro is available as an alternative communication method, and shouldn't suffer from these types of issues.
What's missing?
There are many features we would like to see in Cassandra itself, but most of those (like compressed SSTables) are already being tackled by the developers.
We ran into many problems with the Ordered Partitioner, and it would be nice for certain data models if this partitioner would work in a more automated fashion.
We also believe Cassandra would benefit from integration with existing open source projects. For example, a Django ORM layer would instantly expose Cassandra to many more developers. Right now, everyone using Cassandra is building custom systems to communicate with it, but with a little work, many more projects could be "Cassandra Enabled."
Cassandra community
The Cassandra community recently graduated to a top-level project at the Apache Software Foundation. Through the hard work of all its committers and contributors, the project has come a long way in a very short amount of time. The open community model is yielding an increasingly impressive data storage system, which is being used by many companies around the world.
Cassandra is quickly approaching its next iteration of version 0.6. New features include: row-level caching, support for Hadoop/MapReduce, authenticated connections, new statistics exposed over JMX, per-keyspace replication factors, and a new batch_mutate method. The pace of innovation in the project is impressive and we're eager to upgrade.
We'd would like to sincerely thank all the users, contributors and developers who have made Apache Cassandra such a successful project!
Posted by Team Cloudkick on March 2, 2010 ♻ Tweet this!
Some good stuff when you need to grow.
