The Book Look: Cassandra Data Modeling and Schema Design

I love writing this column for TDAN. It lets me discuss what I learned from a newly released data management book. When I publish a book through Technics Publications, I see the manuscript mostly through the eyes of a publisher. But when I write this column, I see the manuscript through the eyes of a potential reader. I try to keep my summaries objective, even when I happen to be one of the co-authors, as in the case of this quarter’s column :).

A few years ago, a data manager at a large financial company asked me to teach my five-day Data Modeling Master Class to a group of NoSQL developers. I was warned not to use traditional data modeling terms such as “conceptual” or “third normal form.” I spent about six months rewriting all of my material for this group, finding it a fun challenge to think of the value data modeling brings to all projects without using the terms I’ve been using for over 30 years. I rewrote the entire course (over 600 slides!) for this particular audience and, in general, for all organizations that use technology broader than relational databases.

The major change to the training material was transitioning from Conceptual > Logical > Physical to Align > Refine > Design. Align is about aligning to a common business vocabulary (what the conceptual does). Refine is about refining on a precise set of business requirements (what the logical does). Design is about designing an efficient database structure (and you guessed it, the physical).

The class went well and I started thinking about how else we can impact the industry in terms of Align > Refine > Design. Then it hit me. If I could coauthor a series of books on specific NoSQL technologies with experts, this can change the world!

The first book in the series to be released was “MongoDB Data Modeling and Schema Design,” in which I partnered with experts Daniel Coupal and Pascal Desmarets. Next came the book on Neo4j co-authored with David Fauth, followed by “TerminusDB” by Donny Winston, “Elasticsearch” by Rafid Reaz, and our most recently released book in the series: “Cassandra Data Modeling and Schema Design,” co-authored with Betul O’Reilly.

Each book in the series combines sound modeling practices with proven technology-specific designs and approaches. Each book’s introduction covers the three modeling characteristics of precise, minimal, and visual; the three model components of entities, relationships, and attributes (including keys); the three model levels of conceptual (align), logical (refine), and physical (design); and the three modeling perspectives of relational, dimensional, and query. Chapter 1 goes into detail into Align, Chapter 2 into Refine, and Chapter 3 into Design.

Below is an excerpt from this book (permission given from Technics Publications to reproduce here), that will explain more about Cassandra and its many use cases:

Cassandra’s design originated on Facebook with inspiration from Amazon’s DynamoDB and Google’s Bigtable. Both systems were pioneers in providing scalable and reliable storage solutions, but they were not without flaws. Cassandra combines the strengths of both systems, supporting massive data volumes and efficiently handling intensive queries. Cassandra was released as an open-source project in 2008. It became a top Apache Foundation project in 2010 after joining the Apache Incubator in 2009. Cassandra evolved to be the database solution of choice for many companies, like Apple, Instagram, Uber, Spotify, Facebook, and Netflix.

Traditional relational databases struggle with limited scalability as data grows, lack flexibility due to rigid schemas, and may prioritize strict consistency over availability and scalability. Moreover, large-scale deployments might be expensive to license and maintain. Cassandra, on the other hand, maintains large amounts of data efficiently across multiple nodes, ensuring high availability and eliminating single points of failure. This flexibility includes handling structured, semi-structured, and unstructured data, making it ideal for high-performance use cases. Cassandra shines in write-intensive applications, allowing for fast writes without sacrificing performance or availability. Another great thing about Cassandra is its elastic scalability. When data volume and traffic increase, Cassandra clusters can easily scale up or down to accommodate the change.

Cassandra is a powerful and flexible database system, but it may not be the ideal solution for every application. Use Cassandra if your application requires:

  • Handling massive amounts of data across many nodes, providing high availability and no single point of failure. If your application can work on a database with just one server, you might reconsider using Cassandra.
  • Excellent in write-intensive workloads due to its distributed architecture. However, a relational database is the way to go if your application handles heavy analytical workloads or has complex queries.
  • High availability with no single point of failure, meaning that even if one node fails, the system will continue to function without interruption.
  • Replicating data quickly everywhere, regardless of location. You can achieve a high standard of fault tolerance by replicating data across many datacenters, guaranteeing that data is secure even during outages. This strategic distribution of data also leads to low latency.

Cassandra is very flexible, so you can apply it to many use cases:

E-commerce and inventory management: E-commerce companies need a website that is always up and running, especially during peak periods, to avoid financial losses. They also need a database that can handle a large amount of data and adjust quickly to meet customer expectations. To achieve this, they need to be able to scale their online inventory quickly and cost-effectively. To provide a seamless user experience, e-commerce websites must be fast and scalable.

Personalization, recommendations, and customer experience: Today, we see personalization and recommendation systems everywhere. It’s like having built-in helpers in apps and websites that tell us about events or articles we might enjoy. The Eventbrite phone app now uses Cassandra instead of MySQL to tell people about nearby fun events. Vector search can also significantly help e-commerce by improving product recommendations and search functionality. Vector search can help by making searching products and product recommendations easier. It can look at embeddings and keywords to analyze the similarity of items and user preferences. Vector search can show customers more relevant products based on how they’ve behaved and what they like, improving the user experience and ultimately leading to more sales. Vector search can also quickly and easily handle large catalogs of products and complex queries.

Internet of Things (IoT) and edge computing: Keeping an eye on the weather, traffic, energy use, stock levels, health signs, video game scores, farming conditions, and many other things uses sensors, wearable tech, cars, machines, drones, and more to make a constant stream of data. We must gather this information from reliable devices and monitor continuously. Cassandra is a great choice for the Internet of Things because it can handle a lot of data from many devices at once. It spreads this data across many nodes so it doesn’t get overwhelmed. Additionally, it stores and retrieves data very quickly, which is significant for the Internet of Things (IoT), where speed is essential.

Fraud detection and authentication: Companies need a lot of data to prevent unauthorized user access. They must constantly analyze big and varied data sets to find unusual patterns that might mean fraud. This is especially important in finance, banking, payments, and insurance. Another important part is confirming people’s identities. Authentication is critical to every application. The challenge is to make this process quick and easy while still being sure about who they are. Like fraud detection, this needs real-time analysis of lots of different data. And since authentication is a big part of your systems, you can’t afford any breakdowns.

Share this post

Steve Hoberman

Steve Hoberman

Steve Hoberman has trained more than 10,000 people in data modeling since 1992. Steve is known for his entertaining and interactive teaching style (watch out for flying candy!), and organizations around the globe have brought Steve in to teach his Data Modeling Master Class, which is recognized as the most comprehensive data modeling course in the industry. Steve is the author of nine books on data modeling, including the bestseller Data Modeling Made Simple. Steve is also the author of the bestseller, Blockchainopoly. One of Steve’s frequent data modeling consulting assignments is to review data models using his Data Model Scorecard® technique. He is the founder of the Design Challenges group, Conference Chair of the Data Modeling Zone conferences, director of Technics Publications, and recipient of the Data Administration Management Association (DAMA) International Professional Achievement Award. He can be reached at me@stevehoberman.com.

scroll to top