Saturday, 28 March 2015

Cassandra Database Overview

What is Apache Cassandra?
Apache Cassandra™ is a massively scalable open source NoSQL database. Cassandra is perfect for managing large amounts of data across multiple data centers and the cloud. Cassandra delivers continuous availability, linear scalability, and operational simplicity across many commodity servers with no single point of failure, along with a powerful data model designed for maximum flexibility and fast response times.

How does Cassandra work?
Cassandra has a “masterless” architecture, meaning all nodes are the same. Cassandra provides automatic data distribution across all nodes that participate in a “ring” or database cluster. There is nothing programmatic that a developer or administrator needs to do or code to distribute data across a cluster because data is transparently partitioned across all nodes in a cluster.
Cassandra also provides customizable replication, storing redundant copies of data across nodes that participate in a Cassandra ring. This means that if any node in a cluster goes down, one or more copies of that node’s data is still available on other machines in the cluster. Replication can be configured to work across one data center, many data centers, and multiple cloud availability zones
Understanding the architecture
In this topic: •Architecture in brief
 Essential information for understanding and using Cassandra.
 •Internode communications (gossip)
 Cassandra uses a protocol called gossip to discover location and state information about the other nodes participating in a Cassandra cluster.
 •Data distribution and replication
 How data is distributed and factors influencing replication.
 A partitioner determines how data is distributed across the nodes in the cluster (including replicas).
 A snitch determines which data centers and racks nodes belong to.
 •Client requests
 Client read or write requests can be sent to any node in the cluster because all nodes in Cassandra are peers.
 •Planning a cluster deployment
 Vital information about successfully deploying a Cassandra cluster.
•Data distribution and replication
In Cassandra, data distribution and replication go together. Data is organized by table and identified by a primary key, which determines which node the data is stored on. Replicas are copies of rows. When data is first written, it is also referred to as a replica.
Factors influencing replication include:
•Virtual nodes: assigns data ownership to physical machines.
•Partitioner: partitions the data across the cluster.
•Replication strategy: determines the replicas for each row of data.
•Snitch: defines the topology information that the replication strategy uses to place replicas.

In this topic: •Consistent hashing
 Consistent hashing allows distributing data across a cluster which minimizes reorganization when nodes are added or removed.
 •Virtual nodes
 Overview of virtual nodes (vnodes).
 •Data replication
 Cassandra stores replicas on multiple nodes to ensure reliability and fault tolerance. A replication strategy determines the nodes where replicas are placed.

No comments:

Post a Comment