Chapter 23: Distributed Database Concepts

Advanced Database Management Systems - Final Term Elite Preparation

1. Introduction to Distributed Systems

A Distributed Computing System consists of several processing sites (nodes) interconnected by a computer network (LAN or Long-haul/WAN). Nodes cooperate by partitioning large tasks into smaller, efficiently solvable tasks.

Big Data Technologies

Modern Big Data solutions are born from combining traditional database technologies with distributed computing techniques to successfully mine and process vast amounts of unstructured and structured data across thousands of nodes.

Scalability & Partition Tolerance

Vertical Scalability (Scaling Up): Expanding the physical capacity (RAM/CPU) of an individual, single node.
Horizontal Scalability (Scaling Out): Expanding the total capacity by adding more nodes/servers to the distributed system network. (Preferred for distributed DBs).
Partition Tolerance: A critical requirement where the system must have the capacity to continue operating safely even if the network communication between nodes drops or is partitioned.

2. Distributed Database Concepts & Autonomy

A Distributed Database consists of a logical interrelation of connected databases over a network, which may possibly lack homogeneity (different hardware/OS). The DDBMS (Distributed Database Management System) is the software that manages this.

The Four Dimensions of Autonomy

Autonomy determines the exact extent to which individual nodes can operate completely independently of the global system.

Design Autonomy: Independence of data model usage and transaction management. Does Node A use Oracle while Node B uses MySQL?
Communication Autonomy: The node decides whether to communicate with another component DBS, and exactly how much information to share.
Execution Autonomy: The node executes local operations without interference from external operations and decides its own order of execution.
Association Autonomy: The node decides whether (and how much) to share its functionality and resources with the global system.

3. Distribution Transparency

Transparency is the fundamental goal of a DDBMS. It means hiding the complex implementation details of the network and distribution from the end user, making the system look like one massive centralized database.

Key Types of Transparency

Data Organization Transparency:
- Location Transparency: Users query tables without needing to know exactly which geographical node the data physically resides on.
- Naming Transparency: A unique name is provided for each object, avoiding naming conflicts across sites.
Replication Transparency: The user does not need to know that multiple copies of data exist. They execute one UPDATE, and the system invisibly propagates it.
Fragmentation Transparency: The user queries a table as if it is whole, completely unaware that it is horizontally or vertically sharded across multiple sites.
Execution / Design Transparency: Hides the fact that the underlying databases might be heterogeneous (different vendors/models).

4. Data Fragmentation, Replication & Allocation

Fragmentation (Sharding)

Dividing logical units of the database into smaller pieces (fragments) assigned to different nodes.

Horizontal Fragmentation: A subset of tuples (rows). Grouped by a condition (e.g., Location='San Francisco'). To reconstruct the full table, apply the UNION operation.
Vertical Fragmentation: A subset of attributes (columns). To reconstruct the full table, apply the OUTER UNION or FULL OUTER JOIN operation using a common key.
Mixed (Hybrid) Fragmentation: A combination of both horizontal and vertical.

Schemas defining Distribution

The Fragmentation Schema defines the set of fragments. The Allocation Schema defines exactly which node/site each fragment is physically allocated to.

Replication

Fully Replicated: A copy of the entire database exists at every single site. Pros: Incredible availability and fast local reads. Cons: Extremely slow updates (must update every site).
Nonredundant Allocation: No replication. Each fragment exists on exactly one site.
Partial Replication: Some critical fragments are replicated, others are not. Defined by the Replication Schema.

5. Distributed Concurrency Control

Distributed concurrency is much harder than centralized because it must handle multiple copies of data, failed sites, and broken communication links.

Method A: Distinguished Copy (Primary Site)

A particular copy of a data item is designated as the "distinguished" copy. All locks are associated strictly with this copy.

Primary Site Technique: All distinguished copies are kept at a single, central master site. Vulnerable to a single point of failure.
Primary Site with Backup: Lock info is maintained at the primary site and shadowed to a backup site for failover.
Primary Copy Method: The distinguished copies are distributed among various sites, balancing the lock coordination load across the network.

Method B: Voting Method

There is no distinguished copy. When a transaction wants a lock, it sends a request to all sites containing a replica. If granted a lock by a majority of the copies before a time-out, it holds the lock on all copies. (Con: Results in massive network message traffic).

6. Distributed Recovery & Commit Protocols

It is very difficult to determine if a site is actually down, or if just the network cable is cut. A Distributed Commit ensures that if a transaction updates data across 3 sites, it will NOT commit until it is certain its effect on every site cannot be lost.

Two-Phase Commit (2PC) Protocol

Requires a Coordinator and Local Recovery Managers.

Phase 1 (Prepare / Vote): The coordinator asks all participating sites to prepare to commit. Sites vote "Yes" (I saved data to log) or "No" (I failed).
Phase 2 (Commit / Abort): If ALL sites vote Yes, the coordinator broadcasts a "Global Commit". If even ONE votes No, the coordinator broadcasts a "Global Abort/Rollback".

Three-Phase Commit (3PC) Protocol

Divides the second phase of 2PC into two subphases to avoid certain blocking/deadlock scenarios if the coordinator fails.

Phase 1: Vote Phase.
Phase 2: Prepare-to-Commit: Communicates the result of the vote phase to everyone (creating a safety buffer).
Phase 3: Final Commit subphase.

7. Distributed Query Processing

Querying across a network is bottlenecked by Data Transfer Costs (the cost of transferring intermediate result files across the network). The ultimate optimization criterion is minimizing data transfer.

Stages of a Distributed Query

Query Mapping: References the global conceptual schema.
Localization: Maps the global query to separate queries on individual fragments.
Global Optimization: Strategy selected to minimize network transfer.
Local Optimization: The local site optimizes how it pulls from its local disk.

The Power of the SEMIJOIN

Instead of sending a massive 1-million row table across the network to join with a small table, a Semijoin first sends only the joining column to the remote site. The remote site finds the matches, and sends back ONLY the attributes of the matching rows to the original site. This drastically minimizes network payload.

8. Types of DDBMS & System Architectures

Hardware System Architectures

Shared Memory (Tightly Coupled): Multiple CPUs share the same RAM.
Shared Disk (Loosely Coupled): Nodes have their own RAM but share access to a central SAN/Disk array.
Shared-Nothing: Every node has its own CPU, RAM, and Disk. (The standard for pure distributed databases).

Categorization of DDBMS Software

Classified based on Heterogeneity (same or different DB software) and Autonomy.

Homogeneous: Every node runs the exact same software (e.g., all Oracle).
Federated Database System (FDBS): High autonomy and high heterogeneity. A global schema is shared by applications, but the component databases have differences in data models, constraints, and query languages.
Semantic Heterogeneity Problem in FDBS: Dealing with differences in meaning. Node A might store Price in Dollars, while Node B stores it in Euros. The Federated system must reconcile this.

🔥 Core Theory Q&A Preparation

Ensure you can articulate these core architectural principles flawlessly.

Concept: Horizontal vs Vertical Fragmentation

Q: How are Horizontal and Vertical fragments reconstructed to form the original table?

A: Horizontal fragmentation divides a table by rows (tuples). To reconstruct it, the DDBMS applies a relational UNION operation. Vertical fragmentation divides a table by columns (attributes), ensuring the primary key is included in every fragment. To reconstruct it, the DDBMS applies an OUTER UNION or a FULL OUTER JOIN using the common primary key.

Concept: Network Minimization

Q: Why is the Semi-Join operation crucial in Distributed Query Processing?

A: The primary bottleneck in a distributed database is not disk I/O, but Data Transfer Costs over the network. If we perform a standard join by sending a massive table across the network, it will saturate bandwidth. A Semijoin minimizes this by only sending the distinct joining keys across the network, evaluating the match remotely, and shipping back only the exact subset of required matching rows.

Concept: Distributed Deadlocks

Q: What is the primary disadvantage of the "Voting Method" for Concurrency Control?

A: While the Voting Method avoids the single-point-of-failure inherent in the Primary Site technique, its major disadvantage is massive message traffic. Because there is no distinguished copy, a transaction must broadcast lock requests to every site holding a replica, wait for majority votes, and handle potential time-outs, which severely degrades network performance.

🏆 10-Mark Scenario Questions

These complex scenarios mimic high-weight university final exam questions, requiring you to synthesize architecture, protocols, and fragmentation strategies.

Scenario 1: Replication & Fault Tolerance

A global banking institution has three data centers: New York, London, and Tokyo. The CEO demands that user balance queries must be instantaneous globally, but balance updates must be absolutely consistent across all sites, even if one site's network cable is cut.

Recommend a Data Replication Strategy and a Commit Protocol. Justify why these solve the CEO's requirements.

Elite Answer Formulation:

1. Data Replication Strategy: Fully Replicated Database

To satisfy the requirement of "instantaneous global queries," the database must be Fully Replicated. This means an identical copy of the entire database exists at all three sites. Because every read request is handled entirely locally (Data Localization), query performance is maximized. While fully replicated databases suffer from slow update speeds, the CEO prioritized consistency and read-speed over update speed.

2. Commit Protocol: Two-Phase Commit (2PC)

To satisfy the requirement of absolute consistency even during network failures (Partition Tolerance), the system must utilize the Two-Phase Commit Protocol.
• In Phase 1 (Prepare), the Global Transaction Manager coordinates with New York, London, and Tokyo.
• If the network cable to Tokyo is cut, Tokyo cannot reply with a "Yes" vote.
• In Phase 2, because the Coordinator did not receive a unanimous "Yes", it will broadcast a Global Abort. This strictly guarantees that no site applies an update unless ALL sites can, preventing desynchronization.

Scenario 2: DDBMS Architecture Choices

A hospital network is merging three previously independent clinics. Clinic A uses an Oracle database, Clinic B uses Microsoft SQL Server, and Clinic C uses a custom legacy system. The board wants a single unified application to query patient histories across all three, but the local clinic managers refuse to give up control over how their local data is formatted or managed.

Which specific DDBMS architecture should be deployed? Discuss how it handles the clinics' demands and identify one major technical hurdle it must overcome.

Elite Answer Formulation:

1. DDBMS Architecture: Federated Database System (FDBS)

The correct architecture is a Heterogeneous Federated Database System. An FDBS creates a single "Global Conceptual Schema" that sits above the clinics, allowing the board's new unified application to query data as if it were centralized.

2. Satisfying Local Demands: High Autonomy

This architecture inherently supports the clinic managers' demands through Design and Execution Autonomy. Clinic A can continue using Oracle's data models and constraints, and Clinic B can use SQL Server, operating independently. They only share what they explicitly map to the federated export schema.

3. Major Technical Hurdle: Semantic Heterogeneity

Because the databases were built independently, the FDBS must resolve Semantic Heterogeneity. For example, Clinic A might store Blood Pressure as a String ("120/80"), while Clinic B stores it as two Integers (Systolic: 120, Diastolic: 80). The Federated system must run complex translation logic at the middleware tier to standardize the meaning and intended use of this data before presenting it to the global application.

Scenario 3: Distributed Concurrency Control

A global e-commerce platform stores its inventory catalog across 10 distributed servers using partial replication. During a massive Black Friday sale, thousands of transactions attempt to lock the inventory record for a specific TV.

The DB team is deciding between the "Primary Site Technique" and the "Primary Copy Method" for concurrency control. Define both, and state which one the DB team must choose to survive the Black Friday traffic. Justify your answer.

Elite Answer Formulation:

1. Definitions:

Both techniques designate one specific copy of a replicated data item as the "distinguished copy" where all locks are managed.
• Primary Site Technique: Every single distinguished copy for every piece of data is held at one centralized master node.
• Primary Copy Method: The distinguished copies are spread out. Server 1 holds the primary locks for TVs, Server 2 holds the primary locks for Laptops, etc.

2. The Correct Choice: Primary Copy Method

The team must choose the Primary Copy Method. Black Friday generates thousands of concurrent lock requests. If they use the Primary Site Technique, that single centralized server will become an extreme bottleneck (receiving 100% of all lock traffic) and is highly vulnerable to a single-point-of-failure crash under load.

By using the Primary Copy Method, the massive lock coordination load is distributed evenly among the 10 various sites. This provides horizontal scalability for transaction management, ensuring the system survives the spike in concurrent traffic.

Scenario 4: Fragmentation Strategy

A multinational corporation has an `EMPLOYEE (EmpID, Name, Salary, Region, Department)` table. The HR department in Europe queries European employee names frequently. The Payroll department at Global HQ in the USA runs massive batch queries exclusively calculating salaries for all employees worldwide.

Design a Hybrid Fragmentation schema for this table. Explain your horizontal and vertical fragmentation choices, and where you would allocate them.

Elite Answer Formulation:

To optimize for performance and data localization, I will implement a Mixed (Hybrid) Fragmentation strategy.

Step 1: Vertical Fragmentation (Optimizing for Payroll)

I will vertically fragment the table by columns.
• Fragment V1 will contain: (EmpID, Salary).
• Allocation: V1 will be allocated exclusively to the Global HQ node in the USA. Because Payroll only calculates salaries, removing the string-heavy Name, Region, and Department columns drastically reduces the disk blocks they must scan, massively speeding up their batch processing.

Step 2: Horizontal Fragmentation (Optimizing for European HR)

I will take the remaining columns (EmpID, Name, Region, Department) and horizontally fragment them (by rows) based on the Region attribute.
• Fragment H1: Region = 'Europe'
• Fragment H2: Region = 'USA' ... etc.
• Allocation: Fragment H1 will be allocated directly to the server at the European HR site. This guarantees that when European HR queries their local employees, the data is fetched instantly from local disks without incurring any Long-haul network data transfer costs.