Course Introduction & Relational Model

This course is on the design and implementation of disk-oriented database management systems.

Course Outline

Relational Databases
Storage
Execution
Concurrency Control
Recovery
Distributed Databases
Potpourri

Textbook

Database System Concepts. 7th Edition. Silbersschatz, Korth, & Sudarshan

Projects

You will build your own storage manager from scratch of the course of the semester. Each project builds on the previous one.

Let’s get started!

About database:

A database is organized collection of inter-related data that models some aspect of the real-world.

Example: Create a database that models a digital music store, we need

Information about the Artists
What albums those Artists released

=> Store our databases as comma-separated value(CSV) files that we manage in our own code.

Use a separate file per entity
The application has to parse the files each time they want to read\update records.

Artist( name, year, country ), Album ( name, artist, year )

Question: How to find a certain entry, like a certain artist?

A simple implementation:

for line in file:
    record = parse(line)
    if "artist_name" == record[0]
    	print int(record[1])

About Data Integrity

How do we ensure that the artist is the same for each album entry ?
What if somebody overwrites the album year with an invalid string ?
How do we store that there are multiple artists on an album ?

About Implementation

How do you find a particular record ?
What if we now want to create a new application that uses the same database?
What if two threads try to write to the same file at the same time ?

About Durability

What if the machine crashes while our program is updating a record ?
What if we want to replicate the database on multiple machines for high availability ?

Database Management System

A DBMS is software that allows applications to store and analyze information in a database. A general-purpose DBMS is designed to allow the definition, creation, querying, update and administration of databases, without the need of knowing the details of the above questions.

Early DBMS

Tight coupling between logical and physical layers.

You have to (roughly) know what queries your app would execute before you deployed the database. (Hard to maintain)

–> solution: relational model

Relational Model

Proposed in 1970 by Ted Codd.

Database abstraction to avoid this maintenance:

Store database in simple data structures
Access data through high-level language
Physical storage left up to implementation.

A data model is collection of concepts for describing the data in a database.

A schema is a description of a particular collection of data, using a given data model.

Some categories of data models: Relational, Key/Value, Graph, Document, Column-family, Array/Matrix, Hierarchical, Network

Above all, the relation model is about

Structure: The definition of relations and their contents
Integrity: Ensure the database’s contents satisfy constraints.
Manipulation: How to access and modify a database’s contents.

Relation

A relation is unordered set that contain the relationship of attributes that represent entities

A tuple is a set of attribute values(also known as its domain) in the iteration.

Values are (normally) atomic / scalar.
The special value NULL is a member of every domain.

n-ary Relation = Table with n columns

Primary keys

A relation’s primary key uniquely identifies a single tuple.

For example, we can extend our Artist entity to Artist(id, name, year, country).

Foreign keys

A foreign key specifies that an attribute from one relation has to map to a tuple in another relation.

For example, we have the ids for Artist entity, but a album may have multiple artists, we need foreign keys to solve this:

We can create a new entity like ArtistAlbum(artist_id, album_id)

Data Manipulate Languages (DML)

We want to know: How to store and retrieve information from a database.

Procedural:

The query specifies the (high-level) strategy the DBMS should use to find the desired result.
Need relational algebra (What we focus in this class)

Non-Procedural (Declarative):

The query specifies only what data is wanted and not how to find it.
Need relational calculus

Relational Algebra

Fundamental operations to retrieve and manipulate tuples in a relation. (Based on set algebra)

Each operator takes one or more relations as its inputs and outputs a new relation.

Select: $\sigma$ Projection: $\pi$ Union:$\cup$ Intersection: $\cap$ Difference: - Product: $\times$ Join: $\Join$

Select

Choose a subset of the tuples from a relation that satisfies a selection predicate

Predicate acts as a filter to retain only tuples that fulfill its qualifying requirement.
Can combine multiple predicates using conjunctions / disjunctions

Syntax: $\sigma_{predicate}$(R)

In SQL:

SELECT * FROM R
	WHERE a_id = 'a2' AND b_id > 102;

Projection

Generate a relation with tuples that contains only the specified attributes.

Can rearrange attributes' ordering.
Can manipulate the values.

Syntax: $\pi_{A1,A2,…An}$(R)

In SQL:

SELECT b_id - 100, a_id
	FROM R WHERE a_id = 'a2';

Union

Generate a relation that contains all tuples that appear in either only one or both input relations

Syntax: (R $\cup$ S)

In SQL:

(SELECT * FROM R)
	UNION ALL
(SELECT * FROM S);

Difference

Syntax: (R - S)

In SQL:

(SELECT * FROM R)
	EXCEPT
(SELECT * FROM S);

Intersection

Syntax: (R $\cap$ S)

In SQL:

(SELECT * FROM R)
	INTERSECT
(SELECT * FROM S);

Product

Syntax: (R $\times$ S)

In SQL:

SELECT * FROM R CROSS JOIN S;

Join

Join takes in two relations and outputs a relation that contains all the tuples that are a combination of two tuples where for each attribute that the two relations share, the values for that attribute of both tuples is the same.

Syntax: (R $\Join$ S)

In SQL:

SELECT * FROM R NATURAL CROSS S;

The difference between “Join” and “Intersection” is that, Join doesn’t require that the two relations have the exact same attributes, it only requires that their common attributes should be the same.

Many more operations: Rename, Assignment, Duplicate Elimination …

Observation

Relational algebra still defines the high-level steps of how to compute a query.

We want to get all the tuples in relation R and S whose b_id equals 102.

$\sigma_{b_id=102}(R \Join S)$ vs. $R\Join (\sigma_{b_id=102}(S))$

From a high level, they both achieve this goal. However, their efficiency differs much.

A better approach is to say the result you want, and let the DBMS decide the steps it wants to compute a query according to the data size. SQL will do exactly this, and it is the de facto standard for writing queries. (Non-procedure)

The relational model is independent of any query language implementation.

for line in Artist.CSV:
	record = parse(line)
    if "Ice Cube" == record[0]
    	print int(record[1])

SELECT year FROM artists
	WHERE name = "Ice Cube"

SQL ’s robustness is much stronger. It doesn’t depend on the implementation of database, so if the low-level organization of the database is changed (e.g. using a tree instead of a list to store data), the application layer doesn’t need to change (while the python code becomes completely useless.)

Conclusion

Databases are ubiquitous.
Relational algebra defines the primitives for processing queries on a relational database.
We will see relational algebra again when we talk about query optimization + execution.