Database System Concepts. 7th Edition. Silbersschatz, Korth, & Sudarshan
You will build your own storage manager from scratch of the course of the semester. Each project builds on the previous one.
Let’s get started!
A database is organized collection of inter-related data that models some aspect of the real-world.
Example: Create a database that models a digital music store, we need
=> Store our databases as comma-separated value(CSV) files that we manage in our own code.
Artist( name, year, country ), Album ( name, artist, year )
Question: How to find a certain entry, like a certain artist?
A simple implementation:
for line in file:
record = parse(line)
if "artist_name" == record[0]
print int(record[1])
How do we ensure that the artist is the same for each album entry ?
What if somebody overwrites the album year with an invalid string ?
How do we store that there are multiple artists on an album ?
How do you find a particular record ?
What if we now want to create a new application that uses the same database?
What if two threads try to write to the same file at the same time ?
A DBMS is software that allows applications to store and analyze information in a database. A general-purpose DBMS is designed to allow the definition, creation, querying, update and administration of databases, without the need of knowing the details of the above questions.
Tight coupling between logical and physical layers.
You have to (roughly) know what queries your app would execute before you deployed the database. (Hard to maintain)
–> solution: relational model
Proposed in 1970 by Ted Codd.
Database abstraction to avoid this maintenance:
A data model is collection of concepts for describing the data in a database.
A schema is a description of a particular collection of data, using a given data model.
Some categories of data models: Relational, Key/Value, Graph, Document, Column-family, Array/Matrix, Hierarchical, Network
Above all, the relation model is about
A relation is unordered set that contain the relationship of attributes that represent entities
A tuple is a set of attribute values(also known as its domain) in the iteration.
n-ary Relation = Table with n columns
A relation’s primary key uniquely identifies a single tuple.
For example, we can extend our Artist entity to Artist(id, name, year, country).
A foreign key specifies that an attribute from one relation has to map to a tuple in another relation.
For example, we have the ids for Artist entity, but a album may have multiple artists, we need foreign keys to solve this:
We can create a new entity like ArtistAlbum(artist_id, album_id)
We want to know: How to store and retrieve information from a database.
Procedural:
Non-Procedural (Declarative):
Fundamental operations to retrieve and manipulate tuples in a relation. (Based on set algebra)
Each operator takes one or more relations as its inputs and outputs a new relation.
Select: $\sigma$ Projection: $\pi$ Union:$\cup$ Intersection: $\cap$ Difference: - Product: $\times$ Join: $\Join$
Choose a subset of the tuples from a relation that satisfies a selection predicate
Syntax: $\sigma_{predicate}$(R)
In SQL:
SELECT * FROM R
WHERE a_id = 'a2' AND b_id > 102;
Generate a relation with tuples that contains only the specified attributes.
Syntax: $\pi_{A1,A2,…An}$(R)
In SQL:
SELECT b_id - 100, a_id
FROM R WHERE a_id = 'a2';
Generate a relation that contains all tuples that appear in either only one or both input relations
Syntax: (R $\cup$ S)
In SQL:
(SELECT * FROM R)
UNION ALL
(SELECT * FROM S);
Syntax: (R - S)
In SQL:
(SELECT * FROM R)
EXCEPT
(SELECT * FROM S);
Syntax: (R $\cap$ S)
In SQL:
(SELECT * FROM R)
INTERSECT
(SELECT * FROM S);
Syntax: (R $\times$ S)
In SQL:
SELECT * FROM R CROSS JOIN S;
Join takes in two relations and outputs a relation that contains all the tuples that are a combination of two tuples where for each attribute that the two relations share, the values for that attribute of both tuples is the same.
Syntax: (R $\Join$ S)
In SQL:
SELECT * FROM R NATURAL CROSS S;
The difference between “Join” and “Intersection” is that, Join doesn’t require that the two relations have the exact same attributes, it only requires that their common attributes should be the same.
Relational algebra still defines the high-level steps of how to compute a query.
We want to get all the tuples in relation R and S whose
b_id
equals 102.
$\sigma_{b_id=102}(R \Join S)$ vs. $R\Join (\sigma_{b_id=102}(S))$
From a high level, they both achieve this goal. However, their efficiency differs much.
A better approach is to say the result you want, and let the DBMS decide the steps it wants to compute a query according to the data size. SQL will do exactly this, and it is the de facto standard for writing queries. (Non-procedure)
The relational model is independent of any query language implementation.
for line in Artist.CSV:
record = parse(line)
if "Ice Cube" == record[0]
print int(record[1])
SELECT year FROM artists
WHERE name = "Ice Cube"
SQL ’s robustness is much stronger. It doesn’t depend on the implementation of database, so if the low-level organization of the database is changed (e.g. using a tree instead of a list to store data), the application layer doesn’t need to change (while the python code becomes completely useless.)