Compgen Tool Suite

Problem Set #3, Fall 2010

mcmillan / Version 11

Comp 521: Files and Databases -- Fall 2010

Problem Set #3

Issued: 10/13/2010 Due: In class 10/27/2010

Homework Information: Some of the problems are probably too long to attempt the night before the due date, so plan accordingly. Late homework will penalized by a factor of 70.71% for each late class period. Feel free to work with others, but the work you hand in should be your own.

Question 1.

Suppose you have a phone book with 60,000 entries in alphabetical order by last name. Each page has 300 entries.

A. Give an upper bound on the number of pages that must be examined to find a given last name using a binary search. (Assume that, having computed a page number, you can open directly to that page.)

B. Now suppose a 1-page table of contents is added to the phone book. It lists the last name on each of the other pages. (Assume no last name has more than 1 page of entries) How many pages must be examined to look up a given last name?

Now suppose that the phone book is reorganized such that each page contains a list of names with the same last two consonants. For example, 'Gehrke' falls on the same page as 'Brook' (both have 'rk' as their last 2 consonants). Assume that no pattern takes more than one page, and there are a total of 250 pages. Finally, assume that each page has an instant-access index tab on the side.

C. How many pages must be checked to find a particular name?

D.Assuming a page still has room for up to 300 entries, what is the occupancy percentage of an average page in the new phone book?

Question 2.

(Hint: use the Cost of Operations table to complete this problem)

Consider a database table of 200 disk pages, with 50 records per page. Assume this database is running on a machine with an average disk access time of 5 ms. Assume any tree indices mentioned in this problem have a fanout of 100.

Suppose that 10% of all rows match a certain range query, and they are randomly distributed throughout the table. About how many pages will contain at least one matching row? (More formally, assume each row has a 10% chance of matching, independent of any other row or of this row’s position.)
For each of the following file organizations, compute the time in milliseconds that it will take to do an equality selection on a unique key.
- Heap file (no indicies)
- Clustered tree index
- Heap file with an unclustered tree index
- Heap file with an unclustered hash index
For each of the above file organizations, compute the time in milliseconds that it will take to do a range selection with 10% of all rows matching. (For the unclustered tree index, assume matching rows are distributed as in part A.)
Would you expect a clustered hash index to perform better for either type of selection? Why or why not?

Question 3.

Consider the following table:

CREATE TABLE Rentals (
    cardNo INTEGER,
    movieId INTEGER,
    date    DATE,
    rating INTEGER, -- 1 to 5
    PRIMARY KEY(cardNo, movieID, date)
);

For each of the following queries, suggest an index that would best improve performance:

A) SELECT cardNo, COUNT(*) FROM Rentals

WHERE rating = 1

GROUP BY cardNo;

B) SELECT DISTINCT movieId FROM Rentals

WHERE rating = 5 AND date > date(‘now’, ‘-1 week’);

-- note: date is a SQLITE-specific function. The invocation above will return a datetime 1 week before the current datetime.

C) SELECT cardNo, MAX(date) FROM Rentals

GROUP BY cardNo;

Question 4.

Consider a disk with a secor size of 512 bytes, 14409 tracks per surface, 16947 sectors per track, 8 double-sided platters, and an averge seek time of 9 mS.

A) What is the capacity of a track in bytes?

B) What is the capacity of each surface?

C) What is the capacity of each disk?

D) How many cylinders does the surface have?

E) If the disks spin at 7200 revolutions per minute, what is the maximum rotational delay?

F) If one track of data can be transfered per revolution, what is the transfer rate?

Assuming a block size of 4096 bytes is used, suppose tht a file containing 1,000,000 records of 100 bytes each is to be stored on such a disk and that no recored is allowed to span two blocks.

G) How many records fit onto a block?

H) How many blocks are required to store the entire file? If the file is arranged sequentially on disk, how many surfaces are needed?

I) How many 100 byte records can be stored on this disk?

Question 5.

Recall the database movies.db (~500 Mbytes) used in the previous assignment with the following schema:

CREATE TABLE Customers (
    cardNo INTEGER PRIMARY KEY,
    first   TEXT,
    last    TEXT,
    sex     CHAR,
    dob     DATE
);

CREATE TABLE Movies (
    movieId INTEGER PRIMARY KEY,
    title TEXT,
    year INTEGER
);

CREATE TABLE Rentals (
    cardNo INTEGER,
    movieId INTEGER,
    date    DATE,
    rating INTEGER,
    PRIMARY KEY(cardNo, movieID, date),
    FOREIGN KEY (cardNo) REFERENCES Customers,
    FOREIGN KEY (movieId) REFERENCES Movies
);

The first part of this assigment is to time the following query:

SELECT first, last, COUNT(*)
FROM Customers, Rentals
WHERE Customers.cardNo=Rentals.cardNo
GROUP BY Rentals.cardNo

This can be accomplished using the following lines of Python.

import time
import sqlite3

Q1 = """SELECT first, last, COUNT(*)
            FROM Customers, Rentals
            WHERE Customers.cardNo=Rentals.cardNo
            GROUP BY Rentals.cardNo"""

db = sqlite3.connect('movies.db')
cursor = db.cursor()
cursor.execute(Q1)
start = time.clock(); t = cursor.fetchall(); end = time.clock() - start
print "Processing time =", end

There are a few things to note here. First, notice that execution of queries in Python is defered until the actual data is called for by the "cursor.fetchall()" statement. Second, be patient.

When completed (eat dinner, take a nap), add the following two 3 lines after before the line "cursor.execute(Q1)" and reexecute the script.

cursor.execute("CREATE INDEX cardNo_movieId on Rentals(cardNo,movieId)")
cursor.execute("CREATE INDEX Cust_cardNo on Customers(cardNo)")
cursor.execute("CREATE INDEX Mov_Id on Movies(movieId)")

Hopefully, you will notice a significant speedup. One that is even more noticable than the time printed. Why do you think that the reported time underreports the actual speed up?

Now you are armed with all you need to do the following.

1) Make a copy of the "movies.db" database called "origMovies.db"
2) Write a python script to add the three example indices to "movies.db", commit, and close it.
3) Write a second python script that takes, as a command-line argument the name of the database, connects to it, and then executes and times 5 repitions of the following queries. The program should print out both the execution time and number of tuples returned for each query.
4) Turn in a listing of your code and a print out of its execution with both "movies.db" and "origMovies.db"

The queries are:

A) The titles and the number of rentals of all movies rented more than 10000 times
B) The first and last names of all customers who rented the "The Matrix"
C) The card number and title of most recent movie rented by each customer

D) Finally, add you own index to improve the query evaluation time of query A from above, and demonstrate it with printouts of your modified code and the improved timings.

Comp 521: Files and Databases -- Fall 2010 Problem Set #3 Issued: 10/13/2010 Due: In class 10/27/2010

Comp 521: Files and Databases -- Fall 2010

Problem Set #3

Issued: 10/13/2010 Due: In class 10/27/2010