Logged in as: guest Log in
Problem Set #3 kemal / Version 25

Comp 521: Files and Databases -- Fall 2012

Problem Set #3


Issued: 9/25/2012     Due: In class 10/16/2012


Homework Information: Some of the problems are probably too long to attempt the night before the due date, so plan accordingly. Late homework will penalized by a factor of 70.71% for each late class period. Feel free to work with others, but the work you hand in should be your own.

Question 1.


 

Suppose you have a phone book with 366,000 entries in alphabetical order by first name.  Each page has 610 entries.

A. Give an upper bound on the number of pages that must be examined to find a given first name using a binary search.  (Assume that, having computed a page number, you can open directly to that page.)

B. Now suppose a 1-page table of contents is added to the phone book.  It lists the first "first name" on each of the other pages.  (Assume no first name has more than 1 page of entries) How many pages must be examined to look up a given first name?

Now suppose that the phone book is reorganized such that each page contains a list of names with the same first and last characters.  For example, 'Thomas' falls on the same page as 'Travis' (both have 'Ts' as their first and last characters).  Assume that no pattern takes more than one page, and there are a total of 1000 pages.  Finally, assume that each page has an instant-access index tab on the side.

C. How many pages must be checked to find a particular name?

D.Assuming a page still has room for up to 610 entries, what is the occupancy percentage of an average page in the new phone book?

 

Question 2.


Consider the following table:

CREATE TABLE Actors (
    aid  INTEGER PRIMARY KEY,
    FName TEXT,
    LName TEXT);

 

CREATE TABLE Casts (
    aid  INTEGER,
    mid INTEGER,
    role TEXT,
    PRIMARY KEY(aid, mid),
    FOREIGN KEY(aid) REFERENCES Actors
    FOREIGN KEY(mid) REFERENCES Movies);

CREATE TABLE Movies (
    mid  INTEGER PRIMARY KEY,
    Title TEXT,
    Year INTEGER);


For each of the following queries, suggest an index that would best improve performance:

A)  SELECT aid, COUNT(*) FROM Casts

WHERE role = 'Himself'

GROUP BY aid;

B) SELECT DISTINCT m.mid FROM Casts c, Movies m

WHERE m.mid=c.mid and  Year > 1990 ;

C) SELECT role, COUNT(*) FROM Casts c, Actors a

WHERE a.aid =c.aid

GROUP BY role;

 

Question 3.


Consider a disk with a secor size of 4096 bytes, 63 sectors per track, 364000 tracks per surface,  8 double-sided platters, and an averge seek time of 4 mS.

A) What is the capacity of a track in bytes?

B) What is the capacity of each surface?

C) What is the capacity of the disk?

D) How many cylinders does the disk have?

E) If the disks spin at 7200 revolutions per minute, what is the maximum rotational delay?

F) If one track of data can be transfered per revolution, what is the transfer rate?

Assuming a block size of 512 bytes is used, suppose that a file containing 350,000 records of 70 bytes each is to be stored on such a disk and that no recored is allowed to span  two blocks.

G) How many records fit onto a block?

H) How many blocks are required to store the entire file? If the file is arranged sequentially on disk, how many surfaces are needed?

I) How many 70 byte records can be stored on this disk?

 

Question 4. (Turn in your answers to this question by emailing .zip file to kemal@cs.unc.edu. Put your python scripts as separate files in this zipped file)


Consider the database actors.db (~300 Mbytes) with the following schema:

CREATE TABLE Actors (
    aid  INTEGER PRIMARY KEY,
    FName TEXT,
    LName TEXT);

 

CREATE TABLE Casts (
    aid  INTEGER,
    mid INTEGER,
    role TEXT,
    PRIMARY KEY(aid, mid),
    FOREIGN KEY(aid) REFERENCES Actors
    FOREIGN KEY(mid) REFERENCES Movies);

CREATE TABLE Movies (
    mid  INTEGER PRIMARY KEY,
    Title TEXT,
    Year INTEGER);

 

The first part of this assigment is to time the following query:

SELECT c.mid, MAX(Year)
FROM  actors a ,casts c,movies m
WHERE a.aid=c.aid and m.mid=c.mid
GROUP BY c.mid 

This can be accomplished using the following lines of Python.

import time
import sqlite3

Q1 = """SELECT FName, LName, COUNT(*)
            FROM Actors, Casts 
            WHERE Actors.aid=Casts.aid
            GROUP BY Casts.aid"""


db = sqlite3.connect('actors.db')
cursor = db.cursor()
start = time.clock()
cursor.execute(Q1)
t = cursor.fetchall(); 
end = time.clock() - start
print "Processing time =", end

When completed, add the following 3 lines before the line "cursor.execute(Q1)" and reexecute the script.

cursor.execute("CREATE INDEX actorId_movieId on Casts(aid,mid)")
cursor.execute("CREATE INDEX Act_id on Actors(aid)")
cursor.execute("CREATE INDEX Mov_Id on Movies(mid)")

Hopefully, you will notice a significant speedup. Now you are armed with all you need to do the following.

1) Make a copy of the "actors.db" database called "origActors.db"
2) Write a python script to add the three example indices to "actors.db", commit, and close it.
3) Write a second python script that takes, as a command-line argument the name of the database, connects to it, and then executes and times 5 repitions of the following queries. The program should print out both the execution time and number of tuples returned for each query.
4) Turn in a listing of your code and a print out of its execution with both "actors.db" and "origActors.db"

The queries are:

A) The titles and the number of actors of all movies where more than 10 actors played.
B) The first and last names of all actors who cast in the movie "Donnie Darko"
C) The first&last name of actor and title of most recent movie played by each actor
D) (Bonus) Come up with your own query and index(es) that runs at least 5 times faster on actors.db than origActors.db and explain why it is so.

E) Finally, add you own index to improve the query evaluation time of query A from above, and demonstrate it with printouts of your modified code and the improved timings.




Site built using pyWeb version 1.10
© 2010 Leonard McMillan, Alex Jackson and UNC Computational Genetics