CMPS 4420 Advanced Database Systems Final

Proposal

	The primary focus of the Advanced Database Systems class has been improving efficiency and the run time for various actions taken on a database. We can improve speed by using faster hardware, understanding the physical design of the hardware and optimizing our queries. We can also tune our database according to new needs or seasonal changes that affect the overall usage of the database. The most intriguing topic for me so far is the use of indexing to improve efficiency in a database which  was discussed in chapter 18 of our Fundamentals of Database Systems1 textbook.
	We performed a lab that demonstrated the improvement that could be made by implementing an index file and then using a binary search to make searches even faster. We could improve upon this by making a sparse index as opposed to a dense index. The sparse index reduced search time by reducing the number of block reads required. I was astonished by the improvement the index made but the book also discussed the concept of a clustered index. I am interested to figure out the effect that producing a clustered index can have on improving the search time of queries.
	Of course, the efficacy of the type of index you use depends on the data that you are searching and the queries that you use. Last year we used a csv file in CMPS 3500.2 This file has data on accidents that occurred in the United States from 2016 to 2021. Each accident has several attributes including city, state, zip, condition and the time stamp. I think that most queries would involve cities and states, so I think this is a perfect file to test the effectiveness of a clustered index.
	My idea is to produce 2 tables from the US accidents data file. One table will be indexed by the ID attribute. The second table will  have a clustered index which will be indexed by the state attribute. I think many queries would be based on the city and state location of accidents, so this type of index makes sense for this database. After setting up the 2 tables that have the same information but different indexes, I plan to write about 20 queries selecting a count of the number of accidents in different cities in different states in the U.S. I think there   will be a significant improvement when using the table with the clustered index versus searching the table with the ID as the primary index. The former should be able to jump to the state and find the accidents in the city while the latter will have to search through the whole table every time. There should especially be a significant difference when making queries on cities in smaller states. I also intend to include a query using the zip code, which I suspect will run faster on the ID index because both tables will need to be completely searched and the clustered index will have to go through the additional 50 cluster indexes.
	My original plan was to use SQLite as my DBMS and then write my queries with Python. However, I found out that SQLite does not support clustered indexes, so I will use MariaDb instead. PHP is readily compatible with MariaDb, so I have chosen to use PHP instead of Python for my coding language.

Work Plan

Time Line
Week 0	April 10 to April 14 
	brainstorm for final project
Week 1	April 15 to April 21
	write initial proposal
	develop plan
	insert csv file into database
	create table with ID as primary index
	create table with clustered index
Week 2 April 22 to April 28
	Develop queries for both the ID primary index and the index clustered by state
	Learn PHP syntax
	Initiate coding for queries and time testing
Week 3	April 29 to may 5
	finish writing code 
	debug
Week 4	May6 to May 12
	debug
	finish any assigned writings
week5	May 19
	graduate!!!

Requirements

MariaDb platform, PHP, Linux

References

1.	Ramez Elmasri, Shamkant B. Navathe. Fundamentals of Database Systems, 7th edition. Pearson, 2016.
2.	US_Accidents_data.csv from Walter Morales, CSUB 3500 Programming Languages, Spring 2022.