Database Homework #2, Spring 2018

For each question below (most anyway), you will write an SQL query. Additionally, for each question with a star (*), provide the SQLite output (the table that is printed). The output can be cut-and-pasted from your SQLite browser.

Each of these questions should be answered with a single SQL query that would hypothetically work for any logical database instance. In other words, you shouldn't "hard-code" anything in a query that will make the query return incorrect information if any of the information in the database changes.

(Easy SQL) Using SQLite, load the Flights database. The schema is as follows:
- customers (customerID, first, last, city, state, birthday)
  The customers table represents people traveling by airplane. customerID is a key.
- airports (name, code, city, state, airportLat, airportLong, cityLat, cityLong, population)
  The airports table represents airports that travelers will arrive at and depart from. For each airport, the table stores the name of the airport, a three-letter code that is unique for each airport, the major city and state the airport serves, the latitude and longitude of the airport itself, the latitude and longitude of the city served by the airport, and the population of the city.
- airlines (name, code)
  The airlines table stores information about the airlines that fly the planes between the airports. For each airline, the table stores the name of the airline and a two-character unique code that identifies the airline.
- schedule (airline, flightNumber, source, destination, departTime, arriveTime)
  The schedule table stores information about the individual flights that will happen on a particular day. For each flight, the table stores the airline's two-character code, the flight number, the three-digit codes for the airport where the flight departs (source) and lands (destination), and the departure and arrival times in a 24-hour clock format. Note that flight numbers may be duplicated across airlines, but not across flights. In other words, two airlines may both have a flight with number 42, but there is only one Delta flight #42 on a particular day.
- bookings (customerID, airline, flightNumber, boardingGroup, seatRow, seatLetter)
  The bookings table stores information about the customers who have made reservations to fly. For each reservation, the table stores the customerID, the two-character airline code, the flight number, the boarding group for the customer (an integer between 1 and 3, inclusive), and the seat assignment (a row between 1 and 30, and a letter from A to F).
1. *Select the IDs, first, and last names of all customers who live in Memphis. [Sanity check: 6 rows]
2. *Select the airline code, flight numbers, and departure/arrival times of all flights scheduled to fly from MEM to ATL (MEM and ATL are the airport codes for Memphis and Atlanta). [Sanity check: 7 rows]
3. *Find the names, codes, and cities of all the airports in Tennessee that serve cities of at least 100,000 people. [Sanity check: 6 rows]
4. Find the total number of airports in each state in the country. Return two columns: state, and number of airports.
5. *Find the total number of people on Delta flight 45. Hint: Delta's two-character code is DL.
6. *Find the total number of rows occupied on Delta flight 45. A row is occupied if at least one person is sitting in the row.
(Medium SQL) Consider the following relational design used at a Memphis pizza restaurant:
- Customer (cust_id, name, phone, ccn, neighborhood)
- Pizza (pizza_id, name, size, price)
- Order (cust_id, pizza_id, ordermonth, orderday, orderyear, ordertime, quantity)
- Supplies (supply_name, amountleft, unitprice)
- Ingredient (pizza_id, supply_name, amount)
The Customer table maintains the personal information for people who order pizzas, such as a unique ID, name, phone number, credit card number (ccn), and neighborhood (e.g., Midtown, East Memphis, Evergreen). It is possible that two different people have the same name.
In the Pizza table, every pizza has an ID, name (e.g., "the works"), particular size (e.g., 7 inches) and price. Note that different pizzas may have the same name, but different sizes.
The Order table includes the records about which customer ordered which pizza, quantity of pizzas, orderyear (e.g., 2014), ordermonth (e.g., 12), orderday (e.g., 27), and the order time (e.g., "6:13pm"). Note that an order can contain more than one pizza; there will just be multiple tuples with the same cust_id, date, and time.
The Supplies table includes the information of the various groceries used by the store: the name, unit price, and the amount left in the store (e.g. the store might have 3 lbs of mozzarella left with a unit price of $5 per lb).
The Ingredient table keeps the records about the amount of ingredients used by each pizza (so "The works 7 inch" might use only 0.1 lbs of mozzarella, while "Four cheese 12 inch" might use 0.4 lbs).
Write SQL queries to answer parts b through f. Use only the operators and SQL statements we have learned in class. In addition, avoid the operators IN, ALL, ANY, and EXISTS.
1. What are logical keys for the five tables?
2. Find the ids of the customers with name "Bob."
3. List pizza ids of the pizzas ordered by all the customers. Remove duplicates.
4. Find pizza ids of all the pizzas with the price more than 15 dollars.
5. List ids and names of the customers who ordered at least one pizza and live in Midtown.
6. Find all the pizzas (pizza id, pizza name, and size) where the store makes a profit of at least 10 dollars per whole pizza. (Here, we assume Profit = Pizza Price minus Ingredient Cost)
Suggestion: It is very tempting to create a database and throw some sample data into it to do this problem on a computer. I encourage you not to do this, or at least only do it to check your work. I purposefully did not give you a database for this question so you could practice writing SQL queries without being able to test them out, because the midterm will have questions like this.
(Advanced SQL) Using the Flights SQLite database again, write SQL queries to answer the following questions. Note that some of these queries will require multiple seconds to run. On my computer, the last query takes a good 15-20 seconds.
1. Find a listing of all flights and the number of people on each flight. Return a table with columns for airline code, flight number, and number of passengers. [Sanity check: 23983 rows]
2. *Find the largest number of passengers on any single flight. Return this number of passengers (one row/one column). [Sanity check: the number is between 100 and 200.]
3. *Find the airline code, flight number, source, and destination airport codes for the flight(s) with the largest number of passengers. Do not use any literal numbers in the query; your query should calculate the size of the largest flight and retrieve the relevant information in one query.
  Note that in general, there is no guarantee that there will be only one flight that has this number of passengers on it! Your query should be able to return them all if there is more than one.
  [Sanity check: For this particular database instance, there is only one flight with the largest number of passengers on it and it involves a New York City airport.]
4. *Find, for each airline, the most popular one-way route that airline flies, defined as the route (pair of source/destination airports) that carries the most passengers. Return the airline code, the source and destination airport codes, and the total number of passengers that are flying that route.
  Note that there are multiple airlines in the database that fly the same route, and possibly multiple flights per day for each airline on that route.
  Note that "one-way route" means we are not combining flights from Memphis to Atlanta with flights from Atlanta to Memphis. Those are two separate one-way routes.
  Hint: this is hard. You may want to use SQL VIEWs to save parts of your query as you develop them. An SQL view is similar to the relational algebra shorthand notation: a view allows you to give a name to a query that you plan to use later on (for instance, as a subquery). If you use views, you should write down your CREATE VIEW statements as well as your final SELECT statement.
  Hint 2: You may find it useful as well to read the book section on correlated subqueries. This may be helpful as well.
  Hint 3: You should end up with a 12-row table. [Sanity check: You should end up with the most popular Delta (DL) route being from Minneapolis/St. Paul (MSP) to Salt Lake City (SLC), carrying 572 passengers in total.

Other hints, guidelines, and clarifications

Do not use the LIMIT keyword. In particular, you should not use LIMIT 1 combined with ORDER BY to find the largest or smallest anything, in SQL. The reason is that sorting results with ORDER BY is an O(n log n) operation, and then using LIMIT 1 to find the largest or smallest item wastes much of that computation; you can do this with aggregation in O(n) time.
The second reason why LIMIT 1 is often deceptive is that while it can help you find the maximum or minimum something in SQL (albeit inefficiently), it can create problems when you try to expand your query to find other attributes related to the maximum or minimum something, if the maximum or minimum appears multiple times. For instance, pretend you have a database table that stores items sold in a store and their prices. You want to find the price of the most expensive item. You can do this with an ORDER BY/LIMIT 1 query (again, it's horribly inefficient and you shouldn't do it!) but it works. What doesn't work, however, is where you try to then get the name of the item that has the most expensive price --- what if there are multiple items with this price? Using LIMIT 1 will only retrieve one of them, and you don't know which one. This kind of query should be done with a subquery (see the link in the last question above for an example of this exact situation).