Warehouse Activity Profiling

Datamining the distribution center

We will build a database using the following datafiles and query it to answer questions about the demands on the distribution center. Each of the following datafiles is a plain ASCII file with comma-delimited fields and with field names in the first row.

sales.csv

This file contains the concatenated shopping lists of all customers over a given time period. This has been taken from the financial system and may be expected to be accurate. Note: This file is unrealistically small, containing only about 10,000 pick lines. Here is a more realistically sized version (about 2.25 million pick lines); but it can be very time-consuming to manipulate.

skus.csv

This describes about 21,000 skus. It is taken from the Item Master, a database that may be populated by data from the vendors or from the purchasing department or from the warehouse operations staff. Data of this sort is likely to contain some inaccuracies, especially in dimensions of the product. Beware especially of physical dimensions with value 0.

locations.csv

This table lists the addresses of all 20,000+ storage locations in the warehouse together with which sku is believed to be stored there. It is taken from the database that manages storage space, which is part of the warehouse management system. Note that it may contain “dummy locations” that do not correspond to actual locations in the warehouse. You will need a map of the warehouse to translate these addresses to actual physical locations.

Set up the database

If you already have a preferred database program, import the text files above into your database and continue with the next section. Otherwise, here are some open source databases that are sufficient to the purpose: MySQL,PostgreSQL, and H2. Each includes a rudimentary user interface. There are several free GUIs that connect to these databases and also paid versions. Another possibility is to install MAMP (Mac) or XAMPP (Mac/Windows/Linux) and use phpMyAdmin to interface with the MySQL database.

If you have never done this before it can be tedious and even stressful. But there is lots of on-line documentation, or you can ask a more-experienced classmate for help, or ask the TA.

Using SQL to extract information

There are many excellent on-line SQL references available; here is one.

For our purposes the most important part of SQL is the SELECT statement.

If you would like the database to save its work for future reference, use the CREATE TABLE...AS...WITH DATA construct, as in the following.

CREATE TABLE Weights AS
	(SELECT SkuID, UnitWeight FROM Skus)
WITH DATA;

This will save the results of the query as a table named Weights, which will become part of the database (until you choose to DROP it). To examine this table, execute the query SELECT * FROM Weights;

If you want only the fields corresponding to sku ID and unit weight then execute the query
SELECT SkuID, UnitWeight FROM Skus WHERE UnitWeight > 25;.

SQL also allows allows use of = (equal), <> (not equal), < (less than), <= (less than or equal to), and >= (greater than or equal to).

The % character is used as a wildcard. Note that strings of text must be surrounded by single quotes, whereas no quotes are used with numbers.

There are many ways to transmit SQL commands to the database. Most databases include some type of query input facility. For example, MS Access has a QueryBuilder tool. SQuirrel is a useful input facility that is also open source and can connect to most common database engines and so can be worth learning for future use. SQuirreL also comes with a SQL syntax verifier that can help you avoid SQL errors. You can also copy results directly from SQuirreL to paste into reporting tools such as spreadsheets.

Most front ends will also allow you to submit a file of SQL commands to be executed in batch, once you have debugged them.

Questions

Use the sales history, description of skus and their locations to answer the following. (SQL queries are provided for some questions to help you get started. Sometimes the query can be rewritten to run faster: You are urged to experiment.)


Question 1. What are the largest and the smallest dimensions among all skus?

--
-- This is a comment. Be sure to document each query
--
SELECT Max( UnitLength ) FROM Skus;

Write similar queries for the remaining dimensions. Or you can combine multiple queries into a single one.


Question 2. How many times was each sku was requested?

--
-- Sum all requests for each sku and list skus from most popular to least popular
--
CREATE TABLE Popularity AS
	(SELECT Sales.SkuID, Count(Sales.SkuID) AS TotalPicks
	FROM Sales
	GROUP BY Sales.SkuID
	ORDER BY TotalPicks DESC)
WITH DATA;


Question 3. How many orders arrived per day?

--
-- Number of orders per day, in calendar sequence
--
CREATE TABLE OrdersPerDate AS
	(SELECT Date, COUNT(Sales.OrderNumber) AS Orders
	FROM Sales
	GROUP BY Date
	ORDER BY Date)
WITH DATA;


Question 4. How many picks came out of each zone?

--
-- Number of picks per zone listed from most busy to least
--
SELECT Locations.Zone, COUNT(Sales.SkuID) AS Picks
FROM Locations INNER JOIN Sales
ON Locations.SkuID = Sales.SkuID
GROUP BY Locations.Zone
ORDER BY Picks DESC;


Question 5. What were the total quantities, measured in selling units, requested of each sku?

--
-- Sum quantities ordered of each sku
--
CREATE TABLE Quantities AS
	(SELECT SkuID, SUM(OrderQuantity) AS TotalQuantity
	FROM Sales GROUP BY SkuID)
WITH DATA;


Question 6. Compute the “flow” — that is, the physical volume sold, measured in cubic feet — for each product.

--
-- Compute volume ordered of each sku and save into a new table named "Flow"
--
CREATE TABLE Flow AS
	(SELECT Skus.SkuID,
		(Skus.UnitWidth * Skus.UnitHeight * Skus.UnitLength * OrderQuantity / 1728.0) AS CubicFeet
	FROM Skus INNER JOIN Sales ON Skus.SkuID = Sales.SkuID)
WITH DATA;

The following question(s) are harder and may require direct processing of the data files via some general programming language (Python, Ruby, Java, etc.).

Question 7. Construct the pick-lines-per-order distribution: How many orders consist of a single line and what percentage of all orders do they represent? How many of 2 lines? etc.


Thanks to Aloysius CHEN Yan Kai for help in preparing the queries on this page and to Andrew Kovacs for catching some typos, and Pisit Jarumaneeroj for fixing my SQL errors.