Richard Bucker

Walmart SQL Test

Posted at — Aug 9, 2022

I started to deep dive into the nuance of the queries and realized that was the wrong path. As an interview question this is meant to ilicit a discussion not just the answer. It’s as important to know:

the interview question

There is a phonelog table that has information about callers' call history.
wtite a SQL to find out callers whose first and last call was to the same person on a give day.

Sample data

create table phonelog(
    Callerid int,
    Recipientid int,
    Datecalled datetime
);

insert into phonelog(Callerid, Recipientid, Datecalled)
values(1, 2, '2019-01-01 09:00:00.000'),
       (1, 3, '2019-01-01 17:00:00.000'),
       (1, 4, '2019-01-01 23:00:00.000'),
       (2, 5, '2019-07-05 09:00:00.000'),
       (2, 3, '2019-07-05 17:00:00.000'),
       (2, 3, '2019-07-05 17:20:00.000'),
       (2, 5, '2019-07-05 23:00:00.000'),
       (2, 3, '2019-08-01 09:00:00.000'),
       (2, 3, '2019-08-01 17:00:00.000'),
       (2, 5, '2019-08-01 19:30:00.000'),
       (2, 4, '2019-08-02 09:00:00.000'),
       (2, 5, '2019-08-02 10:00:00.000'),
       (2, 5, '2019-08-02 10:45:00.000'),
       (2, 4, '2019-08-02 11:00:00.000');

my answer

select callerid, recipientid, datecalled from (
select *, row_number() over win as r, first_value(recipientid) over win as a, last_value(recipientid) over win as b, count(*) over win as c
from phonelog 
window win as (partition by callerid, date(datecalled) order by datecalled RANGE BETWEEN UNBOUNDED PRECEDING AND unbounded following )
) x where c>1 and r=1 and a=b;

my query plan

QUERY PLAN
|--CO-ROUTINE x
|  |--CO-ROUTINE (subquery-3)
|  |  |--CO-ROUTINE (subquery-4)
|  |  |  |--SCAN phonelog
|  |  |  `--USE TEMP B-TREE FOR ORDER BY
|  |  `--SCAN (subquery-4)
|  `--SCAN (subquery-3)
`--SCAN x

the presenters answer

with calls as (
select callerid, date(datecalled) called_date, min(datecalled) first_call, max(datecalled) as last_call
from phonelog
group by callerid, date(datecalled))

select c.*, p1.recipientid from calls c
inner join phonelog p1 on p1.callerid=c.callerid and p1.datecalled=c.first_call
inner join phonelog p2 on p2.callerid=c.callerid and p2.datecalled=c.last_call
where p1.recipientid=p2.recipientid;

his query plan

QUERY PLAN
|--MATERIALIZE calls
|  |--SCAN phonelog
|  `--USE TEMP B-TREE FOR GROUP BY
|--SCAN p1
|--SEARCH p2 USING AUTOMATIC COVERING INDEX (Recipientid=?)
`--SEARCH c USING AUTOMATIC COVERING INDEX (callerid=?)

A covered index is pretty expensive.

Only one call

insert into phonelog(Callerid, Recipientid, Datecalled)
values(3, 2, '2019-01-01 12:00:00.000');

My comment to the channel

what happens when there is only one call in the day for the one caller? (add the following to the data) excluding or including that record would be in error as an undefined state by the requester even though the grammar of the question implies there is at least 2 calls.

(see only one call)

I prefer a partition window.

(see my answer)

The answer provided is better than mine for two reasons… [a] depending on the sql engine the internal performance tuning is data dependent [b] the joins are likely to work on most sql engines where partitions are questionable (thanks oracle)

Looking at this data it’s clear upfront that it’s going to result in at exactly one scan. (think: select * from phonelog order by callerid, datecalled, and then write some perl or even SQL code to iterate over the results a row at a time) but that’s what the partition does. Also testing the number of rows in the partition is pretty simple. The only challenge is understanding the “range”)

Conclusion

It’s actually great question. The real challenge is that interview questions are subjective and at some point you have to rely on your intuition. Having working answer is not good enough; you have to be able to ask questions about the requirements and the data. (see only one call). I had a COBOL teacher in 1984 and your grade was based on generating the EXACT output. The number of spaces, headers, tabs everything. SQL lets you be a little sloppy but there are also places where joins go sideways depending on the data.