Skip to the content.

I am always on the lookout for command-line tools that can get me from a raw data download to some basic numbers as quickly as possible. With DuckDB and its aggregate functions, I may have found my new default flow.

Below I query a salary spreadsheet for summary statistics for data science positions filled in the past quarter. I trust this data from the federal government more than job postings with wide ranges and self-reported surveys on Levels.FYI and elsewhere. The main flaw with relying on this data is that they only report base pay, which can be the smaller part of a total compensation package for some companies and positions

Before DuckDB, I would have to use pandas’s read_excel(), which takes a good bit of time parsing the text-heavy columns before loading the entire dataset into memory. DuckDB not only loads the data faster into memory, but I could have continued to query the spreadsheet itself without as much lag.

-- Data is 2024Q4 filings
.mode markdown
install spatial;
load spatial;
create or replace table df
    select *
    from st_read('Downloads/PERM_Disclosure_Data_FY2022_Q4.xlsx');
summarize(
    select
        coalesce(wage_offer_to, wage_offer_from) as base_pay
    from df
    where job_title ilike '%data scientist%'
);

In case you’re interested, this produces the following output:

column_name column_type min max approx_unique avg std q25 q50 q75 count null_percentage
base_pay DOUBLE 73.0 650000.0 452 152044.49745940792 46640.04639860068 125499.03596938777 155487.80074193547 172934.37240649346 1047 0.0%