A few years ago I began looking for STATA commands to produce stacked area graphs that would allow us to look at the evolution of the distribution of a categorical variable (levels of a factor) over time. Applications included looking at the distributions of chushen (出身) of Qing officials over time, or the positions held by jinshi (进士) degree holders as a function of years since earning their agree. I know that this is straightforward in R, but as far as I can tell there is nothing in STATA that does this easily.
I created two commands, taking as my starting point the examples provided by Andrew Musau in his 2018 post in a thread on stacked area graphs on StataList.
The first is stackedcount
, which plots the number of records in the categories of a variable as a function of a cardinal x variable that takes on discrete values, for example, calendar year or age. I have deposited it at SSC and it can be installed with
ssc install stackedcount
The second is stackedpercent
, which plots of percent of records in each category as a function of the x variable. I will deposit it at SSC after I am sure there are no problems with stackedcount
.
As far as I could tell, STATA doesn’t have a native command to do this sort of basic descriptive graph. Interestingly, a few months ago when I was dabbling with R it turned out to be very straightforward there, so I am not sure why STATA doesn’t provide this.
If I missed something and there is already a package for this or it has been added to STATA, by all means point it out to me.
stackedcount
stackedcount
shows the count of records in each of the categories of a y variable as a function of a discrete numeric x variable.
stackedcount varlist [if] [in] [,options]
where varlist is
y x
y
is a categorical numeric variablex
is a numeric variable that takes on discrete values (for example, calendar year)
x
can be non-integer, but again, the values should be discrete. The program will not ‘bin’ values of x. The program will not bin values of x.
If the categorical variable y
you want to plot is string, run encode
beforehand to create a labeled numeric categorical variable and pass that to the command.
The areas will be stacked according to the numeric value for each category, with the lowest on the bottom. I generally do the categorizations manually rather than relying on encode so that I can control the order in which the areas are stacked.
The areas are presented as stacked bars, with the height of each cumulative value of y set according to the most recent value of x. Heights remain fixed until the next available value of x. This is achieved by additional code that if removed would allow for the cumulative values of y for successive values of x to be connected by diagonal lines. I may add code for an option to turn off the code that produces the bars and connect the cumulative values of y, but haven’t decided yet.
Most of the important options for
twoway are available via pass-through:
xlabel, xtick, xmtick, xtitle, xrange, ylabel, ytick, ymtick, ytitle, yscale, caption, scheme, note, legend
The only option that may need explanation is xrange
, which allows for missing values of x
to be filled in, so that no bars are plotted for them. The reason for this is that I commonly use stackedcount
to plot the distributions of characteristics of officials recorded in quarterly editions of a historical source (缙绅录). Some of these editions are missing, and in that case, I don’t want the values from the most recent preceding edition to be carried forward to the next existing edition. Rather, I would prefer that no bar be plotted for that edition.
xrange
will fill in values of x
according to the numlist
specified with it. numlist
works as usual in STATA. In the sample figure below, editions of the 缙绅录 are quarterly, and year is coded as (for example) 1870, 1870.25, 1870.5, 1870.75 and so forth, where 1870 corresponds to spring, 1870.25 to summer, 1850.50 to autumn, and 1870.75 to winter. For the sample below, xrange
is specified as 1830(0.25)1912 so every missing season is plotted as empty rather than carried forward from the most recent season.
Others are easy to add by following the model in the code.
To install:
ssc install stackedcount
As examples, here are a couple of figures from our forthcoming paper in the Journal of Chinese History introducing the China Government Employee Dataset-Qing (CGED-Q).
stackedcount chushen year if xuhao < 20000 & !qiren & ming != "" & year >= 1830 & !central & !fangkeben_only & !irregular & !ignore & !new_in_1911, xtitle("Year") xtick(1830(5)1910) xmtick(1830(1)1910) xlabel(1830(10)1910) ytitle("Records of officials") ylabel(0(2000)8000,labsize(small)) ymtick(0(1000)8000) legend(cols(4) size(vsmall)) xrange(1830(0.25)1912)
stackedcount chushen year if xuhao < 20000 & !qiren & ming != "" & year >=start_year' & central & !fangkeben_only & !irregular & !ignore & !new_in_1911, xtick(1830(5)1910) xmtick(1830(1)1910) xlabel(1830(10)1910) ytitle("Records of officials") ytitle("Records of officials") ylabel(0(500)2000, labsize(small)) ymtick(0(100)2000) legend(cols(4) size(vsmall))
stackedpercent
stackedpercent
shows the percentage of records in each of the categories of a y variable as a function of a continuous x variable.
stackedpercent varlist [if] [in] [,options]
where varlist
is
x y
x is a categorical numeric variable
y is a continuous numeric variable (for example, year)
If the categorical variable you want to plot is string, run encode beforehand to create a labeled numeric categorical variable and pass that to the command.
To install, you will need to download the ado file and place it in your personal ado directory. You can find out where that is by typing sysdir. If you have no idea what I am talking about, probably to wait until I deposit it at SSC.
The areas will be stacked according to the numeric value for each category, with the lowest on the bottom. I generally do the categorizations manually rather than relying on encode so that I can control the order in which the areas are stacked.
Most of the important options for twoway
are available via pass-through:
xlabel, xtick, xmtick, xtitle, xrange, ylabel, ytick, ymtick, ytitle, yscale, caption, scheme, note, legend
Others are easy to add by following the model in the code.
Here is an example in which I use the command to plot the percent distribution of positions held by non-Banner jinshi (进士) degree holders in the Qing bureaucracy according to the number of years since they earned for the period covered by the China Government Employee Database-Qing (CGED-Q) that I have been working on with Bijia Chen, James Lee, Yuxue Ren and others. I produced two separate plots, one for first and second tier degree holders (一甲,二甲) and another for third tier degree holders (三甲). The results are generally in line with expectations based on the appointment regulations.
stackedpercent guanzhi_js gap if gap >= 0.5 & gap <= 20 & (甲第 == 1 | 甲第 == 2) & !qiren, legend(size(small) cols(4)) xtitle("Years since exam") ytitle("Percent") caption("Positions held by jinshi since years since exam 甲第 1 2 - non-Banner") note("$note_time_stamp")
stackedpercent guanzhi_js gap if gap >= 0.5 & gap <= 20 & (甲第 == 3) & !qiren, legend(size(small) cols(4)) xtitle("Years since exam") ytitle("Percent")
caption("Positions held by jinshi since years since exam 甲第 3 - non-Banner") note("$note_time_stamp")