• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


STATA: Quick Command Reference

Page history last edited by editor 11 years ago


Quick Command Reference Table
  1. Basic Estimation Commands
    1. Ordinary Least Squares
      1. reg lwage edu exp expsq tenu union
      2. predict lwagehat
      3. predict uhat, resid
      4. test north south east
      5. reg lwage educ exper expersq married black, robust
    2. Instrumental Variable (Two Stage Least Squares)
      1. ivreg lwage (edu =married) exp expsq tenu union
    3. Fixed Effect Model
      1. xtreg lwage edu exp expsq tenu union, fe
    4. Random Effect Model
      1. xtreg lwage edu exp expsq tenu union, re
    5. Between Effect Model
      1. xtreg lwage edu exp expsq tenu union, be
    6. Logistic Model
      1. logit married edu exp lwage
      2. probit married edu exp lwage
    7. Cox-Hazard Model
      1. cox tenu edu lwage union married
  2. Estimation and test techniques
    1. Time Series Analysis
      1. Generating lags and leads : gen xlag1= x[_n-1]
      1. Time series operators: reg consume gnp L.gnp L2.gnp D.gnp D2.gnp S2.gnp
      2. Autocorrelation: corrgram
      3. Box-Pierce Q: wntestq
      4. Augmented Dickey-Fuller test: dfuller
      5. Autoregressive Integrated Moving Average (ARIMA): arima
    2. Duration Model
      1. Cox Model: . cox studytim drug age, dead(died) hr
      2. Survival-time data: . stset studytim, failure(died)
      1. Weibull Model: weibull
    3. Simultaneous Equation
      1. Seemingly Unrelated Regression Models: . sureg (gdp g i) (m2 r)
      2. . reg3 (gdp m2 g i) (m2 gdp r)
    4. Constrained Regression
      1. cnsreg
      2. constraint (constraint list)
    5. Test and Diagnostics
      1. Linear tests: . test m2 g
      2. Non-linear tests: . testnl _b[m2]/_b[g] = =1 
      3. Likelihood-ratio test:. lrtest
  3. Other diagnostics
      1. rvfplot       Graph residual-versus-fitted plot
      2. rvpplot      Graph residual-versus-predictor plot
      3. ovtest        Perform Ramsey RESET test for omitted variable test
      4. dwstat       Compute Durbin-Watson d statistic if the data is declared as time series
      5. vif             Calculate VIFs (variance inflation factors)
  4. Using Stata as a Calculator and Computing p-values
    1. Calculator
      1. di .048/(2*.0016)
    2. P-values
      1. di normprob (1.58)
      2. di tprob (df, t)
      3. di fprob (df1, df2, F)
  5. Getting Started
    1. Editing the Command Line
    1. Help and Search
      1. help regress
      2. search fixed effect
    2. Creating log (procedure and output) file
      1. log using a:\wage1, replace
      2. log close
      3. Increasing the amount of memory
      4. set memory 5M
    3. Batch or interactive
      1. Using commands from keyboard
      2. Using a do file
    4. Reading Data Files
      1. use a:\wage1
      2. infile
      3. save
      4. clear
    5. Looking At and Summarizing Your Data
      1. list wage edu
      2. list wage edu in 1/20
      3. list married age if hours = = 0
      4. list married age if union= =1 & hours >= 40
      5. sum wage edu tenu married
      6. sum wage edu tenure married, detail
      7. sum wage edu tenu married if year = =1990
      8. sum wage edu in 1/20
      9. tab married
      10. by married: sum wage
      11. drop if ~union (or drop if union = = 0)
      12. keep if (year >= 1986) & (year <= 1990)
      13. drop in 2672
    6. Defining New Variables
      1. generate expsq = exp^2
      2. gen lwage = ln(wage) if hours > 0
      3. gen ccrime = crime - crime [_n-l] if year = = 1987
      4. replace expsq = exp^2
    7. General Plotting Commands
    8. General commands
    9. Generating new variables
    10. Regression
      1. Important Notes on the "stem" command
    11. Summary of These and Other Commands





Basic Estimation Commands


Ordinary Least Squares


For OLS regression, we use the command reg. Immediately following reg is the dependent variable, and after that, all of the independent variables (order of the independent variables is not, of course, important). An example is:


reg lwage edu exp expsq tenu union


This command produces OLS estimates, standard errors, t statistics, confidence intervals, and a variety of other statistics usually reported with OLS output. Unless a specific range of observations or logical statement is included, Stata uses all possible observations in obtaining estimates. It does not use observations for which data on the dependent or any of the independent variables is missing. Thus, you must he aware that adding another explanatory variable can result in fewer observations used in the regression if some observations are missing for that variable. If a variable called "motheduc" (mother's education) is added to the independent variables in the above regression, and this variable is missing for say 10 percent of individuals, then the sample size using in obtaining OLS estimates is decreased accordingly.

Sometimes we want to restrict our regression analysis based on the size of one or more of the variables. For example,


reg lwage edu exp expsq tenu union if edu<16


where size is the number of employees of a firm restricts the analysis to firms with no more than 5,000 employees. The regression also can be restricted to a particular year using a similar if statements, or to a particular observation range using the command in m/n.

Predicted values are obtained using the predict command. Thus, if a regression is run with lwage as the dependent variable, to get the fitted values type:


predict lwagehat


The choice of the name lwagehat is arbitrary, subject to its being no more than eight characters and its not already being used. The predict command saves the fitted values for the most recently run regression.

The residuals can be obtained by:


predict uhat, resid


where again the name uhat is arbitrary.

You can test multiple linear restrictions after an OLS regression by using the test command. Consider a regression which controls for four Census regions: north, south, east, and west. Because the regression includes a constant term, we can identify parameters for three of the four regional dummy variables. Suppose we exclude the "west" dummy from the regression and we wish to test whether there are any "regional effects" in the data. To test whether the coefficients for the north, south, and east dummy variables are jointly zero you can just lists the variables hypothesized to have no effect:


test north south east


The result of this test tells you whether the three regional indicators can be excluded from the previously estimated model. Along with the value of the F-statistic, Stata also reports a p-value. As with the predict command, test is applied to the most recently estimated model.

OLS estimates with heteroskedasticity-robust standard errors and t statistics can be obtained using  robust option. Remember, this is just OLS, but the asymptotic variance is estimated in a heteroskedasticity-robust fashion. For example,


reg lwage educ exper expersq married black, robust


Instrumental Variable (Two Stage Least Squares)


The reg command can also be used to estimate models by 2SLS. After specifying the dependent variable and the explanatory variables - which presumably contain at least one endogenous variable (that is correlated with the error) - one then lists all of the exogenous variables as instruments in parentheses. Naturally, the list of instruments does not contain any endogenous variables.


An example of a 2SLS command is:


ivreg lwage (edu =married) exp expsq tenu union

ivreg lwage (edu =married exp expsq tenu union) exp expsq tenu union


This command produces 2SLS estimates, standard errors, t statistics, and so on. By looking at this command, we see that edu is an endogenous explanatory variable in the lwage equation while exp, expsq, and union are assumed to be exogenous explanatory variables. The variable married is assumed to be additional exogenous variable that does not appear in the lwage structural equation but should have some correlation with edu. These appear in the instrument list along with the exogenous explanatory variables.


The order in which we order the instruments is not important. The necessary condition for the model to be identified is that the number of terms in parentheses is at least as large as the total number of explanatory variables. In this example, the count is five to four, and so the order condition holds.


In the previous example, we allowed for just one endogenous explanatory variable, "edu". Allowing for more than one endogenous explanatory variable is also easy. After 2SLS, we can test multiple restrictions using the test command, just as with OLS.


Fixed Effect Model


iis id ¿  

tis year ¿

xtreg lwage edu exp expsq tenu union, fe



xtreg lwage edu exp expsq tenu union, re


xtreg lwage edu exp expsq tenu union, be


Random Effect Model

iis id ¿  

tis year ¿

xtreg lwage edu exp expsq tenu union, re


Between Effect Model


iis id ¿  

tis year ¿

xtreg lwage edu exp expsq tenu union, be


Logistic Model


logit married edu exp lwage

probit married edu exp lwage


Cox-Hazard Model


cox tenu edu lwage union married






Estimation and test techniques


  1. Time Series

  2. Duration Model

  3. Simultaneous Equation

  4. Constrained Regression

  5. Test and Diagnostics



Time Series Analysis


Generating lags and leads : gen xlag1= x[_n-1]


If you sort the data by date, then the lagged variable x can be obtained by typing


. gen xlag1= x[_n-1]


Of course you can use as many lags as you want.


. gen xlag2=x[_n-2]


Likewise, you can lead the date by using _n+1, _n+2.


If you are a serious user of time-series data, then it would be better served by using time series operators. The time series operators are L (lag), F (lead), D (difference) and S(seasonal). You must set the time variables using the tsset command.


Time series operators: reg consume gnp L.gnp L2.gnp D.gnp D2.gnp S2.gnp


. tsset  year, yearly        /*declare dataset to be time-series data*/

. reg consume gnp L.gnp L2.gnp D.gnp D2.gnp S2.gnp

. sum interest if F.gnp<gnp


2) Examples


Autocorrelation: corrgram


corrgram lists a table of the autocorrelations, partial autocorrelations, and Q statistics.  It will also list a character-based plot of the autocorrelations and partial autocorrelations. The ac command produces a correlogram (the autocorrelations) with pointwise confidence intervals obtained from the Q statistic.


Box-Pierce Q: wntestq

The wntestq command produces the Box-Pierce Q test statistics. The null hypothesis is the autocorrelation coefficients are simultaneously equal to zero.


. corrgram r

. corrgram r, lags(5)

. wntestq r


Augmented Dickey-Fuller test: dfuller

dfuller performs the augmented Dickey-Fuller test of unit roots.  This test performs a regression of the differenced variable on its lag and the user specified number of lagged differences of the variable. Optionally a constant trend term may be included as well as the associated regression. The null hypothesis is that there is a unit root.


 . dfuller r

 . dfuller r, lags(3) trend regress


Autoregressive Integrated Moving Average (ARIMA): arima

arima estimates a model of depvar on varlist where the disturbances are allowed to follow a linear autoregressive moving-average (ARMA) specification. The dependent and independent variables may be differenced or seasonally differenced to any degree.  When independent variables are not specified, these models reduce to autoregressive integrated moving-average (ARIMA) models in the dependent variable.  Missing data are allowed and are handled using the Kalman filter. arima allows time-series operators in the dependent variable and independent variable lists and it is often convenient to make extensive use of these operators.


    .arima r, arima(1,1,1)

    .arima D.r, ar(1) ma(1)  /*same as above*/

   . arima r, arima(3,2,4)

   . arima D2.r, ar(1/3) ma(1/4)   /*same as above*/


There are other test commands such as Granger causality test and cointegration test. The ado program files are not originally installed, but you can find and download it from STB(Stata Technical Bulletin) web site. If your computer is connected to network, just click the search result.


Duration Model

If the data set is already a duration data, then can use duration model. For example, you can use either Cox or Weibull model. You can also use hr option to report the hazard ratio instead of coefficient.


Cox Model: . cox studytim drug age, dead(died) hr

     . cox studytim drug age, dead(died)

     . cox studytim drug age, dead(died) hr


If you are a serious user of the duration data set you might want to use stset command: stset declares data to be survival-time data.


Survival-time data: . stset studytim, failure(died)

     . stset studytim, failure(died)

     . stset studytim, failure(outcome==2)

    . stset studytim, failure(outcome==2) id(patientno)  /*multiple failure*/


Once declared as duration data, you can use

     . cox drug age

      .cox drug age, hr


Weibull Model: weibull


Compare these results with those above. Also can use weibull instead of cox.



Simultaneous Equation


Seemingly Unrelated Regression Models: . sureg (gdp g i) (m2 r)


sureg estimates Zellner's seemingly unrelated regression models. This is in fact a FGLS and especially useful when all equations consist of only exogenous variables among explanatory variables.  Suppose GDP = f(G, I) and M2=g(r). Do


. sureg (gdp g i) (m2 r). Compare the result with two ols result


Now suppose some equations contain endogenous variables among the explanatory variables. I guess everyone is already familiar with 2sls method (ivreg). Another way to estimate this is estimation of a system of structural equations by using reg3 command. Suppose GDP = f(G, I, M2) and M2 = g(GDP, r). Notice that the two dependent variables are now included as explanatory variable in the equations. reg3 can also estimate systems of equations by seemingly unrelated regression (SURE).



. reg3 (gdp m2 g i) (m2 gdp r)

. reg3 (gdp m2 g i) (m2 gdp r), sure


Constrained Regression



constraint (constraint list)

cnsreg estimates constrained linear regression models. constraint(constraint list) is not optional; it specifies the constraint numbers of the constraints to be applied.  Constraints are defined using the constraint command.



        . constraint def 1 price = weight

        . cnsreg mpg price weight, constraint(1)

        . constraint def 2 gratio = -foreign

        . cnsreg mpg price weight displ gratio foreign length, c(1-2)

        . constraint define 3 _cons = 0

        . cnsreg mpg price weight displ gratio foreign length, c(1-3)


Test and Diagnostics


Linear tests: . test m2 g

Non-linear tests: . testnl _b[m2]/_b[g] = =1 


test tests linear hypotheses about the estimated parameters from the most recently estimated model. testnl tests nonlinear (or linear) hypotheses about the estimated parameters from the most recently estimated model.


        . reg gdp m2 g i

        . test m2 g

        . test m2-5*g +3 = = 0

        . testnl _b[m2]/_b[g] = =1

        . testnl (_b[m2]/_b[g] = =1)  (_b[m2] = = 2)



test perform Wald tests.  See help lrtest for likelihood-ratio tests. 

Likelihood-ratio test:. lrtest


Other diagnostics


rvfplot       Graph residual-versus-fitted plot

rvpplot      Graph residual-versus-predictor plot

ovtest        Perform Ramsey RESET test for omitted variable test

dwstat       Compute Durbin-Watson d statistic if the data is declared as time series

vif             Calculate VIFs (variance inflation factors)



Using Stata as a Calculator and Computing p-values



Stata can be used to compute a variety of expressions, including certain functions that are not available on a standard calculator. The command to compute an expression is disp or di for short. The command:


di .048/(2*.0016)


will return "15." We can use the di command to compute natural logs, exponentials, squares, and so on. For example:


di exp(3.5 + 4*.06)



returns the value 42.098 (approximately). These previous calculation can he performed on most calculators. More importantly, we can use di to compute p-values after computing a test statistic. The command:

di normprob (1.58)


gives the probability that a standard normal random variable is greater than the value 1.58 (about .943). Thus, if a standard normal test statistic takes on the value 1.58, the p-value is 1 - .943 = .057. Other functions are geared to give the p-value directly.


di tprob (df, t)


returns the p-value for a t test against a two-sided alternative (t is the absolute value of the “t” statistic and “df” is the degrees of freedom). For example, with df= 31 and t = 1.32, the command returns the value .196. To obtain the p-value for an F test, the command is:


di fprob (df1, df2, F)


where “df1” is the numerator degrees of freedom, “df2” is the denominator df, and F is the value of the F statistic. As an example:


di fprob (3, 142, 2.18)


returns the p-value .093.







Getting Started

Editing the Command Line


Stata has several shortcuts for entering command. Two useful keys are Page Up and Page Down. If at any point you hit Page Up, the previously executed command appears on the command line. This can save on a lot of typing because you can hit Page Up and edit the previous command. Among other things, this makes adding an independent variable to a regression, or expanding and instrument list easier. Hitting Page Up repeatedly allows you to traverse through previously executed commands until you find the one you want. Hitting Page Down takes you back down through all of the commands.

It is easy to edit the command line. Hitting Home on the keyboard takes the cursor to the beginning of the line; hitting End moves the cursor to the end of the line. The key Delete deletes a single character to the fight of the Cursor; holding it down will delete many characters. The Backspace key (a left arrow on many keyboards) deletes a character to the left of the cursor. Hitting the left arrow moves you one character to the left, and the right arrow takes you one character to the right. You can hold down either to move several characters. The key Ins allows you to toggle between insert and overwrite modes. Both of these modes are useful for editing commands.


Help and Search


help regress

search fixed effect


Creating log (procedure and output) file


Suppose you want to print out the “tutorial” or results from I. Stata for Dummies. Yes, you should create an output file. In particular, for involved projects, you must create a record of what you have done (data transformations, regressions, and so on). To do this you can create a log file. With a diskette in the A: drive, before doing any analysis, type:


log using a:wage1, replace


This will create the file wage1.log on the diskette in the A: drive. Or just

Click the log start/stop icon, and follow the direction.


Stata log files are just standard ASCII files. They can also he directly sent to a printer. However, I do not like the font size and format. So the best way is to read the log by using MS Word or any text editor (hint: Courier new with 8 font size is best). So why don’t you open any log file and type “tutorial intro” and print it out? When you are finished, you can close the log file for good


log close


After typing this command, log on will not open the log file. If you decide to add onto the end of an existing log file type:


log using a:\wage1, append


Increasing the amount of memory


The default is only 1Mb. If your data set is bigger than 1Mb, you cannot even open the data with this default. Always have at least two times bigger memory than your data set. There are two ways to increase memory. The first way is, at the beginning, type (for 5Mb)


set memory 5M


However, allocating memory every time is cumbersome. So, you can change the amount of memory Stata uses once it is running. Create shortcut of the wstata.exe file. Click properties, and choose shortcut tab. The target probably says something like


            C:\stata\wstata.exe /k1000


Stata is to allocate 1000k (1Mb). If you change the option to, say, k5000 (5Mb), Stata would allocate 5Mb. Change the number as you wish. The number needs to be a multiple of 1000. If your computer’s memory is smaller than the allocation, then the Stata will use virtual memory (not recommended).


Batch or interactive


Using commands from keyboard

Using a do file

Of course, it is possible to cause Stata to execute the commands stored in filename (batch mode) just as if they were entered from the keyboard (interactive mode). If filename is specified without an extension, .do is assumed. This is called “do file”. You will find this batch mode is extremely helpful. Refer to ancillary handouts.



Reading Data Files


The command to read a Stata file is “use”. Of course you can instead use the Stata tool bar. If the Stata file is called wage1.dta, and the file is on the diskette in the A: drive, the command is:


use a:wage1


After entering this command the data file wage1 is loaded into memory (Note that Stata is case sensitive). However, life is not that easy. Not every data has Stata format. If the data is not Stata format, you should change the data to Stata format. There are many ways. Here are my hints.


1. If you want to input data, then just use the “Data Editor” in Stata. After you input the data, and save it by pulling down File and choosing Save as. It is easy. The Data Editor is compatible with MS Excel. So as long as your data is in Excel format, you can just copy them to the editor.


2. If the data is saved as any software format, for example, Excel, SPSS, SAS, Dbase, Limdep, RATS, Gauss……, then you can use STAT/Transfer program. This is the easiest way to transfer one data set to another. If you think you will heavily use micro-data set in the future, consider buying it.


3. Or you can create Stata data from ASCII file (text file). You may be able to convert your data set into ASCII format. In most cases, your data set is already an ASCII file. There is a command called infile, that allows you to read an ASCII file. The file must be organized with an observation in each row, and the variables in the data set in its own column.


a) If each number is separated by space, for example, suppose a wage data set is organized as


10.75   12        6          1          0
16.50   16        3          0          0

12.10    12         8          1          1


Each row corresponds to an individual. In the example above, the first variable is hourly wage, the second is years of education, the third is experience, the fourth is a dummy variable equal to unity for unionized firm, zero for non-unionized firm, and the last variable is an indicator variable for marital status (which equals unity for married individuals). The variables in this example are equally spaced but this spacing is not essential. If these data are in the file wage.raw on the A: drive, then the command:



infile wage edu exp union married using a:\wage.raw


reads in each row of data, and stores the data on each variable into the appropriate name. Once the ASCII file has been read, it is a good idea to save it as a Stata file. The command:



save a:\wage2 


creates the Stata file wage.dta on the A: drive. Notice that the file type dta denotes a stata file. If you are working with multiple years of data for each individual, it is a good idea to include in your data set a variable indicating the year and id of the observation.


b) If each number is not separated by space or tab, then you should create a dictionary file to read the data. Create exer1.dct (ASCII) file as follows. Assume that there is a data set named original.dat


dictionary using C:\original.dat  {

year                  %2f 

firmsize %1f 

sampling           %3f 

union                %1f 

idnumber          %4f 


sex                   %1s 


msts                 %1f 

emptype           %1f 

shift                  %2f 

expyears           %1f 


“ind” and “sex” is string (s) and others are numbers(f). %2f means occupying 2 columns with numbers. If you do not want to read all variables then jump to column 40 where the variable sex is.


            Then type


infile using a:\wage.dct


For more advanced features on inputting data, you can refer to the Stata User’s Guide which is published by the Stata Press.


If you have completed your analysis with a file such as wage1.dta, and then wish to use a different data set, you simply clear the existing data set from memory. The command to use is




By issuing this command, it is important to know that any changes you made to the data set during your current Stata session will be lost.



Looking At and Summarizing Your Data


After reading in a data file, you can get a list of the available variables by typing des. Often a short description has been given to each variable. To look at the observations of one or more variables, use the list command. For example, to look at the variables wage and edu for all observations, type:


list wage edu


This will list, one screen at a time, the data on wage and edu for every person in the sample. (Missing values in Stata are denoted by a period.) If the data set is large, you may not wish to look at all observations. You can always stop the listing by hitting Ctrl-Break on the keyboard. In fact, Ctrl-Break can be used to interrupt any Stata command.

Alternatively, there are various ways to restrict the range of the listing and many other Stata commands. To look at the first 20 observations on wage and edu type:  


list wage edu in 1/20


Rather than specify a range of observations, a logical command can be used instead. For example, to look at the data on marital status and age for people with zero hours worked type:


list married age if hours = = 0


Notice how the double equal sign is used by Stata to determine equivalence. The other relational operators in Stata are > (greater than), < (less than), >= (greater than or equal), <= (less than or equal), and ~= (not equal)). Or if you want to restrict attention to non union members, type:


list married age if union= =0


The variable union is a binary indicator equal to unity for union members, and zero otherwise. The ~ is the logical "not" operator. We can combine many different logical statements. The command:


list married age if union= =1 & hours >= 40


restricts attention to union members who work at least 40 hours a week. (Logical and is denoted by “&” and logical or is denoted by "|" in Stata.)          .'

Two useful commands for summarizing data are the sum and tab commands. The sum command computes the sample average, standard deviation, and the minimum and maximum values of all (nonmissing) observations. Because this command tells you how many observations were used for each variable in computing the summary statistics, you can easily find out how many missing data points there are for any variable. Thus, the command:


sum wage edu tenu married


computes the summary statistics for the four variables listed. Because married is a binary variable, its minimum and maximum values are not very interesting. The average value reported is simply the proportion of people in the sample who are married.

To obtain more summary information for each of these variables you must type:


sum wage edu tenure married, detail


By adding the detail option, Stata provides an extensive list of summary statistics for each of these variables including the median and other percentiles of the empirical distribution.

Stata also provides summary statistics for any subgroup of the sample if you add a logical statement:


sum wage edu tenu married if union


If the data is a pooled cross section or a panel data set, to summarize for 1990 type:


sum wage edu tenu married if year = =1990


The sample can be restricted to certain observation ranges by using the in m/n option, just as illustrated in the list command:


sum wage edu in 1/20


For variables that take on a relatively small number of values - such as number of children or number of times an individual was arrested during a year - you can use the tab command to get frequency tabulation:


tab married


This command reports the frequency associated with each value of arrests in the sample. You also can combine this command with logical statements or restrict the range of observations.


In order to calculate the frequency of arrests by city you will need to use the sort command. First you need to sort the data by city by typing:


sort married


Once the data is sorted then you summarize the variable by typing:


by married: sum wage


In order to calculate the wage by another variable, say year, you will need to resort the data by year and then use the su command again.

Sometimes, you may want to restrict all subsequent analysis to a particular subset of the data. In such cases it is useful to delete the data that will not he used subsequently. This can be done using the drop or keep commands. For example, if we want to analyze only union members in a wage equation, then we can type:


drop if ~union (or drop if union = = 0)


This drops everyone in the sample who is not union member. Or, to analyze only the years between 1986 and 1990 (inclusive), we can type:


keep if (year >= 1986) & (year <= 1990)


In order to drop a particular observation, say observation 2672, you must type:


drop in 2672


It is important to know that the data dropped are gone from the current Stata session. If you want to get them back, you must reread the original data file. Along these lines, do not make the mistake of saving the smaller data set over the original one, or you will lose a chunk of your data.

BE SURE TO KEEP AN EXTRA BACKUP FILE OF YOUR STATA DATA SETS (for both beginners and experts!!). 




Defining New Variables


It is easy to create variables that are functions of existing variables. In Stata, this is accomplished using the gen command (short for generate). For example, to create the square of experience, type:


generate expsq = exp^2


The new variable, expsq, can he used in a regression or any place else Stata variables are used (Stata does not allow us to put expressions such as exp^2 into regression commands; we must create the variables first.) When creating Stata variables, you should remember that the name of variables can not he longer than eight characters. Stata will refuse to accept names longer than eight characters in the gen command (and in all other Stata commands). If an observation had a missing value for exp then, naturally, expsq will also be missing for that observation. In fact, Stata will tell you how many missing observations were created after every gen command. If Stata reports nothing, then no missing observations were generated.

To find the natural log of a variable such as wage, type:


gen lwage = ln(wage)


If saving is missing then lwage will also be missing. For functions such as the natural log, there is an additional consideration: ln(wage) is not defined for wage <= 0. When a function is not defined for particular values of the variable, Stata sets the result to missing.

Logical commands can be used to restrict observations used for generating new variables. For example:


gen lwage = ln(wage) if hours > 0


creates ln(wage) for people who work ( and therefore whose wage can be observed). Using the gen command without the statement if hour > 0 has the same effect in this example because wage is missing for those individuals who do not work.

Creating interaction terms is easy:


gen blckedu = black*edu


where “*” denotes multiplication: the division operator is “/”. Addition is “+” and subtraction is

“-”. The gen command also can be used to create binary variables. For example if fratio is the funding ratio of a firm's pension plan, the dummy variable overfund can he created which is unity when fratio > 1 and zero otherwise:


gen overfund = fratio > 1


The way this command works is that the logical statement on the right hand side is evaluated to be true or false; then true is assigned the value unity, and false assigned the value zero. So overfund is unity if frafio >1 and overfund is zero if fratio <= 1. As another example, we can create year dummies using a command such as:


gen y85 = (year = = 1985)


where year is assumed to be a variable define in the data set. The variable y85 is unity for observations corresponding to 1985, and zero otherwise. We can do this for each year in our sample to create a full set of year dummies.


The gen command also can be used to difference data among different years. Suppose that, for a sample of cities, we have two years of data for each city (say 1982 and 1987). The data are stored so that the two years for each city are adjacent in the file, with the 1982 observation preceding the 1987 observation. To eliminate unobserved "fixed" effects, say in relating city crime rates to expenditures on crimes and other city characteristics, we can relate changes overtime. Stata stores the changes between 1982 and 1987 alongside the 1987 data. It is important to remember that for 1982 there is no change from a previous time period because we do not have data on a previous time period. Therefore, we should define the change data so that it is missing in 1982. For example:


gen ccrime = crime - crime [_n-l] if year = = 1987


gen cexpend = expend - expend[_n-l] if year = =1987


The variable "_n" is the reserved Stata symbol for the current observation; thus, _n-1 is the variable lagged once. The variable ccrime is the change in crime between 1982 and 1987; cexpend is the change in expenditures between 1982 and 1987. These new change variables are stored next to the 1987 observations, and the corresponding change variables for 1982 are missing denoted as a ".". We can then use these change variables in a regression analysis, or some other analysis.


The replace command is useful for correcting mistakes in definitions and redefining variables after values of other variables have changed. Suppose, for example, when creating the variable expsq, you mistakenly typed "gen expsq = exper^3.'' One possibility is to drop the variable expsq and try again:


drop expsq


gen expsq = exp^2


(Note that the drop command really has two purposes: to delete all variables for certain observations and to drop one or more variables for all observations.) A faster route is to use the replace command:


replace expsq = exp^2


Stata explicitly requires the replace command to write over the contents in a previously defined variable.



General Plotting Commands


  1. Plot a histogram of a variable:
    histogram vname
  2. Plot a histogram of a variable using frequencies:
    histogram vname, freq
    histogram vname, bin(xx) norm
    where xx is the number of bins.
  3. Plot a boxplot of a variable:
    graph box vname
  4. Plot side-by-side box plots for one variable (vone) by categories of another variable vtwo. (vtwo should be categorical)):
    graph box vone, over(vtwo)
  5. A scatter plot of two variables:
    scatter vone vtwo
  6. A matrix of scatter plots for three variables:
    graph matrix vone vtwo vthree
  7. A scatter plot of two variables with the values of a third variable used in place of points on the graph (vthree might contain numerical values or indicate categories, such as male ("m") and female ("f")):
    scatter vone vtwo, symbol([vthree])
  8. Normal quantile plot:
    qnorm vname


General commands


  1. To compute means and standard deviations of all variables:
    or, using an abbreviation,
  2. To compute means and standard deviations of select variables:
    summarize vone vtwo vthree
  3. Another way to compute means and standard deviations that allows the by option:
    tabstat vone vtwo, statistics(mean, sd) by(vthree)
  4. To get more numerical summaries for one variable:
    summ vone, detail
  5. See help tabstat to see the numerical summaries available. For example:
    tabstat vone, statistics(min, q, max, iqr, mean, sd)
  6. Correlation between two variables:
    correlate vone vtwo
  7. To see all values (all variables and all observations, not recommended for large data sets):
    Hit the space bar to see the next page after "-more-" or type "q" to "break" (stop/interrupt the listing).
  8. To list the first 10 values for two variables:
    list vone vtwo in 1/10
  9. To list the last 10 values for two variables:
    list vone vtwo in -10/l
    (The end of this command is "minus 10" / "lowercase letter L".)
  10. Tabulate categorical variable vname:
    tabulate vname
    or, using an abbreviation,
    tab vname
  11. Cross tabulate two categorical variables:
    tab vone vtwo
  12. Cross tabulate two variables, include one or more of the options to produce column, row or cell percents and to suppress printing of frequencies:
    tab vone vtwo, column row cell
    tab vone vtwo, column row cell nofreq


Generating new variables


  1. General.
    1. Generate index of cases 1,2, ...,n (this may be useful if you sort the data, then want to restore the data to the original form without reloading the data):
      generate case= _n
      or, using an abbreviation,
      gen case=_n
    2. Multiply values in vx by b and add a, store results in vy:
      gen vy = a + b * vx
    3. Generate a variable with values 0 unless vtwo is greater than c, then make the value 1:
      gen vone=0
      replace vone=1 if vtwo>c
  2. Random numbers.
    1. Set numbers of observations to n:
      set obs n
    2. Set random number seed to XXXX, default is 1000:
      set seed XXXX
    3. Generate n uniform random variables (equal chance of all outcomes between 0 and 1):
      gen vname=uniform()
    4. Generate n uniform random variables (equal chance of all outcomes between a and b):
      gen vname=a + (b - a)*uniform()
    5. Generate n discrete uniform random variables (equal chance of all outcomes between 1 and 6)
      gen vname=1 + int(6*uniform())
      (These commands simulate rolling a six-sided die.)
    6. Generate normal data with mean 0 and standard deviation 1:
      gen vname= invnorm(uniform())
    7. Generate normal data with mean mu and standard deviation sigma:
      gen vname= mu + sigma * invnorm(uniform())




  1. Compute simple regression line (vy is response, vx is explanatory variable):
    regress vy vx
  2. Compute predictions, create new variable yhat:
    predict yhat
  3. Produce scatter plot with regression line added:
    graph twoway lfit vy vx || scatter vy vx
  4. Compute residuals, create new variable residuals:
    predict residuals, resid
  5. Produce a residual plot with horizontal line at 0:
    graph residuals, yline(0)
  6. Identify points with largest and smallest residuals:
    sort residuals
    list in 1/5
    list in -5/l
    (The last command is "minus 5" / "lowercase letter L".)
  7. Compute multiple regression equation (vy is response, vthree, vtwo, and vvthree are explanatory variables):
    regress vy vone vtwo vthree


Important Notes on the "stem" command


In some versions of Stata, there is a potential glitch with Stata's stem command for stem-and-leaf plots. The stem function seems to permanently reorder the data so that they are sorted according to the variable that the stem-and-leaf plot was plotted for. The best way to avoid this problem is to avoid doing any stem-and-leaf plots (do histograms instead). However, if you really want to do a stem-and-leaf plot you should always create a variable containing the original observation numbers (called index, for example). A command to do so is:
generate index = _n

If you do this, then you can re-sort the data after the stem-and-leaf plot according to the index variable:
sort index.
Then, the data are back in the original order.

Summary of These and Other Commands

Here is a list of the commands demonstrated above and some other commands that you may find useful (this is by no means an exhaustive list of all Stata commands):

anova general ANOVA, ANCOVA, or regression
by repeat operation for categories of a variable
ci confidence intervals for means
clear clears previous dataset out of memory
correlate correlation between variables
describe briefly describes the data (# of obs, variable names, etc.)
diagplot distribution diagnostic plots
drop eliminate variables from memory
edit better alternative to input for Macs
exit leave Stata
generate creates new variables (e.g., generate years = last - first)
graph general graphing command (this command has many options)
help online help
histogram create a histogram graphic
if lets you select a subset of observations (e.g., list if radius >= 3000)
infile read non-Stata-format dataset (ASCII or text file)
input type in raw data
insheet read non-Stata-format spreadsheet with variable names on first line
list lists the whole dataset in memory (you can also list only certain variables)
log save or print Stata ouput (except graphs)
lookup keyword search of commands, often precursor to help
oneway oneway analysis of variance
pcorr partial correlation coefficients
plot text-mode (crude) scatterplots
predict calculated predicted values (y-hat), residuals (ordinary, standardized and studentized), leverages, Cook's distance, standard error of predicted individual y, standard error of predicted mean y, standard error of residual from regression
qnorm create a normal quantile plot
regress regression
replace lets you change individual values of a variable
save saves data and labels in a Stata-format dataset
scatter create a scatter plot of two numerical variables
set set Stata system parameters (e.g., obs and seed)
sebarr standard error-bar chart
sort sorts observations from smallest to largest
stem stem and leaf display
summarize produces summary statistics (# obs, mean, sd, min, max) (has a detail option)
tabstat produces summary statistics of your choice
tabulate produces counts/frequencies for categorical data
test conducts various hypothesis tests (refers back to most recent model fit (e.g., regress or anova ) (see help function for info and examples))
ttest one and two-sample t-tests
use retrieve previously saved Stata dataset






Content adapted from: http://www2.hawaii.edu/~leesang/670/stata.htm




Comments (0)

You don't have permission to comment on this page.