Quick Command Reference Table 
 Basic Estimation Commands
 Ordinary Least Squares
 reg lwage edu exp expsq tenu union
 predict lwagehat
 predict uhat, resid
 test north south east
 reg lwage educ exper expersq married black, robust
 Instrumental Variable (Two Stage Least Squares)
 ivreg lwage (edu =married) exp expsq tenu union
 Fixed Effect Model
 xtreg lwage edu exp expsq tenu union, fe
 Random Effect Model
 xtreg lwage edu exp expsq tenu union, re
 Between Effect Model
 xtreg lwage edu exp expsq tenu union, be
 Logistic Model
 logit married edu exp lwage
 probit married edu exp lwage
 CoxHazard Model
 cox tenu edu lwage union married
 Estimation and test techniques
 Time Series Analysis
 Generating lags and leads : gen xlag1= x[_n1]
 Time series operators: reg consume gnp L.gnp L2.gnp D.gnp D2.gnp S2.gnp
 Autocorrelation: corrgram
 BoxPierce Q: wntestq
 Augmented DickeyFuller test: dfuller
 Autoregressive Integrated Moving Average (ARIMA): arima
 Duration Model
 Cox Model: . cox studytim drug age, dead(died) hr
 Survivaltime data: . stset studytim, failure(died)
 Weibull Model: weibull
 Simultaneous Equation
 Seemingly Unrelated Regression Models: . sureg (gdp g i) (m2 r)
 . reg3 (gdp m2 g i) (m2 gdp r)
 Constrained Regression
 cnsreg
 constraint (constraint list)
 Test and Diagnostics
 Linear tests: . test m2 g
 Nonlinear tests: . testnl _b[m2]/_b[g] = =1
 Likelihoodratio test:. lrtest
 Other diagnostics
 rvfplot Graph residualversusfitted plot
 rvpplot Graph residualversuspredictor plot
 ovtest Perform Ramsey RESET test for omitted variable test
 dwstat Compute DurbinWatson d statistic if the data is declared as time series
 vif Calculate VIFs (variance inflation factors)
 Using Stata as a Calculator and Computing pvalues
 Calculator
 di .048/(2*.0016)
 Pvalues
 di normprob (1.58)
 di tprob (df, t)
 di fprob (df1, df2, F)
 Getting Started
 Editing the Command Line
 Help and Search
 help regress
 search fixed effect
 Creating log (procedure and output) file
 log using a:\wage1, replace
 log close
 Increasing the amount of memory
 set memory 5M
 Batch or interactive
 Using commands from keyboard
 Using a do file
 Reading Data Files
 use a:\wage1
 infile
 save
 clear
 Looking At and Summarizing Your Data
 list wage edu
 list wage edu in 1/20
 list married age if hours = = 0
 list married age if union= =1 & hours >= 40
 sum wage edu tenu married
 sum wage edu tenure married, detail
 sum wage edu tenu married if year = =1990
 sum wage edu in 1/20
 tab married
 by married: sum wage
 drop if ~union (or drop if union = = 0)
 keep if (year >= 1986) & (year <= 1990)
 drop in 2672
 Defining New Variables
 generate expsq = exp^2
 gen lwage = ln(wage) if hours > 0
 gen ccrime = crime  crime [_nl] if year = = 1987
 replace expsq = exp^2
 General Plotting Commands
 General commands
 Generating new variables
 Regression
 Important Notes on the "stem" command
 Summary of These and Other Commands


Basic Estimation Commands
Ordinary Least Squares

For OLS regression, we use the command reg. Immediately following reg is the dependent variable, and after that, all of the independent variables (order of the independent variables is not, of course, important). An example is:
reg lwage edu exp expsq tenu union
This command produces OLS estimates, standard errors, t statistics, confidence intervals, and a variety of other statistics usually reported with OLS output. Unless a specific range of observations or logical statement is included, Stata uses all possible observations in obtaining estimates. It does not use observations for which data on the dependent or any of the independent variables is missing. Thus, you must he aware that adding another explanatory variable can result in fewer observations used in the regression if some observations are missing for that variable. If a variable called "motheduc" (mother's education) is added to the independent variables in the above regression, and this variable is missing for say 10 percent of individuals, then the sample size using in obtaining OLS estimates is decreased accordingly.
Sometimes we want to restrict our regression analysis based on the size of one or more of the variables. For example,
reg lwage edu exp expsq tenu union if edu<16
where size is the number of employees of a firm restricts the analysis to firms with no more than 5,000 employees. The regression also can be restricted to a particular year using a similar if statements, or to a particular observation range using the command in m/n.
Predicted values are obtained using the predict command. Thus, if a regression is run with lwage as the dependent variable, to get the fitted values type:
predict lwagehat
The choice of the name lwagehat is arbitrary, subject to its being no more than eight characters and its not already being used. The predict command saves the fitted values for the most recently run regression.
The residuals can be obtained by:
predict uhat, resid
where again the name uhat is arbitrary.
You can test multiple linear restrictions after an OLS regression by using the test command. Consider a regression which controls for four Census regions: north, south, east, and west. Because the regression includes a constant term, we can identify parameters for three of the four regional dummy variables. Suppose we exclude the "west" dummy from the regression and we wish to test whether there are any "regional effects" in the data. To test whether the coefficients for the north, south, and east dummy variables are jointly zero you can just lists the variables hypothesized to have no effect:
test north south east
The result of this test tells you whether the three regional indicators can be excluded from the previously estimated model. Along with the value of the Fstatistic, Stata also reports a pvalue. As with the predict command, test is applied to the most recently estimated model.
OLS estimates with heteroskedasticityrobust standard errors and t statistics can be obtained using robust option. Remember, this is just OLS, but the asymptotic variance is estimated in a heteroskedasticityrobust fashion. For example,
reg lwage educ exper expersq married black, robust

Instrumental Variable (Two Stage Least Squares)

The reg command can also be used to estimate models by 2SLS. After specifying the dependent variable and the explanatory variables  which presumably contain at least one endogenous variable (that is correlated with the error)  one then lists all of the exogenous variables as instruments in parentheses. Naturally, the list of instruments does not contain any endogenous variables.
An example of a 2SLS command is:
ivreg lwage (edu =married) exp expsq tenu union
ivreg lwage (edu =married exp expsq tenu union) exp expsq tenu union
This command produces 2SLS estimates, standard errors, t statistics, and so on. By looking at this command, we see that edu is an endogenous explanatory variable in the lwage equation while exp, expsq, and union are assumed to be exogenous explanatory variables. The variable married is assumed to be additional exogenous variable that does not appear in the lwage structural equation but should have some correlation with edu. These appear in the instrument list along with the exogenous explanatory variables.
The order in which we order the instruments is not important. The necessary condition for the model to be identified is that the number of terms in parentheses is at least as large as the total number of explanatory variables. In this example, the count is five to four, and so the order condition holds.
In the previous example, we allowed for just one endogenous explanatory variable, "edu". Allowing for more than one endogenous explanatory variable is also easy. After 2SLS, we can test multiple restrictions using the test command, just as with OLS.

Fixed Effect Model

iis id ¿
tis year ¿
xtreg lwage edu exp expsq tenu union, fe
xtreg lwage edu exp expsq tenu union, re
xtreg lwage edu exp expsq tenu union, be

Random Effect Model

iis id ¿
tis year ¿
xtreg lwage edu exp expsq tenu union, re

Between Effect Model

iis id ¿
tis year ¿
xtreg lwage edu exp expsq tenu union, be

Logistic Model

logit married edu exp lwage
probit married edu exp lwage

CoxHazard Model

cox tenu edu lwage union married

Estimation and test techniques

Time Series

Duration Model

Simultaneous Equation

Constrained Regression

Test and Diagnostics
Time Series Analysis

Generating lags and leads : gen xlag1= x[_n1]
If you sort the data by date, then the lagged variable x can be obtained by typing
. gen xlag1= x[_n1]
Of course you can use as many lags as you want.
. gen xlag2=x[_n2]
Likewise, you can lead the date by using _n+1, _n+2.
If you are a serious user of timeseries data, then it would be better served by using time series operators. The time series operators are L (lag), F (lead), D (difference) and S(seasonal). You must set the time variables using the tsset command.
Time series operators: reg consume gnp L.gnp L2.gnp D.gnp D2.gnp S2.gnp
. tsset year, yearly /*declare dataset to be timeseries data*/
. reg consume gnp L.gnp L2.gnp D.gnp D2.gnp S2.gnp
. sum interest if F.gnp<gnp
2) Examples
Autocorrelation: corrgram
wntestq
corrgram lists a table of the autocorrelations, partial autocorrelations, and Q statistics. It will also list a characterbased plot of the autocorrelations and partial autocorrelations. The ac command produces a correlogram (the autocorrelations) with pointwise confidence intervals obtained from the Q statistic.
BoxPierce Q: wntestq
The wntestq command produces the BoxPierce Q test statistics. The null hypothesis is the autocorrelation coefficients are simultaneously equal to zero.
. corrgram r
. corrgram r, lags(5)
. wntestq r
Augmented DickeyFuller test: dfuller
dfuller performs the augmented DickeyFuller test of unit roots. This test performs a regression of the differenced variable on its lag and the user specified number of lagged differences of the variable. Optionally a constant trend term may be included as well as the associated regression. The null hypothesis is that there is a unit root.
. dfuller r
. dfuller r, lags(3) trend regress
Autoregressive Integrated Moving Average (ARIMA): arima
arima estimates a model of depvar on varlist where the disturbances are allowed to follow a linear autoregressive movingaverage (ARMA) specification. The dependent and independent variables may be differenced or seasonally differenced to any degree. When independent variables are not specified, these models reduce to autoregressive integrated movingaverage (ARIMA) models in the dependent variable. Missing data are allowed and are handled using the Kalman filter. arima allows timeseries operators in the dependent variable and independent variable lists and it is often convenient to make extensive use of these operators.
.arima r, arima(1,1,1)
.arima D.r, ar(1) ma(1) /*same as above*/
. arima r, arima(3,2,4)
. arima D2.r, ar(1/3) ma(1/4) /*same as above*/
There are other test commands such as Granger causality test and cointegration test. The ado program files are not originally installed, but you can find and download it from STB(Stata Technical Bulletin) web site. If your computer is connected to network, just click the search result.

Duration Model

If the data set is already a duration data, then can use duration model. For example, you can use either Cox or Weibull model. You can also use hr option to report the hazard ratio instead of coefficient.
Cox Model: . cox studytim drug age, dead(died) hr
. cox studytim drug age, dead(died)
. cox studytim drug age, dead(died) hr
If you are a serious user of the duration data set you might want to use stset command: stset declares data to be survivaltime data.
Survivaltime data: . stset studytim, failure(died)
. stset studytim, failure(died)
. stset studytim, failure(outcome==2)
. stset studytim, failure(outcome==2) id(patientno) /*multiple failure*/
Once declared as duration data, you can use
. cox drug age
.cox drug age, hr
Weibull Model: weibull
Compare these results with those above. Also can use weibull instead of cox.

Simultaneous Equation

Seemingly Unrelated Regression Models: . sureg (gdp g i) (m2 r)
sureg estimates Zellner's seemingly unrelated regression models. This is in fact a FGLS and especially useful when all equations consist of only exogenous variables among explanatory variables. Suppose GDP = f(G, I) and M2=g(r). Do
. sureg (gdp g i) (m2 r). Compare the result with two ols result
Now suppose some equations contain endogenous variables among the explanatory variables. I guess everyone is already familiar with 2sls method (ivreg). Another way to estimate this is estimation of a system of structural equations by using reg3 command. Suppose GDP = f(G, I, M2) and M2 = g(GDP, r). Notice that the two dependent variables are now included as explanatory variable in the equations. reg3 can also estimate systems of equations by seemingly unrelated regression (SURE).
. reg3 (gdp m2 g i) (m2 gdp r)
. reg3 (gdp m2 g i) (m2 gdp r), sure

Constrained Regression

cnsreg
constraint (constraint list)
cnsreg estimates constrained linear regression models. constraint(constraint list) is not optional; it specifies the constraint numbers of the constraints to be applied. Constraints are defined using the constraint command.
. constraint def 1 price = weight
. cnsreg mpg price weight, constraint(1)
. constraint def 2 gratio = foreign
. cnsreg mpg price weight displ gratio foreign length, c(12)
. constraint define 3 _cons = 0
. cnsreg mpg price weight displ gratio foreign length, c(13)

Test and Diagnostics

Linear tests: . test m2 g
Nonlinear tests: . testnl _b[m2]/_b[g] = =1
test tests linear hypotheses about the estimated parameters from the most recently estimated model. testnl tests nonlinear (or linear) hypotheses about the estimated parameters from the most recently estimated model.
. reg gdp m2 g i
. test m2 g
. test m25*g +3 = = 0
. testnl _b[m2]/_b[g] = =1
. testnl (_b[m2]/_b[g] = =1) (_b[m2] = = 2)
test perform Wald tests. See help lrtest for likelihoodratio tests.
Likelihoodratio test:. lrtest

Other diagnostics

rvfplot Graph residualversusfitted plot
rvpplot Graph residualversuspredictor plot
ovtest Perform Ramsey RESET test for omitted variable test
dwstat Compute DurbinWatson d statistic if the data is declared as time series
vif Calculate VIFs (variance inflation factors)

Using Stata as a Calculator and Computing pvalues
Calculator
Stata can be used to compute a variety of expressions, including certain functions that are not available on a standard calculator. The command to compute an expression is disp or di for short. The command:
di .048/(2*.0016)
will return "15." We can use the di command to compute natural logs, exponentials, squares, and so on. For example:
di exp(3.5 + 4*.06)
Pvalues
returns the value 42.098 (approximately). These previous calculation can he performed on most calculators. More importantly, we can use di to compute pvalues after computing a test statistic. The command:
di normprob (1.58)
gives the probability that a standard normal random variable is greater than the value 1.58 (about .943). Thus, if a standard normal test statistic takes on the value 1.58, the pvalue is 1  .943 = .057. Other functions are geared to give the pvalue directly.
di tprob (df, t)
returns the pvalue for a t test against a twosided alternative (t is the absolute value of the “t” statistic and “df” is the degrees of freedom). For example, with df= 31 and t = 1.32, the command returns the value .196. To obtain the pvalue for an F test, the command is:
di fprob (df1, df2, F)
where “df1” is the numerator degrees of freedom, “df2” is the denominator df, and F is the value of the F statistic. As an example:
di fprob (3, 142, 2.18)
returns the pvalue .093.
Getting Started
Editing the Command Line
Stata has several shortcuts for entering command. Two useful keys are Page Up and Page Down. If at any point you hit Page Up, the previously executed command appears on the command line. This can save on a lot of typing because you can hit Page Up and edit the previous command. Among other things, this makes adding an independent variable to a regression, or expanding and instrument list easier. Hitting Page Up repeatedly allows you to traverse through previously executed commands until you find the one you want. Hitting Page Down takes you back down through all of the commands.
It is easy to edit the command line. Hitting Home on the keyboard takes the cursor to the beginning of the line; hitting End moves the cursor to the end of the line. The key Delete deletes a single character to the fight of the Cursor; holding it down will delete many characters. The Backspace key (a left arrow on many keyboards) deletes a character to the left of the cursor. Hitting the left arrow moves you one character to the left, and the right arrow takes you one character to the right. You can hold down either to move several characters. The key Ins allows you to toggle between insert and overwrite modes. Both of these modes are useful for editing commands.
Help and Search
help regress
search fixed effect
Creating log (procedure and output) file
Suppose you want to print out the “tutorial” or results from I. Stata for Dummies. Yes, you should create an output file. In particular, for involved projects, you must create a record of what you have done (data transformations, regressions, and so on). To do this you can create a log file. With a diskette in the A: drive, before doing any analysis, type:
log using a:wage1, replace
This will create the file wage1.log on the diskette in the A: drive. Or just
Click the log start/stop icon, and follow the direction.
Stata log files are just standard ASCII files. They can also he directly sent to a printer. However, I do not like the font size and format. So the best way is to read the log by using MS Word or any text editor (hint: Courier new with 8 font size is best). So why don’t you open any log file and type “tutorial intro” and print it out? When you are finished, you can close the log file for good
log close
After typing this command, log on will not open the log file. If you decide to add onto the end of an existing log file type:
log using a:\wage1, append
Increasing the amount of memory
The default is only 1Mb. If your data set is bigger than 1Mb, you cannot even open the data with this default. Always have at least two times bigger memory than your data set. There are two ways to increase memory. The first way is, at the beginning, type (for 5Mb)
set memory 5M
However, allocating memory every time is cumbersome. So, you can change the amount of memory Stata uses once it is running. Create shortcut of the wstata.exe file. Click properties, and choose shortcut tab. The target probably says something like
C:\stata\wstata.exe /k1000
Stata is to allocate 1000k (1Mb). If you change the option to, say, k5000 (5Mb), Stata would allocate 5Mb. Change the number as you wish. The number needs to be a multiple of 1000. If your computer’s memory is smaller than the allocation, then the Stata will use virtual memory (not recommended).
Batch or interactive
Using commands from keyboard
Using a do file
Of course, it is possible to cause Stata to execute the commands stored in filename (batch mode) just as if they were entered from the keyboard (interactive mode). If filename is specified without an extension, .do is assumed. This is called “do file”. You will find this batch mode is extremely helpful. Refer to ancillary handouts.
Reading Data Files
The command to read a Stata file is “use”. Of course you can instead use the Stata tool bar. If the Stata file is called wage1.dta, and the file is on the diskette in the A: drive, the command is:
use a:wage1
After entering this command the data file wage1 is loaded into memory (Note that Stata is case sensitive). However, life is not that easy. Not every data has Stata format. If the data is not Stata format, you should change the data to Stata format. There are many ways. Here are my hints.
1. If you want to input data, then just use the “Data Editor” in Stata. After you input the data, and save it by pulling down File and choosing Save as. It is easy. The Data Editor is compatible with MS Excel. So as long as your data is in Excel format, you can just copy them to the editor.
2. If the data is saved as any software format, for example, Excel, SPSS, SAS, Dbase, Limdep, RATS, Gauss……, then you can use STAT/Transfer program. This is the easiest way to transfer one data set to another. If you think you will heavily use microdata set in the future, consider buying it.
3. Or you can create Stata data from ASCII file (text file). You may be able to convert your data set into ASCII format. In most cases, your data set is already an ASCII file. There is a command called infile, that allows you to read an ASCII file. The file must be organized with an observation in each row, and the variables in the data set in its own column.
a) If each number is separated by space, for example, suppose a wage data set is organized as
10.75 12 6 1 0
16.50 16 3 0 0
…
12.10 12 8 1 1
Each row corresponds to an individual. In the example above, the first variable is hourly wage, the second is years of education, the third is experience, the fourth is a dummy variable equal to unity for unionized firm, zero for nonunionized firm, and the last variable is an indicator variable for marital status (which equals unity for married individuals). The variables in this example are equally spaced but this spacing is not essential. If these data are in the file wage.raw on the A: drive, then the command:
infile
infile wage edu exp union married using a:\wage.raw
reads in each row of data, and stores the data on each variable into the appropriate name. Once the ASCII file has been read, it is a good idea to save it as a Stata file. The command:
save
save a:\wage2
creates the Stata file wage.dta on the A: drive. Notice that the file type dta denotes a stata file. If you are working with multiple years of data for each individual, it is a good idea to include in your data set a variable indicating the year and id of the observation.
b) If each number is not separated by space or tab, then you should create a dictionary file to read the data. Create exer1.dct (ASCII) file as follows. Assume that there is a data set named original.dat
dictionary using C:\original.dat {
year %2f
firmsize %1f
sampling %3f
union %1f
idnumber %4f
_column(40)
sex %1s
_column(44)
msts %1f
emptype %1f
shift %2f
expyears %1f
}
“ind” and “sex” is string (s) and others are numbers(f). %2f means occupying 2 columns with numbers. If you do not want to read all variables then jump to column 40 where the variable sex is.
Then type
infile using a:\wage.dct
For more advanced features on inputting data, you can refer to the Stata User’s Guide which is published by the Stata Press.
If you have completed your analysis with a file such as wage1.dta, and then wish to use a different data set, you simply clear the existing data set from memory. The command to use is
clear
By issuing this command, it is important to know that any changes you made to the data set during your current Stata session will be lost.
Looking At and Summarizing Your Data
After reading in a data file, you can get a list of the available variables by typing des. Often a short description has been given to each variable. To look at the observations of one or more variables, use the list command. For example, to look at the variables wage and edu for all observations, type:
list wage edu
This will list, one screen at a time, the data on wage and edu for every person in the sample. (Missing values in Stata are denoted by a period.) If the data set is large, you may not wish to look at all observations. You can always stop the listing by hitting CtrlBreak on the keyboard. In fact, CtrlBreak can be used to interrupt any Stata command.
Alternatively, there are various ways to restrict the range of the listing and many other Stata commands. To look at the first 20 observations on wage and edu type:
list wage edu in 1/20
Rather than specify a range of observations, a logical command can be used instead. For example, to look at the data on marital status and age for people with zero hours worked type:
list married age if hours = = 0
Notice how the double equal sign is used by Stata to determine equivalence. The other relational operators in Stata are > (greater than), < (less than), >= (greater than or equal), <= (less than or equal), and ~= (not equal)). Or if you want to restrict attention to non union members, type:
list married age if union= =0
The variable union is a binary indicator equal to unity for union members, and zero otherwise. The ~ is the logical "not" operator. We can combine many different logical statements. The command:
list married age if union= =1 & hours >= 40
restricts attention to union members who work at least 40 hours a week. (Logical and is denoted by “&” and logical or is denoted by "" in Stata.) .'
Two useful commands for summarizing data are the sum and tab commands. The sum command computes the sample average, standard deviation, and the minimum and maximum values of all (nonmissing) observations. Because this command tells you how many observations were used for each variable in computing the summary statistics, you can easily find out how many missing data points there are for any variable. Thus, the command:
sum wage edu tenu married
computes the summary statistics for the four variables listed. Because married is a binary variable, its minimum and maximum values are not very interesting. The average value reported is simply the proportion of people in the sample who are married.
To obtain more summary information for each of these variables you must type:
sum wage edu tenure married, detail
By adding the detail option, Stata provides an extensive list of summary statistics for each of these variables including the median and other percentiles of the empirical distribution.
Stata also provides summary statistics for any subgroup of the sample if you add a logical statement:
sum wage edu tenu married if union
If the data is a pooled cross section or a panel data set, to summarize for 1990 type:
sum wage edu tenu married if year = =1990
The sample can be restricted to certain observation ranges by using the in m/n option, just as illustrated in the list command:
sum wage edu in 1/20
For variables that take on a relatively small number of values  such as number of children or number of times an individual was arrested during a year  you can use the tab command to get frequency tabulation:
tab married
This command reports the frequency associated with each value of arrests in the sample. You also can combine this command with logical statements or restrict the range of observations.
In order to calculate the frequency of arrests by city you will need to use the sort command. First you need to sort the data by city by typing:
sort married
Once the data is sorted then you summarize the variable by typing:
by married: sum wage
In order to calculate the wage by another variable, say year, you will need to resort the data by year and then use the su command again.
Sometimes, you may want to restrict all subsequent analysis to a particular subset of the data. In such cases it is useful to delete the data that will not he used subsequently. This can be done using the drop or keep commands. For example, if we want to analyze only union members in a wage equation, then we can type:
drop if ~union (or drop if union = = 0)
This drops everyone in the sample who is not union member. Or, to analyze only the years between 1986 and 1990 (inclusive), we can type:
keep if (year >= 1986) & (year <= 1990)
In order to drop a particular observation, say observation 2672, you must type:
drop in 2672
It is important to know that the data dropped are gone from the current Stata session. If you want to get them back, you must reread the original data file. Along these lines, do not make the mistake of saving the smaller data set over the original one, or you will lose a chunk of your data.
BE SURE TO KEEP AN EXTRA BACKUP FILE OF YOUR STATA DATA SETS (for both beginners and experts!!).
Defining New Variables
It is easy to create variables that are functions of existing variables. In Stata, this is accomplished using the gen command (short for generate). For example, to create the square of experience, type:
generate expsq = exp^2
The new variable, expsq, can he used in a regression or any place else Stata variables are used (Stata does not allow us to put expressions such as exp^2 into regression commands; we must create the variables first.) When creating Stata variables, you should remember that the name of variables can not he longer than eight characters. Stata will refuse to accept names longer than eight characters in the gen command (and in all other Stata commands). If an observation had a missing value for exp then, naturally, expsq will also be missing for that observation. In fact, Stata will tell you how many missing observations were created after every gen command. If Stata reports nothing, then no missing observations were generated.
To find the natural log of a variable such as wage, type:
gen lwage = ln(wage)
If saving is missing then lwage will also be missing. For functions such as the natural log, there is an additional consideration: ln(wage) is not defined for wage <= 0. When a function is not defined for particular values of the variable, Stata sets the result to missing.
Logical commands can be used to restrict observations used for generating new variables. For example:
gen lwage = ln(wage) if hours > 0
creates ln(wage) for people who work ( and therefore whose wage can be observed). Using the gen command without the statement if hour > 0 has the same effect in this example because wage is missing for those individuals who do not work.
Creating interaction terms is easy:
gen blckedu = black*edu
where “*” denotes multiplication: the division operator is “/”. Addition is “+” and subtraction is
“”. The gen command also can be used to create binary variables. For example if fratio is the funding ratio of a firm's pension plan, the dummy variable overfund can he created which is unity when fratio > 1 and zero otherwise:
gen overfund = fratio > 1
The way this command works is that the logical statement on the right hand side is evaluated to be true or false; then true is assigned the value unity, and false assigned the value zero. So overfund is unity if frafio >1 and overfund is zero if fratio <= 1. As another example, we can create year dummies using a command such as:
gen y85 = (year = = 1985)
where year is assumed to be a variable define in the data set. The variable y85 is unity for observations corresponding to 1985, and zero otherwise. We can do this for each year in our sample to create a full set of year dummies.
The gen command also can be used to difference data among different years. Suppose that, for a sample of cities, we have two years of data for each city (say 1982 and 1987). The data are stored so that the two years for each city are adjacent in the file, with the 1982 observation preceding the 1987 observation. To eliminate unobserved "fixed" effects, say in relating city crime rates to expenditures on crimes and other city characteristics, we can relate changes overtime. Stata stores the changes between 1982 and 1987 alongside the 1987 data. It is important to remember that for 1982 there is no change from a previous time period because we do not have data on a previous time period. Therefore, we should define the change data so that it is missing in 1982. For example:
gen ccrime = crime  crime [_nl] if year = = 1987
gen cexpend = expend  expend[_nl] if year = =1987
The variable "_n" is the reserved Stata symbol for the current observation; thus, _n1 is the variable lagged once. The variable ccrime is the change in crime between 1982 and 1987; cexpend is the change in expenditures between 1982 and 1987. These new change variables are stored next to the 1987 observations, and the corresponding change variables for 1982 are missing denoted as a ".". We can then use these change variables in a regression analysis, or some other analysis.
The replace command is useful for correcting mistakes in definitions and redefining variables after values of other variables have changed. Suppose, for example, when creating the variable expsq, you mistakenly typed "gen expsq = exper^3.^{'' }One possibility is to drop the variable expsq and try again:
drop expsq
gen expsq = exp^2
(Note that the drop command really has two purposes: to delete all variables for certain observations and to drop one or more variables for all observations.) A faster route is to use the replace command:
replace expsq = exp^2
Stata explicitly requires the replace command to write over the contents in a previously defined variable.
General Plotting Commands
 Plot a histogram of a variable:
histogram vname
 Plot a histogram of a variable using frequencies:
histogram vname, freq
histogram vname, bin(xx) norm
where xx is the number of bins.
 Plot a boxplot of a variable:
graph box vname
 Plot sidebyside box plots for one variable (vone) by categories of another variable vtwo. (vtwo should be categorical)):
graph box vone, over(vtwo)
 A scatter plot of two variables:
scatter vone vtwo
 A matrix of scatter plots for three variables:
graph matrix vone vtwo vthree
 A scatter plot of two variables with the values of a third variable used in place of points on the graph (vthree might contain numerical values or indicate categories, such as male ("m") and female ("f")):
scatter vone vtwo, symbol([vthree])
 Normal quantile plot:
qnorm vname
General commands
 To compute means and standard deviations of all variables:
summarize
or, using an abbreviation,
summ
 To compute means and standard deviations of select variables:
summarize vone vtwo vthree
 Another way to compute means and standard deviations that allows the by option:
tabstat vone vtwo, statistics(mean, sd) by(vthree)
 To get more numerical summaries for one variable:
summ vone, detail
 See help tabstat to see the numerical summaries available. For example:
tabstat vone, statistics(min, q, max, iqr, mean, sd)
 Correlation between two variables:
correlate vone vtwo
 To see all values (all variables and all observations, not recommended for large data sets):
list
Hit the space bar to see the next page after "more" or type "q" to "break" (stop/interrupt the listing).
 To list the first 10 values for two variables:
list vone vtwo in 1/10
 To list the last 10 values for two variables:
list vone vtwo in 10/l
(The end of this command is "minus 10" / "lowercase letter L".)
 Tabulate categorical variable vname:
tabulate vname
or, using an abbreviation,
tab vname
 Cross tabulate two categorical variables:
tab vone vtwo
 Cross tabulate two variables, include one or more of the options to produce column, row or cell percents and to suppress printing of frequencies:
tab vone vtwo, column row cell
tab vone vtwo, column row cell nofreq
Generating new variables
 General.
 Generate index of cases 1,2, ...,n (this may be useful if you sort the data, then want to restore the data to the original form without reloading the data):
generate case= _n
or, using an abbreviation,
gen case=_n
 Multiply values in vx by b and add a, store results in vy:
gen vy = a + b * vx
 Generate a variable with values 0 unless vtwo is greater than c, then make the value 1:
gen vone=0
replace vone=1 if vtwo>c

 Random numbers.
 Set numbers of observations to n:
set obs n
 Set random number seed to XXXX, default is 1000:
set seed XXXX
 Generate n uniform random variables (equal chance of all outcomes between 0 and 1):
gen vname=uniform()
 Generate n uniform random variables (equal chance of all outcomes between a and b):
gen vname=a + (b  a)*uniform()
 Generate n discrete uniform random variables (equal chance of all outcomes between 1 and 6)
gen vname=1 + int(6*uniform())
(These commands simulate rolling a sixsided die.)
 Generate normal data with mean 0 and standard deviation 1:
gen vname= invnorm(uniform())
 Generate normal data with mean mu and standard deviation sigma:
gen vname= mu + sigma * invnorm(uniform())
Regression
 Compute simple regression line (vy is response, vx is explanatory variable):
regress vy vx
 Compute predictions, create new variable yhat:
predict yhat
 Produce scatter plot with regression line added:
graph twoway lfit vy vx  scatter vy vx
 Compute residuals, create new variable residuals:
predict residuals, resid
 Produce a residual plot with horizontal line at 0:
graph residuals, yline(0)
 Identify points with largest and smallest residuals:
sort residuals
list in 1/5
list in 5/l
(The last command is "minus 5" / "lowercase letter L".)
 Compute multiple regression equation (vy is response, vthree, vtwo, and vvthree are explanatory variables):
regress vy vone vtwo vthree
Important Notes on the "stem" command
In some versions of Stata, there is a potential glitch with Stata's stem command for stemandleaf plots. The stem function seems to permanently reorder the data so that they are sorted according to the variable that the stemandleaf plot was plotted for. The best way to avoid this problem is to avoid doing any stemandleaf plots (do histograms instead). However, if you really want to do a stemandleaf plot you should always create a variable containing the original observation numbers (called index, for example). A command to do so is:
generate index = _n
If you do this, then you can resort the data after the stemandleaf plot according to the index variable:
sort index.
Then, the data are back in the original order.
Summary of These and Other Commands
Here is a list of the commands demonstrated above and some other commands that you may find useful (this is by no means an exhaustive list of all Stata commands):
anova 
general ANOVA, ANCOVA, or regression 
by 
repeat operation for categories of a variable 
ci 
confidence intervals for means 
clear 
clears previous dataset out of memory 
correlate 
correlation between variables 
describe 
briefly describes the data (# of obs, variable names, etc.) 
diagplot 
distribution diagnostic plots 
drop 
eliminate variables from memory 
edit 
better alternative to input for Macs 
exit 
leave Stata 
generate 
creates new variables (e.g., generate years = last  first) 
graph 
general graphing command (this command has many options) 
help 
online help 
histogram 
create a histogram graphic 
if 
lets you select a subset of observations (e.g., list if radius >= 3000) 
infile 
read nonStataformat dataset (ASCII or text file) 
input 
type in raw data 
insheet 
read nonStataformat spreadsheet with variable names on first line 
list 
lists the whole dataset in memory (you can also list only certain variables) 
log 
save or print Stata ouput (except graphs) 
lookup 
keyword search of commands, often precursor to help 
oneway 
oneway analysis of variance 
pcorr 
partial correlation coefficients 
plot 
textmode (crude) scatterplots 
predict 
calculated predicted values (yhat), residuals (ordinary, standardized and studentized), leverages, Cook's distance, standard error of predicted individual y, standard error of predicted mean y, standard error of residual from regression 
qnorm 
create a normal quantile plot 
regress 
regression 
replace 
lets you change individual values of a variable 
save 
saves data and labels in a Stataformat dataset 
scatter 
create a scatter plot of two numerical variables 
set 
set Stata system parameters (e.g., obs and seed) 
sebarr 
standard errorbar chart 
sort 
sorts observations from smallest to largest 
stem 
stem and leaf display 
summarize 
produces summary statistics (# obs, mean, sd, min, max) (has a detail option) 
tabstat 
produces summary statistics of your choice 
tabulate 
produces counts/frequencies for categorical data 
test 
conducts various hypothesis tests (refers back to most recent model fit (e.g., regress or anova ) (see help function for info and examples)) 
ttest 
one and twosample ttests 
use 
retrieve previously saved Stata dataset 
Reference
Content adapted from: http://www2.hawaii.edu/~leesang/670/stata.htm
http://www.stat.uchicago.edu/~collins/resources/stata/statacommands.html
Comments (0)
You don't have permission to comment on this page.