Quick Command Reference Table

Basic Estimation Commands
Estimation and test techniques
Other diagnostics
Using Stata as a Calculator and Computing p-values
1. Calculator
  1. di .048/(2*.0016)
2. P-values
Getting Started
1. Editing the Command Line

Basic Estimation Commands

Ordinary Least Squares

For OLS regression, we use the command reg. Immediately following reg is the dependent variable, and after that, all of the independent variables (order of the independent variables is not, of course, important). An example is:

reg lwage edu exp expsq tenu union

This command produces OLS estimates, standard errors, t statistics, confidence intervals, and a variety of other statistics usually reported with OLS output. Unless a specific range of observations or logical statement is included, Stata uses all possible observations in obtaining estimates. It does not use observations for which data on the dependent or any of the independent variables is missing. Thus, you must he aware that adding another explanatory variable can result in fewer observations used in the regression if some observations are missing for that variable. If a variable called "motheduc" (mother's education) is added to the independent variables in the above regression, and this variable is missing for say 10 percent of individuals, then the sample size using in obtaining OLS estimates is decreased accordingly.

Sometimes we want to restrict our regression analysis based on the size of one or more of the variables. For example,

reg lwage edu exp expsq tenu union if edu<16

where size is the number of employees of a firm restricts the analysis to firms with no more than 5,000 employees. The regression also can be restricted to a particular year using a similar if statements, or to a particular observation range using the command in m/n.

Predicted values are obtained using the predict command. Thus, if a regression is run with lwage as the dependent variable, to get the fitted values type:

predict lwagehat

The choice of the name lwagehat is arbitrary, subject to its being no more than eight characters and its not already being used. The predict command saves the fitted values for the most recently run regression.

The residuals can be obtained by:

predict uhat, resid

where again the name uhat is arbitrary.

You can test multiple linear restrictions after an OLS regression by using the test command. Consider a regression which controls for four Census regions: north, south, east, and west. Because the regression includes a constant term, we can identify parameters for three of the four regional dummy variables. Suppose we exclude the "west" dummy from the regression and we wish to test whether there are any "regional effects" in the data. To test whether the coefficients for the north, south, and east dummy variables are jointly zero you can just lists the variables hypothesized to have no effect:

test north south east

The result of this test tells you whether the three regional indicators can be excluded from the previously estimated model. Along with the value of the F-statistic, Stata also reports a p-value. As with the predict command, test is applied to the most recently estimated model.

OLS estimates with heteroskedasticity-robust standard errors and t statistics can be obtained using robust option. Remember, this is just OLS, but the asymptotic variance is estimated in a heteroskedasticity-robust fashion. For example,

reg lwage educ exper expersq married black, robust

Instrumental Variable (Two Stage Least Squares)

The reg command can also be used to estimate models by 2SLS. After specifying the dependent variable and the explanatory variables - which presumably contain at least one endogenous variable (that is correlated with the error) - one then lists all of the exogenous variables as instruments in parentheses. Naturally, the list of instruments does not contain any endogenous variables.

An example of a 2SLS command is:

ivreg lwage (edu =married) exp expsq tenu union

ivreg lwage (edu =married exp expsq tenu union) exp expsq tenu union

This command produces 2SLS estimates, standard errors, t statistics, and so on. By looking at this command, we see that edu is an endogenous explanatory variable in the lwage equation while exp, expsq, and union are assumed to be exogenous explanatory variables. The variable married is assumed to be additional exogenous variable that does not appear in the lwage structural equation but should have some correlation with edu. These appear in the instrument list along with the exogenous explanatory variables.

The order in which we order the instruments is not important. The necessary condition for the model to be identified is that the number of terms in parentheses is at least as large as the total number of explanatory variables. In this example, the count is five to four, and so the order condition holds.

In the previous example, we allowed for just one endogenous explanatory variable, "edu". Allowing for more than one endogenous explanatory variable is also easy. After 2SLS, we can test multiple restrictions using the test command, just as with OLS.

Fixed Effect Model

iis id ¿

tis year ¿

xtreg lwage edu exp expsq tenu union, fe

xtreg lwage edu exp expsq tenu union, re

xtreg lwage edu exp expsq tenu union, be

Random Effect Model

iis id ¿

tis year ¿

xtreg lwage edu exp expsq tenu union, re

Between Effect Model

iis id ¿

tis year ¿

xtreg lwage edu exp expsq tenu union, be

Logistic Model

logit married edu exp lwage

probit married edu exp lwage

Cox-Hazard Model

cox tenu edu lwage union married

Estimation and test techniques

Time Series
Duration Model
Simultaneous Equation
Constrained Regression
Test and Diagnostics

Time Series Analysis

Generating lags and leads : gen xlag1= x[_n-1]

If you sort the data by date, then the lagged variable x can be obtained by typing

. gen xlag1= x[_n-1]

Of course you can use as many lags as you want.

. gen xlag2=x[_n-2]

Likewise, you can lead the date by using _n+1, _n+2.

If you are a serious user of time-series data, then it would be better served by using time series operators. The time series operators are L (lag), F (lead), D (difference) and S(seasonal). You must set the time variables using the tsset command.

Time series operators: reg consume gnp L.gnp L2.gnp D.gnp D2.gnp S2.gnp

. tsset year, yearly /*declare dataset to be time-series data*/

. reg consume gnp L.gnp L2.gnp D.gnp D2.gnp S2.gnp

. sum interest if F.gnp<gnp

2) Examples

Autocorrelation: corrgram

wntestq

corrgram lists a table of the autocorrelations, partial autocorrelations, and Q statistics. It will also list a character-based plot of the autocorrelations and partial autocorrelations. The ac command produces a correlogram (the autocorrelations) with pointwise confidence intervals obtained from the Q statistic.

Box-Pierce Q: wntestq

The wntestq command produces the Box-Pierce Q test statistics. The null hypothesis is the autocorrelation coefficients are simultaneously equal to zero.

. corrgram r

. corrgram r, lags(5)

. wntestq r

Augmented Dickey-Fuller test: dfuller

dfuller performs the augmented Dickey-Fuller test of unit roots. This test performs a regression of the differenced variable on its lag and the user specified number of lagged differences of the variable. Optionally a constant trend term may be included as well as the associated regression. The null hypothesis is that there is a unit root.

. dfuller r

. dfuller r, lags(3) trend regress

Autoregressive Integrated Moving Average (ARIMA): arima

arima estimates a model of depvar on varlist where the disturbances are allowed to follow a linear autoregressive moving-average (ARMA) specification. The dependent and independent variables may be differenced or seasonally differenced to any degree. When independent variables are not specified, these models reduce to autoregressive integrated moving-average (ARIMA) models in the dependent variable. Missing data are allowed and are handled using the Kalman filter. arima allows time-series operators in the dependent variable and independent variable lists and it is often convenient to make extensive use of these operators.

.arima r, arima(1,1,1)

.arima D.r, ar(1) ma(1) /*same as above*/

. arima r, arima(3,2,4)

. arima D2.r, ar(1/3) ma(1/4) /*same as above*/

There are other test commands such as Granger causality test and cointegration test. The ado program files are not originally installed, but you can find and download it from STB(Stata Technical Bulletin) web site. If your computer is connected to network, just click the search result.

Duration Model

If the data set is already a duration data, then can use duration model. For example, you can use either Cox or Weibull model. You can also use hr option to report the hazard ratio instead of coefficient.

Cox Model: . cox studytim drug age, dead(died) hr

. cox studytim drug age, dead(died)

. cox studytim drug age, dead(died) hr

If you are a serious user of the duration data set you might want to use stset command: stset declares data to be survival-time data.

Survival-time data: . stset studytim, failure(died)

. stset studytim, failure(died)

. stset studytim, failure(outcome==2)

. stset studytim, failure(outcome==2) id(patientno) /*multiple failure*/

Once declared as duration data, you can use

. cox drug age

.cox drug age, hr

Weibull Model: weibull

Compare these results with those above. Also can use weibull instead of cox.

Simultaneous Equation

Seemingly Unrelated Regression Models: . sureg (gdp g i) (m2 r)

sureg estimates Zellner's seemingly unrelated regression models. This is in fact a FGLS and especially useful when all equations consist of only exogenous variables among explanatory variables. Suppose GDP = f(G, I) and M2=g(r). Do

. sureg (gdp g i) (m2 r). Compare the result with two ols result

Now suppose some equations contain endogenous variables among the explanatory variables. I guess everyone is already familiar with 2sls method (ivreg). Another way to estimate this is estimation of a system of structural equations by using reg3 command. Suppose GDP = f(G, I, M2) and M2 = g(GDP, r). Notice that the two dependent variables are now included as explanatory variable in the equations. reg3 can also estimate systems of equations by seemingly unrelated regression (SURE).

. reg3 (gdp m2 g i) (m2 gdp r)

. reg3 (gdp m2 g i) (m2 gdp r), sure

Constrained Regression

cnsreg

constraint (constraint list)

cnsreg estimates constrained linear regression models. constraint(constraint list) is not optional; it specifies the constraint numbers of the constraints to be applied. Constraints are defined using the constraint command.

. constraint def 1 price = weight

. cnsreg mpg price weight, constraint(1)

. constraint def 2 gratio = -foreign

. cnsreg mpg price weight displ gratio foreign length, c(1-2)

. constraint define 3 _cons = 0

. cnsreg mpg price weight displ gratio foreign length, c(1-3)

Test and Diagnostics

Linear tests: . test m2 g

Non-linear tests: . testnl _b[m2]/_b[g] = =1

test tests linear hypotheses about the estimated parameters from the most recently estimated model. testnl tests nonlinear (or linear) hypotheses about the estimated parameters from the most recently estimated model.

. reg gdp m2 g i

. test m2 g

. test m2-5*g +3 = = 0

. testnl _b[m2]/_b[g] = =1

. testnl (_b[m2]/_b[g] = =1) (_b[m2] = = 2)

test perform Wald tests. See help lrtest for likelihood-ratio tests.

Likelihood-ratio test:. lrtest

Other diagnostics

rvfplot Graph residual-versus-fitted plot

rvpplot Graph residual-versus-predictor plot

ovtest Perform Ramsey RESET test for omitted variable test

dwstat Compute Durbin-Watson d statistic if the data is declared as time series

vif Calculate VIFs (variance inflation factors)

Using Stata as a Calculator and Computing p-values

Calculator

Stata can be used to compute a variety of expressions, including certain functions that are not available on a standard calculator. The command to compute an expression is disp or di for short. The command:

di .048/(2*.0016)

will return "15." We can use the di command to compute natural logs, exponentials, squares, and so on. For example:

di exp(3.5 + 4*.06)

P-values

returns the value 42.098 (approximately). These previous calculation can he performed on most calculators. More importantly, we can use di to compute p-values after computing a test statistic. The command:

di normprob (1.58)

gives the probability that a standard normal random variable is greater than the value 1.58 (about .943). Thus, if a standard normal test statistic takes on the value 1.58, the p-value is 1 - .943 = .057. Other functions are geared to give the p-value directly.

di tprob (df, t)

returns the p-value for a t test against a two-sided alternative (t is the absolute value of the “t” statistic and “df” is the degrees of freedom). For example, with df= 31 and t = 1.32, the command returns the value .196. To obtain the p-value for an F test, the command is:

di fprob (df1, df2, F)

where “df1” is the numerator degrees of freedom, “df2” is the denominator df, and F is the value of the F statistic. As an example:

di fprob (3, 142, 2.18)

returns the p-value .093.

Getting Started

Editing the Command Line

Stata has several shortcuts for entering command. Two useful keys are Page Up and Page Down. If at any point you hit Page Up, the previously executed command appears on the command line. This can save on a lot of typing because you can hit Page Up and edit the previous command. Among other things, this makes adding an independent variable to a regression, or expanding and instrument list easier. Hitting Page Up repeatedly allows you to traverse through previously executed commands until you find the one you want. Hitting Page Down takes you back down through all of the commands.

It is easy to edit the command line. Hitting Home on the keyboard takes the cursor to the beginning of the line; hitting End moves the cursor to the end of the line. The key Delete deletes a single character to the fight of the Cursor; holding it down will delete many characters. The Backspace key (a left arrow on many keyboards) deletes a character to the left of the cursor. Hitting the left arrow moves you one character to the left, and the right arrow takes you one character to the right. You can hold down either to move several characters. The key Ins allows you to toggle between insert and overwrite modes. Both of these modes are useful for editing commands.

Help and Search

help regress

search fixed effect

Creating log (procedure and output) file

Suppose you want to print out the “tutorial” or results from I. Stata for Dummies. Yes, you should create an output file. In particular, for involved projects, you must create a record of what you have done (data transformations, regressions, and so on). To do this you can create a log file. With a diskette in the A: drive, before doing any analysis, type:

log using a:wage1, replace

This will create the file wage1.log on the diskette in the A: drive. Or just

Click the log start/stop icon, and follow the direction.

Stata log files are just standard ASCII files. They can also he directly sent to a printer. However, I do not like the font size and format. So the best way is to read the log by using MS Word or any text editor (hint: Courier new with 8 font size is best). So why don’t you open any log file and type “tutorial intro” and print it out? When you are finished, you can close the log file for good

log close

After typing this command, log on will not open the log file. If you decide to add onto the end of an existing log file type:

log using a:\wage1, append

Increasing the amount of memory

The default is only 1Mb. If your data set is bigger than 1Mb, you cannot even open the data with this default. Always have at least two times bigger memory than your data set. There are two ways to increase memory. The first way is, at the beginning, type (for 5Mb)

set memory 5M

However, allocating memory every time is cumbersome. So, you can change the amount of memory Stata uses once it is running. Create shortcut of the wstata.exe file. Click properties, and choose shortcut tab. The target probably says something like

C:\stata\wstata.exe /k1000

Stata is to allocate 1000k (1Mb). If you change the option to, say, k5000 (5Mb), Stata would allocate 5Mb. Change the number as you wish. The number needs to be a multiple of 1000. If your computer’s memory is smaller than the allocation, then the Stata will use virtual memory (not recommended).

Batch or interactive

Using commands from keyboard

Using a do file

Of course, it is possible to cause Stata to execute the commands stored in filename (batch mode) just as if they were entered from the keyboard (interactive mode). If filename is specified without an extension, .do is assumed. This is called “do file”. You will find this batch mode is extremely helpful. Refer to ancillary handouts.

Reading Data Files

The command to read a Stata file is “use”. Of course you can instead use the Stata tool bar. If the Stata file is called wage1.dta, and the file is on the diskette in the A: drive, the command is:

use a:wage1

After entering this command the data file wage1 is loaded into memory (Note that Stata is case sensitive). However, life is not that easy. Not every data has Stata format. If the data is not Stata format, you should change the data to Stata format. There are many ways. Here are my hints.

1. If you want to input data, then just use the “Data Editor” in Stata. After you input the data, and save it by pulling down File and choosing Save as. It is easy. The Data Editor is compatible with MS Excel. So as long as your data is in Excel format, you can just copy them to the editor.

2. If the data is saved as any software format, for example, Excel, SPSS, SAS, Dbase, Limdep, RATS, Gauss……, then you can use STAT/Transfer program. This is the easiest way to transfer one data set to another. If you think you will heavily use micro-data set in the future, consider buying it.

3. Or you can create Stata data from ASCII file (text file). You may be able to convert your data set into ASCII format. In most cases, your data set is already an ASCII file. There is a command called infile, that allows you to read an ASCII file. The file must be organized with an observation in each row, and the variables in the data set in its own column.

a) If each number is separated by space, for example, suppose a wage data set is organized as

10.75 12 6 1 0
16.50 16 3 0 0

…

12.10 12 8 1 1

Each row corresponds to an individual. In the example above, the first variable is hourly wage, the second is years of education, the third is experience, the fourth is a dummy variable equal to unity for unionized firm, zero for non-unionized firm, and the last variable is an indicator variable for marital status (which equals unity for married individuals). The variables in this example are equally spaced but this spacing is not essential. If these data are in the file wage.raw on the A: drive, then the command:

infile

infile wage edu exp union married using a:\wage.raw

reads in each row of data, and stores the data on each variable into the appropriate name. Once the ASCII file has been read, it is a good idea to save it as a Stata file. The command:

save

save a:\wage2

creates the Stata file wage.dta on the A: drive. Notice that the file type dta denotes a stata file. If you are working with multiple years of data for each individual, it is a good idea to include in your data set a variable indicating the year and id of the observation.

b) If each number is not separated by space or tab, then you should create a dictionary file to read the data. Create exer1.dct (ASCII) file as follows. Assume that there is a data set named original.dat

dictionary using C:\original.dat {

year %2f

firmsize %1f

sampling %3f

union %1f

idnumber %4f

_column(40)

sex %1s

_column(44)

msts %1f

emptype %1f

shift %2f

expyears %1f

}

“ind” and “sex” is string (s) and others are numbers(f). %2f means occupying 2 columns with numbers. If you do not want to read all variables then jump to column 40 where the variable sex is.

Then type

infile using a:\wage.dct

For more advanced features on inputting data, you can refer to the Stata User’s Guide which is published by the Stata Press.

If you have completed your analysis with a file such as wage1.dta, and then wish to use a different data set, you simply clear the existing data set from memory. The command to use is

clear

By issuing this command, it is important to know that any changes you made to the data set during your current Stata session will be lost.

Looking At and Summarizing Your Data

After reading in a data file, you can get a list of the available variables by typing des. Often a short description has been given to each variable. To look at the observations of one or more variables, use the list command. For example, to look at the variables wage and edu for all observations, type:

list wage edu

This will list, one screen at a time, the data on wage and edu for every person in the sample. (Missing values in Stata are denoted by a period.) If the data set is large, you may not wish to look at all observations. You can always stop the listing by hitting Ctrl-Break on the keyboard. In fact, Ctrl-Break can be used to interrupt any Stata command.

Alternatively, there are various ways to restrict the range of the listing and many other Stata commands. To look at the first 20 observations on wage and edu type:

list wage edu in 1/20

Rather than specify a range of observations, a logical command can be used instead. For example, to look at the data on marital status and age for people with zero hours worked type:

list married age if hours = = 0

Notice how the double equal sign is used by Stata to determine equivalence. The other relational operators in Stata are > (greater than), < (less than), >= (greater than or equal), <= (less than or equal), and ~= (not equal)). Or if you want to restrict attention to non union members, type:

list married age if union= =0

The variable union is a binary indicator equal to unity for union members, and zero otherwise. The ~ is the logical "not" operator. We can combine many different logical statements. The command:

list married age if union= =1 & hours >= 40

restricts attention to union members who work at least 40 hours a week. (Logical and is denoted by “&” and logical or is denoted by "|" in Stata.) .'

Two useful commands for summarizing data are the sum and tab commands. The sum command computes the sample average, standard deviation, and the minimum and maximum values of all (nonmissing) observations. Because this command tells you how many observations were used for each variable in computing the summary statistics, you can easily find out how many missing data points there are for any variable. Thus, the command:

sum wage edu tenu married

computes the summary statistics for the four variables listed. Because married is a binary variable, its minimum and maximum values are not very interesting. The average value reported is simply the proportion of people in the sample who are married.

To obtain more summary information for each of these variables you must type:

sum wage edu tenure married, detail

By adding the detail option, Stata provides an extensive list of summary statistics for each of these variables including the median and other percentiles of the empirical distribution.

Stata also provides summary statistics for any subgroup of the sample if you add a logical statement:

sum wage edu tenu married if union

If the data is a pooled cross section or a panel data set, to summarize for 1990 type:

sum wage edu tenu married if year = =1990

The sample can be restricted to certain observation ranges by using the in m/n option, just as illustrated in the list command:

sum wage edu in 1/20

For variables that take on a relatively small number of values - such as number of children or number of times an individual was arrested during a year - you can use the tab command to get frequency tabulation:

tab married

This command reports the frequency associated with each value of arrests in the sample. You also can combine this command with logical statements or restrict the range of observations.

In order to calculate the frequency of arrests by city you will need to use the sort command. First you need to sort the data by city by typing:

sort married

Once the data is sorted then you summarize the variable by typing:

by married: sum wage

In order to calculate the wage by another variable, say year, you will need to resort the data by year and then use the su command again.

Sometimes, you may want to restrict all subsequent analysis to a particular subset of the data. In such cases it is useful to delete the data that will not he used subsequently. This can be done using the drop or keep commands. For example, if we want to analyze only union members in a wage equation, then we can type:

drop if ~union (or drop if union = = 0)

This drops everyone in the sample who is not union member. Or, to analyze only the years between 1986 and 1990 (inclusive), we can type:

keep if (year >= 1986) & (year <= 1990)

In order to drop a particular observation, say observation 2672, you must type:

drop in 2672

It is important to know that the data dropped are gone from the current Stata session. If you want to get them back, you must reread the original data file. Along these lines, do not make the mistake of saving the smaller data set over the original one, or you will lose a chunk of your data.

BE SURE TO KEEP AN EXTRA BACKUP FILE OF YOUR STATA DATA SETS (for both beginners and experts!!).

Defining New Variables

It is easy to create variables that are functions of existing variables. In Stata, this is accomplished using the gen command (short for generate). For example, to create the square of experience, type:

generate expsq = exp^2

The new variable, expsq, can he used in a regression or any place else Stata variables are used (Stata does not allow us to put expressions such as exp^2 into regression commands; we must create the variables first.) When creating Stata variables, you should remember that the name of variables can not he longer than eight characters. Stata will refuse to accept names longer than eight characters in the gen command (and in all other Stata commands). If an observation had a missing value for exp then, naturally, expsq will also be missing for that observation. In fact, Stata will tell you how many missing observations were created after every gen command. If Stata reports nothing, then no missing observations were generated.

To find the natural log of a variable such as wage, type:

gen lwage = ln(wage)

If saving is missing then lwage will also be missing. For functions such as the natural log, there is an additional consideration: ln(wage) is not defined for wage <= 0. When a function is not defined for particular values of the variable, Stata sets the result to missing.

Logical commands can be used to restrict observations used for generating new variables. For example:

gen lwage = ln(wage) if hours > 0

creates ln(wage) for people who work ( and therefore whose wage can be observed). Using the gen command without the statement if hour > 0 has the same effect in this example because wage is missing for those individuals who do not work.

Creating interaction terms is easy:

gen blckedu = black*edu

where “*” denotes multiplication: the division operator is “/”. Addition is “+” and subtraction is

“-”. The gen command also can be used to create binary variables. For example if fratio is the funding ratio of a firm's pension plan, the dummy variable overfund can he created which is unity when fratio > 1 and zero otherwise:

gen overfund = fratio > 1

The way this command works is that the logical statement on the right hand side is evaluated to be true or false; then true is assigned the value unity, and false assigned the value zero. So overfund is unity if frafio >1 and overfund is zero if fratio <= 1. As another example, we can create year dummies using a command such as:

gen y85 = (year = = 1985)

where year is assumed to be a variable define in the data set. The variable y85 is unity for observations corresponding to 1985, and zero otherwise. We can do this for each year in our sample to create a full set of year dummies.

The gen command also can be used to difference data among different years. Suppose that, for a sample of cities, we have two years of data for each city (say 1982 and 1987). The data are stored so that the two years for each city are adjacent in the file, with the 1982 observation preceding the 1987 observation. To eliminate unobserved "fixed" effects, say in relating city crime rates to expenditures on crimes and other city characteristics, we can relate changes overtime. Stata stores the changes between 1982 and 1987 alongside the 1987 data. It is important to remember that for 1982 there is no change from a previous time period because we do not have data on a previous time period. Therefore, we should define the change data so that it is missing in 1982. For example:

gen ccrime = crime - crime [_n-l] if year = = 1987

gen cexpend = expend - expend[_n-l] if year = =1987

The variable "_n" is the reserved Stata symbol for the current observation; thus, _n-1 is the variable lagged once. The variable ccrime is the change in crime between 1982 and 1987; cexpend is the change in expenditures between 1982 and 1987. These new change variables are stored next to the 1987 observations, and the corresponding change variables for 1982 are missing denoted as a ".". We can then use these change variables in a regression analysis, or some other analysis.

The replace command is useful for correcting mistakes in definitions and redefining variables after values of other variables have changed. Suppose, for example, when creating the variable expsq, you mistakenly typed "gen expsq = exper^3.^''One possibility is to drop the variable expsq and try again:

drop expsq

gen expsq = exp^2

(Note that the drop command really has two purposes: to delete all variables for certain observations and to drop one or more variables for all observations.) A faster route is to use the replace command:

replace expsq = exp^2

Stata explicitly requires the replace command to write over the contents in a previously defined variable.

General Plotting Commands

Plot a histogram of a variable:
histogram vname
Plot a histogram of a variable using frequencies:
histogram vname, freq
histogram vname, bin(xx) norm
where xx is the number of bins.
Plot a boxplot of a variable:
graph box vname
Plot side-by-side box plots for one variable (vone) by categories of another variable vtwo. (vtwo should be categorical)):
graph box vone, over(vtwo)
A scatter plot of two variables:
scatter vone vtwo
A matrix of scatter plots for three variables:
graph matrix vone vtwo vthree
A scatter plot of two variables with the values of a third variable used in place of points on the graph (vthree might contain numerical values or indicate categories, such as male ("m") and female ("f")):
scatter vone vtwo, symbol([vthree])
Normal quantile plot:
qnorm vname

General commands

To compute means and standard deviations of all variables:
summarize
or, using an abbreviation,
summ
To compute means and standard deviations of select variables:
summarize vone vtwo vthree
Another way to compute means and standard deviations that allows the by option:
tabstat vone vtwo, statistics(mean, sd) by(vthree)
To get more numerical summaries for one variable:
summ vone, detail
See help tabstat to see the numerical summaries available. For example:
tabstat vone, statistics(min, q, max, iqr, mean, sd)
Correlation between two variables:
correlate vone vtwo
To see all values (all variables and all observations, not recommended for large data sets):
list
Hit the space bar to see the next page after "-more-" or type "q" to "break" (stop/interrupt the listing).
To list the first 10 values for two variables:
list vone vtwo in 1/10
To list the last 10 values for two variables:
list vone vtwo in -10/l
(The end of this command is "minus 10" / "lowercase letter L".)
Tabulate categorical variable vname:
tabulate vname
or, using an abbreviation,
tab vname
Cross tabulate two categorical variables:
tab vone vtwo
Cross tabulate two variables, include one or more of the options to produce column, row or cell percents and to suppress printing of frequencies:
tab vone vtwo, column row cell
tab vone vtwo, column row cell nofreq

Generating new variables

General.

Generate index of cases 1,2, ...,n (this may be useful if you sort the data, then want to restore the data to the original form without reloading the data):
generate case= _n
or, using an abbreviation,
gen case=_n
Multiply values in vx by b and add a, store results in vy:
gen vy = a + b * vx
Generate a variable with values 0 unless vtwo is greater than c, then make the value 1:
gen vone=0
replace vone=1 if vtwo>c

Random numbers.

Set numbers of observations to n:
set obs n
Set random number seed to XXXX, default is 1000:
set seed XXXX
Generate n uniform random variables (equal chance of all outcomes between 0 and 1):
gen vname=uniform()
Generate n uniform random variables (equal chance of all outcomes between a and b):
gen vname=a + (b - a)*uniform()
Generate n discrete uniform random variables (equal chance of all outcomes between 1 and 6)
gen vname=1 + int(6*uniform())
(These commands simulate rolling a six-sided die.)
Generate normal data with mean 0 and standard deviation 1:
gen vname= invnorm(uniform())
Generate normal data with mean mu and standard deviation sigma:
gen vname= mu + sigma * invnorm(uniform())

Regression

Compute simple regression line (vy is response, vx is explanatory variable):
regress vy vx
Compute predictions, create new variable yhat:
predict yhat
Produce scatter plot with regression line added:
graph twoway lfit vy vx || scatter vy vx
Compute residuals, create new variable residuals:
predict residuals, resid
Produce a residual plot with horizontal line at 0:
graph residuals, yline(0)
Identify points with largest and smallest residuals:
sort residuals
list in 1/5
list in -5/l
(The last command is "minus 5" / "lowercase letter L".)
Compute multiple regression equation (vy is response, vthree, vtwo, and vvthree are explanatory variables):
regress vy vone vtwo vthree

Important Notes on the "`stem`" command

In some versions of Stata, there is a potential glitch with Stata's stem command for stem-and-leaf plots. The stem function seems to permanently reorder the data so that they are sorted according to the variable that the stem-and-leaf plot was plotted for. The best way to avoid this problem is to avoid doing any stem-and-leaf plots (do histograms instead). However, if you really want to do a stem-and-leaf plot you should always create a variable containing the original observation numbers (called index, for example). A command to do so is:
generate index = _n

If you do this, then you can re-sort the data after the stem-and-leaf plot according to the index variable:
sort index.
Then, the data are back in the original order.

Summary of These and Other Commands

Here is a list of the commands demonstrated above and some other commands that you may find useful (this is by no means an exhaustive list of all Stata commands):

`anova`	general ANOVA, ANCOVA, or regression
`by`	repeat operation for categories of a variable
`ci`	confidence intervals for means
`clear`	clears previous dataset out of memory
`correlate`	correlation between variables
`describe`	briefly describes the data (# of obs, variable names, etc.)
`diagplot`	distribution diagnostic plots
`drop`	eliminate variables from memory
`edit`	better alternative to `input` for Macs
`exit`	leave Stata
`generate`	creates new variables (e.g., `generate years = last - first)`
`graph`	general graphing command (this command has many options)
`help`	online help
`histogram`	create a histogram graphic
`if`	lets you select a subset of observations (e.g., `list if radius >= 3000)`
`infile`	read non-Stata-format dataset (ASCII or text file)
`input`	type in raw data
`insheet`	read non-Stata-format spreadsheet with variable names on first line
`list`	lists the whole dataset in memory (you can also list only certain variables)
`log`	save or print Stata ouput (except graphs)
`lookup`	keyword search of commands, often precursor to `help`
`oneway`	oneway analysis of variance
`pcorr`	partial correlation coefficients
`plot`	text-mode (crude) scatterplots
`predict`	calculated predicted values (y-hat), residuals (ordinary, standardized and studentized), leverages, Cook's distance, standard error of predicted individual y, standard error of predicted mean y, standard error of residual from regression
`qnorm`	create a normal quantile plot
`regress`	regression
`replace`	lets you change individual values of a variable
`save`	saves data and labels in a Stata-format dataset
`scatter`	create a scatter plot of two numerical variables
`set`	set Stata system parameters (e.g., `obs` and `seed`)
`sebarr`	standard error-bar chart
`sort`	sorts observations from smallest to largest
`stem`	stem and leaf display
`summarize`	produces summary statistics (# obs, mean, sd, min, max) (has a `detail` option)
`tabstat`	produces summary statistics of your choice
`tabulate`	produces counts/frequencies for categorical data
`test`	conducts various hypothesis tests (refers back to most recent model fit (e.g., `regress` or `anova` ) (see help function for info and examples))
`ttest`	one and two-sample t-tests
`use`	retrieve previously saved Stata dataset

Reference

Content adapted from: http://www2.hawaii.edu/~leesang/670/stata.htm

http://www.stat.uchicago.edu/~collins/resources/stata/stata-commands.html

STATA: Quick Command Reference

Basic Estimation Commands

Ordinary Least Squares

reg lwage edu exp expsq tenu union

predict lwagehat

predict uhat, resid

test north south east

reg lwage educ exper expersq married black, robust

Instrumental Variable (Two Stage Least Squares)

ivreg lwage (edu =married) exp expsq tenu union

Fixed Effect Model

xtreg lwage edu exp expsq tenu union, fe

Random Effect Model

xtreg lwage edu exp expsq tenu union, re

Between Effect Model

xtreg lwage edu exp expsq tenu union, be

Logistic Model

logit married edu exp lwage

probit married edu exp lwage

Cox-Hazard Model

cox tenu edu lwage union married

Estimation and test techniques

Time Series Analysis

Generating lags and leads : gen xlag1= x[_n-1]

Time series operators: reg consume gnp L.gnp L2.gnp D.gnp D2.gnp S2.gnp

Autocorrelation: corrgram

Box-Pierce Q: wntestq

Augmented Dickey-Fuller test: dfuller

Autoregressive Integrated Moving Average (ARIMA): arima

Duration Model

Cox Model: . cox studytim drug age, dead(died) hr

Survival-time data: . stset studytim, failure(died)

Weibull Model: weibull

Simultaneous Equation

Seemingly Unrelated Regression Models: . sureg (gdp g i) (m2 r)

. reg3 (gdp m2 g i) (m2 gdp r)

Constrained Regression

cnsreg

constraint (constraint list)

Test and Diagnostics

Linear tests: . test m2 g

Non-linear tests: . testnl _b[m2]/_b[g] = =1

Likelihood-ratio test:. lrtest

Other diagnostics

rvfplot Graph residual-versus-fitted plot

rvpplot Graph residual-versus-predictor plot

ovtest Perform Ramsey RESET test for omitted variable test

dwstat Compute Durbin-Watson d statistic if the data is declared as time series

vif Calculate VIFs (variance inflation factors)

Using Stata as a Calculator and Computing p-values

Calculator

di .048/(2*.0016)

P-values

di normprob (1.58)

di tprob (df, t)

di fprob (df1, df2, F)

Getting Started

Editing the Command Line

Help and Search

help regress

search fixed effect

Creating log (procedure and output) file

log using a:wage1, replace

log close

Increasing the amount of memory

set memory 5M

Batch or interactive

Using commands from keyboard

Using a do file

Reading Data Files

use a:wage1

infile

save

clear

Looking At and Summarizing Your Data

list wage edu

list wage edu in 1/20

list married age if hours = = 0

list married age if union= =1 & hours >= 40

sum wage edu tenu married

Important Notes on the "`stem`" command