STA101 Introduction to Statistics

**Part B:**

**Step 1**

Download a dataset on a subject that interests you and email it to me for approval before you start with the analysis. The dataset should have at least 50 observations and at least two numerical variables. Don’t limit your population of interest to 50 observations just because I required a data set of at least 50 observations; e.g., if you use countries, use all countries for which the data exist. Make sure you understand the meaning of the variables that you use in your analysis. Here are some examples of data sources that you could consult:

STA101 Introduction to Statistics

** The Dataverse project (https://dataverse.harvard.edu/)**. The open-source research data repository contains thousands of datasets collected for research purposes. You can search the database by using terms of your interest: e.g. ‘United Nations’ or ‘firm’. Make sure that there are readily available data associated with your chosen dataset.

** Journal data archives**. Many journals are participating in the open data movement and are providing access to datasets of published papers. See, e.g., the

** Sources of macro-level data**. The performance for a recent year of all countries of the world on some social, economic, or demographic measures (GDP, population, total imports, total exports, child mortality, unemployment rate, inﬂation rate, etc.). Make sure you understand the meaning of the variables: e.g., don’t choose the gross domestic product (GDP) if you don’t know what gross domestic product means. Make sure the data are comparable. For example, don’t use GDP in national currencies for all countries of the world, because in that case Afghanistan’s GDP will be measured in Afghani, Albania’s GDP in Lek, etc., and the numbers will be incomparable; use GDP expressed in a common currency (like the US dollar) instead. The data should be cross-section (measured in a given period or at a point in time). Don’t use time-series data, that is, data where the cases are subsequent ﬁxed periods (such as annual GDP, 1950–2015). Some sources are:

– World Bank, World Development Indicators (__http://databank.worldbank.org/data/reports.aspx?source=world-development-indicators__)

– Gapminder (__https://www.gapminder.org/data/__)

– The Penn World Table (__https://www.rug.nl/ggdc/productivity/pwt/__)

– United Nations, Human Development Report (__http://hdr.undp.org/en/__)

– United Nations Statistics Division (__http://unstats.un.org/unsd/default.htm__)

– United Nations, International Trade Statistics Yearbook (__http://comtrade.un.org/pb/__)

– OECD Data (__https://data.oecd.org/__)

** Other data sources. **You are allowed to search for other sources of interesting datasets to analyze. Just remember to email the dataset to me for approval before you start with the analysis. Some sources are:

– __http://koaning.io/fun-datasets.html__

– __https://www.dataquest.io/blog/free-datasets-for-projects/__

STA101 Introduction to Statistics

Ideally, your dataset should contain both numerical and categorical variables. In case your dataset contains only numerical variables, you will need to transform some of them into categorical (ordinal) variables: you could consult the following URL:

__https://stats.idre.ucla.edu/stata/faq/how-can-i-recode-continuous-variables-into-groups/__

I can also help you with the transformation if needed.

**Step 2**

Write a paragraph about the chosen data source, chosen dataset, and chosen variables. Explain why you chose those variables, i.e., why should the reader be interested in your analysis.

**Step 3**

If your chosen dataset is not in Stata format, start Stata and import your data from the data ﬁle. Inspect your data to see if they were correctly imported.

STA101 Introduction to Statistics

**Step 4**

Use Stata to provide the statistical output listed below. After each exercise, you should interpret the results by focusing on the questions and guidelines provided. Try to come up with a “story” behind your results by answering the following questions: Are the results as expected based on your knowledge of the literature or based on your intuition? In case the results are not as expected, what could be the possible reasons?

- Choose two numerical variables and generate a scatterplot. Interpret the graph by focusing on the following questions: Are the variables associated or independent? What type of relationship, if any, do you observe? Are there any unusual cases?
- Choose a numerical variable and generate a histogram. Describe the distribution of the variable. The description should incorporate the center, variability, and shape of the distribution. A good description of the shape of distribution should include modality and whether the distribution is symmetric or skewed. Also, note any unusual cases.
- For the numerical variable chosen in (b), generate a boxplot. Does the histogram and boxplot tell the same story about the distribution of the variable? What additional information can you see in the boxplot that you couldn’t see in the histogram?
- For the numerical variable chosen in (b), generate descriptive statistics. At the minimum, you should calculate the mean, the median, and the standard deviation. Interpret the descriptive statistics by looking at, e.g., how the mean relates to the median and how the standard deviation relates to the mean.
- Choose a categorical variable and generate a bar plot or a pie chart. Interpret the graph by comparing group sizes.
- Choose one numerical and one categorical variable, and generate a side-by-side boxplot. Interpret the graph by comparing the numerical data across groups of the categorical variable.

STA101 Introduction to Statistics

**Step 5**

Email me the Stata datafile, syntax and statistical output, together with a word file containing interpretation of the statistical results. You should copy and paste to the word file the parts of the statistical output that you are interpreting. ** Please do not submit pdf files.** I will comment on your assignments using Review/Track Changes and Review/New Comment in Word, so please make sure that you have these tools enabled to review my feedback.