hochspringer-blog - Tumblr blog

hochspringer-blog · 5 years ago

Text

4th week exercise

STEP 1: Create graphs of your variables one at a time (univariate graphs).

Examine both their center and spread.

STEP 2: Create a graph showing the association between your explanatory and response variables (bivariate graph).

Your output should be interpretable (i.e. organized and labeled).

For your convinience find the Code I used here:

https://drive.google.com/file/d/1HhH60e3SfD5CmfgZGQZLfylL3nO7VURw/view?usp=sharing

Together with the output here:

https://drive.google.com/file/d/1K2HeFRvfKU-G_eBgAHE54oRXWn-f5tre/view?usp=sharing

Oh boy. First of all I have to say I am sorry. This weekend the SAS Online Server has been down. https://odamid.oda.sas.com/SASODAControlCenter/

HTTP Status 404 - Not Found

As I write this it is switching to:

Service Unavailable

The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.

Apache/2.4.6 (CentOS) Server at odamid.oda.sas.com Port 443

First I tried to solve it with Python, but this did not work after 3 weeks of SAS in just a couple of hours. I downloaded and installed the SAS University Edition. In the end it worked but not with the coding from the examples, since GPLOT and GCHART are not within the licensee. You will therefore find a different code statement in my code.

Good examples of the new syntax could be found here.

http://support.sas.com/resources/papers/proceedings10/154-2010.pdf

Paper 154-2010 // Using PROC SGPLOT for Quick High-Quality Graphs // Susan J. Slaughter, Avocet Solutions, Davis, CA /// Lora D. Delwiche, University of California, Davis, CA

Step 1)

First have a look at the three variables:

Incomeperperson "The average income per person per year in US$" lifeexpectancy "The average number of years a newborn child would live" alcconsumption "The average Alcohol consumption per adult (age 15+) per year in liters";

Incomeperperson

Summary:

190 out of the 213 had reported the income per person per year in US$ .

The mean is 8740.97$.

The standard deviation is 14263$.

The median is 2553.50$.

As you can see it in the graph it is heavily centered at the low income category.

lifeexpectancy

Summary:

191 out of the 213 had reported the average number of years a newborn child would live .

The mean is 69.75

The standard deviation is 9.70

The median is 73.1

The mode is 72.97

We see two peaks. A small one at arround 57 and the other at arround 75.

alcconsumption

Summary:

187 out of the 213 had reported the average number of years a newborn child would live .

The mean is 6.689

The standard deviation is 4.899

The median is 5.920000

We see two peaks. A small one at arround 57 and the other at arround 75.

Step 2)

Here are my association between my explanatory and response variables (bivariate graph).

1) Association between Life Expectancy and Alchol Consumption

2) Association between Life Expectancy and Income

3) Association between Alchol Consumption and Income

Summary:

Two of the three bivariate graphs 1) and 3) show not association.

Within the 2) bivariate graph I can see an association between Life Expectancy and Income. If the Life Expectancy of a new born is above 70, we have a big increase in the average anual income in US$. I would suggest drawing a line as follow.

If you do not want to open the link of the files at the top. Please find for your convinience a screenshot of the code I wrote.

Thank you very much for the effort to review my work. I hope it was somewhat clear. And I applogize again, that I had to use the University Edition due to Server downtime.

Best Regards

Hochspringer

0 notes

hochspringer-blog · 5 years ago

Text

3rd Week Exercise

STEP 1: Make and implement data management decisions for the variables you selected.

Data management includes such things as coding out missing data, coding in valid data, recoding variables, creating secondary variables and binning or grouping variables. Not everyone does all of these, but some is required.

Please find here the Link to pdf documents for your convinience:

Link zum Code of the Week:

https://drive.google.com/file/d/1Q9fSp5jYxEfgLDkstQXhFuzFExWxSEX6/view?usp=sharing

Link zum SAS Result Output of the week:

https://drive.google.com/file/d/16tPcm4pT5mRqMDWaBK8PeApmP7fVaiQo/view?usp=sharing

Since I am working with the GapMinder data set, you may be aware by now, that the data is continous and not grouped. This lead to a frequency table, where each value was almost always only taken once.

The Variables I am interested are:

incomeperperson="The average income per person per year in US$"

lifeexpectancy="The average number of years a newborn child would live"

alcconsumption="The average Alcohol consumption per adult (age 15+) per year in liters"

So my decision was to group the variables into 5 categories, with the aim of have a close approximity distribution of 20% for all values as well as somehow human understandable boundaries.

Here are the 3 new variables that I created:

IncomeGroup="Grouping of average income per person per year in US$"

1: Less or Equal 600

2: Greater 600 and Less or Equal 1860

3: Greater 1860 and Less or Equal 4700

4: Greater 4700 and Less or Equal 15.000

5: Greater 15.000

LifeGroup="Grouping average number of years a newborn child would live"

1: Less or Equal 60

2: Greater 60 and Less or Equal 70

3: Greater 70 and Less or Equal 74,5

4: Greater 74,5 and Less or Equal 84

5: Greater 84

AlcGroup="Grouping average Alcohol consumption per adult (age 15+) per year in liters";

1: Less or Equal 2

2: Greater 2 and Less or Equal 5

3: Greater 5 and Less or Equal 8

4: Greater 8 and Less or Equal 11

5: Greater 11

The second decison was how to deal with missing values. Since they varry on the different observations from 22, 23 and 26 . I did not want to have them counted in the frequency. Therefore, it was important to set them on the new variables to ‘.’.

STEP 2: Run frequency distributions for your chosen variables and select columns, and possibly rows.

Your output should be interpretable (i.e. organized and labeled).

Data Discussion ( Describe the frequency distributions in terms of the values the variables take, how often they take them, the presence of missing data, etc)

I will go through my 3 valuables IncomGroup, LifeGroup and AlcGroup and highlight give thee information in bold “Count X” that the values are taken x times. At the end I will give the information in bold how much data “Missing X” are missing out of the 213 total contries. For the count I will also give the information about the percentage. As mentioned the original values are grouped in a way that they are close to 20%.

IncomeGroup="Grouping of average income per person per year in US$"

1: Less or Equal 600 / Count 41 (21,58%)

2: Greater 600 and Less or Equal 1860 / Count 35 (18,42%)

3: Greater 1860 and Less or Equal 4700 / Count 38 (20.00%)

4: Greater 4700 and Less or Equal 15.000 / Count 39 (20,53%)

5: Greater 15.000 / Count 37 (19,47%)

Missing 23 out of 213.

LifeGroup="Grouping average number of years a newborn child would live"

1: Less or Equal 60 / Count 38 (19,9%)

2: Greater 60 and Less or Equal 70 / Count 38 (19,9%)

3: Greater 70 and Less or Equal 74,5 / Count 41 (21,47%)

4: Greater 74,5 and Less or Equal 84 / count 51 (26,70%)

5: Greater 84 / Count 23 (12,04%)

Missing 22 out of 213.

AlcGroup="Grouping average Alcohol consumption per adult (age 15+) per year in liters";

1: Less or Equal 2 / Count 40 (21,39%)

2: Greater 2 and Less or Equal 5 / Count 41 (21,93%)

3: Greater 5 and Less or Equal 8 / Count 36 (19,25%)

4: Greater 8 and Less or Equal 11 / Count 33 (17,65%)

5: Greater 11 / Count 37 (19,79%)

Missing 26 out of 213.

Here is the SAS Code I used for that:

I added a the new print command that we learned, to give an overview of the observations. With that statement it is possible to ensure that the mapping worked. Also it helped to show that the missing values are not always missing for all 3 variables at the same time.

I did not post here, since it is to lengthy. For that you may have a look into the files I shared in the links at the top of this article.

Looking forward to your feedback.

Best Regards

Hochspringer

0 notes

hochspringer-blog · 5 years ago

Text

2nd Week Exercise

In this week we had to choose between SAS and Python programming. After experimenting with both, I went with SAS as it gives a more appealing output of the results.

Here are the expected deliveries of the week:

1) your program

2) the output that displays three of your variables as frequency tables

3) a few sentences describing your frequency distributions

Here is a link to the PDF if you are not satisfied with the embeded pictures i created for your convinience.

https://drive.google.com/file/d/1jTOV_tExr-0l3O3Cg0K1cjbPARWM-5tu/view?usp=sharing

I am working with the GapMinder data set. Which gives an overview about certain variables in different countries.

1) My programm looks like this

2) The Results look like this

The above picture shows the first variable incomeperperson. As you can see out of the 213 Datasets 23 are missing. (To keep it readable I did not show the data for the 144 to 157)

Here is an overview about all 3 variables used:

incomeperperson

lifeexpectancy

alcconsumption

As you can see the distribution is all over the place. Basically all values are taken only once. Since the values of incomeperperson and lifeexpectancy are give so precise they are taken only once per nation. Only for the alcconsumption we see a few values taken more than once.

Regarding the missing values it would be [Blank values that have not been give for the country]

23 missing for incomeperperson

22 missing for lifeexpectancy

26 missing for alcconsumptions

Values for Average Income per person range from 103$/year to 105.000$/year.

Values for Average Life expectancy for a new born range from 48 to 84 years.

Values for Average alchol consumptions per adult range from 0.03 liters to 23 liters.

0 notes

hochspringer-blog · 6 years ago

Text

Frist week Exercise

STEP 1: Choose a data set that you would like to work with.

GAPMINDER will be the Data set I am interested the most.

STEP 2. Identify a specific topic of interest

A good question would be:

Is there a relation between alcohol income and life expectancy?

STEP 3. Prepare a codebook of your own (i.e., print individual pages or copy screen and paste into a new document) from the larger codebook that includes the questions/items/variables that measure your selected topics.)

incomeperperson

2010 Gross Domestic Product per capita in constant 2000 US$. The inflation but not the differences in the cost of living between countries has been taken into account.

lifeexpectancy

2011 life expectancy at birth (years). The average number of years a newborn child would live if current mortality patterns were to stay the same.

STEP 4. Identify a second topic that you would like to explore in terms of its association with your original topic.

STEP 5. Add questions/items/variables documenting this second topic to your personal codebook.

Is there a relationship between income per person, alcohol consumption and life expectancy?

alcconsumption

2008 alcohol consumption per adult (age 15+), liters Recorded and estimated average alcohol consumption, adult (15+) per capita consumption in liters pure alcohol

STEP 6. Perform a literature review to see what research has been previously done on this topic. Use sites such as Google Scholar (http://scholar.google.com) to search for published academic work in the area(s) of interest. Try to find multiple sources, and take note of basic bibliographic information.

[1] Income distribution and life expectancy / R.G. Wilkinson

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1881178/pdf/bmj00056-0043.pdf

[2] The Association Between Income and Life Expectancy in the United States, 2001-2014 / Raj Chetty

https://jamanetwork.com/journals/jama/article-abstract/2513561

[3] Income distribution and life expectancy: a critical appraisal / Ken Judge

https://www.bmj.com/content/311/7015/1282.short

[4] Income Differences in Life Expectancy: The Changing Contribution of Harmful Consumption of Alcohol and Smoking / Martikainen, Pekka

https://journals.lww.com/epidem/fulltext/2014/03000/Income_Differences_in_Life_Expectancy__The.6.aspx

STEP 7. Based on your literature review, develop a hypothesis about what you believe the association might be between these topics. Be sure to integrate the specific variables you selected into the hypothesis.

The conclusion in the literature is there are several dependencies on Life expectancy, which are not solved. “Despite this popular acclaim, a careful review of the evidence does not support the hypothesis that inequalities in income distribution largely explain differences in average life expectancy among rich countries.” [3].

Similar results are presented in [2]. “In the United States between 2001 and 2014, higher income was associated with greater longevity and differences in life expectancy across income groups increased over time. However, the association between life expectancy and income varied substantially across areas; differences in longevity across income groups decreased in some areas and increased in others. The differences in life expectancy were correlated with health behaviors and local area characteristics.”

The correlation from Income / Alcohol Consumption and Life expectancy seem should be related.

“Alcohol and smoking have a major influence on income differences in mortality and, with the exception of smoking among men, their contribution is increasing. Without alcohol and smoking, there would have been little change in life expectancy differentials.” [4].

Summary for my research questions for the Data Source “GapMinder”.

Is there a relation between alcohol income and life expectancy?

Is there a relationship between income per person, alcohol consumption and life expectancy?

Variables:

incomeperperson / lifeexpectancy / alcconsumption

0 notes