hochspringer-blog
Blogging Coursera
5 posts
Insights of Data Analytics
Don't wanna be here? Send us removal request.
hochspringer-blog · 5 years ago
Text
4th week exercise
STEP 1: Create graphs of your variables one at a time (univariate graphs).
Examine both their center and spread.
STEP 2: Create a graph showing the association between your explanatory and response variables (bivariate graph).
Your output should be interpretable (i.e. organized and labeled).
For your convinience find the Code I used here:
https://drive.google.com/file/d/1HhH60e3SfD5CmfgZGQZLfylL3nO7VURw/view?usp=sharing
Together with the output here:
https://drive.google.com/file/d/1K2HeFRvfKU-G_eBgAHE54oRXWn-f5tre/view?usp=sharing
Oh boy. First of all I have to say I am sorry. This weekend the SAS Online Server has been down. https://odamid.oda.sas.com/SASODAControlCenter/
HTTP Status 404 - Not Found
As I write this it is switching to:
Service Unavailable
The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.
Apache/2.4.6 (CentOS) Server at odamid.oda.sas.com Port 443
First I tried to solve it with Python, but this did not work after 3 weeks of SAS in just a couple of hours. I downloaded and installed the SAS University Edition. In the end it worked but not with the coding from the examples, since GPLOT and GCHART are not within the licensee. You will therefore find a different code statement in my code.
Good examples of the new syntax could be found here.
http://support.sas.com/resources/papers/proceedings10/154-2010.pdf
Paper 154-2010 // Using PROC SGPLOT for Quick High-Quality Graphs // Susan J. Slaughter, Avocet Solutions, Davis, CA /// Lora D. Delwiche, University of California, Davis, CA
Step 1)
First have a look at the three variables:
Incomeperperson  "The average income per person per year in US$" lifeexpectancy "The average number of years a newborn child would live" alcconsumption "The average Alcohol consumption per adult (age 15+) per year in liters";
Incomeperperson
Tumblr media
Summary:
190 out of the 213 had reported the income per person per year in US$ .
The mean is 8740.97$.
The standard deviation is 14263$.
The median is 2553.50$.
As you can see it in the graph it is heavily centered at the low income category.
lifeexpectancy
Tumblr media
Summary:
191 out of the 213 had reported the average number of years a newborn child would live .
The mean is 69.75
The standard deviation is 9.70
The median is 73.1 
The mode is 72.97
We see two peaks. A small one at arround 57 and the other at arround 75.
alcconsumption
Tumblr media
Summary:
187 out of the 213 had reported the average number of years a newborn child would live .
The mean is 6.689
The standard deviation is  4.899
The median is 5.920000
We see two peaks. A small one at arround 57 and the other at arround 75.
Step 2)
Here are my association between my explanatory and response variables (bivariate graph).
1) Association between Life Expectancy and Alchol Consumption
Tumblr media
2) Association between Life Expectancy and Income
Tumblr media
3) Association between Alchol Consumption and Income
Tumblr media
Summary:
Two of the three bivariate graphs 1) and 3) show not association.
Within the 2) bivariate graph I can see an association between Life Expectancy and Income. If the Life Expectancy of a new born is above 70, we have a big increase in the average anual income in US$. I would suggest drawing a line as follow.
Tumblr media
If you do not want to open the link of the files at the top. Please find for your convinience a screenshot of the code I wrote.
Tumblr media
Thank you very much for the effort to review my work. I hope it was somewhat clear. And I applogize again, that I had to use the University Edition due to Server downtime.
Best Regards
Hochspringer
0 notes
hochspringer-blog · 5 years ago
Text
3rd Week Exercise
STEP 1: Make and implement data management decisions for the variables you selected.
Data management includes such things as coding out missing data, coding in valid data, recoding variables, creating secondary variables and binning or grouping variables. Not everyone does all of these, but some is required.
Please find here the Link to pdf documents for your convinience:
Link zum Code of the Week:
https://drive.google.com/file/d/1Q9fSp5jYxEfgLDkstQXhFuzFExWxSEX6/view?usp=sharing
Link zum SAS Result Output of the week:
https://drive.google.com/file/d/16tPcm4pT5mRqMDWaBK8PeApmP7fVaiQo/view?usp=sharing
Since I am working with the GapMinder data set, you may be aware by now, that the data is continous and not grouped. This lead to a frequency table, where each value was almost always only taken once.
The Variables I am interested are:
incomeperperson="The average income per person per year in US$"
lifeexpectancy="The average number of years a newborn child would live"
alcconsumption="The average Alcohol consumption per adult (age 15+) per year in liters"
So my decision was to group the variables into 5 categories, with the aim of have a close approximity distribution of 20% for all values as well as somehow human understandable boundaries.
Here are the 3 new variables that I created:
IncomeGroup="Grouping of average income per person per year in US$"
1: Less or Equal 600
2: Greater 600 and Less or Equal 1860
3: Greater 1860 and Less or Equal 4700
4: Greater 4700 and Less or Equal 15.000
5: Greater 15.000
LifeGroup="Grouping average number of years a newborn child would live"
1: Less or Equal 60
2: Greater 60 and Less or Equal 70
3: Greater 70 and Less or Equal 74,5
4: Greater 74,5 and Less or Equal 84
5: Greater 84
AlcGroup="Grouping average Alcohol consumption per adult (age 15+) per year in liters";
1: Less or Equal 2
2: Greater 2 and Less or Equal 5
3: Greater 5 and Less or Equal 8
4: Greater 8 and Less or Equal 11
5: Greater 11
The second decison was how to deal with missing values. Since they varry on the different observations from 22, 23 and 26 . I did not want to have them counted in the frequency. Therefore, it was important to set them on the new variables to ‘.’.
STEP 2: Run frequency distributions for your chosen variables and select columns, and possibly rows.
Your output should be interpretable (i.e. organized and labeled).
Tumblr media
Data Discussion ( Describe the frequency distributions in terms of the values the variables take, how often they take them, the presence of missing data, etc)
I will go through my 3 valuables IncomGroup, LifeGroup and AlcGroup and highlight give thee information in bold “Count X” that the values are taken x times. At the end I will give the information in bold how much data “Missing X” are missing out of the 213 total contries. For the count I will also give the information about the percentage. As mentioned the original values are grouped in a way that they are close to 20%.
IncomeGroup="Grouping of average income per person per year in US$"
1: Less or Equal 600 / Count 41 (21,58%)
2: Greater 600 and Less or Equal 1860 / Count 35 (18,42%)
3: Greater 1860 and Less or Equal 4700 / Count 38 (20.00%)
4: Greater 4700 and Less or Equal 15.000 / Count 39 (20,53%)
5: Greater 15.000 / Count 37 (19,47%)
Missing 23 out of 213.
LifeGroup="Grouping average number of years a newborn child would live"
1: Less or Equal 60 / Count 38 (19,9%)
2: Greater 60 and Less or Equal 70 / Count 38 (19,9%)
3: Greater 70 and Less or Equal 74,5 / Count 41 (21,47%)
4: Greater 74,5 and Less or Equal 84 / count 51 (26,70%)
5: Greater 84 / Count 23 (12,04%)
Missing 22 out of 213.
AlcGroup="Grouping average Alcohol consumption per adult (age 15+) per year in liters";
1: Less or Equal 2 / Count 40 (21,39%)
2: Greater 2 and Less or Equal 5 / Count 41 (21,93%)
3: Greater 5 and Less or Equal 8 / Count 36 (19,25%)
4: Greater 8 and Less or Equal 11 / Count 33 (17,65%)
5: Greater 11 / Count 37 (19,79%)
Missing 26 out of 213.
Here is the SAS Code I used for that:
Tumblr media
I added a the new print command that we learned, to give an overview of the observations. With that statement it is possible to ensure that the mapping worked. Also it helped to show that the missing values are not always missing for all 3 variables at the same time.
I did not post here, since it is to lengthy. For that you may have a look into the files I shared in the links at the top of this article.
Looking forward to your feedback.
Best Regards
Hochspringer
0 notes
hochspringer-blog · 5 years ago
Text
2nd Week Exercise
In this week we had to choose between SAS and Python programming. After experimenting with both, I went with SAS as it gives a more appealing output of the results.
Here are the expected deliveries of the week:
1) your program
2) the output that displays three of your variables as frequency tables
3) a few sentences describing your frequency distributions
Here is a link to the PDF if you are not satisfied with the embeded pictures i created for your convinience.
https://drive.google.com/file/d/1jTOV_tExr-0l3O3Cg0K1cjbPARWM-5tu/view?usp=sharing
I am working with the GapMinder data set. Which gives an overview about certain variables in different countries.
1) My programm looks like this
Tumblr media
2) The Results look like this
Tumblr media
The above picture shows the first variable incomeperperson. As you can see out of the 213 Datasets 23 are missing. (To keep it readable I did not show the data for the 144 to 157)
Here is an overview about all 3 variables used:
incomeperperson
lifeexpectancy
alcconsumption
Tumblr media
3)
As you can see the distribution is all over the place. Basically all values are taken only once. Since the values of incomeperperson and lifeexpectancy are give so precise they are taken only once per nation. Only for the alcconsumption we see a few values taken more than once.
Regarding the missing values it would be [Blank values that have not been give for the country]
23 missing for incomeperperson
22 missing for lifeexpectancy
26 missing for alcconsumptions
Values for Average Income per person range from 103$/year to 105.000$/year.
Values for Average Life expectancy for a new born range from 48 to 84 years.
Values for Average alchol consumptions per adult range from 0.03 liters to 23 liters.
0 notes
hochspringer-blog · 5 years ago
Text
Frist week Exercise
STEP 1: Choose a data set that you would like to work with.
GAPMINDER will be the Data set I am interested the most.
 STEP 2. Identify a specific topic of interest
A good question would be:
Is there a relation between alcohol income and life expectancy?
 STEP 3. Prepare a codebook of your own (i.e., print individual pages or copy screen and paste into a new document) from the larger codebook that includes the questions/items/variables that measure your selected topics.)
 incomeperperson
2010 Gross Domestic Product per capita in constant 2000 US$. The inflation but not the differences in the cost of living between countries has been taken into account.
 lifeexpectancy
2011 life expectancy at birth (years). The average number of years a newborn child would live if current mortality patterns were to stay the same.
  STEP 4. Identify a second topic that you would like to explore in terms of its association with your original topic.
STEP 5. Add questions/items/variables documenting this second topic to your personal codebook.
 Is there a relationship between income per person, alcohol consumption and life expectancy?
 alcconsumption
2008 alcohol consumption per adult (age 15+), liters Recorded and estimated average alcohol consumption, adult (15+) per capita consumption in liters pure alcohol
  STEP 6. Perform a literature review to see what research has been previously done on this topic. Use sites such as Google Scholar (http://scholar.google.com) to search for published academic work in the area(s) of interest. Try to find multiple sources, and take note of basic bibliographic information.
 [1] Income distribution and life expectancy / R.G. Wilkinson
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1881178/pdf/bmj00056-0043.pdf
 [2] The Association Between Income and Life Expectancy in the United States, 2001-2014 / Raj Chetty
https://jamanetwork.com/journals/jama/article-abstract/2513561
 [3] Income distribution and life expectancy: a critical appraisal / Ken Judge
https://www.bmj.com/content/311/7015/1282.short
 [4] Income Differences in Life Expectancy: The Changing Contribution of Harmful Consumption of Alcohol and Smoking / Martikainen, Pekka
https://journals.lww.com/epidem/fulltext/2014/03000/Income_Differences_in_Life_Expectancy__The.6.aspx
  STEP 7. Based on your literature review, develop a hypothesis about what you believe the association might be between these topics. Be sure to integrate the specific variables you selected into the hypothesis.
 The conclusion in the literature is there are several dependencies on Life expectancy, which are not solved. “Despite this popular acclaim, a careful review of the evidence does not support the hypothesis that inequalities in income distribution largely explain differences in average life expectancy among rich countries.” [3].
 Similar results are presented in [2]. “In the United States between 2001 and 2014, higher income was associated with greater longevity and differences in life expectancy across income groups increased over time. However, the association between life expectancy and income varied substantially across areas; differences in longevity across income groups decreased in some areas and increased in others. The differences in life expectancy were correlated with health behaviors and local area characteristics.”
 The correlation from Income / Alcohol Consumption and Life expectancy seem should be related.
“Alcohol and smoking have a major influence on income differences in mortality and, with the exception of smoking among men, their contribution is increasing. Without alcohol and smoking, there would have been little change in life expectancy differentials.” [4].
 Summary for my research questions for the Data Source “GapMinder”.
 Is there a relation between alcohol income and life expectancy?
Is there a relationship between income per person, alcohol consumption and life expectancy?
 Variables:
incomeperperson / lifeexpectancy / alcconsumption
0 notes
hochspringer-blog · 5 years ago
Text
Data Management and Visualization
First week started. Some videos introduced us to the power of Data Management. My first interatction with Coursera, so lets see how the interaction worked. In the last week I have been signed in to a couple of OPEN.SAP.COM courses and like the approach of what new technology can provide you to get knowledge of new topics.
1 note · View note