anuworlduniverse-blog
COURSERA - DATA MANAGEMENT AND VISUALIZATION
19 posts
Don't wanna be here? Send us removal request.
anuworlduniverse-blog · 4 years ago
Text
Mildstone Assignment2
Sample
Dataset was chosen from the World Bank Data Set which includes data for N=248 countries for the years 2012 and 2013. A total of 163 variables are present in this data set out of which 10 variables were chosen for this research. For this research, only data from year 2012 is considered.
Measures 
Internet Users response variable was measured for various countries per 100 people.
Predictors included 1) GDP per Capita (Current US$),  2) Household Fixed Consumption Expenditure (% of GDP), 3)  Net national income per capita (Current US$),  4) Access of Electricity (% of population), 5) Population Growth(annual %), 6) Rural Population(% of total population), 7)Secure Internet Servers(per 1 million people), 8)Urban Population(% of total), 9)Industry Valued Added(% of GDP)
Analyses
The distributions for the predictors and the internet users response variable were evaluated by calculating the mean, standard deviation and minimum and maximum values for quantitative variables.  Scatter plots and box plots were also examined, and Pearson correlation was used to test bivariate associations between individual predictors and the internet users response variable. Lasso regression with the least angle regression selection algorithm was used to identify the subset of variables that best predicted internet users. The lasso regression model was estimated on a training data set consisting of a random sample of 70% of the batches (N=121), and a test data set included the other 30% of the batches (N=48). All predictor variables were standardized to have a mean=0 and standard deviation=1 prior to conducting the lasso regression analysis. Cross validation was performed using k-fold cross validation specifying 10 folds. The change in the cross validation mean squared error rate at each step was used to identify the best subset of predictor variables. Predictive accuracy was assessed by determining the mean squared error rate of the training data prediction algorithm when applied to observations in the test data set.
0 notes
anuworlduniverse-blog · 4 years ago
Text
Mildstone Assignment1
Project title:
Study about association between internet users and factors affecting internet users.
Research Question:
Study the factors affecting internet users like GDP, electricity available, income, population, internet servers
Purpose:
The purpose of this study was to identify the best predictors of Internet Users from GDP per Capita,  Household Consumption Expenditure, Industry Value added, Access to Electricity, Population Growth, Rural Population, Secure Internet Servers, Urban Population and Income per Capita.
Motivation:
It is my responsibility to mention the factors which increase Internet Users because now the world is going towards digitalization. Having a better understanding of factors that are most likely to increase or decrease lead times will allow me to identify which factors to focus on in order to increase Internet User Rate.
0 notes
anuworlduniverse-blog · 4 years ago
Text
4th Assignment(4)
1. Code:
Tumblr media Tumblr media
2. Output:
A k-means cluster analysis was conducted to identify underlying subgroups of country based on their similarity of responses on 3 variables that represent characteristics that could have an impact on female employment rate.
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
All clustering variables were standardized to have a mean of 0 and a standard deviation of 1. Data were randomly split into a training set that included 70% of the observations  and a test set that included 30% of the observations. A series of k-means cluster analyses were conducted on the training data specifying k=1-4 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.
Tumblr media
The elbow curve was inconclusive, suggesting that the 2 and 3-cluster solutions might be interpreted. The results below are for an interpretation of the 4-cluster solution.
Tumblr media Tumblr media
Canonical discriminant analyses was used to reduce the 3 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatter plot of the first two canonical variables by cluster (Figure  shown below) indicated that the observations in clusters 2 and 3 were densely packed with relatively low within cluster variance, and did not overlap very much with the other clusters.
Tumblr media
0 notes
anuworlduniverse-blog · 4 years ago
Text
3rd Assignment(4)
Lasso regression analysis
1. Code:
Tumblr media
2. Output:
Tumblr media
First table is about survey select procedure and the method used is  simple random sampling and sample size is 150. Next table tells about lasso regression. Our dependent variable is femaleemployrate and predictors are urbanrate, employrate,  internetuserate and lifeexpectancy. I choose random cross validation method. Number of observation read is 213 and number of observation used is 167 but observations used for training are 113 and for testing are 54. 
Number of parameter to be estimated is 5. Next table is LAR selection procedure. As we can see in ASE ( averge square error) that it declines as variables are added. However Test ASE first decreased then increased with variables addition. Life expectancy has asterick in CV PRESS which tells us that this is the best model selected from procedure. 
Tumblr media
This plots tells us that the change in regression coefficient in each step and vertical line tells us the selected model. Lifeexpectancy and employrate shows largest regression coefficients. Urbanrate and lifeexpectancy associate negatively with female employ rate.  
Next graph shows that residual sum of square decreases with addition of variables and at last it changes slowely. 
Tumblr media
This graph tells us that which model to choose. 
Tumblr media
This plots tells us test squared error is not close to training squared error. Prediction accuracy is not pretty stable.
Tumblr media
R-square value is 0.8244 and adjusted r-square value is 0.8195. ASE for training set is 41.88596 and for test set is 85.03852. Parameter estimates table shows us the estimates value of predictors. 
0 notes
anuworlduniverse-blog · 4 years ago
Text
4th Assignment(3)
Test a logistic regression model.
1. Code:
Tumblr media
2. Output:
Tumblr media
My categorical response variable is fer( female employ rate) with two levels and explanatory variable is urbanrate and internetuserate. Here number of observation read is 213 but number of observation used is 190. Here 0 is femaleemployrate less than 40% and 1 indicates femaleemployrate greater than 40%. Frequency for 0 is 71 which tells us that there are 71 countries which has femaleemployrate less than 40% whereas for 1 it is 119. 
Tumblr media
p-value for urbanrate and internetuserate is not significant statistically. Parameter estimate for urbanrate is positive ( 0.00485 ) and for internetuserate is negative (-0.00582 ) and both have value in mid of 0 and 1. 
Odd Ratio Estimates for urbanrate is 1.005 which tells us that female who lives in urban area with fer as 1 is 1.005 times when fer is 0  and for internetuse rate is 0.994 tells females who uses internet with fer 1 is 0.994 times with fer 0. Another sample from population selected will have odd ratio between 0.989 - 1.021 these numbers  95 times out of 100 for urbanrate whereas for internetuserate odd ratio is between 0.981 - 1.008. Odd estimates changes due to internetuserate which is working here as confounding variable.
0 notes
anuworlduniverse-blog · 4 years ago
Text
3rd Assignment(3)
1. Code:
Tumblr media
2. Output:
Tumblr media
Here i used femaleemployrate as response variable and employrate, urbanrate and internetuserate as explanatory variable. GLM procedure read about 213 observations and actyally used 167 observations. Intercept value is -7.49 ( 7.5% out of 100 ) and this is the value of femaleemployrate when all the explanatory variables are at their mean values. p-value for employrate and employrate*employrate is 0.12224 and 0.3151 and both are not statistical significant. p-value for urbanrate is 0.0083 and for internetuserate is 0.0007 which are highly significant.
 Negative value of urbanrate tells us that higher value of urbanrate tends to lower female employment rate and Positive value of employrate and internetuserate tells us that higher value of these variable tends to higher femaleemployrate. All the expanatory variables only explain 75.9% of variability of femaleemployrate. 
Tumblr media
This is Q-Q plot. As we can see that this graph is not completely straight line and there is more deviation at both the ends and tells us that it does not follow perfect distribution. 
Tumblr media
This is the residual plot for femaleemployrate at different values of employrate. As we can see that their are more values at the centre of employrate and tends to zero as emplorate increases. 
Tumblr media
This is the residual plot for femaleemployrate at different values of employrate * employrate. As we can see that there are more values at centre of employrate*employrate . 
Tumblr media
This is the residual plot for femaleemployrate at different values of urbanrate. As we can see from this plot that residuals are randomly distributed for all values of urbanrate. 
Tumblr media
This is the residual plot for femaleemployrate at different values of internetuserate. As we can see that their is clearly funnel shape patterns to the residuals and it tends to zero at higher values of internetuse rate. 
Tumblr media
This is outlier and leverage diagnostics for femaleemployrate. As we can see that in first box maximum values lie. There are few outliers that is countries which have a residual value less than 2 and there are some values with high leverage value (shown in green) and only one country which has outliers and high leverage. This also tells us that thse outliers are close to 0.01-0.025 leverage values. 
0 notes
anuworlduniverse-blog · 4 years ago
Text
2nd Assignment(4)
1. Code:
Tumblr media
2. Output:
Random forest analysis was performed to evaluate the importance of a series of explanatory variables in predicting a binary, categorical response variable. The following explanatory variables were included as possible contributors to a random forest evaluating femaleemplyrate (response variable) , employrate, urbanrate and internetuserate.
Tumblr media
As we can see information section that variables to try is equal to 2 and it tell that there are 5 random explanatory variables. By default it has maximum of 100 trees and it select 60% ( inbag fraction) of their sample. Split criterion is Gini and their is no missing variable. 
Number of observation read is 213 and observation used is 178. Misclassification rate is 97.8% and it states that forest correctly specified is only 2.2%.
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
This is Fit statistic table.
Tumblr media
The explanatory variables with the highest relative importance scores are employrate, urbanrate and internetuserate.
0 notes
anuworlduniverse-blog · 4 years ago
Text
1st Assignment(4)
1. Code:
Tumblr media
2. Output:
Tumblr media
We can see from the model information table that the decision tree that SAS grew has 26 leaves before pruning and 6 leaves after pruning. Model Event Level here has valve 1 which tells us value of  our target variable femaleemployrate. Number of observation read from data set is 213 while observation used is 167. 
Tumblr media
This is a plot of cross validation average misclassification rate that is created by PROC HPSPLIT based on number of leaves each of the tree generated on training sample. A vertical reference line is drawn for tree with the number of leaves that has the lowest average misclassification rate in this case 3 leaf tree.  The horizontal reference line represents the average standard error plus one standard error for this cost-complexity parameter. 
Tumblr media
This purning plot is general model with 10 split levels and 3 leaves.
Tumblr media
This is the final smaller tree which first splits on employ rate basis (about 46.848 ) then splits on basis of employ rate ( about 60.672) and then again on basis of internet use rate ( about 62.238 ). 
Tumblr media
First table is model based Confusion matrix. The total model correctly specify 88% when female employ rate is less than 40% that is 1-error rate and 90% when rate is greater than 40%. 
Next is  a receiver operator characteristics curve known as ROC curve which shows sensitivity on y axis and true negative rate on x axis. 
After that is , variable importance table which has variable employ rate and internet use rate.
0 notes
anuworlduniverse-blog · 4 years ago
Text
2nd Assignment(3)
1. Code:
Tumblr media
2. Output:
a. GLM procedure:
I took Female employ rate as response variable and employ rate as explanatory variable. Here number of observation read is 213 and observation read is 178. Here p-valve is less than 0.0001, this indicates that this is statistically significant. Here R-square value is 0.735,  which tells us that  there is 73.5% relation among them. Here we can write femaleemployrate = -22.4 + (1.2)*employrate. 
Tumblr media Tumblr media
b. Plot:
This graph clearly shows that femaleemployrate   is linearly associated  with employrate and is increasing with increase in employrate. This graph also shows 95% prediction limits which tells maximun number of valve lie in this linear relation. 
Tumblr media
c. Mean graph:
This graph shows the mean valve of response variable corresponding to explanatory variable. When employrate is 0 then mean number of femaleemployrate is 31.4. and when employrate is 1 then mean number of femaleemployrate is 51.8.
Tumblr media
d.  Frequency table for variable er:
Here er is categorical explanatory variable. Here 0 indicates employ rate less than 50 and 1 indicates greater than 50.  66% (141)  countries have more than 50% employ rate. and only 72 countries have employ rate less than 50%.
Tumblr media
0 notes
anuworlduniverse-blog · 4 years ago
Text
1st Assignment(3)
1.  Sample:
The sample is from Gapminder. Since its conception in 2005, Gapminder has grown to include over 200 indicators, including gross domestic product, total employment rate, and estimated HIV prevalence.I  Gapminder contains data for all 192 UN members, aggregating data for Serbia and Montene. Suicide rate is from combination of time series from WHO Violence and Injury Prevention (VIP) and data from WHO Global Burden of Disease 2002 and 2004. Employ rate is from International Labour Organization. Female employ rate is from International Labour Organization. 
2. Procedure:
For suicide mortality , estimates were based largely on GBD ( Global Burden of Diseases ) covariates. Then calculated the ratio of the observed SDR ( suicide death rates ) to the rate expected in geographies globally with similar GBD Socio-demographic Index .  Then calculated 95% uncertainty intervals (UIs) for the point estimates. I used the data of 2005 for suicide rate. I wanted to find relation between suicide rate and unemployment. Employment rate and female employment rate is calculated using three main methods : automated data collection, microdata processing and the annual ILOSTAT (International Labour Organization Statistic) questionnaire. Automation processes include systematizing the collection of data from a wide variety of online data repositories and reshaping, reprocessing and validating the data before publishing on ILOSTAT. After we collect the datasets, ILO experts systematically process them to generate harmonized indicators based on international statistical standards. This allows us to produce and publish a wide range of detailed and internationally-comparable labour statistics. The Excel-based questionnaire is sent out each year to national statistical offices and labour ministries worldwide. They receive a link to a country-specific web page with additional information concerning the data collection, such as a list of contacts and an overview of data availability by indicator. I collected data for employment from the year 2007.
3. Measures:
Crude and age-standardized rates from suicide mortality and years of life lost were compared across regions and countries, and by age, sex, and Socio-demographic Index (a composite measure of fertility, income, and education).  Suicide rate i used is Suicides in 2005, age adjusted, per 100 000 Mortality due to self-inflicted injury, per 100 000 standard population, age adjusted. I took suicide rate as explanatory variable . ILOSTAT currently have more than 10,000 household survey datasets across 151 countries. They also leverage the ILO Harmonized Microdata collection to respond to ad-hoc user queries for different data tabulations and the annual response rate to the questionnaire is typically around 50 per cent of member States, which cover about 80 per cent of the world’s population. I took employ rate as response variable. Employment rate and female employment rate that i used is 2007 total employees age 15+ (% of population) Percentage of total population, age above 15, that has been employed during the given year. Then i converted these quantitative variables into categorical variables to analyze the relation between employment and suicide rate, how employment rate and female employment rate differs from each other.
0 notes
anuworlduniverse-blog · 4 years ago
Text
4th Assignment(1)
1. Code:
Tumblr media
2. Output:
Tumblr media
These are the frequency table for suicide rate (sp), employ rate (er) and female employ rate (fer). About 199 (93.43%) countries have suicide rate less than 20%. Employ rate is about 40-60% in 99 countries. Only six countries have employ rate greater than 80%. About 90% countries have female employ rate about 40-60%. However only 4 countries have female employ rate more than 80%. 
(a). Univariate graph of suicide rate :
Tumblr media
This graph is unimodel and has highest peak when sp is 1. It has values high at left side and low at right side so this is skewed right. 
Univariate graph of employ rate :
Tumblr media
This is unimodel and its highest peak is when er = 3 (employ rate is 40-60%). It is not skewed in ether direction because it has randon peaks that are not continous.
Univariate graph of female employ rate:
Tumblr media
This graph is unimodel also, its highest peak is at 40-60 category and have frequency around 45%. Its higher frequency is at left side so it is right skewed.
(b). Bivariate graph between suicide rate and employ rate:
Tumblr media
Here my hpothesis was : employment rate effects suicide rate. Bivariate graph is between employ rate (response variable) and suicide rate(explanatory variable). As we can see that all values are scattered and does not clearly give any exact relation.
Bivariate graph between suicide rate and female employment rate :
Tumblr media
This Bivariate graph is between suicide rate (response variable) and female employ rate (explanatory variable). As we can see hare that this graph is dense in mid so the relation should vary according to curve.
Bivariate graph between employ rate and female employ rate:
Tumblr media
This bivariate graph is between employ rate and female emloy rate . It is showing a linear relationship between both.
0 notes
anuworlduniverse-blog · 4 years ago
Text
4th Assignment(2)
1. Code:
Tumblr media
2. Output:
Tumblr media Tumblr media Tumblr media
3. Summary:
I am using pearson correlation test for testing a potential moderator, where i took urbanrate (ur) as moderator, employrate and femaleemployrate as another variables. I have divided urbanrate into 5 categories, which are ur 1,2,3,4,5. Here i took ur as categorical variavle and rest two are quantitative variable. For any urban rate group, p-valve is 0.0001 that is less than 0.05, which shows it is statically significant. For the less urbanrate group, the correlation coefficient between employrate and female employrate is 0.945. When urban rate is between 20 and 40%, the correlation coefficient between employrate and femaleemployrate is 0.897. For urban rate group 40-60%, the correlation is 0.798. For urban rate group 60-80%, the correlation is 0.826. For urban rate group more than 80%, the correlation is  0.65. Here it correlation coefficient is decreasing with increase in urban rate except when ur= 4 , where correlation coefficient is 0.897.
0 notes
anuworlduniverse-blog · 4 years ago
Text
3rd Assignment(2)
1. Code:
Tumblr media
2. Output:
Tumblr media
3. Summary: 
To analyze my hypothesis that is effect of unemploment on suicide rate , i have generated correlation coefficient, which is helpful in understanding following. For the association between suicide rate and female employ rate, the correlation coefficient  is 0.15 with p-valve is 0.0509. This tell us that the relation is not statistically significant. For association between employ rate and female employ rate, the correlation coefficient is approximately 0.86 with p-valve 0.0001. This shows that this relation is statistically significant. The association between femaleemployrate and employrate is fairly strong and positive. Whereas the association between suicideper100TH and employrate and suicideper100TH and femaleemployrate is weak but positive. Here we do not need post hac test because we are only using quantitative variable. If we square the correlation coefficient than we can find variability. For example the correlation coefficient between employrate and femaleemployrate is 0.86, if we square it its valve is 0.74, which signify that we can predict 74% of variability we will see in the femaleemployrate.
0 notes
anuworlduniverse-blog · 4 years ago
Text
2nd Assignment(2)
1. Code:
Tumblr media
2. Output:
Tumblr media
3. Summary:
For examining relation between suicide rate(response variable) and employment(explanatory variable), Chi-Square test of Independence is done. As we can see in first table column percent differs very much when sp = 1 and sp = 2, and valves of column percent differ among themselves also. They neither increasing or decreasing. Here p valve is greater than 0.05 in Chi-Square statistic, so, null hypothesis can not be neglected. There is no missing frequency.
0 notes
anuworlduniverse-blog · 4 years ago
Text
1. Code: 
Tumblr media
2. Output: 
Tumblr media Tumblr media Tumblr media
3. Summary:
Here i have done ANOVA procedure. Response variable is suicideper100TH and  er is explanatory variable. F valve is 2.30.  Probability factor is greater than 0.05 (0.0603>0.05). So, null hypothesis can not be neglected. It will not tell whether means are equal or not and explanatory variable has more than two groups than we need POST HOC TEST. So, Duncan test is done here. Group 3 which indicates that 40-60% of employed population have more suicide rate than any another group. 
0 notes
anuworlduniverse-blog · 4 years ago
Text
4th Assignment
1. Code:
Tumblr media
2. Univariate graphs:
Univariate graph for suicide rate:
Tumblr media
Univariate graph for employment rate:
Tumblr media
Univariate graph for female employment rate:
Tumblr media
Bivariate graph:
Tumblr media
3. Summary:
I have drawn the Univariate and Bivariate graph . Univariate graph is for suicide rate , employ rate and female employ rate. Bivariate graph is drawn between employ rate and suicide rate. 1st graph is unimodal and it is less in higher side so it is right skewed and it shows higher frequencies in lower area. 2nd graph is also unimodal, its highest peak is in category 40-60 and have frequency about 46.5% and it does not clearly tell about skeweness. 3rd graph is unimodal also, its highest peak is at 40-60 category and have frequency around 45%. Its higher frequency is at left side so it is right skewed. Bivariate graph is between employ rate and suicide rate. As we can see that all values are scattered and does not clearly give any exact relation.
0 notes
anuworlduniverse-blog · 4 years ago
Text
3rd Assignment
1. Code:
Tumblr media
2.   Frequency distributions:
Tumblr media
3. Summary:
I merged the suicideper100TH, employrate and femaleemployrate to create new variables sp, re and fer respectively. I divided suicideper100TH into two categories which is suicide rate less than twenty and suicide rate greater than twenty and categorised as sp. Employ rate is divided into five categories and whole is named as re. Female employ rate is also divided into five parts and named as fer.  About 199 countries have suicide rate less than 20 per 100 000 due to unemployment which is 93.43% of total suicide rate. Employ rate is about 46.48% in maximum countries. More than 99 countries have employ rate between 40-60%. Female employ rate is about 45.07% which is approximately equal to employ rate. More than 96 countries have female employ rate is between 40-60%. 
0 notes