perimichael
Untitled
8 posts
Don't wanna be here? Send us removal request.
perimichael · 7 months ago
Text
EXAMPLE USING ANOVA
/Example Using ANOVA/ LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly; DATA new; set mydata.nesarc_pds;
PROC SORT; BY EXERCISE;
PROC ANOVA; CLASS REGION; MODEL AGE=REGION; MEANS REGION; BY S2AQ6B;
RUN;
The ANOVA Procedure
S2AQ6B=.Class Level InformationClassLevelsValuesREGION32 3 4Number of Observations Read5Number of Observations Used5
The ANOVA Procedure
Dependent Variable: AGE
S2AQ6B=.SourceDFSum of SquaresMean SquareF ValuePr > FModel22400.3000001200.15000013.760.0678Error2174.50000087.250000  Corrected Total42574.800000   R-SquareCoeff VarRoot MSEAGE Mean0.93222825.109609.34077137.20000SourceDFAnova SSMean SquareF ValuePr > FREGION22400.3000001200.15000013.760.0678
Tumblr media
The ANOVA Procedure
S2AQ6B=.
Tumblr media
Due to the p alue of .0678 there is no moderation between the age, region and drinking of wine
0 notes
perimichael · 7 months ago
Text
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly; DATA new; set mydata.nesarc_pds;
IF S1Q6A eq 1 THEN incomegroup=.; ELSE IF S1Q6A LE 3 THEN education=1; ELSE IF S1Q6A LE 5 THEN education=2; ELSE IF S1Q6A LE 7 THEN education=3; ELSE IF S1Q6A GT 10 THEN education=3;
PROC SORT; by IDNUM;
PROC CORR; VAR S1Q10B S1Q11B;
RUN;
The CORR Procedure2 Variables:S1Q10B S1Q11BSimple StatisticsVariableNMeanStd DevSumMinimumMaximumS1Q10B430936.757504.40665291201017.00000S1Q11B430939.421414.843084059971.0000021.00000Pearson Correlation Coefficients, N = 43093 Prob > |r| under H0: Rho=0 S1Q10BS1Q11BS1Q10B
1.00000
0.67157
<.0001S1Q11B
0.67157
<.0001
1.00000
both education level is correlated to individual income and household income as shown by the data above. this is because the p value is less than .0001
0 notes
perimichael · 7 months ago
Text
CHi 2
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly; DATA new; set mydata.nesarc_pds;
LABEL TAB12MDX="Tobacco Dependence Past 12 Months" CHECK321="Smoked Cigarettes in Past 12 Months" S3AQ3B1="Usual Smoking Frequency" S3AQ3C1="Usual Smoking Quantity" S1Q6A="Highest Grade or year of school completed" S2AQ8A="How often drank any alcohol in the last 12 months"
/Set appropriate missing data as needed/ IF S3AQ3B1=9 THEN S3AQ3B1=.; IF S3AQ3C1=99 THEN S3AQ3C1=.; IF S2AQ8A=99 THEN S2AQ8A=.; IF S1Q6A= 1 THEN S1Q6A= .;
IF S1Q6A= 2-4 THEN GRADE=1; IF S1Q6A= 5-7 THEN GRADE=2; IF S1Q6A= 8-9 THEN GRADE=3; IF S1Q6A=10-14 THEN GRADE=4;
INTELLIGENCE=GRADE*S1Q6A;
/subsetting data to include only age greater than 18/
IF AGE GE 18;
PROC SORT; by IDNUM;
PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ;
RUN;
DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=2; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ; RUN;
DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=3; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8AS1Q6A/CHISQ; RUN; DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=4; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8AS1Q6A/CHISQ; RUN;
DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=5; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ; RUN;
DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=6; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ; RUN;
DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=7; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ; RUN;
DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=8; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ; RUN;
DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=9; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ; RUN;
DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=10; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ; RUN;
DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=11; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ; RUN;
DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=12; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ; RUN;
DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=13; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ; RUN;
DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=14; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ; RUN;
Tumblr media
The table tells me about the probabilty of correlation between varibales of grade level completed vs alcohol use. However based on the CHi2 value I don't think I can make any valid conclusions about this data being correlated.
0 notes
perimichael · 7 months ago
Text
Data analysis for grade vs gender and race
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly; DATA new; set mydata.nesarc_pds;
LABEL TAB12MDX="Tobacco Dependence Past 12 Months" CHECK321="Smoked Cigarettes in Past 12 Months" S3AQ3B1="Usual Smoking Frequency" S3AQ3C1="Usual Smoking Quantity" S1Q6A="Highest Grade or year of school completed" S2AQ8A="How often drank any alcohol in the last 12 months"
/Set appropriate missing data as needed/ IF S3AQ3B1=9 THEN S3AQ3B1=.; IF S3AQ3C1=99 THEN S3AQ3C1=.; IF S2AQ8A=99 THEN S2AQ8A=.; IF S1Q6A= 1 THEN S1Q6A= .;
IF S1Q6A= 2-4 THEN GRADE=1; IF S1Q6A= 5-7 THEN GRADE=2; IF S1Q6A= 8-9 THEN GRADE=3; IF S1Q6A=10-14 THEN GRADE=4;
INTELLIGENCE=GRADE*S1Q6A;
/subsetting data to include only age greater than 18/
IF AGE GE 18;
PROC SORT; by IDNUM;
PROC ANOVA; CLASS SEX; MODEL S1Q6A=SEX; MEANS SEX;
PROC ANOVA; CLASS ETHRACE2A; MODEL S1Q6A=ETHRACE2A; MEANS ETHRACE2A/DUNCAN;
RUN;
The ANOVA ProcedureClass Level InformationClassLevelsValuesSEX21 2Number of Observations Read43093Number of Observations Used42875
The ANOVA Procedure
Dependent Variable: S1Q6A Highest Grade or year of school completedSourceDFSum of SquaresMean SquareF ValuePr > FModel1112.9008112.900818.75<.0001Error42873258168.30276.0217  Corrected Total42874258281.2035   R-SquareCoeff VarRoot MSES1Q6A Mean0.00043725.847022.4539159.493994SourceDFAnova SSMean SquareF ValuePr > FSEX1112.9008106112.900810618.75<.0001
The ANOVA Procedure
Tumblr media
The ANOVA ProcedureClass Level InformationClassLevelsValuesETHRACE2A51 2 3 4 5Number of Observations Read43093Number of Observations Used42875
The ANOVA Procedure
Dependent Variable: S1Q6A Highest Grade or year of school completedSourceDFSum of SquaresMean SquareF ValuePr > FModel415981.16333995.2908706.88<.0001Error42870242300.04025.6520  Corrected Total42874258281.2035   R-SquareCoeff VarRoot MSES1Q6A Mean0.06187525.040962.3773889.493994SourceDFAnova SSMean SquareF ValuePr > FETHRACE2A415981.163283995.29082706.88<.0001
The ANOVA Procedure
Tumblr media
The ANOVA Procedure
Duncan's Multiple Range Test for S1Q6A
Note:This test controls the Type I comparisonwise error rate, not the experimentwise error rate.Alpha0.05Error Degrees of Freedom42870Error Mean Square5.651972Harmonic Mean of Cell Sizes2007.656
Note:Cell sizes are not equal.Number of Means2345Critical Range.1471.1549.1601.1639
Tumblr media
Males generally would have a higher eudctation level but, were also more likely to not even attend school or drop out early.
Hispanic and latinos have the lowest degree of education. Asian tend to have the highest.
0 notes
perimichael · 7 months ago
Text
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt
Read the dataset
data = pd.read_csv('gapminder.csv', low_memory=False)
Replace empty strings with NaN
data = data.replace(r'^\s*$', np.NaN, regex=True)
Setting variables you will be working with to numeric
data['internetuserate'] = pd.to_numeric(data['internetuserate'], errors='coerce') data['urbanrate'] = pd.to_numeric(data['urbanrate'], errors='coerce') data['incomeperperson'] = pd.to_numeric(data['incomeperperson'], errors='coerce') data['hivrate'] = pd.to_numeric(data['hivrate'], errors='coerce')
#
data['country'] = pd.to_numeric(data['country'], errors='coerce') data['alcconsumption'] = pd.to_numeric(data['alcconsumption'], errors='coerce') data['lifeexpectancy'] = pd.to_numeric(data['lifeexpectancy'], errors='coerce') data['employrate'] = pd.to_numeric(data['hivrate'], errors='coerce')
Descriptive statistics
desc1 = data['alcconsumption'].describe() print(desc1)
desc2 = data['employrate'].describe() print(desc2)
Basic scatterplot: Q -> Q
plt.figure(figsize=(10, 6)) sns.regplot(x="alcconsumption", y="hivrate", fit_reg=False, data=data) plt.xlabel('alcconsumption') plt.ylabel('hivrate') plt.title('Scatterplot for the Association Between alcconsumption and HIVRate') plt.show()
plt.figure(figsize=(10, 6)) sns.regplot(x="lifeexpectancy", y="employrate", data=data) plt.xlabel('lifeexpectancy') plt.ylabel('employrate') plt.title('Scatterplot for the Association Between Life Expectancy and Employment Rate') plt.show()
Quartile split (use qcut function & ask for 4 groups - gives you quartile split)
print('Employrate - 4 categories - quartiles') data['EMPLOYRATE4'] = pd.qcut(data['employrate'], 4, labels=["1=25th%tile", "2=50%tile", "3=75%tile", "4=100%tile"]) c10 = data['EMPLOYRATE4'].value_counts(sort=False, dropna=True) print(c10)
Bivariate bar graph C -> Q
plt.figure(figsize=(10, 6)) sns.catplot(x='EMPLOYRATE4', y='alcconsumption', data=data, kind="bar", ci=None) plt.xlabel('Employ Rate') plt.ylabel('Alcohol Consumption Rate') plt.title('Mean Alcohol Consumption Rate by Income Group') plt.show()
c11 = data.groupby('EMPLOYRATE4').size() print(c11)
result = data.sort_values(by=['EMPLOYRATE4'], ascending=True) print(result)
Tumblr media Tumblr media Tumblr media
My data shows little correlation between alcohol consumption ahd HIV Rate but, When life expectancy is lower employment rate is higher which is probably because people retire as they get older. Finally, I found that in each income group the average alcohol consumption rate down as people make more money
0 notes
perimichael · 7 months ago
Text
weight of smokers
import pandas as pd import numpy as np
Read the dataset
data = pd.read_csv('nesarc_pds.csv', low_memory=False)
Bug fix for display formats to avoid runtime errors
pd.set_option('display.float_format', lambda x: '%f' % x)
Setting variables to numeric
numeric_columns = ['TAB12MDX', 'CHECK321', 'S3AQ3B1', 'S3AQ3C1', 'WEIGHT'] data[numeric_columns] = data[numeric_columns].apply(pd.to_numeric)
Subset data to adults 100 to 600 lbs who have smoked in the past 12 months
sub1 = data[(data['WEIGHT'] >= 100) & (data['WEIGHT'] <= 600) & (data['CHECK321'] == 1)]
Make a copy of the subsetted data
sub2 = sub1.copy()
def print_value_counts(df, column, description): """Print value counts for a specific column in the dataframe.""" print(f'{description}') counts = df[column].value_counts(sort=False, dropna=False) print(counts) return counts
Initial counts for S3AQ3B1
print_value_counts(sub2, 'S3AQ3B1', 'Counts for original S3AQ3B1')
Recode missing values to NaN
sub2['S3AQ3B1'].replace(9, np.nan, inplace=True) sub2['S3AQ3C1'].replace(99, np.nan, inplace=True)
Counts after recoding missing values
print_value_counts(sub2, 'S3AQ3B1', 'Counts for S3AQ3B1 with 9 set to NaN and number of missing requested')
Recode missing values for S2AQ8A
sub2['S2AQ8A'].fillna(11, inplace=True) sub2['S2AQ8A'].replace(99, np.nan, inplace=True)
Check coding for S2AQ8A
print_value_counts(sub2, 'S2AQ8A', 'S2AQ8A with Blanks recoded as 11 and 99 set to NaN') print(sub2['S2AQ8A'].describe())
Recode values for S3AQ3B1 into new variables
recode1 = {1: 6, 2: 5, 3: 4, 4: 3, 5: 2, 6: 1} recode2 = {1: 30, 2: 22, 3: 14, 4: 5, 5: 2.5, 6: 1}
sub2['USFREQ'] = sub2['S3AQ3B1'].map(recode1) sub2['USFREQMO'] = sub2['S3AQ3B1'].map(recode2)
Create secondary variable
sub2['NUMCIGMO_EST'] = sub2['USFREQMO'] * sub2['S3AQ3C1']
Examine frequency distributions for WEIGHT
print_value_counts(sub2, 'WEIGHT', 'Counts for WEIGHT') print('Percentages for WEIGHT') print(sub2['WEIGHT'].value_counts(sort=False, normalize=True))
Quartile split for WEIGHT
sub2['WEIGHTGROUP4'] = pd.qcut(sub2['WEIGHT'], 4, labels=["1=0%tile", "2=25%tile", "3=50%tile", "4=75%tile"]) print_value_counts(sub2, 'WEIGHTGROUP4', 'WEIGHT - 4 categories - quartiles')
Categorize WEIGHT into 3 groups (100-200 lbs, 200-300 lbs, 300-600 lbs)
sub2['WEIGHTGROUP3'] = pd.cut(sub2['WEIGHT'], [100, 200, 300, 600], labels=["100-200 lbs", "201-300 lbs", "301-600 lbs"]) print_value_counts(sub2, 'WEIGHTGROUP3', 'Counts for WEIGHTGROUP3')
Crosstab of WEIGHTGROUP3 and WEIGHT
print(pd.crosstab(sub2['WEIGHTGROUP3'], sub2['WEIGHT']))
Frequency distribution for WEIGHTGROUP3
print_value_counts(sub2, 'WEIGHTGROUP3', 'Counts for WEIGHTGROUP3') print('Percentages for WEIGHTGROUP3') print(sub2['WEIGHTGROUP3'].value_counts(sort=False, normalize=True))
Counts for original S3AQ3B1 S3AQ3B1 1.000000 81 2.000000 6 5.000000 2 4.000000 6 3.000000 3 6.000000 4 Name: count, dtype: int64 Counts for S3AQ3B1 with 9 set to NaN and number of missing requested S3AQ3B1 1.000000 81 2.000000 6 5.000000 2 4.000000 6 3.000000 3 6.000000 4 Name: count, dtype: int64 S2AQ8A with Blanks recoded as 11 and 99 set to NaN S2AQ8A 6 12 4 2 7 14 5 16 28 1 6 2 2 10 9 3 5 9 5 8 3 Name: count, dtype: int64 count 102 unique 11 top freq 28 Name: S2AQ8A, dtype: object Counts for WEIGHT WEIGHT 534.703087 1 476.841101 5 534.923423 1 568.208544 1 398.855701 1 .. 584.984241 1 577.814060 1 502.267758 1 591.875275 1 483.885024 1 Name: count, Length: 86, dtype: int64 Percentages for WEIGHT WEIGHT 534.703087 0.009804 476.841101 0.049020 534.923423 0.009804 568.208544 0.009804 398.855701 0.009804
584.984241 0.009804 577.814060 0.009804 502.267758 0.009804 591.875275 0.009804 483.885024 0.009804 Name: proportion, Length: 86, dtype: float64 WEIGHT - 4 categories - quartiles WEIGHTGROUP4 1=0%tile 26 2=25%tile 25 3=50%tile 25 4=75%tile 26 Name: count, dtype: int64 Counts for WEIGHTGROUP3 WEIGHTGROUP3 100-200 lbs 0 201-300 lbs 0 301-600 lbs 102 Name: count, dtype: int64 WEIGHT 398.855701 437.144557 … 599.285226 599.720557 WEIGHTGROUP3 … 301-600 lbs 1 1 … 1 1
[1 rows x 86 columns] Counts for WEIGHTGROUP3 WEIGHTGROUP3 100-200 lbs 0 201-300 lbs 0 301-600 lbs 102 Name: count, dtype: int64 Percentages for WEIGHTGROUP3 WEIGHTGROUP3 100-200 lbs 0.000000 201-300 lbs 0.000000 301-600 lbs 1.000000 Name: proportion, dtype: float64
I changed the code to see the weight of smokers who have smoked in the past year. For weight group 3, 102 people over 102lbs have smoked in the last year
0 notes
perimichael · 7 months ago
Text
gapminder data review
-- coding: utf-8 --
""" Spyder Editor
This is a temporary script file. """
import pandas import numpy
read file
data = pandas.read_csv('gapminder.csv', low_memory=False)
set columns to uppercase
data.columns = map(str.upper, data.columns)
bug fix
pandas.set_option("display.float_format", lambda x: "%f" %x)
print # of rows and columns
print(len(data)) # number of observations (rows) print(len(data.columns)) # number of observations (columns)
print income per person and %
print("counts for incomeperperson = income per person") c1 = data["INCOMEPERPERSON"].value_counts(sort=False) print (c1)
print("percentages for income per person") p1 = data["INCOMEPERPERSON"].value_counts(sort=False, normalize=True) print(p1)
print HIVRATE and %
print("counts for HIV RATE") c1 = data["HIVRATE"].value_counts(sort=False) print (c1)
print("percentages for HIV RATE") p1 = data["HIVRATE"].value_counts(sort=False, normalize=True) print(p1)
print ALCOHOL CONSUMPTION and %
print("counts for ALCOHOL CONSUMPTION") c1 = data["ALCCONSUMPTION"].value_counts(sort=False) print (c1)
print("percentages for ALCOHOL CONSUMPTION") p1 = data["ALCCONSUMPTION"].value_counts(sort=False, normalize=True) print(p1)
213 16 counts for incomeperperson = income per person INCOMEPERPERSON 23 1914.99655094922 1 2231.99333515006 1 21943.3398976022 1 1381.00426770244 1 .. 5528.36311387522 1 722.807558834445 1 610.3573673206 1 432.226336974583 1 320.771889948584 1 Name: count, Length: 191, dtype: int64 percentages for income per person INCOMEPERPERSON 0.107981 1914.99655094922 0.004695 2231.99333515006 0.004695 21943.3398976022 0.004695 1381.00426770244 0.004695
5528.36311387522 0.004695 722.807558834445 0.004695 610.3573673206 0.004695 432.226336974583 0.004695 320.771889948584 0.004695 Name: proportion, Length: 191, dtype: float64 counts for HIV RATE HIVRATE 66 .1 28 2 2 .5 5 .3 10 3.1 1 .06 16 1.4 1 .2 15 2.3 1 1.2 4 24.8 1 .45 1 3.3 1 5.3 1 4.7 1 3.4 3 .4 9 2.5 2 .9 4 .8 5 5 1 5.2 1 1.8 1 1.3 2 1.9 1 1.7 1 6.3 1 .7 3 23.6 1 1.5 2 11 1 1 4 11.5 1 .6 3 13.1 1 3.6 1 2.9 1 1.6 1 17.8 1 1.1 2 25.9 1 5.6 1 3.2 1 6.5 1 13.5 1 14.3 1 Name: count, dtype: int64 percentages for HIV RATE HIVRATE 0.309859 .1 0.131455 2 0.009390 .5 0.023474 .3 0.046948 3.1 0.004695 .06 0.075117 1.4 0.004695 .2 0.070423 2.3 0.004695 1.2 0.018779 24.8 0.004695 .45 0.004695 3.3 0.004695 5.3 0.004695 4.7 0.004695 3.4 0.014085 .4 0.042254 2.5 0.009390 .9 0.018779 .8 0.023474 5 0.004695 5.2 0.004695 1.8 0.004695 1.3 0.009390 1.9 0.004695 1.7 0.004695 6.3 0.004695 .7 0.014085 23.6 0.004695 1.5 0.009390 11 0.004695 1 0.018779 11.5 0.004695 .6 0.014085 13.1 0.004695 3.6 0.004695 2.9 0.004695 1.6 0.004695 17.8 0.004695 1.1 0.009390 25.9 0.004695 5.6 0.004695 3.2 0.004695 6.5 0.004695 13.5 0.004695 14.3 0.004695 Name: proportion, dtype: float64 counts for ALCOHOL CONSUMPTION ALCCONSUMPTION .03 1 7.29 1 .69 1 10.17 1 5.57 1 .. 7.6 1 3.91 1 .2 1 3.56 1 4.96 1 Name: count, Length: 181, dtype: int64 percentages for ALCOHOL CONSUMPTION ALCCONSUMPTION .03 0.004695 7.29 0.004695 .69 0.004695 10.17 0.004695 5.57 0.004695
7.6 0.004695 3.91 0.004695 .2 0.004695 3.56 0.004695 4.96 0.004695 Name: proportion, Length: 181, dtype: float64
Analyze: The table shows that for income per person, alcohol consumption , every category has unique values with country being the unique identifier for each row. In general HIVRATE has commonality with a low rate. I am hoping to correlate all this data together to draw conclusions.
0 notes
perimichael · 8 months ago
Text
Data Analytics Project
The data set I have selected is coming from the National Epidemiologic Survey of Drug Use and Health dataset.
My question is, is there a correlation between drug use and depression within the dataset. I believe there will be a positive correlation between the two datasets.
In google scholar, I searched "drug abuse rates vs major depression "and found the following article: drug abuse rates vs major depression - Google Scholar and it makes me think that my hypothesis will be correct.
The article states the following: or example, the crude lifetime prevalences for depression ranged from 0.7 to 8.6 per 100; for alcohol it was 7.5 to 32.6, and for drugs it was 3.1 to 10.5
Which makes me think drugs are not the highest cause of depression but are a contributing factor
1 note · View note