Tumgik
#replace values in pandas dataframe
codewithnazam · 6 months
Text
DataFrame in Pandas: Guide to Creating Awesome DataFrames
Explore how to create a dataframe in Pandas, including data input methods, customization options, and practical examples.
Data analysis used to be a daunting task, reserved for statisticians and mathematicians. But with the rise of powerful tools like Python and its fantastic library, Pandas, anyone can become a data whiz! Pandas, in particular, shines with its DataFrames, these nifty tables that organize and manipulate data like magic. But where do you start? Fear not, fellow data enthusiast, for this guide will…
Tumblr media
View On WordPress
0 notes
eroz-codes · 21 days
Text
AI_MLCourse_Day04
Topic: Managing Data (Week 2 Summary)
Building a Data Matrix
When first making a matrix it is important that the features (which are the columns when looking at a Pandas Dataframe) are clearly defined. One of these features will be called a 'label' and that is what the model will try to predict.
Understand the sample. Clearly define it. They can overlap one another so long as the model only looks at what it needs to.
Data types include numeric (continuous and integer), and categorical (ordinal and nominative). Continuous and integer are generally ready for ML without much prep but, can lead to outliers (continuous especially). Ordinal can be represented as an integer value but the range is so small it might as well be categorical. Note that numerical ordinal values will be treated as a number by the model, so special care must be made. Use regression and classification on these values.
Feature Engineering
When starting to feature engineer focus on either mapping concepts to data representations or manipulating the data so it is appropriate for common ML API's.
Often times, you will need to adjust the features present so that the better fit a model. There are many different ways to accomplish this but, one important one is called One-hot encoding. It is a manner of transforming categorical values into binary features.
Exploring Data
When looking at data frames the questions of how is it distributed, is it redundant, and how do features correlate with the chosen label should be on the forefront of the mind. The whole point of data analysis is to check for data quality issues.
Good python libraries to do this through include Matplot and Seaborn. Matplot is good on its own, but Seaborn builds off of Matplot to be even better at tabular data.
Avoid correlation between features by looking at the bivariate distribution. Review old statistics notes. They are helpful.
Cleaning Data
Look for any outliers in the data. They could show that you have fucked something up in getting the data or that you just have something weird happening. Methods of detection? Z-Score and Interquartile Range (ha you thought you were done with this shit you are never done with anything in CS there is always a call back)
Handle said outliers through execution. Or winsorizatio (where you replace the outliers with a reasonable high value).
__________
That's it for this week. I am tired but carrying on. I leave you with a quote, "It's not enough to win, everyone else must lose," or something along those lines.
0 notes
perimichael · 21 days
Text
weight of smokers
import pandas as pd import numpy as np
Read the dataset
data = pd.read_csv('nesarc_pds.csv', low_memory=False)
Bug fix for display formats to avoid runtime errors
pd.set_option('display.float_format', lambda x: '%f' % x)
Setting variables to numeric
numeric_columns = ['TAB12MDX', 'CHECK321', 'S3AQ3B1', 'S3AQ3C1', 'WEIGHT'] data[numeric_columns] = data[numeric_columns].apply(pd.to_numeric)
Subset data to adults 100 to 600 lbs who have smoked in the past 12 months
sub1 = data[(data['WEIGHT'] >= 100) & (data['WEIGHT'] <= 600) & (data['CHECK321'] == 1)]
Make a copy of the subsetted data
sub2 = sub1.copy()
def print_value_counts(df, column, description): """Print value counts for a specific column in the dataframe.""" print(f'{description}') counts = df[column].value_counts(sort=False, dropna=False) print(counts) return counts
Initial counts for S3AQ3B1
print_value_counts(sub2, 'S3AQ3B1', 'Counts for original S3AQ3B1')
Recode missing values to NaN
sub2['S3AQ3B1'].replace(9, np.nan, inplace=True) sub2['S3AQ3C1'].replace(99, np.nan, inplace=True)
Counts after recoding missing values
print_value_counts(sub2, 'S3AQ3B1', 'Counts for S3AQ3B1 with 9 set to NaN and number of missing requested')
Recode missing values for S2AQ8A
sub2['S2AQ8A'].fillna(11, inplace=True) sub2['S2AQ8A'].replace(99, np.nan, inplace=True)
Check coding for S2AQ8A
print_value_counts(sub2, 'S2AQ8A', 'S2AQ8A with Blanks recoded as 11 and 99 set to NaN') print(sub2['S2AQ8A'].describe())
Recode values for S3AQ3B1 into new variables
recode1 = {1: 6, 2: 5, 3: 4, 4: 3, 5: 2, 6: 1} recode2 = {1: 30, 2: 22, 3: 14, 4: 5, 5: 2.5, 6: 1}
sub2['USFREQ'] = sub2['S3AQ3B1'].map(recode1) sub2['USFREQMO'] = sub2['S3AQ3B1'].map(recode2)
Create secondary variable
sub2['NUMCIGMO_EST'] = sub2['USFREQMO'] * sub2['S3AQ3C1']
Examine frequency distributions for WEIGHT
print_value_counts(sub2, 'WEIGHT', 'Counts for WEIGHT') print('Percentages for WEIGHT') print(sub2['WEIGHT'].value_counts(sort=False, normalize=True))
Quartile split for WEIGHT
sub2['WEIGHTGROUP4'] = pd.qcut(sub2['WEIGHT'], 4, labels=["1=0%tile", "2=25%tile", "3=50%tile", "4=75%tile"]) print_value_counts(sub2, 'WEIGHTGROUP4', 'WEIGHT - 4 categories - quartiles')
Categorize WEIGHT into 3 groups (100-200 lbs, 200-300 lbs, 300-600 lbs)
sub2['WEIGHTGROUP3'] = pd.cut(sub2['WEIGHT'], [100, 200, 300, 600], labels=["100-200 lbs", "201-300 lbs", "301-600 lbs"]) print_value_counts(sub2, 'WEIGHTGROUP3', 'Counts for WEIGHTGROUP3')
Crosstab of WEIGHTGROUP3 and WEIGHT
print(pd.crosstab(sub2['WEIGHTGROUP3'], sub2['WEIGHT']))
Frequency distribution for WEIGHTGROUP3
print_value_counts(sub2, 'WEIGHTGROUP3', 'Counts for WEIGHTGROUP3') print('Percentages for WEIGHTGROUP3') print(sub2['WEIGHTGROUP3'].value_counts(sort=False, normalize=True))
Counts for original S3AQ3B1 S3AQ3B1 1.000000 81 2.000000 6 5.000000 2 4.000000 6 3.000000 3 6.000000 4 Name: count, dtype: int64 Counts for S3AQ3B1 with 9 set to NaN and number of missing requested S3AQ3B1 1.000000 81 2.000000 6 5.000000 2 4.000000 6 3.000000 3 6.000000 4 Name: count, dtype: int64 S2AQ8A with Blanks recoded as 11 and 99 set to NaN S2AQ8A 6 12 4 2 7 14 5 16 28 1 6 2 2 10 9 3 5 9 5 8 3 Name: count, dtype: int64 count 102 unique 11 top freq 28 Name: S2AQ8A, dtype: object Counts for WEIGHT WEIGHT 534.703087 1 476.841101 5 534.923423 1 568.208544 1 398.855701 1 .. 584.984241 1 577.814060 1 502.267758 1 591.875275 1 483.885024 1 Name: count, Length: 86, dtype: int64 Percentages for WEIGHT WEIGHT 534.703087 0.009804 476.841101 0.049020 534.923423 0.009804 568.208544 0.009804 398.855701 0.009804
584.984241 0.009804 577.814060 0.009804 502.267758 0.009804 591.875275 0.009804 483.885024 0.009804 Name: proportion, Length: 86, dtype: float64 WEIGHT - 4 categories - quartiles WEIGHTGROUP4 1=0%tile 26 2=25%tile 25 3=50%tile 25 4=75%tile 26 Name: count, dtype: int64 Counts for WEIGHTGROUP3 WEIGHTGROUP3 100-200 lbs 0 201-300 lbs 0 301-600 lbs 102 Name: count, dtype: int64 WEIGHT 398.855701 437.144557 … 599.285226 599.720557 WEIGHTGROUP3 … 301-600 lbs 1 1 … 1 1
[1 rows x 86 columns] Counts for WEIGHTGROUP3 WEIGHTGROUP3 100-200 lbs 0 201-300 lbs 0 301-600 lbs 102 Name: count, dtype: int64 Percentages for WEIGHTGROUP3 WEIGHTGROUP3 100-200 lbs 0.000000 201-300 lbs 0.000000 301-600 lbs 1.000000 Name: proportion, dtype: float64
I changed the code to see the weight of smokers who have smoked in the past year. For weight group 3, 102 people over 102lbs have smoked in the last year
0 notes
Text
Making Data Management Decisions
Starting with import the libraries to use
import pandas as pd import numpy as np
data=pd.read_csv("nesarc_pds.csv", low_memory=False)
Now we create a new data with the variables that we want
sub_data=data[[ 'AGE', 'S2AQ8A' , 'S2AQ8B' , 'S4AQ20C' , 'S9Q19C']]
I made a copy to wort with it
sub_data2=sub_data.copy()
We can obtain info of our dataframe to see what are the types of the variables
sub_data2.info()
Tumblr media
#We see that foru variables are objects, so we can convert it in type float by using pd.to_numeric
sub_data2 =sub_data2.apply(pd.to_numeric, errors='coerce') sub_data2.info()
Tumblr media
At this point of the code we may to observe that some variables has values with answers that don´t give us any information
Tumblr media Tumblr media Tumblr media
We can see that this four variables includes the values 99 and 9 as unknown answers so we can replace it by Nan with the next line code:
sub_data2 =sub_data2.replace(99,np.nan) sub_data2=sub_data2.replace(9, np.nan)
And drop this values with
sub_data2=sub_data2.dropna() print(len(sub_data2)) 1058
I want to create a secondary variable that tells me how many drinks did the individual consume last year so I recode the values of S2AQ8A as how many times the individual consume alcohol last year.
For example, the value 1 in S2AQ8A is the answer that the individual consume alcohol everyday, so he consumed alcohol 365 times last year. For the value 2, I codify as the individual consume 29 days per motnh so this give 348 times in the last year.
I made it with the next statement:
recode={1:365, 2:348, 3:192, 4:96, 5:48, 6:36, 7:12, 8:11, 9:6, 10:2} sub_data2['S2AQ8A']=sub_data2['S2AQ8A'].map(recode)
Adicionally I grupo the individual by they ages, dividing by 18 to 30, 31 to 50 and 50 to 99.
sub_data2['AGEGROUP'] = pd.cut(sub_data2.AGE, [17, 30, 50, 99])
And I can see the percentages of each interval
sub_data2['AGEGROUP'].value_counts(normalize=True)
Tumblr media
Now I create the variable 'DLY' for the drinks consumed last year by the next statemen:
sub_data2['DLY']=sub_data2['S2AQ8A']*sub_data2['S2AQ8B'] sub_data2.head()
Tumblr media
The variables S4AQ20C and S9Q19C correspond to the questions:
DRANK ALCOHOL TO IMPROVE MOOD PRIOR TO LAST 12 MONTHS
DRANK ALCOHOL TO AVOID GENERALIZED ANXIETY PRIOR TO LAST 12 MONTHS
respectively.
The values for this question are:
1 = yes
2 = no
I want to know if people who decide to consume alcohol to avoid anxiety or improve mood tends to consume more alcohol that peoplo who don´t do it.
So I made this:
sub_data3=sub_data2.groupby(['S4AQ20C','S9Q19C','AGEGROUP'], observed=True)
And I use value_counts to analyze the frecuency
sub_data3['S4AQ20C'].value_counts()
Tumblr media
From this we can see the next things:
158 individuals consume alcohol to improve mood or avoid anxiety which represents the 14.93%
151 individuals consume alcohol to improve mood but no to avoid anxiety which represents the 14.27%
57 individuals consume alcohol to avoid anxiety but no to improve mood which represents the 05.40%
692 individuals don´t consume alcohol to avoid anxiety or improve mood which represents the 65.40%
We can obtain more informacion by using
sub_data3[['S2AQ8A','S2AQ8B','DLY']].agg(['count','mean', 'std'])
Tumblr media
From this we can see for example:
Mots people are betwen 31 to 50 year old and they don´t consume alcohol to improve mood or avoid anxiety and they have a average of 141 drinks in the laste year which is the lowest average.
The highest average of drinks consumed last year its 1032 and correspond to individuals betwen 31 to 50 years old and the consume alcohol to improve mood or avoid anxiety and the second place its for indivuals that are betwen 18 to 30 year old and also consume alcohol to improve mood or avoid anxiety
This suggests that the age its not a determining factor to 'DYL' but 'S2AQ8A' and 'S2AQ8B' si lo son
0 notes
edcater · 2 months
Text
Demystifying Data Science: Essential Concepts for Beginners
In today's data-driven world, the field of data science stands out as a beacon of opportunity. With Python programming as its cornerstone, data science opens doors to insights, predictions, and solutions across countless industries. If you're a beginner looking to dive into this exciting realm, fear not! This article will serve as your guide, breaking down essential concepts in a straightforward manner.
1. Introduction to Data Science
Data science is the art of extracting meaningful insights and knowledge from data. It combines aspects of statistics, computer science, and domain expertise to analyze complex data sets.
2. Why Python?
Python has emerged as the go-to language for data science, and for good reasons. It boasts simplicity, readability, and a vast array of libraries tailored for data manipulation, analysis, and visualization.
3. Setting Up Your Python Environment
Before we dive into coding, let's ensure your Python environment is set up. You'll need to install Python and a few key libraries such as Pandas, NumPy, and Matplotlib. These libraries will be your companions throughout your data science journey.
4. Understanding Data Types
In Python, everything is an object with a type. Common data types include integers, floats (decimal numbers), strings (text), booleans (True/False), and more. Understanding these types is crucial for data manipulation.
5. Data Structures in Python
Python offers versatile data structures like lists, dictionaries, tuples, and sets. These structures allow you to organize and work with data efficiently. For instance, lists are sequences of elements, while dictionaries are key-value pairs.
6. Introduction to Pandas
Pandas is a powerhouse library for data manipulation. It introduces two main data structures: Series (1-dimensional labeled array) and DataFrame (2-dimensional labeled data structure). These structures make it easy to clean, transform, and analyze data.
7. Data Cleaning and Preprocessing
Before diving into analysis, you'll often need to clean messy data. This involves handling missing values, removing duplicates, and standardizing formats. Pandas provides functions like dropna(), fillna(), and replace() for these tasks.
8. Basic Data Analysis with Pandas
Now that your data is clean, let's analyze it! Pandas offers a plethora of functions for descriptive statistics, such as mean(), median(), min(), and max(). You can also group data using groupby() and create pivot tables for deeper insights.
9. Data Visualization with Matplotlib
They say a picture is worth a thousand words, and in data science, visualization is key. Matplotlib, a popular plotting library, allows you to create various charts, histograms, scatter plots, and more. Visualizing data helps in understanding trends and patterns.
Conclusion
Congratulations! You've embarked on your data science journey with Python as your trusty companion. This article has laid the groundwork, introducing you to essential concepts and tools. Remember, practice makes perfect. As you explore further, you'll uncover the vast possibilities data science offers—from predicting trends to making informed decisions. So, grab your Python interpreter and start exploring the world of data!
In the realm of data science, Python programming serves as the key to unlocking insights from vast amounts of information. This article aims to demystify the field, providing beginners with a solid foundation to begin their journey into the exciting world of data science.
0 notes
chieffurycollection · 10 months
Text
Linear regression alcohol consumption vs number of alcoholic parents
Code
import pandas import numpy import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score import statsmodels.api as sm import seaborn import statsmodels.formula.api as smf
print("start import") data = pandas.read_csv('nesarc_pds.csv', low_memory=False) print("import done")
upper-case all Dataframe column names --> unification
data.colums = map(str.upper, data.columns)
bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%f'%x)
print (len(data)) #number of observations (rows) print (len(data.columns)) # number of variables (columns)
checking the format of your variables
setting variables you will be working with to numeric
data['S2DQ1'] = pandas.to_numeric(data['S2DQ1']) #Blood/Natural Father data['S2DQ2'] = pandas.to_numeric(data['S2DQ2']) #Blood/Natural Mother data['S2BQ3A'] = pandas.to_numeric(data['S2BQ3A'], errors='coerce') #Age at first Alcohol abuse data['S3CQ14A3'] = pandas.to_numeric(data['S3CQ14A3'], errors='coerce')
Blood/Natural Father was alcoholic
print("number blood/natural father was alcoholic")
0 = no; 1= yes; unknown = nan
data['S2DQ1'] = data['S2DQ1'].replace({2: 0, 9: numpy.nan}) c1 = data['S2DQ1'].value_counts(sort=False).sort_index() print (c1)
print("percentage blood/natural father was alcoholic") p1 = data['S2DQ1'].value_counts(sort=False, normalize=True).sort_index() print (p1)
Blood/Natural Mother was alcoholic
print("number blood/natural mother was alcoholic")
0 = no; 1= yes; unknown = nan
data['S2DQ2'] = data['S2DQ2'].replace({2: 0, 9: numpy.nan}) c2 = data['S2DQ2'].value_counts(sort=False).sort_index() print(c2)
print("percentage blood/natural mother was alcoholic") p2 = data['S2DQ2'].value_counts(sort=False, normalize=True).sort_index() print (p2)
Data Management: Number of parents with background of alcoholism is calculated
0 = no parents; 1 = at least 1 parent (maybe one answer missing); 2 = 2 parents; nan = 1 unknown and 1 zero or both unknown
print("number blood/natural parents was alcoholic") data['Num_alcoholic_parents'] = numpy.where((data['S2DQ1'] == 1) & (data['S2DQ2'] == 1), 2, numpy.where((data['S2DQ1'] == 1) & (data['S2DQ2'] == 0), 1, numpy.where((data['S2DQ1'] == 0) & (data['S2DQ2'].isna()), numpy.nan, numpy.where((data['S2DQ1'] == 0) & (data['S2DQ2'].isna()), numpy.nan, numpy.where((data['S2DQ1'] == 0) & (data['S2DQ2'] == 0), 0, numpy.nan)))))
c5 = data['Num_alcoholic_parents'].value_counts(sort=False).sort_index() print(c5)
print("percentage blood/natural parents was alcoholic") p5 = data['Num_alcoholic_parents'].value_counts(sort=False, normalize=True).sort_index() print (p5)
___________________________________________________________________Graphs_________________________________________________________________________
Diagramm für c5 erstellen
plt.figure(figsize=(8, 5)) plt.bar(c5.index, c5.values) plt.xlabel('Num_alcoholic_parents') plt.ylabel('Häufigkeit') plt.title('Häufigkeitsverteilung von Num_alcoholic_parents') plt.xticks(c5.index) plt.show()
Diagramm für p5 erstellen
plt.figure(figsize=(8, 5)) plt.bar(p5.index, p5.values*100) plt.xlabel('Num_alcoholic_parents') plt.ylabel('Häufigkeit (%)') plt.title('Häufigkeitsverteilung von Num_alcoholic_parents') plt.xticks(c5.index) plt.show()
print("lineare Regression")
Entfernen Sie Zeilen mit NaN-Werten in den relevanten Spalten.
data_cleaned = data.dropna(subset=['Num_alcoholic_parents', 'S2BQ3A'])
Definieren Sie Ihre unabhängige Variable (X) und Ihre abhängige Variable (y).
X = data_cleaned['Num_alcoholic_parents'] y = data_cleaned['S2BQ3A']
Fügen Sie eine Konstante hinzu, um den Intercept zu berechnen.
X = sm.add_constant(X)
Erstellen Sie das lineare Regressionsmodell.
model = sm.OLS(y, X).fit()
Drucken Sie die Zusammenfassung des Modells.
print(model.summary()) Result
Tumblr media
frequency distribution shows how many of the people from the study have alcoholic parents and if its only 1 oder both parents.
Tumblr media
R-squared: The R-squared value is 0.002, indicating that only about 0.2% of the variation in "S2BQ3A" (quantity of alcoholic drinks consumed) is explained by the variable "Num_alcoholic_parents" (number of alcoholic parents). This suggests that there is likely no strong linear relationship between these two variables.
F-statistic: The F-statistic has a value of 21.29, and the associated probability (Prob (F-statistic)) is very small (3.99e-06). The F-statistic is used to assess the overall effectiveness of the model. In this case, the low probability suggests that at least one of the independent variables has a significant impact on the dependent variable.
Coefficients: The coefficients of the model show the estimated effects of the independent variables on the dependent variable. In this case, the constant (const) has a value of 29.1770, representing the estimated average value of "S2BQ3A" when "Num_alcoholic_parents" is zero. The coefficient for "Num_alcoholic_parents" is -1.6397, meaning that an additional alcoholic parent is associated with an average decrease of 1.6397 units in "S2BQ3A" (quantity of alcoholic drinks consumed).
P-Values (P>|t|): The p-values next to the coefficients indicate the significance of each coefficient. In this case, both the constant and the coefficient for "Num_alcoholic_parents" are highly significant (p-values close to zero). This suggests that "Num_alcoholic_parents" has a statistically significant impact on "S2BQ3A."
AIC and BIC: AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) are model evaluation measures. Lower values indicate a better model fit. In this case, both AIC and BIC are relatively low, which could indicate the adequacy of the model.
In summary, there is a statistically significant but very weak negative relationship between the number of alcoholic parents and the quantity of alcoholic drinks consumed. This means that an increase in the number of alcoholic parents is associated with a slight decrease in the quantity of alcohol consumed, but it explains only a very limited amount of the variation in the quantity of alcohol consumed.
0 notes
rbem-amo · 10 months
Text
Data analysis Tools: Module 3. Pearson correlation coefficient (r / r2)
According my last studies ( ANOVA and Chi square test of independence), there is a relationship between the variables under observation:
Quantitative variable: Femaeemployrate (Percentage of female population, age above 15, that has been employed during the given year)
Quantitative variable: Incomeperperson (Gross Domestic Product per capita in constant 2000 US$).
I have calculated the Pearson Correlation coefficient and I have created the graphic:
“association between Income per person and Female employ rate
PearsonRResult(statistic=0.3212540576761761, pvalue=0.015769942076727345)”
r=0.3212540576761761à It means that the correlation between female employ rate and income per person rate is positive and it is stark (near to 0)
p-value=0,015 < 0.05 it means that the is a relationship
The graphic shows the positive correlation.
In the graphic in the low range of the x axis is obserbed a different correlation, perhaps this correlation should be also analysis.
Tumblr media
And the r2=0,32125*0,32125= 0,10
It can be predicted 10% of the variability in the rate of female employ rate according the income per person. (The fraction of the variability of one variable that can be predicted by the other).
Program:
# -*- coding: utf-8 -*-
"""
Created on Thu Aug 31 13:59:25 2023
@author: ANA4MD
"""
import pandas
import numpy
import seaborn
import scipy
import matplotlib.pyplot as plt
data = pandas.read_csv('gapminder.csv', low_memory=False)
# lower-case all DataFrame column names - place after code for loading data above
data.columns = list(map(str.lower, data.columns))
# bug fix for display formats to avoid run time errors - put after code for loading data above
pandas.set_option('display.float_format', lambda x: '%f' % x)
# to fix empty data to avoid errors
data = data.replace(r'^\s*$', numpy.NaN, regex=True)
# checking the format of my variables and set to numeric
data['femaleemployrate'].dtype
data['polityscore'].dtype
data['incomeperperson'].dtype
data['urbanrate'].dtype
data['femaleemployrate'] = pandas.to_numeric(data['femaleemployrate'], errors='coerce', downcast=None)
data['polityscore'] = pandas.to_numeric(data['polityscore'], errors='coerce', downcast=None)
data['incomeperperson'] = pandas.to_numeric(data['incomeperperson'], errors='coerce', downcast=None)
data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce', downcast=None)
data['incomeperperson']=data['incomeperperson'].replace(' ', numpy.nan)
# to create bivariate graph for the selected variables
print('relationship femaleemployrate & income per person')
# bivariate bar graph Q->Q
scat2 = seaborn.regplot(
    x="incomeperperson", y="femaleemployrate", fit_reg=False, data=data)
plt.xlabel('Income per Person')
plt.ylabel('Female Employ Rate')
plt.title('Scatterplot for the Association Between Income per Person & Femaleemployrate')
data_clean=data.dropna()
print ('association between Income per person and Female employ rate')
print (scipy.stats.pearsonr(data_clean['incomeperperson'], data_clean['femaleemployrate']))
Results:
relationship femaleemployrate & income per person
association between Income per person and Female employ rate
PearsonRResult(statistic=0.3212540576761761, pvalue=0.015769942076727345)
Data analysis Tools: Module 3. Pearson correlation coefficient (r / r2)
According my last studies ( ANOVA and Chi square test of independence), there is a relationship between the variables under observation:
Quantitative variable: Femaeemployrate (Percentage of female population, age above 15, that has been employed during the given year)
Quantitative variable: Incomeperperson (Gross Domestic Product per capita in constant 2000 US$).
I have calculated the Pearson Correlation coefficient and I have created the graphic:
“association between Income per person and Female employ rate
PearsonRResult(statistic=0.3212540576761761, pvalue=0.015769942076727345)”
r=0.3212540576761761à It means that the correlation is positive and it is stark (near to 0)
p-value=0,015 < 0.05 à it means that the is a relationship
The graphic shows the positive correlation.
And the r2=0,32125*0,32125= 0,10àwe can predict 10% of the variability in the rate of female employ rate according the income per person. (The fraction of the variability of one variable that can be predicted by the other).
Program:
# -*- coding: utf-8 -*-
"""
Created on Thu Aug 31 13:59:25 2023
@author: ANA4MD
"""
import pandas
import numpy
import seaborn
import scipy
import matplotlib.pyplot as plt
data = pandas.read_csv('gapminder.csv', low_memory=False)
# lower-case all DataFrame column names - place after code for loading data above
data.columns = list(map(str.lower, data.columns))
# bug fix for display formats to avoid run time errors - put after code for loading data above
pandas.set_option('display.float_format', lambda x: '%f' % x)
# to fix empty data to avoid errors
data = data.replace(r'^\s*$', numpy.NaN, regex=True)
# checking the format of my variables and set to numeric
data['femaleemployrate'].dtype
data['polityscore'].dtype
data['incomeperperson'].dtype
data['urbanrate'].dtype
data['femaleemployrate'] = pandas.to_numeric(data['femaleemployrate'], errors='coerce', downcast=None)
data['polityscore'] = pandas.to_numeric(data['polityscore'], errors='coerce', downcast=None)
data['incomeperperson'] = pandas.to_numeric(data['incomeperperson'], errors='coerce', downcast=None)
data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce', downcast=None)
data['incomeperperson']=data['incomeperperson'].replace(' ', numpy.nan)
# to create bivariate graph for the selected variables
print('relationship femaleemployrate & income per person')
# bivariate bar graph Q->Q
scat2 = seaborn.regplot(
    x="incomeperperson", y="femaleemployrate", fit_reg=False, data=data)
plt.xlabel('Income per Person')
plt.ylabel('Female Employ Rate')
plt.title('Scatterplot for the Association Between Income per Person & Femaleemployrate')
data_clean=data.dropna()
print ('association between Income per person and Female employ rate')
print (scipy.stats.pearsonr(data_clean['incomeperperson'], data_clean['femaleemployrate']))
Results:
relationship femaleemployrate & income per person
association between Income per person and Female employ rate
PearsonRResult(statistic=0.3212540576761761, pvalue=0.015769942076727345)
Tumblr media
0 notes
likeabosch · 1 year
Text
Course: Data Analysis Tools | Assignment Week 4
For assessment of a moderating effect by a third variable, I studied the association between the life expectancy and the breast cancer rate of the Gapminder dataset.
The scatterplot clearly indicates that breast cancer rate increases with life expectancy. This is not surprising, as the chance to develop breast cancer increases with longer life time. At the same time, low breast cancer rates are distributed over a wide range of life expectancy. Thus, the hypothesis was that the development of a country (i.e. the urban rate) poses a moderating effect on the correlation between breast cancer rate and life expectancy.
Tumblr media
Results:
The correlation is statistically significant for all three urban rate groups (Low, Mid, High)
The correlation gets more positive for higher urban rate
So I'd infer that there is no major moderation effect, but potentially a minor one that leads to a more positive correlation with increasing urban rate
association between life expectancy and breast cancer rate for countries with LOW urban rate (0.40957438891713366, 0.0031404508682154326)
association between life expectancy and breast cancer rate for countries with MIDDLE urban rate (0.5405502238971076, 1.1386264359630638e-06)
association between life expectancy and breast cancer rate for countries with HIGH urban rate (0.6334067851215266, 6.086281365524001e-07)
Code:
Import Libraries
import pandas import numpy import seaborn import matplotlib.pyplot as plt
smf provides ANOVA F-test
import statsmodels.formula.api as smf
multi includes the package to do Post Hoc multi comparison test
import statsmodels.stats.multicomp as multi
scipy includes the Chi Squared Test of Independence
import scipy.stats
bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%f'%x)
Set Pandas to show all colums and rows in Dataframes
pandas.set_option('display.max_columns', None) pandas.set_option('display.max_rows', None)
Import gapminder.csv
data = pandas.read_csv('gapminder.csv', low_memory=False)
Replace all empty entries with 0
data = data.replace(r'^\s*$', numpy.NaN, regex=True)
Extract relevant variables from original dataset and save it in subdata set
print('List of extracted variables in subset') subdata = data[['breastcancerper100th', 'lifeexpectancy', 'urbanrate']]
Safe backup file of reduced dataset
subdata2 = subdata.copy()
Convert all entries to numeric format
subdata2['breastcancerper100th'] = pandas.to_numeric(subdata2['breastcancerper100th']) subdata2['lifeexpectancy'] = pandas.to_numeric(subdata2['lifeexpectancy']) subdata2['urbanrate'] = pandas.to_numeric(subdata2['urbanrate'])
All rows containing value 0 / previously had no entry are deleted from the subdata set
subdata2 = subdata2.dropna() print(subdata2)
Describe statistical distribution of variable values
print('Statistics on "Breastcancerper100th"') desc_income = subdata2['breastcancerper100th'].describe() print(desc_income) print('Statistics on "Life Expectancy"') desc_lifeexp = subdata2['lifeexpectancy'].describe() print(desc_lifeexp) print('Statistics on "urban rate"') desc_suicide = subdata2['urbanrate'].describe() print(desc_suicide)
Identify min & max values within each column
print('Minimum & Maximum Breastcancerper100th') min_bcr = min(subdata2['breastcancerper100th']) print(min_bcr) max_bcr = max(subdata2['breastcancerper100th']) print(max_bcr) print('')
print('Minimum & Maximum Life Expectancy') min_lifeexp = min(subdata2['lifeexpectancy']) print(min_lifeexp) max_lifeexp = max(subdata2['lifeexpectancy']) print(max_lifeexp) print('')
print('Minimum & Maximum Urban Rate') min_srate = min(subdata2['urbanrate']) print(min_srate) max_srate = max(subdata2['urbanrate']) print(max_srate) print('')
scat1 = seaborn.regplot(x="breastcancerper100th", y="lifeexpectancy", fit_reg=True, data=subdata2) plt.xlabel('Breast cancer rate per 100') plt.ylabel('Life expectancy, years') plt.title('Scatterplot for the Association Between Life Expectancy and Breast Cancer rate per 100')
scat2 = seaborn.regplot(x="urbanrate", y="breastcancerper100th", fit_reg=True, data=subdata2) plt.xlabel('Breast cancer rate per 100') plt.ylabel('urban rate') plt.title('Scatterplot for the Association Between Breast Cancer rate and urban rate')
print (scipy.stats.pearsonr(subdata2['breastcancerper100th'], subdata2['urbanrate']))
def urbanrategrp (row): if row['urbanrate'] <= 40: return 1 elif row['urbanrate'] <= 70: return 2 elif row['urbanrate'] <= 100: return 3
subdata2['urbanrategrp'] = subdata2.apply (lambda row: urbanrategrp (row),axis=1)
chk1 = subdata2['urbanrategrp'].value_counts(sort=False, dropna=False) print(chk1)
sub1=subdata2[(subdata2['urbanrategrp']== 1)] sub2=subdata2[(subdata2['urbanrategrp']== 2)] sub3=subdata2[(subdata2['urbanrategrp']== 3)]
print ('association between life expectancy and breast cancer rate for countries with LOW urban rate') print (scipy.stats.pearsonr(sub1['lifeexpectancy'], sub1['breastcancerper100th'])) print (' ') print ('association between life expectancy and breast cancer rate for countries with MIDDLE urban rate') print (scipy.stats.pearsonr(sub2['lifeexpectancy'], sub2['breastcancerper100th'])) print (' ') print ('association between life expectancy and breast cancer rate for countries with HIGH urban rate') print (scipy.stats.pearsonr(sub3['lifeexpectancy'], sub3['breastcancerper100th']))
0 notes
codehunter · 1 year
Text
Pandas dataframe fillna() only some columns in place
I am trying to fill none values in a Pandas dataframe with 0's for only some subset of columns.
When I do:
import pandas as pddf = pd.DataFrame(data={'a':[1,2,3,None],'b':[4,5,None,6],'c':[None,None,7,8]})print dfdf.fillna(value=0, inplace=True)print df
The output:
a b c0 1.0 4.0 NaN1 2.0 5.0 NaN2 3.0 NaN 7.03 NaN 6.0 8.0 a b c0 1.0 4.0 0.01 2.0 5.0 0.02 3.0 0.0 7.03 0.0 6.0 8.0
It replaces every None with 0's. What I want to do is, only replace Nones in columns a and b, but not c.
What is the best way of doing this?
https://codehunter.cc/a/python/pandas-dataframe-fillna-only-some-columns-in-place
0 notes
timothy-mokoka · 1 year
Text
Hypothesis Testing with Pearson Correlation
Introduction:
This assignment examines a 2412 sample of Marijuana / Cannabis users from the NESRAC dataset between the ages of 18 and 30. My Research question is as follows:
Is the number of Cannabis joints smoked per day amongst young adults in USA between the Ages of 18 and 30 the leading cause of mental health disorders such as depression and anxiety?
My Hypothesis Test statements are as follows:
H0: The number of Cannabis joints smoked per day amongst young adults in USA between the Ages of 18 and 30 is not the leading cause of mental health disorders such as depression and anxiety.
Ha: The number of Cannabis joints smoked per day amongst young adults in USA between the Ages of 18 and 30 is the leading cause of mental health disorders such as depression and anxiety.
Explanation of the Code:
My research question only categorical variables but for this Pearson Correlation test I have selected three different quantitative variables from the NESARC dataset. Thus, I have refined the hypothesis and examined the correlation between age with the people that have been using cannabis the most, which is the quantitative explanatory variable (‘S3BD5Q2F’) and the age when they experienced their first episode of general anxiety and major depression, which are the quantitative response variables (‘S9Q6A’) and (‘S4AQ6A’).
For visualizing the relationship and association between cannabis use and general anxiety and major depression episodes, I used the seaborn library to produce scatterplots for each of the mental health disorders separately and the interpretation thereof, by describing the direction as well as the strength and form of the relationships. Additionally I ran a Pearson correlation test twice, one for each mental health disorder, and measured the strength of the relationships between each of the quantitative variables by generating the correlation coefficients “r” and their associated p-values.
Code / Syntax:
-- coding: utf-8 --
""" Created on Mon Apr 2 15:00:39 2023
@author: Oteng """
import pandas import numpy import seaborn import scipy import matplotlib.pyplot as plt
nesarc = pandas.read_csv ('nesarc_pds.csv' , low_memory=False)
Set PANDAS to show all columns in DataFrame
pandas.set_option('display.max_columns', None)
Set PANDAS to show all rows in DataFrame
pandas.set_option('display.max_rows', None)
nesarc.columns = map(str.upper , nesarc.columns)
pandas.set_option('display.float_format' , lambda x:'%f'%x)
Change my variables variables of interest to numeric
nesarc['AGE'] = pandas.to_numeric(nesarc['AGE'], errors='coerce') nesarc['S3BQ4'] = pandas.to_numeric(nesarc['S3BQ4'], errors='coerce') nesarc['S4AQ6A'] = pandas.to_numeric(nesarc['S4AQ6A'], errors='coerce') nesarc['S3BD5Q2F'] = pandas.to_numeric(nesarc['S3BD5Q2F'], errors='coerce') nesarc['S9Q6A'] = pandas.to_numeric(nesarc['S9Q6A'], errors='coerce') nesarc['S4AQ7'] = pandas.to_numeric(nesarc['S4AQ7'], errors='coerce') nesarc['S3BQ1A5'] = pandas.to_numeric(nesarc['S3BQ1A5'], errors='coerce')
Subset my sample
subset1 = nesarc[(nesarc['S3BQ1A5']==1)] #Cannabis users subsetc1 = subset1.copy()
Setting missing data
subsetc1['S3BQ1A5']=subsetc1['S3BQ1A5'].replace(9, numpy.nan) subsetc1['S3BD5Q2F']=subsetc1['S3BD5Q2F'].replace('BL', numpy.nan) subsetc1['S3BD5Q2F']=subsetc1['S3BD5Q2F'].replace(99, numpy.nan) subsetc1['S4AQ6A']=subsetc1['S4AQ6A'].replace('BL', numpy.nan) subsetc1['S4AQ6A']=subsetc1['S4AQ6A'].replace(99, numpy.nan) subsetc1['S9Q6A']=subsetc1['S9Q6A'].replace('BL', numpy.nan) subsetc1['S9Q6A']=subsetc1['S9Q6A'].replace(99, numpy.nan)
Scatterplot for the age when began using cannabis the most and the age of first episode of major depression
plt.figure(figsize=(12,4)) # Change plot size scat1 = seaborn.regplot(x="S3BD5Q2F", y="S4AQ6A", fit_reg=True, data=subset1) plt.xlabel('Age when began using cannabis the most') plt.ylabel('Age when expirenced the first episode of major depression') plt.title('Scatterplot for the age when began using cannabis the most and the age of first the episode of major depression') plt.show()
data_clean=subset1.dropna()
Pearson correlation coefficient for the age when began using cannabis the most and the age of first the episode of major depression
print ('Association between the age when began using cannabis the most and the age of the first episode of major depression') print (scipy.stats.pearsonr(data_clean['S3BD5Q2F'], data_clean['S4AQ6A']))
Scatterplot for the age when began using cannabis the most and the age of the first episode of general anxiety
plt.figure(figsize=(12,4)) # Change plot size scat2 = seaborn.regplot(x="S3BD5Q2F", y="S9Q6A", fit_reg=True, data=subset1) plt.xlabel('Age when began using cannabis the most') plt.ylabel('Age when expirenced the first episode of general anxiety') plt.title('Scatterplot for the age when began using cannabis the most and the age of the first episode of general anxiety') plt.show()
Pearson correlation coefficient for the age when began using cannabis the most and the age of the first episode of general anxiety
print ('Association between the age when began using cannabis the most and the age of first the episode of general anxiety') print (scipy.stats.pearsonr(data_clean['S3BD5Q2F'], data_clean['S9Q6A']))
Output 1
Pearson Correlation test results are as follows:
Tumblr media Tumblr media Tumblr media
Output 2:
Tumblr media
The scatterplot illustrates the relationship and correlation between the age individuals started using cannabis the most, a quantitative explanatory variable, and the age when they started experiencing their first major depression episode, a quantitative response variable. The direction  is a positively increasing relationship; as the age when individual began using cannabis the most increases, the age when they experience their first major depression episode increases. From the Pearson Correlation test, which resulted in a correlation of coefficient of 0.23, indicates a weak positive linear relationship between the two quantitative variables of interest. The associated p-value is 2.27e-09 which is significantly small. This means that the relationship is statistically significant and indicates that the association between the two quantitative variables of interest is weak.
Output 3:
Tumblr media
From the scatterplot above the association between the age of when individuals began using cannabis the most, quantitative explanatory variable, and the age when they experience their first general anxiety episode, a quantitative response variable. The direction is a positive linear relationship. The Pearson Correlation test, which resulted in a correlation coefficient of 0.1494, which indicates a weak positive linear relationship between the two quantitative variables. The associated p-value is 0.00012 which indicates a statistically significant relationship. Thus the relationship between the age of when individuals began using cannabis the most and the age when they experience their first general anxiety episode is weak. The r^2 , which is 0.01, is very low for us to find the fraction of the variable that can be predicted from one variable to another.  
0 notes
andidatachief56 · 1 year
Text
data analysis tools week 3
Hello,
following correlation is to be calculated:   Is there a correlation between the Age und the quantity of beer consuming each day.
Dataset: nesarc_pds.csv
2 quantative Variables choosen: 342-343 S2AQ5D NUMBER OF BEERS USUALLY CONSUMED ON DAYS WHEN DRANK BEER IN LAST 12 MONTHS 18268 1-42. Number of beers    78 99. Unknown    24747 BL. NA did not drink or unknwon
68-69 AGE     AGE       43079 18-97. Age in years      14 98. 98 years or older
Program: A sub- dataframe was created with the 2 choosen variables 'AGE' and 'S2AQ5D'. All unknown and NA rows dropped as well as all drinker of only 1 beer to remove the ground noise. By using the Pearsson- correlation function of the stats- module the correlation between both variables are calculated
Result:   The r- value = -0,15, that means almost no correlation between the Age and the quantity of beer each day.  
------------------------------------------------------------------------------------------------------- Program:
import os import pandas import numpy import seaborn import scipy.stats
# define individual name of dataset data = pandas.read_csv('nesarc_pds.csv', low_memory=False)
# recode missing values to python missing (NaN) data['S2AQ5D']=data['S2AQ5D'].replace('99', numpy.nan)    # defined as char data['S2AQ5D']=data['S2AQ5D'].replace(' ', numpy.nan)    # needed before set to numeric
# new code setting variables you will be working with to numeric data['S2AQ5D'] = pandas.to_numeric(data['S2AQ5D'], errors='coerce')
# data subset only for needed columns sub1 = data[['AGE','S2AQ5D']]
#make a copy of my new subsetted data   !! wichtig !!!   Remove NaN sub2 = sub1.copy()
print(len(sub2))              # No. of rows print(len(sub2.columns))  
sub2 = sub2.dropna()
print(' ---  nach dropNA ----') print(len(sub2))              # No. of rows print(len(sub2.columns))   print(sub2.value_counts(subset='S2AQ5D', normalize=True))    # in Prozent
sub2=sub2[(sub2['S2AQ5D']>1)]      # drop all rows with 1 beer
print(' ---  nach remove 1-3 ----') print(len(sub2))              # No. of rows print(len(sub2.columns))   print(sub2.value_counts(subset='S2AQ5D', normalize=True))    # in Prozent
scat1 = seaborn.regplot(x="AGE", y="S2AQ5D", fit_reg=True, data=sub2) plt.xlabel('AGE') plt.ylabel('Drinking beer per day') plt.title('Scatterplot for the Association Between AGE and consuming beer')
print ('association between AGE and consuming beer') print (scipy.stats.pearsonr(sub2['AGE'], sub2['S2AQ5D']))
---------------------------------------------------------------------------------------------
association between AGE and consuming beer (-0.15448800873680543, 1.7107443117251883e-60)
0 notes
monuonrise · 1 year
Text
import pandas
import numpy
import seaborn
import scipy
import matplotlib.pyplot as plt
nesarc = pandas.read_csv ('nesarc_pds.csv', low_memory=False)
Set PANDAS to show all columns in DataFrame
pandas.set_option('display.max_columns' , None)
Set PANDAS to show all rows in DataFrame
pandas.set_option('display.max_rows' , None)
nesarc.columns = map(str.upper , nesarc.columns)
pandas.set_option('display.float_format' , lambda x:'%f'%x)
Change my variables to numeric
nesarc['AGE'] = nesarc['AGE'].convert_objects(convert_numeric=True) nesarc['MAJORDEP12'] = nesarc['MAJORDEP12'].convert_objects(convert_numeric=True) nesarc['S1Q231'] = nesarc['S1Q231'].convert_objects(convert_numeric=True) nesarc['S3BQ1A5'] = nesarc['S3BQ1A5'].convert_objects(convert_numeric=True) nesarc['S3BD5Q2E'] = nesarc['S3BD5Q2E'].convert_objects(convert_numeric=True)
Subset my sample
subset1 = nesarc[(nesarc['AGE']>=18) & (nesarc['AGE']<=30) & nesarc['S3BQ1A5']==1] # Ages 18-30, cannabis users subsetc1 = subset1.copy()
Setting missing data
subsetc1['S1Q231']=subsetc1['S1Q231'].replace(9, numpy.nan) subsetc1['S3BQ1A5']=subsetc1['S3BQ1A5'].replace(9, numpy.nan) subsetc1['S3BD5Q2E']=subsetc1['S3BD5Q2E'].replace(99, numpy.nan) subsetc1['S3BD5Q2E']=subsetc1['S3BD5Q2E'].replace('BL', numpy.nan)
recode1 = {1: 9, 2: 8, 3: 7, 4: 6, 5: 5, 6: 4, 7: 3, 8: 2, 9: 1} # Frequency of cannabis use variable reverse-recode subsetc1['CUFREQ'] = subsetc1['S3BD5Q2E'].map(recode1) # Change the variable name from S3BD5Q2E to CUFREQ
subsetc1['CUFREQ'] = subsetc1['CUFREQ'].astype('category')
Raname graph labels for better interpetation
subsetc1['CUFREQ'] = subsetc1['CUFREQ'].cat.rename_categories(["2 times/year","3-6 times/year","7-11 times/year","Once a month","2-3 times/month","1-2 times/week","3-4 times/week","Nearly every day","Every day"])
Contingency table of observed counts of major depression diagnosis (response variable) within frequency of cannabis use groups (explanatory variable), in ages 18-30
contab1 = pandas.crosstab(subsetc1['MAJORDEP12'], subsetc1['CUFREQ']) print (contab1)
Column percentages
colsum=contab1.sum(axis=0) colpcontab=contab1/colsum print(colpcontab)
Chi-square calculations for major depression within frequency of cannabis use groups
print ('Chi-square value, p value, expected counts, for major depression within cannabis use status') chsq1= scipy.stats.chi2_contingency(contab1) print (chsq1)
Bivariate bar graph for major depression percentages with each cannabis smoking frequency group
plt.figure(figsize=(12,4)) # Change plot size ax1 = seaborn.factorplot(x="CUFREQ", y="MAJORDEP12", data=subsetc1, kind="bar", ci=None) ax1.set_xticklabels(rotation=40, ha="right") # X-axis labels rotation plt.xlabel('Frequency of cannabis use') plt.ylabel('Proportion of Major Depression') plt.show()
recode2 = {1: 10, 2: 9, 3: 8, 4: 7, 5: 6, 6: 5, 7: 4, 8: 3, 9: 2, 10: 1} # Frequency of cannabis use variable reverse-recode subsetc1['CUFREQ2'] = subsetc1['S3BD5Q2E'].map(recode2) # Change the variable name from S3BD5Q2E to CUFREQ2
sub1=subsetc1[(subsetc1['S1Q231']== 1)] sub2=subsetc1[(subsetc1['S1Q231']== 2)]
print ('Association between cannabis use status and major depression for those who lost a family member or a close friend in the last 12 months') contab2=pandas.crosstab(sub1['MAJORDEP12'], sub1['CUFREQ2']) print (contab2)
Column percentages
colsum2=contab2.sum(axis=0) colpcontab2=contab2/colsum2 print(colpcontab2)
Chi-square
print ('Chi-square value, p value, expected counts') chsq2= scipy.stats.chi2_contingency(contab2) print (chsq2)
Line graph for major depression percentages within each frequency group, for those who lost a family member or a close friend
plt.figure(figsize=(12,4)) # Change plot size ax2 = seaborn.factorplot(x="CUFREQ", y="MAJORDEP12", data=sub1, kind="point", ci=None) ax2.set_xticklabels(rotation=40, ha="right") # X-axis labels rotation plt.xlabel('Frequency of cannabis use') plt.ylabel('Proportion of Major Depression') plt.title('Association between cannabis use status and major depression for those who lost a family member or a close friend in the last 12 months') plt.show()
#
print ('Association between cannabis use status and major depression for those who did NOT lose a family member or a close friend in the last 12 months') contab3=pandas.crosstab(sub2['MAJORDEP12'], sub2['CUFREQ2']) print (contab3)
Column percentages
colsum3=contab3.sum(axis=0) colpcontab3=contab3/colsum3 print(colpcontab3)
Chi-square
print ('Chi-square value, p value, expected counts') chsq3= scipy.stats.chi2_contingency(contab3) print (chsq3)
Line graph for major depression percentages within each frequency group, for those who did NOT lose a family member or a close friend
plt.figure(figsize=(12,4)) # Change plot size ax3 = seaborn.factorplot(x="CUFREQ", y="MAJORDEP12", data=sub2, kind="point", ci=None) ax3.set_xticklabels(rotation=40, ha="right") # X-axis labels rotation plt.xlabel('Frequency of cannabis use') plt.ylabel('Proportion of Major Depression') plt.title('Association between cannabis use status and major depression for those who did NOT lose a family member or a close friend in the last 12 months') plt.show()
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
1 note · View note
Text
Correlation coefficient between the number of drinks in the last 12 months and the age when the first cigarette was smoked with the moderator of 0, 1 or two parents with history of alcoholism
code:
import pandas import numpy import matplotlib.pyplot as plt import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi
import scipy.stats import seaborn
print("start import") data = pandas.read_csv('nesarc_pds.csv', low_memory=False) print("import done") dataz = data
upper-case all Dataframe column names --> unification
data.columns = map(str.upper, data.columns.values.ravel()) dataz.columns = map(str.upper, data.columns.values.ravel())
dataz.replace({"": numpy.nan})
dataz['S2AQ4B'] = pandas.to_numeric(dataz['S2AQ4B'], errors='coerce') #number of drinks last 12 months dataz['S2AQ4B'] = dataz['S2AQ4B'].replace({99: numpy.nan}) #99er löschen dataz['S3AQ2A1'] = pandas.to_numeric(dataz['S3AQ2A1'], errors='coerce') #age first smoked cigarette dataz['S3AQ2A1'] = dataz['S3AQ2A1'].replace({99: numpy.nan}) #99er löschen data_clean = dataz.dropna()
data0=data_clean[(data_clean['Num_alcoholic_parents']== 0)] data1=data_clean[(data_clean['Num_alcoholic_parents']== 1)] data2=data_clean[(data_clean['Num_alcoholic_parents']== 2)]
print ('association between number of drinks in last 12 months and age at first cigarette with 0 Parents with history of alcoholism') print (scipy.stats.pearsonr(data0['S2AQ4B'], data0['S3AQ2A1']))
print(' ') plt.scatter(data0['S3AQ2A1'], data0['S2AQ4B']) plt.xlabel('Age First Smoked Cigarette') plt.ylabel('Number of Drinks Last 12 Months') plt.title('Scatter Plot with 0 Parents with history of alcoholism')
Lineare Regression durchführen
coefficients = numpy.polyfit(data0['S3AQ2A1'], data0['S2AQ4B'], 1) x = numpy.linspace(min(data0['S3AQ2A1']), max(data0['S3AQ2A1']), 100) y = coefficients[0] * x + coefficients[1]
Regressionsgerade zeichnen
plt.plot(x, y, color='red', label='Regression Line')
Legende anzeigen
plt.legend() plt.show() print(' ') print ('association between number of drinks in last 12 months and age at first cigarette with 1 Parents with history of alcoholism') print (scipy.stats.pearsonr(data1['S2AQ4B'], data1['S3AQ2A1']))
print(' ') plt.scatter(data1['S3AQ2A1'], data1['S2AQ4B']) plt.xlabel('Age First Smoked Cigarette') plt.ylabel('Number of Drinks Last 12 Months') plt.title('Scatter Plot with 1 Parent with history of alcoholism')
Lineare Regression durchführen
coefficients = numpy.polyfit(data1['S3AQ2A1'], data1['S2AQ4B'], 1) x = numpy.linspace(min(data1['S3AQ2A1']), max(data1['S3AQ2A1']), 100) y = coefficients[0] * x + coefficients[1]
Regressionsgerade zeichnen
plt.plot(x, y, color='red', label='Regression Line')
Legende anzeigen
plt.legend() plt.show() print(' ') print ('association between number of drinks in last 12 months and age at first cigarette with 2 Parents with history of alcoholism') print (scipy.stats.pearsonr(data2['S2AQ4B'], data2['S3AQ2A1']))
print(' ') plt.scatter(data2['S3AQ2A1'], data2['S2AQ4B']) plt.xlabel('Age First Smoked Cigarette') plt.ylabel('Number of Drinks Last 12 Months') plt.title('Scatter Plot with 2 Parents with history of alcoholism')
Lineare Regression durchführen
coefficients = numpy.polyfit(data2['S3AQ2A1'], data2['S2AQ4B'], 1) x = numpy.linspace(min(data2['S3AQ2A1']), max(data2['S3AQ2A1']), 100) y = coefficients[0] * x + coefficients[1]
Regressionsgerade zeichnen
plt.plot(x, y, color='red', label='Regression Line')
Legende anzeigen
plt.legend() plt.show()
Result:
Tumblr media Tumblr media
the correlation coefficient ist small and indicates that there is no correlation between the number of drinks in the last 12 months and the age when the first cigarette was smoked. the p value is below 0,05 and shows that this data is statistically significant
Tumblr media Tumblr media
the correlation coefficient ist really small and indicates that there is no correlation between the number of drinks in the last 12 months and the age when the first cigarette was smoked. nevertheless the p value is over 0,05 and shows that this data is not statistically significant
Tumblr media Tumblr media
the correlation coefficient ist small and indicates that there is no correlation between the number of drinks in the last 12 months and the age when the first cigarette was smoked. nevertheless the p value is over 0,05 and shows that this data is not statistically significant
0 notes
taj-15 · 1 year
Text
Testing a Potential Moderator
import pandas import numpy import seaborn import scipy import matplotlib.pyplot as plt
nesarc = pandas.read_csv ('nesarc_pds.csv', low_memory=False)
Set PANDAS to show all columns in DataFrame
pandas.set_option('display.max_columns' , None)
Set PANDAS to show all rows in DataFrame
pandas.set_option('display.max_rows' , None)
nesarc.columns = map(str.upper , nesarc.columns)
pandas.set_option('display.float_format' , lambda x:'%f'%x)
Change my variables to numeric
nesarc['AGE'] = nesarc['AGE'].convert_objects(convert_numeric=True) nesarc['MAJORDEP12'] = nesarc['MAJORDEP12'].convert_objects(convert_numeric=True) nesarc['S1Q231'] = nesarc['S1Q231'].convert_objects(convert_numeric=True) nesarc['S3BQ1A5'] = nesarc['S3BQ1A5'].convert_objects(convert_numeric=True) nesarc['S3BD5Q2E'] = nesarc['S3BD5Q2E'].convert_objects(convert_numeric=True)
Subset my sample
subset1 = nesarc[(nesarc['AGE']>=18) & (nesarc['AGE']<=30) & nesarc['S3BQ1A5']==1] # Ages 18-30, cannabis users subsetc1 = subset1.copy()
Setting missing data
subsetc1['S1Q231']=subsetc1['S1Q231'].replace(9, numpy.nan) subsetc1['S3BQ1A5']=subsetc1['S3BQ1A5'].replace(9, numpy.nan) subsetc1['S3BD5Q2E']=subsetc1['S3BD5Q2E'].replace(99, numpy.nan) subsetc1['S3BD5Q2E']=subsetc1['S3BD5Q2E'].replace('BL', numpy.nan)
recode1 = {1: 9, 2: 8, 3: 7, 4: 6, 5: 5, 6: 4, 7: 3, 8: 2, 9: 1} # Frequency of cannabis use variable reverse-recode subsetc1['CUFREQ'] = subsetc1['S3BD5Q2E'].map(recode1) # Change the variable name from S3BD5Q2E to CUFREQ
subsetc1['CUFREQ'] = subsetc1['CUFREQ'].astype('category')
Raname graph labels for better interpetation
subsetc1['CUFREQ'] = subsetc1['CUFREQ'].cat.rename_categories(["2 times/year","3-6 times/year","7-11 times/year","Once a month","2-3 times/month","1-2 times/week","3-4 times/week","Nearly every day","Every day"])
Contingency table of observed counts of major depression diagnosis (response variable) within frequency of cannabis use groups (explanatory variable), in ages 18-30
contab1 = pandas.crosstab(subsetc1['MAJORDEP12'], subsetc1['CUFREQ']) print (contab1)
Column percentages
colsum=contab1.sum(axis=0) colpcontab=contab1/colsum print(colpcontab)
Chi-square calculations for major depression within frequency of cannabis use groups
print ('Chi-square value, p value, expected counts, for major depression within cannabis use status') chsq1= scipy.stats.chi2_contingency(contab1) print (chsq1)
Bivariate bar graph for major depression percentages with each cannabis smoking frequency group
plt.figure(figsize=(12,4)) # Change plot size ax1 = seaborn.factorplot(x="CUFREQ", y="MAJORDEP12", data=subsetc1, kind="bar", ci=None) ax1.set_xticklabels(rotation=40, ha="right") # X-axis labels rotation plt.xlabel('Frequency of cannabis use') plt.ylabel('Proportion of Major Depression') plt.show()
recode2 = {1: 10, 2: 9, 3: 8, 4: 7, 5: 6, 6: 5, 7: 4, 8: 3, 9: 2, 10: 1} # Frequency of cannabis use variable reverse-recode subsetc1['CUFREQ2'] = subsetc1['S3BD5Q2E'].map(recode2) # Change the variable name from S3BD5Q2E to CUFREQ2
sub1=subsetc1[(subsetc1['S1Q231']== 1)] sub2=subsetc1[(subsetc1['S1Q231']== 2)]
print ('Association between cannabis use status and major depression for those who lost a family member or a close friend in the last 12 months') contab2=pandas.crosstab(sub1['MAJORDEP12'], sub1['CUFREQ2']) print (contab2)
Column percentages
colsum2=contab2.sum(axis=0) colpcontab2=contab2/colsum2 print(colpcontab2)
Chi-square
print ('Chi-square value, p value, expected counts') chsq2= scipy.stats.chi2_contingency(contab2) print (chsq2)
Line graph for major depression percentages within each frequency group, for those who lost a family member or a close friend
plt.figure(figsize=(12,4)) # Change plot size ax2 = seaborn.factorplot(x="CUFREQ", y="MAJORDEP12", data=sub1, kind="point", ci=None) ax2.set_xticklabels(rotation=40, ha="right") # X-axis labels rotation plt.xlabel('Frequency of cannabis use') plt.ylabel('Proportion of Major Depression') plt.title('Association between cannabis use status and major depression for those who lost a family member or a close friend in the last 12 months') plt.show()
#
print ('Association between cannabis use status and major depression for those who did NOT lose a family member or a close friend in the last 12 months') contab3=pandas.crosstab(sub2['MAJORDEP12'], sub2['CUFREQ2']) print (contab3)
Column percentages
colsum3=contab3.sum(axis=0) colpcontab3=contab3/colsum3 print(colpcontab3)
Chi-square
print ('Chi-square value, p value, expected counts') chsq3= scipy.stats.chi2_contingency(contab3) print (chsq3)
Line graph for major depression percentages within each frequency group, for those who did NOT lose a family member or a close friend
plt.figure(figsize=(12,4)) # Change plot size ax3 = seaborn.factorplot(x="CUFREQ", y="MAJORDEP12", data=sub2, kind="point", ci=None) ax3.set_xticklabels(rotation=40, ha="right") # X-axis labels rotation plt.xlabel('Frequency of cannabis use') plt.ylabel('Proportion of Major Depression') plt.title('Association between cannabis use status and major depression for those who did NOT lose a family member or a close friend in the last 12 months') plt.show()
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
0 notes
likeabosch · 1 year
Text
Course: Data Analysis Tools | Assignment Week 3
The gapminder dataset is analyzed for correlations between income per person and life expectancy / suicide rate.
The results indicate that life expectancy is positively correlated with income with a p-value close to zero, meaning a dependency between both parameters.
Suicide rates are not correlated with income per person.
Pearson correlation related output:
association between Income per Person and Life Expectancy (0.5928813911387393, 1.0481935313322037e-17) association between Income per Person and Suicide rate (0.01009696739050997, 0.8954128168470443)
Code:
Import Libraries
import pandas import numpy import seaborn import matplotlib.pyplot as plt
smf provides ANOVA F-test
import statsmodels.formula.api as smf
multi includes the package to do Post Hoc multi comparison test
import statsmodels.stats.multicomp as multi
scipy includes the Chi Squared Test of Independence
import scipy.stats
bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%f'%x)
Set Pandas to show all colums and rows in Dataframes
pandas.set_option('display.max_columns', None) pandas.set_option('display.max_rows', None)
Import gapminder.csv
data = pandas.read_csv('gapminder.csv', low_memory=False)
Replace all empty entries with 0
data = data.replace(r'^\s*$', numpy.NaN, regex=True)
Extract relevant variables from original dataset and save it in subdata set
print('List of extracted variables in subset') subdata = data[['incomeperperson', 'lifeexpectancy', 'suicideper100th']]
Safe backup file of reduced dataset
subdata2 = subdata.copy()
Convert all entries to numeric format
subdata2['incomeperperson'] = pandas.to_numeric(subdata2['incomeperperson']) subdata2['lifeexpectancy'] = pandas.to_numeric(subdata2['lifeexpectancy']) subdata2['suicideper100th'] = pandas.to_numeric(subdata2['suicideper100th'])
All rows containing value 0 / previously had no entry are deleted from the subdata set
subdata2 = subdata2.dropna() print(subdata2)
Describe statistical distribution of variable values
print('Statistics on "Income per Person"') desc_income = subdata2['incomeperperson'].describe() print(desc_income) print('Statistics on "Life Expectancy"') desc_lifeexp = subdata2['lifeexpectancy'].describe() print(desc_lifeexp) print('Statistics on "Suicide Rate per 100th"') desc_suicide = subdata2['suicideper100th'].describe() print(desc_suicide)
Identify min & max values within each column
print('Minimum & Maximum Income') min_income = min(subdata2['incomeperperson']) print(min_income) max_income = max(subdata2['incomeperperson']) print(max_income) print('')
print('Minimum & Maximum Life Expectancy') min_lifeexp = min(subdata2['lifeexpectancy']) print(min_lifeexp) max_lifeexp = max(subdata2['lifeexpectancy']) print(max_lifeexp) print('')
print('Minimum & Maximum Suicide Rate') min_srate = min(subdata2['suicideper100th']) print(min_srate) max_srate = max(subdata2['suicideper100th']) print(max_srate) print('')
scat1 = seaborn.regplot(x="incomeperperson", y="lifeexpectancy", fit_reg=True, data=subdata2) plt.xlabel('Income per person, $*1000') plt.ylabel('Life expectancy, years') plt.title('Scatterplot for the Association Between Urban Rate and Internet Use Rate')
scat2 = seaborn.regplot(x="incomeperperson", y="suicideper100th", fit_reg=True, data=subdata2) plt.xlabel('Income per person, $*1000') plt.ylabel('Suicide rate per 100th') plt.title('Scatterplot for the Association Between Income per Person and Suicide Rate')
print ('association between Income per Person and Life Expectancy') print (scipy.stats.pearsonr(subdata2['incomeperperson'], subdata2['lifeexpectancy']))
print ('association between Income per Person and Suicide rate') print (scipy.stats.pearsonr(subdata2['incomeperperson'], subdata2['suicideper100th']))
0 notes
codehunter · 1 year
Text
How to replace NaNs by preceding or next values in pandas DataFrame?
Suppose I have a DataFrame with some NaNs:
>>> import pandas as pd>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])>>> df 0 1 20 1 2 31 4 NaN NaN2 NaN NaN 9
What I need to do is replace every NaN with the first non-NaN value in the same column above it. It is assumed that the first row will never contain a NaN. So for the previous example the result would be
0 1 20 1 2 31 4 2 32 4 2 9
I can just loop through the whole DataFrame column-by-column, element-by-element and set the values directly, but is there an easy (optimally a loop-free) way of achieving this?
https://codehunter.cc/a/python/how-to-replace-nans-by-preceding-or-next-values-in-pandas-dataframe
0 notes