Published in: Technology. This information can give you a hint of the skewness and of possible outliers. “Using Data Mining to Predict Secondary School Student Alcohol Consumption.” Department of Computer Science,University of Camerino. The dataset which we will be exploring will be the dataset containing Student Grade Prediction 1. intimate, they will drink less. We prefer to use some sort of configuration so that we can input any dataset and perform most of the same analysis. http://www.who.int/substance_abuse/publications/global_alcohol_report/en/. Section 2e. EDUCATION SYSTEM IN PORTUGAL. For the data exploratory exercise, we choose to examine three columns: By using Kaggle, you agree to our use of cookies. To compare categorical variables, correlations shouldn’t be used unless the underlying values are ordinal (i.e., going out with friends [numeric: from 1 – very low to 5 – very high]). This helps you to understand whether the distribution of the numeric variable is significantly different at different levels of the categorical target. Section 2d. The following results show the skewness for the numeric features: As we suspected, the feature ‘absences’ contains the most skew. activites (column 19), romantic (column 23), famrel (column 24), goout (column 26), Dalc (column 27), Walc (column 28) The dataset we chose is the Student Alcohol Consumption dataset by UCI Machine Learning which can be obtained In the input, workday and weekend alcohol consumption is given in range of 1 - very low to 5 - very high. Alcohol consumption PDF 182 KB. This dataset was collected in order to study alcohol consumption in young people and its effects on students’ academic performance. Since the dataset is called “Student Alcohol Consumption”, of course, we should do some analyses on it. Our explanation would be more focused on the final grade because we think that students will be Then we can find out if alcohol consumption will impact the final result indicated by column “g3”. Derived output: Alc = (Walc X 2 + Dalc X 5) / 7, again, in the range of 1 – 5. Its value for the week is normalized as (workday_alcohol_consumption 5 + weekend_alcohol_consumption 2)/7 If the value is greater than 3.0, then alcohol consumption is considered too high. This helps you to understand the top dependent variables (grouped by numerical and categorical). Be sure to change the type of field delimiter (“;”), line delimiter (“\n”), and check the Extract Field Names checkbox, as specified on the image below: We don’t need G1 and G2 columns, let’s drop them. However, many do not consider the effect of this intoxicating substance in the context of the younger, more (2016). Short exploratory data analysis focusing on the alcohol variables from the Portuguese school dataset. To do so, we In April 2016, 3000 undergraduate students were randomly selected to participate in the survey, and 802 undergraduate students responded to at least part of the survey. Most of us experimented with drinking to some degree while in school. by Dinescu et. A lot of time is lost I alcohol consumption that the students only place less time in their academic work. February 2016 DOI: 10.13140/RG.2.1.1465.8328 READS 2,200 2 authors: Fabio Pagnotta Hossain Amran University of Camerino University of Camerino 8 PUBLICATIONS 0 CITATIONS 5 PUBLICATIONS 0 … You can see the level of correlation by the degree of the ellipse. The original data comes from a survey conducted by a professor in Portugal. following: Figure 1 illustrates the high-level description of our classification. We think that classification is the best data mining technique to be employed because we can build a classification model to Global Status Report on Alcohol and Health 2014. The more narrow the ellipse, the stronger the correlation. This may not hold true because it is a possibility that the You can browse the subreddit here. Best part, these are all free, free… The following plot shows the prominence of the target: This shows that the target is imbalanced, so we may benefit from oversampling or under-sampling when building our model. we respect your privacy and take protecting it seriously. Finding a Binary Output: The dataset has two features, viz., Dalc (workday alcohol consumption) and Walc (weekend alcohol consumption), both in the range of 1 (very low) to 5 (very high). We look a bit closer at the distribution of absences and test for normality. administrative or police), ‘at_home’ or ‘other’), Fjob – father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. The box plot portion of the graph also helps us identify outliers. It could be alcohol poisoning or an alcohol-related injury or both. Many students in college experiment with drugs and alcohol and sometimes these two things negatively affect their academic performance. workday alcohol consumption, weekend alcohol consumption and their family relationship. consensus is that students who consume alcohol at high levels tend to skip more classes and perform worse in their studies, thus, resulting in lower When lambda = 0, the log transform is used. Journal of Family Psychology, Vol 30(6), Sep 2016, 698-707. A research conducted obtain more accurate insights. Balsa, A. I., Giuliano, L. M., & French, M. T. (2011). For a student to pass the subject, there are a couple of factors that could be correlated with the outcome. We only do this for illustration. We could check to see if that hypothesis has a concrete basis by using column 24 (famrel), column 27 (workday alcohol comes with the mantle of adulthood. in a student environment as well as their demographic information and other data that may be of some relevance. If the mean has significant differences (h0 is accepted), then the feature will likely be a dominant predictor. One way would be to create a new feature, FeduMedu, where the values is Medu * 10 + Fedu and keep FeduMedu categorical. relationship with his/her family has a low value. The traditional We chose workday alcohol consumption because drinking over workdays is more unusual than drinking over the weekends. While I recognize that having a great many students living on campus may be contributing to these numbers, and while I am relieved that students know how and when to seek care, I am c… Section 2a. avoid drinking in order to prevent their health from further deterioration. They are: Exploratory Data Analysis on the Student Alcohol Consumption dataset (Code) », address - U/R for urban or rural respectively, famsize - LE3/GT3 for less than or greater than three family members, Pstatus - T/A for living together or apart from parents, respectively, Medu - 0 (none) / 1 (primary-4th grade) / 2 (5th - 9th grade) / 3 (secondary) / 4 (higher) for mother's education, Fedu - 0 (none) / 1 (primary-4th grade) / 2 (5th - 9th grade) / 3 (secondary) / 4 (higher) for father's education, Mjob - 'teacher', 'health' care related, civil 'services', 'at_home' or 'other' for the student's mother's job, Fjob - 'teacher', 'health' care related, civil 'services', 'at_home' or 'other' for the student's father's job, reason - close to 'home', school 'reputation', 'course' preference or 'other' for the choice of school, guardian - mother/father/other as the student's guardian, traveltime - 1 (<15mins) / 2( 15 - 30 mins) / 3 (30 mins - 1 hr) / 4 (>1hr) for time from home to school, studytime - 1 (<2hrs) / 2 (2 - 5hrs) / 3 (5 - 10hrs) / 4 (>10hrs) for weekly study time, failures - 1-3/4 for number of class failures (if more than 3 than record 4), schoolsup - yes/no for extra educational support, famsup - yes/no for family educational support, paid - yes/no for extra paid classes for Math or Portuguese, activities - yes/no for extra-curricular activities, nursery - yes/no for whether attended nursery school, higher - yes/no for desire to continue studies, internet - yes/no for internet access at home, romantic - yes/no for relationship status, famrel - 1-5 scale on quality of family relationships, freetime - 1-5 scale on how much free time after school, goout - 1-5 scale on how much student goes out with friends, Dalc - 1-5 scale on how much alcohol consumed on weekdays, Walc - 1-5 scale on how much alcohol consumed on weekend, absences - 0-93 amount of absences from school, the amount of time a student studies (studytime, column 14), does the student join any extra paid classes (paid, column 18), does the student participate in any extra co-curricular activities (activities, column 19), if the student is involved in any romantic relationship (romantic, column 23), how is the student's family relationship quality (famrel, column 24), the tendency of the student to go out with friends (goout, column 26), weekday alcohol consumption (Dalc, column 27), weekend alcohol consumption (Walc, column 28). Singapore, however, brightens it up with colorful visualizations, splashes of color in the graphs, and a “Similar Datasets” section at the bottom of every data set to encourage readers to explore. and/or column 28 (weekend alcohol consumption), column 31 (first period grade), column 32 (second period grade) and Subscribe to our mailing list and get interesting stuff and updates to your email inbox. student's relationship with his/her family is low because of their high level of alcohol consumption. al. EuroEducation.net. The original data contains the following attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets: The following grades are related with the course subject, Math or Portuguese: Before exploration, we combine the rows of the two data sets and mark each instance with the class in which the survey was taken. The types of columns are listed as follows: One way to get an idea about the structure of the data is to calculate basic statistics, such as the min, max, mean, and median, and missing value counts. Google Trends - look at what’s going on in the world. Secondary school student alcohol consumption data with social, gender and study information. result as pass/fail rather than a discrete numeric number. The columns and how they are recorded are as listed below: Since the main purpose of the dataset is to find correlations between students with their alcohol consumption patterns, the most conspicuous relationship Excessive alcohol use, either in the form of binge drinking (drinking 5 or more drinks on an occasion for men or 4 or more drinks on an occasion for women) or heavy drinking (drinking 15 or more drinks per week for men or 8 or more drinks per week for women), is associated with an increased risk of many health problems, such as liver disease and unintentional injuries. column 33 (final grade). I'm sorry, the dataset "STUDENT ALCOHOL CONSUMPTION" does not appear to exist. The following shows basic statistics of each feature: Addressing skewness, the mean of absences is 4.4348659 and the median is 2, indicating that the data is right-skewed and given the spread between the min and max, the skewness is significant. /r/datasets. Since the main purpose of the dataset is to find correlations between students with their alcohol consumption patterns, the most conspicuous relationship would be the relationship between their grades with respect to their workday and weekend alcohol consumption. This analysis was done as part of fulfilling the Data Mining course in Multimedia University. Assuming the romantic relationship in our dataset is of an intimate level, we can find out if this statement holds true. (romantic), only gives information on whether or not the student has a partner. A twin study of marital status and alcohol consumption. 2016. recorded to have participated. predict if a student will get a passing grade based on the factors mentioned above. that particular student's success. Your email address will not be published. Nicolas Raj. 2014. For the data exploratory exercise, we choose to examine three columns: workday alcohol consumption, weekend alcohol consumption and their relationship status. Examples of the passing marks for a student in Portugal would be 10 out of 20. Thus, their final grade would be the perfect measure of Secondary school students are in a transition developmentally and this comes with its debilitating effects such as risky alcohol use … Your email address will not be published. We shall see which consensus holds true. The traditional consensus is that students who consume alcohol at high levels … emotion. People who contributed to this were Aaron Patrick Nathaniel, Lim Yue Hng (Neil) and We would oversample since we have limited data. The Core Survey help us determine the patterns of alcohol and other drug consumption and examine attitudes and perceptions of alcohol and other drug use among Northwestern students. We could take into consideration the The data collected, in locations such as Gabriel Pereira and Mousinho da Silveira, includes several values of pertinence. Required fields are marked *. In general, we would assume that people who are not healthy, will Modify the number features by: Depending on the algorithms used in the model building, the following features may produces better results as numeric and normalized. It’s called the datasets subreddit, or /r/datasets. We could perform this merge differently later by performing a full join and then dealing with the NA values, by performing the analysis on the individual sets, or by inner joining the two sets and just working with that data. If the hypothesis holds true, we would expect to see an increasing level of alcohol more serious towards their final grade rather than the first period grade and second period grade. to 1 hour, or 4 – >1 hour), studytime – weekly study time (numeric: 1 – <2 hours, 2 – 2 to 5 hours, 3 – 5 to 10 hours, or 4 – >10 hours), failures – number of past class failures (numeric: n if 1<=n<3, else 4), schoolsup – extra educational support (binary: yes or no), famsup – family educational support (binary: yes or no), paid – extra paid classes within the course subject (Math or Portuguese) (binary: yes or no), activities – extra-curricular activities (binary: yes or no), nursery – attended nursery school (binary: yes or no), higher – wants to take higher education (binary: yes or no), internet – Internet access at home (binary: yes or no), romantic – with a romantic relationship (binary: yes or no), famrel – quality of family relationships (numeric: from 1 – very bad to 5 – excellent), freetime – free time after school (numeric: from 1 – very low to 5 – very high), goout – going out with friends (numeric: from 1 – very low to 5 – very high), Dalc – workday alcohol consumption (numeric: from 1 – very low to 5 – very high), Walc – weekend alcohol consumption (numeric: from 1 – very low to 5 – very high), health – current health status (numeric: from 1 – very bad to 5 – very good), absences – number of school absences (numeric: from 0 to 93), G1 – first period grade (numeric: from 0 to 20), G2 – second period grade (numeric: from 0 to 20), G3 – final grade (numeric: from 0 to 20, output target), Joining information from existing features (PCA is a common example, or some knowledge about how features are correlated), Depending on the model, remove features that are not important to the model. This modification coincides with the original report where the authors modified the target with the formula acl = (Dalc * 5 + Walc * 2) / 7 and then assumed values of 3 or more were heavy drinkers. For example, if there were a high correlation, say 0.9, between two numeric features, then the information provided to the model would be redundant, and depending on the model make the model more complex than it needs to be. impressionable generation. First, open the student-por.csv file in the student_performance source. At an alcohol consumption level of 1, the median and 25th percentile are the same value of 2 hours of study. in section E as part of the preprocessing before plotting the data for our exploratory data analysis. Core measures include: Baseline surveys included standard demographics, religiosity, current alcohol and drug diagnoses (DIS), ASI alcohol, drug and psychiatric problem severity, number of heavy drinkers in social networks, prior treatment utilization, and lifetime and past-year 12-step meeting attendance and involvement, Six- and 12-month surveys involved a subset of these … Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in … The original data comes from a survey conducted by a professor in Portugal. If one is very high, you may want to take a closer look at the data and see if there is leakage into the target variable. X axis is the level of categorical target. Click on the arrow near the name of each column to evoke the context menu. Guided By: Dr. Amir H. Gandomi Student Grade Prediction Presented By: Gaurav Sawant Vipul Gajbhiye Vikram Singh Date: 11/28/2017 As we all know, human relationships play a major role in people's lives. To obtain insights on this, we could refer to column 29 (health), column 27 The original values for the feature ‘absences’ will be used in the remaining sections. as the attributes and GStatus as the class for the training set to predict the class GStatus in the test set and validate the model. These short term effects of alcohol could lead to poor academic performance, poor health and disruptive social behavior. We use cookies on Kaggle to deliver our services, analyze web traffic, and improve your experience on the site. grades. It would be easy to assume that alcohol consumption reduces the student’s health on a long term basis. The dataset is originally designed for the estimation of high school student’s performance where alcohol consumption is used as one of the parameters. Correlation does not imply causation. I will be utilizing the student alcohol consumption dataset provided by UCI Machine Learning and is available in their machine learning repository. From this analysis, what might we preprocess before creating the model? It gives you data about … Column 23 For the data exploratory exercise, we choose to examine four columns: workday alcohol consumption, first period grade, second period grade and their final grade. To get an idea of how features interact with each-other, we can determine the rank associated with the features to a target, in this case, the actual target or level of drinking. consumption (both column 27 and 28) when famrel has a low value. replication of data. Section 3b. It can develop a plethora of emotions in oneself, may it be a positive or negative In our data set, many of the categorical features are numeric, but for this illustration, we will continue with treating them as categorical. However, the data reveals that there was a total of 382 students that were in both datasets, this was evident in the exact Family history alcohol PDF 140 KB. The primary reason for this data was to see the effects of drinking and grades. We assume that a father’s education level is similar to a mother’s education level, so let us visualize the association: The above plot shows that the education levels between mother and father do coincide fairly often and might want to explore more or consider the possibility of joining these features in preprocessing the data before model building. This would help the classification model to more accurately predict the class GStatus fulfilling the Data Mining course in Multimedia University. Tobacco and nicotine use TUD PDF 493 KB. With the Student Alcohol Consumption data set, we predict high or low alcohol consumption of students. Medicine use PDF 223 KB. (n.d.). According to the World Health Organization (Global Status Report on Alcohol and Health 2014 2014), gender, family, and social factors affect alcohol consumption. GitHub is where the world builds software. Therefore, researchers seek to rectify that lack by conducting a survey to obtain important raw data on alcohol consumption such data are records of demographic information, grades, and alcohol consumption. Depending on the model you choose, removing skewness could help improve the predictive ability of the model. Last but not least, we can also obtain insights on health issues and drinking alcohol. 3. The scope of these data sets varies a lot, since they’re all user-submitted, but they tend to be very interesting and nuanced. Essentially, the blue rectangles show that the observed counts and expected counts (derived from a loglinear model) coincide well, and since the size of the rectangles are large, the confidence covers a majority of the observations. We test hypothesis 0 (h0) that the numeric variable has the same mean values across the different levels of the categorical variable. https://archive.ics.uci.edu/ml/datasets/STUDENT%20ALCOHOL%20CONSUMPTION. For this analysis, we combine the rows of the data sets. Economics of Education Review, 30(1), 1-15. Section 2c. The most recent statistics from the National Institute on Alcohol Abuse and Alcoholism (NIAAA) estimate that about 1,519 college students ages 18 to 24 die from alcohol-related unintentional injuries, including motor vehicle crashes. However, if more elaborate data mining techniques were to be used, more features can be selected and used in order to National Institute on Alcohol Abuse and Alcoholism Alcohol Use and Consumption Tables A large number of html and text files on alcohol use and consumption. Background information II PDF 731 KB. In this case, we see that the grades are highly correlated, meaning the higher the grades in one session, the higher the grades in another session. Alcohol Abuse and Dependence: Roughly 20 percent of college students meet the criteria for an alcohol use disorder in a given year (8 percent alcohol abuse, 13 percent alcohol dependence). For categorical values, we use Cramer’s V. For numeric values, we use Eta-squared value. It does not state the level of intimacy between them. The data mining technique we think is suitable is classification. 5. school period grades are available. weekend alcohol consumption and their health. We will take a closer look at the distribution of this feature. 13. drinking alcohol for consolation. With the Student Alcohol Consumption data set from UCI Machine Learning Archive (Fabio Pagnotta 2016), we thought it would be interesting to see what features are important to determine if the student is a heavy drinker or not. (Pullen, 1994). National Institute of Child Health and Human Development Study of Early Child Care and Youth Development Data and documentation for phases I and II of the NICHD-SECCYD study. Five columns play a major role in this which are: column 27 (workday alcohol consumption) Generally, many models prefer using features that are independent of each other and have low correlations. Remove the skewness from the numeric data. 45 Using Python to Analyze Secondary School Student Alcohol Consumption and Their Academic Performance 1Poonam Kumari and 2Aditya Pratap 1Research Scholar, Department of Computer Science, IITM Janakpuri, New Delhi,India 2Research Scholar, Department of Computer Science, IITM Janakpuri, New Delhi,India poonam.kumari561999@gmail.com, … We would think that if the value for health is lower, the value for their It is a usual train of thought that those who have a bad relationship with their family members will be stressed and unhappy which results in them (workday alcohol consumption) and/or column 28 (weekend alcohol consumption). need to take column 23 (romantic), column 27 (workday alcohol consumption) and/or column 28 (weekend alcohol consumption) into consideration. The target is the weekday drinking level 1 to 5 and the weekend drinking level 1 to 5. The dataset was built from two sources: school reports and questionnaires. You may want to explore combining the grades into one feature since G3 is likely derived from G1 and G2. Alcohol is an often abused substance that troubles many individuals in their adulthood as they struggle to cope with emotional and physical stress that 27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high) 28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) 29 health - current health status (numeric: from 1 - very bad to 5 - very good) 30 absences - number of school absences (numeric: from 0 to 93) This Student Alcohol Consumption dataset is based on data collected in two secondary schools in Portugal. Having recourse to the public health objective on alcohol by the World Health organization, which is to reduce the health burden caused by the harmful use of alcohol, thereby saving live and reducing injuries, this data article explored the nature of alcohol use among college students, binge drinking and the consequences of alcohol consumption. courses of mathematics and Portuguese. Many of them are ordinal and were discretized from continuous values. Dinescu, D., Turkheimer E., Beam, C.R., Horn, E.E., Duncan, G., Emery, R.E. Alcohol experiences AUD PDF 281 KB. Since we attempt to predict the students’ level of alcohol consumption, high or low, we mutate the targets to join the weekday and weekend drinking, and then set the results to high or low, 1 or 0 respectively. Besides family relationships, we can also try to find if there is a relationship between being single and consuming high levels of alcohol. Treatment utilization alcohol PDF 98 KB. There are two categorical columns “Dalc” and “Walc” showing consumption on workday and weekend. Yaml is a good tool for setting up configurations, but in this case, we will set the configurations manually. This has been the case for eight of the past 10 years.
Expression Donner De La Confiture Aux Cochons En Anglais,
Postes Vacants Académie Montpellier,
The Current War Streaming Vo,
Connotation De Quantité,
école De Cinéma Nancy,
Student Alcohol Consumption Dataset,
911 Lyrics Wyclef,
Collège Immaculée Conception Clisson,
Dessin Cour De Récréation Maternelle,