Throughout my life, I have had many experiences with private music lessons. I took private piano lessons for about 8 years and, recently, I have been working as a private piano teacher for the last 4 years. Therefore, I am very interested in analyzing what makes a lesson the most effective, so that I can be the an effective teacher.
My main research question is: “What factors in a private music lesson will help increase a student’s performance?”
The sub-questions I will be focusing on are: Do students who have a longer duration of lessons tend to get better performance scores than others? Do students whose lessons focus on playing the instrument(practicality) tend to get better performance scores than those who focus on the theory? Does the age of the student affect their performance score?
Design the Study
For this analysis, I will be using data taken from a study done on Music Education Performance, where a variety of possible factors were tested, as well as an overall performance score. I will be looking at both quantitative and categorical data and I expect to do at least one t.test and an ANOVA test. The population for this analysis is all people who take private music lessons.
Response Variable: Performance Score
Possible Explanatory Variables:
1.Duration - Quantitative
2.Lesson Type - Categorical
3.Age - Categorical
Collect the Data
When I was first looking for data to analyze, my research question was “Does taking music lessons increase general education performance?” However, as I was looking into different data sets, I was drawn to this one about what affects musical performance. Due to the focus of the data being different from my original question, I had to adjust what I planned to research. I found the data through the website Kaggle, and the data was gathered in December of 2024.
Describe/Summarize the Data
#read in libraries and datasetlibrary(tidyverse)
Warning: package 'ggplot2' was built under R version 4.4.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(mosaic)
Registered S3 method overwritten by 'mosaic':
method from
fortify.SpatialPolygonsDataFrame ggplot2
The 'mosaic' package masks several functions from core packages in order to add
additional features. The original behavior of these functions should not be affected by this.
Attaching package: 'mosaic'
The following object is masked from 'package:Matrix':
mean
The following objects are masked from 'package:dplyr':
count, do, tally
The following object is masked from 'package:purrr':
cross
The following object is masked from 'package:ggplot2':
stat
The following objects are masked from 'package:stats':
binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
quantile, sd, t.test, var
The following objects are masked from 'package:base':
max, mean, min, prod, range, sample, sum
library(rio)
Warning: package 'rio' was built under R version 4.4.3
Attaching package: 'rio'
The following object is masked from 'package:mosaic':
factorize
library(ggplot2)library(readr)library(car)
Loading required package: carData
Attaching package: 'car'
The following objects are masked from 'package:mosaic':
deltaMethod, logit
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
Analyzing the linear relationship between performance score and duration of play time in lessons
\[H_0: \beta = 0\]
\[H_a:\beta > 0\]\[\alpha = 0.05 \]
#explore and visualize dataggplot(clean_music, aes(y = Performance_Score, x = Duration)) +geom_point() +geom_smooth(method ="lm") +labs(title ="Relationship Between Duration of Play Time and Performance Scores" ) +theme_bw()
`geom_smooth()` using formula = 'y ~ x'
The scatterplot seems to have points all over the graph. There is a slight linear relationship but it is a very weak negative one.
#testing for linear regressionlm_music <-lm(clean_music$Performance_Score ~ clean_music$Duration)summary(lm_music)
Call:
lm(formula = clean_music$Performance_Score ~ clean_music$Duration)
Residuals:
Min 1Q Median 3Q Max
-19.6041 -10.0096 0.0233 9.5801 19.4098
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 81.679254 3.608076 22.638 <2e-16 ***
clean_music$Duration -0.003459 0.009303 -0.372 0.711
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.8 on 98 degrees of freedom
Multiple R-squared: 0.001408, Adjusted R-squared: -0.008781
F-statistic: 0.1382 on 1 and 98 DF, p-value: 0.7109
Correlation Coefficient(r): -0.038, which is a very weak negative relationship Slope: -0.003459 Intercept: 81.679254 p-value: 0.711 Interpretation: The p-value is not less than alpha(0.05), therefore I fail to reject the null hypothesis. I do not have sufficient evidence to suggest that the duration of play time in lessons has an effect on a student’s performance score.
Confidence Interval: I am 95% confident that the true slope of the relationship between duration and performance score is between -0.022 and 0.015. For every one second more play time in a lesson, a student’s performance score changes by between -0.022 and 0.015, therefore there is little to no difference between scores in relation to the duration of play time in lessons.
Residuals: The residuals are normal Variance: The variance is constant Trust p-value: Yes, I can trust my p-value
Analyzing the relationship between Lesson Type and Performance Score through independent t.test
\[H_0: \mu_{Practical} = \mu_{Theory}\]
\[H_a:\mu_{P} > \mu_{T}\]\[\alpha = 0.05\]
# explore and visualize datafavstats(clean_music$Performance_Score ~ clean_music$Lesson_Type)
clean_music$Lesson_Type min Q1 median Q3 max mean
1 Practical 60.32214 70.42106 79.33088 91.90368 99.71868 80.75567
2 Theory 60.90163 70.45786 81.01049 88.01304 98.35884 80.09406
sd n missing
1 12.64735 48 0
2 10.97366 52 0
ggplot(clean_music, aes(y = Performance_Score, x = Lesson_Type, fill = Lesson_Type)) +geom_boxplot() +theme_bw() +labs( title ="Comparing Performance Score by Lesson Type")
Looking at the boxplot and summary statistics, the theory focused lessons have a slightly larger median of performance scores than practical lessons. However, the variability of performance scores for practical lessons is larger, which may even it out.
#t.testt.test(clean_music$Performance_Score ~ clean_music$Lesson_Type, alternative ="greater")
Welch Two Sample t-test
data: clean_music$Performance_Score by clean_music$Lesson_Type
t = 0.27839, df = 93.437, p-value = 0.3907
alternative hypothesis: true difference in means between group Practical and group Theory is greater than 0
95 percent confidence interval:
-3.286678 Inf
sample estimates:
mean in group Practical mean in group Theory
80.75567 80.09406
t: 0.27839 df: 93.437 p-value:0.3907 Interpretation: The p-value is not less than alpha, therefore I fail to reject the null hypothesis. I do not have sufficient evidence to suggest that students who take practical lessons get higher performance scores than those who take theory lessons.
Confidence Interval: I am 95% confident that the true difference in performance scores of students who take practical lessons is between 4.058 less points and 5.381 more points than those who take theory lessons.
#check for normalityqqPlot(clean_music$Performance_Score ~ clean_music$Lesson_Type)
Normality: The data is normal Trust p-value: Yes, I can trust my p-value
Analyzing the relationship between performance score and age of the student through ANOVA
\[H_o: \mu_{10} = \mu_{11} = \mu_{12} = \mu_{13} = \mu_{14} = \mu_{15} = \mu_{16} = \mu_{17}\]\[H_a: \text{at least one age is different from the others}\]\[\alpha = 0.05 \]
#explore and visualize datafavstats(clean_music$Performance_Score ~ clean_music$Age)
Df Sum Sq Mean Sq F value Pr(>F)
clean_music$Age 1 178 178.3 1.295 0.258
Residuals 98 13492 137.7
f: 1.295 df: 1 p-value: 0.258 Interpretation: The p-value is not less than alpha(0.05), therefore I fail to reject the null hypothesis. I do not have sufficient evidence to suggest that the age of the student has a significant affect on their performance score.
Residuals: The residuals are normal SD: The standard deviations are approximately equal Trust p-value: Yes, I can trust my p-value
Conclusion (Take Action)
From my findings, I conclude that neither duration of play time, lesson type, nor age have a significant affect on a student’s performance score. These findings did surprise me a little, as I expected to see a relationship between at least one of them and performance score. However, my findings do imply that no matter what type of lesson a student takes, how long the lesson is, or how old they are, they can still do well in a performance. After analyzing this data, I would be interested in doing some more studies to see what factors do affect a student’s performance score.