I am currently a Data-Science student at General Assembly. My first project was to analyze datasets based on SAT exams and come with a problem statement and its solution. In this blog, I will present my problem statements, the procedure to solve the problem statement, challenges faced, and a conclusion.
SAT is an exam used in the United States by most universities/colleges to check students' academic capability. To get admission to the college/university, students have to get high scores in SAT exams. SAT stands for Scholastic Aptitude Test.
2. Context & Challenge
This section consists of a detailed description of the context that led to the creation of the project. This section is comprised of three parts:
Background & Description
Universities and colleges in the US require the ACT and SATs for the admission process. Other scores like high school grade-point-average (GPA) and essay responses are combined with SAT/ACT to check students' academic potential. The SAT exam has two parts: Math and EBRW (Evidence-based Reading and Writing).
Is there any relationship between SAT scores and participation rates in the states?
Goals & Objectives
● To clean the datasets and make them ready for analysis.
● To observe the data and extract the valuable information that could help John’s preparation.
● To do the required visualization to find the relationship between the features and to support the findings.
● Most importantly, to see if there is any relationship between SAT score and participation rates in the states.
3. Process & Insight
I have used a Jupyter notebook to code, analyze, and visualize the data in this task. Our work starts with importing the required datasets. SAT_2018.csv and SAT_2019.csv were provided to us to work upon.
Now let’s start cleaning our datasets. It includes the following steps,
● Null check is done, and rows with Null are removed.
● Inconsistency in data is removed.
● The data type is checked and converted into an appropriate data type with the help of the supporting functions.
● Dataset columns are renamed, and unwanted columns are dropped.
● Two datasets are merged into one, and then it is exported to a .csv file.
Our final dataset will look like this,
Now we have the cleaned dataset. Various visualizations are done with the help of this dataset.
From the heatmap, we can say that there is a strong relationship between the SAT scores and participation rate. A negative score means variables are inversely proportional, and a positive score means variables are directly proportional.
The above two scatter maps show a negative correlation between the SAT test's total score and Participation rate.
5. The Results
Finally, we got some useful information. It is found that the distributions of SAT score and participation rate are Bimodal, and there is a negative correlation between them. So, we can say that when the participation rate increases, the total mean SAT score of a state will decrease.
We can further take forward this work by building a predictive machine learning model which takes our final dataset as an input to predict the score based on the participation rate.