Commit f10387b4 authored by Iryna Nikolayeva's avatar Iryna Nikolayeva
Browse files

améliorations 2021 du tp 3

parent c22d4712
%% Cell type:markdown id: tags:
---
# Exploratory data analysis Lab
---
Some define statistics as the field that focuses on turning information into
knowledge. The first step in that process is to summarize and describe the raw
information - the data. In this lab we explore educationnal data gathered by the OECD,
specifically resuluts of PISA (Programme for International Student Assessment) tests on reading, science and mathematics level at the age of 15.
As this is a large data set, along the way you'll also learn the indispensable skills of data
processing and subsetting.
## Getting started
Please complete the "..." whenever they appear.
### Help
1) Official Python documentation: https://docs.python.org/
2) To search for exact parameters: ?command
3) Internet and specifically Stackoverflow for all other questions/issues
### Load packages
In this lab we will explore the data using the `numpy` and `pandas` packages and visualize it
using the `pyplot` package that is a subpackage of the matplotlib package
NB:A package, when it is published, can also be called a "library". We may use those words as synonyms.
Let's load the packages.
%% Cell type:code id: tags:
``` python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
```
%% Cell type:markdown id: tags:
For plots to appear in the notebook, execute:
%% Cell type:code id: tags:
``` python
%matplotlib inline
```
%% Cell type:markdown id: tags:
### Data
The mission of the Organisation for Economic Co-operation and Development (OECD) is to promote policies that will improve the economic and social well-being of people around the world. It does a lot of data analysis and measurements of performacnes. For education, it regularly compares the levels of students across countries of the OECD. These tests are called PISA (Programme for International Student Assessment) tests. We will here explore some of the data that they make publicly available after their analysis from the PISA 2015 results.
1.1 Let's begin by downloading the data
For linux/mac download: https://gitlab.lip6.fr/bouchet/damivis/-/blob/master/data/OECD_PISA.csv
For windows download:
https://gitlab.lip6.fr/bouchet/damivis/-/blob/master/data/OECD_PISA_Windows.csv
1.2 Load the data into a dataframe `dfOecd`. Use the column 'location' as index:
%% Cell type:code id: tags:
``` python
dfOecd=pd.read_csv(...,..., index_col=1)
```
%% Cell type:markdown id: tags:
2) View the first 3 lines:
%% Cell type:code id: tags:
``` python
dfOecd....
```
%% Cell type:markdown id: tags:
3) What are the names of different columns?
%% Cell type:code id: tags:
``` python
dfOecd....
```
%% Cell type:markdown id: tags:
4) What countries are in this survey?
%% Cell type:code id: tags:
``` python
dfOecd...
```
%% Cell type:markdown id: tags:
5) Let's draw a histogram of the last scores in maths in 2015:
%% Cell type:markdown id: tags:
6) To the previous histogram, reduce the size of bins. Is the distribution skewed? Is it unimodal? Does it have extreme values? On the left? On the right?
%% Cell type:markdown id: tags:
7) For each score, what is the mean, median, standard deviation? From the graph above, for mathematics, doest the mean or the median better summarise the performance of the majority of the countries?
%% Cell type:code id: tags:
``` python
dfOecd....
```
%% Cell type:code id: tags:
``` python
dfOecd....
```
%% Cell type:code id: tags:
``` python
dfOecd....
```
%% Cell type:markdown id: tags:
8) Actually, there is a command that shows all the summary tatistics at once:
%% Cell type:code id: tags:
``` python
dfOecd.describe()
```
%% Cell type:markdown id: tags:
9) Let's now look at relationships between variables. Draw a scatterplot with reading2015 on one axis and maths2015 on the other axis, with location column as labels
%% Cell type:code id: tags:
``` python
plt.plot(...) #alternative: plt.scatter(dfOecd['reading2015'], dfOecd['maths2015'])
for label, x, y in zip(np.array(dfOecd.index), dfOecd['reading2015'], dfOecd['maths2015']):
plt.annotate(label, xy=(x,y))
```
%% Cell type:markdown id: tags:
To go further, take a look on the graphs that have been generated from this data (and some other variables) on the PISA website: http://www.oecd.org/pisa/.
If you are interested by the subject, the report is worth reading!
%% Cell type:markdown id: tags:
10) We do not clearly see what happens for countries where scores in maths or reading exceed 470. Can you filter countries whose scores are higher than 470 and plot only the to see better?
%% Cell type:code id: tags:
``` python
dfOecdHighReading=...
```
%% Cell type:code id: tags:
``` python
...
```
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment