Page 5: Visualizing Data

Unit 5, Lab 3, Page 5

DAT-2.D.2, DAT-2.D.6 bullet 4

Tables, diagrams, text, charts, graphs, and other visual tools help extract, modify, and communicate information from data.

On this page, you will create a visualizations to help you analyze and communicate information from your dataset.

Grouping Data

DAT-2.E.3 classifying only

Classifying data means distributing data into groups based on common characteristics.

Another thing that’s often done in data science is grouping (or classifying) data. For example, here is the cars data grouped by vehicle make (column 14):

Column A shows all of the vehicle makes (field 14 of each record).
Column B shows the total number of vehicles of each make.
Column C contains a list of all the data from cars for the vehicles of that Make (such as all the data for the Acuras or all the data for the Nissans). If you double-click one of the lists in column C, another table will open showing the data for all cars of that make.

The by intervals of input to the group table block should be left empty when, as in this example, the field on which you’re grouping is text rather than numbers. Later on this page, you’ll see how to use intervals in graphing.

Open your U5L3-Data-Processing project if it isn’t open already.
Determine one question you can answer by grouping your data, and build code to answer it.
Click for example questions for which grouping is helpful.
- How many Toyotas are in the database?
- Which brand in the table has the most models listed?
- How many 2010 Hyundais are in the database? (This requires looking inside one of the lists in column C, so you’d need two keep functions.)
Pipe may be useful for questions that require looking inside the inner lists of the grouped data (in column C).

Plotting Data

The bar chart function works like the group function, but with special features for numeric data: it allows you to select upper and lower limits of the data; you can have a range of values in one bucket, such as values 6–10, values 11–15, and so on; and it sorts the groups. For example, here is the cars data grouped by city MPG (column 9):

Now, Column A shows city MPG (field 9 of each record) grouped into intervals of 5 and sorted.
As before, Column B shows the total number of vehicles within each MPG range (0–5, 6–10, 11–15, etc.)
As before, Column C contains a list of all the data from cars within that MPG range (such as all the data for the 879 cars that get between 21 and 25 MPG in the city).

The number in column A is the largest value included in each group. If the values aren’t all integers, the next group includes anything larger. For example, the group numbered 15 includes values from 10.0001 (or anything more than 10) to exactly 15.

You can plot the data from bar chart to visualize them:

Plot a few bar charts of some fields from your dataset and make at least one new observation about your data.
The mode of a data set is the value that appears most often in it.

Here is a bar chart of field 11 from the cars data set (highway MPG) with MPG values from 5 to 50, using an interval of 3. Identify the mode. (It will be a range of values such as 13–15 or 16–18.)
Here is another bar chart with all the inputs the same as before, but with an interval of 6. Identify the mode.
How can these results both be correct? (There’s nothing wrong with the graphs.)
Why would you ever use an interval larger than 1?

Research the question of why would you ever use an interval larger than 1.