Page 3: Importing and Accessing Data

Unit 5, Lab 3, Page 3

DAT-2.D.1, DAT-2.C.1

Computers can be useful in processing data to gain information, but your ability process data depends both on your capabilities and the tools you have available.

On this page, you will build various tools (specifically, selectors) to help you answer questions about data that interests you.

You learned about selectors on Unit 2 Lab 2 Page 2: Planning a Quiz App.

  1. “U5L3-Data-Processing”Start a New Project called U5L3-Crayon

  2. Visit the CORGIS Datasets Project and select a dataset you’d like to explore. Download the CSV file for the data you want to explore.
  3. Open Snap! and drag the downloaded file into the Snap! window. You should see a table full of data. Look over the data (including the column headings in the top row) to give yourself a sense of what kind of information is included. Then click “OK” to close the window. You can still see the data in the watcher on the Snap! stage.

DAT-2.D.5

a graph with a bunch of data points, in which they generally follow a straight line from top left to bottom right You may be familiar with tables from spreadsheet applications (such as Google Sheets, Apple Numbers, or Microsoft Excel) which store tabular data in different file formats. Spreadsheet programs help efficiently organize information, and they can find trends in data automatically (such as the line shown at right). CSV is an open spreadsheet format that works in any of these applications and in Snap!.

What does CSV stand for?

CSV stands for “comma separated values.” CSV files are tables of data stored with commas between each item in a row and line breaks between each row in the table. (You can see the commas and line breaks of a CSV by opening it in a text editor.)

Although spreadsheets are a common kind of data, they aren’t the only kind. Consider the list of words that you used in Lab 1 to check spelling. It’s not a list of lists with two dimensions of data (rows and columns); it’s just a list with one dimension of data. “Spreadsheet” refers to two-dimensional data organized in rows and columns. “Dataset” is the more general term for any collection of data, including simple, one-dimensional lists; spreadsheets (two-dimensional lists of lists); and more complicated datasets such as spreadsheets with multiple tabs.

  1. Experiment with different inputs to the item () of 'list input slot' block to access different rows in the dataset.
  2. Talk with Your Partner How does the first item of your data set (item (1) of) differ from the other items?
  3. Locate the all but first of 'list input slot' block and determine what it does with your data.
  4. Build selectors:

    1. headings of table () that reports just the headings in a table
    2. data of table () that reports just the data (not the headings)

A table is represented in Snap! as a list of lists. If you right-click (or control-click on a Mac) a table, you can switch to “list view” and see how the data (and column headings) are stored. See examples of table view and list view.

Table View

List View

report of cars dataset displayed as a table with columns and rows; the first row is the label of each column; the remaining rows each contain the data for a single record

report of cars dataset displayed as a list of lists; the first list contains the labels for each of the columns shown in the table view; the remaining lists each contain the data for a single record

: Records, Fields, and Columns

  • A record is one row in a dataset (other than the first row, which contains the column headings). A single record might be the data for one student in your school, the data for one earthquake that happened, the data for one hospital in the U.S, or the data for one contact in your contact list. In other words, a record is a horizontal slice of the dataset.
  • A field is one item of a record in a dataset. It might be one person’s homeroom teacher, the magnitude of an earthquake in Los Angeles last week, the owner of one hospital in Chicago, or the phone number of one person in your contact list.
  • A column is a list containing the data from one field for all records in a dataset. A column might be the homeroom teacher for every student in your school, the magnitude of every earthquake in the dataset, the owner of every hospital in the U.S., or the phone number of every person in your contact list. In other words, a column is a vertical slice of the dataset.

three frame animation of the report of cars dataset displayed as a table with columns and rows; in the first frame, the fourth row of the table is highlighted and labeled 'record (row)'; in the second frame, the third column of the table is highlighted and labeled 'column'; in the third frame, the cell in the fourth row and third column is highlighted and labeled 'field'

  1. Build record () of table () and field () of record () blocks to make it easier to think about the structure of the table.
  2. Talk with Your Partner Build column () of table () to report just one column of your dataset.

    Record and column are selectors for a table abstract data type. We don’t need a constructor for this abstract data type because we are importing the data from the Internet, but the selectors will be useful. (Field is a selector for record, which is, itself, an abstract data type.)

    Notice that these suggested block names include the word “table” or “record” before the second input. Including the expected input data type in the block name can help you avoid bugs caused by using a selector that doesn’t match the input you want to use.

    Try to figure out how to report just one column, but click if you really need a hint.

    map () over () performs the same function on every item in a list. Have I seen map before?

    Since a table is a list of lists (where each inner list is one record), map will perform the same function on every record in your dataset. You will need to determine what function to map over the dataset. You learned about map on Unit 3 Lab 2 Page 5: Transforming Every List Item.

  3. Talk with Your Partner Determine one question you can answer by looking at a single column of your data set, and then build code to answer your question. You may need to build additional blocks or import them from other projects.

    You can see the column number by holding your mouse pointer over the letter at the top of the column in table view.
    image of the top of the table view for the cars dataset with the mouse pointer over the top of the second column; the columns are now labeled A, 2 (where the pointer is), C, D, etc.

    You may need to use map, keep, or combine to answer your question. Click to see where you learned about these higher order functions.

    Click for example questions to ask about a single column.

    • What’s the average MPG that cars in this database get in the city? (You’d need an average block.)
    • What’s the year of the oldest car in this dataset? (You’d need a minimum block.)
    • How many cars in this dataset have manual transmission?

    Notice that all of these examples only require data from one column. If you want to ask a question that requires looking at another column (for example, “What’s the model of the car with the highest MPG?”), you can do the Take It Further Activity below.

Challenges in Collecting Data

This section looks long. We should do another pass to trim it down. –MF, 4/4/20

DAT-2.C.3, DAT-2.C.2 all bullets

Researchers often face challenges with data before they even begin analysis. Suppose you are combining data from different countries about distances between cities, and you discover that the distance data from the U.S. is measured in miles, but the distance data from Europe is measured in kilometers; to make meaningful comparisons, you need uniform data (all in miles or all in kilometers). As another example, if you use an online survey to collect data, the way participants abbreviate, spell, or capitalize their entries may vary. Data may also be incomplete (if some people didn’t complete the survey) or invalid (if some people made mistakes).

DAT-2.C.4, DAT-2.E.2

Cleaning data is the process of making the data uniform without changing its meaning (such as replacing abbreviations, spellings, and capitalizations with the intended word or converting miles to kilometers). Programmers can use programs to filter and clean digital data, thereby gaining insight and knowledge.

DAT-2.C.5

Imagine you read in the news that people who eat a lot of broccoli are less likely to get cancer. The conclusion that broccoli prevents cancer could be a result of bias. It could be that people who eat a lot of broccoli tend to be the same people who also get a lot of exercise, and it’s actually the exercise that makes the difference. In research, the term “bias” doesn’t have to mean prejudice; it’s about reasons the data might not mean what they seem to mean.

People sometimes think that the way to overcome bias is to use a bigger sample (asking more people if they eat broccoli and have cancer). But if the bigger sample has the same problem (people getting more exercise also eat more broccoli), then a bigger sample won’t eliminate the bias.

    DAT-2.C

  1. Write Out Your ThoughtsDescribe possible challenges about the data you might face if you were going to collect your own data to answer a question like the one you investigated in the previous problem.

I need to redo these images and code statements with “record” instead of “row.” –MF, 6/27/19

  1. Answer a question like “What’s the model of the car with the highest MPG?” by building a row with maximum in column () of table () block.

    Click for a hint about how to build row with maximum in column () of table ().

    One way to build a maximum of 'list input slot' function for a simple list (that isn’t a table) is shown below. You can build a maximum of () and () block that compares two inputs and use it with combine to find the maximum of a whole list.
    maximum of (a) and (b) {
if (a > b) {
report (a)
}
report (b)
} maximum of (data) {
report (combine (data) using (maximum of ( ) and ( )))
}
    You can use a similar approach here by first building a row with maximum in column () between row () and row () that compares a specific field (column) for two rows and reports the row with the higher value in the specified column.

    Need to fix “specific field (column)” in the TIF. –MF, 6/27/19

  2. Writing specific field numbers into the code is problematic for several reasons: It’s easy to get wrong, it makes for unreadable code, and (this actually happened with the cars database between edits of BJC) the layout of a table might change over time.

    • Write a block field named () from record () of table () that looks for the given name in the table header , finds what item number it is in the header list, and uses that as the field number to extract from the given record.
    • Also write column named () of table (), which should take the full table (header and data) as input, but should report the desired column from only the data of the table (so that the result can be used directly as input to blocks such as AVERAGE OF LIST).
    • You may find index of () in () helpful.