# Data Science: An Introduction/Definitions of Data

# NOTE:

The following page is a rough draft. The final draft can be found at:

http://en.wikibooks.org/wiki/Data_Science:_An_Introduction/Definitions_of_Data

**Data Science: An Introduction**

**Chapter 3: Definitions of Data**

## Contents

(Back to Data Science: An Introduction)

Data Science: An Introduction/Navigation

## Note to Contributors (remove this section when the chapter is complete)

First, please register yourself with Wikibooks (and list yourself below), so that we know who our co-contributors are. Thank you.

Secondly, we only need basic, clear, straight-forward information in each chapter. We are not trying to be exhaustive or complete--the value of this book is in the simple synthesis across subjects. There are other venues in which to wax eloquent on the deepness and complexities of a particular subject. Please place yourself in a "beginner's mind" as you make contributions. Please also scope each chapter so that it can be taught in a one-hour class period. If the chapter requires more than an hour to teach, it is probably too detailed.

- To the extent possible, please use terms and concepts in the way in which they are defined in the Wikipedia and Wiktionary. This way students can refer to the corresponding Wikipedia / Wiktionary page to get a deeper understanding of the concept.

Thirdly, this is a cross-disciplinary book. We want to help people apply data science to all fields. Therefore, we need a wide variety of simple examples and simple exercises.

Fourthly, please adhere to the simple structure of each chapter: Summary of Main Points, Discussion, More Reading, Exercises, and References. We want the More Reading section to link to on-line resources. The References section may contain off-line resources.

Fifthly, as with any Wikibook please feel free to make corrections, expand explanations, and make additions where necessary, even if it is not "your" chapter. Use the discussion page to explain changes that might be controversial.

Sixthly, some syntax rules:

- Please
**bold**key terms and phrases the student should learn. - Put the name of functions and code snippets using the 'code' tags:
`<code>lm()</code>`

- Use in-line links
`[[ ]]`

to the Wikipedia, Wiktionary, WikiCommons, Wikibooks, and other Wikimedia Foundation properties. - Use references
`<ref> </ref>`

to "external" sources--both on-line and off-line.- Use the citations templates to make citations : Template:Cite book, Template:Cite web, Template:Cite journal

- If you want to add an image or graph, you should load it into the Commons rather than uploading into Wikibooks.
- If appropriate, add the tag
`{{Created with R}}`

) when you upload the graph.

- If appropriate, add the tag
- If using a different package than
**R**standard packages, put the name of the package in bold in parenthesis after each function : <code>MCMCprobit()</code> ('''MCMCpack''')

## Chapter Summary

The word "data" is a general purpose word denoting a collection of measurements. "Data points" refer to individual instances of data. A "data set" is a well-structured set of data points. Data points can be of several "data types," such as numbers, or text, or date-times. When we collect data on similar objects in similar formats, we bundle the data points into a "variable." We could give a variable a name such as 'age,' which could represents the list of ages of everyone in a room. The data points associated with a variable are called the "values" of the variable. These concepts are foundational to understanding data science. There is some quirkiness in the way variables are treated in the R programming language.

## Discussion

#### What is Data?

The Wiktionary defines **data** as the plural form of datum; as pieces of information; and as a collection of object-units that are distinct from one another. (http://en.wiktionary.org/wiki/data)

The Wiktionary defines **datum** as a measurement of something on a scale understood by both the recorder (a person or device) and the reader (another person or device). The scale is arbitrarily defined, such as from 1 to 10 by ones, 1 to 100 by 0.1, or simply true or false, on or off, yes, no, or maybe, etc.; and as a fact known from direct observation.
(http://en.wiktionary.org/wiki/datum#English)

For our purposes, the key components of these definitions are that data are observations that are measured and communicated in such a way as to be intelligible to both the recorder and the reader. So, you as a person are not data, but recorded observations about you are data. For example, your name when written down is data; or the digital recording you speaking your name is data; or a digital photograph of your face or video of you dancing are data.

#### What is a Data Point?

Rather than call a single measurement by the formal work '"datum," we will use the convention, **data point**. We may talk about a single data point or several data points. Just remember that when we talk of "data," what we mean is a set of aggregated data points.

#### What is a Data Set?

The Wiktionary, unhelpfully, defines a **data set** as a "set of data." (http://en.wiktionary.org/wiki/data_set#English) Let us define a data set as a collection of data points that has been observed on similar objects and formatted in similar ways. Thus, a complilation of the written names and the written ages of a room full of people is a data set. In computing, a data set is stored in a file on a disk. Storing the data set in a file makes it accessible to analysis.

#### What are Data Types?

As illustrated earlier, data can exist in many forms, such as text, numbers, images, audio, and video. People who work with data have taken great care to very specifically define different **data types**. They do this because they want to compute various operations on the data, and those operations only make sense for particular data types. For example, addition is an operation we can compute on integer data types (2+2=4), but not on text data types ("two"+"two"=???). Concatenation is an operation we can compute on text. To concatenate means to put together, so: `concatenate(two, two) = twotwo`

. For the purposes of this introduction, we will just concern ourselves with simple numeric and simple text data types and leave more complex data types--like images, audio, and video--to more advanced courses. Data scientists use the various data types from mathematics, statistics, and computer science to communicate with each other.

##### Data Types in Mathematics

We will introduce just the most commonly used data types in Mathematics. There are many more, but we'll save those for more advanced courses.

**Integers**- According to the Wikipedia (http://en.wikipedia.org/wiki/Integer) integers are numbers that can be written without a fractional or decimal component, and fall within the set {..., −2, −1, 0, 1, 2, ...}. For example, 21, 4, and −2048 are integers; 9.75, 5½, and √2 are not integers.**Rational Numbers**- According to the Wikipedia (http://en.wikipedia.org/wiki/Rational_number) rational numbers are those that can be expressed as the quotient or fraction p/q of two integers, with the denominator q not equal to zero. Since q may be equal to 1, every integer is a rational number. The decimal expansion of a rational number always either terminates after a finite number of digits or begins to repeat the same finite sequence of digits over and over. For example, 9.75 2/3, and 5.8144144144… are rational numbers.**Real Numbers**- According to the Wikipedia (http://en.wikipedia.org/wiki/Real_number) real numbers include all the rational numbers, such as the integer −5 and the fraction 4/3, plus all the irrational numbers such as √2 (1.41421356... the square root of two), π (3.14159265...), and e (2.71828...).**Imaginary Numbers**- According to the Wikipedia (http://en.wikipedia.org/wiki/Imaginary_number) imaginary numbers are those whose square is less than or equal to zero. For example, √-25 is an imaginary number and its square is -25. An imaginary number can be written as a real number multiplied by the imaginary unit*i*, which is defined by its property*i*^{2}= −1. Thus, √-25 = 5*i*.

Data scientists understand that the kind of mathematical operations they may perform depends on the data types reflected in their data.

##### Data Types in Statistics

We will introduce just the most commonly used data types in Statistics, as defined in the Wikipedia (http://en.wikipedia.org/wiki/Level_of_measurement). There are a few more data types in statistics, but we'll save those for more advanced courses.

**Nominal**- Nominal data are recorded as categories. For this reason, nominal data is also known as categorical data. For example, rocks can be generally categorized as igneous, sedimentary and metamorphic.**Ordinal**- Ordinal data are recorded as the rank order of scores (1st, 2nd, 3rd, etc.). An example of ordinal data is the result of a horse race, which says only which horses arrived first, second, or third but include no information about race times.**Interval**- Interval data are recorded not just about the order of the data points, but also the size of the intervals in between data points. A highly familiar example of interval scale measurement is temperature with the Celsius scale. In this particular scale, the unit of measurement is 1/100 of the temperature difference between the freezing and boiling points of water. The zero point, however is arbitrary.**Ratio**- Ratio data are are recorded on an interval scale with a true zero point. Mass, length, time, plane angle, energy and electric charge are examples of physical measures that are ratio scales. Informally, the distinguishing feature of a ratio scale is the possession of a zero value. For example, the Kelvin temperature scale has a non-arbitrary zero point of absolute zero.

Data scientists know that the kind of statistical analysis they will perform is determined by the kinds of data types they will be analyzing.

##### Data Types in Computer Science

We will introduce just the most commonly used data types in Computer Science, as defined in the Wikipedia (http://en.wikipedia.org/wiki/Data_type). There are many more, but we'll save those for more advanced courses.

**Bit**- A bit (a contraction of binary digit) is the basic unit of information in computing and telecommunications; a bit represents either 1 or 0 (one or zero) only. This kind of data is sometimes also called**binary**data. When 8 bits are grouped together we call that a**byte**. A byte can have values in the range 0-255 (00000000-11111111). For example the byte 10110100 = 180.**Hexadecimal**- Bytes are often represented as Base 16 numbers. Base 16 is known as Hexadecimal (commonly shortened to**Hex**). Hex uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F (or alternatively a–f) to represent values ten to fifteen. Each hexadecimal digit represents four bits, thus two hex digits fully represent one byte. As we mentioned, byte values can range from 0 to 255 (decimal), but may be more conveniently represented as two hexadecimal digits in the range 00 to FF. A two-byte number would also be called a 16-bit number. Rather than representing a number as 16 bits (10101011110011), we would represent it as 2AF3 (hex) or 10995 (decimal). With practice, computer scientists become proficient in reading and thinking in hex. Data scientists must understand and recognize hex numbers. There are many websites that will translate numbers from binary to decimal to hexadecimal and back.

**Boolean**- The Boolean data type encodes logical data, which has just two values (usually denoted "true" and "false"). It is intended to represent the truth values of logic and Boolean algebra. It is used to store the evaluation of the logical truth of an expression. Typically, two values are compared using**logical operators**such as .eq. (equal to), .gt. (greater than), and .le. (less than or equal to). For example,`b = (x .eq. y)`

would assign the boolean value of "true" to "b" if the value of "x" was the same as the value of "y," otherwise it would assign the logical value of "false" to "b."**Alphanumeric**- This data type stores sequences of characters (a-z, A-Z, 0-9, special digits) in a**string**--from a**character set**such as**ASCII**for western languagues or**UTF**for Middle Eastern and Asian languages. Because most character sets include the numeric digits, it is possible to have a string such as "1234". However, this would still be an alphanumeric value, not the integer value 1234.**Integers**- This data type has the same definition as the mathematical data type of the same name. In computer science, however, an integer can either be**signed**or**unsigned**. Let us consider a 16-bit (two byte) integer. In its unsigned form it can have values from 0 to 65535 (2^{16}-1). However if we reserve one bit for a (negative) sign, then the range becomes -32767 to +32768 (-7FFF to +8000 in hex).**Floating Point**- This data type is a method of representing real numbers in a way that can support a wide range of values. The term floating point refers to the fact that the decimal point can "float"; that is, it can be placed anywhere relative to the significant digits of the number. This position is indicated separately in the internal representation, and floating-point representation can thus be thought of as a computer realization of**scientific notation.**In scientific notation, the given number is scaled by a power of 10 so that it lies within a certain range—typically between 1 and 10, with the decimal point appearing immediately after the first digit. The scaling factor, as a power of ten, is then indicated separately at the end of the number. For example, the revolution period of Jupiter's moon Io is 152853.5047 seconds, a value that would be represented in standard-form scientific notation as 1.528535047×10^{5}seconds. Floating-point representation is similar in concept to scientific notation. The base part of the number is called the**significand**(or sometimes the**mantissa**) and the exponent part of the number is unsuprisingly called the**exponent**.- The two most common ways in which floating point numbers are represented are either in 32-bit (4 byte)
**single precision**, or in 64-bit (8 byte)**double precision**. Single precision devotes 24 bits (about 7 decimal digits) to its significand. Double precision devotes 53 bits (about 16 decimal digits) to its significand.

- The two most common ways in which floating point numbers are represented are either in 32-bit (4 byte)

Data scientists understand the importance of how data is represented in computer science, because it effects the results they are generating. This is especially true when small rounding errors accumulate over a large number of iterations.

##### Data Types in R

There are at least 24 data types in the R language. (http://stat.ethz.ch/R-manual/R-devel/doc/manual/R-lang.html#Objects) We will just introduce you to the 7 most commonly used data types. As you will see they are a blend of the data types that exist in Mathematics, Statistics, and Computer Science. Just what a Data Scientist would expect. The seven are:

- - NULL, for something that is nothing
- - logical, for something that is either TRUE or FALSE (on or off; 1 or 0)
- - character, for alphanumeric strings
- - integer, for positive, negative, and zero whole numbers (no decimal place)
- - double, for real numbers (with a decimal place)
- - complex, for complex numbers that have both real and imaginary parts (e.g., square root of -1)
- - POSIX, for dates and times (dates are internally represented as the number of days since 1970-01-01, with negative values for earlier dates)

You can get R to tell you what a particular data object is by using the ** typeof()** command. If you want to know what a particular data object was called in the original definition of the S language by Becker, Chambers & Wilks (1988) you can use the

**command. If you want to know what a particular data object is called in the C programming language that was used to write R, you can use the**

`mode()`

**command. For the purposes of this book, we will just use the**

`class()`

**command.**

`typeof()`

Data scientists must know exactly how their data are being represented in the analysis package, so they can apply the correct mathematical operations and statistical analysis.

#### What are Variables and Values?

Let us start by noting the opposite of a variable is a **constant**. If we declare that the symbol "X" is a constant and assign it a value of 5, then X=5. It does not change; X will always be equal to 5. Now, if we declare the symbol "Y" to be a **variable**, that means Y can have more than one **value** (see the Wiktionary entry for "variable" -- http://en.wiktionary.org/wiki/variable). For example, in the mathematical equation, Y^^2=4 (Y squared equals 4), the variable Y can either have the value of 2 or -2 and satisfy the equation.

Imagine we take a piece of paper and make two columns. At the top of the first column we put the label "name" and the top of the second column we put the label "age." We then ask a room full of 20 people to each write down their name and age on the sheet of paper in the appropriate columns. We will end up with a list of 20 names and 20 ages. Let us use the label "name" to represent the entire list of 20 names and the label "age" to represent the entire list of 20 ages. This is what we mean by the term "variable." The variable "name" has 20 data points (the list of 20 names), and the variable "age" has 20 data points (the list of 20 ages). A **variable** is a symbol that represents multiple **data points** which we also call **values**. Other words that have approximately the same meaning as "value" are **measurement** and **observation**. Data scientists use these four terms (data point, value, measurement, and observation) interchangeably when they communicate with each other.

The word variable is a general purpose word used in many disciplines. However, various disciplines also use more technical terms that mean approximately the same thing. In mathematics another word that approximates the meaning of the term "variable" is **vector**. In computer science, another word that approximates the meaning of the term "variable" is **array**. In statistics, another word that approximates the meaning of the term "variable" is **distribution**. Data scientists use these four words (variable, vector, array, and distribution) interchangeably when they communicate with each other.

Let us think again of the term data set (defined above). A **data set** is usually two or more variables (and their associated values) combined together. Once our data is organized into variables, combined into a data set. and stored in a file on a disk, it is ready to be analyzed.

The R programming language is a little quirky when it comes to data types, variables, and data sets. R sometimes uses the term "vector" instead of "variable." When we combine and store multiple vectors (variables) into a data set in R, we call it a **data frame**. When R stores vectors into a data frame, it assigns a **role** to indicate how the data will be used in subsequent statistical analyses. So in R data frames, for example, the "logical," "date/time," and "character" data types are assigned the role of **Factor**. The "double" data type are assigned the role of **num** and "integers" are assigned the role of **int**. (The "complex" data type is assigned the role of "cplx," but don't worry about that now.) These roles correspond to the statistical data types as follows: Factor = nominal, int = ordinal, and num = interval. (We usually transform the ratio data type into an interval data type before doing statistical analysis. This is normally done by taking the logarithm of the ratio data. More on this in later chapters.) We can discover the roles each variable will play within a data frame by using the **structure** command in R:

. We will explain what "factors" are in latter chapters.
**str()**

## Assignment/Exercise

This assignment should be done with one or two other people. All should interact with the R programming language. The group can help each other both learn the concepts and figure out how to make R work. Practice with R by trying out different ways of using the commands that are described below.

#### Find Data Types in R

Use the `typeof()`

command to verify data types. See if you can guess what the output will look like before you press the enter key.

<source lang="rsplus">

> a <- as.integer(1) > typeof(a) > a

> b <- as.double(1) > typeof(b) > b

> c <- as.character(1) > typeof(c) > c

>d <- as.logical("true") >typeof(d) >d

> e <- as.complex(-25) > typeof(e) > e

> f <- as.null(0) > typeof(f) > f

> g <-as.POSIXct("2012/07/04 10:15:59") > typeof(g) > class(g) > g

> g <-as.POSIXlt("2012/07/04 10:15:59") > typeof(g) > class(g) > g

</source>

If you don't specifically specify a data type through the as.* commands, R tries to figure out what data type you intended. It does not always guess your mind correctly. Play around with R, assigning some values to some variables and then use the ` typeof() `

command to see the automatic assignments of data types that R made for you. Then see if you can convert a value from one data type to another.

#### Objects, Variables, Values, and Vectors in R

The R language is based on an **object-oriented** programming language. Thus, things in R are called **objects.** So, when we assign a value to the letter "X," in R we would say we have assigned a value to the object "X." Objects in R may have different properties from each other, depending on how they are used. For this exercise, we will concern ourselves with objects that behave like variables. Those types of objects are called **vector** objects. So, when we talk--in the language of data science--about the variable "X," in R we could call it the vector "X." As you remember, a variable is something that varies. Let's create a character vector in R and assign it three values. We will use the concatenate ` c() `

command in R. Let's also create an integer vector using the same concatenate command.

<source lang="rsplus">

> name <- c("Maria", "Fred", "Sakura") > typeof(name) > name

> age <- as.integer(c(24,19,21)) > typeof(age) > age

</source>

Both vectors now have three values each. The character string "Maria" is in the first position of the vector "name," "Fred" is in the second position, and "Sakura" is in the third position. Similarly, the integer 24 is in the first position of the vector "age," 19 is in the second position, and 21 is in the third position. Let's examine each of these individually.

<source lang="rsplus">

> name[1] > name[2] > name[3] > age[1] > age[2] > age[3]

</source>

The number with in the brackets is called the **index** or the **subscript**.

#### Data Sets and Data Frames

If we had observed the actual names and ages of three people so that `name[1]`

corresponded to `age[1]`

, we would have a data set that looks like the following.

name age -------- --- Maria 24 Fred 19 Sakura 21

Let us put our data set into an R **data frame** object. We need to think of a name for our data frame object. Let's call it "Project." After we put our data set into the data frame, we will inspect it using R's "structure" command, `str()`

. Remember, upper and lower case are meaningful.

<source lang="rsplus">

> Project <- data.frame(name, age) > str(Project)

</source>

The structure command told us we had three observations and two variables. That is great. It told us the names of the variables were `$name`

and `$age`

. This tells us that when we put a data set into an R data frame object, we need to reference the variable WITHIN the data frame as follows: `Project$name`

and `Project$age`

. The structure command also told us that `Project$name`

was assigned a the role of a "Factor" variable and that `Project$age`

was assigned the role of "int." These correspond to the "nominal" and "ordinal" data types that statistitians use. R needs to know the role variables play in order to perform the correct statistical functions on the data. One might argue that the age variable is more like the statistical interval data type than the statistical ordinal data type. We would then have to change the R data type from integer to double. This will change its role to "number" within the data frame.

Rather than change the data type of `Project$age`

, it is a good practice to create a new variable, so the original is not lost. We will call the new variable `Project$age.n`

, so we can tell that is the transformed `Project$age`

variable.

<source lang="rsplus">

> Project$age.n <- as.double(Project$age) > str(Project)

</source>

We can now see that `Project$age`

and the `Project$age.n`

variables play different roles in the data frame, one as "int" and one as "num." Now, confirm that the complete data set has been properly implemented in R by displaying the data frame object.

<source lang="rsplus">

> Project name age age.n 1 Maria 24 24 2 Fred 19 19 3 Sakura 21 21

</source>

Now let's double check the data types.

<source lang="rsplus">

> typeof(Project$name) > typeof(Project$age) > typeof(Project$age.n)

</source>

Whoops! We see some of the quirkiness of R. When we created the variable "name," it had a data type of "character." When we put it into a data frame not only did R assign it the role of a "Factor" but it also changed its data type to "integer." What is going on here? This is more than you want to know right now. We will explain it now, but you really don't have to understand it until later.

Because all statistical computations are done on numbers, R gave each value of the variable "name" an arbitrary integer number. It calls these arbitrary numbers **levels**. It then labeled these levels with the original values, so we would know what is going on. So under the covers, `Project$name`

, has the values: 2 (labeled "Maria), 1 (labeled "Fred") and 3 (labeled Sakura). We can convert `Project$name`

back into the character data type, but we won't be able to perform statistical calculations on it.

<source lang="rsplus">

> Project$name.c <- as.character(Project$name) > typeof(Project$name.c) > str(Project) 'data.frame': 3 obs. of 4 variables: $ name : Factor w/ 3 levels "Fred","Maria",..: 2 1 3 $ age : int 24 19 21 $ age.n : num 24 19 21 $ name.c: chr "Maria" "Fred" "Sakura"

</source>

We can now see that `Project$name.c`

has a data type of character, and has been assigned a data frame role of "chr."