Difference between revisions of "Data Science: An Introduction/Definitions of Data"

From wiki.acadac.net, the Calvin Andrus wiki
Jump to: navigation, search
m (Data Types in R)
m (Find Data Types in R)
Line 157: Line 157:
  > f
  > f
> g <-as.POSIXct("2012/07/04 10:15:59")
> typeof(g)
> class(g)
> g
> g <-as.POSIXlt("2012/07/04 10:15:59")
> typeof(g)
> class(g)
> g

Revision as of 19:48, 25 June 2012

Data Science: An Introduction

Chapter 3: Definitions of Data


(Back to Data Science: An Introduction)
Data Science: An Introduction/Navigation
Template:Book Search Template:An Introduction to Data Science/Navigation

Note to Contributors (remove this section when the chapter is complete)

First, please register yourself with Wikibooks (and list yourself below), so that we know who our co-contributors are. Thank you.

Secondly, we only need basic, clear, straight-forward information in each chapter. We are not trying to be exhaustive or complete--the value of this book is in the simple synthesis across subjects. There are other venues in which to wax eloquent on the deepness and complexities of a particular subject. Please place yourself in a "beginner's mind" as you make contributions. Please also scope each chapter so that it can be taught in a one-hour class period. If the chapter requires more than an hour to teach, it is probably too detailed.

  • To the extent possible, please use terms and concepts in the way in which they are defined in the Wikipedia and Wiktionary. This way students can refer to the corresponding Wikipedia / Wiktionary page to get a deeper understanding of the concept.

Thirdly, this is a cross-disciplinary book. We want to help people apply data science to all fields. Therefore, we need a wide variety of simple examples and simple exercises.

Fourthly, please adhere to the simple structure of each chapter: Summary of Main Points, Discussion, More Reading, Exercises, and References. We want the More Reading section to link to on-line resources. The References section may contain off-line resources.

Fifthly, as with any Wikibook please feel free to make corrections, expand explanations, and make additions where necessary, even if it is not "your" chapter. Use the discussion page to explain changes that might be controversial.

Sixthly, some syntax rules:

  • Please bold key terms and phrases the student should learn.
  • Put the name of functions and code snippets using the 'code' tags: <code>lm()</code>
  • Use in-line links [[ ]] to the Wikipedia, Wiktionary, WikiCommons, Wikibooks, and other Wikimedia Foundation properties.
  • Use references <ref> </ref> to "external" sources--both on-line and off-line.
  • If you want to add an image or graph, you should load it into the Commons rather than uploading into Wikibooks.
    • If appropriate, add the tag {{Created with R}}) when you upload the graph.
  • If using a different package than R standard packages, put the name of the package in bold in parenthesis after each function : <code>MCMCprobit()</code> ('''MCMCpack''')

Chapter Summary

The word "data" is a general purpose word denoting a collection of measurements. "Data points" refer to individual instances of data. A "data set" is a well-structured set of data points. Data points can be of several "data types," such as numbers, or text, or date-times. When we collect data on similar objects in similar formats, we bundle the data points into a "variable." We could give a variable a name such as 'age,' which could represents the list of ages of everyone in a room. The data points associated with a variable are called the "values" of the variable. There is some quirkiness in naming conventions used in the R programming language to refer to these concepts.


What is Data?

The Wiktionary defines data as the plural form of datum; as pieces of information; and as a collection of object-units that are distinct from one another. (http://en.wiktionary.org/wiki/data)

The Wiktionary defines datum as a measurement of something on a scale understood by both the recorder (a person or device) and the reader (another person or device). The scale is arbitrarily defined, such as from 1 to 10 by ones, 1 to 100 by 0.1, or simply true or false, on or off, yes, no, or maybe, etc.; and as a fact known from direct observation. (http://en.wiktionary.org/wiki/datum#English)

For our purposes, the key components of these definitions are that data are observations that are measured and communicated in such a way as to be intelligible to both the recorder and the reader. So, you as a person are not data, but recorded observations about you are data. For example, your name when written down is data; or the digital recording you speaking your name is data; or a digital photograph of your face or video of you dancing are data.

What is a Data Point?

Rather than call a single measurement by the formal work '"datum," we will use the convention, data point. We may talk about a single data point or several data points. Just remember that when we talk of "data," what we mean is a set of aggregated data points.

What is a Data Set?

The Wiktionary, unhelpfully, defines a data set as a "set of data." (http://en.wiktionary.org/wiki/data_set#English) Let us define a data set as a collection of data points that has been observed on similar objects and formatted in similar ways. Thus, a complilation of the written names and the written ages of a room full of people is a data set. In computing, a data set is stored in a file on a disk. Storing the data set in a file makes it accessible to analysis.

What are Data Types?

As illustrated earlier, data can exist in many forms, such as text, numbers, images, audio, and video. People who work with data have taken great care to very specifically define different data types. They do this because they want to compute various operations on the data, and those operations only make sense for particular data types. For example, addition is an operation we can compute on integer data types (2+2=4), but not on text data types ("two"+"two"=???). Concatenation is an operation we can compute on text. To concatenate means to put together, so: concatenate(two, two) = twotwo. For the purposes of this introduction, we will just concern ourselves with simple numeric and simple text data types and leave more complex data types--like images, audio, and video--to more advanced courses. Data scientists use the various data types from mathematics, statistics, and computer science to communicate with each other.

Data Types in Mathematics

We will introduce just the most commonly used data types in Mathematics. There are many more, but we'll save those for more advanced courses.

  1. Integers - According to the Wikipedia (http://en.wikipedia.org/wiki/Integer) integers are numbers that can be written without a fractional or decimal component, and fall within the set {..., −2, −1, 0, 1, 2, ...}. For example, 21, 4, and −2048 are integers; 9.75, 5½, and √2 are not integers.
  2. Rational Numbers - According to the Wikipedia (http://en.wikipedia.org/wiki/Rational_number) rational numbers are those that can be expressed as the quotient or fraction p/q of two integers, with the denominator q not equal to zero. Since q may be equal to 1, every integer is a rational number. The decimal expansion of a rational number always either terminates after a finite number of digits or begins to repeat the same finite sequence of digits over and over. For example, 9.75 2/3, and 5.8144144144… are rational numbers.
  3. Real Numbers - According to the Wikipedia (http://en.wikipedia.org/wiki/Real_number) real numbers include all the rational numbers, such as the integer −5 and the fraction 4/3, plus all the irrational numbers such as √2 (1.41421356... the square root of two), π (3.14159265...), and e (2.71828...).
  4. Imaginary Numbers - According to the Wikipedia (http://en.wikipedia.org/wiki/Imaginary_number) imaginary numbers are those whose square is less than or equal to zero. For example, √-25 is an imaginary number and its square is -25. An imaginary number can be written as a real number multiplied by the imaginary unit i, which is defined by its property i 2 = −1. Thus, √-25 = 5i.

Data scientists understand that the kind of mathematical operations they may perform depends on the data types reflected in their data.

Data Types in Statistics

We will introduce just the most commonly used data types in Statistics, as defined in the Wikipedia (http://en.wikipedia.org/wiki/Level_of_measurement). There are a few more data types in statistics, but we'll save those for more advanced courses.

  1. Nominal - Nominal data are recorded as categories. For this reason, nominal data is also known as categorical data. For example, rocks can be generally categorized as igneous, sedimentary and metamorphic.
  2. Ordinal - Ordinal data are recorded as the rank order of scores (1st, 2nd, 3rd, etc.). An example of ordinal data is the result of a horse race, which says only which horses arrived first, second, or third but include no information about race times.
  3. Interval - Interval data are recorded not just about the order of the data points, but also the size of the intervals in between data points. A highly familiar example of interval scale measurement is temperature with the Celsius scale. In this particular scale, the unit of measurement is 1/100 of the temperature difference between the freezing and boiling points of water. The zero point, however is arbitrary.
  4. Ratio - Ratio data are are recorded on an interval scale with a true zero point. Mass, length, time, plane angle, energy and electric charge are examples of physical measures that are ratio scales. Informally, the distinguishing feature of a ratio scale is the possession of a zero value. For example, the Kelvin temperature scale has a non-arbitrary zero point of absolute zero.

Data scientists know that the kind of statistical analysis they will perform is determined by the kinds of data types they will be analyzing.

Data Types in Computer Science

We will introduce just the most commonly used data types in Computer Science, as defined in the Wikipedia (http://en.wikipedia.org/wiki/Data_type). There are many more, but we'll save those for more advanced courses.

  1. Bit - A bit (a contraction of binary digit) is the basic unit of information in computing and telecommunications; a bit represents either 1 or 0 (one or zero) only. This kind of data is sometimes also called binary data. When 8 bits are grouped together we call that a byte. A byte can have values in the range 0-255 (00000000-11111111). For example the byte 10110100 = 180.
    • Hexadecimal - Bytes are often represented as Base 16 numbers. Base 16 is known as Hexadecimal (commonly shortened to Hex). Hex uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F (or alternatively a–f) to represent values ten to fifteen. Each hexadecimal digit represents four bits, thus two hex digits fully represent one byte. As we mentioned, byte values can range from 0 to 255 (decimal), but may be more conveniently represented as two hexadecimal digits in the range 00 to FF. A two-byte number would also be called a 16-bit number. Rather than representing a number as 16 bits (10101011110011), we would represent it as 2AF3 (hex) or 10995 (decimal). With practice, computer scientists become proficient in reading and thinking in hex. Data scientists must understand and recognize hex numbers. There are many websites that will translate numbers from binary to decimal to hexadecimal and back.
  2. Boolean - The Boolean data type encodes logical data, which has just two values (usually denoted "true" and "false"). It is intended to represent the truth values of logic and Boolean algebra. It is used to store the evaluation of the logical truth of an expression. Typically, two values are compared using logical operators such as .eq. (equal to), .gt. (greater than), and .le. (less than or equal to). For example, b = (x .eq. y) would assign the boolean value of "true" to "b" if the value of "x" was the same as the value of "y," otherwise it would assign the logical value of "false" to "b."
  3. Alphanumeric - This data type stores sequences of characters (a-z, A-Z, 0-9, special digits) in a string--from a character set such as ASCII for western languagues or UTF for Middle Eastern and Asian languages. Because most character sets include the numeric digits, it is possible to have a string such as "1234". However, this would still be an alphanumeric value, not the integer value 1234.
  4. Integers - This data type has the same definition as the mathematical data type of the same name. In computer science, however, an integer can either be signed or unsigned. Let us consider a 16-bit (two byte) integer. In its unsigned form it can have values from 0 to 65535 (216-1). However if we reserve one bit for a (negative) sign, then the range becomes -32767 to +32768 (-7FFF to +8000 in hex).
  5. Floating Point - This data type is a method of representing real numbers in a way that can support a wide range of values. The term floating point refers to the fact that the decimal point can "float"; that is, it can be placed anywhere relative to the significant digits of the number. This position is indicated separately in the internal representation, and floating-point representation can thus be thought of as a computer realization of scientific notation. In scientific notation, the given number is scaled by a power of 10 so that it lies within a certain range—typically between 1 and 10, with the decimal point appearing immediately after the first digit. The scaling factor, as a power of ten, is then indicated separately at the end of the number. For example, the revolution period of Jupiter's moon Io is 152853.5047 seconds, a value that would be represented in standard-form scientific notation as 1.528535047×105 seconds. Floating-point representation is similar in concept to scientific notation. The base part of the number is called the significand (or sometimes the mantissa) and the exponent part of the number is unsuprisingly called the exponent.
    • The two most common ways in which floating point numbers are represented are either in 32-bit (4 byte) single precision, or in 64-bit (8 byte) double precision. Single precision devotes 24 bits (about 7 decimal digits) to its significand. Double precision devotes 53 bits (about 16 decimal digits) to its significand.

Data scientists understand the importance of how data is represented in computer science, because it effects the results they are generating. This is especially true when small rounding errors accumulate over a large number of iterations.

Data Types in R

There are at least 24 data types in the R language. (http://stat.ethz.ch/R-manual/R-devel/doc/manual/R-lang.html#Objects) We will just introduce you to the 7 most commonly used data types. As you will see they are a blend of the data types that exist in Mathematics, Statistics, and Computer Science. Just what a Data Scientist would expect. The seven are:

  1. - NULL, for something that is nothing
  2. - logical, for something that is either TRUE or FALSE (on or off; 1 or 0)
  3. - character, for alphanumeric strings
  4. - integer, for positive, negative, and zero whole numbers (no decimal place)
  5. - double, for real numbers (with a decimal place)
  6. - complex, for complex numbers that have both real and imaginary parts (e.g., square root of -1)
  7. - POSIX, for dates and times

You can get R to tell you what a particular data object is by using the typeof() command. If you want to know what a particular data object was called in the original definition of the S language by Becker, Chambers & Wilks (1988) you can use the mode() command. If you want to know what a particular data object is called in the C programming language that was used to write R, you can use the class() command. For the purposes of this book, we will just use the typeof() command.

Data scientists must know exactly how their data are being represented in the analysis package, so they can apply the correct mathematical operations and statistical analysis.

What are Variables and Values?

Let us start by noting the opposite of a variable is a constant. If we declare that the symbol "X" is a constant and assign it a value of 5, then X=5. It does not change; X will always be equal to 5. Now, if we declare the symbol "Y" to be a variable, that means Y can have more than one value (see the Wiktionary entry for "variable" -- http://en.wiktionary.org/wiki/variable). For example, in the mathematical equation, Y^^2=4 (Y squared equals 4), the variable Y can either have the value of 2 or -2 and satisfy the equation.

Imagine we take a piece of paper and make two columns. At the top of the first column we put the label "name" and the top of the second column we put the label "age." We then ask a room full of 20 people to each write down their name and age on the sheet of paper in the appropriate columns. We will end up with a list of 20 names and 20 ages. Let us use the label "name" to represent the entire list of 20 names and the label "age" to represent the entire list of 20 ages. This is what we mean by the term "variable." The variable "name" has 20 data points (the list of 20 names), and the variable "age" has 20 data points (the list of 20 ages). A variable is a symbol that represents multiple data points which we also call values. Other words that have approximately the same meaning as "value" are measurement and observation. Data scientists use these four terms (data point, value, measurement, and observation) interchangeably when they communicate with each other.

The word variable is a general purpose word used in many disciplines. However, various disciplines also use more technical terms that mean approximately the same thing. In mathematics another word that approximates the meaning of the term "variable" is vector. In computer science, another word that approximates the meaning of the term "variable" is array. In statistics, another word that approximates the meaning of the term "variable" is distribution. Data scientists use these four words (variable, vector, array, and distribution) interchangeably when they communicate with each other.

Let us think again of the term data set (defined above). A data set is usually two or more variables (and their associated values) combined together. Once our data is organized into variables, combined into a data set. and stored in a file on a disk, it is ready to be analyzed.

The R programming language is a little quirky when it comes to data types, variables, and data sets. R uses the term "vector" instead of "variable." When we combine and store multiple vectors (variables) into a data set in R, we call it a data frame. When R stores vectors into a data frame, it automatically simplifies data type to indicate how the data will be used in subsequent statistical analyses. So in R Data Frames, for example, the "logical," "date/time," and "character" data types are called a Factor. The "double" (and "complex") data types are simply called a Number and "integers" remain Integers. We can discover the data types within a data frame by using the structure command in R: str(). We will explain what "factors" are in latter chapters.


Find Data Types in R

Use the typeof () command to verify data types. See if you can guess what the output will look like before you press the enter key.

<source lang="rsplus">

> a <- as.integer(1)
> typeof(a)
> a
> b <- as.double(1)
> typeof(b)
> b
> c <- as.character(1)
> typeof(c)
> c
>d <- as.logical("true")
> e <- as.complex(-25)
> typeof(e)
> e
> f <- as.null(0)
> typeof(f)
> f
> g <-as.POSIXct("2012/07/04 10:15:59")
> typeof(g)
> class(g)
> g
> g <-as.POSIXlt("2012/07/04 10:15:59")
> typeof(g)
> class(g)
> g


More Reading