Difference between revisions of "Data Science: An Introduction"

From wiki.acadac.net, the Calvin Andrus wiki
Jump to: navigation, search
m (Note to Contributors)
m (remove old version)
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
The Current Version resides at:
<center>Welcome to
<big><div style="font-size:200%;margin:.5ex 0 .5ex 0"><font color="#0000FF">'''Data Science: An Introduction'''</font></div></big>
<small>(Back to [[D. Calvin Andrus, Ph.D.|Home]])</small><br>
[[Data Science: An Introduction/Navigation]]
{{Book Search}}
<noinclude>{{Data Science: An Introduction/Navigation}}</noinclude>
<small>(Back to [[D. Calvin Andrus, Ph.D.|Home]])</small><br>
This is the beginnings of a draft of a [http://en.wikibooks.org Wikibooks] books.  When I get it into reasonable shape, I will transfer it to Wikibooks for the wider community to improve.  As of today (14 April 2012) there is not a Wikibooks book on Data Science.  These pages are locked to keep the spammers from overrunning my wiki.  You will be able to contribute to the book once it is transferred.  If you want to make comments or contributions before then, you should email me at calvin.andrus@gmail.com.
This book is a very basic introduction to data science.  It is designed for the advanced high school student or average college freshman with a high school-level understanding of math, science, word processing and spreadsheets.  No understanding of computer science is assumed.
'''Data science'''--as a profession and as an academic discipline unto itself--is new, having been born in the first decade of the 21st century.  It is a child born of the mature parental disciplines of scientific methods, data and software engineering, statistics, and visualization.  This book is not intended to do justice to any of those disciplines by themselves, but to bring them together in a productive synthesis.  As such, the student will be introduced to the parent disciplines and then given exercises that will fuse the parental disciplines into data science. In addition, "hacking" in the original positive sense of the term, is also a contributing parent to the data science child, even though "hacking" is not taught as an academic discipline.
Obviously, a mature data scientist will be proficient in each of the parent disciplines, studying them individually and combining them to solve serious data problems.  This text book is but just a first tentative step in that direction.
Data science, as practiced today, arises out of the "big data/cloud computing" world.  This means data science is an advanced discipline, requiring proficiency in parallel processing, map-reduce computing, petabyte-sized noSQL databases, machine learning, and advanced statistics.  In this sense, "true" data science is more appropriately taught at the Master's and Doctorate level.  We believe, however, that data science is as much about mindset as it is about the skillful use of tools.  Thus we want to engage students early in their careers to start thinking holistically about data science.  This textbook will not address the more advanced technologies and techniques of data science.  It will, however, help students to start thinking like a data scientist.
We will do most of our data manipulation, computer programming, and statistical analysis in the open source [http://en.wikipedia.org/wiki/R_%28programming_language%29 '''R'''] package.  We know that intermediate or advanced students would use other tools such as [http://en.wikipedia.org/wiki/Mysql MySQL], [http://en.wikipedia.org/wiki/PHP PHP], [http://en.wikipedia.org/wiki/Python_%28programming_language%29 Python], [http://en.wikipedia.org/wiki/Java_%28programming_language%29 Java], [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop], [http://en.wikipedia.org/wiki/Hbase HBase], [http://en.wikipedia.org/wiki/GraphStream GraphStream], [http://en.wikipedia.org/wiki/Apache_Mahout Mahout], [http://en.wikipedia.org/wiki/Matlab MATLAB], [http://en.wikipedia.org/wiki/SPSS SPSS], [http://en.wikipedia.org/wiki/SAS_%28software%29 SAS], etc.  For this introduction, however, we are keeping it simple and sticking to just a single general purpose computing environment.
Finally, we try to use terms and concepts which are already defined in the Wikipedia and Wiktionary.  This way people can refer to the corresponding Wikipedia page to get a deeper understanding of the concept.
==Note to Instructors==
We have designed this text for a 16-week 3-credit class.  That is, a class that has three classroom-hours of instruction for 16 weeks, or 48 class periods.  There are 42 chapters, which allows for some review and testing days.  It also assumes 1 to 2 hours of "homework" per class period, which includes readings, assignments, study, and projects. 
In the professional world, data science is a team sport.  We assume the students will work in teams to do their projects.  We also assume there is a place students can go to get help with the R programming language.
==Note to Contributors==
First, please register yourself with Wikibooks (and list yourself below), so that we know who our co-contributors are.  Thank you.
Secondly, we only need basic, clear, straight-forward information in each chapter.  We are not trying to be exhaustive or complete--the value of this book is in the simple synthesis across subjects.  There are other venues in which to wax eloquent on the deepness and complexities of a particular subject.  Please place yourself in a "beginner's mind" as you make contributions.  Please also scope each chapter so that it can be taught in a one-hour class period.  If the chapter requires more than an hour to teach, it is probably too detailed.
*To the extent possible, please use terms and concepts in the way in which they are defined in the Wikipedia and Wiktionary.  This way students can refer to the corresponding Wikipedia / Wiktionary page to get a deeper understanding of the concept.
Thirdly, this is a cross-disciplinary book.  We want to help people apply data science to all fields.  Therefore, we need a wide variety of simple examples and simple exercises.
Fourthly, please adhere to the simple structure of each chapter: Summary of Main Points, Discussion, More Reading, Exercises, and References.  We want the More Reading section to link to on-line resources.  The References section may contain off-line resources. 
Fifthly, as with any Wikibook please feel free to make corrections, expand explanations, and make additions where necessary, even if it is not "your" chapter.  Use the discussion page to explain changes that might be controversial.
Sixthly, some syntax rules:
* Put the name of functions and code snippets using the 'code' tags: <code><nowiki><code>lm()</code></nowiki></code>
* Use in-line links <code><nowiki> [[ ]]</nowiki></code> to the Wikipedia, Wiktionary, WikiCommons, Wikibooks, and other Wikimedia Foundation Properties
* Use references <code><ref> </ref></code> to "external" sources--both on-line and off-line.
** Use the citations templates to make citations : [[Template:Cite book]], [[Template:Cite web]], [[Template:Cite journal]]
* If you want to add an image or graph, you should load it into the [[Commons:Special:UploadWizard|Commons]] and add the tag <code><nowiki>{{Created with R}}</nowiki></code>.
* If using a different package than '''R''' standard packages, put the name of the package in parenthesis after each function : <nowiki><code>MCMCprobit()</code> ('''MCMCpack''')</nowiki>
* Put the name of non-standard '''R''' packages in bold : <code><nowiki>'''MCMCpack'''</nowiki></code>
==List of Contributors==
== See Also ==
See the following Wikibooks for good companion texts to this introduction:
* The Scientific Method - [http://en.wikibooks.org/wiki/Scientific_Method Scientific Method]
* Data Engineering - [http://en.wikibooks.org/wiki/Relational_Database_Design Relational Database Desgin], [http://en.wikibooks.org/wiki/Data_Structures Data Structures], [http://en.wikibooks.org/wiki/SQL SQL]
* Software Engineering - [http://en.wikibooks.org/wiki/The_Science_of_Programming The Science of Programming], [http://en.wikibooks.org/wiki/R_Programming R Programming]
* Statistical Analysis - [http://en.wikibooks.org/wiki/Statistics Statistics], [http://en.wikibooks.org/wiki/Statistical_Analysis:_an_Introduction_using_R Statistical Analysis: an Introduction using R], [http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R Data Mining Algorithms in R]
*Visualization -
*Hacking -
== References ==
* [http://www.emc.com/collateral/about/news/emc-data-science-study-wp.pdf (2012) EMC Data Science Study]
* [http://shop.oreilly.com/product/0636920022770.do (2011) O'Reilly's Building Data Science Teams]
* [http://cdn.oreilly.com/radar/2010/06/What_is_Data_Science.pdf (2010) O'Reilly's What is Data Science?]
==Copyright Notice==
While this book is in draft on my wiki it is licensed under the [http://creativecommons.org/ Creative Commons] 3.0 license:
You are free:
* to '''Share''' — to copy, distribute, display, and perform the work (pages from this wiki)
* to '''Remix''' — to adapt or make derivative works
Under the following conditions:
* '''Attribution''' — You must attribute this work to me by name (Calvin Andrus), by page title, by source (wiki.acadac.net), by date, and by version number (if available).  You may not suggest that I, in any way, endorse you or your use of this work.
* '''Noncommercial''' — You may not use this work for commercial purposes.
* '''Share Alike''' — If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one.
* '''Waiver''' — Any of the above conditions can be waived if you get permission from the copyright holder.
* '''Public Domain''' — Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
* '''Other Rights''' — In no way are any of the following rights affected by the license:
:* Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations;
:* The author's moral rights;
:* Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights.
* '''Notice''' — For any reuse or distribution, you must make clear to others the license terms of this work.The best way to do this is with a link to the following web page.

Latest revision as of 18:52, 11 August 2012

The Current Version resides at:


(Back to Home)