CSI Communications | July 2014 | 34 www.csi-india.org
Learning Data Science - A Do-It-Yourself Approach
Dr. Prithwis Muker...
CSI Communications | July 2014 | 35
Dr Prithwis Mukerjee is an engineer from IIT Kharagpur and did his PhD from the Univer...
Upcoming SlideShare
Loading in …5

Learning Data Science - A Do-It-Yourself approach


Published on

Article Published in the July 2014 issue of CSI Communicaitons

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Learning Data Science - A Do-It-Yourself approach

  1. 1. CSI Communications | July 2014 | 34 www.csi-india.org Learning Data Science - A Do-It-Yourself Approach Dr. Prithwis Mukerjee Program Director, Business Analytics, Praxis Business School, Calcutta Ever since the venerable Harvard Business Review declared that the Data Scientist was thesexiestjobofthe21stcentury,everybody wants to become one. Unfortunately, the huge number of tools and techniques that all claim to be a part of data science means no one really seems to know where to start and how to go about acquiring the right skills. Fortunately, there is a vast body of knowledge on this topic available on the web and this article will point out a concise set of resources, the URLs of which are given in the list of references, that will help the reader get going with a Do-It-Yourself approach to learning data science. The journey begins with “An Introduction to Data Science”[1] and then it traverses through a technical landscape that includes tools like R, Python, SQL and Hadoop plus techniques like statistics, data mining, machine learning, visualisation and Map Reduce. R : Statistical Programming Language Investing time and effort in mastering R[2] is always better than using proprietary tools like SAS and SPSS because not only is it a free and excellent tool but also it is the most widely used tool in data science. This "Level 0 tutorial for getting started with R"[3] by April Galyardt is, as its name suggests, a good place to start. This tutorial shows how to install R but one should also consider installing R Studio[4] , a free GUI development environment that makes R easier to use. Galyardt's tutorial is very comprehensive but if the reader is of that impatient type for whom doing anything in depth is too tiring and yet he want’s to be familiar with this cutting edge technology then one may go through this 5-part “Beginners Guide to R”[5] from Computerworld. Both these two tutorials require one to install R, which is not too difficult but then even if that is too much of a bother, one can take a look at Data Camp[6] , or Try R at the O'Reilly Code School[7] . This will give a flavour of what is it that people do with R. Statistics The ancient science of statistics has suddenly got a new lease of life and celebrity status because of the current surge of interest in data science. If for a moment we leave aside the Map Reduce and the associated world of Hadoop, then [statistics + statistical tools] is virtually synonymous with data science. Hence it is essential to get a good grip of statistics. As a subject that is taught in many undergraduate and postgraduate courses, there are many excellent books on statistics but today's data scientist must not only be proficient in statistics but also in a tool like R. An excellent place to start learning Statistics along with R is the open source, copyright free OpenIntro website[8] . Not only will this allow one to download an excellent book in PDF format but also a set of superb lab exercises based on R where one can try out what one has just learnt. Other fairly comprehensive tutorials on statistics with R are available from Kelly Black of Clarkson University[9] and from King of Coastal Carolina University[10] . However, those who are seriously interested in a career in data science should purchase a regular textbook on Statistics for reference purposes and for this one can look at either Levin’s Statistics for Management[11] or Blacks’s Applied Business Statistics[12] . Traditional statistics is generally restricted to descriptive and interpretive statistics that is well covered in these links. However, data science is more interested in predictive statistics and this is addressed in areaslikeDataMiningandMachineLearning that will be addressed later. However, one may check out this book for a another quick Introduction to Data Science[13] and download a PDF file[14] of the same. Python with Anaconda Python is another very strong challenger for the position of the best, or most powerful tool, for data science and it makes sense to get a hang of it. This is because it not only supports much of the functionality that R provides, through the Pandas library, but is also quite compatible with Hadoop and the world of Map-Reduce. A quick way to get a hang of Python without installing it is to try out the Python tutorial at After Hours Programming[15] or somewhere similar. But a better way to get going would be to download Anaconda[16] , that not only installs Python but also provides a rich GUI interface to execute Python programs in addition to the standard command line approach. Once Python (with or without Anaconda) in installed, one can try out this workshop from Open Tech School[17] free of cost. Or can jump straight into this very comprehensive tutorial on usingPythonfordatascience[18] thatactually uses data from a Kaggle competition. Visualisation Visualisation is another key area of data science because a business wants data scientists to tell a good story with the data. The Vizualyse blog[19] is a nice starting point for all those who are interested in getting a quick hang of the subject and to get started with free tools like Google Charts. Tableau is an excellent and widely used tool that is used for visualisation that has a public version for free download[20] that one can try out with the free tutorials available in this blog post[21] . Other excellent tools are Google Fusion Tables[22] or Chartbuilder that can be tried out online[23] or explored in this tutorial[24] . Data Mining / Machine Learning As mentioned earlier, traditional statistics is generally limited to descriptive and interpretive work but the world wants predictions and predictive statistics is where we get into data mining and machine learning. Data mining and machine learning is a complex subject and one needs to get a good grounding on the algorithms that are used for Classification[25] , Clustering[26] , Association Rules[27] , Text Analytics and other complex tasks. A quick overview of all these techniques is available in this “Overview of Data Mining Techniques”[28] . Itisdifficulttotakeashortcutthrough this very complex subject but once one has a basic idea about these concepts, then one can try out the examples and exercises given in this book on “R-DataMining”[29] . Rattle is an excellent add-on to R that gives a GUI interface for Data Mining. One can download Rattle[30] and then try out with this short but descriptive tutorial[31] with this data[32] . On the other hand if one is more of a programmer and less of a statistician, it may be preferable to use Python for data mining. A good way to get started is to download and read "A Programmers Guide to Data Mining"[33] that not only introduces the subject but also gives loads of ready made Python code for one to try out. Hadoop and Map Reduce In the world of data science, when boys become men, ( or girls become women, to be politically correct), when data becomes Big Data, that is when Hadoop arrives with Map Reduce. Hadoop is a piece of software, a Java framework, and Map Reduce is a technique, an algorithm, that was developed to support "internet-scale" data processing requirements at Yahoo, Google and other internet giants. The Article
  2. 2. CSI Communications | July 2014 | 35 Dr Prithwis Mukerjee is an engineer from IIT Kharagpur and did his PhD from the University of Texas at Dallas, USA. After nearly two decades in Tata Steel, PricewaterhouseCoopers and IBM, he joined academics, was a tenured Professor at IIT, Kharagpur and is now the Director of the Business Analytics program at Praxis Business School, Calcutta. He believes that “when Resources are Limited, Creativity is Unlimited” [https://www.linkedin. com/in/prithwis] AbouttheAuthor primary requirement was to sort through and count vast amounts of data. People are put off by Hadoop because they assume that it would need a cluster or gigantic servers to even start doing anything with this technology, but this elegant tutorial that helps “Demystify Hadoop and Map Reduce”[34] shows how Map Reduce programming can be learnt by installing the mighty Hadoop in a single cluster mode on even a simple laptop running Ubuntu. In fact even Pig and Hive, the scripting language and the SQL engine that sits on top of Hadoop and shields users from the rigour of Map Reduce is explained very lucidly in this set of tutorials[35] . Evergreen SQL Last but not the least, relational databases and SQL is an absolute must for anyone who is interested in data science but this technology is already so well known and so widely used in the community that there is little point in explaining it here, once again. Anyone going through all these links is almost there on the way to becoming a data scientist. However, technology changes and links die out. To keep abreast of what is happening in this area, it would be a good idea to subscribe to the twitter feed of KDNuggets[36] , DataScience Central[37] and other websites that are relevant in data science. Unlikelawormedicine,wherethevalue of a professional is proportional to the time that he or she has spent in the profession, information technology does not respect age or grey hair ! To fight obsolescence and remain relevant in the profession, one must necessarily keep abreast of “The Next Big Thing.” If going back to college for a full time course on Data Science is not a viable option, then the Do-It-Yourself approach outlined in this article should be the next best alternative. References All URLs in this list have been checked on 23 Jun 2014 and found to be operational [1] http://j.mp/dsdiy01, [2] http://j.mp/dsdiy02, [3] http://j.mp/dsdiy03, [4] http://j.mp/dsdiy04, [5] http://j.mp/dsdiy05, [6] http://j.mp/dsdiy06, [7] http://j.mp/dsdiy07, [8] http://j.mp/dsdiy08, [9] http://j.mp/dsdiy09, [10] http://j.mp/dsdiy10, [11] http://j.mp/dsdiy11, [12] http://j.mp/dsdiy12, [13] http://j.mp/dsdiy13, [14] http://j.mp/dsdiy14, [15] http://j.mp/dsdiy15, [16] http://j.mp/dsdiy16 [17] http://j.mp/dsdiy17, [18] http://j.mp/dsdiy18 [19] http://j.mp/dsdiy19, [20] http://j.mp/dsdiy20, [21] http://j.mp/dsdiy21, [22] http://j.mp/dsdiy22, [23] http://j.mp/dsdiy23, [24] http://j.mp/dsdiy24, [25] http://j.mp/dsdiy25, [26] http://j.mp/dsdiy26, [27] http://j.mp/dsdiy27, [28] http://j.mp/dsdiy28, [29] http://j.mp/dsdiy29, [30] http://j.mp/dsdiy30, [31] http://j.mp/dsdiy31, [32] http://j.mp/dsdiy32, [33] http://j.mp/dsdiy33, [34] http://j.mp/dsdiy34, [35] http://j.mp/dsdiy35, [36] http://j.mp/dsdiy36, [37] http://j.mp/dsdiy37 n