Introduction to data science

1,047 views

Published on

Published in: Technology, Education

Introduction to data science

  1. 1. Dr. Bill Howe - Director of Research, Scalable Data Analytics
  2. 2.  What is data science? ◦ Set of theories and principles to perform several data related tasks, like ◦ Data collection ◦ Data cleaning ◦ Data integration ◦ Data modeling ◦ Data visualization
  3. 3.  Data science is different from ◦ Business intelligence ◦ Statistics ◦ Database management ◦ Visualization ◦ Machine Learning
  4. 4.  DBA- Unstructured data  Statistician – data that doesn’t fit in to memories   Software engineer- statistical models and how to communicate results Business analyst- algorithms and tradeoff at scale
  5. 5.  Common three skills of Data scientiest ◦ Statistics  traditional analysis ◦ Data Munging  parsing, scraping, and formatting data ◦ Visualization  graphs, tools, etc.
  6. 6.  Three types of tasks: ◦ Preparing to run a model ◦ Running the model ◦ Communicating the results
  7. 7. ◦ Preparing to run a model  Gathering  Cleaning  Integrating  Restructuring  Transforming  Loading  Filtering
  8. 8. ◦ Running the model  Choosing appropriate machine learning algorithms for regression, classification, clustering and recommendations.  Validation of model  Improvement of model ◦ Communicating the results
  9. 9.     Breadth ◦ Mapreduce/Relational algebra/Logistic regression/visualization Depth ◦ Structure (Relational algebra)/ statics (linear algebra) Scale ◦ Desktop (R)/Cloud (Hadoop) Target ◦ Hackers(R,Java, python) /Analyts (little/no programming)
  10. 10.   Scale – Cloud for Bigdata The bigdata can be measured by 3 V’s ◦ Volume – number of rows (size) ◦ Variety – number of columns OR sources (text, images, audio, video) ◦ Velocity - number of rows OR bytes per unit time (processing time )
  11. 11.  “data exhaust” from customers  new and pervasive sensors  the ability to “keep everything”
  12. 12.  Prior programming exercise ◦ SQL ◦ Python  Basic statistics  Basic database concepts
  13. 13.  Twitter sentiment Analysis ◦ Extract the tweets from twitter API ◦ Calculate the sentiment score for tweets ◦ Calculate the sentiment score for terms in tweets ◦ Calculate frequency for terms of tweets ◦ Identify the happiest state ◦ Identify the top ten hastag

×