Introduction
to Big Data
Chapter 1 & 2 (Week 1)
Course overview & introduction
DCCS208(02) Korea University 2019 Fall
Asst. Prof. Minseok Seo
mins@korea.ac.kr
01
Course Overview
Introduction to Big Data
Contents
 Definition of Big Data
Introduction to Big Data
2.
 Brief introduction of professor & course
Course Overview
1.
 Object & Aim of the course
 Assignments & Quiz
 Evaluation
 Key techniques in Data Science
 Core technology of Informatics
4 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Course information
Introduction to Big Data, DCCS208(02), Fall 2019.
 Lecture time: Wed. (6,7) and Thu. (6)
 Location: Wed. (7-310) and Thu. (7-315)
 Completion division: Major elective subject
 Level: Junior / Senior
5 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Definition of Big Data (Cont.)
Which is bigger, elephant or rat?
VS.
6 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Definition of Big Data (Cont.)
 What is Data?
ID Height Weight Age
Student 1 189 cm 81 kg 24
Student 2 210 cm 90 kg 26
Student 3 191 cm 92 kg 27
… … … …
Student N 162 cm 71 kg 21
Attributes (Dimension; Features; Variables)
Objects
(Samples,
Individuals)
7 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Definition of Big Data (Cont.)
 In a narrow sense, Big Data means only sample size.
 In a broad sense, Big Data represents both sample size and dimensionality.
8 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Definition of Big Data (Cont.)
 3V’s (Volume, Velocity, and Variety)
9 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Definition of Big Data (Cont.)
 5V’s (Volume, Velocity, Variety, Veracity, and Value)
 Volume: Data size
 Velocity: Data production speed
 Variety: Data oriented from various things
 Veracity: Data accuracy (Trustworthy)
 Value: Data value
Value*
10 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Relationship between Big-data & Data Science
 The amount of data and information is not directly correlated with
knowledge generation.
X
 But the demand for data scientists will be growing.
11 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Job market of Big data
Furht B., Villanustre F. (2016) Introduction to Big Data. In: Big Data Technologies and Applications. Springer, Cham
It is the time to prepare for an academic course to cultivate data analysts
commensurate with demand.
12 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Object & Aim of the course
 Students who have taken this course expect to be able to learn:
Introduction to
Big Data
Concept of
Big Data
Computational
approaches for
Big Data
Statistical
approaches for
Big Data
Visualization
for Big Data
R
programming
Basic Skill in
Data Science
13 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Course schedule (Before Mid-term exam)
Week Period Study Contents
1 09.02 - 09.08 Introduction to Big Data & Data Science
2 09.09 - 09.15
Overall workflow, Computer Software issues, and applications in the
Big Data era
3 09.16 - 09.22 Introduction to R programming
4 09.23 - 09.29 Descriptive & Fundamental Statistics
5 09.30 - 10.06 Understanding Data Structures (Types of random variable)
6 10.07 - 10.13 Data Visualization
7 10.14 - 10.20 Preprocessing of Big Data (Quality Control and Prescreening)
8 10.21 - 10.27 Mid-term Exam
14 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Course schedule (After Mid-term exam)
Week Period Study Contents
9 10.28 - 11.03 Parallel and Distributed Processing for Big Data
10 11.04 - 11.10 Statistical Estimation & Modeling
11 11.11 - 11.17 Computational approach for statistical modeling with robustness
12 11.18 - 11.24 Clustering analysis (Unsupervised learning methods)
13 11.25 - 12.01 Classification analysis (Supervised learning methods)
14 11.02 - 12.08 Algorithms of Dimensionality Reduction for Big Data
15 12.09 - 12.15 Trends in various academic & industrial fields for application of Big Data
16 12.16 - 12.22 Final Exam
15 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Two types of lectures per week
 There are two representative computer language for Big data analysis, R and
Python.
 R will be used in this class.
 It is not required any prior knowledge of the R language because I plan to provide
example code for student's practice.
https://cran.r-project.org/
Wed. day
2hrs
Thu. Day
1hr
Lecture for Theory Hands-on lecture
The methodology learned in theory class will be exercised in the computer lab. on Thursday.
16 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Exam, Quiz, and Homework
 There will be two simple quizzes in class to check the student's learning
progress of the course (before and after midterm respectively).
Quiz
Homework
 There will be 4 times assignments.
 This will be a report on the theory and practice of data analysis learned in
class.
 There will be two exams.
 I will ask you to understand the basic computational/statistical algorithm.
Midterm and Final exams
17 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Evaluation plan
 Absolute grading system
Score ≥ 95, you will get A+
Score ≥ 90, you will get A
Score ≥ 85, you will get B+
and...
30%
30%
10%
20%
10%
Midterm Final Quiz Assignment Attendance
18 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Textbook
 No Textbook
 This course will be proceed based on the presentation slide
 I will upload presentation slide in Blackboard & my homepage
Homepage: https://scholar.harvard.edu/msseo
Teaching >> Introduction to Big Data >> Related Materials
 Reference 2 (Eng. Version)
Introduction to Data Science by Rafael A. Irizarry, 2019.
(online textbook and free)
https://rafalab.github.io/dsbook/
 Reference 3 (Eng. Version)
R for Data Science by Garrett Grolemund.
(online textbook and free)
https://r4ds.had.co.nz/
 Reference 1 (Kor. Version)
R for Practical Data Analysis
(online textbook and free)
http://r4pda.co.kr/pdf/r4pda_2014_03_02.pdf
19 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Contact information
 Prof. Minseok Seo
Location: 7-203
Tel: 044-860-1379
Email: mins@korea.ac.kr
 TA. Heechan Chae
Location: 7-328
Email: chay219@korea.ac.kr
 If you have any questions about the course please email me and I will reply as
soon as I see it.
 If you need to meet in person, please make an appointment by email first.
 I will be available at Mon: 12:00 - 17:00 | Wed: 10:00 - 13:00 | Thu: 10:00 - 13:00.
End of
Orientation
Contents
 Concept of Big Data
Introduction to Big Data
2.
 Brief introduction of professor & course
Course Overview
1.
 Object & Aim of the course
 Assignments & Quiz
 Evaluation
 Key techniques in Data Science for Big data
22 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Characteristics of Big Data
Remind concept of Big Data
 5V’s (Volume, Velocity, Variety, Veracity, and Value)
 Volume: Data size
 Velocity: Data production speed
 Variety: Data oriented from various things
 Veracity: Data accuracy (Trustworthy)
 Value: Data value
Value*
23 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Petabyte era
 transferred about 197 PB of data thorough its network each data (2018)
 processed about 24 petabytes daily (2009)
1 PB = 1000000000000000B = 1015bytes = 1000terabytes
1000 PB = 1 exabyte (EB)
In fact, we can say that we have already entered the exabyte
era.
24 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Characteristics of Big Data
How do you recognize if it's big data or not?
Computer Scientist
My computer is low on memory for
handling this data!!
That is Big Data
No!!!! This data is over 2TB. Where do I
store it?????
That is Big Data
In short, if you’re having trouble with data processing on your computer (멘붕에
빠지면), it will be due to the Big Data.
25 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Characteristics of Big Data
How do you recognize if it's big data or not?
Statistician
When does this calculation end? I was
only waiting for 10 years ...
Dimensionality is too high!!!! I can’t build
statistical model using this data!!!
That is Big Data
In short, if you’re having trouble with data analysis on your computer (멘붕에 빠지
면), it will be due to the Big Data.
26 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Core technologies of Big Data era
IT technologies to resolve issue derived from the Big data
Difficulties arise in both hardware and software.
Prescreening techniques
Data Visualization
Feature selection
Parallel processing
Clouding computing
Distributed processing
Software Hardware
But students can approach software difficulties.
27 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Computational language for Big Data
R and Python
 There are two representative computer language for Big data analysis, R and
Python.
 R programming language (free and relatively easy) for hands-on lecture.
 Let’s connect R homepage
https://cran.r-project.org/
Wed. day
2hrs
Thu. Day
1hr
Lecture for Theory Hands-on lecture
28 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Install R
(Step 1) Download the R installer
29 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Install R
(Step 2) Download the RStudio
 Download Rstudio from https://www.rstudio.com/products/rstudio/download/
30 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Install R
(Step 3) Install R and Rstudio
31 / 20
copyrightⓒ 2018 All rights reserved by Korea University
What is R
 R is an interpreted computer language.
 It is possible to interface procedures written in C, C+, and etc., languages for
efficiency.
 System commands can be called from within R
 R is used for data manipulation, statistics, and graphics.
32 / 20
copyrightⓒ 2018 All rights reserved by Korea University
R, S, and S-plus (History of R)
 S: an interactive environment for data analysis developed at Bell Laboratories since
1976
1988 - S2: RA Becker, JM Chambers, A Wilks
1992 - S3: JM Chambers, TJ Hastie
1998 - S4: JM Chambers
 Exclusively licensed by AT&T/Lucent to Insightful Corporation, Seattle WA. Product
name: “S-plus”.
Implementation languages C, Fortran.
 R: initially written by Ross Ihaka and Robert Gentleman at Dep. of Statistics of U of
Auckland, New Zealand during 1990s.
 Since 1997: international “R-core” team of ca. 15 people with access to common
CVS archive.
33 / 20
copyrightⓒ 2018 All rights reserved by Korea University
What R does and does not
 Possible
(1) data handling and storage: numeric, textual
(2) matrix algebra
(3) has tables and regular expressions
(4) high-level data analytic and statistical functions
(5) OOP (classes)
(6) Graphic
(7) Programming language: loops, branching, subroutines, and etc.,
 Impossible
(1) R is not a database, but connects to DBMSs
(2) R has no GUI, but connect to Java, TclTk
(3) R is fundamentally very slow, but allows to call own C/C++ code
(4) R is no spreadsheet view of data, but connects to Excel/MsOffice
(5) R is no professional & commercial support
 But all R users in the world are developers (Power of Collective intelligence; 집단지성).
 If you make a meaningful package at any time, you can publish it within 1 second.
 Therefore, applying latest algorithms are faster than any programming language.
34 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Install R
(Step 3) Install R and Rstudio
End of Slide

1.introduction_to_bigdata_chap1.pdf

  • 1.
    Introduction to Big Data Chapter1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr
  • 2.
  • 3.
    Contents  Definition ofBig Data Introduction to Big Data 2.  Brief introduction of professor & course Course Overview 1.  Object & Aim of the course  Assignments & Quiz  Evaluation  Key techniques in Data Science  Core technology of Informatics
  • 4.
    4 / 20 copyrightⓒ2018 All rights reserved by Korea University Course Overview Course information Introduction to Big Data, DCCS208(02), Fall 2019.  Lecture time: Wed. (6,7) and Thu. (6)  Location: Wed. (7-310) and Thu. (7-315)  Completion division: Major elective subject  Level: Junior / Senior
  • 5.
    5 / 20 copyrightⓒ2018 All rights reserved by Korea University Course Overview Definition of Big Data (Cont.) Which is bigger, elephant or rat? VS.
  • 6.
    6 / 20 copyrightⓒ2018 All rights reserved by Korea University Course Overview Definition of Big Data (Cont.)  What is Data? ID Height Weight Age Student 1 189 cm 81 kg 24 Student 2 210 cm 90 kg 26 Student 3 191 cm 92 kg 27 … … … … Student N 162 cm 71 kg 21 Attributes (Dimension; Features; Variables) Objects (Samples, Individuals)
  • 7.
    7 / 20 copyrightⓒ2018 All rights reserved by Korea University Course Overview Definition of Big Data (Cont.)  In a narrow sense, Big Data means only sample size.  In a broad sense, Big Data represents both sample size and dimensionality.
  • 8.
    8 / 20 copyrightⓒ2018 All rights reserved by Korea University Course Overview Definition of Big Data (Cont.)  3V’s (Volume, Velocity, and Variety)
  • 9.
    9 / 20 copyrightⓒ2018 All rights reserved by Korea University Course Overview Definition of Big Data (Cont.)  5V’s (Volume, Velocity, Variety, Veracity, and Value)  Volume: Data size  Velocity: Data production speed  Variety: Data oriented from various things  Veracity: Data accuracy (Trustworthy)  Value: Data value Value*
  • 10.
    10 / 20 copyrightⓒ2018 All rights reserved by Korea University Course Overview Relationship between Big-data & Data Science  The amount of data and information is not directly correlated with knowledge generation. X  But the demand for data scientists will be growing.
  • 11.
    11 / 20 copyrightⓒ2018 All rights reserved by Korea University Course Overview Job market of Big data Furht B., Villanustre F. (2016) Introduction to Big Data. In: Big Data Technologies and Applications. Springer, Cham It is the time to prepare for an academic course to cultivate data analysts commensurate with demand.
  • 12.
    12 / 20 copyrightⓒ2018 All rights reserved by Korea University Course Overview Object & Aim of the course  Students who have taken this course expect to be able to learn: Introduction to Big Data Concept of Big Data Computational approaches for Big Data Statistical approaches for Big Data Visualization for Big Data R programming Basic Skill in Data Science
  • 13.
    13 / 20 copyrightⓒ2018 All rights reserved by Korea University Course Overview Course schedule (Before Mid-term exam) Week Period Study Contents 1 09.02 - 09.08 Introduction to Big Data & Data Science 2 09.09 - 09.15 Overall workflow, Computer Software issues, and applications in the Big Data era 3 09.16 - 09.22 Introduction to R programming 4 09.23 - 09.29 Descriptive & Fundamental Statistics 5 09.30 - 10.06 Understanding Data Structures (Types of random variable) 6 10.07 - 10.13 Data Visualization 7 10.14 - 10.20 Preprocessing of Big Data (Quality Control and Prescreening) 8 10.21 - 10.27 Mid-term Exam
  • 14.
    14 / 20 copyrightⓒ2018 All rights reserved by Korea University Course Overview Course schedule (After Mid-term exam) Week Period Study Contents 9 10.28 - 11.03 Parallel and Distributed Processing for Big Data 10 11.04 - 11.10 Statistical Estimation & Modeling 11 11.11 - 11.17 Computational approach for statistical modeling with robustness 12 11.18 - 11.24 Clustering analysis (Unsupervised learning methods) 13 11.25 - 12.01 Classification analysis (Supervised learning methods) 14 11.02 - 12.08 Algorithms of Dimensionality Reduction for Big Data 15 12.09 - 12.15 Trends in various academic & industrial fields for application of Big Data 16 12.16 - 12.22 Final Exam
  • 15.
    15 / 20 copyrightⓒ2018 All rights reserved by Korea University Course Overview Two types of lectures per week  There are two representative computer language for Big data analysis, R and Python.  R will be used in this class.  It is not required any prior knowledge of the R language because I plan to provide example code for student's practice. https://cran.r-project.org/ Wed. day 2hrs Thu. Day 1hr Lecture for Theory Hands-on lecture The methodology learned in theory class will be exercised in the computer lab. on Thursday.
  • 16.
    16 / 20 copyrightⓒ2018 All rights reserved by Korea University Course Overview Exam, Quiz, and Homework  There will be two simple quizzes in class to check the student's learning progress of the course (before and after midterm respectively). Quiz Homework  There will be 4 times assignments.  This will be a report on the theory and practice of data analysis learned in class.  There will be two exams.  I will ask you to understand the basic computational/statistical algorithm. Midterm and Final exams
  • 17.
    17 / 20 copyrightⓒ2018 All rights reserved by Korea University Course Overview Evaluation plan  Absolute grading system Score ≥ 95, you will get A+ Score ≥ 90, you will get A Score ≥ 85, you will get B+ and... 30% 30% 10% 20% 10% Midterm Final Quiz Assignment Attendance
  • 18.
    18 / 20 copyrightⓒ2018 All rights reserved by Korea University Course Overview Textbook  No Textbook  This course will be proceed based on the presentation slide  I will upload presentation slide in Blackboard & my homepage Homepage: https://scholar.harvard.edu/msseo Teaching >> Introduction to Big Data >> Related Materials  Reference 2 (Eng. Version) Introduction to Data Science by Rafael A. Irizarry, 2019. (online textbook and free) https://rafalab.github.io/dsbook/  Reference 3 (Eng. Version) R for Data Science by Garrett Grolemund. (online textbook and free) https://r4ds.had.co.nz/  Reference 1 (Kor. Version) R for Practical Data Analysis (online textbook and free) http://r4pda.co.kr/pdf/r4pda_2014_03_02.pdf
  • 19.
    19 / 20 copyrightⓒ2018 All rights reserved by Korea University Course Overview Contact information  Prof. Minseok Seo Location: 7-203 Tel: 044-860-1379 Email: mins@korea.ac.kr  TA. Heechan Chae Location: 7-328 Email: chay219@korea.ac.kr  If you have any questions about the course please email me and I will reply as soon as I see it.  If you need to meet in person, please make an appointment by email first.  I will be available at Mon: 12:00 - 17:00 | Wed: 10:00 - 13:00 | Thu: 10:00 - 13:00.
  • 20.
  • 21.
    Contents  Concept ofBig Data Introduction to Big Data 2.  Brief introduction of professor & course Course Overview 1.  Object & Aim of the course  Assignments & Quiz  Evaluation  Key techniques in Data Science for Big data
  • 22.
    22 / 20 copyrightⓒ2018 All rights reserved by Korea University Characteristics of Big Data Remind concept of Big Data  5V’s (Volume, Velocity, Variety, Veracity, and Value)  Volume: Data size  Velocity: Data production speed  Variety: Data oriented from various things  Veracity: Data accuracy (Trustworthy)  Value: Data value Value*
  • 23.
    23 / 20 copyrightⓒ2018 All rights reserved by Korea University Petabyte era  transferred about 197 PB of data thorough its network each data (2018)  processed about 24 petabytes daily (2009) 1 PB = 1000000000000000B = 1015bytes = 1000terabytes 1000 PB = 1 exabyte (EB) In fact, we can say that we have already entered the exabyte era.
  • 24.
    24 / 20 copyrightⓒ2018 All rights reserved by Korea University Characteristics of Big Data How do you recognize if it's big data or not? Computer Scientist My computer is low on memory for handling this data!! That is Big Data No!!!! This data is over 2TB. Where do I store it????? That is Big Data In short, if you’re having trouble with data processing on your computer (멘붕에 빠지면), it will be due to the Big Data.
  • 25.
    25 / 20 copyrightⓒ2018 All rights reserved by Korea University Characteristics of Big Data How do you recognize if it's big data or not? Statistician When does this calculation end? I was only waiting for 10 years ... Dimensionality is too high!!!! I can’t build statistical model using this data!!! That is Big Data In short, if you’re having trouble with data analysis on your computer (멘붕에 빠지 면), it will be due to the Big Data.
  • 26.
    26 / 20 copyrightⓒ2018 All rights reserved by Korea University Core technologies of Big Data era IT technologies to resolve issue derived from the Big data Difficulties arise in both hardware and software. Prescreening techniques Data Visualization Feature selection Parallel processing Clouding computing Distributed processing Software Hardware But students can approach software difficulties.
  • 27.
    27 / 20 copyrightⓒ2018 All rights reserved by Korea University Computational language for Big Data R and Python  There are two representative computer language for Big data analysis, R and Python.  R programming language (free and relatively easy) for hands-on lecture.  Let’s connect R homepage https://cran.r-project.org/ Wed. day 2hrs Thu. Day 1hr Lecture for Theory Hands-on lecture
  • 28.
    28 / 20 copyrightⓒ2018 All rights reserved by Korea University Install R (Step 1) Download the R installer
  • 29.
    29 / 20 copyrightⓒ2018 All rights reserved by Korea University Install R (Step 2) Download the RStudio  Download Rstudio from https://www.rstudio.com/products/rstudio/download/
  • 30.
    30 / 20 copyrightⓒ2018 All rights reserved by Korea University Install R (Step 3) Install R and Rstudio
  • 31.
    31 / 20 copyrightⓒ2018 All rights reserved by Korea University What is R  R is an interpreted computer language.  It is possible to interface procedures written in C, C+, and etc., languages for efficiency.  System commands can be called from within R  R is used for data manipulation, statistics, and graphics.
  • 32.
    32 / 20 copyrightⓒ2018 All rights reserved by Korea University R, S, and S-plus (History of R)  S: an interactive environment for data analysis developed at Bell Laboratories since 1976 1988 - S2: RA Becker, JM Chambers, A Wilks 1992 - S3: JM Chambers, TJ Hastie 1998 - S4: JM Chambers  Exclusively licensed by AT&T/Lucent to Insightful Corporation, Seattle WA. Product name: “S-plus”. Implementation languages C, Fortran.  R: initially written by Ross Ihaka and Robert Gentleman at Dep. of Statistics of U of Auckland, New Zealand during 1990s.  Since 1997: international “R-core” team of ca. 15 people with access to common CVS archive.
  • 33.
    33 / 20 copyrightⓒ2018 All rights reserved by Korea University What R does and does not  Possible (1) data handling and storage: numeric, textual (2) matrix algebra (3) has tables and regular expressions (4) high-level data analytic and statistical functions (5) OOP (classes) (6) Graphic (7) Programming language: loops, branching, subroutines, and etc.,  Impossible (1) R is not a database, but connects to DBMSs (2) R has no GUI, but connect to Java, TclTk (3) R is fundamentally very slow, but allows to call own C/C++ code (4) R is no spreadsheet view of data, but connects to Excel/MsOffice (5) R is no professional & commercial support  But all R users in the world are developers (Power of Collective intelligence; 집단지성).  If you make a meaningful package at any time, you can publish it within 1 second.  Therefore, applying latest algorithms are faster than any programming language.
  • 34.
    34 / 20 copyrightⓒ2018 All rights reserved by Korea University Install R (Step 3) Install R and Rstudio
  • 35.