SlideShare a Scribd company logo
1 of 35
Download to read offline
Introduction
to Big Data
Chapter 1 & 2 (Week 1)
Course overview & introduction
DCCS208(02) Korea University 2019 Fall
Asst. Prof. Minseok Seo
mins@korea.ac.kr
01
Course Overview
Introduction to Big Data
Contents
 Definition of Big Data
Introduction to Big Data
2.
 Brief introduction of professor & course
Course Overview
1.
 Object & Aim of the course
 Assignments & Quiz
 Evaluation
 Key techniques in Data Science
 Core technology of Informatics
4 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Course information
Introduction to Big Data, DCCS208(02), Fall 2019.
 Lecture time: Wed. (6,7) and Thu. (6)
 Location: Wed. (7-310) and Thu. (7-315)
 Completion division: Major elective subject
 Level: Junior / Senior
5 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Definition of Big Data (Cont.)
Which is bigger, elephant or rat?
VS.
6 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Definition of Big Data (Cont.)
 What is Data?
ID Height Weight Age
Student 1 189 cm 81 kg 24
Student 2 210 cm 90 kg 26
Student 3 191 cm 92 kg 27
… … … …
Student N 162 cm 71 kg 21
Attributes (Dimension; Features; Variables)
Objects
(Samples,
Individuals)
7 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Definition of Big Data (Cont.)
 In a narrow sense, Big Data means only sample size.
 In a broad sense, Big Data represents both sample size and dimensionality.
8 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Definition of Big Data (Cont.)
 3V’s (Volume, Velocity, and Variety)
9 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Definition of Big Data (Cont.)
 5V’s (Volume, Velocity, Variety, Veracity, and Value)
 Volume: Data size
 Velocity: Data production speed
 Variety: Data oriented from various things
 Veracity: Data accuracy (Trustworthy)
 Value: Data value
Value*
10 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Relationship between Big-data & Data Science
 The amount of data and information is not directly correlated with
knowledge generation.
X
 But the demand for data scientists will be growing.
11 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Job market of Big data
Furht B., Villanustre F. (2016) Introduction to Big Data. In: Big Data Technologies and Applications. Springer, Cham
It is the time to prepare for an academic course to cultivate data analysts
commensurate with demand.
12 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Object & Aim of the course
 Students who have taken this course expect to be able to learn:
Introduction to
Big Data
Concept of
Big Data
Computational
approaches for
Big Data
Statistical
approaches for
Big Data
Visualization
for Big Data
R
programming
Basic Skill in
Data Science
13 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Course schedule (Before Mid-term exam)
Week Period Study Contents
1 09.02 - 09.08 Introduction to Big Data & Data Science
2 09.09 - 09.15
Overall workflow, Computer Software issues, and applications in the
Big Data era
3 09.16 - 09.22 Introduction to R programming
4 09.23 - 09.29 Descriptive & Fundamental Statistics
5 09.30 - 10.06 Understanding Data Structures (Types of random variable)
6 10.07 - 10.13 Data Visualization
7 10.14 - 10.20 Preprocessing of Big Data (Quality Control and Prescreening)
8 10.21 - 10.27 Mid-term Exam
14 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Course schedule (After Mid-term exam)
Week Period Study Contents
9 10.28 - 11.03 Parallel and Distributed Processing for Big Data
10 11.04 - 11.10 Statistical Estimation & Modeling
11 11.11 - 11.17 Computational approach for statistical modeling with robustness
12 11.18 - 11.24 Clustering analysis (Unsupervised learning methods)
13 11.25 - 12.01 Classification analysis (Supervised learning methods)
14 11.02 - 12.08 Algorithms of Dimensionality Reduction for Big Data
15 12.09 - 12.15 Trends in various academic & industrial fields for application of Big Data
16 12.16 - 12.22 Final Exam
15 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Two types of lectures per week
 There are two representative computer language for Big data analysis, R and
Python.
 R will be used in this class.
 It is not required any prior knowledge of the R language because I plan to provide
example code for student's practice.
https://cran.r-project.org/
Wed. day
2hrs
Thu. Day
1hr
Lecture for Theory Hands-on lecture
The methodology learned in theory class will be exercised in the computer lab. on Thursday.
16 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Exam, Quiz, and Homework
 There will be two simple quizzes in class to check the student's learning
progress of the course (before and after midterm respectively).
Quiz
Homework
 There will be 4 times assignments.
 This will be a report on the theory and practice of data analysis learned in
class.
 There will be two exams.
 I will ask you to understand the basic computational/statistical algorithm.
Midterm and Final exams
17 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Evaluation plan
 Absolute grading system
Score ≥ 95, you will get A+
Score ≥ 90, you will get A
Score ≥ 85, you will get B+
and...
30%
30%
10%
20%
10%
Midterm Final Quiz Assignment Attendance
18 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Textbook
 No Textbook
 This course will be proceed based on the presentation slide
 I will upload presentation slide in Blackboard & my homepage
Homepage: https://scholar.harvard.edu/msseo
Teaching >> Introduction to Big Data >> Related Materials
 Reference 2 (Eng. Version)
Introduction to Data Science by Rafael A. Irizarry, 2019.
(online textbook and free)
https://rafalab.github.io/dsbook/
 Reference 3 (Eng. Version)
R for Data Science by Garrett Grolemund.
(online textbook and free)
https://r4ds.had.co.nz/
 Reference 1 (Kor. Version)
R for Practical Data Analysis
(online textbook and free)
http://r4pda.co.kr/pdf/r4pda_2014_03_02.pdf
19 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Course Overview
Contact information
 Prof. Minseok Seo
Location: 7-203
Tel: 044-860-1379
Email: mins@korea.ac.kr
 TA. Heechan Chae
Location: 7-328
Email: chay219@korea.ac.kr
 If you have any questions about the course please email me and I will reply as
soon as I see it.
 If you need to meet in person, please make an appointment by email first.
 I will be available at Mon: 12:00 - 17:00 | Wed: 10:00 - 13:00 | Thu: 10:00 - 13:00.
End of
Orientation
Contents
 Concept of Big Data
Introduction to Big Data
2.
 Brief introduction of professor & course
Course Overview
1.
 Object & Aim of the course
 Assignments & Quiz
 Evaluation
 Key techniques in Data Science for Big data
22 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Characteristics of Big Data
Remind concept of Big Data
 5V’s (Volume, Velocity, Variety, Veracity, and Value)
 Volume: Data size
 Velocity: Data production speed
 Variety: Data oriented from various things
 Veracity: Data accuracy (Trustworthy)
 Value: Data value
Value*
23 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Petabyte era
 transferred about 197 PB of data thorough its network each data (2018)
 processed about 24 petabytes daily (2009)
1 PB = 1000000000000000B = 1015bytes = 1000terabytes
1000 PB = 1 exabyte (EB)
In fact, we can say that we have already entered the exabyte
era.
24 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Characteristics of Big Data
How do you recognize if it's big data or not?
Computer Scientist
My computer is low on memory for
handling this data!!
That is Big Data
No!!!! This data is over 2TB. Where do I
store it?????
That is Big Data
In short, if you’re having trouble with data processing on your computer (멘붕에
빠지면), it will be due to the Big Data.
25 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Characteristics of Big Data
How do you recognize if it's big data or not?
Statistician
When does this calculation end? I was
only waiting for 10 years ...
Dimensionality is too high!!!! I can’t build
statistical model using this data!!!
That is Big Data
In short, if you’re having trouble with data analysis on your computer (멘붕에 빠지
면), it will be due to the Big Data.
26 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Core technologies of Big Data era
IT technologies to resolve issue derived from the Big data
Difficulties arise in both hardware and software.
Prescreening techniques
Data Visualization
Feature selection
Parallel processing
Clouding computing
Distributed processing
Software Hardware
But students can approach software difficulties.
27 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Computational language for Big Data
R and Python
 There are two representative computer language for Big data analysis, R and
Python.
 R programming language (free and relatively easy) for hands-on lecture.
 Let’s connect R homepage
https://cran.r-project.org/
Wed. day
2hrs
Thu. Day
1hr
Lecture for Theory Hands-on lecture
28 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Install R
(Step 1) Download the R installer
29 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Install R
(Step 2) Download the RStudio
 Download Rstudio from https://www.rstudio.com/products/rstudio/download/
30 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Install R
(Step 3) Install R and Rstudio
31 / 20
copyrightⓒ 2018 All rights reserved by Korea University
What is R
 R is an interpreted computer language.
 It is possible to interface procedures written in C, C+, and etc., languages for
efficiency.
 System commands can be called from within R
 R is used for data manipulation, statistics, and graphics.
32 / 20
copyrightⓒ 2018 All rights reserved by Korea University
R, S, and S-plus (History of R)
 S: an interactive environment for data analysis developed at Bell Laboratories since
1976
1988 - S2: RA Becker, JM Chambers, A Wilks
1992 - S3: JM Chambers, TJ Hastie
1998 - S4: JM Chambers
 Exclusively licensed by AT&T/Lucent to Insightful Corporation, Seattle WA. Product
name: “S-plus”.
Implementation languages C, Fortran.
 R: initially written by Ross Ihaka and Robert Gentleman at Dep. of Statistics of U of
Auckland, New Zealand during 1990s.
 Since 1997: international “R-core” team of ca. 15 people with access to common
CVS archive.
33 / 20
copyrightⓒ 2018 All rights reserved by Korea University
What R does and does not
 Possible
(1) data handling and storage: numeric, textual
(2) matrix algebra
(3) has tables and regular expressions
(4) high-level data analytic and statistical functions
(5) OOP (classes)
(6) Graphic
(7) Programming language: loops, branching, subroutines, and etc.,
 Impossible
(1) R is not a database, but connects to DBMSs
(2) R has no GUI, but connect to Java, TclTk
(3) R is fundamentally very slow, but allows to call own C/C++ code
(4) R is no spreadsheet view of data, but connects to Excel/MsOffice
(5) R is no professional & commercial support
 But all R users in the world are developers (Power of Collective intelligence; 집단지성).
 If you make a meaningful package at any time, you can publish it within 1 second.
 Therefore, applying latest algorithms are faster than any programming language.
34 / 20
copyrightⓒ 2018 All rights reserved by Korea University
Install R
(Step 3) Install R and Rstudio
End of Slide

More Related Content

Similar to 1.introduction_to_bigdata_chap1.pdf

KING’S OWN INSTITUTE Success in Higher Education ICT.docx
KING’S OWN INSTITUTE Success in Higher Education    ICT.docxKING’S OWN INSTITUTE Success in Higher Education    ICT.docx
KING’S OWN INSTITUTE Success in Higher Education ICT.docx
croysierkathey
 
Technology action plan
Technology action planTechnology action plan
Technology action plan
alan2939
 
DSAConclave Presentation based on introduction
DSAConclave Presentation based on introductionDSAConclave Presentation based on introduction
DSAConclave Presentation based on introduction
chmeghana1
 
Project Training Ppt
Project Training PptProject Training Ppt
Project Training Ppt
biinoida
 

Similar to 1.introduction_to_bigdata_chap1.pdf (20)

Nba sar ppt
Nba sar pptNba sar ppt
Nba sar ppt
 
Syllabus for fourth year of engineering
Syllabus for fourth year of engineeringSyllabus for fourth year of engineering
Syllabus for fourth year of engineering
 
Big Data analytics
Big Data analyticsBig Data analytics
Big Data analytics
 
Building a Computer Science Pathway for Endorsements
Building a Computer Science Pathway for EndorsementsBuilding a Computer Science Pathway for Endorsements
Building a Computer Science Pathway for Endorsements
 
Building a Computer Science Pathway for Endorsements
Building a Computer Science Pathway for EndorsementsBuilding a Computer Science Pathway for Endorsements
Building a Computer Science Pathway for Endorsements
 
Lecture_01.1.pptx
Lecture_01.1.pptxLecture_01.1.pptx
Lecture_01.1.pptx
 
GCSE year 9 options evening
GCSE year 9 options eveningGCSE year 9 options evening
GCSE year 9 options evening
 
Ate presentation schrag_102413
Ate presentation schrag_102413Ate presentation schrag_102413
Ate presentation schrag_102413
 
KING’S OWN INSTITUTE Success in Higher Education ICT.docx
KING’S OWN INSTITUTE Success in Higher Education    ICT.docxKING’S OWN INSTITUTE Success in Higher Education    ICT.docx
KING’S OWN INSTITUTE Success in Higher Education ICT.docx
 
Important dates and informatio for thapar institute of engineering and techno...
Important dates and informatio for thapar institute of engineering and techno...Important dates and informatio for thapar institute of engineering and techno...
Important dates and informatio for thapar institute of engineering and techno...
 
Engineering Student Engagement With Project Lead the Way
Engineering Student Engagement With Project Lead the WayEngineering Student Engagement With Project Lead the Way
Engineering Student Engagement With Project Lead the Way
 
Eecs6893 big dataanalytics-lecture1
Eecs6893 big dataanalytics-lecture1Eecs6893 big dataanalytics-lecture1
Eecs6893 big dataanalytics-lecture1
 
UKSG Jisc learninganalytics-3june2016
UKSG Jisc learninganalytics-3june2016UKSG Jisc learninganalytics-3june2016
UKSG Jisc learninganalytics-3june2016
 
Modified sd profile june 30
Modified sd profile june 30Modified sd profile june 30
Modified sd profile june 30
 
Technology action plan
Technology action planTechnology action plan
Technology action plan
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
Building a business case and institutional policy on a 10Y research data mana...
Building a business case and institutional policy on a 10Y research data mana...Building a business case and institutional policy on a 10Y research data mana...
Building a business case and institutional policy on a 10Y research data mana...
 
DSAConclave Presentation based on introduction
DSAConclave Presentation based on introductionDSAConclave Presentation based on introduction
DSAConclave Presentation based on introduction
 
Project Training Ppt
Project Training PptProject Training Ppt
Project Training Ppt
 
Data Science Course In Chennai-October
Data Science Course In Chennai-OctoberData Science Course In Chennai-October
Data Science Course In Chennai-October
 

Recently uploaded

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Recently uploaded (20)

Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 

1.introduction_to_bigdata_chap1.pdf

  • 1. Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr
  • 3. Contents  Definition of Big Data Introduction to Big Data 2.  Brief introduction of professor & course Course Overview 1.  Object & Aim of the course  Assignments & Quiz  Evaluation  Key techniques in Data Science  Core technology of Informatics
  • 4. 4 / 20 copyrightⓒ 2018 All rights reserved by Korea University Course Overview Course information Introduction to Big Data, DCCS208(02), Fall 2019.  Lecture time: Wed. (6,7) and Thu. (6)  Location: Wed. (7-310) and Thu. (7-315)  Completion division: Major elective subject  Level: Junior / Senior
  • 5. 5 / 20 copyrightⓒ 2018 All rights reserved by Korea University Course Overview Definition of Big Data (Cont.) Which is bigger, elephant or rat? VS.
  • 6. 6 / 20 copyrightⓒ 2018 All rights reserved by Korea University Course Overview Definition of Big Data (Cont.)  What is Data? ID Height Weight Age Student 1 189 cm 81 kg 24 Student 2 210 cm 90 kg 26 Student 3 191 cm 92 kg 27 … … … … Student N 162 cm 71 kg 21 Attributes (Dimension; Features; Variables) Objects (Samples, Individuals)
  • 7. 7 / 20 copyrightⓒ 2018 All rights reserved by Korea University Course Overview Definition of Big Data (Cont.)  In a narrow sense, Big Data means only sample size.  In a broad sense, Big Data represents both sample size and dimensionality.
  • 8. 8 / 20 copyrightⓒ 2018 All rights reserved by Korea University Course Overview Definition of Big Data (Cont.)  3V’s (Volume, Velocity, and Variety)
  • 9. 9 / 20 copyrightⓒ 2018 All rights reserved by Korea University Course Overview Definition of Big Data (Cont.)  5V’s (Volume, Velocity, Variety, Veracity, and Value)  Volume: Data size  Velocity: Data production speed  Variety: Data oriented from various things  Veracity: Data accuracy (Trustworthy)  Value: Data value Value*
  • 10. 10 / 20 copyrightⓒ 2018 All rights reserved by Korea University Course Overview Relationship between Big-data & Data Science  The amount of data and information is not directly correlated with knowledge generation. X  But the demand for data scientists will be growing.
  • 11. 11 / 20 copyrightⓒ 2018 All rights reserved by Korea University Course Overview Job market of Big data Furht B., Villanustre F. (2016) Introduction to Big Data. In: Big Data Technologies and Applications. Springer, Cham It is the time to prepare for an academic course to cultivate data analysts commensurate with demand.
  • 12. 12 / 20 copyrightⓒ 2018 All rights reserved by Korea University Course Overview Object & Aim of the course  Students who have taken this course expect to be able to learn: Introduction to Big Data Concept of Big Data Computational approaches for Big Data Statistical approaches for Big Data Visualization for Big Data R programming Basic Skill in Data Science
  • 13. 13 / 20 copyrightⓒ 2018 All rights reserved by Korea University Course Overview Course schedule (Before Mid-term exam) Week Period Study Contents 1 09.02 - 09.08 Introduction to Big Data & Data Science 2 09.09 - 09.15 Overall workflow, Computer Software issues, and applications in the Big Data era 3 09.16 - 09.22 Introduction to R programming 4 09.23 - 09.29 Descriptive & Fundamental Statistics 5 09.30 - 10.06 Understanding Data Structures (Types of random variable) 6 10.07 - 10.13 Data Visualization 7 10.14 - 10.20 Preprocessing of Big Data (Quality Control and Prescreening) 8 10.21 - 10.27 Mid-term Exam
  • 14. 14 / 20 copyrightⓒ 2018 All rights reserved by Korea University Course Overview Course schedule (After Mid-term exam) Week Period Study Contents 9 10.28 - 11.03 Parallel and Distributed Processing for Big Data 10 11.04 - 11.10 Statistical Estimation & Modeling 11 11.11 - 11.17 Computational approach for statistical modeling with robustness 12 11.18 - 11.24 Clustering analysis (Unsupervised learning methods) 13 11.25 - 12.01 Classification analysis (Supervised learning methods) 14 11.02 - 12.08 Algorithms of Dimensionality Reduction for Big Data 15 12.09 - 12.15 Trends in various academic & industrial fields for application of Big Data 16 12.16 - 12.22 Final Exam
  • 15. 15 / 20 copyrightⓒ 2018 All rights reserved by Korea University Course Overview Two types of lectures per week  There are two representative computer language for Big data analysis, R and Python.  R will be used in this class.  It is not required any prior knowledge of the R language because I plan to provide example code for student's practice. https://cran.r-project.org/ Wed. day 2hrs Thu. Day 1hr Lecture for Theory Hands-on lecture The methodology learned in theory class will be exercised in the computer lab. on Thursday.
  • 16. 16 / 20 copyrightⓒ 2018 All rights reserved by Korea University Course Overview Exam, Quiz, and Homework  There will be two simple quizzes in class to check the student's learning progress of the course (before and after midterm respectively). Quiz Homework  There will be 4 times assignments.  This will be a report on the theory and practice of data analysis learned in class.  There will be two exams.  I will ask you to understand the basic computational/statistical algorithm. Midterm and Final exams
  • 17. 17 / 20 copyrightⓒ 2018 All rights reserved by Korea University Course Overview Evaluation plan  Absolute grading system Score ≥ 95, you will get A+ Score ≥ 90, you will get A Score ≥ 85, you will get B+ and... 30% 30% 10% 20% 10% Midterm Final Quiz Assignment Attendance
  • 18. 18 / 20 copyrightⓒ 2018 All rights reserved by Korea University Course Overview Textbook  No Textbook  This course will be proceed based on the presentation slide  I will upload presentation slide in Blackboard & my homepage Homepage: https://scholar.harvard.edu/msseo Teaching >> Introduction to Big Data >> Related Materials  Reference 2 (Eng. Version) Introduction to Data Science by Rafael A. Irizarry, 2019. (online textbook and free) https://rafalab.github.io/dsbook/  Reference 3 (Eng. Version) R for Data Science by Garrett Grolemund. (online textbook and free) https://r4ds.had.co.nz/  Reference 1 (Kor. Version) R for Practical Data Analysis (online textbook and free) http://r4pda.co.kr/pdf/r4pda_2014_03_02.pdf
  • 19. 19 / 20 copyrightⓒ 2018 All rights reserved by Korea University Course Overview Contact information  Prof. Minseok Seo Location: 7-203 Tel: 044-860-1379 Email: mins@korea.ac.kr  TA. Heechan Chae Location: 7-328 Email: chay219@korea.ac.kr  If you have any questions about the course please email me and I will reply as soon as I see it.  If you need to meet in person, please make an appointment by email first.  I will be available at Mon: 12:00 - 17:00 | Wed: 10:00 - 13:00 | Thu: 10:00 - 13:00.
  • 21. Contents  Concept of Big Data Introduction to Big Data 2.  Brief introduction of professor & course Course Overview 1.  Object & Aim of the course  Assignments & Quiz  Evaluation  Key techniques in Data Science for Big data
  • 22. 22 / 20 copyrightⓒ 2018 All rights reserved by Korea University Characteristics of Big Data Remind concept of Big Data  5V’s (Volume, Velocity, Variety, Veracity, and Value)  Volume: Data size  Velocity: Data production speed  Variety: Data oriented from various things  Veracity: Data accuracy (Trustworthy)  Value: Data value Value*
  • 23. 23 / 20 copyrightⓒ 2018 All rights reserved by Korea University Petabyte era  transferred about 197 PB of data thorough its network each data (2018)  processed about 24 petabytes daily (2009) 1 PB = 1000000000000000B = 1015bytes = 1000terabytes 1000 PB = 1 exabyte (EB) In fact, we can say that we have already entered the exabyte era.
  • 24. 24 / 20 copyrightⓒ 2018 All rights reserved by Korea University Characteristics of Big Data How do you recognize if it's big data or not? Computer Scientist My computer is low on memory for handling this data!! That is Big Data No!!!! This data is over 2TB. Where do I store it????? That is Big Data In short, if you’re having trouble with data processing on your computer (멘붕에 빠지면), it will be due to the Big Data.
  • 25. 25 / 20 copyrightⓒ 2018 All rights reserved by Korea University Characteristics of Big Data How do you recognize if it's big data or not? Statistician When does this calculation end? I was only waiting for 10 years ... Dimensionality is too high!!!! I can’t build statistical model using this data!!! That is Big Data In short, if you’re having trouble with data analysis on your computer (멘붕에 빠지 면), it will be due to the Big Data.
  • 26. 26 / 20 copyrightⓒ 2018 All rights reserved by Korea University Core technologies of Big Data era IT technologies to resolve issue derived from the Big data Difficulties arise in both hardware and software. Prescreening techniques Data Visualization Feature selection Parallel processing Clouding computing Distributed processing Software Hardware But students can approach software difficulties.
  • 27. 27 / 20 copyrightⓒ 2018 All rights reserved by Korea University Computational language for Big Data R and Python  There are two representative computer language for Big data analysis, R and Python.  R programming language (free and relatively easy) for hands-on lecture.  Let’s connect R homepage https://cran.r-project.org/ Wed. day 2hrs Thu. Day 1hr Lecture for Theory Hands-on lecture
  • 28. 28 / 20 copyrightⓒ 2018 All rights reserved by Korea University Install R (Step 1) Download the R installer
  • 29. 29 / 20 copyrightⓒ 2018 All rights reserved by Korea University Install R (Step 2) Download the RStudio  Download Rstudio from https://www.rstudio.com/products/rstudio/download/
  • 30. 30 / 20 copyrightⓒ 2018 All rights reserved by Korea University Install R (Step 3) Install R and Rstudio
  • 31. 31 / 20 copyrightⓒ 2018 All rights reserved by Korea University What is R  R is an interpreted computer language.  It is possible to interface procedures written in C, C+, and etc., languages for efficiency.  System commands can be called from within R  R is used for data manipulation, statistics, and graphics.
  • 32. 32 / 20 copyrightⓒ 2018 All rights reserved by Korea University R, S, and S-plus (History of R)  S: an interactive environment for data analysis developed at Bell Laboratories since 1976 1988 - S2: RA Becker, JM Chambers, A Wilks 1992 - S3: JM Chambers, TJ Hastie 1998 - S4: JM Chambers  Exclusively licensed by AT&T/Lucent to Insightful Corporation, Seattle WA. Product name: “S-plus”. Implementation languages C, Fortran.  R: initially written by Ross Ihaka and Robert Gentleman at Dep. of Statistics of U of Auckland, New Zealand during 1990s.  Since 1997: international “R-core” team of ca. 15 people with access to common CVS archive.
  • 33. 33 / 20 copyrightⓒ 2018 All rights reserved by Korea University What R does and does not  Possible (1) data handling and storage: numeric, textual (2) matrix algebra (3) has tables and regular expressions (4) high-level data analytic and statistical functions (5) OOP (classes) (6) Graphic (7) Programming language: loops, branching, subroutines, and etc.,  Impossible (1) R is not a database, but connects to DBMSs (2) R has no GUI, but connect to Java, TclTk (3) R is fundamentally very slow, but allows to call own C/C++ code (4) R is no spreadsheet view of data, but connects to Excel/MsOffice (5) R is no professional & commercial support  But all R users in the world are developers (Power of Collective intelligence; 집단지성).  If you make a meaningful package at any time, you can publish it within 1 second.  Therefore, applying latest algorithms are faster than any programming language.
  • 34. 34 / 20 copyrightⓒ 2018 All rights reserved by Korea University Install R (Step 3) Install R and Rstudio