연구데이터 관리와 데이터 관리 계획서 (DMP) - part02

연구 데이터 관리와
데이터 관리 계획서 (DMP)
2017.8.22 (화)
한국과학기술정보연구원
과학데이터연구센터
Dr. 김선태 stkim@kisti.re.kr
2017 한국정보관리학회 하계학술대회,‘17.8.22
연세대학교 위당관 문과대학 100주년 기념홀
Part 02
• 데이터와 데이터세트
• 메타데이터와 연구기록

목차
• 진정한 과학과 연구자 환경
• 연구환경과 데이터 인식변화
• 데이터와 데이터세트
• 메타데이터와 연구기록
• 연구 데이터
• 데이터 관리 계획서

데이터 정의 (1/4)
• Generally and in science, data is a gathered
body of facts. Soruce : http://searchdatamanagement.techtarget.com/definition/data
• A reinterpretable representation of
information in a formalized manner suitable for communication, interpretation,
or processing.
A sequence of bits, a table of numbers, the
characters on a page, the recording of
sounds made by a person speaking, or a
moon rock specimen. Source : http://public.ccsds.org/publications/archive/650x0m2.pdf
데이터란 facts(사실, 실상, 실제)의 집합
데이터란 재해석 가능한 정보의 표현
해양과학 분야에서는
데이터 대신 ‘자료’ 라는 표현 사용
4

• Microdata are data on the lowest level of observation such as individual answers to questions.
• Summary Data is another way of describing data that has been processed, or summarised (see statistics).
• Raw Data are the actual observations that are made when the data is collected.
• Primary Data are data collected through your own research study directly through instruments such as surveys,
observations, etc.
• Secondary Data are data from a research study conducted by someone else. (출처: http://libguides.library.qut.edu.au/DatasetsForResearch)
Microdata Summary Data
Raw Data Primary Data
Secondary Data
"transmittable and storable computer
information“ – 1946
"data processing“ - 1954
(출처: http://www.dictionary.com/browse/data)

• Raw Data & Processed Data
– (컴퓨터 분야) data as it is put into a computer, without being
analysed http://www.investorwords.com/10791/raw_data.html#ixzz4oMtXVk6m
– (엔지니어링 분야) data which have to be processed to provide
useful information to the user. data which has not been processed,
or that has not been processed to the full extent intended
http://www.dictionaryofengineering.com/definition/raw-data.html
– (해양지구과학 분야) Raw data refers to data that have
not been changed since acquisition. Editing, cleaning or
modifying the raw data results in processed data.
http://www.marine-geo.org/help/data_FAQ.php
Data의 상태적 측면을 강조

• Primary Data :
– 해당 연구를 통해,
직접적인 관찰 및 수집 데이터
• Secondary Data :
– 출판한 데이터
– 과거에 수집된 데이터
– 타인이 관찰, 수집한 Primary data.
– 다른 목적의 데이터 http://www.businessdictionary.com/definition/secondary-data.html
<구분 기준>
• 직접 생산
• 연구 목적

데이터 관련 용어
• Data Archive preserves and makes accessible research data.
• Codebook provides information on the structure, contents, and layout
of a data file.
• Time Series is a sequence of data points spaced over time intervals.
(출처: http://libguides.library.qut.edu.au/DatasetsForResearch)
Data Archive 데이터 보존과 접근
데이터 파일의 구조, 목차, 형식정보 Codebook
Time Series 일정한 시간간격을 둔 순차데이터

데이터 구분
(출처: http://bit.ly/2w2xari)
Observational
Experimental
Simulation
Derived or compiled
Reference or canonical
(출처: http://www.bu.edu/datamanagement/background/whatisdata/)
Raw Data
(Unprocessed Data)
Processed Data
Result Data
Scientific Data ⊂
Research Data
Quantitative Data
Qualitative Data
First Engilish use -
1640s
"transmittable and
storable computer
information“ – 1946
"data processing“ - 1954
데이터 처리 단계, 데이터 도메인, 데이터 생산 방식에 의한 데이터 구분
9
데이터
생산 방법
Primary Data
Secondary Data

데이터 유형과 형식
관측 데어터, 관찰 데이터
- 현재시점에 생산
- 재생산 및 대체 불가능
- 센서데이터, 센서기반 인간 관찰, 설문 결과, 신경 이미지 데이터, 샘플 데이터
실험 데이터
- 통제된 조건에서 데이터 생산(현재 시점, 실험실)
- 재생산 가능 (고비용일 수 있음)
- 예: 유전자 시퀀스, 크로마토그램, 분광데이터, 현미경검사데이터, 환면체 마그네틱 데이터
추출 데이터, 컴파일 데이터
- 재생산 가능 (고비용일 수 있음)
- 텍스트 데이터 마이닝 데이터, 추출된 변수 데이터, 컴파일된 데이터베이스, 3D 모델
시뮬레이션 데이터
- 실제 혹은 이론적 시스템의 행태와 성능을 연구하기 위해 모델로 부터 생산된 결과데이터
- 모델과 메타데이터가 출력데이터 보다 중요함
- 기후 모델, 경제 모델, 생물지구화학 모델
레퍼런스 데이터
- 검증된 통계 혹은 신체 컬렉션 데이터세트
- 유전체 시퀀스 데이터뱅크, 화학 구조, 통계 데이터, 공간 데이터 포털
(출처: http://guides.library.oregonstate.edu/research-data-services/data-management-types-formats)

데이터 세트 정의 (1/4)
• A data set
– is a catch-all phrase that covers anything related to data.
– includes raw and processed data, grids, images, maps, data spreadsheets and
tables, and so on.
– comprises a suite of data files collected or generated by one instrument or device.
( 다중빔 수중측량기 데이터 집합은 수백개의 swath 데이터 파일을 포함)
http://www.marine-geo.org/help/data_FAQ.php
• 데이터와 관련된 것들을 담고 있는 주머니
• 원시 데이터, 중간처리 데이터, 그리드,
이미지, 지도, 테이블 등을 포함할 수 있음
• 여러 소스에서 수집되거나 하나의 장치에서
생산될 수 있음

• a collection of data (위키)
• A data set is a collection of related data and information-
generally numeric, word oriented, sound, and/or image-organized to permit search and retrieval or
processing and reorganizing.
• Many data sets are resources from which specific data points, facts, or textual information is extracted for use in
building a derivative data set or data product. A derivative data set, also called a
value-added or transformative data set, is built from one or more preexisting data set(s)
and frequently includes extractions from multiple data sets as well as original data (Committee for a Study on
Promoting Access to Scientific and Technical Data for the Public Interest, 1999, p. 15).
Data set = Data + Information
Data sets = Data set + Data set
Derivative data set = Value-added data set
= Transformative data set
13

• A collection of data records for computer processing (Dictionary.com)
 컴퓨터 처리를 위한 데이터 레코드들의 집합
• A dataset (or data set) is a collection of data, usually presented in tabular
form. Each column represents a particular variable. Each row corresponds to a
given member of the dataset in question. (Wikipedia)  database Table
• A collection of data, published or curated by a single source, and available for
access or download in one or more formats (W3C Data Catalog Vocabulary)
 웹에서 접근하고 다운로드 할 수있는 다양한 형태의 데이터 집합
• …a group of data files–usually numeric or encoded–along
with the documentation files (such as a codebook,
technical or methodology report, data dictionary) which
explain their production or use. Generally a dataset is un-
usable for sound analysis by a second party unless it is
well documented. (JISC , Data Information Specialists Committee)

데이터세트 정의 (4/4)
• A data set is a set of data that is collected for a specific purpose. There are many ways
in which data can be collected—for example, as part of service delivery, one-off surveys,
interviews, observations, and so on. In order to ensure that the meaning of data in the
data set is clearly understood and data can be consistently collected and used, data
are defined using metadata… ("A guide to data development" (2007) from the National Data Development
and Standards Unit in Australia)
• Data should be shared in accordance with recognised data standards where these exist,
and in a way that maximises opportunities for data linkage and interoperability.
Sufficient metadata must be provided to enable the dataset to be used by others. Agreed
best practice standards for metadata provision should be adopted where these are in
place. (Welcome Trust)
• + 법률분야 정의 + 통계분야 정의 + RDF 분야 …
• 데이터세트 정의는 분야마다 다양함.
• 일반적으로 데이터들의 집합을 의미하며, 재사용을 위해,
메타데이터가 포함되어 있는 경우도 있음. 넓은 의미로
데이터를 설명하는 정보도 데이터세트 정의에 포함될 수
있음
데이터세트 정의는 다양
일반적으로 데이터들의 집합을 의미
메타데이터가 포함되어 있는 경우도 있음
데이터 설명정보도 데이터세트 정의에 포함될 수 있음

데이터 세트의 유형
출처: http://bit.ly/2unU33Z

레코드 기반 데이터
데이터 행렬, Data Matrix
문서 행렬, Document Data
트랜잭션 데이터, Transaction Data
모두 숫자로 구성된 경우,
다차원 공간의 포인트
각각의 문서는 하나의 텀벡터
하나의 레코드(트랜젝션)는
여러 아이템을 포함

그래프 기반 데이터
화합물의 입체적 구조를
선으로 표현한 식(구조식)
C6H12O6
분자표현(화학식)

순차 기반 데이터 (1/2)
• Spatial Data = 일반적으로 geospatial data
• Geospatial Data = Spatial data와 attribute data로 구성
• Spatial data는 위치에 해당하는 정보, 즉 도로의 모양이나 좌표에 대한 정보
• Attribute data란 도로의 속성에 대한 정보(이름, 길이, 속도 제한, 혹은 방향등의 정보)
• Temporal Data
• 당시의 상태를 표현 (represents a state in time)
• 많은 소스로 부터 데이터 획득 (수동 입력, 관측센서, 시뮬레이션 모델 등)
• 예: 1990년 홍콩의 토시사용 패턴, 2009년 7월1일 호놀루루 총 강수량, 해양 포유류 위치
가시화, 도시 인구 증가의 이해, 특정 질병으로 인한 사망자수 연구, 해양 기후 및 날씨
패턴 변화
출처: http://arcg.is/2uEoYs5
출처: http://bit.ly/2fsgsL0

순차 기반 데이터 (2/2)
• Sequential Data
• 순차 데이터는 하나의 아이템 세트s
• 각각의 아이템 세트는 여러 개의 아이템들을 가지고 있음
• 같은 아이템 세트에 존재하는 아이템들은 동일한 타임스탬프를 가짐 (출처: https://www.igi-global.com/dictionary/)
• Genetic Sequence Data
(출처: http://bit.ly/2uo3Nzd)

데이터 품질 문제
• 노이즈(잡음) : 원본 값의 변경 (예: 음성의 왜곡, TV스크린의 흔들림)
• 이상치(Outliers) : 데이터 집합 내, 다른 객체들과 상당히 다른 특징을 갖는 객체
• 중복 데이터(Duplicate Data)
• 누락값(Missing Values)
데이터 정제 (data cleaning) 필요
누락값 원인
• 데이터 제공 미동의
• 연간 소득 (아이들은 제외)
누락값 처리
• 레코드 삭제
• 추측
• 분석 시 무시
• 가중치로 판단해서 값 채우기

데이터 전처리 (1/4)
(Data Preprocessing)
Aggregation (집계) Sampling (샘플링)
Dimensionality reduction (차원 축소)
feature selection & extraction (특징 선택 & 추출)
집계 방법
• 여러 개의 속성  하나의 속성
• 여러 개의 객체  하나의 객체
목적
• 데이터 속성 혹은 객체 수 줄이기
• 분석 규모의 변화
(Cities < regions < states < countries)
• 보다 정제된 안정적인 데이터 확보

샘플링 이유
• 관심있는 모든 데이터 확보 및 분석에는 고비용 및
시간문제 발생
• 위 문제로, 데이터 마이닝 과정에서 사용되기도 함

• 차원의 크기는 특징
(feature)의 개수
• 데이터의 의미를 제대
로 표현하는 특징을 추
려내는 것
차원 축소 이유
• 차원이 증가하면 그것
을 표현하기 위한 데이
터 양이 기하급수적으
로 증가
• 그렇기 때문에 너무 고
차원의 데이터들은 의
미를 제대로 표현하
기 어려움 출처: http://bit.ly/2uLGeLT 출처: http://bit.ly/2vTYabG

데이터의 차원을 줄이는 방법: 특징 선택과 특징 추출
특징 선택
• 모든 특징의 부분 집합을 선택해서 간결한 특징 집합을 만드는 것
• 즉, 원본 데이터에서 불필요한 특징들(변수들)을 제거
• 예를들어, varX와 varY 특징이 점프 높이 결과 예측에 영향이 없다고 생각한다면 전체 특징 집합
에서 해당 특징들을 제거해 간결한 특징 집합을 만드는 것
특징 추출
• 원본 특징들의 조합으로 새로운 특징을 생성하려고 시도
• 예를들어, 주성분분석(Principal Compnent Analysis)은 데이터로부터 직교 주축을 찾고 모든 데
이터를 해당 축에 투영시킵니다. 이 경우, 원본 데이터를 투영된 데이터로 만드는 투영 함수는 결
국 원본 특징들의 선형 결합으로 이루어진 새로운 특징을 만드는 것임
출처: http://bit.ly/2vnZy5v

메타 데이터와
연구 기록

메타 데이터
• Metadata is structured data about data
Source : http://www.bu.edu/datamanagement/background/whatisdata/
• Metadata addresses data attributes that describe, provide context, indicate the quality, or
document other object (or data) characteristics. source : Greenberg (2005, p. 20 Metadata: A Cataloger's
Primer)
• Metadata are often classified by their purpose
– descriptive metadata
– structural metadata
– administrative metadata
• Rights management (terms and conditions),
• provenance, and
• preservation metadata source : Greenberg, 2005; National Information Standards Organization [NISO], 2004
27
• 데이터에 대한 속성기술
• 컨텍스트, 데이터 품질정보 제공
• 다른 객체나 데이터의 특징 문서화
• 메타데이터 종류

연구 기록, Research records
• Records are documents containing data
or information of any kind and in any
form (including both paper-based and
electronic format) created or received by an
organisation or person for use in the course of their work and
subsequently kept by that organisation or individual as
evidence of that work, or because
of the informational value of the data that such documents
contain. Records associated with the
research process include correspondence
(including electronic mail as well as paper-based
correspondence); project files; grant applications; ethics
applications; authorship agreements; technical reports; research
reports; laboratory notebooks or research journals; master
lists(?); signed consent forms; and information sheets for
research participants. Source : https://policy.unimelb.edu.au/MPF1242
28
연구 기록
- 종이나 전자파일 형식으로 존재. 데이터와 정보(자료) 포함 문서
- 연구 과정과 관련된 기록으로서 (전자)메일, 프로젝트파일, 연구비 신청서, 윤리신청서, 저작권
협약서, 기술보고서, 연구보고서, 실험노트북, 연구저널, 마스터리스트, 동의서, 연구자 참여정보 등을 포함
• Research Records include
Research Data and Materials
(defined below), as well as documents, materials
and information that relate to: administrative,
financial, and human resource management of
research, reporting of research results, and
sponsored award applications. This includes, but
is not limited to, financial, administrative, cost or
pricing, or other management information that has
been gathered or used to apply for or support
specific research activities, such as grant
proposals, progress reports, and communications
with funders. Forms in which Research Records
may appear can differ among and across
academic disciplines, and can include data in
electronic form, such as electronic mail and
budget spreadsheets. https://vpr.harvard.edu/faq/what-are-research-records
◀ 호주 멜버른 ▲ 미국 하버드 대학의 정의

연구데이터 관리와 데이터 관리 계획서 (DMP) - part02

More Related Content

What's hot

Similar to 연구데이터 관리와 데이터 관리 계획서 (DMP) - part02

More from Suntae Kim

연구데이터 관리와 데이터 관리 계획서 (DMP) - part02

Editor's Notes