SlideShare a Scribd company logo
1 of 5
Download to read offline
International Journal of Electrical and Computer Engineering (IJECE)
Vol. 8, No. 5, October 2018, pp. 4042 – 4046
ISSN: 2088-8708 4042
Institute of Advanced Engineering and Science
w w w . i a e s j o u r n a l . c o m
Multi-modal Asian Conversation Mobile Video Dataset for
Recognition Task
Dewi Suryani, Valentino Ekaputra, and Andry Chowanda
Computer Science Department, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia 11480
Article Info
Article history:
Received December 6, 2017
Revised July 15, 2018
Accepted August 12, 2018
Keyword:
Multi-modal dataset
Asian conversation dataset
Mobile video
Recognition task
Facial features expression
Emotion recognition
ABSTRACT
Images, audio, and videos have been used by researchers for a long time to develop
several tasks regarding human facial recognition and emotion detection. Most of the
available datasets usually focus on either static expression, a short video of changing
emotion from neutral to peak emotion, or difference in sounds to detect the current
emotion of a person. Moreover, the common datasets were collected and processed
in the United States (US) or Europe, and only several datasets were originated from
Asia. In this paper, we present our effort to create a unique dataset that can fill in the
gap by currently available datasets. At the time of writing, our datasets contain 10 full
HD (1920×1080) video clips with annotated JSON file, which is in total 100 minutes
of duration and the total size of 13 GB. We believe this dataset will be useful as a
training and benchmark data for a variety of research topics regarding human facial
and emotion recognition.
Copyright c 2018 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Dewi Suryani, Valentino Ekaputra, Andry Chowanda
Computer Science Department, School of Computer Science, Bina Nusantara University
Jl. K. H. Syahdan No. 9, Palmerah, Jakarta, Indonesia 11480
(+6221) 534 5830, ext. 2188
{dsuryani, vekaputra, achowanda}@binus.edu
1. INTRODUCTION
Nowadays, a mobile device such as a smartphone is equipped with a high quality camera and a good
processor which enabled people to record a high definition (HD) video. There is no need to buy an expensive
digital or DSLR camera for the people who want to record a video with a good quality result. Based on the
observation performed by Ofcom [1] in 2016, in the USA, people spent approximately 87 hours on average a
month to browse on a smartphone compared to 34 hours on laptop or desktop. This signifies that people tend
to use a smartphone for most of their activities. It means that the feature on their smartphone is more than
enough for their daily needs, and this includes the video recording. Statista [2] also claimed that 71% of 44,761
respondents use their smartphone to take photos/videos, which is the second highest activity in the smartphone
after accessing the internet. This indicates the camera in their smartphones is already satisfied for taking images
and recording video compared to a few years before.
Despite the increasing amount of time people spend on their smartphone, there is no publicly avail-
able dataset regarding Asian facial feature that is captured using mobile smartphone camera. Generally, the
dataset only exists still as images and most of the video dataset only covers the western person facial features.
As literature suggested, although facial expression recognition is universal to all races of humans, emotions
perception from facial expressions cues are quite different from one culture to others. Moreover, most of the
datasets focus only on facial features of the video and there is no such thing as facial features when a person
is talking with the others in the wild. Hence, by using smartphone, we could capture a natural conversation of
two interlocutors in the wild. Such datasets serve as datasets for computer to learn emotions recognition, facial
expression recognition features and classification.
In this work, we are aiming to address this gap by presenting a mobile video dataset that contains
Journal Homepage: http://iaescore.com/journals/index.php/IJECE
Institute of Advanced Engineering and Science
w w w . i a e s j o u r n a l . c o m
, DOI: 10.11591/ijece.v8i5.pp4042-4046
IJECE ISSN: 2088-8708 4043
videos of Asian people having a natural conversation with each other. Our dataset is obtained by recording a
natural conversation between 2 people inside a controlled room with adequate lighting and the mobile camera
is in a steady position. In order to collect the variety of facial features, we provide several topics to be chosen
by the interlocutors. The topics are mostly general topics such as foods, lecturers, etc.
The rest of this paper is organized as follows: In Section 2., we list the existing different publicly
available dataset and explain its difference with our dataset. Our method of collecting data and the characteristic
of the data will be explained in Section 3. Potential applications of the dataset are described in Section 4. And
lastly, the conclusion will be provided in Section 5.
2. RELATED WORKS
Currently, several datasets have been created for many kinds of recognition tasks, especially facial ex-
pression analysis and recognition [3, 4, 5]. However, only few datasets that contain Asian people and recorded
by using a smartphone. The extended Cohn-Kanade database, CK+ [6] composed of 593 recordings of posed
and non-posed sequences. It is recorded under controlled conditions of light and head motion, and range be-
tween 9-60 frames per sequence. Each sequence represents a single changing facial expression that starts with
a neutral expression and ends with a peak expression. The transitions between expressions are not included.
Moreover, there is an NRC-IIT database [7] which contains pairs of short low-resolution mpeg1-encoded video
clips. Each video clip is showing a face of a user who sits in front of the monitor that exhibiting a wide range
of facial expressions and orientations as captured by a USB webcam mounted on the computer. Every video
clip is about 15 seconds long, has a capture rate of 20 fps and is compressed with the AVI Intel codec 481 Kbps
bit-rate.
In the other hands, there is the Cohn-Kanade DFAT-504 dataset [8] that consists of 100 university
students ranging in age from 18 to 30 years. 65% were female, 15% were African-American, and 3% were
Asian or Latino. Students were instructed by an experimenter to perform a series of 23 facial expressions.
Students began and ended each display with a neutral face. Image sequences from neutral to target display
were digitized into 640 by 480 pixel arrays with 8-bit precision for grayscale values. Similar to the others,
the MMI database [9] contains a large collection of FACS coded facial videos. However, it consists of 1395
manually AU coded video sequences with the majority of the video is posed and recorded in laboratory settings.
In the dataset mentioned above, most of them only focus on the image part of the video without
recording the audio and makes the dataset quite unnatural for several expression. The dataset above mostly
used camera or webcam to take the video. Also the duration of each data considered to be too short for
applications in real life condition which combined different aspects and the context of the topic with the facial
expression in the video. In summary, our dataset is different in following points: (i) recorded using the camera
in a mobile device; (ii) long duration of natural conversation video; (iii) includes full HD videos; and (iv)
includes audio for matching expression with context.
3. PROPOSED METHOD
In this paper, we proposed a new mobile video dataset, which can be used as a benchmark data for
several recognition tasks as well as serve as a dataset for machine learning tasks. Here, we start by explaining
how we collect the dataset for this research. The aim of this research is providing a publicly available dataset
for Asian (specifically Indonesian) facial features, expressions, and conversations. Thus, we recruited twenty
volunteers, whose mainly are Indonesian students (age between 19 - 21) to participate in this research. The
participants are given a list of possible topics to be discussed during the recording in 10 minutes. In one session
of recording, there were two interlocutors sitting facing each other across the table. The participants then start
the conversation in with the other interlocutor, when the researcher give signal to them to start the conversation.
To record the conversation, the researchers set up 2 smart phone with identical camera specification.
The smart phone used in this research were two Xiaomi Mi 4i with 13MP, f/2.0 camera and the video was
recorded in full a HD setting with resolution of 1920x1080 pixels and 30 fps. The smart phone were placed in
a steady position in front of each interlocutor. The recorded video is depicted in the following Figure 1 where
Figure 1a shows the first interlocutor who involves in this conversation with a specific topic selected, with the
other interlocutor is illustrated in Figure 1b.
During the conversation, the volunteers are encouraged to behave as if it is a normal and natural con-
versation in order to get the most natural dataset possible. After 10 minutes, they will be reminded to stop the
conversation and the video will be saved as an mp4 file with the MPEG-4 format in the device before it is ex-
Multi-modal Asian Conversation Dataset for Recognition ... (Dewi Suryani)
4044 ISSN: 2088-8708
(a) First Interlucutor (b) Second Interlucutor
Figure 1. Example of recorded video in a conversation between two interlocutors
Figure 2. Part of annotated JSON example
ported to a computer. Each file will be named in the format as follows: ”CONVERSATIONID CAMERAID”,
where CONVERSATIONID is the identifier for the session and CAMERAID is the identifier for the device
used. After the data is completely recorded, we start annotating the video for three different facial expressions,
i.e., sad, neutral, and happy. Using visual object tagging tool (VOTT) [10] provided by Microsoft, we can
get the annotation data in JSON format as shown in Figure 2.Furthermore, Figure 3 also describes the video
example while annotating the data.
4. POTENTIAL APPLICATIONS
There are several potential applications that can take the advantages of the availability of this dataset,
such as facial recognition and emotion detection.
Facial Recognition, our dataset provides another data in order to increase the accuracy of facial
recognition task. The reason is that most research in this field is using CK+ dataset as done by Bartlett et al.
[11], Cohen et al. [12], and Cohn et al. [13]. Unfortunately, the current used datasets mostly contain people
from the western region, which can highly affect the Asian facial recognition.
Emotion Detection, most of other research regarding emotion detection only used specific datasets
like visual only or audio only [14]. Our dataset provides multi-modal information, i.e., images and audio that
linked together in this case. Moreover, the common research used is posed expression datasets that are not
based on authentic emotions [15].
Virtual Humans or Intelligent VIrtual Agents, with the dataset, we can learn a natural conversation
between two interlocutors and implement them into a virtual human [16, 17]. The dataset provides several
features to be learn: emotion recognition from audio (i.e. voice) and audio (e.g. facial expressions), natural
IJECE Vol. 8, No. 5, October 2018: 4042 – 4046
IJECE ISSN: 2088-8708 4045
Figure 3. Example of the video while annotation process. The red square denotes the facial area of annotation.
language processing, and conversation.
Psychology or Social Science Study, with the dataset, researchers from psychology or social study
also could analyze and observe human behavior during the interaction. An ethnography study also can be
applied to analyze or to observe human behavior (specifically Indonesian people) through the video.
5. CONCLUSION
This paper presents conversation video dataset, containing videos of real conversation performed by
a pair of volunteers recorded using mobile device camera along with JSON data of its annotation. In future,
we intend to collect more data for the datasets by asking for more diverse volunteers based on age, gender, and
occupation. We believe that this dataset will be useful for several applications which required training using
images, audio, or videos from our datasets. We want also to record the video using different conditions of
lighting in order to observe the influence of the lighting.
REFERENCES
[1] Ofcom. (2016, Dec.) The communications market report: International. [Online]. Available:
https://www.ofcom.org.uk/research-and-data/multi-sector-research/cmr/cmr16/international
[2] Statista. (2016, Sep.) Weekly smartphone activities among adult users in the united states as of august
2016. [Online]. Available: https://www.statista.com/statistics/187128/leading-us-smartphone-activities/
[3] H. K. Palo and M. N. Mohanty, “Classification of emotional speech of children using probabilistic neural
network,” International Journal of Electrical and Computer Engineering, vol. 5, no. 2, p. 311, 2015.
[4] F. E. Gunawan and K. Idananta, “Predicting the level of emotion by means of indonesian speech signal.”
Telkomnika, vol. 15, no. 2, 2017.
[5] F. Z. Salmam, A. Madani, and M. Kissi, “Emotion recognition from facial expression based on fiducial
points detection and using neural network,” International Journal of Electrical and Computer Engineering
(IJECE), vol. 8, no. 1, 2017.
[6] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The extended cohn-kanade
dataset (ck+): A complete dataset for action unit and emotion-specified expression,” in Computer Vision
and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on. IEEE,
2010, pp. 94–101.
[7] D. O. Gorodnichy, “Video-based framework for face recognition in video,” in Computer and Robot Vision,
2005. Proceedings. The 2nd Canadian Conference on. IEEE, 2005, pp. 330–338.
[8] T. Kanade, J. F. Cohn, and Y. Tian, “Comprehensive database for facial expression analysis,” in Automatic
Face and Gesture Recognition, 2000. Proceedings. Fourth IEEE International Conference on. IEEE,
2000, pp. 46–53.
[9] M. Valstar and M. Pantic, “Induced disgust, happiness and surprise: an addition to the mmi facial expres-
sion database,” in Proc. 3rd Intern. Workshop on EMOTION (satellite of LREC): Corpora for Research
Multi-modal Asian Conversation Dataset for Recognition ... (Dewi Suryani)
4046 ISSN: 2088-8708
on Emotion and Affect, 2010, p. 65.
[10] Microsoft. (2017, Sep.) Vott: Visual object tagging tool. [Online]. Available:
https://github.com/Microsoft/VoTT
[11] M. S. Bartlett, G. Littlewort, I. Fasel, and J. R. Movellan, “Real time face detection and facial expression
recognition: Development and applications to human computer interaction.” in Computer Vision and
Pattern Recognition Workshop, 2003. CVPRW’03. Conference on, vol. 5. IEEE, 2003, pp. 53–53.
[12] I. Cohen, N. Sebe, A. Garg, L. S. Chen, and T. S. Huang, “Facial expression recognition from video
sequences: temporal and static modeling,” Computer Vision and image understanding, vol. 91, no. 1, pp.
160–187, 2003.
[13] J. F. Cohn, L. I. Reed, Z. Ambadar, J. Xiao, and T. Moriyama, “Automatic analysis and recognition of
brow actions and head motion in spontaneous facial behavior,” in Systems, Man and Cybernetics, 2004
IEEE International Conference on, vol. 1. IEEE, 2004, pp. 610–616.
[14] L. C. De Silva, T. Miyasato, and R. Nakatsu, “Facial emotion recognition using multi-modal information,”
in Information, Communications and Signal Processing, 1997. ICICS., Proceedings of 1997 International
Conference on, vol. 1. IEEE, 1997, pp. 397–401.
[15] Y. Sun, N. Sebe, M. S. Lew, and T. Gevers, “Authentic emotion detection in real-time video,” in Interna-
tional Workshop on Computer Vision in Human-Computer Interaction. Springer, 2004, pp. 94–104.
[16] A. Chowanda, P. Blanchfield, M. Flintham, and M. Valstar, “Erisa: Building emotionally realistic social
game-agents companions,” in International Conference on Intelligent Virtual Agents. Springer, 2014,
pp. 134–143.
[17] ——, “Play smile game with erisa,” in IVA 2015, Fifteenth International Conference on Intelligent Virtual
Agents, 2015.
BIOGRAPHIES OF AUTHORS
Dewi Suryani is a Computer Science lecturer at Bina Nusantara University, Indonesia. She
obtained Bachelor Degree in Computer Science from Bina Nusantara University in 2014. She
just graduated from the Sirindhorn Thai-German Graduate School of Engineering (TGGS), King
Mongkut’s University of Technology North Bangkok (KMUTNB), Thailand, with Master of Engi-
neering. Her researches are in fields of computer science, image processing, handwriting recogni-
tion, etc.
Valentino Ekaputra is a lecturer in Bina Nusantara University. He obtained Bachelor and Master
Degree in Computer Science from Binus University in 2016. Currently, he still does not have fields
he is focusing on researches.
Andry Chowanda is a Computer Science lecturer in Bina Nusantara University Indonesia and
currently is a PhD Student in the University of Nottingham. His research is in agent architecture.
His work mainly on how to model an agent that has capability to sense and perceive the environment
and react based on the perceived data in addition to the ability of building a social relationship with
the user overtime.
IJECE Vol. 8, No. 5, October 2018: 4042 – 4046

More Related Content

What's hot

An HCI Principles based Framework to Support Deaf Community
An HCI Principles based Framework to Support Deaf CommunityAn HCI Principles based Framework to Support Deaf Community
An HCI Principles based Framework to Support Deaf Community
IJEACS
 
Industrial Applications of Automatic Speech Recognition Systems
Industrial Applications of Automatic Speech Recognition SystemsIndustrial Applications of Automatic Speech Recognition Systems
Industrial Applications of Automatic Speech Recognition Systems
IJERA Editor
 
The upsurge of deep learning for computer vision applications
The upsurge of deep learning for computer vision applicationsThe upsurge of deep learning for computer vision applications
The upsurge of deep learning for computer vision applications
IJECEIAES
 

What's hot (20)

Design of a Communication System using Sign Language aid for Differently Able...
Design of a Communication System using Sign Language aid for Differently Able...Design of a Communication System using Sign Language aid for Differently Able...
Design of a Communication System using Sign Language aid for Differently Able...
 
IRJET- Voice Assistant for Visually Impaired People
IRJET-  	  Voice Assistant for Visually Impaired PeopleIRJET-  	  Voice Assistant for Visually Impaired People
IRJET- Voice Assistant for Visually Impaired People
 
IRJET- Navigation and Camera Reading System for Visually Impaired
IRJET- Navigation and Camera Reading System for Visually ImpairedIRJET- Navigation and Camera Reading System for Visually Impaired
IRJET- Navigation and Camera Reading System for Visually Impaired
 
Assistance Application for Visually Impaired - VISION
Assistance Application for Visually  Impaired - VISIONAssistance Application for Visually  Impaired - VISION
Assistance Application for Visually Impaired - VISION
 
An HCI Principles based Framework to Support Deaf Community
An HCI Principles based Framework to Support Deaf CommunityAn HCI Principles based Framework to Support Deaf Community
An HCI Principles based Framework to Support Deaf Community
 
IRJET- Text Reading for Visually Impaired Person using Raspberry Pi
IRJET- Text Reading for Visually Impaired Person using Raspberry PiIRJET- Text Reading for Visually Impaired Person using Raspberry Pi
IRJET- Text Reading for Visually Impaired Person using Raspberry Pi
 
Indian Sign Language Recognition Method For Deaf People
Indian Sign Language Recognition Method For Deaf PeopleIndian Sign Language Recognition Method For Deaf People
Indian Sign Language Recognition Method For Deaf People
 
voice recognition security system ppt
voice recognition security system pptvoice recognition security system ppt
voice recognition security system ppt
 
Industrial Applications of Automatic Speech Recognition Systems
Industrial Applications of Automatic Speech Recognition SystemsIndustrial Applications of Automatic Speech Recognition Systems
Industrial Applications of Automatic Speech Recognition Systems
 
Phase ii frontpage
Phase ii frontpagePhase ii frontpage
Phase ii frontpage
 
The upsurge of deep learning for computer vision applications
The upsurge of deep learning for computer vision applicationsThe upsurge of deep learning for computer vision applications
The upsurge of deep learning for computer vision applications
 
IRJET-Human Face Detection and Identification using Deep Metric Learning
IRJET-Human Face Detection and Identification using Deep Metric LearningIRJET-Human Face Detection and Identification using Deep Metric Learning
IRJET-Human Face Detection and Identification using Deep Metric Learning
 
IRJET- Review on Raspberry Pi based Assistive Communication System for Blind,...
IRJET- Review on Raspberry Pi based Assistive Communication System for Blind,...IRJET- Review on Raspberry Pi based Assistive Communication System for Blind,...
IRJET- Review on Raspberry Pi based Assistive Communication System for Blind,...
 
HCI BASED APPLICATION FOR PLAYING COMPUTER GAMES | J4RV4I1014
HCI BASED APPLICATION FOR PLAYING COMPUTER GAMES | J4RV4I1014HCI BASED APPLICATION FOR PLAYING COMPUTER GAMES | J4RV4I1014
HCI BASED APPLICATION FOR PLAYING COMPUTER GAMES | J4RV4I1014
 
IRJET- Review on Portable Camera based Assistive Text and Label Reading f...
IRJET-  	  Review on Portable Camera based Assistive Text and Label Reading f...IRJET-  	  Review on Portable Camera based Assistive Text and Label Reading f...
IRJET- Review on Portable Camera based Assistive Text and Label Reading f...
 
IRJET- Email Framework for Blinds using Speech Recognition
IRJET- Email Framework for Blinds using Speech RecognitionIRJET- Email Framework for Blinds using Speech Recognition
IRJET- Email Framework for Blinds using Speech Recognition
 
IRJET- Design and Development of Tesseract-OCR Based Assistive System to Conv...
IRJET- Design and Development of Tesseract-OCR Based Assistive System to Conv...IRJET- Design and Development of Tesseract-OCR Based Assistive System to Conv...
IRJET- Design and Development of Tesseract-OCR Based Assistive System to Conv...
 
IRJET - Sign Language Converter
IRJET -  	  Sign Language ConverterIRJET -  	  Sign Language Converter
IRJET - Sign Language Converter
 
HTC Evo 4g Redesign
HTC Evo 4g Redesign HTC Evo 4g Redesign
HTC Evo 4g Redesign
 
IRJET - Sign Language Recognition using Neural Network
IRJET - Sign Language Recognition using Neural NetworkIRJET - Sign Language Recognition using Neural Network
IRJET - Sign Language Recognition using Neural Network
 

Similar to Multi-modal Asian Conversation Mobile Video Dataset for Recognition Task

Video content analysis and retrieval system using video storytelling and inde...
Video content analysis and retrieval system using video storytelling and inde...Video content analysis and retrieval system using video storytelling and inde...
Video content analysis and retrieval system using video storytelling and inde...
IJECEIAES
 
Procedia Computer Science 94 ( 2016 ) 295 – 301 Avail.docx
 Procedia Computer Science   94  ( 2016 )  295 – 301 Avail.docx Procedia Computer Science   94  ( 2016 )  295 – 301 Avail.docx
Procedia Computer Science 94 ( 2016 ) 295 – 301 Avail.docx
aryan532920
 
Video Liveness Verification
Video Liveness VerificationVideo Liveness Verification
Video Liveness Verification
ijtsrd
 

Similar to Multi-modal Asian Conversation Mobile Video Dataset for Recognition Task (20)

Mobile Personalized Notes Using Memory Package
Mobile Personalized Notes Using Memory PackageMobile Personalized Notes Using Memory Package
Mobile Personalized Notes Using Memory Package
 
Video content analysis and retrieval system using video storytelling and inde...
Video content analysis and retrieval system using video storytelling and inde...Video content analysis and retrieval system using video storytelling and inde...
Video content analysis and retrieval system using video storytelling and inde...
 
CONFIDENCE LEVEL ESTIMATOR BASED ON FACIAL AND VOICE EXPRESSION RECOGNITION A...
CONFIDENCE LEVEL ESTIMATOR BASED ON FACIAL AND VOICE EXPRESSION RECOGNITION A...CONFIDENCE LEVEL ESTIMATOR BASED ON FACIAL AND VOICE EXPRESSION RECOGNITION A...
CONFIDENCE LEVEL ESTIMATOR BASED ON FACIAL AND VOICE EXPRESSION RECOGNITION A...
 
Pacify based video retrieval system
Pacify based video retrieval systemPacify based video retrieval system
Pacify based video retrieval system
 
Extract the Audio from Video by using python
Extract the Audio from Video by using pythonExtract the Audio from Video by using python
Extract the Audio from Video by using python
 
IDE Code Compiler for the physically challenged (Deaf, Blind & Mute)
IDE Code Compiler for the physically challenged (Deaf, Blind & Mute)IDE Code Compiler for the physically challenged (Deaf, Blind & Mute)
IDE Code Compiler for the physically challenged (Deaf, Blind & Mute)
 
Procedia Computer Science 94 ( 2016 ) 295 – 301 Avail.docx
 Procedia Computer Science   94  ( 2016 )  295 – 301 Avail.docx Procedia Computer Science   94  ( 2016 )  295 – 301 Avail.docx
Procedia Computer Science 94 ( 2016 ) 295 – 301 Avail.docx
 
IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...
IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...
IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...
 
IRJET - Mutecom using Tensorflow-Keras Model
IRJET - Mutecom using Tensorflow-Keras ModelIRJET - Mutecom using Tensorflow-Keras Model
IRJET - Mutecom using Tensorflow-Keras Model
 
Assistive Examination System for Visually Impaired
Assistive Examination System for Visually ImpairedAssistive Examination System for Visually Impaired
Assistive Examination System for Visually Impaired
 
Depression Detection Based On Title Of Video
Depression Detection Based On Title Of VideoDepression Detection Based On Title Of Video
Depression Detection Based On Title Of Video
 
INDIAN SIGN LANGUAGE TRANSLATION FOR HARD-OF-HEARING AND HARD-OF-SPEAKING COM...
INDIAN SIGN LANGUAGE TRANSLATION FOR HARD-OF-HEARING AND HARD-OF-SPEAKING COM...INDIAN SIGN LANGUAGE TRANSLATION FOR HARD-OF-HEARING AND HARD-OF-SPEAKING COM...
INDIAN SIGN LANGUAGE TRANSLATION FOR HARD-OF-HEARING AND HARD-OF-SPEAKING COM...
 
A Translation Device for the Vision Based Sign Language
A Translation Device for the Vision Based Sign LanguageA Translation Device for the Vision Based Sign Language
A Translation Device for the Vision Based Sign Language
 
A Voice Based Assistant Using Google Dialogflow And Machine Learning
A Voice Based Assistant Using Google Dialogflow And Machine LearningA Voice Based Assistant Using Google Dialogflow And Machine Learning
A Voice Based Assistant Using Google Dialogflow And Machine Learning
 
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
 
Mtech Fourth progress presentation
Mtech Fourth progress presentationMtech Fourth progress presentation
Mtech Fourth progress presentation
 
50120130404055
5012013040405550120130404055
50120130404055
 
Video Liveness Verification
Video Liveness VerificationVideo Liveness Verification
Video Liveness Verification
 
Real Time Translator for Sign Language
Real Time Translator for Sign LanguageReal Time Translator for Sign Language
Real Time Translator for Sign Language
 
IRJET- Voice Assisted Text Reading and Google Home Smart Socket Control Syste...
IRJET- Voice Assisted Text Reading and Google Home Smart Socket Control Syste...IRJET- Voice Assisted Text Reading and Google Home Smart Socket Control Syste...
IRJET- Voice Assisted Text Reading and Google Home Smart Socket Control Syste...
 

More from IJECEIAES

Predicting churn with filter-based techniques and deep learning
Predicting churn with filter-based techniques and deep learningPredicting churn with filter-based techniques and deep learning
Predicting churn with filter-based techniques and deep learning
IJECEIAES
 
Automatic customer review summarization using deep learningbased hybrid senti...
Automatic customer review summarization using deep learningbased hybrid senti...Automatic customer review summarization using deep learningbased hybrid senti...
Automatic customer review summarization using deep learningbased hybrid senti...
IJECEIAES
 
Situational judgment test measures administrator computational thinking with ...
Situational judgment test measures administrator computational thinking with ...Situational judgment test measures administrator computational thinking with ...
Situational judgment test measures administrator computational thinking with ...
IJECEIAES
 
Hand LightWeightNet: an optimized hand pose estimation for interactive mobile...
Hand LightWeightNet: an optimized hand pose estimation for interactive mobile...Hand LightWeightNet: an optimized hand pose estimation for interactive mobile...
Hand LightWeightNet: an optimized hand pose estimation for interactive mobile...
IJECEIAES
 
Application of deep learning methods for automated analysis of retinal struct...
Application of deep learning methods for automated analysis of retinal struct...Application of deep learning methods for automated analysis of retinal struct...
Application of deep learning methods for automated analysis of retinal struct...
IJECEIAES
 
Predictive modeling for breast cancer based on machine learning algorithms an...
Predictive modeling for breast cancer based on machine learning algorithms an...Predictive modeling for breast cancer based on machine learning algorithms an...
Predictive modeling for breast cancer based on machine learning algorithms an...
IJECEIAES
 

More from IJECEIAES (20)

Predicting churn with filter-based techniques and deep learning
Predicting churn with filter-based techniques and deep learningPredicting churn with filter-based techniques and deep learning
Predicting churn with filter-based techniques and deep learning
 
Taxi-out time prediction at Mohammed V Casablanca Airport
Taxi-out time prediction at Mohammed V Casablanca AirportTaxi-out time prediction at Mohammed V Casablanca Airport
Taxi-out time prediction at Mohammed V Casablanca Airport
 
Automatic customer review summarization using deep learningbased hybrid senti...
Automatic customer review summarization using deep learningbased hybrid senti...Automatic customer review summarization using deep learningbased hybrid senti...
Automatic customer review summarization using deep learningbased hybrid senti...
 
Ataxic person prediction using feature optimized based on machine learning model
Ataxic person prediction using feature optimized based on machine learning modelAtaxic person prediction using feature optimized based on machine learning model
Ataxic person prediction using feature optimized based on machine learning model
 
Situational judgment test measures administrator computational thinking with ...
Situational judgment test measures administrator computational thinking with ...Situational judgment test measures administrator computational thinking with ...
Situational judgment test measures administrator computational thinking with ...
 
Hand LightWeightNet: an optimized hand pose estimation for interactive mobile...
Hand LightWeightNet: an optimized hand pose estimation for interactive mobile...Hand LightWeightNet: an optimized hand pose estimation for interactive mobile...
Hand LightWeightNet: an optimized hand pose estimation for interactive mobile...
 
An overlapping conscious relief-based feature subset selection method
An overlapping conscious relief-based feature subset selection methodAn overlapping conscious relief-based feature subset selection method
An overlapping conscious relief-based feature subset selection method
 
Recognition of music symbol notation using convolutional neural network
Recognition of music symbol notation using convolutional neural networkRecognition of music symbol notation using convolutional neural network
Recognition of music symbol notation using convolutional neural network
 
A simplified classification computational model of opinion mining using deep ...
A simplified classification computational model of opinion mining using deep ...A simplified classification computational model of opinion mining using deep ...
A simplified classification computational model of opinion mining using deep ...
 
An efficient convolutional neural network-extreme gradient boosting hybrid de...
An efficient convolutional neural network-extreme gradient boosting hybrid de...An efficient convolutional neural network-extreme gradient boosting hybrid de...
An efficient convolutional neural network-extreme gradient boosting hybrid de...
 
Generating images using generative adversarial networks based on text descrip...
Generating images using generative adversarial networks based on text descrip...Generating images using generative adversarial networks based on text descrip...
Generating images using generative adversarial networks based on text descrip...
 
Collusion-resistant multiparty data sharing in social networks
Collusion-resistant multiparty data sharing in social networksCollusion-resistant multiparty data sharing in social networks
Collusion-resistant multiparty data sharing in social networks
 
Application of deep learning methods for automated analysis of retinal struct...
Application of deep learning methods for automated analysis of retinal struct...Application of deep learning methods for automated analysis of retinal struct...
Application of deep learning methods for automated analysis of retinal struct...
 
Proposal of a similarity measure for unified modeling language class diagram ...
Proposal of a similarity measure for unified modeling language class diagram ...Proposal of a similarity measure for unified modeling language class diagram ...
Proposal of a similarity measure for unified modeling language class diagram ...
 
Efficient criticality oriented service brokering policy in cloud datacenters
Efficient criticality oriented service brokering policy in cloud datacentersEfficient criticality oriented service brokering policy in cloud datacenters
Efficient criticality oriented service brokering policy in cloud datacenters
 
System call frequency analysis-based generative adversarial network model for...
System call frequency analysis-based generative adversarial network model for...System call frequency analysis-based generative adversarial network model for...
System call frequency analysis-based generative adversarial network model for...
 
Comparison of Iris dataset classification with Gaussian naïve Bayes and decis...
Comparison of Iris dataset classification with Gaussian naïve Bayes and decis...Comparison of Iris dataset classification with Gaussian naïve Bayes and decis...
Comparison of Iris dataset classification with Gaussian naïve Bayes and decis...
 
Efficient intelligent crawler for hamming distance based on prioritization of...
Efficient intelligent crawler for hamming distance based on prioritization of...Efficient intelligent crawler for hamming distance based on prioritization of...
Efficient intelligent crawler for hamming distance based on prioritization of...
 
Predictive modeling for breast cancer based on machine learning algorithms an...
Predictive modeling for breast cancer based on machine learning algorithms an...Predictive modeling for breast cancer based on machine learning algorithms an...
Predictive modeling for breast cancer based on machine learning algorithms an...
 
An algorithm for decomposing variations of 3D model
An algorithm for decomposing variations of 3D modelAn algorithm for decomposing variations of 3D model
An algorithm for decomposing variations of 3D model
 

Recently uploaded

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
dharasingh5698
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 

Multi-modal Asian Conversation Mobile Video Dataset for Recognition Task

  • 1. International Journal of Electrical and Computer Engineering (IJECE) Vol. 8, No. 5, October 2018, pp. 4042 – 4046 ISSN: 2088-8708 4042 Institute of Advanced Engineering and Science w w w . i a e s j o u r n a l . c o m Multi-modal Asian Conversation Mobile Video Dataset for Recognition Task Dewi Suryani, Valentino Ekaputra, and Andry Chowanda Computer Science Department, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia 11480 Article Info Article history: Received December 6, 2017 Revised July 15, 2018 Accepted August 12, 2018 Keyword: Multi-modal dataset Asian conversation dataset Mobile video Recognition task Facial features expression Emotion recognition ABSTRACT Images, audio, and videos have been used by researchers for a long time to develop several tasks regarding human facial recognition and emotion detection. Most of the available datasets usually focus on either static expression, a short video of changing emotion from neutral to peak emotion, or difference in sounds to detect the current emotion of a person. Moreover, the common datasets were collected and processed in the United States (US) or Europe, and only several datasets were originated from Asia. In this paper, we present our effort to create a unique dataset that can fill in the gap by currently available datasets. At the time of writing, our datasets contain 10 full HD (1920×1080) video clips with annotated JSON file, which is in total 100 minutes of duration and the total size of 13 GB. We believe this dataset will be useful as a training and benchmark data for a variety of research topics regarding human facial and emotion recognition. Copyright c 2018 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Dewi Suryani, Valentino Ekaputra, Andry Chowanda Computer Science Department, School of Computer Science, Bina Nusantara University Jl. K. H. Syahdan No. 9, Palmerah, Jakarta, Indonesia 11480 (+6221) 534 5830, ext. 2188 {dsuryani, vekaputra, achowanda}@binus.edu 1. INTRODUCTION Nowadays, a mobile device such as a smartphone is equipped with a high quality camera and a good processor which enabled people to record a high definition (HD) video. There is no need to buy an expensive digital or DSLR camera for the people who want to record a video with a good quality result. Based on the observation performed by Ofcom [1] in 2016, in the USA, people spent approximately 87 hours on average a month to browse on a smartphone compared to 34 hours on laptop or desktop. This signifies that people tend to use a smartphone for most of their activities. It means that the feature on their smartphone is more than enough for their daily needs, and this includes the video recording. Statista [2] also claimed that 71% of 44,761 respondents use their smartphone to take photos/videos, which is the second highest activity in the smartphone after accessing the internet. This indicates the camera in their smartphones is already satisfied for taking images and recording video compared to a few years before. Despite the increasing amount of time people spend on their smartphone, there is no publicly avail- able dataset regarding Asian facial feature that is captured using mobile smartphone camera. Generally, the dataset only exists still as images and most of the video dataset only covers the western person facial features. As literature suggested, although facial expression recognition is universal to all races of humans, emotions perception from facial expressions cues are quite different from one culture to others. Moreover, most of the datasets focus only on facial features of the video and there is no such thing as facial features when a person is talking with the others in the wild. Hence, by using smartphone, we could capture a natural conversation of two interlocutors in the wild. Such datasets serve as datasets for computer to learn emotions recognition, facial expression recognition features and classification. In this work, we are aiming to address this gap by presenting a mobile video dataset that contains Journal Homepage: http://iaescore.com/journals/index.php/IJECE Institute of Advanced Engineering and Science w w w . i a e s j o u r n a l . c o m , DOI: 10.11591/ijece.v8i5.pp4042-4046
  • 2. IJECE ISSN: 2088-8708 4043 videos of Asian people having a natural conversation with each other. Our dataset is obtained by recording a natural conversation between 2 people inside a controlled room with adequate lighting and the mobile camera is in a steady position. In order to collect the variety of facial features, we provide several topics to be chosen by the interlocutors. The topics are mostly general topics such as foods, lecturers, etc. The rest of this paper is organized as follows: In Section 2., we list the existing different publicly available dataset and explain its difference with our dataset. Our method of collecting data and the characteristic of the data will be explained in Section 3. Potential applications of the dataset are described in Section 4. And lastly, the conclusion will be provided in Section 5. 2. RELATED WORKS Currently, several datasets have been created for many kinds of recognition tasks, especially facial ex- pression analysis and recognition [3, 4, 5]. However, only few datasets that contain Asian people and recorded by using a smartphone. The extended Cohn-Kanade database, CK+ [6] composed of 593 recordings of posed and non-posed sequences. It is recorded under controlled conditions of light and head motion, and range be- tween 9-60 frames per sequence. Each sequence represents a single changing facial expression that starts with a neutral expression and ends with a peak expression. The transitions between expressions are not included. Moreover, there is an NRC-IIT database [7] which contains pairs of short low-resolution mpeg1-encoded video clips. Each video clip is showing a face of a user who sits in front of the monitor that exhibiting a wide range of facial expressions and orientations as captured by a USB webcam mounted on the computer. Every video clip is about 15 seconds long, has a capture rate of 20 fps and is compressed with the AVI Intel codec 481 Kbps bit-rate. In the other hands, there is the Cohn-Kanade DFAT-504 dataset [8] that consists of 100 university students ranging in age from 18 to 30 years. 65% were female, 15% were African-American, and 3% were Asian or Latino. Students were instructed by an experimenter to perform a series of 23 facial expressions. Students began and ended each display with a neutral face. Image sequences from neutral to target display were digitized into 640 by 480 pixel arrays with 8-bit precision for grayscale values. Similar to the others, the MMI database [9] contains a large collection of FACS coded facial videos. However, it consists of 1395 manually AU coded video sequences with the majority of the video is posed and recorded in laboratory settings. In the dataset mentioned above, most of them only focus on the image part of the video without recording the audio and makes the dataset quite unnatural for several expression. The dataset above mostly used camera or webcam to take the video. Also the duration of each data considered to be too short for applications in real life condition which combined different aspects and the context of the topic with the facial expression in the video. In summary, our dataset is different in following points: (i) recorded using the camera in a mobile device; (ii) long duration of natural conversation video; (iii) includes full HD videos; and (iv) includes audio for matching expression with context. 3. PROPOSED METHOD In this paper, we proposed a new mobile video dataset, which can be used as a benchmark data for several recognition tasks as well as serve as a dataset for machine learning tasks. Here, we start by explaining how we collect the dataset for this research. The aim of this research is providing a publicly available dataset for Asian (specifically Indonesian) facial features, expressions, and conversations. Thus, we recruited twenty volunteers, whose mainly are Indonesian students (age between 19 - 21) to participate in this research. The participants are given a list of possible topics to be discussed during the recording in 10 minutes. In one session of recording, there were two interlocutors sitting facing each other across the table. The participants then start the conversation in with the other interlocutor, when the researcher give signal to them to start the conversation. To record the conversation, the researchers set up 2 smart phone with identical camera specification. The smart phone used in this research were two Xiaomi Mi 4i with 13MP, f/2.0 camera and the video was recorded in full a HD setting with resolution of 1920x1080 pixels and 30 fps. The smart phone were placed in a steady position in front of each interlocutor. The recorded video is depicted in the following Figure 1 where Figure 1a shows the first interlocutor who involves in this conversation with a specific topic selected, with the other interlocutor is illustrated in Figure 1b. During the conversation, the volunteers are encouraged to behave as if it is a normal and natural con- versation in order to get the most natural dataset possible. After 10 minutes, they will be reminded to stop the conversation and the video will be saved as an mp4 file with the MPEG-4 format in the device before it is ex- Multi-modal Asian Conversation Dataset for Recognition ... (Dewi Suryani)
  • 3. 4044 ISSN: 2088-8708 (a) First Interlucutor (b) Second Interlucutor Figure 1. Example of recorded video in a conversation between two interlocutors Figure 2. Part of annotated JSON example ported to a computer. Each file will be named in the format as follows: ”CONVERSATIONID CAMERAID”, where CONVERSATIONID is the identifier for the session and CAMERAID is the identifier for the device used. After the data is completely recorded, we start annotating the video for three different facial expressions, i.e., sad, neutral, and happy. Using visual object tagging tool (VOTT) [10] provided by Microsoft, we can get the annotation data in JSON format as shown in Figure 2.Furthermore, Figure 3 also describes the video example while annotating the data. 4. POTENTIAL APPLICATIONS There are several potential applications that can take the advantages of the availability of this dataset, such as facial recognition and emotion detection. Facial Recognition, our dataset provides another data in order to increase the accuracy of facial recognition task. The reason is that most research in this field is using CK+ dataset as done by Bartlett et al. [11], Cohen et al. [12], and Cohn et al. [13]. Unfortunately, the current used datasets mostly contain people from the western region, which can highly affect the Asian facial recognition. Emotion Detection, most of other research regarding emotion detection only used specific datasets like visual only or audio only [14]. Our dataset provides multi-modal information, i.e., images and audio that linked together in this case. Moreover, the common research used is posed expression datasets that are not based on authentic emotions [15]. Virtual Humans or Intelligent VIrtual Agents, with the dataset, we can learn a natural conversation between two interlocutors and implement them into a virtual human [16, 17]. The dataset provides several features to be learn: emotion recognition from audio (i.e. voice) and audio (e.g. facial expressions), natural IJECE Vol. 8, No. 5, October 2018: 4042 – 4046
  • 4. IJECE ISSN: 2088-8708 4045 Figure 3. Example of the video while annotation process. The red square denotes the facial area of annotation. language processing, and conversation. Psychology or Social Science Study, with the dataset, researchers from psychology or social study also could analyze and observe human behavior during the interaction. An ethnography study also can be applied to analyze or to observe human behavior (specifically Indonesian people) through the video. 5. CONCLUSION This paper presents conversation video dataset, containing videos of real conversation performed by a pair of volunteers recorded using mobile device camera along with JSON data of its annotation. In future, we intend to collect more data for the datasets by asking for more diverse volunteers based on age, gender, and occupation. We believe that this dataset will be useful for several applications which required training using images, audio, or videos from our datasets. We want also to record the video using different conditions of lighting in order to observe the influence of the lighting. REFERENCES [1] Ofcom. (2016, Dec.) The communications market report: International. [Online]. Available: https://www.ofcom.org.uk/research-and-data/multi-sector-research/cmr/cmr16/international [2] Statista. (2016, Sep.) Weekly smartphone activities among adult users in the united states as of august 2016. [Online]. Available: https://www.statista.com/statistics/187128/leading-us-smartphone-activities/ [3] H. K. Palo and M. N. Mohanty, “Classification of emotional speech of children using probabilistic neural network,” International Journal of Electrical and Computer Engineering, vol. 5, no. 2, p. 311, 2015. [4] F. E. Gunawan and K. Idananta, “Predicting the level of emotion by means of indonesian speech signal.” Telkomnika, vol. 15, no. 2, 2017. [5] F. Z. Salmam, A. Madani, and M. Kissi, “Emotion recognition from facial expression based on fiducial points detection and using neural network,” International Journal of Electrical and Computer Engineering (IJECE), vol. 8, no. 1, 2017. [6] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on. IEEE, 2010, pp. 94–101. [7] D. O. Gorodnichy, “Video-based framework for face recognition in video,” in Computer and Robot Vision, 2005. Proceedings. The 2nd Canadian Conference on. IEEE, 2005, pp. 330–338. [8] T. Kanade, J. F. Cohn, and Y. Tian, “Comprehensive database for facial expression analysis,” in Automatic Face and Gesture Recognition, 2000. Proceedings. Fourth IEEE International Conference on. IEEE, 2000, pp. 46–53. [9] M. Valstar and M. Pantic, “Induced disgust, happiness and surprise: an addition to the mmi facial expres- sion database,” in Proc. 3rd Intern. Workshop on EMOTION (satellite of LREC): Corpora for Research Multi-modal Asian Conversation Dataset for Recognition ... (Dewi Suryani)
  • 5. 4046 ISSN: 2088-8708 on Emotion and Affect, 2010, p. 65. [10] Microsoft. (2017, Sep.) Vott: Visual object tagging tool. [Online]. Available: https://github.com/Microsoft/VoTT [11] M. S. Bartlett, G. Littlewort, I. Fasel, and J. R. Movellan, “Real time face detection and facial expression recognition: Development and applications to human computer interaction.” in Computer Vision and Pattern Recognition Workshop, 2003. CVPRW’03. Conference on, vol. 5. IEEE, 2003, pp. 53–53. [12] I. Cohen, N. Sebe, A. Garg, L. S. Chen, and T. S. Huang, “Facial expression recognition from video sequences: temporal and static modeling,” Computer Vision and image understanding, vol. 91, no. 1, pp. 160–187, 2003. [13] J. F. Cohn, L. I. Reed, Z. Ambadar, J. Xiao, and T. Moriyama, “Automatic analysis and recognition of brow actions and head motion in spontaneous facial behavior,” in Systems, Man and Cybernetics, 2004 IEEE International Conference on, vol. 1. IEEE, 2004, pp. 610–616. [14] L. C. De Silva, T. Miyasato, and R. Nakatsu, “Facial emotion recognition using multi-modal information,” in Information, Communications and Signal Processing, 1997. ICICS., Proceedings of 1997 International Conference on, vol. 1. IEEE, 1997, pp. 397–401. [15] Y. Sun, N. Sebe, M. S. Lew, and T. Gevers, “Authentic emotion detection in real-time video,” in Interna- tional Workshop on Computer Vision in Human-Computer Interaction. Springer, 2004, pp. 94–104. [16] A. Chowanda, P. Blanchfield, M. Flintham, and M. Valstar, “Erisa: Building emotionally realistic social game-agents companions,” in International Conference on Intelligent Virtual Agents. Springer, 2014, pp. 134–143. [17] ——, “Play smile game with erisa,” in IVA 2015, Fifteenth International Conference on Intelligent Virtual Agents, 2015. BIOGRAPHIES OF AUTHORS Dewi Suryani is a Computer Science lecturer at Bina Nusantara University, Indonesia. She obtained Bachelor Degree in Computer Science from Bina Nusantara University in 2014. She just graduated from the Sirindhorn Thai-German Graduate School of Engineering (TGGS), King Mongkut’s University of Technology North Bangkok (KMUTNB), Thailand, with Master of Engi- neering. Her researches are in fields of computer science, image processing, handwriting recogni- tion, etc. Valentino Ekaputra is a lecturer in Bina Nusantara University. He obtained Bachelor and Master Degree in Computer Science from Binus University in 2016. Currently, he still does not have fields he is focusing on researches. Andry Chowanda is a Computer Science lecturer in Bina Nusantara University Indonesia and currently is a PhD Student in the University of Nottingham. His research is in agent architecture. His work mainly on how to model an agent that has capability to sense and perceive the environment and react based on the perceived data in addition to the ability of building a social relationship with the user overtime. IJECE Vol. 8, No. 5, October 2018: 4042 – 4046