Несанкционированный сбор информации с использованием интегрированных датчиков и систем

Несанкционированный сбор
информации с использованием
интегрированных датчиков и систем
к.т.н., Антон Николаевич Бойко
Институт НМСТ МИЭТ
anton.bojko@mail.ru
Москва, Зеленоград, 26 февраля 2021
II Международная научно-практическая конференция «Высокотехнологичное право: генезис и перспективы»

Датчики и системы в смартфонах
Источник: Yole Développement
xhaustive list)
AKM, Alps Electric, Amazon, AMS, Apple, Aptina
AuthenTec, BlackBerry, Boeing, Bosch Sensortec,
GING AND IMPROVED SOUND INPUT
of
vely
ent
ong
eve
ned
one
row
ties
(IR)
ket,
ng a
ets.
g of
Mobile value proposition
(Yole Développement, June 2016)
anton.bojko@mail.ru
• смартфон - не только
«умное» устройство, это
киберфизическая
система

Схема устройства датчиков движения
Источник: Ba, Zhongjie & Zheng, Tianhang & Zhang, Xinyu & Qin, Zhan & Li,
Baochun & Liu, Xue & Ren, Kaili. (2020). Learning-based Practical Smartphone
Eavesdropping with Built-in Accelerometer.
(a) Accelerometer structure
celerometer measurements. Extensive evaluations on
the existing and our datasets show that the system
significantly and consistently outperforms existing so-
lutions1
. To the best of our knowledge, the proposed
system gives the first trail on accelerometer-based
speech reconstruction.
II. BACKGROUND AND RELATED WORK
In this section, we first describe the design of the motion
sensors on current smartphones. We then review the existing
works that exploit motion sensors to capture speech signals
and other topics related to AccelEve.
A. MEMS Motion Sensors
Modern smartphones typically come equipped with a three-
axis accelerometer and a three-axis gyroscope. These sensors
are highly sensitive to the motion of the device and have been
widely applied to sense orientation, vibration, shock, etc.
Accelerometer: a three-axis accelerometer is a device that
captures the acceleration of its body along three sensing axes.
Each axis is normally handled by a sensing unit consisting of
a movable seismic mass, several fixed electrodes, and several
spring legs, as shown in Fig.2(a). When the accelerometer
experiences an acceleration along a sensing axis, the corre-
sponding seismic mass shifts to the opposite direction and
creates a change in the capacitance between the electrodes.
This change yields an analog signal that is then mapped to
(a) Accelerometer structure
(b) Gyroscope structure
Fig. 2. Sketches of an accelerometer and gyroscope.
TABLE I. SAMPLING FREQUENCIES SUPPORTED BY ANDROID [2].
Delay Options Delay Sampling Rate
SENSOR_DELAY_NORMAL 200 ms 5 Hz
SENSOR_DELAY_UI 20 ms 50 Hz
SENSOR_DELAY_GAME 60 ms 16.7 Hz
SENSOR_DELAY_FASTEST 0 ms As fast as possible
B. Speech Recognition via Motion Sensors
Акселерометр
(датчик линейного ускорения)
Гироскоп
(датчик угловой скорости)
anton.bojko@mail.ru
• устройство МЭМС датчиков движения
позволяет использовать их в качестве
микрофона

Сбор информации с использованием встроенных датчиков движения
Источник: Ba, Zhongjie & Zheng, Tianhang & Zhang, Xinyu & Qin, Zhan & Li, Baochun & Liu, Xue & Ren,
Kaili. (2020). Learning-based Practical Smartphone Eavesdropping with Built-in Accelerometer.
collects acceleromete
utilizes the collected
played speech signals
be disguised as any k
accelerometer does n
The main intent
and reconstruct the
measurements. Since
ture multiple “words
human movement, ou
module to automatica
acceleration signals a
segments. We then co
to its spectrogram rep
module and a reconst
• доступ приложений к
датчикам движения
обычно не
запрашивается
anton.bojko@mail.ru

«Gyrophone» - гироскоп как датчик для прослушки
Источник: Michalevsky, Yan & Boneh, Dan & Nakibly, Gabi. (2014).
Gyrophone: Recognizing Speech from Gyroscope Signals.
• МЭМС-гироскоп
• измерительная схема
(a) MEMS structure (b) Driving mass movement depending on the angular rate
Figure 1: STMicroelectronics 3-axis gyro design (Taken from [16]. Figure copyright of STMicroelectronics. U
with permission.)
Spectral Centroid statistical features. We used MIRTool-
box [32] for the feature computation. It is important
to note that while MFCC have a physical meaning for
real speech signal, in our case of an narrow-band aliased
signal, MFCC don’t necessarily have an advantage, and
were used partially because of availability in MIRTool-
box. We attempted to identify the gender of the speaker,
distinguish between different speakers of the same gen-
der and distinguish between different speakers in a mixed
set of male and female speakers. For gender identifica-
tion we used a binary SVM, and for speaker identifica-
tion we used multi-class SVM and GMM. We also at-
tempted gender and speaker recognition using DTW with
STFT features. All STFT features were computed with
a window of 512 samples which, for sampling rate of 8
KHz, corresponds to 64 ms.
3.3 Speech recognition algorithm
Figure 5: Experimental setup
SNR, perhaps by filtering out the noise or applying some
other preprocessing for emphasizing the speech signal. 9
anton.bojko@mail.ru

Сценарий «на одной поверхности с динамиком»
Источник: Anand, S & Saxena, Nitesh. (2018). Speechless: Analyzing the Threat to Speech
Privacy from Smartphone Motion Sensors. 1000-1017. 10.1109/SP.2018.00004.
0 20 40 60 80 100 120 140 160 180 200 220
0
2
4
·10−2
Samples
Maximum
range
in
(rad/s)
Location 1
Location 2
Location 3
Location 4
Speech
(a) Gyroscope readings along x axis
0 20 40 60 80 100 120 140 160 180 200 220
0
2
4
·10−2
Samples
Maximum
range
in
(rad/s)
(b) Gyroscope readings along y axis
0 20 40 60 80 100 120 140 160 180 200 220
0
2
4
·10−2
Samples
Maximum
range
in
(rad/s)
(c) Gyroscope readings along z axis
0 20 40 60 80 100 120 140 160 180 200 220
0
2
4
6
8
Samples
Maximum
range
in
(m/s
2
)
(d) Accelerometer readings along x axis
0 20 40 60 80 100 120 140 160 180 200 220
0
2
4
6
8
Samples
Maximum
range
in
(m/s
2
)
(e) Accelerometer readings along y axis
0 20 40 60 80 100 120 140 160 180 200 220
0
2
4
6
8
Samples
Maximum
range
in
(m/s
2
)
(f) Accelerometer readings along z axis
Fig. 3: Comparison of sensor behavior under ambient locations and in presence of speech in the Loudspeaker-Same-
Surface scenario. Maximum variance in sensor readings (in absence of speech) at quiet locations 1, 2, 3, 4 is plotted along
side maximum variance in sensor readings (in presence of speech) to determine the effect of speech on sensors. Due to surface
vibrations from loudspeaker, there is a noticeable effect on accelerometer readings that pushes the blue line plot significantly
higher than the line plots of quiet locations (denoted by green, magenta, cyan, and red line plots).
• гироскоп
• акселерометр
anton.bojko@mail.ru

Сценарий «в одном устройстве»
Источник: Ba, Zhongjie & Zheng, Tianhang & Zhang, Xinyu & Qin, Zhan & Li, Baochun & Liu, Xue & Ren, Kaili. (2020).
Learning-based Practical Smartphone Eavesdropping with Built-in Accelerometer.
• на столе
g
is
(a) Table setting
a) Acceleration signal along the x-
xis, y-axis, and z-axis
(b) Spectrogram of the signal along
the z-axis
g. 8. The impact of self-noise and surface vibration. The accelerometer is
ced on a table and is only affected by the vibration of the surface.
ternal stimulus, we investigate the combined effect of self-
oise and surface vibration. Surface vibration could affect
e accelerometer’s measurements along the z-axis when the
martphone is placed on a table. To measure the impact of
ese two noise sources, we place a Samsung S8 on a table
d record its accelerometer measurements for 330 seconds.
he table has a solid surface that could effectively hand over
bration to the smartphone and is placed in a building under
nstruction. The output signal of the accelerometer is depicted
Fig. 8(a). It can be observed that the accelerometer has
constant noise output along the x-axis and the y-axis. The
lf-noise of the accelerometer contributes to the majority of
ese noise outputs. For the z-axis, the accelerometer outputs
constant noise signal as well as surface vibrations. The fre-
uency distribution of the acceleration signal along three axes
e similar. For illustration, Fig. 8(b) plots the spectrogram
the signal along the z-axis (with the DC offset removed).
this spectrogram, around 57% of the energy are distributed
(a) Table setting
(b) Handhold setting
Fig. 9. Raw acceleration signals captured by smartphone accelerometer.
• в руках
anton.bojko@mail.ru

Распознавание музыки
Источник: Matovu, Richard & Griswold-Steiner, Isaac & Serwadda, Abdul. (2019). Kinetic Song
Comprehension: Deciphering Personal Listening Habits via Phone Vibrations.
4
Fig. 1: Attack threat model.
Three of the above cited four papers (i.e., [6]–[8]) have an
additional fundamental difference from our work, namely,
(5) Threat Scenario Studied: Our work is focused on the
scenario of a malicious entity such as an advertising company
that has a rogue app which seeks to make inferences on
the kind of multimedia content consumed by the owners
this text inference attack have since been studied — e.g.,
Xu et al. [14] used both the accelerometer and orientation
sensors for keystroke inference, Marquardt et al. [15] focused
on inference of text typed on nearby keyboards, and more
recently Tang et al. [16] and Hodges et al. [17] respectively
focused on validating the inference of PINs in a much larger
Смартфон,
проигрывающий
музыку
Сбор информации
шпионским
приложением
Машинное
обучение
Распознавание
• Эффективность - 80%
anton.bojko@mail.ru

Распознавание речи по мимическим движениям
JOURNAL OF L
A
TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
Figure 1: Location of the phone on the face of a user during
a typical phone conversation. As the jaws and face move
during the articulation of words, motion sensors in the phone
pick up some of these minute movements. Our work explores
if these movements might be distinct enough to enable the
decoding of spoken words. Observe that the phone’s speaker is
directed straight into the ear while the microphone is directed
at the mouth. The user’s quest to attain this speaker-to-ear
and microphone-to-mouth mapping is the reason why most
users hold the phone in some variation of this general posture.
However, the predictability of this phone holding posture means
that certain jaw movements emanating from speech might
produce consistent patterns in sensor data (within a reasonable
margin of error), which in turn lends itself to potential attacks
such as the one studied in this paper. Photograph by Hassan
OUAJBIR, via Unsplash [4].
spoken on the phone. This experiment showed that not only
can phonetics be differentiated from one another, but that
words are more clearly separated due to their containing a
combination of phonetics. This preliminary investigation
prompted the development of additional experiments to
evaluate the risk posed by this side-channel attack. Our
work suggests a new family of attacks based on smart
devices placed against or connected to the head (e.g.,
smartphones or smart earbuds with sensors)
• Evaluating Word-Level Inference Attacks: Building
on patterns observed during initial experiments using
phonetics, we study how motion sensor data can predict
digits (0-9) spoken on the phone. We show, based on a
deep neural network comprised of Convolutional Neural
Network (CNN) and Long-Short Term Memory (LSTM)
layers, that for the most vulnerable users, spoken digits
can be recognized with an accuracy of up to 50+%. Even
for the least vulnerable users, we find that the attack
outperforms random guessing, making it appealing to the
numbers before embarking on decoding them. Although
our study is focused on digits for the sake of focused
exposition, its implications are beyond just these 10 digits.
Our findings imply that an entity aiming to detect a
small dictionary of words (e.g., see trigger words used by
Homeland Security [5]) might be able extract them via
biomechanical movements extracted from motion sensor
data.
• Inference of User Identity and Gender: Finally, we
study the question of whether the face movements captured
by the phone’s sensors during phone calls could be used
to infer the identity and gender of the user. We find that
the user’s (or victim’s) identity could be determined with
an accuracy of between 36% and 45% on the first guess,
depending on the type of content being spoken (digits or
non-digits). Additional experiments find that the user’s
gender could be determined with a classification accuracy
of between 64% and 88% depending on whether some
data from the victim is part of the training set or not.
Potential applications of the gender and identity inference
attack include the surveillance of high value targets (e.g.,
terrorists) who might use multiple devices over time or
activists in nations where their activity is not sanctioned
by the government. Our findings, coupled with the fact
that motion sensor data on certain devices (e.g., Android)
can be accessed by any app without restriction, reveal
that such surveillance might be feasible.
• Demonstrating the Effect of Variations in Hardware
and User Behavior: Finally, we conducted a series
of experiments to evaluate the impact of variations in
hardware and user behavior on the performance of the
attack. Unlike our other experiments, participants used
10 different models of phone for this investigation. The
phones held by users for these experiments had a weight
ranging from 5.1 to 7.2 ounces, with a length of 5.84-
6.39 inches, and width of 2.76-3.06 inches. It’s possible
that varying sizes of phone are held differently by a user,
causing the attack to misidentify movements of the face if
it is not trained on similar data. Additionally, phones might
propagate facial movements to the device’s sensors in a
different manner based on weight or size characteristics.
Another difference between devices was in the sampling
rate of the sensors, with devices ranging from 50 to 243 Hz
for the accelerometer and gyroscope. Users also held their
devices differently, with some users only having contact
between the phone and their ear, while others held the
device at drastically different angles (e.g., nearly vertical
or horizontal). We found that large variations in either
the sampling rate of the data or the way the participant
held the phone led to the attacks performing poorly. For
example, participants that either held the phone with the
only contact point between their phone and head being
their ear were highly resistant to attacks classifying the
digit spoken or the participant’s gender. These results
JOURNAL OF L
A
(a) Phonetics from the IPA. (b) Corresponding words using the IPA phonetics.
Figure 2: Preliminary study to evaluate whether biomechanical movements during phone conversations can cause enough
movement of a smartphone to distinguish between phonetic sounds and words. The colors and shapes in Figures 2a and 2b are
paired in such a way that the word uses the matching phonetic.
These variations notwithstanding, the key takeaway for our
research is that even in these low dimensional spaces based on 3
simple features, we see a decent amount of separability between
words. The observation suggests that a more carefully crafted
set of features in conjunction with a rigorous classification
engine might attain good classification on the problem, making
the attack feasible. The attacks in the rest of this paper leverage
6-26% accuracy for user-independent experimental conditions.
They were also able identify the person whose voice was played
through the speakers at a rate of about 50% accuracy (for 10
participants). When conducting user-dependent analysis, they
found that the accuracy dramatically increased to 65% when
using Dynamic Time Warping (DTW) for classification.
Other researchers have attempted to leverage such informa-
JOURNAL OF L
A
(a) Phonetics from the IPA. (b) Corresponding words using the IPA phonetics.
Figure 2: Preliminary study to evaluate whether biomechanical movements during phone conversations can cause enough
movement of a smartphone to distinguish between phonetic sounds and words. The colors and shapes in Figures 2a and 2b are
paired in such a way that the word uses the matching phonetic.
These variations notwithstanding, the key takeaway for our
research is that even in these low dimensional spaces based on 3
simple features, we see a decent amount of separability between
words. The observation suggests that a more carefully crafted
set of features in conjunction with a rigorous classification
engine might attain good classification on the problem, making
the attack feasible. The attacks in the rest of this paper leverage
6-26% accuracy for user-independent experimental conditions.
They were also able identify the person whose voice was played
through the speakers at a rate of about 50% accuracy (for 10
participants). When conducting user-dependent analysis, they
found that the accuracy dramatically increased to 65% when
using Dynamic Time Warping (DTW) for classification.
Other researchers have attempted to leverage such informa-
Фонемы международного фонетического
алфавита и перемещения акселерометра
Отдельные слова и их связь с амплитудой
перемещений акселерометра
Источник: Griswold-Steiner, Isaac & LeFevre, Zachary & Serwadda, Abdul.
(2020). Smartphone Speech Privacy Concerns from Side-Channel Attacks on
Facial Biomechanics. Computers & Security. 100. 10.1016/j.cose.2020.102110.
• распознавание цифр - до 35%
• распознавание пола - до 88%
• идентификация пользователя - до 45%
anton.bojko@mail.ru

Распознавание действий с клавиатурой датчиками движения
Источник: Javed, A.R., Beg, M.O., Asim, M. et al. AlphaLogger: detecting motion-based side-
channel attack using smartphone keystrokes. J Ambient Intell Human Comput (2020).
1 3
puter association. Ruan et al. (2019) proposed frequency of 30 samples per second as recommended by
ram of the
anton.bojko@mail.ru

Распознавание движений глаз передней камерой
Источник: Y. Wang, W. Cai, T. Gu and W. Shao, "Your Eyes Reveal Your Secrets: An Eye Movement
Based Password Inference on Smartphone," in IEEE Transactions on Mobile Computing, vol. 19, no.
11, pp. 2714-2730, 1 Nov. 2020, doi: 10.1109/TMC.2019.2934690.
(a) Face detection (b) Eye detection on
face area
(c) Pupil center lo-
cation on eye ROI
represent the co
images are dete
consequently e
other eye by th
to the visual ga
ing. In the field
extraction sche
example in [13
by fixed time i
fixation), then
from the chunk
since fixation f
transient (on a
images at the b
into such chun
face area
cation on eye ROI
represen
images
consequ
other ey
to the v
ing. In t
extractio
example
by fixed
fixation
from the
since fix
transien
images
into suc
face area
cation on eye ROI
1. определение лица 2. определение глаз
2. считывание
движений зрачков
• распознавание 6-значного номера -
до 84,38 %

Системы пассивной идентификации пользователя
Источник: Deb, D. et al. “Actions Speak Louder Than (Pass)words: Passive Authentication of Smartphone*
Users via Deep Temporal Features.” 2019 International Conference on Biometrics (ICB) (2019): 1-8.
• сенсорная клавиатура
(динамика нажатия)

• GPS (местоположение)


• гироскоп

• магнитометр


• инклинометр
Ford Motor Company
Dearborn, MI, USA
{kprakaha, kprasad}@ford.com
s
a
-
-
e
-
-
n
-
e
-
-
7
-
-
Figure 1: Authentication on smartphones by exploiting sen-
sorial data has become an active field of research due to the
growing number of available sensors in smartphones.
Неявная аутентификация с
использованием датчиков смартфона
97,15% в течение 3-х секунд!
anton.bojko@mail.ru

Особенности использования встроенных датчиков и систем
• возможность непрямого использования датчиков для сбора
информации

• сложность сохранения баланса между стоимостью сенсорной сети и
степенью ее защиты

• возможность несанкционированного создания и использования
биометрического профиля
anton.bojko@mail.ru

Москва, Зеленоград, 26 февраля 2021
Благодарю за внимание!
А.Н. Бойко

anton.bojko@mail.ru

Несанкционированный сбор информации с использованием интегрированных датчиков и систем

Recommended

Recommended

More Related Content

Similar to Несанкционированный сбор информации с использованием интегрированных датчиков и систем

Similar to Несанкционированный сбор информации с использованием интегрированных датчиков и систем (20)

Recently uploaded

Recently uploaded (20)

Несанкционированный сбор информации с использованием интегрированных датчиков и систем