Search, Discovery and Analysis
of Sensory Data Streams
1
Payam Barnaghi
Centre for Vision, Speech and Signal Processing (CVSSP), University of
Surrey
Care Technology & Research Centre, The UK Dementia Research
Institute (DRI)
SAW2019: 1st International Workshop on Sensors and Actuators on
the Web
46 years ago on the 5th of November (submission
day)
2
Source: https://www.cs.princeton.edu/courses/archive/fall06/cos561/papers/cerf74.pdf
• A 32 bit IP address was
used of which the first 8
bits signified the network
and the remaining 24 bits
designated the host on
that network.
• The assumption was that
256 networks would be
sufficient for the
foreseeable future…
• Obviously this was before
LANs (Ethernet was
under development at
Xerox PARC at that time).
Around 20 years later…
3
Web search in the early days
44
And there came Google!
55
Google says that the web has now over 30
trillion unique individual pages. It is
probably not even that relevant anymore;
lots of resources are dynamic…
The Crawling problem
6Source: https://www.bruceclay.com/seo/submit-website/
The Web content search lifecycle
− Creation
− Upload
− Crawling
− Indexing
− Delete/Update
− Query
− Search and discovery
− Processing
− Ranking
− Presentation
7
Content
Access
However, not only pages are on the web…
8
Image source: Youmegeek.com
Internet of Things (IoT) Search
9
10
http://Thingful.net
11
http://Thingful.net
12
13
14
Image sources: Wolfram Alpha
Search and automation
15
source: Passler.com
Sensory data
16
Sensor Data Flow on the Web
17
P. Barnaghi, A. Sheth, “On Searching the Internet of Things: Requirements and Challenges”, IEEE Intelligent Systems, 2016.
18https://iotcrawler.eu
Searching for…
19
(Y. Fathy, P. Barnaghi, et. al, 2018)
Searching for Sensory Devices
(i.e. Resources)
20
Semantic models
21
Semantic models
22
LSM : A Semantic Approach
23
(Danh Le-Phuoc et. al, ISWC, 2011)
A discovery engine for the IoT
24(HosseiniTabatabaie, Barnaghi et. al, 2018)
A GMM model for indexing
25
Average Success rates
First attempt: 92.3%
(min)
At first DS: 92.5 % (min)
At first DSL2 : 98.5 %
(min)
Number of attempts
Percentageofthetotalqueries
0 10 20 30 40 50 60
10
-4
10
-3
10
-2
10
-1
10
0
DSL2 capacity 1
DSL2 capacity 2
DSL2 capacity 3
DSL2 capacity 4
26
However, there
are also other
possible solutions:
(Y. Fathy, P. Barnaghi, et. al, 2017)
(A. HosseiniTabatabaie, P. Barnaghi et. al, 2019)
The Crawling and Update Issue
27
The Crawling Challenge
− Uniform policy: re-visiting all pages in the collection with
the same frequency, regardless of their rates of change.
− Proportional policy: re-visiting more often the pages that
change more frequently. The visiting frequency is directly
proportional to the (estimated) change frequency.
28
Cho, Junghoo; Garcia-Molina, Hector (2003). "Effective page refresh policies for Web
crawlers". ACM Transactions on Database Systems. 28 (4): 390–426.
Web Crawling
− Cho and Garcia-Molina proved the surprising result that,
in terms of average freshness, the uniform policy
outperforms the proportional policy in both a simulated
Web and a real Web crawl.
− Allocating too many new crawls to rapidly changing
pages at the expense of less frequently updating pages.
− A proportional policy allocates more resources to
crawling frequently updating pages, but experiences less
overall freshness time from them.
29
Source: Wikipedia
Crawling and the Freshness Issue
− To improve freshness, the crawler should penalise the
elements that change too often.
− The optimal re-visiting policy is neither the uniform policy
nor the proportional policy.
− The optimal method for keeping average freshness high
includes ignoring the pages that change too often, and
the optimal for keeping average age low is to use access
frequencies that monotonically (and sub-linearly)
increase with the rate of change of each page.
30
Junghoo Cho; Hector Garcia-Molina (2003). "Estimating frequency of change". ACM
Transactions on Internet Technology. 3 (3): 256–290.
Source: Wikipedia
Searching the content of data streams
31
Patterns and segmentation of time-series data
32
But the data is often multidimensional and
multivariate
33Credit: Shirin Enshaeifar, CR&T Centre, UK Dementia Research Institute/CVSSP, Uni of Surrey
Creating patterns from streaming data
34(Gonzalez-Vidal, Barnaghi, Skarmeta, IEEE TKDE, 2018)
IoTCrawler search engine
35http://iot-crawler.ee.surrey.ac.uk/search-engine/
36http://iot-crawler.ee.surrey.ac.uk/search-engine/
Pattern analysis
37
Days
Time
Aggregated daily pattern (2weeks)
Days
Time
Aggregated daily pattern (2weeks)
(Enshaeifar, Barnaghi, et. al, PlosOne, 2018)
Developing end-to-end solutions
38
(Enshaeifar, Barnaghi, et. al, 2019)
Some of the Research Challenges
− Provenance monitoring and fact checking algorithms
and tools
− Dealing with noisy, incomplete and dynamic data.
− Handling and processing large data streams, search and
identification of patterns.
− Crawling, search and query of changing data
− Multi-modal information analysis and continual and
adaptive learning algorithms
− Security, privacy, trust and accessibility
− Solutions to keep (and make) the Web a safe, open,
inclusive and collaborative environment.
39
Some (other) important issues
40
How representative is your data?
41
The issue of trust and reliability
42
How stable are the models that you learn from
your data?
43
Credits: Roonak Rezvani, CR&T Centre, UK Dementia Research Institute/CVSSP, Uni of Surrey
Dynamicity and machine learning issue
44
Noise and missing data Pattern and change representation
Continual and adaptive learning Network and Causation analysis
Avoid (unnecessary) complexity
45
Be ready for setbacks
46
References
− S. Enshaeifar et. al, "Health management and pattern analysis of daily living activities
of people with Dementia using in-home sensors and machine learning techniques",
PLoS ONE 13(5): e0195605, 2018.
− A. González Vidal, P. Barnaghi, A. F. Skarmeta, "BEATS: Blocks of Eigenvalues
Algorithm for Time series Segmentation", IEEE Transactions on Knowledge and Data
Engineering (TKDE), 2018.
− Y. Fathy, P. Barnaghi, R. Tafazolli, "An Online Adaptive Algorithm for Change
Detection in Streaming Sensory Data", IEEE Systems Journal, 2018.
− Y. Fathy, P. Barnaghi, R. Tafazolli, "Large-Scale Indexing, Discovery and Ranking for
the Internet of Things (IoT)", ACM Computing Surveys, 2017.
− S. A. Hosieni Tabatabaei, Y. Fathy, P. Barnaghi, C. Wang, R. Tafazolli, "A Novel
Indexing Method for Scalable IoT Source Lookup", IEEE Internet of Things Journal,
2018.
− Y. Fathy, P. Barnaghi, R. Tafazolli, "Distributed Spatial Indexing for the Internet of
Things Data Management", Proc. of IFIP/IEEE International Symposium on
Integrated Network Management, Lisbon, Portugal, May 2017.
47
Acknowledgments
48
Thank you!
http://personal.ee.surrey.ac.uk/Personal/P.Barnaghi/
@pbarnaghi
p.barnaghi@surrey.ac.uk
https://ukdri.ac.uk/team/payam-barnaghi

Search, Discovery and Analysis of Sensory Data Streams

  • 1.
    Search, Discovery andAnalysis of Sensory Data Streams 1 Payam Barnaghi Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey Care Technology & Research Centre, The UK Dementia Research Institute (DRI) SAW2019: 1st International Workshop on Sensors and Actuators on the Web
  • 2.
    46 years agoon the 5th of November (submission day) 2 Source: https://www.cs.princeton.edu/courses/archive/fall06/cos561/papers/cerf74.pdf • A 32 bit IP address was used of which the first 8 bits signified the network and the remaining 24 bits designated the host on that network. • The assumption was that 256 networks would be sufficient for the foreseeable future… • Obviously this was before LANs (Ethernet was under development at Xerox PARC at that time).
  • 3.
    Around 20 yearslater… 3
  • 4.
    Web search inthe early days 44
  • 5.
    And there cameGoogle! 55 Google says that the web has now over 30 trillion unique individual pages. It is probably not even that relevant anymore; lots of resources are dynamic…
  • 6.
    The Crawling problem 6Source:https://www.bruceclay.com/seo/submit-website/
  • 7.
    The Web contentsearch lifecycle − Creation − Upload − Crawling − Indexing − Delete/Update − Query − Search and discovery − Processing − Ranking − Presentation 7 Content Access
  • 8.
    However, not onlypages are on the web… 8 Image source: Youmegeek.com
  • 9.
    Internet of Things(IoT) Search 9
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
    Sensor Data Flowon the Web 17 P. Barnaghi, A. Sheth, “On Searching the Internet of Things: Requirements and Challenges”, IEEE Intelligent Systems, 2016.
  • 18.
  • 19.
    Searching for… 19 (Y. Fathy,P. Barnaghi, et. al, 2018)
  • 20.
    Searching for SensoryDevices (i.e. Resources) 20
  • 21.
  • 22.
  • 23.
    LSM : ASemantic Approach 23 (Danh Le-Phuoc et. al, ISWC, 2011)
  • 24.
    A discovery enginefor the IoT 24(HosseiniTabatabaie, Barnaghi et. al, 2018)
  • 25.
    A GMM modelfor indexing 25 Average Success rates First attempt: 92.3% (min) At first DS: 92.5 % (min) At first DSL2 : 98.5 % (min) Number of attempts Percentageofthetotalqueries 0 10 20 30 40 50 60 10 -4 10 -3 10 -2 10 -1 10 0 DSL2 capacity 1 DSL2 capacity 2 DSL2 capacity 3 DSL2 capacity 4
  • 26.
    26 However, there are alsoother possible solutions: (Y. Fathy, P. Barnaghi, et. al, 2017) (A. HosseiniTabatabaie, P. Barnaghi et. al, 2019)
  • 27.
    The Crawling andUpdate Issue 27
  • 28.
    The Crawling Challenge −Uniform policy: re-visiting all pages in the collection with the same frequency, regardless of their rates of change. − Proportional policy: re-visiting more often the pages that change more frequently. The visiting frequency is directly proportional to the (estimated) change frequency. 28 Cho, Junghoo; Garcia-Molina, Hector (2003). "Effective page refresh policies for Web crawlers". ACM Transactions on Database Systems. 28 (4): 390–426.
  • 29.
    Web Crawling − Choand Garcia-Molina proved the surprising result that, in terms of average freshness, the uniform policy outperforms the proportional policy in both a simulated Web and a real Web crawl. − Allocating too many new crawls to rapidly changing pages at the expense of less frequently updating pages. − A proportional policy allocates more resources to crawling frequently updating pages, but experiences less overall freshness time from them. 29 Source: Wikipedia
  • 30.
    Crawling and theFreshness Issue − To improve freshness, the crawler should penalise the elements that change too often. − The optimal re-visiting policy is neither the uniform policy nor the proportional policy. − The optimal method for keeping average freshness high includes ignoring the pages that change too often, and the optimal for keeping average age low is to use access frequencies that monotonically (and sub-linearly) increase with the rate of change of each page. 30 Junghoo Cho; Hector Garcia-Molina (2003). "Estimating frequency of change". ACM Transactions on Internet Technology. 3 (3): 256–290. Source: Wikipedia
  • 31.
    Searching the contentof data streams 31
  • 32.
    Patterns and segmentationof time-series data 32
  • 33.
    But the datais often multidimensional and multivariate 33Credit: Shirin Enshaeifar, CR&T Centre, UK Dementia Research Institute/CVSSP, Uni of Surrey
  • 34.
    Creating patterns fromstreaming data 34(Gonzalez-Vidal, Barnaghi, Skarmeta, IEEE TKDE, 2018)
  • 35.
  • 36.
  • 37.
    Pattern analysis 37 Days Time Aggregated dailypattern (2weeks) Days Time Aggregated daily pattern (2weeks) (Enshaeifar, Barnaghi, et. al, PlosOne, 2018)
  • 38.
  • 39.
    Some of theResearch Challenges − Provenance monitoring and fact checking algorithms and tools − Dealing with noisy, incomplete and dynamic data. − Handling and processing large data streams, search and identification of patterns. − Crawling, search and query of changing data − Multi-modal information analysis and continual and adaptive learning algorithms − Security, privacy, trust and accessibility − Solutions to keep (and make) the Web a safe, open, inclusive and collaborative environment. 39
  • 40.
  • 41.
    How representative isyour data? 41
  • 42.
    The issue oftrust and reliability 42
  • 43.
    How stable arethe models that you learn from your data? 43 Credits: Roonak Rezvani, CR&T Centre, UK Dementia Research Institute/CVSSP, Uni of Surrey
  • 44.
    Dynamicity and machinelearning issue 44 Noise and missing data Pattern and change representation Continual and adaptive learning Network and Causation analysis
  • 45.
  • 46.
    Be ready forsetbacks 46
  • 47.
    References − S. Enshaeifaret. al, "Health management and pattern analysis of daily living activities of people with Dementia using in-home sensors and machine learning techniques", PLoS ONE 13(5): e0195605, 2018. − A. González Vidal, P. Barnaghi, A. F. Skarmeta, "BEATS: Blocks of Eigenvalues Algorithm for Time series Segmentation", IEEE Transactions on Knowledge and Data Engineering (TKDE), 2018. − Y. Fathy, P. Barnaghi, R. Tafazolli, "An Online Adaptive Algorithm for Change Detection in Streaming Sensory Data", IEEE Systems Journal, 2018. − Y. Fathy, P. Barnaghi, R. Tafazolli, "Large-Scale Indexing, Discovery and Ranking for the Internet of Things (IoT)", ACM Computing Surveys, 2017. − S. A. Hosieni Tabatabaei, Y. Fathy, P. Barnaghi, C. Wang, R. Tafazolli, "A Novel Indexing Method for Scalable IoT Source Lookup", IEEE Internet of Things Journal, 2018. − Y. Fathy, P. Barnaghi, R. Tafazolli, "Distributed Spatial Indexing for the Internet of Things Data Management", Proc. of IFIP/IEEE International Symposium on Integrated Network Management, Lisbon, Portugal, May 2017. 47
  • 48.
  • 49.

Editor's Notes

  • #25 The entropy of the (x,y,z) triple on D D is the set of data items