Final Presentation IRT - Jingxuan Wei V1.2

(FINAL PRESENTATION)Faculty of Information Technology
Supervisor: Assoc Prof David Taniar
BY: JINGXUAN WEI (Tom)
25025031
1

 Research Background
 Instrumented Ore Car Program issue
 Problem of the existing database
 Research Question
 Related works
 Research Aim
 Data Acquisition
 MongoDB Import and Export Tools
 Spark-MongoDB Application
 Result analysis
 Data Retrieval
 Data retrieval by Spark SQL
 Data retrieval by Spark filter operation
 How to improve searching efficiency?
 Conclusion and Future work
2

3
 Railway In Mining, Pilbara region, WA
 Loaded iron ore
 Equipped with sensors to collect data as
Train run
 Trained professionals to maintain the
sensors
 Aim of the program:
• Monitoring track and wagon performance.
• Detect track abnormalities

What are the issues?
• Sensor selection
• Smart sensor is expensive.
• Less expensive sensor is inaccuracy
(Semi-structured data)
• Database issue
• Low data ingestion speed
• Spend too much time on searching
Expected outcome:
Equipped with many cheap sensors to
collect data in order to obtain the desire
outcome. (Reduce cost)
4

Low data ingestion speed in current database
• High velocity data input:
• Each wagon fitted with 16 sensors
• The one sensor produce 25 records per second
• Approximately 200 wagons in one train
• At least 30 trains running at the same time.
• 𝑫𝒂𝒕𝒂 𝑽𝒆𝒍𝒐𝒄𝒊𝒕𝒚 = 𝟏𝟔 ∗ 𝟐𝟓 ∗ 𝟐𝟎𝟎 ∗ 𝟑𝟎 = 𝟐, 𝟒𝟎𝟎, 𝟎𝟎𝟎 𝐫𝐞𝐜𝐨𝐫𝐝𝐬 𝒑𝒆𝒓 𝒔𝒆𝒄𝒐𝒏𝒅
• Transaction management of relational database
Spend too much time on searching
• Large volume of unstructured data
5

DATA
6
Data Information:
• Twenty one attributes include
train acceleration and
geography information
(latitude and longitude)
• Missing Track Information
Solution:
• Append Track Information
Concept Used
• Geo Hashing Algorithm
((Wolfson & Rigoutsos, 1997))

“How to improve the performance of data ingestion into the
database?”
“How to perform fast data retrieval in the IRT project?.
7

 Use MongoDB to Enhancing the Management of Unstructured Data
(Stevic, Milosavljevic, & Perisic, 2015)
 Improvement of MongoDB Auto-Sharding (Liu, Wang, & Jin, 2012)
 Spark SQL (Armbrust et al., 2015)
Pervious work (Benchmark Model)
 Given the infrastructure we have for processing, we have successfully
processed 40,000 records per second.
 With the same infrastructure, based on the file system (CSV files provided
by IRT), we have successfully retrieved results for 40 GB of data in less
that 85 seconds.
8

Scalable Techniques for Parallel Data Acquisition and
Retrieval of High-Velocity Data
9

 NoSQL Document Database
 Handle unstructured data well
 Improve Storage Capacity
10

Approaches taken:
MongoDB Default Import Tool
Regular MongoDB
MongoDB Sharded Cluster
Spark-MongoDB Application
11

12
Command:
mongoimport --db RegularDB --collection railwayDataCollection --type
csv --headerline --file /mnt/data/IRTRailwayData80K.csv

13
 Sharded MongoDB
Cluster
 Divide the data set and
distributes the data
over multiple shards.
Each shard is an
independent database.

14
 Reads / Writes
 Storage Capacity
 High Availability

 Hashed Sharding
 sh.shardCollection("<database>.<coll
ection>", { <key> : <direction> } )
 Ranged Sharding
 sh.shardCollection( "database.collection",
{ <shard key> } )
15

16
40K 80K 160K 320K
Ranged Sharding 4.0 6.3 13.0 25.3
Hashed Sharding 3.0 4.0 11.0 20.3
0.0
5.0
10.0
15.0
20.0
25.0
30.0
Seconds
Number of records
Hashing Sharding VS Ranged Sharding
Ranged Sharding Hashed Sharding

17
40K 80K 160K 320K
Sharded MongoDB 3.0 4.3 11.0 20.3
MongoDB (Regular) 2.3 4.0 9.3 18.7
0.0
5.0
10.0
15.0
20.0
25.0
Seconds
Number of records
Sharded Database VS Regular Database
Sharded MongoDB MongoDB (Regular)

18
 The bottleneck occurs in
the first section.
 Compare the database
enable sharding, the
regular database perform
better job in data
acquisition.
 The acquisition result can
not meet industry
requirement.(80000 per
second)

19
 We need to set the spark environment first:
 Create 80000 records as input batch:
 Store into MongoDB:

20
MongoImport:
Shard database: 4.3 s
Regular database 4.0s
Spark program 1.4s
40000
records
50000
records
60000
records
70000
records
80000
records
Number of records 822 1007 1167 1302 1444
822
1007
1167
1302
1444
0
200
400
600
800
1000
1200
1400
1600
Milliseconds
Number of records
Data inserting – Router (Master) (4CPUs)

21
120000
records
130000
records
140000
records
150000
records
160000
records
Number of records 860 931 1031 1053 1134
860
931
1031 1053
1134
0
200
400
600
800
1000
1200
Milliseconds
Number of records
Data inserting -- Server 16CPUs

22
 Database level
 Application level

23
2970000 5940000 8910000 11880000
Regular MongoDB 4997 9794 11942 14652
Sharded MongoDB 2330 7134 8509 11073
0
2000
4000
6000
8000
10000
12000
14000
16000
Milliseconds
Number of record
Searching performance between sharded
MongoDB and regular MongoDB
Conclusion:
1. Sharded MongoDB
perform faster
searching than
Regular MongoDB
2. Hard to measure
query execution
time when the
dataset is too big.
db.getCollection('Test').find({'accR3': { $gt: 4 , $lt:6}}). explain(‘executionStats’)

 Create Spark SQL object
24
 Create register temp table and run the searching query.
 Sample result

 Perform searching by using filter operation
25

26
Data searching Local Machine 2CPU/I5 Server 16CPUs - Regular Database
Filter Search Spark
SQL query
Filter Search Spark SQL
query
4G (3.92G) 16881 877 5736 557/602
8G (7.85G) 52229 2012 13281 1753
12G (11.78G) N/A N/A 19556 3323
16G (15.70G) N/A N/A 31179 4518
40G (39.93G) N/A N/A 79893 8783
45G (44.83G) N/A N/A 86399 10883
Query 1. SELECT * FROM railwayData Where accR3 > 4 and accR3 < 6
2. val result = readData.filter(readData("acc.r3") >= 4 &&
readData("acc.r3") <= 6)

 Adopt hash partitioner to
partition data and use
mapPartitionsWithIndex
to get the target partition.
 Perform searching in the
target partition
 Narrow search scope
28

29
1 2 3
Search by Hash Partition 30724 29058 27440
Search for all data 44737 42773 43200
30724 29058 27440
44737 42773 43200
0
10000
20000
30000
40000
50000
Milliseconds
Number of testing
Compare the performance between two
searching approaches
Search by Hash Partition Search for all data

We have successfully created a system that is able to
accept the data as a batch or streams.
We solve the low data ingestion speed problem by writing
a spark program.
We have successful import 1400000 record in one second
in the MongoDB server.
We perform searching by using Spark SQL and execute
SQL query in 40 GB of data within 11 seconds.
30

 How to measure MongoDB query execute time in the very
large database.
 Efficient searching mechanism in sharded MongoDB by using
Spark.
31

 Wolfson, H. J., & Rigoutsos, I. (1997). Geometric hashing: An overview. IEEE
computational science and engineering, 4(4), 10-21.
 Stevic, M. P., Milosavljevic, B., & Perisic, B. R. (2015). Enhancing the
management of unstructured data in e-learning systems using MongoDB.
Program, 49(1), 91-114.
 Liu, Y., Wang, Y., & Jin, Y. (2012). Research on the improvement of MongoDB
Auto-Sharding in cloud environment. Paper presented at the Computer Science &
Education (ICCSE), 2012 7th International Conference on.
 Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., . . . Ghodsi, A.
(2015). Spark sql: Relational data processing in spark. Paper presented at the
Proceedings of the 2015 ACM SIGMOD International Conference on Management
of Data.
32

TEAM
33
Dr. Maria Indrawan-Santiago
Senior Lecturer
Faculty of IT
Prajwol Sangat
Research Assistant
Faculty of IT
Assoc Prof David Taniar
Associate Professor
Faculty of IT
Jingxuan Wei
Student
Faculty of IT
Subudh Sali
Student
Faculty of IT

34
(Final Presentation)
Supervisor: Assoc Prof David Taniar
BY: JINGXUAN WEI (Tom)
25025031

Final Presentation IRT - Jingxuan Wei V1.2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Final Presentation IRT - Jingxuan Wei V1.2

Similar to Final Presentation IRT - Jingxuan Wei V1.2 (20)

Final Presentation IRT - Jingxuan Wei V1.2

Editor's Notes