1. (FINAL PRESENTATION)Faculty of Information Technology
Supervisor: Assoc Prof David Taniar
BY: JINGXUAN WEI (Tom)
25025031
1
2. Research Background
Instrumented Ore Car Program issue
Problem of the existing database
Research Question
Related works
Research Aim
Data Acquisition
MongoDB Import and Export Tools
Spark-MongoDB Application
Result analysis
Data Retrieval
Data retrieval by Spark SQL
Data retrieval by Spark filter operation
How to improve searching efficiency?
Conclusion and Future work
2
3. 3
Railway In Mining, Pilbara region, WA
Loaded iron ore
Equipped with sensors to collect data as
Train run
Trained professionals to maintain the
sensors
Aim of the program:
• Monitoring track and wagon performance.
• Detect track abnormalities
4. What are the issues?
• Sensor selection
• Smart sensor is expensive.
• Less expensive sensor is inaccuracy
(Semi-structured data)
• Database issue
• Low data ingestion speed
• Spend too much time on searching
Expected outcome:
Equipped with many cheap sensors to
collect data in order to obtain the desire
outcome. (Reduce cost)
4
5. Low data ingestion speed in current database
• High velocity data input:
• Each wagon fitted with 16 sensors
• The one sensor produce 25 records per second
• Approximately 200 wagons in one train
• At least 30 trains running at the same time.
• 𝑫𝒂𝒕𝒂 𝑽𝒆𝒍𝒐𝒄𝒊𝒕𝒚 = 𝟏𝟔 ∗ 𝟐𝟓 ∗ 𝟐𝟎𝟎 ∗ 𝟑𝟎 = 𝟐, 𝟒𝟎𝟎, 𝟎𝟎𝟎 𝐫𝐞𝐜𝐨𝐫𝐝𝐬 𝒑𝒆𝒓 𝒔𝒆𝒄𝒐𝒏𝒅
• Transaction management of relational database
Spend too much time on searching
• Large volume of unstructured data
5
6. DATA
6
Data Information:
• Twenty one attributes include
train acceleration and
geography information
(latitude and longitude)
• Missing Track Information
Solution:
• Append Track Information
Concept Used
• Geo Hashing Algorithm
((Wolfson & Rigoutsos, 1997))
7. “How to improve the performance of data ingestion into the
database?”
“How to perform fast data retrieval in the IRT project?.
7
8. Use MongoDB to Enhancing the Management of Unstructured Data
(Stevic, Milosavljevic, & Perisic, 2015)
Improvement of MongoDB Auto-Sharding (Liu, Wang, & Jin, 2012)
Spark SQL (Armbrust et al., 2015)
Pervious work (Benchmark Model)
Given the infrastructure we have for processing, we have successfully
processed 40,000 records per second.
With the same infrastructure, based on the file system (CSV files provided
by IRT), we have successfully retrieved results for 40 GB of data in less
that 85 seconds.
8
16. 16
40K 80K 160K 320K
Ranged Sharding 4.0 6.3 13.0 25.3
Hashed Sharding 3.0 4.0 11.0 20.3
0.0
5.0
10.0
15.0
20.0
25.0
30.0
Seconds
Number of records
Hashing Sharding VS Ranged Sharding
Ranged Sharding Hashed Sharding
17. 17
40K 80K 160K 320K
Sharded MongoDB 3.0 4.3 11.0 20.3
MongoDB (Regular) 2.3 4.0 9.3 18.7
0.0
5.0
10.0
15.0
20.0
25.0
Seconds
Number of records
Sharded Database VS Regular Database
Sharded MongoDB MongoDB (Regular)
18. 18
The bottleneck occurs in
the first section.
Compare the database
enable sharding, the
regular database perform
better job in data
acquisition.
The acquisition result can
not meet industry
requirement.(80000 per
second)
19. 19
We need to set the spark environment first:
Create 80000 records as input batch:
Store into MongoDB:
20. 20
MongoImport:
Shard database: 4.3 s
Regular database 4.0s
Spark program 1.4s
40000
records
50000
records
60000
records
70000
records
80000
records
Number of records 822 1007 1167 1302 1444
822
1007
1167
1302
1444
0
200
400
600
800
1000
1200
1400
1600
Milliseconds
Number of records
Data inserting – Router (Master) (4CPUs)
28. Adopt hash partitioner to
partition data and use
mapPartitionsWithIndex
to get the target partition.
Perform searching in the
target partition
Narrow search scope
28
29. 29
1 2 3
Search by Hash Partition 30724 29058 27440
Search for all data 44737 42773 43200
30724 29058 27440
44737 42773 43200
0
10000
20000
30000
40000
50000
Milliseconds
Number of testing
Compare the performance between two
searching approaches
Search by Hash Partition Search for all data
30. We have successfully created a system that is able to
accept the data as a batch or streams.
We solve the low data ingestion speed problem by writing
a spark program.
We have successful import 1400000 record in one second
in the MongoDB server.
We perform searching by using Spark SQL and execute
SQL query in 40 GB of data within 11 seconds.
30
31. How to measure MongoDB query execute time in the very
large database.
Efficient searching mechanism in sharded MongoDB by using
Spark.
31
32. Wolfson, H. J., & Rigoutsos, I. (1997). Geometric hashing: An overview. IEEE
computational science and engineering, 4(4), 10-21.
Stevic, M. P., Milosavljevic, B., & Perisic, B. R. (2015). Enhancing the
management of unstructured data in e-learning systems using MongoDB.
Program, 49(1), 91-114.
Liu, Y., Wang, Y., & Jin, Y. (2012). Research on the improvement of MongoDB
Auto-Sharding in cloud environment. Paper presented at the Computer Science &
Education (ICCSE), 2012 7th International Conference on.
Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., . . . Ghodsi, A.
(2015). Spark sql: Relational data processing in spark. Paper presented at the
Proceedings of the 2015 ACM SIGMOD International Conference on Management
of Data.
32
33. TEAM
33
Dr. Maria Indrawan-Santiago
Senior Lecturer
Faculty of IT
Prajwol Sangat
Research Assistant
Faculty of IT
Assoc Prof David Taniar
Associate Professor
Faculty of IT
Jingxuan Wei
Student
Faculty of IT
Subudh Sali
Student
Faculty of IT
Good morning and welcome to my final presentation. My name is JINGXUAN WEI . MY presentation topic is …
In today’s presentation I’m hoping to cover these points:
firstly, I will introduce research background.
then, I will talk about research question and aim
There are two main parts in my research, the first part is DA part. In my research, DA means database acquire data. We get data and import into database.
After DA PART, I want to talk about DT. DT means retrieval useful information form the database.
, and finally I’ll mention the future work and give the conclusion.
Let’s start at the background of this research. This research collaborates with the Institute of Railway Technology (IRT) at Monash University.
Look at the picture in the right side, this is a mine railway…
Train Length: >2KM
Load: >10 Ton per wagon
Speed:5-10 Km per hour
Usually there are 200 wagons in a train
……………………………
The engineer in railway industry want to …Therefore, there is a program called IOC program, which e…
For example, if the train acceleration increase or decrease significant. We can consider the track abnormality is occur. Therefore, my searching is mainly focus on the acceleration attribute.
The engineer in IOC program faced two problems,
If we can improve the database performance( especially in DA and DT efficiency), we are more likely to reach the expected outcome.
In general, we can improve the database performance, to address part of the sensor issue by equipped with plenty of cheap sensors, get the similar performance without increasing expenditure.
Let’s look at the first problem,….
The existing database is relational database. The sensor data is large and come in very fast pace. For example, if we assume that ..
However, the current system was able to accept the data in only single port. When the data velocity become faster, the data congestion will happen in the database I/O port. Also, low data ingestion speed also caused by transaction management of relational db.
Spend too much time on searching . Relational database cannot handle unstructured data well.
We can generate geohash code according to the corresponding latitude and longitude. The geohash code can be used to identify the records in specific area.
Now let’s move on to research question
In relation database, storing a large volume of the unstructured data in binary based columns will dramatically increase the demand for the hardware resource. Therefore, using an relation database for managing the huge amount of unstructured data is not a good selection.
In my original assumption
Configure servers store the cluster’s metadata. This data contains a mapping of the cluster’s data set to the shards. The query router uses this metadata to target operations to specific shards.
Shards store the data. To provide high availability and data consistency, in a production sharded cluster, each shard is a replica set
Reads / Writes
MongoDB distributes the read and write workload across the shards in the sharded cluster, allowing each shard to process a subset of cluster operations.
Storage Capacity
High Availability
A sharded cluster can continue to perform partial read / write operations even if one or more shards are unavailable.
How to distribute data in the sharded cluster ?
In this case, compare with ranged sharding technology, the hashing sharding spend less time on data acquisition task.
When I use mongoimput command to input data into MongoDB sharding database, the Mongos master also called Router (4 CPUs, 130GB) perform the data input job. The main data input speed was affected by the first phase which is application input data into Router.
Spark SQL includes a cost-based optimizer, columnar storage and code generation to make queries fast. At the same time, it scales to thousands of nodes and multi hour queries using the Spark engine.
Narrows the search scope
We have successfully created a system that is able to accept the data as a batch or streams.