SlideShare a Scribd company logo
1 of 34
(FINAL PRESENTATION)Faculty of Information Technology
Supervisor: Assoc Prof David Taniar
BY: JINGXUAN WEI (Tom)
25025031
1
 Research Background
 Instrumented Ore Car Program issue
 Problem of the existing database
 Research Question
 Related works
 Research Aim
 Data Acquisition
 MongoDB Import and Export Tools
 Spark-MongoDB Application
 Result analysis
 Data Retrieval
 Data retrieval by Spark SQL
 Data retrieval by Spark filter operation
 How to improve searching efficiency?
 Conclusion and Future work
2
3
 Railway In Mining, Pilbara region, WA
 Loaded iron ore
 Equipped with sensors to collect data as
Train run
 Trained professionals to maintain the
sensors
 Aim of the program:
• Monitoring track and wagon performance.
• Detect track abnormalities
What are the issues?
• Sensor selection
• Smart sensor is expensive.
• Less expensive sensor is inaccuracy
(Semi-structured data)
• Database issue
• Low data ingestion speed
• Spend too much time on searching
Expected outcome:
Equipped with many cheap sensors to
collect data in order to obtain the desire
outcome. (Reduce cost)
4
Low data ingestion speed in current database
• High velocity data input:
• Each wagon fitted with 16 sensors
• The one sensor produce 25 records per second
• Approximately 200 wagons in one train
• At least 30 trains running at the same time.
• 𝑫𝒂𝒕𝒂 𝑽𝒆𝒍𝒐𝒄𝒊𝒕𝒚 = 𝟏𝟔 ∗ 𝟐𝟓 ∗ 𝟐𝟎𝟎 ∗ 𝟑𝟎 = 𝟐, 𝟒𝟎𝟎, 𝟎𝟎𝟎 𝐫𝐞𝐜𝐨𝐫𝐝𝐬 𝒑𝒆𝒓 𝒔𝒆𝒄𝒐𝒏𝒅
• Transaction management of relational database
Spend too much time on searching
• Large volume of unstructured data
5
DATA
6
Data Information:
• Twenty one attributes include
train acceleration and
geography information
(latitude and longitude)
• Missing Track Information
Solution:
• Append Track Information
Concept Used
• Geo Hashing Algorithm
((Wolfson & Rigoutsos, 1997))
“How to improve the performance of data ingestion into the
database?”
“How to perform fast data retrieval in the IRT project?.
7
 Use MongoDB to Enhancing the Management of Unstructured Data
(Stevic, Milosavljevic, & Perisic, 2015)
 Improvement of MongoDB Auto-Sharding (Liu, Wang, & Jin, 2012)
 Spark SQL (Armbrust et al., 2015)
Pervious work (Benchmark Model)
 Given the infrastructure we have for processing, we have successfully
processed 40,000 records per second.
 With the same infrastructure, based on the file system (CSV files provided
by IRT), we have successfully retrieved results for 40 GB of data in less
that 85 seconds.
8
Scalable Techniques for Parallel Data Acquisition and
Retrieval of High-Velocity Data
9
 NoSQL Document Database
 Handle unstructured data well
 Improve Storage Capacity
10
Approaches taken:
MongoDB Default Import Tool
Regular MongoDB
MongoDB Sharded Cluster
Spark-MongoDB Application
11
12
Command:
mongoimport --db RegularDB --collection railwayDataCollection --type
csv --headerline --file /mnt/data/IRTRailwayData80K.csv
13
 Sharded MongoDB
Cluster
 Divide the data set and
distributes the data
over multiple shards.
Each shard is an
independent database.
14
 Reads / Writes
 Storage Capacity
 High Availability
 Hashed Sharding
 sh.shardCollection("<database>.<coll
ection>", { <key> : <direction> } )
 Ranged Sharding
 sh.shardCollection( "database.collection",
{ <shard key> } )
15
16
40K 80K 160K 320K
Ranged Sharding 4.0 6.3 13.0 25.3
Hashed Sharding 3.0 4.0 11.0 20.3
0.0
5.0
10.0
15.0
20.0
25.0
30.0
Seconds
Number of records
Hashing Sharding VS Ranged Sharding
Ranged Sharding Hashed Sharding
17
40K 80K 160K 320K
Sharded MongoDB 3.0 4.3 11.0 20.3
MongoDB (Regular) 2.3 4.0 9.3 18.7
0.0
5.0
10.0
15.0
20.0
25.0
Seconds
Number of records
Sharded Database VS Regular Database
Sharded MongoDB MongoDB (Regular)
18
 The bottleneck occurs in
the first section.
 Compare the database
enable sharding, the
regular database perform
better job in data
acquisition.
 The acquisition result can
not meet industry
requirement.(80000 per
second)
19
 We need to set the spark environment first:
 Create 80000 records as input batch:
 Store into MongoDB:
20
MongoImport:
Shard database: 4.3 s
Regular database 4.0s
Spark program 1.4s
40000
records
50000
records
60000
records
70000
records
80000
records
Number of records 822 1007 1167 1302 1444
822
1007
1167
1302
1444
0
200
400
600
800
1000
1200
1400
1600
Milliseconds
Number of records
Data inserting – Router (Master) (4CPUs)
21
120000
records
130000
records
140000
records
150000
records
160000
records
Number of records 860 931 1031 1053 1134
860
931
1031 1053
1134
0
200
400
600
800
1000
1200
Milliseconds
Number of records
Data inserting -- Server 16CPUs
22
 Database level
 Application level
23
2970000 5940000 8910000 11880000
Regular MongoDB 4997 9794 11942 14652
Sharded MongoDB 2330 7134 8509 11073
0
2000
4000
6000
8000
10000
12000
14000
16000
Milliseconds
Number of record
Searching performance between sharded
MongoDB and regular MongoDB
Conclusion:
1. Sharded MongoDB
perform faster
searching than
Regular MongoDB
2. Hard to measure
query execution
time when the
dataset is too big.
db.getCollection('Test').find({'accR3': { $gt: 4 , $lt:6}}). explain(‘executionStats’)
 Create Spark SQL object
24
 Create register temp table and run the searching query.
 Sample result
 Perform searching by using filter operation
25
26
Data searching Local Machine 2CPU/I5 Server 16CPUs - Regular Database
Filter Search Spark
SQL query
Filter Search Spark SQL
query
4G (3.92G) 16881 877 5736 557/602
8G (7.85G) 52229 2012 13281 1753
12G (11.78G) N/A N/A 19556 3323
16G (15.70G) N/A N/A 31179 4518
40G (39.93G) N/A N/A 79893 8783
45G (44.83G) N/A N/A 86399 10883
Query 1. SELECT * FROM railwayData Where accR3 > 4 and accR3 < 6
2. val result = readData.filter(readData("acc.r3") >= 4 &&
readData("acc.r3") <= 6)
27
(Key-Value)
 Adopt hash partitioner to
partition data and use
mapPartitionsWithIndex
to get the target partition.
 Perform searching in the
target partition
 Narrow search scope
28
29
1 2 3
Search by Hash Partition 30724 29058 27440
Search for all data 44737 42773 43200
30724 29058 27440
44737 42773 43200
0
10000
20000
30000
40000
50000
Milliseconds
Number of testing
Compare the performance between two
searching approaches
Search by Hash Partition Search for all data
We have successfully created a system that is able to
accept the data as a batch or streams.
We solve the low data ingestion speed problem by writing
a spark program.
We have successful import 1400000 record in one second
in the MongoDB server.
We perform searching by using Spark SQL and execute
SQL query in 40 GB of data within 11 seconds.
30
 How to measure MongoDB query execute time in the very
large database.
 Efficient searching mechanism in sharded MongoDB by using
Spark.
31
 Wolfson, H. J., & Rigoutsos, I. (1997). Geometric hashing: An overview. IEEE
computational science and engineering, 4(4), 10-21.
 Stevic, M. P., Milosavljevic, B., & Perisic, B. R. (2015). Enhancing the
management of unstructured data in e-learning systems using MongoDB.
Program, 49(1), 91-114.
 Liu, Y., Wang, Y., & Jin, Y. (2012). Research on the improvement of MongoDB
Auto-Sharding in cloud environment. Paper presented at the Computer Science &
Education (ICCSE), 2012 7th International Conference on.
 Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., . . . Ghodsi, A.
(2015). Spark sql: Relational data processing in spark. Paper presented at the
Proceedings of the 2015 ACM SIGMOD International Conference on Management
of Data.
32
TEAM
33
Dr. Maria Indrawan-Santiago
Senior Lecturer
Faculty of IT
Prajwol Sangat
Research Assistant
Faculty of IT
Assoc Prof David Taniar
Associate Professor
Faculty of IT
Jingxuan Wei
Student
Faculty of IT
Subudh Sali
Student
Faculty of IT
34
(Final Presentation)
Supervisor: Assoc Prof David Taniar
BY: JINGXUAN WEI (Tom)
25025031

More Related Content

What's hot

Apache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTApache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTjixuan1989
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
 
Astronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache SparkAstronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache SparkDatabricks
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009Ian Foster
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationIan Foster
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Databricks
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMOVING Project
 
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, RocanaSolr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, RocanaLucidworks
 
Data correlation using PySpark and HDFS
Data correlation using PySpark and HDFSData correlation using PySpark and HDFS
Data correlation using PySpark and HDFSJohn Conley
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariKarissa Rae McKelvey
 
MongoDB and the Internet of Things
MongoDB and the Internet of ThingsMongoDB and the Internet of Things
MongoDB and the Internet of ThingsMongoDB
 
Lessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsLessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsDatabricks
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesRomi Kuntsman
 

What's hot (20)

Apache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTApache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoT
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
Astronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache SparkAstronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache Spark
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud Automation
 
Druid
DruidDruid
Druid
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open Data
 
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, RocanaSolr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
 
Data correlation using PySpark and HDFS
Data correlation using PySpark and HDFSData correlation using PySpark and HDFS
Data correlation using PySpark and HDFS
 
Weld Strata talk
Weld Strata talkWeld Strata talk
Weld Strata talk
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
 
MongoDB and the Internet of Things
MongoDB and the Internet of ThingsMongoDB and the Internet of Things
MongoDB and the Internet of Things
 
Lessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsLessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark Workloads
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframes
 

Similar to Final Presentation IRT - Jingxuan Wei V1.2

Gluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with HadoopGluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with Hadoopgluent.
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Nicolas Poggi
 
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Ahsan Javed Awan
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 Databricks
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBencht_ivanov
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreHPCC Systems
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Introduction to MongoDB and its best practices
Introduction to MongoDB and its best practicesIntroduction to MongoDB and its best practices
Introduction to MongoDB and its best practicesAshishRathore72
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
MongoDB_Spark
MongoDB_SparkMongoDB_Spark
MongoDB_SparkMat Keep
 
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...MongoDB
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...Amazon Web Services
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptxAndrew Lamb
 
Voxpopme - Elasticsearch Service
Voxpopme - Elasticsearch ServiceVoxpopme - Elasticsearch Service
Voxpopme - Elasticsearch ServiceElasticsearch
 
MongoDB for Time Series Data
MongoDB for Time Series DataMongoDB for Time Series Data
MongoDB for Time Series DataMongoDB
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...MongoDB
 
Geospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataGeospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataAlexMiowski
 

Similar to Final Presentation IRT - Jingxuan Wei V1.2 (20)

Gluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with HadoopGluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with Hadoop
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)
 
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Introduction to MongoDB and its best practices
Introduction to MongoDB and its best practicesIntroduction to MongoDB and its best practices
Introduction to MongoDB and its best practices
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
MongoDB_Spark
MongoDB_SparkMongoDB_Spark
MongoDB_Spark
 
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Voxpopme - Elasticsearch Service
Voxpopme - Elasticsearch ServiceVoxpopme - Elasticsearch Service
Voxpopme - Elasticsearch Service
 
MongoDB for Time Series Data
MongoDB for Time Series DataMongoDB for Time Series Data
MongoDB for Time Series Data
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
 
Geospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataGeospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning Data
 

Final Presentation IRT - Jingxuan Wei V1.2

  • 1. (FINAL PRESENTATION)Faculty of Information Technology Supervisor: Assoc Prof David Taniar BY: JINGXUAN WEI (Tom) 25025031 1
  • 2.  Research Background  Instrumented Ore Car Program issue  Problem of the existing database  Research Question  Related works  Research Aim  Data Acquisition  MongoDB Import and Export Tools  Spark-MongoDB Application  Result analysis  Data Retrieval  Data retrieval by Spark SQL  Data retrieval by Spark filter operation  How to improve searching efficiency?  Conclusion and Future work 2
  • 3. 3  Railway In Mining, Pilbara region, WA  Loaded iron ore  Equipped with sensors to collect data as Train run  Trained professionals to maintain the sensors  Aim of the program: • Monitoring track and wagon performance. • Detect track abnormalities
  • 4. What are the issues? • Sensor selection • Smart sensor is expensive. • Less expensive sensor is inaccuracy (Semi-structured data) • Database issue • Low data ingestion speed • Spend too much time on searching Expected outcome: Equipped with many cheap sensors to collect data in order to obtain the desire outcome. (Reduce cost) 4
  • 5. Low data ingestion speed in current database • High velocity data input: • Each wagon fitted with 16 sensors • The one sensor produce 25 records per second • Approximately 200 wagons in one train • At least 30 trains running at the same time. • 𝑫𝒂𝒕𝒂 𝑽𝒆𝒍𝒐𝒄𝒊𝒕𝒚 = 𝟏𝟔 ∗ 𝟐𝟓 ∗ 𝟐𝟎𝟎 ∗ 𝟑𝟎 = 𝟐, 𝟒𝟎𝟎, 𝟎𝟎𝟎 𝐫𝐞𝐜𝐨𝐫𝐝𝐬 𝒑𝒆𝒓 𝒔𝒆𝒄𝒐𝒏𝒅 • Transaction management of relational database Spend too much time on searching • Large volume of unstructured data 5
  • 6. DATA 6 Data Information: • Twenty one attributes include train acceleration and geography information (latitude and longitude) • Missing Track Information Solution: • Append Track Information Concept Used • Geo Hashing Algorithm ((Wolfson & Rigoutsos, 1997))
  • 7. “How to improve the performance of data ingestion into the database?” “How to perform fast data retrieval in the IRT project?. 7
  • 8.  Use MongoDB to Enhancing the Management of Unstructured Data (Stevic, Milosavljevic, & Perisic, 2015)  Improvement of MongoDB Auto-Sharding (Liu, Wang, & Jin, 2012)  Spark SQL (Armbrust et al., 2015) Pervious work (Benchmark Model)  Given the infrastructure we have for processing, we have successfully processed 40,000 records per second.  With the same infrastructure, based on the file system (CSV files provided by IRT), we have successfully retrieved results for 40 GB of data in less that 85 seconds. 8
  • 9. Scalable Techniques for Parallel Data Acquisition and Retrieval of High-Velocity Data 9
  • 10.  NoSQL Document Database  Handle unstructured data well  Improve Storage Capacity 10
  • 11. Approaches taken: MongoDB Default Import Tool Regular MongoDB MongoDB Sharded Cluster Spark-MongoDB Application 11
  • 12. 12 Command: mongoimport --db RegularDB --collection railwayDataCollection --type csv --headerline --file /mnt/data/IRTRailwayData80K.csv
  • 13. 13  Sharded MongoDB Cluster  Divide the data set and distributes the data over multiple shards. Each shard is an independent database.
  • 14. 14  Reads / Writes  Storage Capacity  High Availability
  • 15.  Hashed Sharding  sh.shardCollection("<database>.<coll ection>", { <key> : <direction> } )  Ranged Sharding  sh.shardCollection( "database.collection", { <shard key> } ) 15
  • 16. 16 40K 80K 160K 320K Ranged Sharding 4.0 6.3 13.0 25.3 Hashed Sharding 3.0 4.0 11.0 20.3 0.0 5.0 10.0 15.0 20.0 25.0 30.0 Seconds Number of records Hashing Sharding VS Ranged Sharding Ranged Sharding Hashed Sharding
  • 17. 17 40K 80K 160K 320K Sharded MongoDB 3.0 4.3 11.0 20.3 MongoDB (Regular) 2.3 4.0 9.3 18.7 0.0 5.0 10.0 15.0 20.0 25.0 Seconds Number of records Sharded Database VS Regular Database Sharded MongoDB MongoDB (Regular)
  • 18. 18  The bottleneck occurs in the first section.  Compare the database enable sharding, the regular database perform better job in data acquisition.  The acquisition result can not meet industry requirement.(80000 per second)
  • 19. 19  We need to set the spark environment first:  Create 80000 records as input batch:  Store into MongoDB:
  • 20. 20 MongoImport: Shard database: 4.3 s Regular database 4.0s Spark program 1.4s 40000 records 50000 records 60000 records 70000 records 80000 records Number of records 822 1007 1167 1302 1444 822 1007 1167 1302 1444 0 200 400 600 800 1000 1200 1400 1600 Milliseconds Number of records Data inserting – Router (Master) (4CPUs)
  • 21. 21 120000 records 130000 records 140000 records 150000 records 160000 records Number of records 860 931 1031 1053 1134 860 931 1031 1053 1134 0 200 400 600 800 1000 1200 Milliseconds Number of records Data inserting -- Server 16CPUs
  • 22. 22  Database level  Application level
  • 23. 23 2970000 5940000 8910000 11880000 Regular MongoDB 4997 9794 11942 14652 Sharded MongoDB 2330 7134 8509 11073 0 2000 4000 6000 8000 10000 12000 14000 16000 Milliseconds Number of record Searching performance between sharded MongoDB and regular MongoDB Conclusion: 1. Sharded MongoDB perform faster searching than Regular MongoDB 2. Hard to measure query execution time when the dataset is too big. db.getCollection('Test').find({'accR3': { $gt: 4 , $lt:6}}). explain(‘executionStats’)
  • 24.  Create Spark SQL object 24  Create register temp table and run the searching query.  Sample result
  • 25.  Perform searching by using filter operation 25
  • 26. 26 Data searching Local Machine 2CPU/I5 Server 16CPUs - Regular Database Filter Search Spark SQL query Filter Search Spark SQL query 4G (3.92G) 16881 877 5736 557/602 8G (7.85G) 52229 2012 13281 1753 12G (11.78G) N/A N/A 19556 3323 16G (15.70G) N/A N/A 31179 4518 40G (39.93G) N/A N/A 79893 8783 45G (44.83G) N/A N/A 86399 10883 Query 1. SELECT * FROM railwayData Where accR3 > 4 and accR3 < 6 2. val result = readData.filter(readData("acc.r3") >= 4 && readData("acc.r3") <= 6)
  • 28.  Adopt hash partitioner to partition data and use mapPartitionsWithIndex to get the target partition.  Perform searching in the target partition  Narrow search scope 28
  • 29. 29 1 2 3 Search by Hash Partition 30724 29058 27440 Search for all data 44737 42773 43200 30724 29058 27440 44737 42773 43200 0 10000 20000 30000 40000 50000 Milliseconds Number of testing Compare the performance between two searching approaches Search by Hash Partition Search for all data
  • 30. We have successfully created a system that is able to accept the data as a batch or streams. We solve the low data ingestion speed problem by writing a spark program. We have successful import 1400000 record in one second in the MongoDB server. We perform searching by using Spark SQL and execute SQL query in 40 GB of data within 11 seconds. 30
  • 31.  How to measure MongoDB query execute time in the very large database.  Efficient searching mechanism in sharded MongoDB by using Spark. 31
  • 32.  Wolfson, H. J., & Rigoutsos, I. (1997). Geometric hashing: An overview. IEEE computational science and engineering, 4(4), 10-21.  Stevic, M. P., Milosavljevic, B., & Perisic, B. R. (2015). Enhancing the management of unstructured data in e-learning systems using MongoDB. Program, 49(1), 91-114.  Liu, Y., Wang, Y., & Jin, Y. (2012). Research on the improvement of MongoDB Auto-Sharding in cloud environment. Paper presented at the Computer Science & Education (ICCSE), 2012 7th International Conference on.  Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., . . . Ghodsi, A. (2015). Spark sql: Relational data processing in spark. Paper presented at the Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 32
  • 33. TEAM 33 Dr. Maria Indrawan-Santiago Senior Lecturer Faculty of IT Prajwol Sangat Research Assistant Faculty of IT Assoc Prof David Taniar Associate Professor Faculty of IT Jingxuan Wei Student Faculty of IT Subudh Sali Student Faculty of IT
  • 34. 34 (Final Presentation) Supervisor: Assoc Prof David Taniar BY: JINGXUAN WEI (Tom) 25025031

Editor's Notes

  1. Good morning and welcome to my final presentation. My name is JINGXUAN WEI . MY presentation topic is …
  2. In today’s presentation I’m hoping to cover these points: firstly, I will introduce research background. then, I will talk about research question and aim There are two main parts in my research, the first part is DA part. In my research, DA means database acquire data. We get data and import into database. After DA PART, I want to talk about DT. DT means retrieval useful information form the database. , and finally I’ll mention the future work and give the conclusion.
  3. Let’s start at the background of this research. This research collaborates with the Institute of Railway Technology (IRT) at Monash University. Look at the picture in the right side, this is a mine railway… Train Length: >2KM Load: >10 Ton per wagon Speed:5-10 Km per hour Usually there are 200 wagons in a train …………………………… The engineer in railway industry want to …Therefore, there is a program called IOC program, which e… For example, if the train acceleration increase or decrease significant. We can consider the track abnormality is occur. Therefore, my searching is mainly focus on the acceleration attribute.
  4. The engineer in IOC program faced two problems, If we can improve the database performance( especially in DA and DT efficiency), we are more likely to reach the expected outcome. In general, we can improve the database performance, to address part of the sensor issue by equipped with plenty of cheap sensors, get the similar performance without increasing expenditure.
  5. Let’s look at the first problem,…. The existing database is relational database. The sensor data is large and come in very fast pace. For example, if we assume that .. However, the current system was able to accept the data in only single port. When the data velocity become faster, the data congestion will happen in the database I/O port. Also, low data ingestion speed also caused by transaction management of relational db. Spend too much time on searching . Relational database cannot handle unstructured data well.
  6. We can generate geohash code according to the corresponding latitude and longitude. The geohash code can be used to identify the records in specific area.
  7. Now let’s move on to research question
  8. In relation database, storing a large volume of the unstructured data in binary based columns will dramatically increase the demand for the hardware resource. Therefore, using an relation database for managing the huge amount of unstructured data is not a good selection.
  9. In my original assumption Configure servers store the cluster’s metadata. This data contains a mapping of the cluster’s data set to the shards. The query router uses this metadata to target operations to specific shards.   Shards store the data. To provide high availability and data consistency, in a production sharded cluster, each shard is a replica set
  10. Reads / Writes MongoDB distributes the read and write workload across the shards in the sharded cluster, allowing each shard to process a subset of cluster operations. Storage Capacity High Availability A sharded cluster can continue to perform partial read / write operations even if one or more shards are unavailable. How to distribute data in the sharded cluster ?
  11. In this case, compare with ranged sharding technology, the hashing sharding spend less time on data acquisition task.
  12. When I use mongoimput command to input data into MongoDB sharding database, the Mongos master also called Router (4 CPUs, 130GB) perform the data input job. The main data input speed was affected by the first phase which is application input data into Router.
  13. Spark SQL includes a cost-based optimizer, columnar storage and code generation to make queries fast. At the same time, it scales to thousands of nodes and multi hour queries using the Spark engine.
  14. Narrows the search scope
  15. We have successfully created a system that is able to accept the data as a batch or streams.