CONTEXT – Write once data load - Ex. Time-series data.Which Database?
SSD is Good
MPP is Good
Columnar is Good
Logical Partition is Good
Data Skew Partition is Good
Search Engine Index could lead to Index Explosion
Concurrent Users First, Single Query Performance Next
High Throughput File level Snapshot Loading
Calculate cost upfront
Data Structure makes a Big Difference
Ten things to consider for interactive analytics on write once workloads
1. Ten things
to consider
for Interactive Analytics on high
volume, write-once workloads
Full talk and demo at Fifth Elephant 2014
Abinash Karan
abinash@Bizosys.com
www.bizosys.com
2. About
• CTO and Co-Founder at Bizosys Technologies since 2009
• Created HSearch – a Real-time, distributed search and
analytics engine built on Hadoop platform
• Passion on distributed systems and data structures
• Speaker at Fifth Elephant 2013, Microsoft Teched 2012,
Yahoo Hadoop India Summit 2011
• Developed partitioning, read optimized data structures
modules for HSearch.
• Worked with a range of search products including Lucene,
Solr, Endeca and FAST
• Abinash is an engineering graduate of NIT, Raurkela
3. Summary of what you will hear
CONTEXT – Write once data load - Ex. Time-series data.
Which Database?
1. SSD is Good
2. MPP is Good
3. Columnar is Good
4. Logical Partition is Good
5. Data Skew Partition is Good
6. Search Engine Index could lead to Index Explosion
7. Concurrent Users First, Single Query Performance Next
8. High Throughput File level Snapshot Loading
9. Calculate cost upfront
10. Data Structure makes a Big Difference
7. 12 2 2 8 4
12
228 bytes
Concept#3 Columnar is Good
Opens 84 Bytes*Filter on Col1 and Display Col6
8. 2012 Data
180 Millions
…..
2014 Data
500 Millions
Select sum(col3) where col2= 2014
Complete Dataset
(1 billion rows)
Partitioned Data
(500M Rows)
Concept#4 Logical Partition is Good
Stringer
9. 5 Million
…
5 Million
500 Million
rows in
memory
Select sum(col3) where col2= 2014
5 Million rows
in memory
Concept#5 Data Skew Partition is Good (Paging)
2012 Data
180 Millions
…..
2014 Data
500 Millions
10. Index size is X times more
of original data size
Index size is X time lesser
of original data size
Concept#6 Search Index may lead to Index Explosion
Repeated Value
Unique Value
1 2 2 2 8 4
1 2 2 2 8 4
11. Concept#7 Concurrent Users First, Single Query Performance
Next
1 User
10% CPU
200ms
1 User
70% CPU
175ms
Support 6
Concurrent
Users
12. Concept#8 High Throughput File level Snapshot Loading
Insert 1 row in 1sec
1million rows in 1sec
Insert 1 row in 1 ms
1million rows in 1
hour
Backup
Move the
snapshot file
Distributed Index
Building
Splitting
Compaction
13. Concept#9 Calculate cost upfront
Support existing
SQLs,
No new servers
New Process
Instance
New Language
No Monitoring
Hardware Cost Per Byte
SSD-RAM,
Engine Efficiency,
Spot Instance – Reserved Instance,
Indexes @ Compute Node - Data Node
Maintenance Cost
Skill Acquisition, Dashboard
App Dev/Migration Cost
Existing SQLs to custom SQL/JSON