Ten things to consider for interactive analytics on write once workloads

•Download as PPTX, PDF•

0 likes•1,895 views

CONTEXT – Write once data load - Ex. Time-series data.Which Database? SSD is Good MPP is Good Columnar is Good Logical Partition is Good Data Skew Partition is Good Search Engine Index could lead to Index Explosion Concurrent Users First, Single Query Performance Next High Throughput File level Snapshot Loading Calculate cost upfront Data Structure makes a Big Difference

Software Technology

Ten things
to consider
for Interactive Analytics on high
volume, write-once workloads
Full talk and demo at Fifth Elephant 2014
Abinash Karan
abinash@Bizosys.com
www.bizosys.com

About
• CTO and Co-Founder at Bizosys Technologies since 2009
• Created HSearch – a Real-time, distributed search and
analytics engine built on Hadoop platform
• Passion on distributed systems and data structures
• Speaker at Fifth Elephant 2013, Microsoft Teched 2012,
Yahoo Hadoop India Summit 2011
• Developed partitioning, read optimized data structures
modules for HSearch.
• Worked with a range of search products including Lucene,
Solr, Endeca and FAST
• Abinash is an engineering graduate of NIT, Raurkela

Summary of what you will hear
CONTEXT – Write once data load - Ex. Time-series data.
Which Database?
1. SSD is Good
2. MPP is Good
3. Columnar is Good
4. Logical Partition is Good
5. Data Skew Partition is Good
6. Search Engine Index could lead to Index Explosion
7. Concurrent Users First, Single Query Performance Next
8. High Throughput File level Snapshot Loading
9. Calculate cost upfront
10. Data Structure makes a Big Difference

HBase
MangoDB
Shark
SAP Hanna
i1010
Which Database?
HSearch
Riak
Hive
Dremel
1010data
Memcached
FoundationDB
Splunk
Elasticsearch
DynamoDB
Datameer
LevelDB
Netezza
Oracle TimesTen
Aerospike
Sybase IQ
Vertica
accumulo
HyperTable
SOLR

Data Node
Application
Server
DB
Instance
Network
50 micro
sec
DISK
Disk access
20 milli sec
SSD
100 micro sec
RAM
100 nano sec
Data Node
Application
Server
Database
Node
Network
50 micro
sec
DISK
Data Hotness
based caching
Concept#1 SSD And RAM is Good.

SSD
RAM
Database
Node
Application
Server
MPP Node
Computed
Data
DISK
All Data
MPP Processing ?Concept#2 MPP is Good

12 2 2 8 4
12
228 bytes
Concept#3 Columnar is Good
Opens 84 Bytes*Filter on Col1 and Display Col6

2012 Data
180 Millions
…..
2014 Data
500 Millions
Select sum(col3) where col2= 2014
Complete Dataset
(1 billion rows)
Partitioned Data
(500M Rows)
Concept#4 Logical Partition is Good
Stringer

5 Million
…
5 Million
500 Million
rows in
memory
Select sum(col3) where col2= 2014
5 Million rows
in memory
Concept#5 Data Skew Partition is Good (Paging)
2012 Data
180 Millions
…..
2014 Data
500 Millions

Index size is X times more
of original data size
Index size is X time lesser
of original data size
Concept#6 Search Index may lead to Index Explosion
Repeated Value
Unique Value
1 2 2 2 8 4
1 2 2 2 8 4

Concept#7 Concurrent Users First, Single Query Performance
Next
1 User
10% CPU
200ms
1 User
70% CPU
175ms
Support 6
Concurrent
Users

Concept#8 High Throughput File level Snapshot Loading
Insert 1 row in 1sec
1million rows in 1sec
Insert 1 row in 1 ms
1million rows in 1
hour
Backup
Move the
snapshot file
Distributed Index
Building
Splitting
Compaction

Concept#9 Calculate cost upfront
Support existing
SQLs,
No new servers
New Process
Instance
New Language
No Monitoring
Hardware Cost Per Byte
SSD-RAM,
Engine Efficiency,
Spot Instance – Reserved Instance,
Indexes @ Compute Node - Data Node
Maintenance Cost
Skill Acquisition, Dashboard
App Dev/Migration Cost
Existing SQLs to custom SQL/JSON

CSV/JSON/
TSV
KV
Secondary
Index
Inverted
Index
LazySorted
Binary
Serde
Append
Update
Delete
GET
Select (Repea
t Data)
(Non-Repeat
Data)
Filter (Repe
at
Data)
(Non-
Repeat
Data)
Nulls
Concept#10 Data Structure makes a Big Difference
* Custom Variations : RC File, ORC File, Parquet
1. Size Reduction
on Index
2. Compressibility
3. Fast Access

10 CONCEPT DEMONSTRATION
HSEARCH DEMO
HVAC ID BuildingID READING_TIME INLET
TEMP
OUTLET
TEMP
ERROR
MESSAGE

What's hot

Big Tools for Big DataLewis Crawford

Big data and hadoopSri Kanth

Self Service Analytics at TwitchImply

Bi on Big Data - Strata 2016 in LondonDremio Corporation

Solr on CloudGulcin Yildirim Jelinek

Google BigQuery 101 & What’s NewDoiT International

NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...South London Geek Nights

Exploring BigData with Google BigQueryDharmesh Vaya

Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryChris Schalk

Learn to Build Apps Using Neo4J Ranveer Tegi

Introduction to hadoopdhruv_gairola

Google Developer Group - Cloud Singapore BigQuery WebinarRasel Rana

HadoopHimanshu Soni

Big DataAmir Hossain

Try It The Google Way .abhinavbom

Jethro qlik-datasheetVenkatesan Ethiraj

Analytics over Terabytes of Data at TwitterImply

Cloudian HyperStore 5.0 Release What's NewCloudian

Big Data Analytics with Google BigQuery. By Javier Ramirez. All your base Co...javier ramirez

How to plan a successful Digital Signage Campaign in 5 stepsLuca Naso

What's hot (20)

Big Tools for Big Data

Big data and hadoop

Self Service Analytics at Twitch

Bi on Big Data - Strata 2016 in London

Solr on Cloud

Google BigQuery 101 & What’s New

NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...

Exploring BigData with Google BigQuery

Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery

Learn to Build Apps Using Neo4J

Introduction to hadoop

Google Developer Group - Cloud Singapore BigQuery Webinar

Hadoop

Big Data

Try It The Google Way .

Jethro qlik-datasheet

Analytics over Terabytes of Data at Twitter

Cloudian HyperStore 5.0 Release What's New

Big Data Analytics with Google BigQuery. By Javier Ramirez. All your base Co...

How to plan a successful Digital Signage Campaign in 5 steps

Viewers also liked

Bizosys at fifth elephantAbinasha Karana

Introduction to ActOnMagicMadan Ganesh Velayudham

Europäische Mediendiskurse zu AusteritätEric Bonse

2013 05-29-advocates-for-blind-childrenMichael Wright

ใบงานสำรวจตนเองครีมWanwipha Kanjan

Интерактивный маркетинг: Управление репутациейMoscowCCI

Iot platform supporting million requests per secondAbinasha Karana

Viewers also liked (7)

Bizosys at fifth elephant

Introduction to ActOnMagic

Europäische Mediendiskurse zu Austerität

2013 05-29-advocates-for-blind-children

ใบงานสำรวจตนเองครีม

Интерактивный маркетинг: Управление репутацией

Iot platform supporting million requests per second

Similar to Ten things to consider for interactive analytics on write once workloads

Start Getting Your Feet Wet in Open Source Machine and Deep Learning Ian Gomez

MongoDB meetup at HikeBharvi Dixit

INFOGOV14 - Trusting Your KM & ECM Strategy to SharePointJonathan Ralton

Hadoop and SAP BI Praveen Kumar (Tyagi)

Data & Analytics - Session 1 - Big Data AnalyticsAmazon Web Services

(ARC311) Decoding The Genetic Blueprint Of Life On A Cloud EcosystemAmazon Web Services

MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...MongoDB

MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...MongoDB

SDSC18 and DSATL Meetup March 2018 CareerBuilder.com

Neo4j in DepthMax De Marzi

14 Tips for Planning ECM Content Migration to SharePointJoel Oleson

Data science and Artificial IntelligenceSuman Srinivasan

AWS Summit Berlin 2013 - Big Data AnalyticsAWS Germany

The Enterprise Search Market in a NutshellDr. Haxel Consult

Introduction Big DataFrank Kienle

J1 - Keynote Data Platform - Rohan KumarMS Cloud Summit

Qo Introduction V2Joe_F

Initiate Edinburgh 2019 - Big Data Meets AIAmazon Web Services

GraphTour - Neo4j Database OverviewNeo4j

BigData AnalysisInnfinision Cloud and BigData Solutions

Similar to Ten things to consider for interactive analytics on write once workloads (20)

Start Getting Your Feet Wet in Open Source Machine and Deep Learning

MongoDB meetup at Hike

INFOGOV14 - Trusting Your KM & ECM Strategy to SharePoint

Hadoop and SAP BI

Data & Analytics - Session 1 - Big Data Analytics

(ARC311) Decoding The Genetic Blueprint Of Life On A Cloud Ecosystem

MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...

MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...

SDSC18 and DSATL Meetup March 2018

Neo4j in Depth

14 Tips for Planning ECM Content Migration to SharePoint

Data science and Artificial Intelligence

AWS Summit Berlin 2013 - Big Data Analytics

The Enterprise Search Market in a Nutshell

Introduction Big Data

J1 - Keynote Data Platform - Rohan Kumar

Qo Introduction V2

Initiate Edinburgh 2019 - Big Data Meets AI

GraphTour - Neo4j Database Overview

BigData Analysis

Recently uploaded

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea

What are the key points to focus on before starting to learn ETL Development....kzayra69

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh

Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel

2.pdf Ejercicios de programación competitivaDiego Iván Oliveros Acosta

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp

Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ

Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin

Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl

Introduction Computer Science - Software Design.pdfFerryKemperman

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions

EY_Graph Database Powered SustainabilityNeo4j

What is Advanced Excel and what are some best practices for designing and cre...Technogeeks

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort

How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC

Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

Recently uploaded (20)

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样

What are the key points to focus on before starting to learn ETL Development....

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...

Unveiling the Future: Sylius 2.0 New Features

2.pdf Ejercicios de programación competitiva

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE

Cloud Data Center Network Construction - IEEE

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

Buds n Tech IT Solutions: Top-Notch Web Services in Noida

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany

Introduction Computer Science - Software Design.pdf

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...

EY_Graph Database Powered Sustainability

What is Advanced Excel and what are some best practices for designing and cre...

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)

How to Track Employee Performance A Comprehensive Guide.pdf

Intelligent Home Wi-Fi Solutions | ThinkPalm

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Der Spagat zwischen BIAS und FAIRNESS (2024)

Ten things to consider for interactive analytics on write once workloads

1. Ten things to consider for Interactive Analytics on high volume, write-once workloads Full talk and demo at Fifth Elephant 2014 Abinash Karan abinash@Bizosys.com www.bizosys.com

2. About • CTO and Co-Founder at Bizosys Technologies since 2009 • Created HSearch – a Real-time, distributed search and analytics engine built on Hadoop platform • Passion on distributed systems and data structures • Speaker at Fifth Elephant 2013, Microsoft Teched 2012, Yahoo Hadoop India Summit 2011 • Developed partitioning, read optimized data structures modules for HSearch. • Worked with a range of search products including Lucene, Solr, Endeca and FAST • Abinash is an engineering graduate of NIT, Raurkela

3. Summary of what you will hear CONTEXT – Write once data load - Ex. Time-series data. Which Database? 1. SSD is Good 2. MPP is Good 3. Columnar is Good 4. Logical Partition is Good 5. Data Skew Partition is Good 6. Search Engine Index could lead to Index Explosion 7. Concurrent Users First, Single Query Performance Next 8. High Throughput File level Snapshot Loading 9. Calculate cost upfront 10. Data Structure makes a Big Difference

4. HBase MangoDB Shark SAP Hanna i1010 Which Database? HSearch Riak Hive Dremel 1010data Memcached FoundationDB Splunk Elasticsearch DynamoDB Datameer LevelDB Netezza Oracle TimesTen Aerospike Sybase IQ Vertica accumulo HyperTable SOLR

5. Data Node Application Server DB Instance Network 50 micro sec DISK Disk access 20 milli sec SSD 100 micro sec RAM 100 nano sec Data Node Application Server Database Node Network 50 micro sec DISK Data Hotness based caching Concept#1 SSD And RAM is Good.

6. SSD RAM Database Node Application Server MPP Node Computed Data DISK All Data MPP Processing ?Concept#2 MPP is Good

7. 12 2 2 8 4 12 228 bytes Concept#3 Columnar is Good Opens 84 Bytes*Filter on Col1 and Display Col6

8. 2012 Data 180 Millions ….. 2014 Data 500 Millions Select sum(col3) where col2= 2014 Complete Dataset (1 billion rows) Partitioned Data (500M Rows) Concept#4 Logical Partition is Good Stringer

9. 5 Million … 5 Million 500 Million rows in memory Select sum(col3) where col2= 2014 5 Million rows in memory Concept#5 Data Skew Partition is Good (Paging) 2012 Data 180 Millions ….. 2014 Data 500 Millions

10. Index size is X times more of original data size Index size is X time lesser of original data size Concept#6 Search Index may lead to Index Explosion Repeated Value Unique Value 1 2 2 2 8 4 1 2 2 2 8 4

11. Concept#7 Concurrent Users First, Single Query Performance Next 1 User 10% CPU 200ms 1 User 70% CPU 175ms Support 6 Concurrent Users

12. Concept#8 High Throughput File level Snapshot Loading Insert 1 row in 1sec 1million rows in 1sec Insert 1 row in 1 ms 1million rows in 1 hour Backup Move the snapshot file Distributed Index Building Splitting Compaction

13. Concept#9 Calculate cost upfront Support existing SQLs, No new servers New Process Instance New Language No Monitoring Hardware Cost Per Byte SSD-RAM, Engine Efficiency, Spot Instance – Reserved Instance, Indexes @ Compute Node - Data Node Maintenance Cost Skill Acquisition, Dashboard App Dev/Migration Cost Existing SQLs to custom SQL/JSON

14. CSV/JSON/ TSV KV Secondary Index Inverted Index LazySorted Binary Serde Append Update Delete GET Select (Repea t Data) (Non-Repeat Data) Filter (Repe at Data) (Non- Repeat Data) Nulls Concept#10 Data Structure makes a Big Difference * Custom Variations : RC File, ORC File, Parquet 1. Size Reduction on Index 2. Compressibility 3. Fast Access

15. 10 CONCEPT DEMONSTRATION HSEARCH DEMO HVAC ID BuildingID READING_TIME INLET TEMP OUTLET TEMP ERROR MESSAGE

Ten things to consider for interactive analytics on write once workloads

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Ten things to consider for interactive analytics on write once workloads

Similar to Ten things to consider for interactive analytics on write once workloads (20)

Recently uploaded

Recently uploaded (20)

Ten things to consider for interactive analytics on write once workloads