Big Data Storage Challenges and
Solutions
June 2013
Deependra
Bhathiya
About WSO2
•  Providing the only complete open source componentized cloud platform
•  Dedicated to removing all the stumbling blocks to enterprise agility
•  Enabling you to focus on business logic and business value
•  Recognized by leading analyst firms as visionaries and leaders
•  Gartner cites WSO2 as visionaries in all 3 categories of application infrastructure
•  Forrester places WSO2 in top 2 for API Management
•  Global corporation with offices in USA, UK & Sri Lanka
•  200+ employees and growing
•  Business model of selling comprehensive support & maintenance for our
products
150+ globally positioned support customers
Overview
Ø  Introduction
Ø  Big Data and Data Types
Ø  Optimal Data Storage
Ø  WSO2 Big Data Storage Solutions – WSO2 SS
Ø  Big Data Access
Ø  Big Data Summery tools – WSO2 BAM
“Big data” is high-volume, -velocity
and -variety information assets that
demand cost-effective, innovative
forms of information processing for
enhanced insight and decision making.
Growth Rate
Every day, we create 2.5 quintillion
bytes of data, so much that 90% of
the data in the world today has been
created in the last two years alone.
- IBM
Storage Requirements
•  Types of data
•  Scalability requirements
•  Nature of data retrieval
•  Consistency requirements
Data types to store
•  Structured
Location Temperature Humidity Rain
Colombo 32 C 77 % 22 mm
Kandy 28 C 79 % 12 mm
Galle 29 C 80 % 2 mm
Jaffna 33 C 76 % 0 mm
Badulla 27C 69 % 2 mm
Data types to store
•  Semi - structured
<dependency>
<groupId>com.h2database.wso2</groupId>
<artifactId>h2-database-engine</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.ws.commons.axiom.wso2</groupId>
<artifactId>axiom</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.ws.commons.schema.wso2</groupId>
<artifactId>XmlSchema</artifactId>
</dependency>
<dependency>
<groupId>org.wso2.carbon</groupId>
<artifactId>org.wso2.carbon.registry.api</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.wso2.carbon</groupId>
<artifactId>org.wso2.carbon.registry.xboot</artifactId>
<version>${project.version}</version>
</dependency>
Data types to store
•  Unstructured
(WSO2 CEP), is an enterprise grade server that integrates to various systems to collect, analyse, and notify meaningful patterns in real
time. The core back end runtime engine that powers WSO2 CEP 2.x server is WSO2 Siddhi which is a very high performing Java event
processing engine A join query always has two Handler processors, one for each input stream it joins. Here, when an event from one stream
reaches the In-Stream Join Processor, it is matched against all the available events of the other stream's Window Processor. When a match is
found, those matched events are then sent to the Query Projector to create the output in-events; at the same time, the original event will be
added to the Window Processor and it will remain there until it expires. Similarly, when an event expires from its Window Processor, it is matched
against all the available events of the other stream's Window Processor; when a match is found, those matched events are sent to the Query
Projector to create the output expired-events.
Note: Inspite of the optimizations, a join query is quite expensive when it comes to performance, and this is because the Window Processor will
be locked during the matching process to avoid race conditions and to achieve accuracy in joining process; therefore, users should avoid
matching huge windows in high volume streams. Based on the user scenario, using appropriate window sizes (by time or length) or using within
keywords will help to achieve maximum performance. Pattern and sequence queries can have many Handler Processors; here, they will have a
Handler Processor for each incoming event stream. After events are received by the Handler Processor, it passes them to the Inner Handler
Processors; these Inner Handler Processors are responsible for processing the states in pattern and sequence queries. Here, the Inner Handler
Processors contain all the events that are partially matched up to its state level, and when a new event arrives it tries to match whether it satisfies
its Filter condition along with the partially matched events. If there is a match, it passes the corresponding previously matched events and the
current event to the next state (Inner Handler Processor).
Optimal Data Store for Data Type
•  Structured Data :
•  RDBMS - MySQL, Oracle, MS SQL
•  KV systems - Accumulo, Oracle kv
•  CF systems - Cassandra
•  Semi-structured Data:
•  JSON - Mongo DB
•  XML - MSSQL, PostgreSQL
•  Unstructured :
•  File Systems - HDFS, OCFS2, GFS2
How Data becomes Big Data
•  Growth over time
• Dimensions and Insights
Capacity / Scaling Problem
Efficient Economical Solutions:
•  Cassandra
•  HBase
•  HDFS
•  Mongo DB
WSO2 Storage Server
Link: https://storage.stratoslive.wso2.com
Big Data Support in WSO2 Platform
Big Data Support in WSO2 Open PaaS
Big Data Accessibility
•  Cluster Support
•  Lock free READ / WRITE
•  External Indexes - Big Data Search
Why Summarization
•  Convert Big Data into simple / small data
•  Turn Big Data in the meaningful data
•  Genarate precentation friendly data
Summarization Technologies
MapReduce : Apache Hadoop
Reference: http://developer.yahoo.com/hadoop/tutorial/module4.html
Map Reduce is Hard
User Friendly Analytic Tools
•  Hive
o  Interpret SQL-like queries to map-reduce jobs
•  Pig
o  Pig Latin statements
Why Hive
SQL like high level language interface.
Tools to Summarize Big Data – WSO2BAM
Publish
Analyze and Summarize
Visualize
Published Data in Cassandra
Published Data in Cassandra
Hive Scripting in BAM
BAM Data Visualization
Engage with WSO2
•  Helping you get the most out of your deployments
•  From project evaluation and inception to development
and going into production, WSO2 is your partner in
ensuring 100% project success
Questions?
*
lean . enterprise . middleware
END

Big Data Storage Challenges and Solutions

  • 1.
    Big Data StorageChallenges and Solutions June 2013 Deependra Bhathiya
  • 2.
    About WSO2 •  Providingthe only complete open source componentized cloud platform •  Dedicated to removing all the stumbling blocks to enterprise agility •  Enabling you to focus on business logic and business value •  Recognized by leading analyst firms as visionaries and leaders •  Gartner cites WSO2 as visionaries in all 3 categories of application infrastructure •  Forrester places WSO2 in top 2 for API Management •  Global corporation with offices in USA, UK & Sri Lanka •  200+ employees and growing •  Business model of selling comprehensive support & maintenance for our products
  • 3.
    150+ globally positionedsupport customers
  • 4.
    Overview Ø  Introduction Ø  BigData and Data Types Ø  Optimal Data Storage Ø  WSO2 Big Data Storage Solutions – WSO2 SS Ø  Big Data Access Ø  Big Data Summery tools – WSO2 BAM
  • 5.
    “Big data” ishigh-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.
  • 6.
    Growth Rate Every day,we create 2.5 quintillion bytes of data, so much that 90% of the data in the world today has been created in the last two years alone. - IBM
  • 7.
    Storage Requirements •  Typesof data •  Scalability requirements •  Nature of data retrieval •  Consistency requirements
  • 8.
    Data types tostore •  Structured Location Temperature Humidity Rain Colombo 32 C 77 % 22 mm Kandy 28 C 79 % 12 mm Galle 29 C 80 % 2 mm Jaffna 33 C 76 % 0 mm Badulla 27C 69 % 2 mm
  • 9.
    Data types tostore •  Semi - structured <dependency> <groupId>com.h2database.wso2</groupId> <artifactId>h2-database-engine</artifactId> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.ws.commons.axiom.wso2</groupId> <artifactId>axiom</artifactId> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.ws.commons.schema.wso2</groupId> <artifactId>XmlSchema</artifactId> </dependency> <dependency> <groupId>org.wso2.carbon</groupId> <artifactId>org.wso2.carbon.registry.api</artifactId> <version>${project.version}</version> </dependency> <dependency> <groupId>org.wso2.carbon</groupId> <artifactId>org.wso2.carbon.registry.xboot</artifactId> <version>${project.version}</version> </dependency>
  • 10.
    Data types tostore •  Unstructured (WSO2 CEP), is an enterprise grade server that integrates to various systems to collect, analyse, and notify meaningful patterns in real time. The core back end runtime engine that powers WSO2 CEP 2.x server is WSO2 Siddhi which is a very high performing Java event processing engine A join query always has two Handler processors, one for each input stream it joins. Here, when an event from one stream reaches the In-Stream Join Processor, it is matched against all the available events of the other stream's Window Processor. When a match is found, those matched events are then sent to the Query Projector to create the output in-events; at the same time, the original event will be added to the Window Processor and it will remain there until it expires. Similarly, when an event expires from its Window Processor, it is matched against all the available events of the other stream's Window Processor; when a match is found, those matched events are sent to the Query Projector to create the output expired-events. Note: Inspite of the optimizations, a join query is quite expensive when it comes to performance, and this is because the Window Processor will be locked during the matching process to avoid race conditions and to achieve accuracy in joining process; therefore, users should avoid matching huge windows in high volume streams. Based on the user scenario, using appropriate window sizes (by time or length) or using within keywords will help to achieve maximum performance. Pattern and sequence queries can have many Handler Processors; here, they will have a Handler Processor for each incoming event stream. After events are received by the Handler Processor, it passes them to the Inner Handler Processors; these Inner Handler Processors are responsible for processing the states in pattern and sequence queries. Here, the Inner Handler Processors contain all the events that are partially matched up to its state level, and when a new event arrives it tries to match whether it satisfies its Filter condition along with the partially matched events. If there is a match, it passes the corresponding previously matched events and the current event to the next state (Inner Handler Processor).
  • 11.
    Optimal Data Storefor Data Type •  Structured Data : •  RDBMS - MySQL, Oracle, MS SQL •  KV systems - Accumulo, Oracle kv •  CF systems - Cassandra •  Semi-structured Data: •  JSON - Mongo DB •  XML - MSSQL, PostgreSQL •  Unstructured : •  File Systems - HDFS, OCFS2, GFS2
  • 12.
    How Data becomesBig Data •  Growth over time • Dimensions and Insights
  • 13.
    Capacity / ScalingProblem Efficient Economical Solutions: •  Cassandra •  HBase •  HDFS •  Mongo DB
  • 14.
    WSO2 Storage Server Link:https://storage.stratoslive.wso2.com
  • 15.
    Big Data Supportin WSO2 Platform
  • 16.
    Big Data Supportin WSO2 Open PaaS
  • 17.
    Big Data Accessibility • Cluster Support •  Lock free READ / WRITE •  External Indexes - Big Data Search
  • 18.
    Why Summarization •  ConvertBig Data into simple / small data •  Turn Big Data in the meaningful data •  Genarate precentation friendly data
  • 19.
    Summarization Technologies MapReduce :Apache Hadoop Reference: http://developer.yahoo.com/hadoop/tutorial/module4.html
  • 20.
  • 21.
    User Friendly AnalyticTools •  Hive o  Interpret SQL-like queries to map-reduce jobs •  Pig o  Pig Latin statements
  • 22.
    Why Hive SQL likehigh level language interface.
  • 23.
    Tools to SummarizeBig Data – WSO2BAM Publish Analyze and Summarize Visualize
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
    Engage with WSO2 • Helping you get the most out of your deployments •  From project evaluation and inception to development and going into production, WSO2 is your partner in ensuring 100% project success
  • 29.
  • 30.
    * lean . enterprise. middleware END