Big Data Storage Challenges and Solutions

2,396 views
2,094 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,396
On SlideShare
0
From Embeds
0
Number of Embeds
318
Actions
Shares
0
Downloads
106
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Big Data Storage Challenges and Solutions

  1. 1. Big Data Storage Challenges andSolutionsJune 2013DeependraBhathiya
  2. 2. About WSO2•  Providing the only complete open source componentized cloud platform•  Dedicated to removing all the stumbling blocks to enterprise agility•  Enabling you to focus on business logic and business value•  Recognized by leading analyst firms as visionaries and leaders•  Gartner cites WSO2 as visionaries in all 3 categories of application infrastructure•  Forrester places WSO2 in top 2 for API Management•  Global corporation with offices in USA, UK & Sri Lanka•  200+ employees and growing•  Business model of selling comprehensive support & maintenance for ourproducts
  3. 3. 150+ globally positioned support customers
  4. 4. OverviewØ  IntroductionØ  Big Data and Data TypesØ  Optimal Data StorageØ  WSO2 Big Data Storage Solutions – WSO2 SSØ  Big Data AccessØ  Big Data Summery tools – WSO2 BAM
  5. 5. “Big data” is high-volume, -velocityand -variety information assets thatdemand cost-effective, innovativeforms of information processing forenhanced insight and decision making.
  6. 6. Growth RateEvery day, we create 2.5 quintillionbytes of data, so much that 90% ofthe data in the world today has beencreated in the last two years alone.- IBM
  7. 7. Storage Requirements•  Types of data•  Scalability requirements•  Nature of data retrieval•  Consistency requirements
  8. 8. Data types to store•  StructuredLocation Temperature Humidity RainColombo 32 C 77 % 22 mmKandy 28 C 79 % 12 mmGalle 29 C 80 % 2 mmJaffna 33 C 76 % 0 mmBadulla 27C 69 % 2 mm
  9. 9. Data types to store•  Semi - structured<dependency><groupId>com.h2database.wso2</groupId><artifactId>h2-database-engine</artifactId><scope>test</scope></dependency><dependency><groupId>org.apache.ws.commons.axiom.wso2</groupId><artifactId>axiom</artifactId><scope>test</scope></dependency><dependency><groupId>org.apache.ws.commons.schema.wso2</groupId><artifactId>XmlSchema</artifactId></dependency><dependency><groupId>org.wso2.carbon</groupId><artifactId>org.wso2.carbon.registry.api</artifactId><version>${project.version}</version></dependency><dependency><groupId>org.wso2.carbon</groupId><artifactId>org.wso2.carbon.registry.xboot</artifactId><version>${project.version}</version></dependency>
  10. 10. Data types to store•  Unstructured(WSO2 CEP), is an enterprise grade server that integrates to various systems to collect, analyse, and notify meaningful patterns in realtime. The core back end runtime engine that powers WSO2 CEP 2.x server is WSO2 Siddhi which is a very high performing Java eventprocessing engine A join query always has two Handler processors, one for each input stream it joins. Here, when an event from one streamreaches the In-Stream Join Processor, it is matched against all the available events of the other streams Window Processor. When a match isfound, those matched events are then sent to the Query Projector to create the output in-events; at the same time, the original event will beadded to the Window Processor and it will remain there until it expires. Similarly, when an event expires from its Window Processor, it is matchedagainst all the available events of the other streams Window Processor; when a match is found, those matched events are sent to the QueryProjector to create the output expired-events.Note: Inspite of the optimizations, a join query is quite expensive when it comes to performance, and this is because the Window Processor willbe locked during the matching process to avoid race conditions and to achieve accuracy in joining process; therefore, users should avoidmatching huge windows in high volume streams. Based on the user scenario, using appropriate window sizes (by time or length) or using withinkeywords will help to achieve maximum performance. Pattern and sequence queries can have many Handler Processors; here, they will have aHandler Processor for each incoming event stream. After events are received by the Handler Processor, it passes them to the Inner HandlerProcessors; these Inner Handler Processors are responsible for processing the states in pattern and sequence queries. Here, the Inner HandlerProcessors contain all the events that are partially matched up to its state level, and when a new event arrives it tries to match whether it satisfiesits Filter condition along with the partially matched events. If there is a match, it passes the corresponding previously matched events and thecurrent event to the next state (Inner Handler Processor).
  11. 11. Optimal Data Store for Data Type•  Structured Data :•  RDBMS - MySQL, Oracle, MS SQL•  KV systems - Accumulo, Oracle kv•  CF systems - Cassandra•  Semi-structured Data:•  JSON - Mongo DB•  XML - MSSQL, PostgreSQL•  Unstructured :•  File Systems - HDFS, OCFS2, GFS2
  12. 12. How Data becomes Big Data•  Growth over time• Dimensions and Insights
  13. 13. Capacity / Scaling ProblemEfficient Economical Solutions:•  Cassandra•  HBase•  HDFS•  Mongo DB
  14. 14. WSO2 Storage ServerLink: https://storage.stratoslive.wso2.com
  15. 15. Big Data Support in WSO2 Platform
  16. 16. Big Data Support in WSO2 Open PaaS
  17. 17. Big Data Accessibility•  Cluster Support•  Lock free READ / WRITE•  External Indexes - Big Data Search
  18. 18. Why Summarization•  Convert Big Data into simple / small data•  Turn Big Data in the meaningful data•  Genarate precentation friendly data
  19. 19. Summarization TechnologiesMapReduce : Apache HadoopReference: http://developer.yahoo.com/hadoop/tutorial/module4.html
  20. 20. Map Reduce is Hard
  21. 21. User Friendly Analytic Tools•  Hiveo  Interpret SQL-like queries to map-reduce jobs•  Pigo  Pig Latin statements
  22. 22. Why HiveSQL like high level language interface.
  23. 23. Tools to Summarize Big Data – WSO2BAMPublishAnalyze and SummarizeVisualize
  24. 24. Published Data in Cassandra
  25. 25. Published Data in Cassandra
  26. 26. Hive Scripting in BAM
  27. 27. BAM Data Visualization
  28. 28. Engage with WSO2•  Helping you get the most out of your deployments•  From project evaluation and inception to developmentand going into production, WSO2 is your partner inensuring 100% project success
  29. 29. Questions?
  30. 30. *lean . enterprise . middlewareEND

×