Webcast Q&A- Big Data Architectures Beyond Hadoop


Published on

Webcast Q&A

Impetus webcast ‘Big Data Architectures – Beyond Hadoop’ available at http://lf1.me/PB/

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Webcast Q&A- Big Data Architectures Beyond Hadoop

  1. 1. Webinar: Big Data Architectures – Beyond the Elephant RideJune 29, 2012Question and Answer SessionQ1. What are the differences between Storm and ESBs like Mule?Storm and ESB (like mule) are very distinct and cannot be compared..The motivation behind ESBs is to standardize and structure the loosely coupledsoftware components so that they can be independently deployed and run in adisparate environment. The communication is through message passing and using anESB heterogeneous components are able to interact with each other.Storm is for processing large data in real time. When we use storm, we do notattempt establishing any form of a common structure for different components tocollaborate. Rather, Storm enables huge amount of data to be processed through achain of processing units.So when you’ve large amounts of data than you want to process in real-time, weadvise you to use Storm. On the other hand, when you have numerous componentsand you want to write a layer that will enable their interaction, use an ESB.Infact, Storm and ESB can be theoretically integrated together so that Storm canhandle the streaming analytics part while ESB can cater to service orchestration andintegrations.Q2. What is the advantage of Giraph and Pregel over more common Graph DBs likeNeo or Infinite graphs?Giraph is an opensource implementation of Pregel meant for large datasets. Itprovides a large-scale graph processing infrastructure over Hadoop. Some of theadvantages I’d like to highlight include: 1. Distributed and especially developed for large scale graph processing 2. Bulk Synchronous Parallel (BSP) as execution model 3. Fault tolerance by check pointing 4. Giraph runs on standard Hadoop infrastructure© 2012 Impetus Technologies Page 1
  2. 2. 5. Computation is executed in memory 6. It can be a part of pipeline in form of a job 7. Vertex centric APIRequest you to go through answer to Question No. 9 as well.Q3. What do you recommend for Reporting on top of NoSQL databases?Technologies coming under NoSQL are relative new and still evolving. Furthermore,there are a lot of these technologies and it is unlikely that one single tool would workon all of them.It will be great if you could share us the exact NoSQL technology which you are eitherusing or planning to use and well then be able to suggest you the right tool.There are a very few reporting tools like Intellicus and Jasper that work on HBase butI guess theyre still keeping an eye on the market to see the direction its going totake.I strongly believe that you should see some exciting features in these tools in thenext 6 - 12 months’ time frame.Q4. What are the difference between Cassandra and RIAK and why would youchoose one over the other?Cassandra and RIAK are popular NoSQL solutions and are best suited to solvedifferent kind of use cases in specific ways. So the answer to choose one over theother would totally depend on the business use case that we are trying to solve.Strengths of Riak over Cassandra- Adding nodes to the Riak cluster is very easy- Datamodel doesnt need to be pre-setup- You can access it using REST or using Protocol Buffer API- Commercial support is available from Basho© 2012 Impetus Technologies Page 2
  3. 3. Strengths of Cassandra over Riak- Cassandra is still more popular because of the bigger community using it- You can access it using Cassandra CQL; a SQLish language- Scales to PBs and support columnar structure- Enterprise features like rack-awareness are free which is helpful in largedeployments- Commercial product support is available from Datastax.- Implementation support is available from 3rd party commercial service providerslike Impetus. (http://wiki.apache.org/cassandra/ThirdPartySupport)Q5. We planned for a SAN deployment as our storage solution. I have read thatMPP database solutions are optimal on a shared-nothing architecture as DASrather than on SAN. Can you please comment on MPP database on SAN vs DAS?Typically speaking SAN can offer higher throughput over DAS but can also have ahigher latency for lighter loads vis-à-vis DAS. Also, SANs available throughput will beshared across all connected nodes. In a MPP Data warehousing scenario, multiplenodes will connect to SAN, thereby, sharing a common bandwidth.Another point to note is that most queries served by MPP systems will involve highamount of scattered reads across multiple nodes, thus pushing the bandwidthutilization on SAN to its limits. However, if we have high amount of cache with highspeed HBAs and high speed disks in SAN (15K RPM), then the SAN should be able toserver a 10-15 nodes MPP cluster.On the other hand, DAS storage can also provide very good throughput and does nothave to share the bandwidth across multiple nodes. The bandwidth offered can befurther improved by using multiple SATA adapters and high speed disks (10K - 15KRPM). DAS probably will offer better performance on a cluster with very high numberof nodes.To summarize, there is no clear winner and using SAN vs DAS will depend on variousfactors like load, underlying technology in the storage system, cache, number ofnodes etc. Both, high end fibre based storage technology and new SATA basedstorage technology (e.g. SATA-3), can offer similar bandwidth. We suggest that a© 2012 Impetus Technologies Page 3
  4. 4. careful study and capacity planning should be conducted on the underlying storagesystem before deciding on the storage solution.Q6. What architecture components would satisfy the desire to have an integratedNewSQL environment and be able to marry that data with both adhoc defined usertables and events detected during unstructured data stream processing?NewSQL and NoSQL databases/datastores excel in areas where traditional RDBMSsystems have some limitations. In many scenarios, NoSQL/NewSQL databases canoffer significant improvement over RDBMS. Some cases are1. Very high availability on a high traffic data2. Storing CLOB/text data that store denormalized/unstructured data3. Journal data4. Performance and scalabilityUnstructured data stream processing falls more under the category of CEP (complexevent processing) and eventually we will see that NewSQL systems start providingsupport for pre-ingestion analytics than the current traditional post-ingestionanalytics. Currently, you will have to rely on some CEP component providing eventdetection on streaming data while NewSQL acts as the data sink for this streamingdata. NewSQL can also help in rapid event generation by firing analytical queriesmuch faster than traditional RDBMS.Q7. Can you compare Neo4J with your recommended Graph DatabaseAlready answered as a part of Q.2Q8. What is your take on MongoDB?RDBMS is still the most commonly used data-store for applications built today. But,the flexibility offered by Mongo provides advantages with respect to developmentspeed and overall application performance in many use- cases. Like any otherdocument store, instead of storing data into tables with rows and columns,MongoDB encapsulates data into loosely defined documents.© 2012 Impetus Technologies Page 4
  5. 5. There are a lot of document-oriented stores, and the underlying implementationvaries between various data-stores. Some represent it as an XML document andsome use JSON. The general rule is documents are not rigidly defined and you canexpect a high degree of flexibility when defining data.MongoDB is one of the most popular document stores. It is an open source, schema-free, written in C++ and support for a wide array of programming languages includinga SQL-like query language.It’s relatively a new technology and has a few challenges as well but with attractivepricing and relative ease of use, it definitely is becoming a choice for various smalland large companies.Q9. You didnt mention Neo4j in your graph databases you recommend. Anyparticular reason Neo4j wasnt included?No, there is no particular reason. What’s important here for you is to understand thedifference between these technologies and where their fitment is. If you’ve an OLAPand data analytics scenario, Hadoop-based Pregel and Giraph will be a better fit. Ifyou’ve an OLTP setup where you want to store and query on connected data foronline transaction processing Neo4J will come into the picture.Request you to go through an excellent reading here:http://jim.webber.name/2011/08/24/66f1fb4b-83c3-4f52-af40-ee6382ad2155.aspxQ10.What is the limiting factor in analyzing all data in a real-time basis? Is itprocessing power, storage systems, DB systems or something else?There are challenges in each of the points you raised like storing and processing.When you process the data, it usually has to be loaded on to the main memory whichis still expensive. The machines have to be powerful enough to get you the resultsfast. Hence, both processing and storage system are the main bottlenecks.Also, there is a paradigm shift in the way programming is done. So, in order toefficiently process the data, we need to come up with parallel algorithms which areable to work on this data and utilize the processing power of the machines.© 2012 Impetus Technologies Page 5
  6. 6. So to summarize three points that I consider limiting factor are: memory, processingand the right set of algorithms.Q11.What do you recommend for an in database but very scalable alternative toSAS for doing advanced math on large datasetsAssuming that the reference here is to SAS language, R scripting can be a goodalternative to work with large datasets as it has good integration with Hadoop andcan scale well using map reduce programming interface over R scripts. RevolutionAnalytics is a commercial product for R over Hadoop.There are other non-Hadoop options as well such as Greenplum or Aster etc whichhave support for specialized advanced math libraries.Also, SAS is now providing integration with Hadoop which means that you can reusesome of your SAS programming investments and use Hadoop as the underlyingscalable processing engine for some of the analytical execution.Q12 Are there any NewSQL platforms that have mastered the functionality aroundWorkload Management? For instance, without workload management, the highresource, intense transactions can get in the way of traditional reporting needs...in other words, is there a NewSQL environment that can be used for traditional andadvanced analysis on the same platform?NewSQL are certainly evolving every day as we speak with many more being built instealth mode. We are not aware of any advanced workload managementfunctionality being provided with any NewSQL platform for now, but that maychange any day now.However, most NewSQL platforms have been designed to work efficiently with eitherOLTP environment or OLAP environment.Q14 Is MongoDB a better solution for any of the scenarios discussed?MongoDB can be a good option in some use-cases of OLTP systems or thetransactional system we discussed.© 2012 Impetus Technologies Page 6
  7. 7. Q15 Do you have recommendations for an indexing solution?Depending on the data size you can go for Solr and Elastic Solr as options forindexing. There are commercial solutions as well but Solr with its new scalableversion SolrCloud can compete with any other commercial solution. Write to us at bigdata@impetus.com for more information© 2012 Impetus Technologies Page 7