Big Data using NoSQL TechnologiesAmit Kr. SinghSenior Developer, EricssonDecember 14, 2012
My Background Part of Java and Open Source Practice Area. Driving technology initiatives in LockBox project. Part of System-X development team. Contributing in JOSP Competence Development & Training.
Big DataEricsson defines:“People,devices and things are constantly generating massive volumes of data.At work people create data, as do children at home, students at school, peopleand things on the move, as well as objects that are stationary. Devices andsensors attached to millions of things take measurements from theirsurroundings, providing up-to-date readings over the entire globe – data to bestored for later use by countless different applications.”
Big DataIBM defines:“Every day, we create 2.5 quintillion bytes of data — so much that 90% of thedata in the world today has been created in the last two years alone. This datacomes from everywhere: sensors used to gather climate information, posts tosocial media sites, digital pictures and videos, purchase transaction records, andcell phone GPS signals to name a few. This data is big data.”
Big DataWikipedia defines:Big data is a collection of data sets so large and complex that it becomes difficultto process using on-hand database management tools. The challenges includecapture, curation, storage, search, sharing, analysis and visualization.
Big Data Why so many definitions? I am really confused.
Big DataIn simple words, A set of technology advances that have made capturing andanalyzing data at high scale and speed vastly more efficient.
Six insights from Facebooks formerHead of Big Data Analytics on 900M users 25PB of compressed data – 125 uncompressed New technologies has shifted the conversions from “what data to store” to “what can we do with more data”. Simplify data anlytics for end users. More users means data analytics system have to be more robust. Social networking works for Big Data. No single infrastructure can solve all Big Data problems. Building software is hard, but running a service is even harder.
The Three Vs of Big DataVolume – big data comes in one size XXL and available storage cannot handlethese volumes.Velocity – data needs to be used quickly to maximize business benefit beforethe value of the information is lost.Variability – data can be structured, unstructured, semi-structured or a mix of allthree. It comes in many forms including text, audio, video, click streams and logfiles.
Big Data TechnologiesBig-data technologies are usually engineered from the bottom up with two thingsin mind: scale and availability. Most solutions are distributed in nature andintroduce new programming models for working with large volumes of data.Technologies such as Not only SQL (NoSQL), characterized by its non-adherence to the RDBMS model, used in a wide variety of industry applications.These technologies have the flexibility to handle Big Data.
ScalabilityScalability refers to the ability of an application or product to increase in size asdemand warrants. The base concept is consistent – the ability for a business ortechnology to accept increased volume without impacting the business settings. Scale horizontally (scale out) Scale vertically (scale up)
ScalabilityScale vertically (scale up)Extra capacity can be obtained by adding more hardware to a specific computeror by moving applications to larger computers – a process known as verticalscaling. One limitation of this approach is the risk of outgrowing the capacity ofthe largest computer; this will eventually affect cost. Vendor lock-in is a potentialrisk, and vertically scaled solutions can become prohibitively expensive.
ScalabilityScale horizontally (scale out)Adding computers in parallel can also increase capacity. This approach is knownas horizontal scaling, and Big Data technologies tend to favor it because itsupports network expansion. Systems that are built in this way are more flexible,and because commodity computers can be operated together in parallel, the riskassociated with single vendor solutions is reduced. Also horizontal scaling is builtfor Cloud.
AvailabilityAvailability is a guarantee that every request receives a responseabout whether it was successful or failed.Users want their systems (Facebook, Twitter, Telecom app, etc) to be ready toserve them at all times. If a user cannot access the system, it is said to beunavailable. Generally, the term downtime is used to refer to periods when asystem is unavailable.
NoSQLWhat NoSQL databases can: Serve as an online processing database, so that it becomes the primary datasource/operational datastore for online applications. Use data stored in primary source systems for real-time, batch analytics, and enterprise search operations. Handle “big data” use cases that involve data velocity, variety, volume, and complexity. Excel at distributed database and multi-data center operations. Offer a flexible schema design that can be changed without downtime or service disruption. Accommodate structured, semi-structured, and non-structured data. Easily operate in the cloud and exploit the benefits of cloud computing.
Is NoSQL replacing the RDBMS?The answer is both yes and no, considering that the choicebetween the two depends on the Use Case.NoSQL doesnt take advantage of ACID properties. Applications which dependon transaction support (Banking, Airlines etc) will continue to work with RDBMSwhile Social Media applications which mostly deal with unstructured data will lookat alternative NoSQL solutions. However hybrid architecture may provebeneficial as well where the power of both RDBMS and NoSQL can beleveraged.
Is NoSQL replacing the RDBMS?However many enterprises are choosing to leave some legacy RDBMS systemsin place, while directing new development towards NoSQL databases. This isespecially the case when the applications in question demand high writethroughput, need flexible schema designs, process large volumes of data, andare distributed in nature.Technology aside, another reason many new development and/or migrationefforts are being directed towards NoSQL databases is the high cost of legacyRDBMS vendors versus NoSQL software. In general the fact is that, NoSQLsoftware is a fraction of what vendors such as IBM and Oracle charge for theirdatabases.
RDBMS & Big DataTactics to extend the useful scope of RDBMS technology Sharding Denormalizing Distributed caching
ShardingIf the data for an application will not fit on a single server or, more likely, if asingle server is incapable of maintaining the I/O throughput required to servemany users simultaneously, then a tactic known as sharding is frequentlyemployed.Database sharding is the process of splitting up a database across multiplemachines to improve the scalability of an application.
ShardingThis does work to spread the load but there are some undesirableconsequences to the approach. When you fill a shard, you have to change the sharding strategy in the application itself. For example, placing user profile information on one database server, friend lists on another and a third for user generated content like photos and blogs. The main problem with this approach is that if the site experiences additional growth then it may be necessary to further shard a feature specific database across multiple servers. You lose some of the most important benefits of the relational model. You can’t do “joins” across shards. In addition, you can’t do cross-node locking when making updates.
DenormalizingDenormalization is the process of attempting to optimise the read performance ofa database by adding redundant data or by grouping data. In some cases,denormalisation is a means of addressing performance or improving thescalability in relational database software. Most of the time denorm is application-specific and needs to be re-evaluated ifthe application changes. Denorm can increase the size of tables.
Distributed CachingAnother tactic used to extend the useful scope of RDBMS technology is toemploy distributed caching technologies, such as Memcached. Today,Memcached is a key ingredient in the data architecture behind 18 of the top 20largest (by user count) Web applications, including Google, Wikipedia, Twitter,YouTube and Facebook.Memcached “sits in front” of an RDBMS system, caching recently accessed datain memory and storing that data across any number of servers or virtualmachines. When an application needs access to data, rather than going directlyto the RDBMS, it first checks Memcached to see if the data is available there; if itis not, then the database is read by the application and stored in Memcached forquick access next time it is needed.
Distributed CachingMemcached and similar distributed caching technologies used for this purposeare no magic and can even create problems of their own: Memcached was designed to accelerate the reading of data by storing it inmain memory, but it was not designed to permanently store data. Memcachedstores data in memory. If a server is powered off or otherwise fails, or if memoryis filled up, data is lost. Again another tier to manage. It should be obvious that inserting another tier ofinfrastructure into the architecture to address some (but not all) of the failings ofRDBMS technology in the modern interactive software use case can create itsown set of problems: more capital costs, more operational expense, more pointsof failure and more complexity.
NoSQL TechnologiesSharding, Denormalizing, Distributed Caching and other tactics are all attempt topaper over one simple fact: RDBMS technology is a forced fit for moderninteractive software systems. Because vendors of RDBMS technology have littleincentive to disrupt a technology generating billions of dollars for them annually.Few application developers from Google (Big Table) and Amazon (Dynamo) tookinitiatives and invented, developed No SQL database technologies.
NoSQL Characteristics: No schema required. Data can be inserted in a NoSQL database without firstdefining a rigid database schema. As a corollary, the format of the data beinginserted can be changed at any time, without application disruption. Thisprovides immense application flexibility, which ultimately delivers substantialbusiness flexibility. Auto-sharding. A NoSQL database automatically spreads data across servers,without requiring applications to participate. Servers can be added or removedfrom the data layer without application downtime. Most NoSQL databases alsosupport data replication, storing multiple copies of data across the cluster, andeven across data centers, to ensure high availability and support disasterrecovery.
NoSQL Characteristics: Distributed query support. “Sharding” an RDBMS can reduce, or eliminate incertain cases, the ability to perform complex data queries. NoSQL databasesystems retain their full query expressive power even when distributed acrosshundreds or thousands of servers. Integrated caching. To reduce latency and increase sustained data throughput,advanced NoSQL database technologies transparently cache data in systemmemory. This behavior is transparent to the application developer and theoperations team, in contrast to RDBMS technology where a caching tier isusually a separate infrastructure tier that must be developed to, deployed onseparate servers, and explicitly managed by the ops team.
Research activities in Big Data The White House has recently announced a national "Big Data Initiative" forimproving the ability to extract knowledge and insights from large and complexcollections of digital data. This initiative will help US goverment in scientificdiscovery, environmental and biomedical research, education, and nationalsecurity. NASA is working on number of innovative approaches to advancing Big Data,including the Lunar Mapping and Modeling Activity