NoSQL is not only SQL, so it’s structured and unstructured data AND much of it is very important data, data that requires enterprise-grade features. I’m referring to all the features of Relational databases that large enterprises expect
Good morning, my name is Mike Doane and I’m a technical director at MarkLogic where I’ve worked for about six years on the public sector team. My recent focus has been on Healthcare, but I’ve been a part of every public sector team including civilian agencies, the intel community and the DoD so I have an understanding of the public sector’s big data needs. I only have a brief amount of time to talk to you today about Big Data, so I’m going to focus on how it’s created a need for a different way to store the data and to emphasize that the new storage mechanisms still need to treat data as a first class citizen. Simply because it’s different and we refer to it differently does not make it less important, so it should be stored with the same robust features you’ve come to take for granted in relational databases.
let’s take a quick look back through database history:We start with Mainframe systems databases built specifically for applications and specific hardware. Hierarchical, application-specific and , hard to changeThe 1970’s - relational model. SQL – a standard language for querying dataChanged the database world completely….BUTIt’s designed for data that fits neatly in columns and rowsThen came PCsNetworked PCsThat got faster every yearThen came the internetMany and varied data typesMore people generating dataRequires a new way to store itNeed a document model, that also handles structuredSchema-agnostic, shared nothing scalability and a host of other features make MarkLogic an uncompromised solution to Big Data
How many of you are familiar with NoSQL?NoSQL is not only SQL, so it’s structured and unstructured data AND much of it is very important data, data that requires enterprise-grade features. I’m referring to all the features of Relational databases that large enterprises expectThe graphic on the right side shows MarkLogic is a combination of three products integrated in one. You don’t buy a database and integrate a separate search product to enable full-text searches, you don’t need to integrate an application server to provide web services, or to serve up a full web application. You get all three of these “features normally found in discrete products” in one fully-integrated, QA’d, hardened platform.Search and Query – we provide full-text search capability across all data elements including geospatial, proximity, boolean operations – all the features you’d expect in a search engine, but with MarkLogic it’s also structure-aware.ACID transactionality is virtually non-existent in NoSQL databases. ACID is an acronym consisting of atomicity, consistency, isolation and durability: The atomicity property ensures if one part of the transaction fails, the entire transaction fails, and the database state is left unchanged in all situations including power failures. The consistency property ensures that any transaction will bring the database from one valid state to another . The isolation property ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed serially, i.e. one after the other . And the durability property means that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errorsHA/DR – High availability ensures that if a node in a MarkLogic cluster fails, the other nodes will take over its responsibilities and deliver the data. Disaster Recovery provides continuity of operations in catastrophic situations - it allows the customer to fail-over to entirely different cluster located somewhere else.Replication is the mechanism behind disaster recovery and it happens in near real-time in the background when enabledSecurity is provided through Role Based Access Control and Attribute based access control that was developed for the Intelligence community. We allow for sub-document security that is akin to cell-based security, for example restricting access to elements like a patient’s social security number, and all of this is baked into the platform. Application developers do not need to write their own security.We scale with a shared nothing architecture with elasticity, in that a cluster can be expanded or contracted to greater or fewer computing nodes all automated according to rulesWe work in the cloud with certifications on Amazon, integrate with Hadoop, in fact we can run read and write to HDFS for our database’s partitions and lastly we provide semantic capabilities that I’ll discuss further later in this talk.The combination of all these features in one NoSQL product results in no compromises for your big data
Hadoop has become synonymous with Big Datait’s not the sole solution to big data problems. Use Hadoop for what it is good atrunning batch analyticsprogressively enhancing your data to extract metadata out of binaries or to find the people places and things of interest in unstructured content and return the enriched data,to help distribute data loading across a large cluster of NoSQL databases that perform in real time. MarkLogic has connectors to send data to Hadoop for processing,To run map reduce jobs inside MarkLogic even run its database partitions on the Hadoop distributed file system.
Tiered Storage provides the ability to automatically allow different types of data to be stored on different data storage mediums to accommodate for different SLAs or to simply reduce costs. With tiered storage you canPut your most important data on the fastest storage, like SSDsMove your older data to cheaper storageThis can be done automatically You can use any combination of SSD, local disk,SAN, NAS, or even HDFS or Amazon S3A side benefit of storing on HDFS is that mapreduce can be run directly on the MarkLogic forests obviating the need to export and re-import dataNo other database allows this
Terrorist Database hundreds of nodes providing analysts with many different products a flexible data model that accommodates widely varying data in size and types available for every terrorist or suspectHealthcare.gov – Provides an insurnacemarketplce to our citizens and uses MarkLogic as it’s database the site has handled 80,000 requests per minute at peak with no data lost despite what may be inferred from the press coverageData Services Hub runs entirely on MarkLogic, receives every health insurance application submitted and exchanges it with the IRS, SSA, DHS, Peace Corp, Medicaid and others. No problems reported. The FAA has a wide variety of data types these need to be assembled and presented via a dashboard for situational awareness in real-time especially during crises and emergenciesDCGS –is a metadata catalog to provide situational awareness on the battlefieldthe DoD’s focus on Counter Insurgency and Counter Terrorism, has brought Biometrics, CELLEX, SIGINT tipping and cueing, and All Source into the mix. That means multiple schema, variable content, and different communities of interest needing different metadata extensions. Throw in the basic Sparse Data Matrix problem with which metadata catalogs have to contend, and you can quickly see why an Enterprise-grade NoSQL database was needed.BBC – 2 olympics, no downtime, New orders of magnitude of content and digital assets delivery over previous static, non-semantic publishing model.I particulary like their quote ““We pushed MarkLogic as hard as I could and it wouldn’t break”Even banks like JPMC trust their tradestores to run on MarkLogic because it reduces the number of disparate databases needed due to its schema agnostic designThese are mission critical applications of great variety, volume and velocity (including speed of delivery) that represent the need for NoSQL in big data applications.