Big Data is defined by an organization’s ability to analyze volumes of data, harvest business intelligence from it, and derive actionable insights and business decisions with it. Dark data is neglected, accumulating in archives, Hadoop clusters, remote data stores, grids, and log files – that no one knows what to do with. In a way it is anti-matter, that clouds our vision. But, if 80% of Big Data is dark data, not tapping into it means the results of your analytics/intelligence becomes faulty/suspect/incomplete and ultimately not useful and even harmful. So what can we do about it? We can continue to ignore it. That’s easy enough. But in these modern times, with the re-emergence of newer and faster ways of accessing data of all types, there is a better option that could have a real impact on the business. Establishing data connectivity – either through a cloud-based service or directly – from any platform or mobile device to data stores residing anywhere is the key that unlocks the dark data mysteries. By using SQL or OData APIs to access the data, we can hide the complexities of proprietary interfaces or unique Web services and get to a variety of data sources no matter where they might reside and at light speed. On the right hand side of the slide, I show our support for Apache Hive, which comes with all of the popular commercial Hadoop distributions out there like Hortonworks, MapR, IBM BigInsights, and Cloudera. We have ODBC and JDBC drivers that access Hive data directly. The data that ends up in Hive gets there by using MapReduce on Hadoop distributed file systems in order to filter the right data. And since Hive is SQL-oriented, it’s custom made for accessibility and performance.
Having many different and varied connections makes API management too complex – this is not a scalable environment. And each API could be unique to that data source … with some native APIs even being proprietary. Also, these native APIs often change frequently. The native Salesforce.com API, for example, gets updated every quarter. That means that if you are relying on it within your applications, you might need to update every app that uses it to accommodate those updates. Another point is on development cost. Since there could be many different data APIs to handle, and since many of those involve unique skill sets to understand them, and since many of those could change often enough to become a nightmare to maintain – the cost of multiple API management explodes along with the explosion of new up and coming data sources out there that you need to access.
Enterprise data, typically held in relational databases on-premise, behind a firewall, pose some security challenges which add to the latency in accessing the data. Normally if I wanted to get to the data from a cloud app or from a different network, I would need to open a new data port and essentially reconfigure the firewall – as well as establish separate or new authorization and credentials for every database I want to get to. And most likely you’ll require extensive SQL API knowledge for each relational database.
So let me summarize this way … I talked about the top data API headaches and here I have some aspirin-oriented recommendations for how to alleviate those headaches.
Cloud-based data connectivity services help ease the data variety chaos.
The OData Open Data Platform enables standardized data access for enterprise mobile applications.
New technology provides low-touch access to enterprise data through a firewall.
SQL access to Apache Hive and HBase in Big Data environments helps ease the headache of dealing with volumes of data – and dark data.
Direct access to databases using data drivers architected for speed and efficiency is paramount to preserving ACID transactions.
And finally, customized access to private application data can be built quickly and easily.