When building a high-performance data mining platform, several key considerations include the current and anticipated size of the organization's data, whether the data is structured or unstructured, how much data movement will be required for projects, and what percentage of projects can be solved on a single machine. The purchasing decision will be made by IT but should understand current and future problems to be solved, as well as stakeholder needs, leading to the best outcome for the organization.
High Performance data mining platforms-Things to consider
1. High Performance data mining platforms-Things to consider
The computing environment is critical to your success in Big data and data mining projects.Your
computing environment comprises four important resources to consider: network, disk, central
processing units (CPUs), and memory. Time to solution goals, expected data volumes, and budget will
direct your decisions regarding computation resources. Appropriate computing platforms for data
analysis depend on many dimensions, primarily the volume of data (initial, working set, and output
data volumes), the pattern of access to the data, and the algorithm for analysis. These will vary
depending on the phase of the data analysis.
Here are some key questions to consider when you are trying to built or purchase high-performance
data mining platform for your organization.
• What is the size of your data now?
• What is the anticipated growth rate of your data in the next few years?
• Is the data you are storing mostly structured data or unstructured data?
• What data movement will be required to complete your data mining projects?
• What percentage of your data mining projects can be solved on a single machine?
• What other changes are required to improve data processing power? Do you need to buy
additional multi-core machines/servers. How much we can save if we process the data over
cloud?
• How to leverage the existing IT infrasture for running analytics applications faster & smoother?
• How much In-person hours/efforts are required to build new big data and advance analytics
capabilities in the current enviornment?
• How many existing system applications and data sources are required to be integrated to built
the efficient big data processing capabilities?
• Do we need to re-design organizational meta-data to create one version or consistency of data?
• Is your data mining software a good match for the computing environment you are designing?
• What are your users' biggest complaints about the current system?
The purchasing decision for the data mining computing platform will likely be made by the IT
organization, but it must first understand the problems being solved now and the problems that are
needed to be solved in the near future. The consideration of platform trade-offs and the needs of the
organization, all shared in a transparent way, will lead to the best outcome for the entire organization
and the individual stakeholders.
V 1.0- Author: Ashish Jain Date: 13-Apr-2015
2. In the figure below we try to compare a number of big data technologies. The figure highlights the
different types of systems and their comparative strengths and weaknesses.
MPP Database: Massively parallel processing Databases. Also named as Enterprise
Datawarehouses[EDWs]
IMDBs: In-Memory Databases [SAP HANA,Oracle Exalytics]
Hadoop: Based on distributed file system
NoSQL Databases: Cassandra,Hbase,MongoDB,Couchbase etc.
You can learn more about IMDBs databases from below URLs
http://www.mcobject.com/in_memory_database
http://www.opensourceforu.com/2012/01/importance-of-in-memory-databases/
http://searchsap.techtarget.com/feature/Faceoff-SAP-HANA-a-full-in-memory-database-unlike-Oracle-
Exalytics
You can learn more about NoSQL databases from below URLs
https://www.digitalocean.com/community/tutorials/a-comparison-of-nosql-database-management-
systems-and-models
http://nosql-database.org/
V 1.0- Author: Ashish Jain Date: 13-Apr-2015
Hadoop
Consistent ● ● ● ▲ ▲
Available ● ● ● ▲ ▲
Fault Tolerant ● ● ▲ ● ●
● ● ● ♦ ♦
▲ ▲ ● ● ♦
♦ ▲ ▲ ● ●
♦ ♦ ▲ ● ●
In Memory
Database
MPP
Database Big Data
Appliance
NoSQL
Database
Suitable for
Real-time Transaction
Suitable for
Analytics
Suitable for
Extremely
Large data size
Suitable for
Unstructured data
● Meets widely held expectation
▲ Potentially meets widely held expectations
♦ Fails to meet widely held expectations