Your SlideShare is downloading. ×
Business intelligence

Optimizing performance
for big data analysis
By Armando Acosta and Maggie Smith

Unlocking insights...
times greater than that of the information and

Intel Manager for Apache Hadoop Software

communication technology market....
Business intelligence

performance. Intel® Manager for Apache
Hadoop is a web-based management
console that facilitates th...
Distributed processing in action
Another example is a large

As big data becomes big business,

storage scalability or rea...
Upcoming SlideShare
Loading in...5
×

Optimizing Performance for Big Data Analysis

298

Published on

Unlocking insights from vast data volumes requires a scalable system that quickly processes both unstructured and structured data. The Intel® Distribution for Apache Hadoop provides enhancements that boost performance while streamlining deployment.

By Armando Acosta and Maggie Smith

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
298
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Optimizing Performance for Big Data Analysis"

  1. 1. Business intelligence Optimizing performance for big data analysis By Armando Acosta and Maggie Smith Unlocking insights from vast data volumes requires a scalable system that quickly processes both unstructured and structured data. The Intel® Distribution for Apache Hadoop provides enhancements that boost performance while streamlining deployment. O nce a little-known technology, capacity and capabilities of standard relational the Apache™ Hadoop® software database management systems (RDBMSs). framework was developed to The same architectural design of RDBMSs that support offline analytics. It has since helps ensure consistency and availability often evolved into a powerful platform for managing results in scalability limitations. Also, the use of and processing the vast amounts of data deluging proprietary RDBMS extensions optimizes database enterprise systems. performance but subjects organizations to vendor From megabytes to yottabytes, information repositories are growing by the nanosecond through an influx of unstructured data from social lock-in. And organizations may experience costly per-processing licenses for commercial RDBMSs. As a result, the demand for innovative, cost- networking sites, video, images, mobile devices, sensors and other sources. To gain insights from global big data technology and services market is massive amounts of diverse data types, many projected to expand at a 31.7 percent compound organizations are looking beyond the restricted 42 effective big data offerings is intensifying. The annual growth rate through 2016 — about seven 2013 Issue 03 | Dell.com/powersolutions Reprinted from Dell Power Solutions, 2013 Issue 3. Copyright © 2013 Dell Inc. All rights reserved.
  2. 2. times greater than that of the information and Intel Manager for Apache Hadoop Software communication technology market.1 number of scenarios, ranging from mining social media profiles and flagging credit card fraud to identifying top job candidates and predicting weather patterns. Yet for all Hadoop’s data-crunching prowess, an absence of integrated support for strong data security has slowed deployment efforts. Consider, SQL-like query Apache Oozie™ workflow MapReduce distributed processing framework columnar storage Apache Hive scripting coordination for big data management and processing in a Apache Pig™ Apache Zookeeper™ Hadoop platform shows enormous promise data exchange and secure IT environment. The open-source log collector easily to accommodate growth in a stable Apache Flume™ multistructured data volumes rapidly and scale Apache Sqoop™ Deriving actionable insights from huge data volumes calls for a system that can process Apache HBase Deployment, configuration, monitoring, altering and security Apache HDFS Hadoop Distributed File System Taxonomy of the Intel Distribution for Apache Hadoop for example, a financial institution that combines multiple data warehouses into a large Hadoop Apache Hadoop. The Intel Distribution joins the cluster. Securing the data requires extensive use field-tested Dell | Cloudera Hadoop Solution, of embedded encryption tools. However, many which combines Cloudera’s Distribution Including Hadoop implementations are not optimized Apache Hadoop (CDH) with Dell servers, Dell- to handle the processing load incurred by developed Crowbar deployment software and encryption and decryption, which typically add networking components, as well as management considerable latency and consume substantial tools, training, technology support and compute resources. professional services. (For more information, see To address organizational needs to run high- the sidebar, “Insight acceleration.”) performance analytics on a secure platform, Intel Distribution for Apache Hadoop software Enhancing performance, security and manageability for deployment on Dell™ hardware. The Intel The Intel Distribution is packaged with the Distribution is designed to provide secure Hadoop platform and other software components enterprise-quality distributed-processing and data- (see figure). Hadoop comprises the Hadoop management software, as well as deployment Distributed File System (HDFS™) framework, support and consulting services. designed for high-throughput data storage Dell has teamed up with Intel to optimize the and access on commodity hardware, and Finding the right fit the MapReduce framework, which enables Because of the wide variety of big data developers to write applications that execute challenges, organizations require broadened jobs in parallel on large clusters. Other core flexibility and choice in a platform that helps them components of Hadoop are the Apache Hive™ gain valuable insights based on their specific use data warehousing software and the cases. (For more information, see the sidebar, Apache HBase™ database, a distributed, “Distributed processing in action.”) When it comes columnar big data store. to big data management and analytics, one size does not fit all. To that end, Dell has expanded its Hadoop offerings to include the Intel Distribution for 1 IDC With the power of Hadoop at its foundation, the Intel Distribution features a number of additional capabilities and optimizations designed to streamline deployment and improve Worldwide Big Data Technology and Services 2012-2016 Forecast, doc #238746, December 2012. Reprinted from Dell Power Solutions, 2013 Issue 3. Copyright © 2013 Dell Inc. All rights reserved. Dell.com/powersolutions | 2013 Issue 03 43
  3. 3. Business intelligence performance. Intel® Manager for Apache Hadoop is a web-based management console that facilitates the installation, configuration and administration of the Hadoop cluster. Intel Manager also supports resource monitoring and alerting through the open-source Nagios® and Ganglia monitoring systems, which are included in the Intel Distribution. By taking advantage of this powerful, easy-to-use tool, IT can focus critical resources and expertise on deriving business value from Insight acceleration the Hadoop environment rather than Organizations worldwide are turning to the open-source Apache Hadoop software platform to support enterprise applications that analyze extremely large amounts of diverse data. However, the inherent nature of Hadoop, with its distributed architecture, adds layers of complexity, especially when it comes to deployment, management and security. As a result, many organizations may have delayed Hadoop deployments because they lack the necessary expertise in planning, design, implementation and maintenance. By providing the expert assistance, tools and technology resources needed, Dell Services helps organizations move their Hadoop activities from the sandbox to production environments to achieve business value. These services are tailored to an organization’s short- and/or long-term objectives and help optimize the use of emerging technologies, advance efficiencies and maximize the value of IT investments. Experts at Dell Solution Centers located in key sites around the globe are available to bolster the technical skills of those new to Hadoop and open-source technologies. They can help participants gain hands-on experience with a variety of topics, ranging from obtaining maximum performance from an application deployed on Dell servers and storage to exploring cloud computing and big data using Hadoop. At a Dell Solution Center, participants can attend a technical briefing with a Dell expert, investigate an architectural design session or build a proof-of-concept engagement to comprehensively validate a big data solution and streamline deployment. Using an organization’s specific configurations and test data, participants can discover how a big data solution from Dell meets their business needs. A recent addition to Dell’s global network of solution centers is the Big Data Innovation Center in Singapore, where organizations can test big data initiatives and proofs of concept. The facility provides a big data stack that includes Dell infrastructure using Intel Xeon E5 processor–based servers, Intel® 10 Gigabit Ethernet networking, Intel® Solid-State Drives, the Intel Distribution for Apache Hadoop and Revolution R Enterprise predictive analytics software. Organizations that need to test-run their big data workloads can use the center to determine the impact of big data initiatives to their business. The center also offers training to help equip participants with the skills necessary for improving the quality of data mining across a wide range of platforms and data sources. For more information on Dell Solution Centers, visit dell.com/solutioncenters. managing the cluster. The Intel Distribution includes extensions to HBase and Hive that help improve real-time transactional performance and the end-user experience. Exceptional encryption and decryption capabilities heighten security and access control. The Intel Distribution is optimized to work with Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI) technology, which is built into Intel® Xeon® processors. Intel AES-NI is designed to accelerate compute-intensive encryption and decryption, helping eliminate latency and greatly reduce processor load. In addition to leveraging the capabilities of its processors, Intel can build and optimize hardware features of the company’s solid-state drives (SSDs) and 10 Gigabit Ethernet (10GbE) adapters to boost Hadoop performance, security and manageability. Also critical to accelerating Hadoop performance is server optimization. The Intel Distribution is designed to efficiently integrate Hadoop with Dell servers to deliver optimal solutions for a variety of use cases. The Dell PowerEdge™ R720xd server is well suited for Hadoop deployments because these environments often require a 1:1 spindle-to-core ratio for optimized performance. The PowerEdge R720xd features high spindle-to-core counts and includes options to avoid I/O bottlenecks. 44 2013 Issue 03 | Dell.com/powersolutions Reprinted from Dell Power Solutions, 2013 Issue 3. Copyright © 2013 Dell Inc. All rights reserved.
  4. 4. Distributed processing in action Another example is a large As big data becomes big business, storage scalability or real-time query organizations are discovering telecommunications company that was access. So the telecommunications innovative ways to harness the value faced with eroding profits thanks in part company selected the Intel Distribution of their data. The Intel Distribution to the high cost of maintaining a complex for real-time analytics and decision for Apache Hadoop helps these billing system. Poor-quality customer support, as well as solid disaster organizations get the most out of service stemming from the beleaguered recovery and failover. The result: hardware performance, strengthen data billing system was prompting customer exceptional support for a new business security and improve data management churn. Unfortunately, the company’s intelligence initiative that provided a and processing capabilities. existing relational database management lower total cost of ownership compared system (RDBMS) could not deliver to its traditional RDBMS. One company, for example, used the Intel Distribution to support its powerful search-engine technology for life-science researchers. Dedicated to furthering genomics research, the company was having trouble managing its large data sets. To scale in a costeffective manner, the company deployed the Intel Distribution and used Apache Hive and Apache Hadoop for query and search. The company also turned to Intel to optimize its iStockphoto/Thinkstock hardware and software for increased performance. As a result, the company achieved an exceptional increase in throughput using less than half the nodes previously deployed. Putting big data to work Authors Originally a tool for offline analytics of web-scale data, Hadoop is fast on its way to becoming a business-critical platform for gathering Armando Acosta is a senior product line consultant at Dell and has more than 15 years of experience in the IT industry. intelligence and actionable insights from vast amounts of unstructured data. Helping to drive this transformation is the Intel Distribution for Apache Hadoop — an open-source offering Maggie Smith is a senior marketing manager at Dell. She is focused on big data solutions for enterprises and has over 30 years of experience marketing technology products. that unites the power of Hadoop and other software elements with important performance enhancements and hardware optimizations from Intel. Together, this combination of capabilities not only enhances security, performance Learn more Intel Distribution for Apache Hadoop on Dell PowerEdge Servers: qrs.ly/br3gyd4 and manageability, but also provides a robust foundation for advancing innovation in analytics by the open-source community. Reprinted from Dell Power Solutions, 2013 Issue 3. Copyright © 2013 Dell Inc. All rights reserved. Dell.com/powersolutions | 2013 Issue 03 45

×