SlideShare a Scribd company logo
1 of 6
Download to read offline
RC25281 (AUS1204-004) April 23, 2012
Computer Science




                        IBM Research Report
       Understanding System and Architecture for Big Data

      Anne E. Gattiker, Fadi H. Gebara, Ahmed Gheith, H. Peter Hofstee,
                   Damir A. Jamsek, Jian Li, Evan Speight
                             IBM Research Division
                          Austin Research Laboratory
                               11501 Burnet Road
                               Austin, TX 78758
                                     USA

                              Ju Wei Shi, Guan Cheng Chen
                                  IBM Research Division
                                China Research Laboratory
                          Building 19, Zhouguancun Software Park
                        8 Dongbeiwang West Road, Haidian District
                                      Beijing, 100193
                                         P.R.China

                                        Peter W. Wong
                                  IBM Linux Technology Center
                                      11501 Burnet Road
                                       Austin TX 78758
                                             USA




                Research Division
                Almaden - Austin - Beijing - Cambridge - Haifa - India - T. J. Watson - Tokyo - Zurich
Understanding Systems and Architectures for Big Data

      William M. Buros, Guan Cheng Chen, Mei-Mei Fu, Anne E. Gattiker, Fadi H. Gebara,
      Ahmed Gheith, H. Peter Hofstee, Damir A. Jamsek, Thomas G. Lendacky, Jian Li, Yan
           Li, John S. Poelman, Steven Pratt, Ju Wei Shi, Evan Speight, Peter W. Wong
                                              International Business Machines Corp.
                      {wmb,chengc,mfu,gattiker,fhgebara,ahmedg,hofstee,jamsek,toml,
                       jianli,liyancrl,poelman,slpratt,jwshi,speight,wpeter}@us.ibm.com

ABSTRACT                                                              refers to the quality of the information in the face of data un-
The use of Big Data underpins critical activities in all sec-         certainty from many different places: the data itself, sensor
tors of our society. Achieving the full transformative po-            precision, model approximation, or inherent process uncer-
tential of Big Data in this increasingly digital world re-            tainty. Emerging social media also brings in new types of
quires both new data analysis algorithms and a new class              data uncertainty such as rumors, lies, falsehoods and wishful
of systems to handle the dramatic data growth, the de-                thinking. We believe the 4 Vs presently capture the main
mand to integrate structured and unstructured data ana-               characteristics of Big Data.
lytics, and the increasing computing needs of massive-scale              The importance of Big Data in today’s society can not
analytics. In this paper, we discuss several Big Data re-             be underestimated, and proper understanding and use of
search activities at IBM Research: (1) Big Data benchmark-            this important resource will require both new data analysis
ing and methodology; (2) workload optimized systems for               algorithms and new systems to handle the dramatic data
Big Data; and (3) a case study of Big Data workloads on               growth, the demand to integrate structured and unstruc-
IBM Power systems. Our case study shows that prelimi-                 tured data analytics, and the increasing computing needs of
nary infrastructure tuning results in sorting 1TB data in             massive-scale analytics. As a result, there exists much re-
                               TM                                     search in critical aspects of emerging analytics systems for
9 minutes1 on 10 PowerLinux        7R2 systems [5] running
                             TM                                       Big Data. Examples of current research in this area include:
IBM InfoSphere BigInsights . This translates to sorting               processor, memory, and system architectures for data analy-
11.4GB/node/minute for the IO intensive sort benchmark.               sis; benchmarks, metrics, and workload characterization for
We also show that 4 PowerLinux 7R2 nodes can sort 1TB                 analytics and data-intensive computing; debugging and per-
input with around 22 minutes. Further improvements are                formance analysis tools for analytics and data-intensive com-
expected as we continue full-stack optimizations on both              puting; accelerators for analytics and data-intensive com-
IBM software and hardware.                                            puting; implications of data analytics to mobile and em-
                                                                      bedded systems; energy efficiency and energy-efficient de-
1. INTRODUCTION                                                       signs for analytics; availability, fault tolerance and recovery
  The term “Big Data” refers to the continuing massive ex-            issues; scalable system and network designs for high con-
pansion in the data volume and variety as well as the velocity        currency or high bandwidth data streaming; data manage-
and veracity of data processing [12]. Volume refers to the            ment and analytics for vast amounts of unstructured data;
scale of the data and processing needs. Whether for data              evaluation tools, methodologies and workload synthesis; OS,
at rest or in motion, i.e., being a repository of information         distributed systems and system management support; and
or a stream, the desire for high speed is constant, hence             MapReduce and other processing paradigms and algorithms
the notion of velocity. Variety indicates that Big Data may           for analytics.
be structured in many different ways, and techniques are                  This short paper briefly discusses three topics that IBM
needed to understand and process such variation. Veracity             researchers and colleagues from other IBM divisions have
                                                                      been working on:
1
  All performance data contained in this publication was ob-
tained in the specific operating environment and under the                • Big Data benchmarking and methodology
conditions described below and is presented as an illustra-
tion. Performance obtained in other operating environments               • Workload optimized systems for Big Data
may vary and customers should conduct their own testing.
                                                                         • A case study of a Big Data workload on IBM Power
                                                                           systems

                                                                        We highlight the research directions that we are pursuing
                                                                      in this paper. More technical details and progress updates
                                                                      will be covered in several papers in an upcoming issue of the
                                                                      IBM Journal of Research and Development.
Contact Author: Jian Li, IBM Research in Austin, 11501 Burnet Road,
Austin, TX 78758; Phone: (512)286-7228; Email: jianli@us.ibm.com
IBM Research Technical Report. July 10, 2012.
Copyright 2012 .
2.   BENCHMARKING METHODOLOGY                                     3. WORKLOAD OPTIMIZED SYSTEMS
   Massive Scale Analytics is representative of a new class of       While industry has made substantial investments in ex-
workloads that justifies a re-thinking of how computing sys-       tending its software capabilities in analytics and Big Data,
tems should be optimized. We start by tackling the problem        thus far these new workloads are being executed on systems
of the absence of a set of benchmarks that system hardware        that were designed and configured in response to more tra-
designers can use to measure the quality of their designs         ditional workload demands.
and that customers can use to evaluate competing hardware            In this paper we present two key improvements to tradi-
offerings in this fast-growing and still rapidly-changing mar-     tional system design. The first is the addition of reconfig-
ket. Existing benchmarks, such as HiBench [11], fall short in     urable acceleration. While reconfigurable acceleration has
terms of both scale and relevance. We conceive a method-          been used successfully in commercial systems and appliances
ology for peta-scale data-size benchmark creation that in-        before (e.g., DataPower R [6] and Netezza R [4]), we have
cludes representative Big Data workloads and can be used          demonstrated via a prototype system that such technology
as a driver of total system performance, with demands bal-        can benefit the processing of unstructured data.
anced across storage, network and computation. Creating              The second innovation we discuss is a new modular and
such a benchmark requires meeting unique challenges asso-         dense design that also leverages acceleration. Achieving the
ciated with the data size and its unstructured nature. To         computational and storage densities that this design pro-
be useful, the benchmark also needs to be generic enough          vides requires an increase in processing efficiency that is
to be accepted by the community at large. We also ob-             achieved by a combination of power-efficient processing cores
serve unique challenges associated with massive scale ana-        and offloading of performance-intensive functions to the re-
lytics benchmark creation, along with a specific benchmark         configurable logic.
we have created to meet them.                                        With an eye towards how analytics workloads are likely to
   The first consequence of the massive scale of the data is       evolve and executing such workloads efficiently, we conceive
that the benchmark must be descriptive, rather than pre-          of a system that leverages the accelerated dense scale-out
scriptive. In other words, our proposed benchmark is pro-         design in combination with powerful global server nodes that
vided as instructions for acquiring the required data and pro-    orchestrate and manage computation. This system can also
cessing it, rather than providing benchmark code to run on        be used to perform deeper in-memory analytics on selected
supplied data. We propose the use of existing large datasets,     data.
such as the 25TB ClueWeb09 dataset [9] and the over 200TB
Stanford WebBase repository [10]. Challenges of using such
real-world large datasets include physical data delivery (e.g.,   4. BIG DATA ON POWER SYSTEMS
via shipped disk drives), and data formatting/“cleaning” of         Apache Hadoop [1] has been widely deployed on clusters
the data to allow robust processing.                              of relatively large numbers of moderately sized, commodity
   We propose compute- and data-intensive processing tasks        servers. However, it has not been widely used on large,
that are representative of key massive-scale analytics work-      multi-core, heavily threaded machines even though smaller
loads to be applied to this unstructured data. These tasks        systems have increasingly large core and hardware thread
include major Big Data application areas, text analytics,         counts. We describe an initial performance evaluation and
graph analytics and machine-learning. Specifically our bench-      tuning of Hadoop on a large multi-core cluster with only a
mark efforts focused on document categorization based on           handful of machines. The evaluation environment comprises
dictionary-matching, document and page ranking, and topic         IBM InfoSphere BigInsights [3] on a 10-machine cluster of
                                                                                     TM
determination via non-negative matrix factorization. The          IBM PowerLinux        7R2 systems.
first of the three, in particular, required innovation in bench-
mark creation, as there is no “golden reference” to establish     4.1 Evaluation Environment
correct document categorization. Existing datasets typically         Table 1 shows the evaluation environment, which com-
used as references for text-categorization assessments, such      prises IBM InfoSphere BigInsights [3] on a 10-machine clus-
as the Enron corpus [2], are orders of magnitude smaller than     ter of IBM PowerLinux 7R2 servers. IBM InfoSphere BigIn-
what we required. Our approach for overcoming this chal-          sights provides the power of Apache Hadoop in an enterprise-
lenge included utilizing publicly-accessible documents coded      ready package. BigInsights enhances Hadoop by adding
by subject, such as US Patent Office patents and applica-           robust administration, workflow orchestration, provisioning
tions, to create subject-specific dictionaries against which       and security, along with best-in-class analytics tools from
to match documents. Unique challenges of ensuring “real-          IBM Research. This version of BigInsights, version 1.3, uses
world” relevance includes covering non-word terms of impor-       Hadoop 0.20.2 and its built-in HDFS file system.
tance, such as band names that include regular expression            Each PowerLinux 7R2 includes 16 POWER7 R cores @
characters, and a “wisdom of crowds” approach that helps          3.55GHz that can scale to 3.86GHz in a “Dynamic Power
us meet those challenges.                                         Saving - Favor Performance” mode, up to 64 hardware threads,
   It is our intention to make our benchmark public. The          128 GB of memory, 6 × 600GB internal SAS drives, and
benchmark is complementary to existing prescriptive bench-        24 × 600GB SAS drives in an external Direct Attached Stor-
marks, such as Terasort and its variations that have been         age (DAS) drawer. We used software RAID0 over LVM for
widely exercised in the Big Data community. In this paper,        the 29 drives to each machine. One internal SAS drive is
we use Terasort as a case study in Section 4.                     dedicated as the boot disk. The machines are connected
                                                                  by 10Gb Ethernet network. Each machine has two 10Gbe
                                                                  connections to the top of the rack switch. We used RedHat
                                                                  Linux (RHEL6.2). All 10 PowerLinux 7R2 machines func-
                                                                  tion as Hadoop data nodes and task trackers. Note that we
• Preliminary intermediate control of map and reduce
            Table 1: Evaluation Environment                           stages to better utilize available memory capacity;
 Hardware                                                           • Reconfiguration of storage subsystem to remove fail-
 Cluster        10 PowerLinux 7R2 Servers                             over support of storage adapters for effective band-
 CPU            16 processor cores per server (160 total)             width improvement since Hadoop handles storage fail-
 Memory         128GB per server (1280GB total)                       ures by replication at software level;
 Internal       6 600GB internal SAS drives
 Storage        per server (36TB total)                             • JVM, Jitting, and GC tuning that better fit the POWER
 Storage        24 600GB SAS drives in IBM EXP24S SFF                 architecture.
 Expansion      Gen2-bay Drawer, per server (144TB total)
 Network        2 10Gbe connections per server                      Currently, we are able to achieve an execution time of less
 Switch         BNT BLADE RackSwitch G8264                       than 9 minutes to sort 1TB input data from disk and writ-
                                                                 ing back 1TB output data to disk storage. This translates
 Software                                                        to sorting 11.4GB/node/minute for the IO intensive sort
 OS             Red Hat Enterprise Linux 6.2                     benchmark. Table 2 lists some of the BigInsights (Hadoop)
 Java           IBM Java Version 7 SR1                           job configuration parameters used in this Terasort run that
 BigInsights    Version 1.3 (Hadoop v0.20.2)                     completed in 8 minutes and 44 seconds.
 MapReduce      One Power 730 as NameNode/JobTracker                Figure 1 shows the CPU utilization of one of the 10 nodes
                and 10 7R2s as DataNodes/TaskTrackers            in the cluster during the Terasort run. All nodes have similar
                                                                 CPU utilization. Figure 1 shows that CPU utilization is very
                                                                 high during the map and map-reduce overlapping stages.
                                                                 As expected, it is only when all mappers finish and only
use one extra Power 730 Express Server as the master node        reducers (380 as shown in Table 2 out of the 640 hardware
that serves as the Name Node and Job Tracker2 .                  threads supported by the 10 PowerLinux 7R2 servers) are
4.2   Results and Analysis                                       writing output back to disks that CPU utilization drops.
                                                                    While we have large quantities of other system profile
   We have some early experiences with measuring and tun-        data to help us understand the potential for further perfor-
ing standard Hadoop programs, including some of the ones         mance improvement, one thing is clear: high CPU utiliza-
used in the HiBench [11] benchmark suite, and some from          tion does not necessarily translate into high performance. In
real-world customer engagements. In this paper, we use           other words, the CPU may be busy doing inefficient comput-
Terasort, a widely used sort program included in the Hadoop      ing (during the Hadoop and Java framework, for instance),
distribution, as a case study. While Terasort in Hadoop can      or performing inefficient algorithms that artificially inflate
be configured to sort differing amounts of input data, we          CPU utilization without improving performance. This leads
only present results for sorting 1TB of input data. In addi-     us to examine full-stack optimization during the next phase
tion we compress neither input nor output data.                  of our research.
   Our initial trial is done with the default Hadoop map-
reduce configuration, e.g., limited map and reduce task slots,
which does not utilize the 16 processor cores in a single Pow-
erLinux 7R2 system. As expected, the test takes hours to
finish. After initial adjustment of the number of mappers
and reducers to fit to the parallel processing power of 16
cores in a PowerLinux 7R2 system, the execution time dras-
tically decreases.
   We then apply the following public-domain tuning meth-
ods and reference the best practices from the PowerLinux
Community [8] to gradually improve the sort performance:
   • Four-way Simultaneous Multithreading (SMT4) to fur-
     ther increase computation parallelism and stress data
                                                                 Figure 1: CPU utilization on one PowerLinux 7R2 node
     parallelism via large L3 caches and high memory band-
                                                                 from a Terasort run of 8 minutes 44 seconds.
     width on POWER7 R ;
   • Aggressive read ahead setting and deadline disk IO             As indicated above, we have only applied infrastructure
     scheduling at OS level;                                     tuning in this stage of the study. In the next stage, we
                                                                 plan to incorporate the performance enhancement features
   • Large block size and buffer sizing in Hadoop;                in IBM InfoSphere BigInsights for further improvement. We
   • Publicly available LZO compression [7] for intermedi-       are also working on patches for better intermediate control
     ate data compression;                                       in MapReduce. Newer Hadoop versions than 0.20.2 are ex-
2
                                                                 pected to continue to deliver performance improvements.
  Given the small size cluster in this study, our experiments       Furthermore, we plan to apply reconfigurable acceleration
show negligible performance overhead of using one server         technology as indicated in Section 3. In the meantime, scal-
as both a master node (Name Node and Job Tracker) and
a slave node (Data Node and Task Tracker). We separate           ability studies are also important to understand the optimal
the P730 master node from the 10 7R2 slave nodes for the         approach to (A) strong scaling of the BigInsights cluster in
convenience of cluster management.                               terms of scaling the number of nodes with constant input
Table 2: Sample BigInsights (Hadoop) Job Configuration
    Parameters                               Values
    mapred.compress.map.output               true
    mapred.map.output.compression.codec      com.hadoop.compression.lzo.LzoCodec
    mapred.reduce.slowstart.completed.maps   0.01
    mapred.reduce.parallel.copies            1
    mapred.map.tasks                         640
    mapred.reduce.tasks                      380
    mapred.map.tasks.speculative.execution   true
    io.sort.factor                           120
    mapred.jobtracker.taskScheduler          org.apache.hadoop.mapred.JobQueueTaskScheduler
    flex.priority                             0
    adaptivemr.map.enable                    false
    io.sort.mb                               650
    mapred.job.reduce.input.buffer.percent    0.96
    mapred.job.shuffle.merge.percent           0.96
    mapred.job.shuffle.input.buffer.percent     0.7
    io.file.buffer.size                        524288
    io.sort.record.percent                   0.138
    io.sort.spill.percent                    1.0
    mapred.child.java.opts                   ’-server -Xlp -Xnoclassgc -Xgcpolicy:gencon -Xms890m -Xmx890m
                                             -Xjit:optLevel=hot -Xjit:disableProfiling -Xgcthreads4 -XlockReservation’
    mapred.tasktracker.map.tasks.maximum     64
    mapred.tasktracker.reduce.tasks.maximum 49
    dfs.replication                          1
    mapred.max.tracker.blacklists            20
    dfs.block.size                           536870912
    mapred.job.reuse.jvm.num.tasks           -1



and (B) weak scaling of the BigInsights cluster that scales      initial experience of sorting 1TB data on a 10-node Pow-
up the number of nodes with corresponding increases in in-       erLinux 7R2 cluster. As of this writing, it takes less than
put size3 .                                                      9 minutes to complete the sort, which translates to sorting
   As a preliminary proof of concept, our experiment shows       11.4GB/node/minute for the IO intensive sort benchmark.
that BigInsights v1.3 with only 4 PowerLinux 7R2 nodes can       We expect additional improvements in the near future. We
sort 1TB input in slightly over 22 minutes. Thus, the Tera-      also show that 4 PowerLinux 7R2 nodes can sort 1TB input
sort benchmark exhibits nearly linear scaling with strong        with around 22 minutes. Further improvement is expected
scaling from 4 up through the 10 nodes in our cluster, lead-     as we continue full-stack optimization on both software and
ing to a lower system cost and footprint for situations where    hardware.
a 22 minutes Terasort time is acceptable.
   While we have not observed that the network is a bot-         6. ACKNOWLEDGMENTS
tleneck in our 10-node cluster, this may change when the
system scales up and the workload changes. Similar obser-           We would like to thank the IBM InfoSphere BigInsights
vations apply to our disk subsystem, which currently has 30      team for their support. Particularly, we owe a debt of grat-
disks in total per PowerLinux 7R2 server.                        itude to Berni Schiefer, Stewart Tate and Hui Liao for their
   Finally, we expect people to utilize performance analysis     guidance and insights. Susan Proietti Conti, Gina King, An-
similarly and judiciously size their systems for their partic-   gela S Perez, Raguraman Tumaati-Krishnan, Anitra Powell,
ular workloads and needs. This can be useful either when         Demetrice Browder and Chuck Bryan have been supporting
they make purchasing decisions or when they reconfigure           us to streamline the process along the way. Last but not
their systems in production.                                     least, we could not have reached this stage without the gen-
                                                                 erous support and advice from Dan P. Dumarot, Richard W.
                                                                 Bishop, Pat Buckland, Clark Anderson, Ken Blake, Geoffrey
5. CONCLUSIONS                                                   Cagle and John Williams, among others.
  In this paper, we have presented our initial study on Big
Data benchmarking and methodology as well as workload            7. REFERENCES
optimized systems for Big Data. We have also discussed our        [1] Apache hadoop. http://hadoop.apache.hadoop.
3                                                                 [2] Enron Email Dataset.
 We borrow the two common notions of scalability, strong
vs. weak scaling, from the High Performance Computing                 http://www.cs.cmu.edu/ enron/.
community, which we think fit well to Big Data analytics           [3] IBM InfoSphere BigInsights. http://www-
and data-intensive computing.                                         01.ibm.com/software/data/infosphere/biginsights/.
[4] IBM Netezza Data Warehouse Appliances.
     http://www-01.ibm.com/software/data/netezza/.
 [5] IBM PowerLinux 7R2 Server. http://www-
     03.ibm.com/systems/power/software/linux/
     powerlinux/7r2/index.html.
 [6] IBM WebSphere DataPower SOA Appliances.
     http://www-
     01.ibm.com/software/integration/datapower/.
 [7] LZO: A Portable Lossless Data Compression Library.
     http://www.oberhumer.com/opensource/lzo/.
 [8] PowerLinux Community.
     https://www.ibm.com/developerworks/group/tpl.
 [9] The ClueWeb09 Dataset.
     http://lemurproject.org/clueweb09/.
[10] The Stanford WebBase Project.
     http://diglib.stanford.edu:8091/ testbed/doc2/WebBase/.
[11] S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang.
     The HiBench benchmark suite: Characterization of
     the MapReduce-based data analysis. In IEEE
     International Conference on Data Engineering
     Workshops (ICDEW), Long Beach, CA, USA, 2010.
[12] T. Morgan. IBM Global Technology Outlook 2012. In
     Technology Innovation Exchange, IBM Warwick, 2012.

More Related Content

What's hot

Big data privacy issues in public social media
Big data privacy issues in public social mediaBig data privacy issues in public social media
Big data privacy issues in public social mediaSupriya Radhakrishna
 
Idc info brief-choosing_dbms_to_address_challenges_of_the_third-platform
Idc info brief-choosing_dbms_to_address_challenges_of_the_third-platformIdc info brief-choosing_dbms_to_address_challenges_of_the_third-platform
Idc info brief-choosing_dbms_to_address_challenges_of_the_third-platformRobert Bira
 
A Survey on Data Mining
A Survey on Data MiningA Survey on Data Mining
A Survey on Data MiningIOSR Journals
 
Managing Big data using Hadoop Map Reduce in Telecom Domain
Managing Big data using Hadoop Map Reduce in Telecom DomainManaging Big data using Hadoop Map Reduce in Telecom Domain
Managing Big data using Hadoop Map Reduce in Telecom DomainAM Publications
 
An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.ijceronline
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challengesijcisjournal
 
Emerging database landscape july 2011
Emerging database landscape july 2011Emerging database landscape july 2011
Emerging database landscape july 2011navaidkhan
 
Integrating compression technique for data mining
Integrating compression technique for data  miningIntegrating compression technique for data  mining
Integrating compression technique for data miningDr.Manmohan Singh
 
Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects IJMER
 
A Survey on Big Data Mining Challenges
A Survey on Big Data Mining ChallengesA Survey on Big Data Mining Challenges
A Survey on Big Data Mining ChallengesEditor IJMTER
 
A technical Introduction to Big Data Analytics
A technical Introduction to Big Data AnalyticsA technical Introduction to Big Data Analytics
A technical Introduction to Big Data AnalyticsPethuru Raj PhD
 

What's hot (16)

Big data privacy issues in public social media
Big data privacy issues in public social mediaBig data privacy issues in public social media
Big data privacy issues in public social media
 
Idc info brief-choosing_dbms_to_address_challenges_of_the_third-platform
Idc info brief-choosing_dbms_to_address_challenges_of_the_third-platformIdc info brief-choosing_dbms_to_address_challenges_of_the_third-platform
Idc info brief-choosing_dbms_to_address_challenges_of_the_third-platform
 
A Survey on Data Mining
A Survey on Data MiningA Survey on Data Mining
A Survey on Data Mining
 
Managing Big data using Hadoop Map Reduce in Telecom Domain
Managing Big data using Hadoop Map Reduce in Telecom DomainManaging Big data using Hadoop Map Reduce in Telecom Domain
Managing Big data using Hadoop Map Reduce in Telecom Domain
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challenges
 
Emerging database landscape july 2011
Emerging database landscape july 2011Emerging database landscape july 2011
Emerging database landscape july 2011
 
Lecture1-IS322(Data&InfoMang-introduction)
Lecture1-IS322(Data&InfoMang-introduction)Lecture1-IS322(Data&InfoMang-introduction)
Lecture1-IS322(Data&InfoMang-introduction)
 
F035431037
F035431037F035431037
F035431037
 
Integrating compression technique for data mining
Integrating compression technique for data  miningIntegrating compression technique for data  mining
Integrating compression technique for data mining
 
Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects
 
Big data road map
Big data road mapBig data road map
Big data road map
 
A Survey on Big Data Mining Challenges
A Survey on Big Data Mining ChallengesA Survey on Big Data Mining Challenges
A Survey on Big Data Mining Challenges
 
A technical Introduction to Big Data Analytics
A technical Introduction to Big Data AnalyticsA technical Introduction to Big Data Analytics
A technical Introduction to Big Data Analytics
 

Similar to Understanding System and Architecture for Big Data

Big Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the FutureBig Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the FutureOdinot Stanislas
 
Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformIRJET Journal
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Prof.Balakrishnan S
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research reportJULIO GONZALEZ SANZ
 
Big data analytics
Big data analyticsBig data analytics
Big data analyticsAccenture
 
Ast 0060878 wayne-eckerson_research_report_big_data_analytics
Ast 0060878 wayne-eckerson_research_report_big_data_analyticsAst 0060878 wayne-eckerson_research_report_big_data_analytics
Ast 0060878 wayne-eckerson_research_report_big_data_analyticsAccenture
 
Ast 0060878 wayne-eckerson_research_report_big_data_analytics
Ast 0060878 wayne-eckerson_research_report_big_data_analyticsAst 0060878 wayne-eckerson_research_report_big_data_analytics
Ast 0060878 wayne-eckerson_research_report_big_data_analyticsAccenture
 
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSISCASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSISIRJET Journal
 
JPJ1417 Data Mining With Big Data
JPJ1417   Data Mining With Big DataJPJ1417   Data Mining With Big Data
JPJ1417 Data Mining With Big Datachennaijp
 
Big data seminor
Big data seminorBig data seminor
Big data seminorberasrujana
 
A Deep Dissertion Of Data Science Related Issues And Its Applications
A Deep Dissertion Of Data Science  Related Issues And Its ApplicationsA Deep Dissertion Of Data Science  Related Issues And Its Applications
A Deep Dissertion Of Data Science Related Issues And Its ApplicationsTracy Hill
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...Vladimir Bacvanski, PhD
 
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...DATAVERSITY
 
Big Data Analytics Research Report
Big Data Analytics Research ReportBig Data Analytics Research Report
Big Data Analytics Research ReportIla Group
 

Similar to Understanding System and Architecture for Big Data (20)

Complete-SRS.doc
Complete-SRS.docComplete-SRS.doc
Complete-SRS.doc
 
Big Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the FutureBig Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the Future
 
Using Big Data Smarter Decision Making
Using Big Data Smarter Decision MakingUsing Big Data Smarter Decision Making
Using Big Data Smarter Decision Making
 
Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop Platform
 
E018142329
E018142329E018142329
E018142329
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research report
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Ast 0060878 wayne-eckerson_research_report_big_data_analytics
Ast 0060878 wayne-eckerson_research_report_big_data_analyticsAst 0060878 wayne-eckerson_research_report_big_data_analytics
Ast 0060878 wayne-eckerson_research_report_big_data_analytics
 
Ast 0060878 wayne-eckerson_research_report_big_data_analytics
Ast 0060878 wayne-eckerson_research_report_big_data_analyticsAst 0060878 wayne-eckerson_research_report_big_data_analytics
Ast 0060878 wayne-eckerson_research_report_big_data_analytics
 
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSISCASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
 
FC Brochure & Insert
FC Brochure & InsertFC Brochure & Insert
FC Brochure & Insert
 
JPJ1417 Data Mining With Big Data
JPJ1417   Data Mining With Big DataJPJ1417   Data Mining With Big Data
JPJ1417 Data Mining With Big Data
 
Big data seminor
Big data seminorBig data seminor
Big data seminor
 
A Deep Dissertion Of Data Science Related Issues And Its Applications
A Deep Dissertion Of Data Science  Related Issues And Its ApplicationsA Deep Dissertion Of Data Science  Related Issues And Its Applications
A Deep Dissertion Of Data Science Related Issues And Its Applications
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
 
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
 
Big Data Analytics Research Report
Big Data Analytics Research ReportBig Data Analytics Research Report
Big Data Analytics Research Report
 

More from IBM India Smarter Computing

Using the IBM XIV Storage System in OpenStack Cloud Environments
Using the IBM XIV Storage System in OpenStack Cloud Environments Using the IBM XIV Storage System in OpenStack Cloud Environments
Using the IBM XIV Storage System in OpenStack Cloud Environments IBM India Smarter Computing
 
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...IBM India Smarter Computing
 
A Comparison of PowerVM and Vmware Virtualization Performance
A Comparison of PowerVM and Vmware Virtualization PerformanceA Comparison of PowerVM and Vmware Virtualization Performance
A Comparison of PowerVM and Vmware Virtualization PerformanceIBM India Smarter Computing
 
IBM pureflex system and vmware vcloud enterprise suite reference architecture
IBM pureflex system and vmware vcloud enterprise suite reference architectureIBM pureflex system and vmware vcloud enterprise suite reference architecture
IBM pureflex system and vmware vcloud enterprise suite reference architectureIBM India Smarter Computing
 

More from IBM India Smarter Computing (20)

Using the IBM XIV Storage System in OpenStack Cloud Environments
Using the IBM XIV Storage System in OpenStack Cloud Environments Using the IBM XIV Storage System in OpenStack Cloud Environments
Using the IBM XIV Storage System in OpenStack Cloud Environments
 
All-flash Needs End to End Storage Efficiency
All-flash Needs End to End Storage EfficiencyAll-flash Needs End to End Storage Efficiency
All-flash Needs End to End Storage Efficiency
 
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...
 
IBM FlashSystem 840 Product Guide
IBM FlashSystem 840 Product GuideIBM FlashSystem 840 Product Guide
IBM FlashSystem 840 Product Guide
 
IBM System x3250 M5
IBM System x3250 M5IBM System x3250 M5
IBM System x3250 M5
 
IBM NeXtScale nx360 M4
IBM NeXtScale nx360 M4IBM NeXtScale nx360 M4
IBM NeXtScale nx360 M4
 
IBM System x3650 M4 HD
IBM System x3650 M4 HDIBM System x3650 M4 HD
IBM System x3650 M4 HD
 
IBM System x3300 M4
IBM System x3300 M4IBM System x3300 M4
IBM System x3300 M4
 
IBM System x iDataPlex dx360 M4
IBM System x iDataPlex dx360 M4IBM System x iDataPlex dx360 M4
IBM System x iDataPlex dx360 M4
 
IBM System x3500 M4
IBM System x3500 M4IBM System x3500 M4
IBM System x3500 M4
 
IBM System x3550 M4
IBM System x3550 M4IBM System x3550 M4
IBM System x3550 M4
 
IBM System x3650 M4
IBM System x3650 M4IBM System x3650 M4
IBM System x3650 M4
 
IBM System x3500 M3
IBM System x3500 M3IBM System x3500 M3
IBM System x3500 M3
 
IBM System x3400 M3
IBM System x3400 M3IBM System x3400 M3
IBM System x3400 M3
 
IBM System x3250 M3
IBM System x3250 M3IBM System x3250 M3
IBM System x3250 M3
 
IBM System x3200 M3
IBM System x3200 M3IBM System x3200 M3
IBM System x3200 M3
 
IBM PowerVC Introduction and Configuration
IBM PowerVC Introduction and ConfigurationIBM PowerVC Introduction and Configuration
IBM PowerVC Introduction and Configuration
 
A Comparison of PowerVM and Vmware Virtualization Performance
A Comparison of PowerVM and Vmware Virtualization PerformanceA Comparison of PowerVM and Vmware Virtualization Performance
A Comparison of PowerVM and Vmware Virtualization Performance
 
IBM pureflex system and vmware vcloud enterprise suite reference architecture
IBM pureflex system and vmware vcloud enterprise suite reference architectureIBM pureflex system and vmware vcloud enterprise suite reference architecture
IBM pureflex system and vmware vcloud enterprise suite reference architecture
 
X6: The sixth generation of EXA Technology
X6: The sixth generation of EXA TechnologyX6: The sixth generation of EXA Technology
X6: The sixth generation of EXA Technology
 

Understanding System and Architecture for Big Data

  • 1. RC25281 (AUS1204-004) April 23, 2012 Computer Science IBM Research Report Understanding System and Architecture for Big Data Anne E. Gattiker, Fadi H. Gebara, Ahmed Gheith, H. Peter Hofstee, Damir A. Jamsek, Jian Li, Evan Speight IBM Research Division Austin Research Laboratory 11501 Burnet Road Austin, TX 78758 USA Ju Wei Shi, Guan Cheng Chen IBM Research Division China Research Laboratory Building 19, Zhouguancun Software Park 8 Dongbeiwang West Road, Haidian District Beijing, 100193 P.R.China Peter W. Wong IBM Linux Technology Center 11501 Burnet Road Austin TX 78758 USA Research Division Almaden - Austin - Beijing - Cambridge - Haifa - India - T. J. Watson - Tokyo - Zurich
  • 2. Understanding Systems and Architectures for Big Data William M. Buros, Guan Cheng Chen, Mei-Mei Fu, Anne E. Gattiker, Fadi H. Gebara, Ahmed Gheith, H. Peter Hofstee, Damir A. Jamsek, Thomas G. Lendacky, Jian Li, Yan Li, John S. Poelman, Steven Pratt, Ju Wei Shi, Evan Speight, Peter W. Wong International Business Machines Corp. {wmb,chengc,mfu,gattiker,fhgebara,ahmedg,hofstee,jamsek,toml, jianli,liyancrl,poelman,slpratt,jwshi,speight,wpeter}@us.ibm.com ABSTRACT refers to the quality of the information in the face of data un- The use of Big Data underpins critical activities in all sec- certainty from many different places: the data itself, sensor tors of our society. Achieving the full transformative po- precision, model approximation, or inherent process uncer- tential of Big Data in this increasingly digital world re- tainty. Emerging social media also brings in new types of quires both new data analysis algorithms and a new class data uncertainty such as rumors, lies, falsehoods and wishful of systems to handle the dramatic data growth, the de- thinking. We believe the 4 Vs presently capture the main mand to integrate structured and unstructured data ana- characteristics of Big Data. lytics, and the increasing computing needs of massive-scale The importance of Big Data in today’s society can not analytics. In this paper, we discuss several Big Data re- be underestimated, and proper understanding and use of search activities at IBM Research: (1) Big Data benchmark- this important resource will require both new data analysis ing and methodology; (2) workload optimized systems for algorithms and new systems to handle the dramatic data Big Data; and (3) a case study of Big Data workloads on growth, the demand to integrate structured and unstruc- IBM Power systems. Our case study shows that prelimi- tured data analytics, and the increasing computing needs of nary infrastructure tuning results in sorting 1TB data in massive-scale analytics. As a result, there exists much re- TM search in critical aspects of emerging analytics systems for 9 minutes1 on 10 PowerLinux 7R2 systems [5] running TM Big Data. Examples of current research in this area include: IBM InfoSphere BigInsights . This translates to sorting processor, memory, and system architectures for data analy- 11.4GB/node/minute for the IO intensive sort benchmark. sis; benchmarks, metrics, and workload characterization for We also show that 4 PowerLinux 7R2 nodes can sort 1TB analytics and data-intensive computing; debugging and per- input with around 22 minutes. Further improvements are formance analysis tools for analytics and data-intensive com- expected as we continue full-stack optimizations on both puting; accelerators for analytics and data-intensive com- IBM software and hardware. puting; implications of data analytics to mobile and em- bedded systems; energy efficiency and energy-efficient de- 1. INTRODUCTION signs for analytics; availability, fault tolerance and recovery The term “Big Data” refers to the continuing massive ex- issues; scalable system and network designs for high con- pansion in the data volume and variety as well as the velocity currency or high bandwidth data streaming; data manage- and veracity of data processing [12]. Volume refers to the ment and analytics for vast amounts of unstructured data; scale of the data and processing needs. Whether for data evaluation tools, methodologies and workload synthesis; OS, at rest or in motion, i.e., being a repository of information distributed systems and system management support; and or a stream, the desire for high speed is constant, hence MapReduce and other processing paradigms and algorithms the notion of velocity. Variety indicates that Big Data may for analytics. be structured in many different ways, and techniques are This short paper briefly discusses three topics that IBM needed to understand and process such variation. Veracity researchers and colleagues from other IBM divisions have been working on: 1 All performance data contained in this publication was ob- tained in the specific operating environment and under the • Big Data benchmarking and methodology conditions described below and is presented as an illustra- tion. Performance obtained in other operating environments • Workload optimized systems for Big Data may vary and customers should conduct their own testing. • A case study of a Big Data workload on IBM Power systems We highlight the research directions that we are pursuing in this paper. More technical details and progress updates will be covered in several papers in an upcoming issue of the IBM Journal of Research and Development. Contact Author: Jian Li, IBM Research in Austin, 11501 Burnet Road, Austin, TX 78758; Phone: (512)286-7228; Email: jianli@us.ibm.com IBM Research Technical Report. July 10, 2012. Copyright 2012 .
  • 3. 2. BENCHMARKING METHODOLOGY 3. WORKLOAD OPTIMIZED SYSTEMS Massive Scale Analytics is representative of a new class of While industry has made substantial investments in ex- workloads that justifies a re-thinking of how computing sys- tending its software capabilities in analytics and Big Data, tems should be optimized. We start by tackling the problem thus far these new workloads are being executed on systems of the absence of a set of benchmarks that system hardware that were designed and configured in response to more tra- designers can use to measure the quality of their designs ditional workload demands. and that customers can use to evaluate competing hardware In this paper we present two key improvements to tradi- offerings in this fast-growing and still rapidly-changing mar- tional system design. The first is the addition of reconfig- ket. Existing benchmarks, such as HiBench [11], fall short in urable acceleration. While reconfigurable acceleration has terms of both scale and relevance. We conceive a method- been used successfully in commercial systems and appliances ology for peta-scale data-size benchmark creation that in- before (e.g., DataPower R [6] and Netezza R [4]), we have cludes representative Big Data workloads and can be used demonstrated via a prototype system that such technology as a driver of total system performance, with demands bal- can benefit the processing of unstructured data. anced across storage, network and computation. Creating The second innovation we discuss is a new modular and such a benchmark requires meeting unique challenges asso- dense design that also leverages acceleration. Achieving the ciated with the data size and its unstructured nature. To computational and storage densities that this design pro- be useful, the benchmark also needs to be generic enough vides requires an increase in processing efficiency that is to be accepted by the community at large. We also ob- achieved by a combination of power-efficient processing cores serve unique challenges associated with massive scale ana- and offloading of performance-intensive functions to the re- lytics benchmark creation, along with a specific benchmark configurable logic. we have created to meet them. With an eye towards how analytics workloads are likely to The first consequence of the massive scale of the data is evolve and executing such workloads efficiently, we conceive that the benchmark must be descriptive, rather than pre- of a system that leverages the accelerated dense scale-out scriptive. In other words, our proposed benchmark is pro- design in combination with powerful global server nodes that vided as instructions for acquiring the required data and pro- orchestrate and manage computation. This system can also cessing it, rather than providing benchmark code to run on be used to perform deeper in-memory analytics on selected supplied data. We propose the use of existing large datasets, data. such as the 25TB ClueWeb09 dataset [9] and the over 200TB Stanford WebBase repository [10]. Challenges of using such real-world large datasets include physical data delivery (e.g., 4. BIG DATA ON POWER SYSTEMS via shipped disk drives), and data formatting/“cleaning” of Apache Hadoop [1] has been widely deployed on clusters the data to allow robust processing. of relatively large numbers of moderately sized, commodity We propose compute- and data-intensive processing tasks servers. However, it has not been widely used on large, that are representative of key massive-scale analytics work- multi-core, heavily threaded machines even though smaller loads to be applied to this unstructured data. These tasks systems have increasingly large core and hardware thread include major Big Data application areas, text analytics, counts. We describe an initial performance evaluation and graph analytics and machine-learning. Specifically our bench- tuning of Hadoop on a large multi-core cluster with only a mark efforts focused on document categorization based on handful of machines. The evaluation environment comprises dictionary-matching, document and page ranking, and topic IBM InfoSphere BigInsights [3] on a 10-machine cluster of TM determination via non-negative matrix factorization. The IBM PowerLinux 7R2 systems. first of the three, in particular, required innovation in bench- mark creation, as there is no “golden reference” to establish 4.1 Evaluation Environment correct document categorization. Existing datasets typically Table 1 shows the evaluation environment, which com- used as references for text-categorization assessments, such prises IBM InfoSphere BigInsights [3] on a 10-machine clus- as the Enron corpus [2], are orders of magnitude smaller than ter of IBM PowerLinux 7R2 servers. IBM InfoSphere BigIn- what we required. Our approach for overcoming this chal- sights provides the power of Apache Hadoop in an enterprise- lenge included utilizing publicly-accessible documents coded ready package. BigInsights enhances Hadoop by adding by subject, such as US Patent Office patents and applica- robust administration, workflow orchestration, provisioning tions, to create subject-specific dictionaries against which and security, along with best-in-class analytics tools from to match documents. Unique challenges of ensuring “real- IBM Research. This version of BigInsights, version 1.3, uses world” relevance includes covering non-word terms of impor- Hadoop 0.20.2 and its built-in HDFS file system. tance, such as band names that include regular expression Each PowerLinux 7R2 includes 16 POWER7 R cores @ characters, and a “wisdom of crowds” approach that helps 3.55GHz that can scale to 3.86GHz in a “Dynamic Power us meet those challenges. Saving - Favor Performance” mode, up to 64 hardware threads, It is our intention to make our benchmark public. The 128 GB of memory, 6 × 600GB internal SAS drives, and benchmark is complementary to existing prescriptive bench- 24 × 600GB SAS drives in an external Direct Attached Stor- marks, such as Terasort and its variations that have been age (DAS) drawer. We used software RAID0 over LVM for widely exercised in the Big Data community. In this paper, the 29 drives to each machine. One internal SAS drive is we use Terasort as a case study in Section 4. dedicated as the boot disk. The machines are connected by 10Gb Ethernet network. Each machine has two 10Gbe connections to the top of the rack switch. We used RedHat Linux (RHEL6.2). All 10 PowerLinux 7R2 machines func- tion as Hadoop data nodes and task trackers. Note that we
  • 4. • Preliminary intermediate control of map and reduce Table 1: Evaluation Environment stages to better utilize available memory capacity; Hardware • Reconfiguration of storage subsystem to remove fail- Cluster 10 PowerLinux 7R2 Servers over support of storage adapters for effective band- CPU 16 processor cores per server (160 total) width improvement since Hadoop handles storage fail- Memory 128GB per server (1280GB total) ures by replication at software level; Internal 6 600GB internal SAS drives Storage per server (36TB total) • JVM, Jitting, and GC tuning that better fit the POWER Storage 24 600GB SAS drives in IBM EXP24S SFF architecture. Expansion Gen2-bay Drawer, per server (144TB total) Network 2 10Gbe connections per server Currently, we are able to achieve an execution time of less Switch BNT BLADE RackSwitch G8264 than 9 minutes to sort 1TB input data from disk and writ- ing back 1TB output data to disk storage. This translates Software to sorting 11.4GB/node/minute for the IO intensive sort OS Red Hat Enterprise Linux 6.2 benchmark. Table 2 lists some of the BigInsights (Hadoop) Java IBM Java Version 7 SR1 job configuration parameters used in this Terasort run that BigInsights Version 1.3 (Hadoop v0.20.2) completed in 8 minutes and 44 seconds. MapReduce One Power 730 as NameNode/JobTracker Figure 1 shows the CPU utilization of one of the 10 nodes and 10 7R2s as DataNodes/TaskTrackers in the cluster during the Terasort run. All nodes have similar CPU utilization. Figure 1 shows that CPU utilization is very high during the map and map-reduce overlapping stages. As expected, it is only when all mappers finish and only use one extra Power 730 Express Server as the master node reducers (380 as shown in Table 2 out of the 640 hardware that serves as the Name Node and Job Tracker2 . threads supported by the 10 PowerLinux 7R2 servers) are 4.2 Results and Analysis writing output back to disks that CPU utilization drops. While we have large quantities of other system profile We have some early experiences with measuring and tun- data to help us understand the potential for further perfor- ing standard Hadoop programs, including some of the ones mance improvement, one thing is clear: high CPU utiliza- used in the HiBench [11] benchmark suite, and some from tion does not necessarily translate into high performance. In real-world customer engagements. In this paper, we use other words, the CPU may be busy doing inefficient comput- Terasort, a widely used sort program included in the Hadoop ing (during the Hadoop and Java framework, for instance), distribution, as a case study. While Terasort in Hadoop can or performing inefficient algorithms that artificially inflate be configured to sort differing amounts of input data, we CPU utilization without improving performance. This leads only present results for sorting 1TB of input data. In addi- us to examine full-stack optimization during the next phase tion we compress neither input nor output data. of our research. Our initial trial is done with the default Hadoop map- reduce configuration, e.g., limited map and reduce task slots, which does not utilize the 16 processor cores in a single Pow- erLinux 7R2 system. As expected, the test takes hours to finish. After initial adjustment of the number of mappers and reducers to fit to the parallel processing power of 16 cores in a PowerLinux 7R2 system, the execution time dras- tically decreases. We then apply the following public-domain tuning meth- ods and reference the best practices from the PowerLinux Community [8] to gradually improve the sort performance: • Four-way Simultaneous Multithreading (SMT4) to fur- ther increase computation parallelism and stress data Figure 1: CPU utilization on one PowerLinux 7R2 node parallelism via large L3 caches and high memory band- from a Terasort run of 8 minutes 44 seconds. width on POWER7 R ; • Aggressive read ahead setting and deadline disk IO As indicated above, we have only applied infrastructure scheduling at OS level; tuning in this stage of the study. In the next stage, we plan to incorporate the performance enhancement features • Large block size and buffer sizing in Hadoop; in IBM InfoSphere BigInsights for further improvement. We • Publicly available LZO compression [7] for intermedi- are also working on patches for better intermediate control ate data compression; in MapReduce. Newer Hadoop versions than 0.20.2 are ex- 2 pected to continue to deliver performance improvements. Given the small size cluster in this study, our experiments Furthermore, we plan to apply reconfigurable acceleration show negligible performance overhead of using one server technology as indicated in Section 3. In the meantime, scal- as both a master node (Name Node and Job Tracker) and a slave node (Data Node and Task Tracker). We separate ability studies are also important to understand the optimal the P730 master node from the 10 7R2 slave nodes for the approach to (A) strong scaling of the BigInsights cluster in convenience of cluster management. terms of scaling the number of nodes with constant input
  • 5. Table 2: Sample BigInsights (Hadoop) Job Configuration Parameters Values mapred.compress.map.output true mapred.map.output.compression.codec com.hadoop.compression.lzo.LzoCodec mapred.reduce.slowstart.completed.maps 0.01 mapred.reduce.parallel.copies 1 mapred.map.tasks 640 mapred.reduce.tasks 380 mapred.map.tasks.speculative.execution true io.sort.factor 120 mapred.jobtracker.taskScheduler org.apache.hadoop.mapred.JobQueueTaskScheduler flex.priority 0 adaptivemr.map.enable false io.sort.mb 650 mapred.job.reduce.input.buffer.percent 0.96 mapred.job.shuffle.merge.percent 0.96 mapred.job.shuffle.input.buffer.percent 0.7 io.file.buffer.size 524288 io.sort.record.percent 0.138 io.sort.spill.percent 1.0 mapred.child.java.opts ’-server -Xlp -Xnoclassgc -Xgcpolicy:gencon -Xms890m -Xmx890m -Xjit:optLevel=hot -Xjit:disableProfiling -Xgcthreads4 -XlockReservation’ mapred.tasktracker.map.tasks.maximum 64 mapred.tasktracker.reduce.tasks.maximum 49 dfs.replication 1 mapred.max.tracker.blacklists 20 dfs.block.size 536870912 mapred.job.reuse.jvm.num.tasks -1 and (B) weak scaling of the BigInsights cluster that scales initial experience of sorting 1TB data on a 10-node Pow- up the number of nodes with corresponding increases in in- erLinux 7R2 cluster. As of this writing, it takes less than put size3 . 9 minutes to complete the sort, which translates to sorting As a preliminary proof of concept, our experiment shows 11.4GB/node/minute for the IO intensive sort benchmark. that BigInsights v1.3 with only 4 PowerLinux 7R2 nodes can We expect additional improvements in the near future. We sort 1TB input in slightly over 22 minutes. Thus, the Tera- also show that 4 PowerLinux 7R2 nodes can sort 1TB input sort benchmark exhibits nearly linear scaling with strong with around 22 minutes. Further improvement is expected scaling from 4 up through the 10 nodes in our cluster, lead- as we continue full-stack optimization on both software and ing to a lower system cost and footprint for situations where hardware. a 22 minutes Terasort time is acceptable. While we have not observed that the network is a bot- 6. ACKNOWLEDGMENTS tleneck in our 10-node cluster, this may change when the system scales up and the workload changes. Similar obser- We would like to thank the IBM InfoSphere BigInsights vations apply to our disk subsystem, which currently has 30 team for their support. Particularly, we owe a debt of grat- disks in total per PowerLinux 7R2 server. itude to Berni Schiefer, Stewart Tate and Hui Liao for their Finally, we expect people to utilize performance analysis guidance and insights. Susan Proietti Conti, Gina King, An- similarly and judiciously size their systems for their partic- gela S Perez, Raguraman Tumaati-Krishnan, Anitra Powell, ular workloads and needs. This can be useful either when Demetrice Browder and Chuck Bryan have been supporting they make purchasing decisions or when they reconfigure us to streamline the process along the way. Last but not their systems in production. least, we could not have reached this stage without the gen- erous support and advice from Dan P. Dumarot, Richard W. Bishop, Pat Buckland, Clark Anderson, Ken Blake, Geoffrey 5. CONCLUSIONS Cagle and John Williams, among others. In this paper, we have presented our initial study on Big Data benchmarking and methodology as well as workload 7. REFERENCES optimized systems for Big Data. We have also discussed our [1] Apache hadoop. http://hadoop.apache.hadoop. 3 [2] Enron Email Dataset. We borrow the two common notions of scalability, strong vs. weak scaling, from the High Performance Computing http://www.cs.cmu.edu/ enron/. community, which we think fit well to Big Data analytics [3] IBM InfoSphere BigInsights. http://www- and data-intensive computing. 01.ibm.com/software/data/infosphere/biginsights/.
  • 6. [4] IBM Netezza Data Warehouse Appliances. http://www-01.ibm.com/software/data/netezza/. [5] IBM PowerLinux 7R2 Server. http://www- 03.ibm.com/systems/power/software/linux/ powerlinux/7r2/index.html. [6] IBM WebSphere DataPower SOA Appliances. http://www- 01.ibm.com/software/integration/datapower/. [7] LZO: A Portable Lossless Data Compression Library. http://www.oberhumer.com/opensource/lzo/. [8] PowerLinux Community. https://www.ibm.com/developerworks/group/tpl. [9] The ClueWeb09 Dataset. http://lemurproject.org/clueweb09/. [10] The Stanford WebBase Project. http://diglib.stanford.edu:8091/ testbed/doc2/WebBase/. [11] S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In IEEE International Conference on Data Engineering Workshops (ICDEW), Long Beach, CA, USA, 2010. [12] T. Morgan. IBM Global Technology Outlook 2012. In Technology Innovation Exchange, IBM Warwick, 2012.