Published on

HPCC (High Performance Computing Cluster) is a massive parallel-processing computing platform that solves Big Data problems. The platform is now Open Source!

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  2. 2. HPCC vs HADOOP Declarative programming language: Describe what needs to be done and nothow to do it Powerful: Unlike Java, high level primitives such as JOIN, TRANSFORM, PROJECT,SORT, DISTRIBUTE, MAP, etc. are available. Higher level code means lessprogrammers and shorter time to deliver complete projects Extensible: As new attributes are defined, they become primitives that otherprogrammers can use Implicitly parallel: Parallelism is built into the underlying platform. Theprogrammer needs not be concerned with it Maintainable: A High level programming language, no side effects and attributeencapsulation provide for more succinct, reliable and easier to troubleshootcode Complete: Unlike Pig and Hive, ECL provides for a complete programmingparadigm. Homogeneous: One language to express data algorithms across the entireHPCC platform, including data ETL and delivery.
  3. 3. The Enterprise Control Language (ECL) HPCC Systems Enterprise Control Language (ECL) is the query and controllanguage developed to manage all aspects of the massive data joins, sortsand builds. ECL truly differentiates HPCC from other technologies in its ability toprovide flexible data analysis on a massive scale. ECL is a declarative language optimized for the manipulation of massive datasets and provides for modular structured programming. Moreover, ECL is atransparent and implicitly parallel programming language which is bothpowerful and flexible, allowing for faster and more effective developmentcycles, through higher expressiveness, encapsulation and code reuse. Data analysts can “express” complex queries without the need for iterative,time-consuming data transformations and sorts associated with otherprogramming languages. Traditional low level languages (Java, C++ etc)force the translation of business requirements to functional requirements beforeprogramming can occur. The abstract nature of ECL eliminates the need forthis by making it easy to express business rules directly and succinctly.
  4. 4. HPCC System Architecture The HPCC system architecture includes two distinct cluster processingenvironments, each of which can be optimized independently for its paralleldata processing purpose. The first of these platforms is called a Data Refinerywhose overall purpose is the general processing of massive volumes of raw dataof any type for any purpose but typically used for data cleansing and hygiene. ETL processing of the raw data, record linking and entity resolution, large-scalead-hoc complex analytics, and creation of keyed data and indexes to supporthigh-performance structured queries and data warehouse applications. TheData Refinery is also referred to as Thor. A Thor cluster is similar in its function, execution environment, filesystem, andcapabilities to the Google and Hadoop MapReduce platforms.
  5. 5. It shows a representation of a physical Thor processing cluster which functions as a batch job executionengine for scalable data-intensive computing applications. In addition to the Thor master and slavenodes, additional auxiliary and common components are needed to implement a complete HPCCprocessing environment.
  6. 6. Roxie(rapid data delivery engine) The second of the parallel data processing platforms is called Roxie andfunctions as a rapid data delivery engine. This platform is designed as an online high-performance structured query andanalysis platform or data warehouse delivering the parallel data accessprocessing requirements of online applications through Web services interfacessupporting thousands of simultaneous queries and users with sub-secondresponse times. Roxie utilizes a distributed indexed filesystem to provide parallel processing ofqueries using an optimized execution environment and filesystem for high-performance online processing. A Roxie cluster is similar in its function and capabilities to Hadoop with HBaseand Hive capabilities added, and provides for near real time predictable querylatencies. Both Thor and Roxie clusters utilize the ECL programming language forimplementing applications, increasing continuity and programmer productivity.
  7. 7. Continued… It shows a representation of a physical Roxie processing clusterwhich functions as an online query execution engine for high-performance query and data warehousing applications. A Roxie cluster includes multiple nodes with server and workerprocesses for processing queries; an additional auxiliary componentcalled an ESP server which provides interfaces for external clientaccess to the cluster; and additional common components whichare shared with a Thor cluster in an HPCC environment. Although aThor processing cluster can be implemented and used without aRoxie cluster, an HPCC environment which includes a Roxie clustershould also include a Thor cluster. The Thor cluster is used to build thedistributed index files used by the Roxie cluster and to developonline queries which will be deployed with the index files to theRoxie cluster.
  8. 8. More on ECL(data-centric programming language) ECL is a declarative, data centric programming language designed in 2000 to allow ateam of programmers to process big data across a high performance computing clusterwithout the programmer being involved in many of the lower level, imperative decisions. Sorting problem// First declare a dataset with one column containing a list of strings// Datasets can also be binary, csv, xml or externally defined structuresD :=DATASET([{ECL},{Declarative},{Data},{Centric},{Programming},{Language}],{STRINGValue;});SD := SORT(D,Value);output(SD)
  9. 9. More on ECL(data-centric programming language) ECL primitives that act upon datasets include: SORT, ROLLUP, DEDUP, ITERATE,PROJECT, JOIN, NORMALIZE, DENORMALIZE, PARSE, CHOSEN, ENTH, TOPN,DISTRIBUTE. Comparison to Map-ReduceThe Hadoop Map-Reduce paradigm actually consists of three phases whichcorrelate to ECL primitives as follows.