Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Innovation with Connection, The new HPCC Systems Plugins and Modules

22 views

Published on

As part of the 2018 HPCC Systems Summit Community Day event:

The HPCC Systems platform team continues to expand interoperability with third party systems, which increases the platform feature-set and facilitates custom solutions. James will share an update on the latest connectors available, including the Spark-HPCC, and the upcoming HDFS connector plugin.

James McMullan has a broad range Software Engineering experience from developing low level system drivers for X-Ray fluorescence equipment to mobile video games and web applications. He is a recent addition to the Lexis Nexis team and is part of the HPCC Systems Platform team where he has been working on connectors integrating HPCC Systems with the Spark & Hadoop ecosystems.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Innovation with Connection, The new HPCC Systems Plugins and Modules

  1. 1. Innovation and Reinvention Driving Transformation OCTOBER 9, 2018 2018 HPCC Systems® Community Day James McMullan Innovation with Connection, The new HPCC Systems Plugins and Modules
  2. 2. Overview • Why Integrate HPCC Systems, Spark and Hadoop? • Spark-Thor Component • Goals, Features • Spark-HPCC Connector • Goals, Features, Demo • HDFS Connector • Goals, Features, Demo • Closing Thoughts Innovation with Connection, The new HPCC Systems Plugins and Modules
  3. 3. Why Integrate HPCC Systems, Spark and Hadoop? • Our goal: Allow you to do more • Combine strengths of the different ecosystems • Still in early stages • Exploring the potential of these integrations • More than one compelling use case • Python statistical and ML libraries through PySpark • New data formats • New methods of consuming data Innovation with Connection, The new HPCC Systems Plugins and Modules
  4. 4. Spark-Thor Component
  5. 5. HPCC Systems Spark-Thor Component - Goals • Easy setup of co-located Spark & HPCC Systems • Easy configuration • Allow custom configuration • Unified startup of Spark & HPCC Systems • Default configuration that works with HPCC Systems • Log directories • Work directories • Resource allocation Innovation with Connection, The new HPCC Systems Plugins and Modules
  6. 6. HPCC Systems Spark-Thor Component - Features • Spark-Thor Component Installation • Packaged as a plugin • Platform build option • Easy Configuration through configmgr • Spark cluster mirrors Thor cluster • Resource allocation settings • Custom configuration through spark- env.sh • Default configuration • Fixes common issues • Works with HPCC Systems configuration • Easily Start & Stop Spark • Unified startup • Uses existing HPCC Systems scripts Innovation with Connection, The new HPCC Systems Plugins and Modules
  7. 7. Spark-HPCC Connector
  8. 8. Spark-HPCC Connector - Goals • Read and write data from Spark • Reliable and easy to use • Performant • Memory usage • I/O throughput • Row construction cost • Allow co-location of Spark and HPCC • Use HPCC Systems data with Spark MLlib • RDD and DataFrame support Innovation with Connection, The new HPCC Systems Plugins and Modules
  9. 9. Spark-HPCC Connector - Features Reading data from HPCC Systems • Co-located and remote • HPCC Systems record definitions • All scalar types, child datasets, and sets of scalars • Construct RDD of Rows • Easy translation from RDD to Dataframe • Easy translation from RDD to ML Datasets • Field & row filtering on HPCC Systems side • Distributed reading Innovation with Connection, The new HPCC Systems Plugins and Modules 3 1 3 0 2 Accidents < 3 2 4 1 1 4
  10. 10. Spark-HPCC Connector - Features Writing data to HPCC Systems • Co-located writes only • RDD<Row> of Spark SQL data-types • Integer, Long, Float, Double, • BigDecimal, String, Sequence, Row, byte[] • Automated Row to Record translation • Distributed writing • Creation of new datasets only Innovation with Connection, The new HPCC Systems Plugins and Modules Row Record
  11. 11. Spark-HPCC Connector - Features PySpark Support • Utilizes Scala / Java API • Reading & Writing • Same features & limitations • Py4J library used to construct JavaRDDs • PySpark picklers used for Serialization / Deserialization • Uses PySpark configured pickler • PySpark introduces additional overhead • RDD Serialization and Deserialization required Innovation with Connection, The new HPCC Systems Plugins and Modules
  12. 12. Spark-HPCC Connector - Demo Innovation with Connection, The new HPCC Systems Plugins and Modules
  13. 13. Spark-HPCC Connector - Results • Read and write data from Spark • Reliable and easy to use • Performant • Memory usage • I/O throughput • Row construction cost • Co-location of Spark and HPCC Systems • HPCC Systems data with Spark MLlib • RDD and DataFrame support ✔ ✔ ✔ • No additional row / field overhead • 1.5 GBit / s • 30 million rows / s ✔ ✔ ✔ Presentation Title Here (Insert Menu > Header & Footer > Apply) 13
  14. 14. Spark-HPCC Connector - Roadmap • Remote writes • Improved performance • Better data locality planning • Scala/Java generic RDDs: RDD<YourClass> • Automated field mapping • Automated type conversion • Automated filtering • DataFrameReader and DataFrameWriter • No intermediate RDD Innovation with Connection, The new HPCC Systems Plugins and Modules
  15. 15. Spark-HPCC Connector - Availability • HPCC Systems GitHub: • https://github.com/hpcc-systems/Spark-HPCC • We are open for feedback & feature requests • We want to hear about your use cases! • Pull requests welcome! Innovation with Connection, The new HPCC Systems Plugins and Modules
  16. 16. HDFS Connector
  17. 17. HDFS Connector - Goals • Read and write HDFS data from HPCC Systems • Reliable and easy to use • Performant • Memory usage • I/O throughput • Row construction cost • Few dependencies • Allow collocation of HPCC Systems and HDFS • Support multiple file formats Innovation with Connection, The new HPCC Systems Plugins and Modules
  18. 18. HDFS Connector - Features Reading from HDFS • Thor and CSV files • Thor files support all HPCC Systems record layouts • Including variable length records and records with children • CSV files support only scalar datatypes • Integers, Reals, Decimals, Strings, Varstrings, UTF8 • Automatic field mapping and filtering • Distributed Reading • Dynamically split datasets down to the HDFS block size (64 MiB) Innovation with Connection, The new HPCC Systems Plugins and Modules Node 2Node 1
  19. 19. HDFS Connector - Features Writing to HDFS • Supports Thor and CSV files • Thor files support all HPCC Systems record layouts • CSV files support only scalar datatypes • Integers, Reals, Decimals, Strings, Varstrings, UTF8 • Distributed writing • Aware of HPCC Systems cluster topology • Additional metadata added to Thor Files • Record structure validation and dynamic splitting • Multiple write modes: Create Only, Overwrite or Append Innovation with Connection, The new HPCC Systems Plugins and Modules Node 2Node 1
  20. 20. HDFS Connector - Demo Innovation with Connection, The new HPCC Systems Plugins and Modules
  21. 21. HDFS Connector - Results • Read and write HDFS data • Reliable and easy to use • Performant • Memory usage • I/O throughput • Record construction cost • Few dependencies • Co-location of HPCC Systems and HDFS • Support multiple file formats ✔ ✔ ✔ • No additional memory overhead • TBD • TBD ✔ ✔ ✔ Presentation Title Here (Insert Menu > Header & Footer > Apply) 21
  22. 22. HDFS Connector - Roadmap • Parquet support • Reading & Writing • Automatic column filtering • Expand library support to Hadoop libhdfs • Statically linking to Apache Hawq libhdfs3 • Support Hadoop HDFS add-ons • S3A client • Performance tuning Innovation with Connection, The new HPCC Systems Plugins and Modules
  23. 23. HDFS Connector - Availability • Available as Technical Preview • HPCC Systems GitHub: • https://github.com/hpcc-systems/HDFS-Connector • We are open for feedback & feature requests • We want to hear about your use cases! • Pull requests welcome! Innovation with Connection, The new HPCC Systems Plugins and Modules
  24. 24. Closing Thoughts • Integration work between HPCC Systems, Spark and Hadoop is on-going • Our goal: allow you to do more with your data • Spark-HPCC and HDFS Connector available now • Feedback, Feature requests and PRs wanted • Tell us about your uses cases! Innovation with Connection, The new HPCC Systems Plugins and Modules
  25. 25. Questions? Spark-Thor Plugin: https://hpccsystems.com/download Spark-HPCC Connector: https://github.com/hpcc-systems/Spark-HPCC HDFS Connector Tech Preview: https://github.com/hpcc-systems/HDFS- Connector

×