The document discusses accelerating Apache Hadoop through high-performance networking and I/O technologies. It describes how technologies like InfiniBand, RoCE, SSDs, and NVMe can benefit big data applications by alleviating bottlenecks. It outlines projects from the High-Performance Big Data project that implement RDMA for Hadoop, Spark, HBase and Memcached to improve performance. Evaluation results demonstrate significant acceleration of HDFS, MapReduce, and other workloads through the high-performance designs.
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
In the last few years, the DevOps movement has introduced ground breaking approaches to the way we manage the lifecycle of software development and deployment. Today organisations aspire to fully automate the deployment of microservices and web applications with tools such as Chef, Puppet and Ansible. However, the deployment of data-processing pipelines remains a relic from the dark-ages of software development.
Processing large-scale data pipelines is the main engineering task of the Big Data era, and it should be treated with the same respect and craftsmanship as any other piece of software. That is why we created Apache Amaterasu (Incubating) - an open source framework that takes care of the specific needs of Big Data applications in the world of continuous delivery.
In this session, we will take a close look at Apache Amaterasu (Incubating) a simple and powerful framework to build and dispense pipelines. Amaterasu aims to help data engineers and data scientists to compose, configure, test, package, deploy and execute data pipelines written using multiple tools, languages and frameworks.
We will see what Amaterasu provides today, and how it can help existing Big Data application and demo some of the new bits that are coming in the near future.
Speaker:
Yaniv Rodenski, Senior Solutions Architect, Couchbase
Hadoop has traditionally been an on-premises workload, with very few notable implementations on the cloud. With Organizations either having jumped on the cloud bandwagon or have started planning their expansion into the ecosystem, it is imperative for us to explore how Hadoop conforms to the cloud paradigm. With the coming off age of some very useful cloud paradigms and the nature of Big Data with high seasonality of workloads, this is becoming a very common ask from customers. Robust architectures, elastic scale, open platforms, OSS integrations, and addressing complex pain points will all be part of this lively talk. To be able to implement effective solutions for Big Data in the cloud it is imperative that you understand the core principles and grasp the design principles of how the cloud can enhance the benefits of parallelized analytics. Join this session to understand the nitty-gritties of implementing Big Data in the cloud and the various options therein. Big Data + Cloud is definitely a deadly combination.
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
In the last few years, the DevOps movement has introduced ground breaking approaches to the way we manage the lifecycle of software development and deployment. Today organisations aspire to fully automate the deployment of microservices and web applications with tools such as Chef, Puppet and Ansible. However, the deployment of data-processing pipelines remains a relic from the dark-ages of software development.
Processing large-scale data pipelines is the main engineering task of the Big Data era, and it should be treated with the same respect and craftsmanship as any other piece of software. That is why we created Apache Amaterasu (Incubating) - an open source framework that takes care of the specific needs of Big Data applications in the world of continuous delivery.
In this session, we will take a close look at Apache Amaterasu (Incubating) a simple and powerful framework to build and dispense pipelines. Amaterasu aims to help data engineers and data scientists to compose, configure, test, package, deploy and execute data pipelines written using multiple tools, languages and frameworks.
We will see what Amaterasu provides today, and how it can help existing Big Data application and demo some of the new bits that are coming in the near future.
Speaker:
Yaniv Rodenski, Senior Solutions Architect, Couchbase
Hadoop has traditionally been an on-premises workload, with very few notable implementations on the cloud. With Organizations either having jumped on the cloud bandwagon or have started planning their expansion into the ecosystem, it is imperative for us to explore how Hadoop conforms to the cloud paradigm. With the coming off age of some very useful cloud paradigms and the nature of Big Data with high seasonality of workloads, this is becoming a very common ask from customers. Robust architectures, elastic scale, open platforms, OSS integrations, and addressing complex pain points will all be part of this lively talk. To be able to implement effective solutions for Big Data in the cloud it is imperative that you understand the core principles and grasp the design principles of how the cloud can enhance the benefits of parallelized analytics. Join this session to understand the nitty-gritties of implementing Big Data in the cloud and the various options therein. Big Data + Cloud is definitely a deadly combination.
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
As Hadoop applications move into cloud deployments, object stores become more and more the source and destination of data. But object stores are not filesystems: sometimes they are slower; security is different,
What are the secret settings to get maximum performance from queries against data living in cloud object stores? That's at the filesystem client, the file format and the query engine layers? It's even how you lay out the files —the directory structure and the names you give them.
We know these things, from our work in all these layers, from the benchmarking we've done —and the support calls we get when people have problems. And now: we'll show you.
This talk will start from the ground up "why isn't an object store a filesystem?" issue, showing how that breaks fundamental assumptions in code, and so causes performance issues which you don't get when working with HDFS. We'll look at the ways to get Apache Hive and Spark to work better, looking at optimizations which have been done to enable this —and what work is ongoing. Finally, we'll consider what your own code needs to do in order to adapt to cloud execution.
Apache Hadoop 3 is coming! As the next major milestone for hadoop and big data, it attracts everyone's attention as showcase several bleeding-edge technologies and significant features across all components of Apache Hadoop: Erasure Coding in HDFS, Docker container support, Apache Slider integration and Native service support, Application Timeline Service version 2, Hadoop library updates and client-side class path isolation, etc. In this talk, first we will update the status of Hadoop 3.0 releasing work in apache community and the feasible path through alpha, beta towards GA. Then we will go deep diving on each new feature, include: development progress and maturity status in Hadoop 3. Last but not the least, as a new major release, Hadoop 3.0 will contain some incompatible API or CLI changes which could be challengeable for downstream projects and existing Hadoop users for upgrade - we will go through these major changes and explore its impact to other projects and users.
Speaker: Sanjay Radia, Founder and Chief Architect, Hortonworks
Hadoop’s capabilities offer untapped potential for business insights but companies often get weighed down with DIY platforms and fail to keep up with the requirements. Join this Dell EMC session which will address this challenge with ready bundles to quickly deliver solutions for ETL offload, Single View, & IoT.
Get more value from your big data:
• Deploy big data applications faster
• Increase business agility
• Confidently deliver high performance and endless scale
• Improve IT operational efficiency
Speaker
Shawn Smith, Big Data Specialist, Dell EMC
R is a hugely popular platform for Data Scientists to create analytic models in many different domains. It is simple and ubiquitous. And a large number of readily available packages make it very powerful for statistical computing. But when these applications should move from the science lab to the production environment of large enterprises a new set of challenges arises. Independently of R, Spark has been very successful as a powerful general-purpose computing platform that leverages clusters of computers and is able to process data at a scale that has not been feasible before.
With the introduction of SparkR an exciting new option to productionize Data Science applications has been made available. This talk will give insight into two real-life projects at major enterprises where Data Science applications in R have been migrated to SparkR.
Suggested Topics:
• Dealing with platform challenges: R was not installed on the cluster. We show how to execute SparkR on a Yarn cluster with a dynamic deployment of R.
• Integrating Data Engineering and Data Science: we highlight the technical and cultural challenges that arise from closely integrating these two different areas.
• Separation of concerns: we describe how to disentangle ETL and data preparation from analytic computing and statistical methods.
• Scaling R with SparkR: we present what options SparkR offers to scale R applications and how we applied them to different areas such as time series forecasting and web analytics.
• Performance Improvements: we will show benchmarks for an R applications that took over 20 hours on a single server/single-threaded setup. With moderate effort we have been able to reduce that number to 15 minutes with SparkR. And we will show how we plan to further reduces this to less than a minute in the future.
• Mixing SparkR, SparkSQL and MLlib: we show how we combined the three different libraries to maximize efficiency.
Summary and Outlook: we describe what we have learnt so far, what the biggest gaps currently are and what challenges we expect to solve in the short- to mid-term.
United Airlines is leveraging big data at the enterprise level to help drive revenue, improve the customer experience, optimize operations, and support our employees in their day-to-day activities. At the center of our big data stack is Apache Hadoop, supported by many other emerging open source frameworks that must be integrated with the myriad of operational systems that support a 90-year-old transportation company with worldwide operations. In addition, learn how streaming data and streaming data analytics are helping to drive operational decisions in real time and how this is being architected to scale horizontally to take advantage of high availability and parallel processing. With the rapidly evolving Hadoop ecosystem, and so many new open source technologies at our disposal, the options for solving long-standing industry problems such as modeling how customers make decisions, making timely and meaningful real-time offers, and optimizing logistical operations have never been better. JOE OLSON, Senior Manager, Big Data Analytics, United Airlines and JONATHAN INGALLS, Sr. Solutions Engineer, Hortonworks
Practice of large Hadoop cluster in China MobileDataWorks Summit
China Mobile Limited is the leading telecommunications services provider in China, with more than 800 million active users. In China Mobile, distributed big data clusters are built by branch companies in each province for their unique requirements. Meanwhile, we have built a centralized Hadoop cluster with scale more than 1600 nodes, on which we collect data from dozens of distributed clusters and make analysis for our business.
In this session, we will introduce the architecture of the centralized Hadoop cluster and experience of constructing and tuning this large scale Hadoop cluster. Key points are as follows:
1. About Ambari: we improve Ambari with features like supporting HDFS Federation and Ambari HA , improving its performance and enabling it to support up to 1600 nodes.
2. About HDFS: we build a large HDFS cluster with data up to 60PB, using federation, ViewFS, FairCallQueue. Our best practice of cluster operation and management will also be included.
3. About Flume: We use the reformed Flume to collect data as much as 200TB per day.
Speakers
Yuxuan Pan, Software Engineer, China Mobile Software Technology
Duan Yunfeng, Chief Designer of China Mobile's big data system, China Mobile Communications Corporation
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
Uber’s mission is to provide transportation as reliable as running water and for fulfilling that mission data plays a critical role. In Uber, Hadoop plays a critical role in Data Infrastructure. We want to talk about the journey of Hadoop @Uber and our future plans in terms of scaling for billions of trips. We will talk about most unique use case Uber have and how Hadoop and eco system which we built, helped us in this journey. We want to talk about how we scaled from 10 -> 2000 and In future to scale up to 10’s X1000 of Nodes. We will talk about our mistakes, learning and wins and how we process billions of events per day. We will talk about the unique challenges and real world use-cases and how we will co-locate the Uber’s service architecture with batch (e.g data pipelines, machine learning and analytical workloads). Uber have done lot of improvements to current Hadoop eco system and uniquely solved some of the problems in a way which is never been solved in the past. This presentation will help audience to use this as an example and even encourage them to enhance the eco system. This will help to increase the community of these project and overall help the whole big data space. Audience is anybody who is working on Big Data and want to understand how to scale Hadoop and eco system for 10s of thousands of node. This talk will help them understand the Hadoop ecosystem and how to efficiently use that. It will also introduce them to some of the awesome technologies which Uber team is building in big data space.
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)DataWorks Summit
Once the staple of HPC clusters, today high-performance network and storage devices are everywhere. For a fraction of the cost, one can rent 40/100 Gbps RDMA networks and high-end NVMe flash devices supporting 10s GB/s bandwidths, less than 100 microseconds of latencies, with millions of IOPS. How does one leverage this phenomenal performance for popular data processing frameworks such as Apache Spark, Flink, Hadoop that we all know and love?
In this talk, I will introduce the Apache Crail (Incubating), which is a fast, distributed data store that is designed specifically for high-performance network and storage devices. The goal of the project is to deliver the true hardware performance to Apache data processing frameworks in the most accessible way. With its modular design, Crail supports multiple storage back ends (DRAM, NVMe Flash, and 3D XPoint) and networking protocols (RDMA and TPC/sockets). Crail provides multiple flexible APIs (file system, KV, HDFS, streaming) for a better integration with the high-level data access operations in Apache compute frameworks. As a result, on a 100 Gbps network infrastructure, Crail delivers all-to-all shuffle operations at 80+ Gbps speed, broadcast operations at less than 10 usec latencies, and more than 8M lookups/namenode, etc. Moreover, Crail is a generic solution that integrates well with the Apache ecosystem including frameworks like Spark, Hadoop, Hive, etc.
I will present the case for Crail, its current status, and future plans. As Crail is a young Apache project, we are seeking to build a community and expand its application to other interesting domains.
Speaker
Animesh Trivedi, IBM Research, Research Staff Member (RSM)
Apache Hadoop 3.0 is coming! As the next major release, it attracts everyone's attention as show case several bleeding-edge technologies and significant features across all components of Apache Hadoop, include: Erasure Coding in HDFS, Multiple Standby NameNodes, YARN Timeline Service v2, JNI-based shuffle in MapReduce, Apache Slider integration and Service Support as First Class Citizen, Hadoop library updates and client-side class path isolation, etc.
In this talk, we will update the status of Hadoop 3 especially the releasing work in community and then go deep diving on new features included in Hadoop 3.0. As a new major release, Hadoop 3 would also include some incompatible changes - we will go through most of these changes and explore its impact to existing Hadoop users and operators. In the last part of this session, we will continue to discuss ongoing efforts in Hadoop 3 age and show the big picture that how big data landscape could be largely influenced by Hadoop 3.
Druid is an open-source analytics data store specially designed to execute OLAP queries on event data. Its speed, scalability and efficiency have made it a popular choice to power user-facing analytic applications, including multiple BI tools and dashboards. However, Druid does not provide important features requested by many of these applications, such as a SQL interface or support for complex operations such as joins. This talk presents our work on extending Druid indexing and querying capabilities using Apache Hive. In particular, our solution allows to index complex query results in Druid using Hive, query Druid data sources from Hive using SQL, and execute complex Hive queries on top of Druid data sources. We describe how we built an extension that brings benefits to both systems alike, leveraging Apache Calcite to overcome the challenge of transparently generating Druid JSON queries from the input Hive SQL queries. We conclude with a demo highlighting the performant and powerful integration of these projects.
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
DK Panda from Ohio State University presented this deck at the Switzerland HPC Conference.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Mem- cached on modern HPC clusters. An overview of RDMA-based designs for multiple com- ponents of Hadoop (HDFS, MapReduce, RPC and HBase), Spark, and Memcached will be presented. Enhanced designs for these components to exploit in-memory technology and parallel file systems (such as Lustre) will be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video presentation: https://www.youtube.com/watch?v=glf2KITDdVs
See more talks in the Swiss Conference Video Gallery: http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
As Hadoop applications move into cloud deployments, object stores become more and more the source and destination of data. But object stores are not filesystems: sometimes they are slower; security is different,
What are the secret settings to get maximum performance from queries against data living in cloud object stores? That's at the filesystem client, the file format and the query engine layers? It's even how you lay out the files —the directory structure and the names you give them.
We know these things, from our work in all these layers, from the benchmarking we've done —and the support calls we get when people have problems. And now: we'll show you.
This talk will start from the ground up "why isn't an object store a filesystem?" issue, showing how that breaks fundamental assumptions in code, and so causes performance issues which you don't get when working with HDFS. We'll look at the ways to get Apache Hive and Spark to work better, looking at optimizations which have been done to enable this —and what work is ongoing. Finally, we'll consider what your own code needs to do in order to adapt to cloud execution.
Apache Hadoop 3 is coming! As the next major milestone for hadoop and big data, it attracts everyone's attention as showcase several bleeding-edge technologies and significant features across all components of Apache Hadoop: Erasure Coding in HDFS, Docker container support, Apache Slider integration and Native service support, Application Timeline Service version 2, Hadoop library updates and client-side class path isolation, etc. In this talk, first we will update the status of Hadoop 3.0 releasing work in apache community and the feasible path through alpha, beta towards GA. Then we will go deep diving on each new feature, include: development progress and maturity status in Hadoop 3. Last but not the least, as a new major release, Hadoop 3.0 will contain some incompatible API or CLI changes which could be challengeable for downstream projects and existing Hadoop users for upgrade - we will go through these major changes and explore its impact to other projects and users.
Speaker: Sanjay Radia, Founder and Chief Architect, Hortonworks
Hadoop’s capabilities offer untapped potential for business insights but companies often get weighed down with DIY platforms and fail to keep up with the requirements. Join this Dell EMC session which will address this challenge with ready bundles to quickly deliver solutions for ETL offload, Single View, & IoT.
Get more value from your big data:
• Deploy big data applications faster
• Increase business agility
• Confidently deliver high performance and endless scale
• Improve IT operational efficiency
Speaker
Shawn Smith, Big Data Specialist, Dell EMC
R is a hugely popular platform for Data Scientists to create analytic models in many different domains. It is simple and ubiquitous. And a large number of readily available packages make it very powerful for statistical computing. But when these applications should move from the science lab to the production environment of large enterprises a new set of challenges arises. Independently of R, Spark has been very successful as a powerful general-purpose computing platform that leverages clusters of computers and is able to process data at a scale that has not been feasible before.
With the introduction of SparkR an exciting new option to productionize Data Science applications has been made available. This talk will give insight into two real-life projects at major enterprises where Data Science applications in R have been migrated to SparkR.
Suggested Topics:
• Dealing with platform challenges: R was not installed on the cluster. We show how to execute SparkR on a Yarn cluster with a dynamic deployment of R.
• Integrating Data Engineering and Data Science: we highlight the technical and cultural challenges that arise from closely integrating these two different areas.
• Separation of concerns: we describe how to disentangle ETL and data preparation from analytic computing and statistical methods.
• Scaling R with SparkR: we present what options SparkR offers to scale R applications and how we applied them to different areas such as time series forecasting and web analytics.
• Performance Improvements: we will show benchmarks for an R applications that took over 20 hours on a single server/single-threaded setup. With moderate effort we have been able to reduce that number to 15 minutes with SparkR. And we will show how we plan to further reduces this to less than a minute in the future.
• Mixing SparkR, SparkSQL and MLlib: we show how we combined the three different libraries to maximize efficiency.
Summary and Outlook: we describe what we have learnt so far, what the biggest gaps currently are and what challenges we expect to solve in the short- to mid-term.
United Airlines is leveraging big data at the enterprise level to help drive revenue, improve the customer experience, optimize operations, and support our employees in their day-to-day activities. At the center of our big data stack is Apache Hadoop, supported by many other emerging open source frameworks that must be integrated with the myriad of operational systems that support a 90-year-old transportation company with worldwide operations. In addition, learn how streaming data and streaming data analytics are helping to drive operational decisions in real time and how this is being architected to scale horizontally to take advantage of high availability and parallel processing. With the rapidly evolving Hadoop ecosystem, and so many new open source technologies at our disposal, the options for solving long-standing industry problems such as modeling how customers make decisions, making timely and meaningful real-time offers, and optimizing logistical operations have never been better. JOE OLSON, Senior Manager, Big Data Analytics, United Airlines and JONATHAN INGALLS, Sr. Solutions Engineer, Hortonworks
Practice of large Hadoop cluster in China MobileDataWorks Summit
China Mobile Limited is the leading telecommunications services provider in China, with more than 800 million active users. In China Mobile, distributed big data clusters are built by branch companies in each province for their unique requirements. Meanwhile, we have built a centralized Hadoop cluster with scale more than 1600 nodes, on which we collect data from dozens of distributed clusters and make analysis for our business.
In this session, we will introduce the architecture of the centralized Hadoop cluster and experience of constructing and tuning this large scale Hadoop cluster. Key points are as follows:
1. About Ambari: we improve Ambari with features like supporting HDFS Federation and Ambari HA , improving its performance and enabling it to support up to 1600 nodes.
2. About HDFS: we build a large HDFS cluster with data up to 60PB, using federation, ViewFS, FairCallQueue. Our best practice of cluster operation and management will also be included.
3. About Flume: We use the reformed Flume to collect data as much as 200TB per day.
Speakers
Yuxuan Pan, Software Engineer, China Mobile Software Technology
Duan Yunfeng, Chief Designer of China Mobile's big data system, China Mobile Communications Corporation
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
Uber’s mission is to provide transportation as reliable as running water and for fulfilling that mission data plays a critical role. In Uber, Hadoop plays a critical role in Data Infrastructure. We want to talk about the journey of Hadoop @Uber and our future plans in terms of scaling for billions of trips. We will talk about most unique use case Uber have and how Hadoop and eco system which we built, helped us in this journey. We want to talk about how we scaled from 10 -> 2000 and In future to scale up to 10’s X1000 of Nodes. We will talk about our mistakes, learning and wins and how we process billions of events per day. We will talk about the unique challenges and real world use-cases and how we will co-locate the Uber’s service architecture with batch (e.g data pipelines, machine learning and analytical workloads). Uber have done lot of improvements to current Hadoop eco system and uniquely solved some of the problems in a way which is never been solved in the past. This presentation will help audience to use this as an example and even encourage them to enhance the eco system. This will help to increase the community of these project and overall help the whole big data space. Audience is anybody who is working on Big Data and want to understand how to scale Hadoop and eco system for 10s of thousands of node. This talk will help them understand the Hadoop ecosystem and how to efficiently use that. It will also introduce them to some of the awesome technologies which Uber team is building in big data space.
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)DataWorks Summit
Once the staple of HPC clusters, today high-performance network and storage devices are everywhere. For a fraction of the cost, one can rent 40/100 Gbps RDMA networks and high-end NVMe flash devices supporting 10s GB/s bandwidths, less than 100 microseconds of latencies, with millions of IOPS. How does one leverage this phenomenal performance for popular data processing frameworks such as Apache Spark, Flink, Hadoop that we all know and love?
In this talk, I will introduce the Apache Crail (Incubating), which is a fast, distributed data store that is designed specifically for high-performance network and storage devices. The goal of the project is to deliver the true hardware performance to Apache data processing frameworks in the most accessible way. With its modular design, Crail supports multiple storage back ends (DRAM, NVMe Flash, and 3D XPoint) and networking protocols (RDMA and TPC/sockets). Crail provides multiple flexible APIs (file system, KV, HDFS, streaming) for a better integration with the high-level data access operations in Apache compute frameworks. As a result, on a 100 Gbps network infrastructure, Crail delivers all-to-all shuffle operations at 80+ Gbps speed, broadcast operations at less than 10 usec latencies, and more than 8M lookups/namenode, etc. Moreover, Crail is a generic solution that integrates well with the Apache ecosystem including frameworks like Spark, Hadoop, Hive, etc.
I will present the case for Crail, its current status, and future plans. As Crail is a young Apache project, we are seeking to build a community and expand its application to other interesting domains.
Speaker
Animesh Trivedi, IBM Research, Research Staff Member (RSM)
Apache Hadoop 3.0 is coming! As the next major release, it attracts everyone's attention as show case several bleeding-edge technologies and significant features across all components of Apache Hadoop, include: Erasure Coding in HDFS, Multiple Standby NameNodes, YARN Timeline Service v2, JNI-based shuffle in MapReduce, Apache Slider integration and Service Support as First Class Citizen, Hadoop library updates and client-side class path isolation, etc.
In this talk, we will update the status of Hadoop 3 especially the releasing work in community and then go deep diving on new features included in Hadoop 3.0. As a new major release, Hadoop 3 would also include some incompatible changes - we will go through most of these changes and explore its impact to existing Hadoop users and operators. In the last part of this session, we will continue to discuss ongoing efforts in Hadoop 3 age and show the big picture that how big data landscape could be largely influenced by Hadoop 3.
Druid is an open-source analytics data store specially designed to execute OLAP queries on event data. Its speed, scalability and efficiency have made it a popular choice to power user-facing analytic applications, including multiple BI tools and dashboards. However, Druid does not provide important features requested by many of these applications, such as a SQL interface or support for complex operations such as joins. This talk presents our work on extending Druid indexing and querying capabilities using Apache Hive. In particular, our solution allows to index complex query results in Druid using Hive, query Druid data sources from Hive using SQL, and execute complex Hive queries on top of Druid data sources. We describe how we built an extension that brings benefits to both systems alike, leveraging Apache Calcite to overcome the challenge of transparently generating Druid JSON queries from the input Hive SQL queries. We conclude with a demo highlighting the performant and powerful integration of these projects.
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
DK Panda from Ohio State University presented this deck at the Switzerland HPC Conference.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Mem- cached on modern HPC clusters. An overview of RDMA-based designs for multiple com- ponents of Hadoop (HDFS, MapReduce, RPC and HBase), Spark, and Memcached will be presented. Enhanced designs for these components to exploit in-memory technology and parallel file systems (such as Lustre) will be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video presentation: https://www.youtube.com/watch?v=glf2KITDdVs
See more talks in the Swiss Conference Video Gallery: http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Ahead of the NFV Curve with Truly Scale-out Network Function CloudificationMellanox Technologies
Presented at OpenStack Summit Vancouver by Chloe Jian Ma, Senior Director, Cloud Market Development (@chloe_ma)
Colin Tregenza Dancer, Director of Architecture
Published twice a year and publicly available at http://www.top500.org, the TOP500 supercomputing list ranks the world’s most powerful computer systems according to the Linpack benchmark rating system.
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Memcached on modern HPC clusters. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC and HBase), Spark, Memcached, Swift, and Kafka using native RDMA support for InfiniBand and RoCE will be presented. Enhanced designs for these components to exploit NVM-based in-memory technology and parallel file systems (such as Lustre) will also be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video: https://youtu.be/iLTYkTandEA
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Accelerate Big Data Processing with High-Performance Computing TechnologiesIntel® Software
Learn about opportunities and challenges for accelerating big data middleware on modern high-performance computing (HPC) clusters by exploiting HPC technologies.
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...inside-BigData.com
In this deck from the 2019 Stanford HPC Conference, DK Panda from Ohio State University presents: Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiBD Project.
"This talk will provide an overview of challenges in designing convergent HPC and BigData software stacks on modern HPC clusters. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC and HBase), Spark, Memcached, Swift, and Kafka using native RDMA support for InfiniBand and RoCE will be presented. Enhanced designs for these components to exploit HPC scheduler (SLURM), parallel file systems (Lustre) and NVM-based in-memory technology will also be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown.
DK Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at the Ohio State University. He has published over 450 papers in the area of high-end computing and networking. The MVAPICH2 (High Performance MPI and PGAS over InfiniBand, Omni-Path, iWARP and RoCE) libraries, designed and developed by his research group (http://mvapich.cse.ohio-state.edu), are currently being used by more than 2,950 organizations worldwide (in 85 countries). More than 518,000 downloads of this software have taken place from the project's site. This software is empowering several InfiniBand clusters (including the 3rd, 14th, 17th, and 27th ranked ones) in the TOP500 list. The RDMA packages for Apache Spark, Apache Hadoop and Memcached together with OSU HiBD benchmarks from his group (http://hibd.cse.ohio-state.edu) are also publicly available. These libraries are currently being used by more than 300 organizations in 35 countries. More than 28,900 downloads of these libraries have taken place. High-performance and scalable versions of the Caffe and TensorFlow framework are available from https://hidl.cse.ohio-state.edu.
Prof. Panda is an IEEE Fellow. More details about Prof. Panda are available at http://www.cse.ohio-state.edu/~panda.
Watch the video: https://youtu.be/1QEq0EUErKM
Learn more: http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Accelerating Hadoop, Spark, and Memcached with HPC Technologiesinside-BigData.com
DK Panda from Ohio State University presented this deck at the OpenFabrics Workshop.
"Modern HPC clusters are having many advanced features, such as multi-/many-core architectures, highperformance RDMA-enabled interconnects, SSD-based storage devices, burst-buffers and parallel file systems. However, current generation Big Data processing middleware (such as Hadoop, Spark, and Memcached) have not fully exploited the benefits of the advanced features on modern HPC clusters. This talk will present RDMA-based designs using OpenFabrics Verbs and heterogeneous storage architectures to accelerate multiple components of Hadoop (HDFS, MapReduce, RPC, and HBase), Spark and Memcached. An overview of the associated RDMA-enabled software libraries (being designed and publicly distributed as a part of the HiBD project for Apache Hadoop (integrated and plug-ins for Apache, HDP, and Cloudera distributions), Apache Spark and Memcached will be presented. The talk will also address the need for designing benchmarks using a multi-layered and systematic approach, which can be used to evaluate the performance of these Big Data processing middleware."
Watch the video presentation: http://wp.me/p3RLHQ-gzg
Learn more: http://hibd.cse.ohio-state.edu/
and
https://www.openfabrics.org/index.php/abstracts-agenda.html
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...inside-BigData.com
In this deck from the 2019 Stanford HPC Conference, DK Panda from Ohio State University presents: How to Design Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems.
"This talk will focus on challenges in designing HPC, Deep Learning, and HPC Cloud middleware for Exascale systems with millions of processors and accelerators. For the HPC domain, we will discuss about the challenges in designing runtime environments for MPI+X (PGAS - OpenSHMEM/UPC/CAF/UPC++, OpenMP, and CUDA) programming models taking into account support for multi-core systems (Xeon, OpenPower, and ARM), high-performance networks, GPGPUs (including GPUDirect RDMA), and energy-awareness. Features and sample performance numbers from the MVAPICH2 libraries (http://mvapich.cse.ohio-state.edu) will be presented. For the Deep Learning domain, we will focus on popular Deep Learning frameworks (Caffe, CNTK, and TensorFlow) to extract performance and scalability with MVAPICH2-GDR MPI library and RDMA-Enabled Big Data stacks. Finally, we will outline the challenges in moving middleware to the Cloud environments."
Watch the video: https://youtu.be/hR8cnFVF8Zg
Learn more: http://www.cse.ohio-state.edu/~panda
and
http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
Big Data Processing Above and Beyond Hadoop: Data-intensive computing represents a new computing paradigm to address Big Data processing requirements using high-performance architectures supporting scalable parallel processing to allow government, commercial organizations, and research environments to process massive amounts of data and implement new applications previously thought to be impractical or infeasible. The fundamental challenges of data-intensive computing are managing and processing exponentially growing data volumes, significantly reducing associated data analysis cycles to support practical, timely applications, and developing new algorithms which can scale to search and process massive amounts of data. The open source HPCC (High-Performance Computing Cluster) Systems platform offers a unified approach to Big Data processing requirements: (1) a scalable, integrated computer systems hardware and software architecture designed for parallel processing of data-intensive computing applications, and (2) a new programming paradigm in the form of a high-level, declarative, data-centric programming language designed specifically for big data processing. This presentation explores the challenges of data-intensive computing from a programming perspective, and describes the ECL programming language and the HPCC architecture designed for data-intensive computing applications. HPCC is an alternative to the Hadoop platform, and ECL is compared to Pig Latin, a high-level language developed for the Hadoop MapReduce architecture.
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
This talk, given at Berlin Buzzwords 2019, describes the recent progress in making Hopsworks a cloud-native platform, with HA data-center support added for HopsFS.
Accelerating TensorFlow with RDMA for high-performance deep learningDataWorks Summit
Google’s TensorFlow is one of the most popular deep learning (DL) frameworks. In distributed TensorFlow, gradient updates are a critical step governing the total model training time. These updates incur a massive volume of data transfer over the network.
In this talk, we first present a thorough analysis of the communication patterns in distributed TensorFlow. Then we propose a unified way of achieving high performance through enhancing the gRPC runtime with Remote Direct Memory Access (RDMA) technology on InfiniBand and RoCE. Through our proposed RDMA-gRPC design, TensorFlow only needs to run over the gRPC channel and gets the optimal performance. Our design includes advanced features such as message pipelining, message coalescing, zero-copy transmission, etc. The performance evaluations show that our proposed design can significantly speed up gRPC throughput by up to 1.5x compared to the default gRPC design. By integrating our RDMA-gRPC with TensorFlow, we are able to achieve up to 35% performance improvement for TensorFlow training with CNN models.
Speakers
Dhabaleswar K (DK) Panda, Professor and University Distinguished Scholar, The Ohio State University
Xiaoyi Lu, Research Scientist, The Ohio State University
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systemsinside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems.
"This talk will focus on challenges in designing HPC, Deep Learning, and HPC Cloud middleware for Exascale systems with millions of processors and accelerators. For the HPC domain, we will discuss the challenges in designing runtime environments for MPI+X (PGAS-OpenSHMEM/UPC/CAF/UPC++, OpenMP and Cuda) programming models by taking into account support for multi-core systems (KNL and OpenPower), high networks, GPGPUs (including GPUDirect RDMA) and energy awareness. Features and sample performance numbers from MVAPICH2 libraries will be presented. For the Deep Learning domain, we will focus on popular Deep Learning framewords (Caffe, CNTK, and TensorFlow) to extract performance and scalability with MVAPICH2-GDR MPI library and RDMA-enabled Big Data stacks. Finally, we will outline the challenges in moving these middleware to the Cloud environments."
Watch the video: https://youtu.be/i2I6XqOAh_I
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
High-Performance and Scalable Designs of Programming Models for Exascale Systemsinside-BigData.com
DK Panda from Ohio State University presented this deck at the Switzerland HPC Conference.
"This talk will focus on challenges in designing programming models and runtime environments for Exascale systems with millions of processors and accelerators to support various programming models. We will focus on MPI+X (PGAS - OpenSHMEM/UPC/CAF/UPC++, OpenMP, and CUDA) programming models by taking into account support for multi-core systems (KNL and OpenPower), high-performance networks, GPGPUs (including GPUDirect RDMA), and energy-awareness. Features and sample performance numbers from the MVAPICH2 libraries, will be presented."
Watch the video: http://wp.me/p3RLHQ-gCb
Learn more: http://hpcadvisorycouncil.com/events/2017/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Simplifying Big Data Integration with Syncsort DMX and DMX-hPrecisely
Today’s modern data strategies have to manage more than growing data volumes. They must also address the added complexity of integrating diverse data sources and types, adhere to security and governance mandates, and ensure the right tools and skills are in place to deliver business value from the data.
Learn how the latest enhancements to Syncsort DMX and DMX-h can help you achieve your modern data strategy goals with a single interface for accessing and integrating all your enterprise data sources – batch and streaming – across Hadoop, Spark, Linux, Windows or Unix – on premise or in the cloud.
Watch this on-demand customer education webcast to learn the latest product features introduced this year, including:
• Best in class data ingestion capabilities with enhanced support for mainframes, RDBMSs, MPP, Avro/Parquet, Kafka, NoSQL and more.
• Single interface for streaming and batch processes – now with support for Kafka and MapR Streams
• Secure data access, data governance and lineage with seamless integration with Kerberos, Apache Ranger, Apache Ambari, Cloudera Manager, Cloudera Navigator and Sentry.
• Evolution of our design once, deploy anywhere architecture – now with support for Spark!
Building Efficient HPC Clouds with MCAPICH2 and RDMA-Hadoop over SR-IOV Infin...inside-BigData.com
Xiaoyi Lu from Ohio State University presented this deck at the OpenFabrics Workshop.
"Single Root I/O Virtualization (SR-IOV) technology has been steadily gaining momentum for high performance interconnects such as InfiniBand. SR-IOV can deliver near native performance but lacks locality-aware communication support. This talk presents an efficient approach to building HPC clouds based on MVAPICH2 and RDMA-Hadoop with SR-IOV. We discuss high-performance designs of the
virtual machine and container aware MVAPICH2 library over SR-IOV enabled HPC Clouds."
This talk will also present a high-performance virtual machine migration framework for MPI applications on SR-IOV enabled InfiniBand clouds. The MVAPICH2 software for building HPC Clouds presented in this talk is publicly available. We will also discuss how to leverage the high-performance networking features (e.g., RDMA, SR-IOV) on cloud environments to accelerate data processing through RDMAHadoop package, which is publicly available. Comprehensive performance evaluations on NSF-supported Chameleon Cloud show that our design can deliver the near bare-metal performance."
Watch the video: http://wp.me/p3RLHQ-gB3
Learn more: http://%20mvapich.cse.ohio-state.edu/
and
https://www.openfabrics.org/index.php/abstracts-agenda.html
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Designing HPC & Deep Learning Middleware for Exascale Systemsinside-BigData.com
DK Panda from Ohio State University presented this deck at the 2017 HPC Advisory Council Stanford Conference.
"This talk will focus on challenges in designing runtime environments for exascale systems with millions of processors and accelerators to support various programming models. We will focus on MPI, PGAS (OpenSHMEM, CAF, UPC and UPC++) and Hybrid MPI+PGAS programming models by taking into account support for multi-core, high-performance networks, accelerators (GPGPUs and Intel MIC), virtualization technologies (KVM, Docker, and Singularity), and energy-awareness. Features and sample performance numbers from the MVAPICH2 libraries will be presented."
Watch the video: http://wp.me/p3RLHQ-glW
Learn more: http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache NiFi, Apache Kafka, Apache Storm.
There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time.
The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.
DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.
QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful.
At Hortonworks, we’ve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.
Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together.
This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear:
• How and why the business and IT requirements originated
• How we leverage the platform to fulfill security and production requirements
• How we organize a community to:
o Guard all the players, no one gets left on the ground!
o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead)
• What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community
We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match!
DETAILS
This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
There has been an explosion of data digitising our physical world – from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses – both operationally, and through their products and services – by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is “no” in most cases.
In this session, we’ll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.
In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.
In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs.
Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Assure Contact Center Experiences for Your Customers With ThousandEyes
Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies
1. Accelerating Apache Hadoop through High-
Performance Networking and I/O Technologies
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: panda@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
Talk at Hadoop Summit, Dublin, Ireland (April 2016)
by
Xiaoyi Lu
The Ohio State University
E-mail: luxi@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~luxi
2. Hadoop Summit@Dublin (April ‘16) 2Network Based Computing Laboratory
• Substantial impact on designing and utilizing data management and processing systems in multiple tiers
– Front-end data accessing and serving (Online)
• Memcached + DB (e.g. MySQL), HBase
– Back-end data analytics (Offline)
• HDFS, MapReduce, Spark
Data Management and Processing on Modern Clusters
3. Hadoop Summit@Dublin (April ‘16) 3Network Based Computing Laboratory
0
10
20
30
40
50
60
70
80
90
100
0
50
100
150
200
250
300
350
400
450
500
PercentageofClusters
NumberofClusters
Timeline
Percentage of Clusters
Number of Clusters
Trends for Commodity Computing Clusters in the Top 500
List (http://www.top500.org)
85%
4. Hadoop Summit@Dublin (April ‘16) 4Network Based Computing Laboratory
Drivers of Modern HPC Cluster Architectures
Tianhe – 2 Titan Stampede Tianhe – 1A
• Multi-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
• Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD
• Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)
Accelerators / Coprocessors
high compute density, high
performance/watt
>1 TFlop DP on a chip
High Performance Interconnects -
InfiniBand
<1usec latency, 100Gbps Bandwidth>Multi-core Processors SSD, NVMe-SSD, NVRAM
5. Hadoop Summit@Dublin (April ‘16) 5Network Based Computing Laboratory
• Advanced Interconnects and RDMA protocols
– InfiniBand
– 10-40 Gigabit Ethernet/iWARP
– RDMA over Converged Enhanced Ethernet (RoCE)
• Delivering excellent performance (Latency, Bandwidth and CPU Utilization)
• Has influenced re-designs of enhanced HPC middleware
– Message Passing Interface (MPI) and PGAS
– Parallel File Systems (Lustre, GPFS, ..)
• SSDs (SATA and NVMe)
• NVRAM and Burst Buffer
Trends in HPC Technologies
6. Hadoop Summit@Dublin (April ‘16) 6Network Based Computing Laboratory
Interconnects and Protocols in OpenFabrics Stack for HPC
(http://openfabrics.org)
Kernel
Space
Application /
Middleware
Verbs
Ethernet
Adapter
Ethernet
Switch
Ethernet
Driver
TCP/IP
1/10/40/100
GigE
InfiniBand
Adapter
InfiniBand
Switch
IPoIB
IPoIB
Ethernet
Adapter
Ethernet
Switch
Hardware
Offload
TCP/IP
10/40 GigE-
TOE
InfiniBand
Adapter
InfiniBand
Switch
User
Space
RSockets
RSockets
iWARP
Adapter
Ethernet
Switch
TCP/IP
User
Space
iWARP
RoCE
Adapter
Ethernet
Switch
RDMA
User
Space
RoCE
InfiniBand
Switch
InfiniBand
Adapter
RDMA
User
Space
IB Native
Sockets
Application /
Middleware Interface
Protocol
Adapter
Switch
InfiniBand
Adapter
InfiniBand
Switch
RDMA
SDP
SDP
7. Hadoop Summit@Dublin (April ‘16) 7Network Based Computing Laboratory
• 235 IB Clusters (47%) in the Nov’ 2015 Top500 list (http://www.top500.org)
• Installations in the Top 50 (21 systems):
Large-scale InfiniBand Installations
462,462 cores (Stampede) at TACC (10th) 76,032 cores (Tsubame 2.5) at Japan/GSIC (25th)
185,344 cores (Pleiades) at NASA/Ames (13th) 194,616 cores (Cascade) at PNNL (27th)
72,800 cores Cray CS-Storm in US (15th) 76,032 cores (Makman-2) at Saudi Aramco (32nd)
72,800 cores Cray CS-Storm in US (16th) 110,400 cores (Pangea) in France (33rd)
265,440 cores SGI ICE at Tulip Trading Australia (17th) 37,120 cores (Lomonosov-2) at Russia/MSU (35th)
124,200 cores (Topaz) SGI ICE at ERDC DSRC in US (18th) 57,600 cores (SwiftLucy) in US (37th)
72,000 cores (HPC2) in Italy (19th) 55,728 cores (Prometheus) at Poland/Cyfronet (38th)
152,692 cores (Thunder) at AFRL/USA (21st ) 50,544 cores (Occigen) at France/GENCI-CINES (43rd)
147,456 cores (SuperMUC) in Germany (22nd) 76,896 cores (Salomon) SGI ICE in Czech Republic (47th)
86,016 cores (SuperMUC Phase 2) in Germany (24th) and many more!
8. Hadoop Summit@Dublin (April ‘16) 8Network Based Computing Laboratory
• Introduced in Oct 2000
• High Performance Data Transfer
– Interprocessor communication and I/O
– Low latency (<1.0 microsec), High bandwidth (up to 12.5 GigaBytes/sec -> 100Gbps), and
low CPU utilization (5-10%)
• Multiple Operations
– Send/Recv
– RDMA Read/Write
– Atomic Operations (very unique)
• high performance and scalable implementations of distributed locks, semaphores, collective
communication operations
• Leading to big changes in designing
– HPC clusters
– File systems
– Cloud computing systems
– Grid computing systems
Open Standard InfiniBand Networking Technology
9. Hadoop Summit@Dublin (April ‘16) 9Network Based Computing Laboratory
Communication in the Memory Semantics (RDMA Model)
InfiniBand Device
Memory Memory
InfiniBand Device
CQ QPSend Recv
Memory
Segment
Send WQE contains information about the
send buffer (multiple segments) and the
receive buffer (single segment)
Processor Processor
CQQP
Send Recv
Memory
Segment
Hardware ACK
Memory
Segment
Memory
Segment
Initiator processor is involved only to:
1. Post send WQE
2. Pull out completed CQE from the send CQ
No involvement from the target processor
10. Hadoop Summit@Dublin (April ‘16) 10Network Based Computing Laboratory
How Can HPC Clusters with High-Performance Interconnect and Storage
Architectures Benefit Big Data Applications?
Bring HPC and Big Data processing into a
“convergent trajectory”!
What are the major
bottlenecks in current Big
Data processing
middleware (e.g. Hadoop,
Spark, and Memcached)?
Can the bottlenecks be
alleviated with new
designs by taking
advantage of HPC
technologies?
Can RDMA-enabled
high-performance
interconnects
benefit Big Data
processing?
Can HPC Clusters with
high-performance
storage systems (e.g.
SSD, parallel file
systems) benefit Big
Data applications?
How much
performance benefits
can be achieved
through enhanced
designs?
How to design
benchmarks for
evaluating the
performance of Big
Data middleware on
HPC clusters?
11. Hadoop Summit@Dublin (April ‘16) 11Network Based Computing Laboratory
Designing Communication and I/O Libraries for Big Data Systems:
Challenges
Big Data Middleware
(HDFS, MapReduce, HBase, Spark and Memcached)
Networking Technologies
(InfiniBand, 1/10/40/100 GigE
and Intelligent NICs)
Storage Technologies
(HDD, SSD, and NVMe-SSD)
Programming Models
(Sockets)
Applications
Commodity Computing System
Architectures
(Multi- and Many-core
architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point
Communication
QoS
Threaded Models
and Synchronization
Fault-ToleranceI/O and File Systems
Virtualization
Benchmarks
Upper level
Changes?
12. Hadoop Summit@Dublin (April ‘16) 12Network Based Computing Laboratory
• Sockets not designed for high-performance
– Stream semantics often mismatch for upper layers
– Zero-copy not available for non-blocking sockets
Can Big Data Processing Systems be Designed with High-
Performance Networks and Protocols?
Current Design
Application
Sockets
1/10/40/100 GigE
Network
Our Approach
Application
OSU Design
10/40/100 GigE or
InfiniBand
Verbs Interface
13. Hadoop Summit@Dublin (April ‘16) 13Network Based Computing Laboratory
• RDMA for Apache Spark
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• RDMA for Memcached (RDMA-Memcached)
• OSU HiBD-Benchmarks (OHB)
– HDFS and Memcached Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base (based on voluntary registration): 160 organizations from 22 countries
• More than 16,000 downloads from the project site
• RDMA for Apache HBase and Impala (upcoming)
The High-Performance Big Data (HiBD) Project
Available for InfiniBand and RoCE
14. Hadoop Summit@Dublin (April ‘16) 14Network Based Computing Laboratory
• High-Performance Design of Hadoop over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and
RPC components
– Enhanced HDFS with in-memory and heterogeneous storage
– High performance design of MapReduce over Lustre
– Plugin-based architecture supporting RDMA-based designs for Apache Hadoop, CDH and HDP
– Easily configurable for different running modes (HHH, HHH-M, HHH-L, and MapReduce over Lustre) and different protocols (native
InfiniBand, RoCE, and IPoIB)
• Current release: 0.9.9
– Based on Apache Hadoop 2.7.1
– Compliant with Apache Hadoop 2.7.1, HDP 2.3.0.0 and CDH 5.6.0 APIs and applications
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR and FDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms
• Different file systems with disks and SSDs and Lustre
– http://hibd.cse.ohio-state.edu
RDMA for Apache Hadoop 2.x Distribution
15. Hadoop Summit@Dublin (April ‘16) 15Network Based Computing Laboratory
• High-Performance Design of Spark over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level
for Spark
– RDMA-based data shuffle and SEDA-based shuffle architecture
– Non-blocking and chunk-based data transfer
– Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB)
• Current release: 0.9.1
– Based on Apache Spark 1.5.1
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR and FDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms
• RAM disks, SSDs, and HDD
– http://hibd.cse.ohio-state.edu
RDMA for Apache Spark Distribution
16. Hadoop Summit@Dublin (April ‘16) 16Network Based Computing Laboratory
• Micro-benchmarks for Hadoop Distributed File System (HDFS)
– Sequential Write Latency (SWL) Benchmark, Sequential Read Latency (SRL)
Benchmark, Random Read Latency (RRL) Benchmark, Sequential Write Throughput
(SWT) Benchmark, Sequential Read Throughput (SRT) Benchmark
– Support benchmarking of
• Apache Hadoop 1.x and 2.x HDFS, Hortonworks Data Platform (HDP) HDFS, Cloudera
Distribution of Hadoop (CDH) HDFS
• Micro-benchmarks for Memcached
– Get Benchmark, Set Benchmark, and Mixed Get/Set Benchmark
• Current release: 0.8
• http://hibd.cse.ohio-state.edu
OSU HiBD Micro-Benchmark (OHB) Suite – HDFS & Memcached
17. Hadoop Summit@Dublin (April ‘16) 17Network Based Computing Laboratory
• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well
as performance. This mode is enabled by default in the package.
• HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-
memory and obtain as much performance benefit as possible.
• HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.
• MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top
of Lustre alone. Here, two different modes are introduced: with local disks and without local disks.
• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-
L, and MapReduce over Lustre).
Different Modes of RDMA for Apache Hadoop 2.x
18. Hadoop Summit@Dublin (April ‘16) 18Network Based Computing Laboratory
• RDMA-based Designs and Performance Evaluation
– HDFS
– MapReduce
– RPC
– HBase
– Spark
– OSU HiBD Benchmarks (OHB)
Acceleration Case Studies and Performance Evaluation
19. Hadoop Summit@Dublin (April ‘16) 19Network Based Computing Laboratory
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Java based HDFS with communication library written in native code
Design Overview of HDFS with RDMA
HDFS
Verbs
RDMA Capable Networks
(IB, iWARP, RoCE ..)
Applications
1/10/40/100 GigE, IPoIB
Network
Java Socket Interface Java Native Interface (JNI)
WriteOthers
OSU Design
• Design Features
– RDMA-based HDFS write
– RDMA-based HDFS
replication
– Parallel replication support
– On-demand connection
setup
– InfiniBand/RoCE support
N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS
over InfiniBand , Supercomputing (SC), Nov 2012
N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June 2014
20. Hadoop Summit@Dublin (April ‘16) 20Network Based Computing Laboratory
Triple-H
Heterogeneous Storage
• Design Features
– Three modes
• Default (HHH)
• In-Memory (HHH-M)
• Lustre-Integrated (HHH-L)
– Policies to efficiently utilize the heterogeneous
storage devices
• RAM, SSD, HDD, Lustre
– Eviction/Promotion based on data usage
pattern
– Hybrid Replication
– Lustre-Integrated mode:
• Lustre-based fault-tolerance
Enhanced HDFS with In-Memory and Heterogeneous Storage
Hybrid Replication
Data Placement Policies
Eviction/Promotion
RAM Disk SSD HDD
Lustre
N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters
with Heterogeneous Storage Architecture, CCGrid ’15, May 2015
Applications
21. Hadoop Summit@Dublin (April ‘16) 21Network Based Computing Laboratory
Design Overview of MapReduce with RDMA
MapReduce
Verbs
RDMA Capable Networks
(IB, iWARP, RoCE ..)
OSU Design
Applications
1/10/40/100 GigE, IPoIB
Network
Java Socket Interface Java Native Interface (JNI)
Job
Tracker
Task
Tracker
Map
Reduce
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Java based MapReduce with communication library written in native code
• Design Features
– RDMA-based shuffle
– Prefetching and caching map output
– Efficient Shuffle Algorithms
– In-memory merge
– On-demand Shuffle Adjustment
– Advanced overlapping
• map, shuffle, and merge
• shuffle, merge, and reduce
– On-demand connection setup
– InfiniBand/RoCE support
M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in
MapReduce over High Performance Interconnects, ICS, June 2014
22. Hadoop Summit@Dublin (April ‘16) 22Network Based Computing Laboratory
• A hybrid approach to achieve maximum
possible overlapping in MapReduce across
all phases compared to other approaches
– Efficient Shuffle Algorithms
– Dynamic and Efficient Switching
– On-demand Shuffle Adjustment
Advanced Overlapping among Different Phases
Default Architecture
Enhanced Overlapping with In-Memory Merge
Advanced Hybrid Overlapping
M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A
Hybrid Approach to Exploit Maximum Overlapping in MapReduce
over High Performance Interconnects, ICS, June 2014
23. Hadoop Summit@Dublin (April ‘16) 23Network Based Computing Laboratory
• Design Features
– JVM-bypassed buffer
management
– RDMA or send/recv based
adaptive communication
– Intelligent buffer allocation and
adjustment for serialization
– On-demand connection setup
– InfiniBand/RoCE support
Design Overview of Hadoop RPC with RDMA
Hadoop RPC
Verbs
RDMA Capable Networks
(IB, iWARP, RoCE ..)
Applications
1/10/40/100 GigE, IPoIB
Network
Java Socket Interface Java Native Interface (JNI)
Our DesignDefault
OSU Design
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Java based RPC with communication library written in native code
X. Lu, N. Islam, M. W. Rahman, J. Jose, H. Subramoni, H. Wang, and D. K. Panda, High-Performance Design of Hadoop RPC with
RDMA over InfiniBand, Int'l Conference on Parallel Processing (ICPP '13), October 2013.
24. Hadoop Summit@Dublin (April ‘16) 24Network Based Computing Laboratory
0
50
100
150
200
250
80 100 120
ExecutionTime(s)
Data Size (GB)
IPoIB (FDR)
0
50
100
150
200
250
80 100 120
ExecutionTime(s)
Data Size (GB)
IPoIB (FDR)
Performance Benefits – RandomWriter & TeraGen in TACC-Stampede
Cluster with 32 Nodes with a total of 128 maps
• RandomWriter
– 3-4x improvement over IPoIB
for 80-120 GB file size
• TeraGen
– 4-5x improvement over IPoIB
for 80-120 GB file size
RandomWriter TeraGen
Reduced by 3x Reduced by 4x
25. Hadoop Summit@Dublin (April ‘16) 25Network Based Computing Laboratory
0
100
200
300
400
500
600
700
800
900
80 100 120
ExecutionTime(s)
Data Size (GB)
IPoIB (FDR) OSU-IB (FDR)
0
100
200
300
400
500
600
80 100 120
ExecutionTime(s)
Data Size (GB)
IPoIB (FDR) OSU-IB (FDR)
Performance Benefits – Sort & TeraSort in TACC-Stampede
Cluster with 32 Nodes with a total of
128 maps and 64 reduces
• Sort with single HDD per node
– 40-52% improvement over IPoIB
for 80-120 GB data
• TeraSort with single HDD per node
– 42-44% improvement over IPoIB
for 80-120 GB data
Reduced by 52% Reduced by 44%
Cluster with 32 Nodes with a total of
128 maps and 57 reduces
26. Hadoop Summit@Dublin (April ‘16) 26Network Based Computing Laboratory
Evaluation of HHH and HHH-L with Applications
HDFS (FDR) HHH (FDR)
60.24 s 48.3 s
CloudBurstMR-MSPolyGraph
0
200
400
600
800
1000
4 6 8
ExecutionTime(s)
Concurrent maps per host
HDFS Lustre HHH-L Reduced by 79%
• MR-MSPolygraph on OSU RI with
1,000 maps
– HHH-L reduces the execution time
by 79% over Lustre, 30% over HDFS
• CloudBurst on TACC Stampede
– With HHH: 19% improvement over
HDFS
27. Hadoop Summit@Dublin (April ‘16) 27Network Based Computing Laboratory
Evaluation with Spark on SDSC Gordon (HHH vs. Tachyon/Alluxio)
• For 200GB TeraGen on 32 nodes
– Spark-TeraGen: HHH has 2.4x improvement over Tachyon; 2.3x over HDFS-IPoIB (QDR)
– Spark-TeraSort: HHH has 25.2% improvement over Tachyon; 17% over HDFS-IPoIB (QDR)
0
20
40
60
80
100
120
140
160
180
8:50 16:100 32:200
ExecutionTime(s)
Cluster Size : Data Size (GB)
IPoIB (QDR) Tachyon OSU-IB (QDR)
0
100
200
300
400
500
600
700
8:50 16:100 32:200
ExecutionTime(s)
Cluster Size : Data Size (GB)
Reduced
by 2.4x
Reduced by 25.2%
TeraGen TeraSort
N. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance Characterization and Acceleration of In-Memory File
Systems for Hadoop and Spark Applications on HPC Clusters, IEEE BigData ’15, October 2015
28. Hadoop Summit@Dublin (April ‘16) 28Network Based Computing Laboratory
Intermediate Data Directory
Design Overview of Shuffle Strategies for MapReduce over Lustre
• Design Features
– Two shuffle approaches
• Lustre read based shuffle
• RDMA based shuffle
– Hybrid shuffle algorithm to take benefit
from both shuffle approaches
– Dynamically adapts to the better
shuffle approach for each shuffle
request based on profiling values for
each Lustre read operation
– In-memory merge and overlapping of
different phases are kept similar to
RDMA-enhanced MapReduce design
Map 1 Map 2 Map 3
Lustre
Reduce 1 Reduce 2
Lustre Read / RDMA
In-memory
merge/sort
reduce
M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, High Performance Design of YARN
MapReduce on Modern HPC Clusters with Lustre and RDMA, IPDPS, May 2015
In-memory
merge/sort
reduce
29. Hadoop Summit@Dublin (April ‘16) 29Network Based Computing Laboratory
• For 500GB Sort in 64 nodes
– 44% improvement over IPoIB (FDR)
Performance Improvement of MapReduce over Lustre on TACC-
Stampede
• For 640GB Sort in 128 nodes
– 48% improvement over IPoIB (FDR)
0
200
400
600
800
1000
1200
300 400 500
JobExecutionTime(sec)
Data Size (GB)
IPoIB (FDR)
OSU-IB (FDR)
0
50
100
150
200
250
300
350
400
450
500
20 GB 40 GB 80 GB 160 GB 320 GB 640 GB
Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128
JobExecutionTime(sec)
IPoIB (FDR) OSU-IB (FDR)
M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMA-
based Approach Benefit?, Euro-Par, August 2014.
• Local disk is used as the intermediate data directory
Reduced by 48%Reduced by 44%
30. Hadoop Summit@Dublin (April ‘16) 30Network Based Computing Laboratory
• For 80GB Sort in 8 nodes
– 34% improvement over IPoIB (QDR)
Case Study - Performance Improvement of MapReduce over
Lustre on SDSC-Gordon
• For 120GB TeraSort in 16 nodes
– 25% improvement over IPoIB (QDR)
• Lustre is used as the intermediate data directory
0
100
200
300
400
500
600
700
800
900
40 60 80
JobExecutionTime(sec)
Data Size (GB)
IPoIB (QDR)
OSU-Lustre-Read (QDR)
OSU-RDMA-IB (QDR)
OSU-Hybrid-IB (QDR)
0
100
200
300
400
500
600
700
800
900
40 80 120
JobExecutionTime(sec)
Data Size (GB)
Reduced by 25%Reduced by 34%
31. Hadoop Summit@Dublin (April ‘16) 31Network Based Computing Laboratory
• RDMA-based Designs and Performance Evaluation
– HDFS
– MapReduce
– RPC
– HBase
– Spark
– OSU HiBD Benchmarks (OHB)
Acceleration Case Studies and Performance Evaluation
32. Hadoop Summit@Dublin (April ‘16) 32Network Based Computing Laboratory
HBase-RDMA Design Overview
• JNI Layer bridges Java based HBase with communication library written in native code
• Enables high performance RDMA communication, while supporting traditional socket interface
HBase
IB Verbs
RDMA Capable Networks
(IB, iWARP, RoCE ..)
OSU-IB Design
Applications
1/10/40/100 GigE, IPoIB
Networks
Java Socket Interface Java Native Interface (JNI)
33. Hadoop Summit@Dublin (April ‘16) 33Network Based Computing Laboratory
HBase – YCSB Read-Write Workload
• HBase Get latency (QDR, 10GigE)
– 64 clients: 2.0 ms; 128 Clients: 3.5 ms
– 42% improvement over IPoIB for 128 clients
• HBase Put latency (QDR, 10GigE)
– 64 clients: 1.9 ms; 128 Clients: 3.5 ms
– 40% improvement over IPoIB for 128 clients
0
1000
2000
3000
4000
5000
6000
7000
8 16 32 64 96 128
Time(us)
No. of Clients
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
8 16 32 64 96 128
Time(us)
No. of Clients
10GigE
Read Latency Write Latency
OSU-IB (QDR)IPoIB (QDR)
J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H.
Wang, M. Luo, H. Subramoni, Chet Murthy, and
D. K. Panda, High-Performance Design of HBase
with RDMA over InfiniBand, IPDPS’12
34. Hadoop Summit@Dublin (April ‘16) 34Network Based Computing Laboratory
HBase – YCSB Get Latency and Throughput on SDSC-Comet
• HBase Get average latency (FDR)
– 4 client threads: 38 us
– 59% improvement over IPoIB for 4 client threads
• HBase Get total throughput
– 4 client threads: 102 Kops/sec
– 2.4x improvement over IPoIB for 4 client threads
Get Latency Get Throughput
0
0.02
0.04
0.06
0.08
0.1
0.12
1 2 3 4
AverageLatency(ms)
Number of Client Threads
IPoIB (FDR) OSU-IB (FDR)
0
20
40
60
80
100
120
1 2 3 4
TotalThroughput
(Kops/sec)
Number of Client Threads
IPoIB (FDR) OSU-IB (FDR)
59%
2.4x
35. Hadoop Summit@Dublin (April ‘16) 35Network Based Computing Laboratory
• RDMA-based Designs and Performance Evaluation
– HDFS
– MapReduce
– RPC
– HBase
– Spark
– OSU HiBD Benchmarks (OHB)
Acceleration Case Studies and Performance Evaluation
36. Hadoop Summit@Dublin (April ‘16) 36Network Based Computing Laboratory
• Design Features
– RDMA based shuffle
– SEDA-based plugins
– Dynamic connection
management and sharing
– Non-blocking data transfer
– Off-JVM-heap buffer
management
– InfiniBand/RoCE support
Design Overview of Spark with RDMA
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Scala based Spark with communication library written in native code
X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early
Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014
37. Hadoop Summit@Dublin (April ‘16) 37Network Based Computing Laboratory
• InfiniBand FDR, SSD, 64 Worker Nodes, 1536 Cores, (1536M 1536R)
• RDMA-based design for Spark 1.5.1
• RDMA vs. IPoIB with 1536 concurrent tasks, single SSD per node.
– SortBy: Total time reduced by up to 80% over IPoIB (56Gbps)
– GroupBy: Total time reduced by up to 57% over IPoIB (56Gbps)
Performance Evaluation on SDSC Comet – SortBy/GroupBy
64 Worker Nodes, 1536 cores, SortByTest Total Time 64 Worker Nodes, 1536 cores, GroupByTest Total Time
0
50
100
150
200
250
300
64 128 256
Time(sec)
Data Size (GB)
IPoIB
RDMA
0
50
100
150
200
250
64 128 256
Time(sec)
Data Size (GB)
IPoIB
RDMA
57%80%
38. Hadoop Summit@Dublin (April ‘16) 38Network Based Computing Laboratory
• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)
• RDMA-based design for Spark 1.5.1
• RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node.
– 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps)
– 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)
Performance Evaluation on SDSC Comet – HiBench PageRank
32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time
0
50
100
150
200
250
300
350
400
450
Huge BigData Gigantic
Time(sec)
Data Size (GB)
IPoIB
RDMA
0
100
200
300
400
500
600
700
800
Huge BigData Gigantic
Time(sec)
Data Size (GB)
IPoIB
RDMA
43%37%
39. Hadoop Summit@Dublin (April ‘16) 39Network Based Computing Laboratory
• RDMA-based Designs and Performance Evaluation
– HDFS
– MapReduce
– RPC
– HBase
– Spark
– OSU HiBD Benchmarks (OHB)
Acceleration Case Studies and Performance Evaluation
40. Hadoop Summit@Dublin (April ‘16) 40Network Based Computing Laboratory
• The current benchmarks provide some performance behavior
• However, do not provide any information to the designer/developer on:
– What is happening at the lower-layer?
– Where the benefits are coming from?
– Which design is leading to benefits or bottlenecks?
– Which component in the design needs to be changed and what will be its impact?
– Can performance gain/loss at the lower-layer be correlated to the performance
gain/loss observed at the upper layer?
Are the Current Benchmarks Sufficient for Big Data?
41. Hadoop Summit@Dublin (April ‘16) 41Network Based Computing Laboratory
Big Data Middleware
(HDFS, MapReduce, HBase, Spark and Memcached)
Networking Technologies
(InfiniBand, 1/10/40/100 GigE
and Intelligent NICs)
Storage Technologies
(HDD, SSD, and NVMe-SSD)
Programming Models
(Sockets)
Applications
Commodity Computing System
Architectures
(Multi- and Many-core
architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point
Communication
QoS
Threaded Models
and Synchronization
Fault-ToleranceI/O and File Systems
Virtualization
Benchmarks
RDMA Protocols
Challenges in Benchmarking of RDMA-based Designs
Current
Benchmarks
No Benchmarks
Correlation?
42. Hadoop Summit@Dublin (April ‘16) 42Network Based Computing Laboratory
Big Data Middleware
(HDFS, MapReduce, HBase, Spark and Memcached)
Networking Technologies
(InfiniBand, 1/10/40/100 GigE
and Intelligent NICs)
Storage Technologies
(HDD, SSD, and NVMe-SSD)
Programming Models
(Sockets)
Applications
Commodity Computing System
Architectures
(Multi- and Many-core
architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point
Communication
QoS
Threaded Models
and Synchronization
Fault-ToleranceI/O and File Systems
Virtualization
Benchmarks
RDMA Protocols
Iterative Process – Requires Deeper Investigation and Design for
Benchmarking Next Generation Big Data Systems and Applications
Applications-Level
Benchmarks
Micro-
Benchmarks
43. Hadoop Summit@Dublin (April ‘16) 43Network Based Computing Laboratory
• HDFS Benchmarks
– Sequential Write Latency (SWL) Benchmark
– Sequential Read Latency (SRL) Benchmark
– Random Read Latency (RRL) Benchmark
– Sequential Write Throughput (SWT) Benchmark
– Sequential Read Throughput (SRT) Benchmark
• Memcached Benchmarks
– Get Benchmark
– Set Benchmark
– Mixed Get/Set Benchmark
• Available as a part of OHB 0.8
OSU HiBD Benchmarks (OHB)
N. S. Islam, X. Lu, M. W. Rahman, J. Jose, and D.
K. Panda, A Micro-benchmark Suite for
Evaluating HDFS Operations on Modern
Clusters, Int'l Workshop on Big Data
Benchmarking (WBDB '12), December 2012
D. Shankar, X. Lu, M. W. Rahman, N. Islam, and
D. K. Panda, A Micro-Benchmark Suite for
Evaluating Hadoop MapReduce on High-
Performance Networks, BPOE-5 (2014)
X. Lu, M. W. Rahman, N. Islam, and D. K. Panda,
A Micro-Benchmark Suite for Evaluating Hadoop
RPC on High-Performance Networks, Int'l
Workshop on Big Data Benchmarking (WBDB
'13), July 2013
To be Released
44. Hadoop Summit@Dublin (April ‘16) 44Network Based Computing Laboratory
• Upcoming Releases of RDMA-enhanced Packages will support
– HBase
– Impala
• Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will support
– MapReduce
– RPC
• Advanced designs with upper-level changes and optimizations
– Memcached with Non-blocking API
– HDFS + Memcached-based Burst Buffer
On-going and Future Plans of OSU High Performance Big Data
(HiBD) Project
45. Hadoop Summit@Dublin (April ‘16) 45Network Based Computing Laboratory
• Discussed challenges in accelerating Hadoop and Spark with HPC technologies
• Presented initial designs to take advantage of InfiniBand/RDMA for HDFS,
MapReduce, RPC, and Spark
• Results are promising
• Many other open issues need to be solved
• Will enable Big Data community to take advantage of modern HPC
technologies to carry out their analytics in a fast and scalable manner
• Looking forward to collaboration with the community
Concluding Remarks
46. Hadoop Summit@Dublin (April ‘16) 46Network Based Computing Laboratory
Funding Acknowledgments
Funding Support by
Equipment Support by
47. Hadoop Summit@Dublin (April ‘16) 47Network Based Computing Laboratory
Personnel Acknowledgments
Current Students
– A. Augustine (M.S.)
– A. Awan (Ph.D.)
– S. Chakraborthy (Ph.D.)
– C.-H. Chu (Ph.D.)
– N. Islam (Ph.D.)
– M. Li (Ph.D.)
Past Students
– P. Balaji (Ph.D.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
Past Research Scientist
– S. Sur
Current Post-Doc
– J. Lin
– D. Banerjee
Current Programmer
– J. Perkins
Past Post-Docs
– H. Wang
– X. Besseron
– H.-W. Jin
– M. Luo
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– R. Rajachandrasekar (Ph.D.)
– K. Kulkarni (M.S.)
– M. Rahman (Ph.D.)
– D. Shankar (Ph.D.)
– A. Venkatesh (Ph.D.)
– J. Zhang (Ph.D.)
– E. Mancini
– S. Marcarelli
– J. Vienne
Current Research Scientists Current Senior Research Associate
– H. Subramoni
– X. Lu
Past Programmers
– D. Bureddy
- K. Hamidouche
Current Research Specialist
– M. Arnold
48. Hadoop Summit@Dublin (April ‘16) 48Network Based Computing Laboratory
Second International Workshop on
High-Performance Big Data Computing (HPBDC)
HPBDC 2016 will be held with IEEE International Parallel and Distributed Processing
Symposium (IPDPS 2016), Chicago, Illinois USA, May 27th, 2016
Keynote Talk: Dr. Chaitanya Baru,
Senior Advisor for Data Science, National Science Foundation (NSF);
Distinguished Scientist, San Diego Supercomputer Center (SDSC)
Panel Moderator: Jianfeng Zhan (ICT/CAS)
Panel Topic: Merge or Split: Mutual Influence between Big Data and HPC Techniques
Six Regular Research Papers and Two Short Research Papers
http://web.cse.ohio-state.edu/~luxi/hpbdc2016
HPBDC 2015 was held in conjunction with ICDCS’15
http://web.cse.ohio-state.edu/~luxi/hpbdc2015
49. Hadoop Summit@Dublin (April ‘16) 49Network Based Computing Laboratory
{panda, luxi}@cse.ohio-state.edu
Thank You!
The High-Performance Big Data Project
http://hibd.cse.ohio-state.edu/
Network-Based Computing Laboratory
http://nowlab.cse.ohio-state.edu/