This document discusses using Ceph storage with Apache Hadoop to provide a scalable and efficient storage solution for big data workloads. It outlines the challenges of scaling Hadoop storage independently from compute resources using the native Hadoop Distributed File System. The solution presented is to use the open source Ceph storage system instead of direct-attached storage. This allows Hadoop compute and storage resources to scale independently and provides a centralized storage platform for all enterprise data workloads. Performance tests showed the Ceph and Hadoop configuration providing up to a 60% improvement in I/O performance when using Intel caching software and SSDs.
Red Hat and Verizon teamed up to take attendees of Red Hat Storage Day New York on 1/19/16 through a tour of containerized storage and why it's important to the future of storage.
Implementation of Dense Storage Utilizing HDDs with SSDs and PCIe Flash Acc...Red_Hat_Storage
At Red Hat Storage Day New York on 1/19/16, Red Hat partner Seagate presented on how to implement dense storage using HDDs with SSDs and PCIe flash accelerator cards.
Red Hat's Ross Turk took the podium at the Public Sector Red Hat Storage Days on 1/20/16 and 1/21/16 to explain just why software-defined storage matters.
Red Hat and Verizon teamed up to take attendees of Red Hat Storage Day New York on 1/19/16 through a tour of containerized storage and why it's important to the future of storage.
Implementation of Dense Storage Utilizing HDDs with SSDs and PCIe Flash Acc...Red_Hat_Storage
At Red Hat Storage Day New York on 1/19/16, Red Hat partner Seagate presented on how to implement dense storage using HDDs with SSDs and PCIe flash accelerator cards.
Red Hat's Ross Turk took the podium at the Public Sector Red Hat Storage Days on 1/20/16 and 1/21/16 to explain just why software-defined storage matters.
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
Après la petite intro sur le stockage distribué et la description de Ceph, Jian Zhang réalise dans cette présentation quelques benchmarks intéressants : tests séquentiels, tests random et surtout comparaison des résultats avant et après optimisations. Les paramètres de configuration touchés et optimisations (Large page numbers, Omap data sur un disque séparé, ...) apportent au minimum 2x de perf en plus.
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...Odinot Stanislas
Issue du salon orienté développeurs d'Intel (l'IDF) voici une présentation plutôt sympa sur le stockage dit "scale out" avec une présentation des différents fournisseurs de solutions (slide 6) comprenant ceux qui font du mode fichier, bloc et objet. Puis du benchmark sur certains d'entre eux dont Swift, Ceph et GlusterFS.
NVMe PCIe and TLC V-NAND It’s about TimeDell World
With an explosion in data and the relentless growth in demand for information, identifying a much more efficient means of storage has become extremely important. In this session, we will cover the key drivers behind the need for faster and more efficient storage. NVMe, a standardized protocol for PCIe-based storage, is giving users the huge leap in bandwidth required for demanding applications. Samsung, who makes the fastest NVMe SSDs on the market, will cover the benefits enabled by such technology, in areas such as fraud prevention and surgical procedures.
The technology behind flash drives – NAND memory – will be spotlighted in this presentation. Memory manufacturers have improved NAND’s value by migrating from single-level-cell to multi-level-cell designs, but the most significant evolution will be a marriage of triple-level-cell and V-NAND flash manufacturing technologies. Samsung will also provide an overview of the prospects for TLC V-NAND with mobile device manufacturers, while examining the strong potential for a much wider TLC V-NAND market in data centers.
Devconf2017 - Can VMs networking benefit from DPDKMaxime Coquelin
DPDK brings high-performance/low-latency virtualization networking capabilities thanks to its Vhost/Virtio support. The session will first introduce DPDK and its Vhost/Virtio implementations, exposing to the audience examples of possible uses, and challenges that need to be addressed to achieve high-performance, functionality and reliability. Then, Vhost/Virtio improvements introduced in last DPDK release will be covered, such as receive path optimizations, Virtio's indirect descriptors support, or transmit zero copy to name a few. The speakers will explain which problems they aim to address, how they address them, mentioning their limitations.
Finally, the speakers, who are active DPDK's Virtio/Vhost contributors, will expose what new developments are in the pipe to tackle the remaining challenges.
The session will be presented so that DPDK developers and users find useful information on current developments and status. People not familiar with DPDK may find a overview, get and share ideas with other projects.
Ceph on Intel: Intel Storage Components, Benchmarks, and ContributionsRed_Hat_Storage
At Red Hat Storage Day Minneapolis on 4/12/16, Intel's Dan Ferber presented on Intel storage components, benchmarks, and contributions as they relate to Ceph.
Linux is usually at the edge of implementing new storage standards, and NVMe over Fabrics is no different in this regard. This presentation gives an overview of the Linux NVMe over Fabrics implementation on the host and target sides, highlighting how it influenced the design of the protocol by early prototyping feedback. It also tells how the lessons learned during developing the NVMe over Fabrics, and how they helped reshaping parts of the Linux kernel to support NVMe over Fabrics and other storage protocols better."
This presentation was delivered at LinuxCon Japan 2016 by Christoph Hellwig
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryDatabricks
The capacity of data grows rapidly in big data area, more and more memory are consumed either in the computation or holding the intermediate data for analytic jobs. For those memory intensive workloads, end-point users have to scale out the computation cluster or extend memory with storage like HDD or SSD to meet the requirement of computing tasks. For scaling out the cluster, the extra cost from cluster management, operation and maintenance will increase the total cost if the extra CPU resources are not fully utilized. To address the shortcoming above, Intel Optane DC persistent memory (Optane DCPM) breaks the traditional memory/storage hierarchy and scale up the computing server with higher capacity persistent memory. Also it brings higher bandwidth & lower latency than storage like SSD or HDD. And Apache Spark is widely used in the analytics like SQL and Machine Learning on the cloud environment. For cloud environment, low performance of remote data access is typical a stop gap for users especially for some I/O intensive queries. For the ML workload, it's an iterative model which I/O bandwidth is the key to the end-2-end performance. In this talk, we will introduce how to accelerate Spark SQL with OAP (https://github.com/Intel-bigdata/OAP) to accelerate SQL performance on Cloud to archive 8X performance gain and RDD cache to improve K-means performance with 2.5X performance gain leveraging Intel Optane DCPM. Also we will have a deep dive how Optane DCPM for these performance gains.
Speakers: Cheng Xu, Piotr Balcer
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...Databricks
Apache Spark is a popular data processing engine designed to execute advanced analytics on very large data sets which are common in today’s enterprise use cases. To enable Spark’s high performance for different workloads (e.g. machine-learning applications), in-memory data storage capabilities are built right in.
However, Spark’s in-memory capabilities are limited by the memory available in the server; it is common for computing resources to be idle during the execution of a Spark job, even though the system’s memory is saturated. To mitigate this limitation, Spark’s distributed architecture can run on a cluster of nodes, thus taking advantage of the memory available across all nodes. While employing additional nodes would solve the server DRAM capacity problem, it does so at an increased cost. Intel(R) Memory Drive Technology is a software-defned memory (SDM) technology, which combined with an Intel(R) Optane(TM) SSD, expands the system’s memory.
This combination of Intel(R) Optane(TM) SSD with Intel Memory Drive Technology alleviates those memory limitations that are inherent to Spark, by making more memory available to the operating system and to Spark jobs, transparently.
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
Nowadays, people are creating, sharing and storing data at a faster pace than ever before, effective data compression / decompression could significantly reduce the cost of data usage. Apache Spark is a general distributed computing engine for big data analytics, and it has large amount of data storing and shuffling across cluster in runtime, the data compression/decompression codecs can impact the end to end application performance in many ways.
However, there’s a trade-off between the storage size and compression/decompression throughput (CPU computation). Balancing the data compress speed and ratio is a very interesting topic, particularly while both software algorithms and the CPU instruction set keep evolving. Apache Spark provides a very flexible compression codecs interface with default implementations like GZip, Snappy, LZ4, ZSTD etc. and Intel Big Data Technologies team also implemented more codecs based on latest Intel platform like ISA-L(igzip), LZ4-IPP, Zlib-IPP and ZSTD for Apache Spark; in this session, we’d like to compare the characteristics of those algorithms and implementations, by running different micro workloads as well as end to end workloads, based on different generations of Intel x86 platform and disk.
It’s supposedly to be the best practice for big data software engineers to choose the proper compression/decompression codecs for their applications, and we also will present the methodologies of measuring and tuning the performance bottlenecks for typical Apache Spark workloads.
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciIntel® Software
Preprocess, visualize, and Build AI Faster at-Scale on Intel Architecture. Develop end-to-end AI pipelines for inferencing including data ingestion, preprocessing, and model inferencing with tabular, NLP, RecSys, video and image using Intel oneAPI AI Analytics Toolkit and other optimized libraries. Build at-scale performant pipelines with Databricks and end-to-end Xeon optimizations. Learn how to visualize with the OmniSci Immerse Platform and experience a live demonstration of the Intel Distribution of Modin and OmniSci.
Spring Hill (NNP-I 1000): Intel's Data Center Inference Chipinside-BigData.com
Today at Hot Chips 2019, Intel revealed new details of upcoming high-performance AI accelerators: Intel Nervana neural network processors, with the NNP-T for training and the NNP-I for inference. Intel engineers also presented technical details on hybrid chip packaging technology, Intel Optane DC persistent memory and chiplet technology for optical I/O.
"To get to a future state of ‘AI everywhere,’ we’ll need to address the crush of data being generated and ensure enterprises are empowered to make efficient use of their data, processing it where it’s collected when it makes sense and making smarter use of their upstream resources," said Naveen Rao, Intel vice president and GM, Artificial Intelligence Products Group. "Data centers and the cloud need to have access to performant and scalable general purpose computing and specialized acceleration for complex AI applications. In this future vision of AI everywhere, a holistic approach is needed—from hardware to software to applications.”
Learn more: https://www.intel.ai/accelerating-for-ai/?elq_cid=1192980
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Apache CarbonData & Spark meetup
"QATCodec: past, present and future" if from INTEL
Apache Spark™ is a unified analytics engine for large-scale data processing.
CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergenceinside-BigData.com
In this deck, Johann Lombardi from Intel presents: DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence.
"Intel has been building an entirely open source software ecosystem for data-centric computing, fully optimized for Intel® architecture and non-volatile memory (NVM) technologies, including Intel Optane DC persistent memory and Intel Optane DC SSDs. Distributed Asynchronous Object Storage (DAOS) is the foundation of the Intel exascale storage stack. DAOS is an open source software-defined scale-out object store that provides high bandwidth, low latency, and high I/O operations per second (IOPS) storage containers to HPC applications. It enables next-generation data-centric workflows that combine simulation, data analytics, and AI."
Unlike traditional storage stacks that were primarily designed for rotating media, DAOS is architected from the ground up to make use of new NVM technologies, and it is extremely lightweight because it operates end-to-end in user space with full operating system bypass. DAOS offers a shift away from an I/O model designed for block-based, high-latency storage to one that inherently supports fine- grained data access and unlocks the performance of next- generation storage technologies.
Watch the video: https://youtu.be/wnGBW31yhLM
Learn more: https://www.intel.com/content/www/us/en/high-performance-computing/daos-high-performance-storage-brief.html
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Accelerating Virtual Machine Access with the Storage Performance Development ...Michelle Holley
Abstract: Although new non-volatile media inherently offers very low latency, remote access
using protocols such as NVMe-oF and presenting the data to VMs via virtualized interfaces such as virtio
adds considerable software overhead. One way to reduce the overhead is to use the Storage
Performance Development Kit (SPDK), an open-source software project that provides building blocks for
scalable and efficient storage applications with breakthrough performance. Comparing the software
paths for virtualizing block storage I/O illustrates the advantages of the SPDK-based approach. Empirical
data shows that using SPDK can improve CPU efficiency by up to 10 x and reduce latency up to 50% over
existing methods. Future enhancements for SPDK will make its advantages even greater.
Speaker Bio: Anu Rao is Product line manager for storage software in Data center Group. She helps
customer ease into and adopt open source Storage software like Storage Performance Development Kit
(SPDK) and Intelligent Software Acceleration-Library (ISA-L).
Webinář: Dell VRTX - datacentrum vše-v-jednom za skvělou cenu / 7.10.2013Jaroslav Prodelal
Dokážete si představit, že byste provozovali své datacentrum v prostředí kanceláře? Ano, je to možné. Společnost Dell uvedla na trh novinku v podobě tzv. datacenter-in-a-box (vše-v-jednom), které je optimalizované (odhlučnění, napájení) pro provoz i v kanceláři, samozřejmě jej můžete dát i do samostatné místnosti.
Dell VRTX kombinuje v jediném 5U šasí výpočetní výkon (až 4 2-CPU servery), diskové úložiště (až 24 HDD) a síť.
Ve webináři vás seznámíme s touto cenově velmi zajímavou novinkou a ukážeme rozdíl mezi tímto řešením a případnými alternativami v době samostaných serverů, diskového pole a síťových switchů.
Agenda:
* co je Dell VRTX?
* segment zákazníků pro VRTX
* co VRTX nabízí
* řešení provozované na VRTX
* technické specifikace
* možná použití
* cena
* aktuální nabídky a promo akce
Intel’s Big Data and Hadoop Security Initiatives - StampedeCon 2014StampedeCon
At StampedeCon 2014, Todd Speck (Intel) presented "Intel’s Big Data and Hadoop Security Initiatives."
In this talk, we will cover various aspects of software and hardware initiatives that Intel is contributing to Hadoop as well as other aspects of our involvement in solutions for Big Data and Hadoop, with a special focus on security. We will discuss specific security initiatives as well as our recent partnership with Cloudera. You should leave the session with a clear understanding of Intel’s involvement and contributions to Hadoop today and coming in the near future.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
JMeter webinar - integration with InfluxDB and Grafana
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Efficiency with Storage Disaggregation
1. Unlocking Big Data
Infrastructure Efficiency
with Storage Disaggregation
Anjaneya “Reddy” Chagam
Chief SDS Architect, Data Center Group, Intel Corporation
3. Intel Confidential 3
Challenges for Cloud Service Providers
Nearly continuous
acquisition of storage
is needed.
Petabyte-scale data
footprints are common.
>35-percent annual
rate of storage
growth is expected.1
Inefficiencies
of storage acquisition
are magnified over time.
3
Tier-2 cloud service
providers (CSPs)
must meet the
demands of fast data
growth while driving
differentiation and
value-added services.
1 IDC. “Extracting Value from Chaos.” Sponsored by EMC Corporation. June 2011.
emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf.
4. Intel Confidential 4
Challenges with Scaling Apache Hadoop* Storage
Native Hadoop storage and compute can’t be scaled independently
Inefficient resource
allocation and IT
spending result.
Excess compute
capacity: when more
storage is needed, IT
ends up with more
compute than it needs.
4
Inefficiencies are
highly consequential
for large firms such as
tier-2 CSPs.
Both storage and compute resources are bound to Hadoop nodes
5. Intel Confidential 5
Challenges with Scaling Apache Hadoop* Storage
Native Hadoop storage can be used only for Hadoop workloads
Additional storage is needed for non-big-data workloads
Greater investments are
required for other
workloads
• Higher IT costs
Multiple storage
environments are
needed
• Low storage-capacity
utilization for workloads
No multi-tenancy support
in Hadoop
• Decreased operational agility
Lack of a central, unified
storage technology
• Need to replicate data from
other storage environments
and applications to the
Hadoop cluster on a regular
basis
• Results in unsustainable “data
islands” that increase total cost
of ownership (TCO) and
reduce decision agility
6. Intel Confidential 6
Solution: Apache Hadoop* with Ceph*
• Disaggregate Hadoop
storage and compute
• Ceph is:
• Open source
• Scalable
• Ceph enables:
• Storage for all data types
• Intel® Xeon® processors
• Intel network solutions
• Intel® Cache Acceleration
Software (Intel® CAS)
• Intel® Solid-State Drives
(SSDs) using high-speed
Non-Volatile Memory
Express* (NVMe*)
• Compute and storage scale
separately
• Unified storage for all
enterprise needs
• Increased organizational
agility
• More efficient use of IT
resources
Optimize performance
with Intel® technologies
ResultsUse Ceph instead of
local, direct-attached
hard drives for back-end
storage
7. Intel Confidential 7
Advantages of Ceph* Storage vs. Local Storage
Free (if self-supported)
Supports all data types:
file, block, and object
data
Provides one centralized,
standardized, and
scalable storage solution
for all enterprise needs
Open source
Supports many different
workloads and
applications
Works on commodity
hardware
8. Intel Confidential 8
Apache Hadoop* with Ceph* Storage: Logical Architecture
HDFS+YARN
SQLIn-Memory Map-Reduce NoSQL Stream Search Custom
Deployment Options
• Hadoop Services: Virtual, Container or Bare Metal
• Storage Integration: Ceph Block, File or Object
• Data Protection: HDFS and/or Ceph replication or Erasure Codes
• Tiering: HDFS and/or Ceph tiering
10. Intel Confidential 10
QCT Test Lab Environment (Cloudera Hadoop 5.7.0 & Ceph Jewel 10.2.1/FileStore)
Hadoop 21-22 (Data Nodes)RMS32 (Mgmt)
AP ES HM SM
SNN B
JHS RM
S
Hadoop24 (Name Node)
NN
G
S
Hadoop23 (Data Node)
DN
G
S
NM
/
blkdev<Host#>_{
0..11}, 6TB
110
RBD vols DN
G NM
/
blkdev<Host#>_{0
..11}, 6TB
110
RBD vols
p10p2
10.10.150.0/24 – private/cluster
p10p2 p10p2p255p2
Hadoop11-14 (Data Nodes)
DN
G NM
/
blkdev<Host#>_{0
..11}, 6TB
110
RBD vols
p10p2
p10p1 p10p1
10.10.242.0/24 – public
10.10.241.0/24 – public
p10p1
StarbaseMON41..42 StarbaseMON43
bond0 (p255p1+p255p2) bond0 (p255p1+p255p2)
Starbase51..54 Starbase55..57
bond0 (p255p1+p255p2)bond0 (p255p1+p255p2)
p2p1
10.10.100.0/24 – private/clusterp2p1
10.10.200.0/24 – private/cluster
CAS
NVMe 1 24
Journal
NVMe
OSD 1 OSD 2 OSD 24
nvme1n1nvme0n1
CAS
NVMe 1 24
Journal
NVMe
OSD 1 OSD 2 OSD 24
nvme1n1nvme0n1HDD 6TB HDD 6TB
SSD
Boot & Mon
SSD
Boot
SSD
Boot
SSD
Boot & Mon
MONMON
HDD (Boot
and CDH)
HDD (Boot
and CDH)
HDD (Boot
& CDH)
HDD (Boot
& CDH)
HDD (Boot
& CDH)
NOTE: BMC management network is not shown. HDFS replication 1, Ceph replication 2
*Other names and brands may be claimed as the property of others.
11. Intel Confidential 11
Intel CAS and Ceph Journal Configuration
OSDs
Ceph Journal
HDD13-24
Cache for
HDD1-12
Ceph Journal
HDD1-12
Cache for
HDD13-24
HDD1 HDD12 HDD13 HDD24… …
Reads Writes
OSDs
Ceph Journal
HDD13-24
Cache for
HDD1-12
Ceph Journal
HDD1-12
Cache for
HDD13-24
HDD1 HDD12 HDD13 HDD24… …
NVMe1 NVMe2 NVMe1 NVMe2
• Ceph Journal[1-24]: 20G each, 480G in Total
• Intel CAS[1-4]: 880G each, ~3520TB in Total
12. Validated Solution: Apache Hadoop* with Ceph* Storage
A highly performant proof-of-concept (POC) has been built by Intel and QCT.2
12
2 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and
MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You
should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with
other products. For more complete information visit intel.com/performance. For more information, see Legal Notices and Disclaimers.
**
Optimize performance
with Intel® CAS and
Intel® SSDs using
NVMe*
• Resolve input/output (I/
O) bottlenecks
• Provide better
customer service-level-
agreement (SLA)
support
• Provide up to a 60-
percent I/O
performance
improvement2
Disaggregate
storage and
compute in
Hadoop by using
Ceph storage
instead of
direct-attached
storage (DAS)
HDFS replication 1, Ceph replication 2
13. Intel Confidential 13
Benefits of the Apache Hadoop* with Ceph* Solution
Multi-protocol
storage support
Independent
scaling of storage
and compute
Enhanced
organizational
agility
Decreased capital
expenditures
(CapEx)
No loss in
performance
Can use
resources for any
workload
14. Intel Confidential 14
Find Out More
To learn more about Intel® CAS and request a trial copy, visit: intel.com/content/www/us/en/software/
intel-cache-acceleration-software-performance.html
To find the Intel® SSD that’s right for you, visit: intel.com/go/ssd
To learn about QCT QxStor* Red Hat* Ceph* Storage Edition, visit: qct.io/solution/software-defined-
infrastructure/storage-virtualization/qxstor-red-hat-ceph-storage-edition-p365c225c226c230
17. Intel Confidential 17
Legal Notices and Disclaimers
1 IDC. “Extracting Value from Chaos.” Sponsored by EMC Corporation. June 2011.
emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf.
2 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as
SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors
may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products.
Configurations:
• Ceph* storage nodes, each server: 16 Intel® Xeon® processor E5-2680 v3, 128 GB RAM, twenty-four 6 TB Seagate Enterprise* hard drives, and two 2 TB
Intel® Solid-State Drive (SSD) DC P3700 NVMe* drives with 10 gigabit Ethernet (GbE) Intel® Ethernet Converged Network Adapter X540-T2 network cards,
20 GbE public network, and 40 GbE private Ceph network.
• Apache Hadoop* data nodes, each server: 16 Intel Xeon processor E5-2620 v3 single socket, 128 GB RAM, with 10 GbE Intel Ethernet Converged Network
Adapter X540-T2 network cards, bonded.
The difference between the version with Intel® Cache Acceleration Software (Intel® CAS) and the baseline is that the Intel CAS version is not caching and is in
pass-through mode, so software only, no hardware changes are needed. The tests used were TeraGen*, TeraSort*, TeraValidate*, and DFSIO*, which are the
industry-standard Hadoop performance tests. For more complete information, visit intel.com/performance.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced website and confirm
whether referenced data are accurate.
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel
microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability,
functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are
intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the
applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice Revision #20110804
19. 19
Intel’s role in storage
Advance the
Industry
Open Source & Standards
Build an Open
Ecosystem
Intel® Storage Builders
End user solutions
Cloud, Enterprise
Intel Technology Leadership
Storage Optimized Platforms
Intel® Xeon® E5-2600 v4 Platform
Intel® Xeon® Processor D-1500 Platform
Intel® Converged Network Adapters 10/40GbE
Intel® SSDs for DC & Cloud
Storage Optimized Software
Intel® Intelligent Storage Acceleration Library
Storage Performance Development Kit
Intel® Cache Acceleration Software
SSD & Non-Volatile Memory
Interfaces: SATA , NVMe PCIe,
Form Factors: 2.5”, M.2, U.2, PCIe AIC
New Technologies: 3D NAND, Intel® Optane™
Cloud & Enterprise partner storage
solution architectures
73
+
partners
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully
evaluating your contemplated purchases, including the performance of that product when combined with other products.
Next gen solutions architectures
Intel solution architects have deep
expertise on Ceph for low cost and
high performance usage
helping customers to enable a modern
storage infrastructure
21. Intel Confidential
First 3D XPoint Use Cases for
Bluestore
§ Bluestore Backend, RocksDB Backend,
RocksDB WAL
Two methods for accessing PMEM
devices
§ Raw PMEM blockdev (libpmemblk)
§ DAX-enabled FS (mmap + libpmemlib)
3D XPoint™ and Ceph
BlueStore
Rocksdb
BlueFS
PMEMDevice PMEMDevice PMEMDevice
Metadata
Libpmemlib
Libpmemblk
DAX Enabled File System
mmap
Load/store
mmap
Load/store
File
File
File
API
API
Data
21
22. Intel Confidential
Enterprise class, highly reliable, feature rich,
and cost effective AFA solution
§ NVMe SSD is today’s SSD, and 3D NAND
or TLC SSD is today’s HDD
– NVMe as Journal, high capacity SATA SSD
or 3D NAND SSD as data store
– Provide high performance, high capacity, a
more cost effective solution
– 1M 4K Random Read IOPS delivered by 5 Ceph
nodes
– Cost effective: 1000 HDD Ceph nodes (10K
HDDs) to deliver same throughput
– High capacity: 100TB in 5 nodes
§ with special software optimization on
filestore and bluestore backend
3D NAND - Ceph cost effective solution
Ceph Node
S3510
1.6TB
S3510
1.6TB
S3510
1.6TB
S3510
1.6TB
P3700
M.2 800GB
Ceph Node
P3520
4TB
P3520
4TB
P3520
4TB
P3520
4TB
P3700 & 3D Xpoint™ SSDs
P3520
4TB
NVMe 3D Xpoint™
NVMe 3D NAND
SATA/NVMe
NAND
22
26. Intel Confidential
Test Setup (Hadoop)
Parameter Value Comment
Container Memory yarn.nodemanager.resource.memory-mb 80.52 GiB Default: Amount of physical memory, in MiB, that can be allocated for
containers
NOTE: In a different document, it recommends
Container Virtual CPU Cores
yarn.nodemanager.resource.cpu-vcores
48 Default: Number of virtual CPU cores that can be allocated for containers.
Container Memory Maximum
yarn.scheduler.maximum-allocation-mb
12 GiB The largest amount of physical memory, in MiB, that can be requested for a
container.
Container Virtual CPU Cores Maximum
yarn.scheduler.maximum-allocation-vcores
48 Default: The largest number of virtual CPU cores that can be requested for
a container.
Container Virtual CPU Cores Minimum
yarn.scheduler.minimum-allocation-vcores
2 The smallest number of virtual CPU cores that can be requested for a
container. If using the Capacity or FIFO scheduler (or any scheduler, prior to
CDH 5), virtual core requests will be rounded up to the nearest multiple of
this number.
JobTracker MetaInfo Maxsize
mapreduce.job.split.metainfo.maxsize
1000000000 The maximum permissible size of the split metainfo file. The JobTracker
won't attempt to read split metainfo files bigger than the configured value.
No limits if set to -1.
I/O Sort Memory Buffer (MiB) mapreduce.task.io.sort.mb 400 MiB To enable larger blocksize without spills
yarn.scheduler.minimum-allocation-mb 2 GiB Default: Minimum container size
mapreduce.map.memory.mb 1 GiB Memory req’d for each type of container - may want to increase for some
apps
mapreduce.reduce.memory.mb 1.5 GiB Memory req’d for each type of container - may want to increase for some
apps
mapreduce.map.cpu.vcores 1 Default: Number of vcores req’d for each type of container
mapreduce.reduce.cpu.vcores 1 Default: Number of vcores req’d for each type of container
mapreduce.job.heap.memory-mb.ratio 0.8 (Default). This sets Java heap size = 800/1200 MiB for mapreduce.{map|
reduce}.memory.mb = 1/1.5 GiB
27. Intel Confidential
Test Setup (Hadoop)
Parameter Value Comment
dfs.blocksize 128 MiB Default
dfs.replication 1 Default block replication. The number of replications to make
when the file is created. The default value is used if a
replication number is not specified.
Java Heap Size of NameNode in Bytes 4127MiB Default: Maximum size in bytes for the Java Process heap
memory. Passed to Java -Xmx.
Java Heap Size of Secondary NameNode in
Bytes
4127MiB Default: Maximum size in bytes for the Java Process heap
memory. Passed to Java -Xmx.
Parameter Value Comment
Memory overcommit validation threshold 0.9 Threshold used when validating the allocation of RAM on a
host. 0 means all of the memory is reserved for the system. 1
means none is reserved. Values can range from 0 to 1.
28. Intel Confidential
Test Setup (CAS NVMe, Journal NVMe)
NVMe0n1 NVMe1n1
Ceph journal configured for 1st 12 HDDs will be
/dev/nvme0n1p1 - /dev/nvme0n1p12
Each Partition size: 20GiB
Ceph Journal configured for remaining 12 HDDs will be
/dev/nvme1n1p1 - /dev/nvme1n1p12
Each Partition size: 20GiB
CAS for 12-24 HDDs will be from this SSD. Use rest of
the free space and split evenly for 2 cache partitions
e.g. /dev/sdo - /dev/sdz
cache 1 /dev/nvme0n1p13 Running wo -
├core 1 /dev/sdo1 - - /dev/intelcas1-1
├core 2 /dev/sdp1 - - /dev/intelcas1-2
├core 3 /dev/sdq1 - - /dev/intelcas1-3
├core 4 /dev/sdr1 - - /dev/intelcas1-4
├core 5 /dev/sds1 - - /dev/intelcas1-5
└core 6 /dev/sdt1 - - /dev/intelcas1-6
cache 2 /dev/nvme0n1p14 Running wo -
├core 1 /dev/sdu1 - - /dev/intelcas2-1
├core 2 /dev/sdv1 - - /dev/intelcas2-2
├core 3 /dev/sdw1 - - /dev/intelcas2-3
├core 4 /dev/sdx1 - - /dev/intelcas2-4
├core 5 /dev/sdy1 - - /dev/intelcas2-5
└core 6 /dev/sdz1 - - /dev/intelcas2-6
CAS for 1-12 HDDs will be from this SSD. Use rest of the free
space and split evenly for 2 cache partitions
e.g. /dev/sdc - /dev/sdn
cache 1 /dev/nvme1n1p13 Running wo -
├core 1 /dev/sdc1 - - /dev/intelcas1-1
├core 2 /dev/sdd1 - - /dev/intelcas1-2
├core 3 /dev/sde1 - - /dev/intelcas1-3
├core 4 /dev/sdf1 - - /dev/intelcas1-4
├core 5 /dev/sdg1 - - /dev/intelcas1-5
└core 6 /dev/sdh1 - - /dev/intelcas1-6
cache 2 /dev/nvme1n1p14 Running wo -
├core 1 /dev/sdi1 - - /dev/intelcas2-1
├core 2 /dev/sdj1 - - /dev/intelcas2-2
├core 3 /dev/sdk1 - - /dev/intelcas2-3
├core 4 /dev/sdl1 - - /dev/intelcas2-4
├core 5 /dev/sdm1 - - /dev/intelcas2-5
└core 6 /dev/sdn1 - - /dev/intelcas2-6