Storage is one of the most important part of a data center, the complexity to design, build and delivering 24/forever availability service continues to increase every year. For these problems one of the best solution is a distributed filesystem (DFS) This talk describes the basic architectures of DFS and comparison among different free software solutions in order to show what makes DFS suitable for large-scale distributed environments. We explain how to use, to deploy, advantages and disadvantages, performance and layout on each solutions. We also introduce some Case Studies on implementations based on openAFS, GlusterFS and Hadoop finalized to build your own Cloud Storage.
Gluster Webinar: Introduction to GlusterFSGlusterFS
This webinar provides an introduction to GlusterFS, the leading open source, scale-out NAS file system. Learn how GlusterFS is deployed in the datacenter, in the cloud, or enables you to create a global namespace between the two.
Extract business value by analyzing large volumes of multi-structured data from various sources such as databases, websites, blogs, social media, smart sensors...
Gluster Webinar: Introduction to GlusterFSGlusterFS
This webinar provides an introduction to GlusterFS, the leading open source, scale-out NAS file system. Learn how GlusterFS is deployed in the datacenter, in the cloud, or enables you to create a global namespace between the two.
Extract business value by analyzing large volumes of multi-structured data from various sources such as databases, websites, blogs, social media, smart sensors...
Clustered and distributed storage with commodity hardware and open source ...Phil Cryer
An overview of the state of the Biodiversity Heritage Library's first storage cluster. It covers the basics of building a clustered and distributed storage with commodity hardware and open source software , and also details such as working software to maintain synchronization with other global partners. Presented to the Biodiversity Heritage Library Europe's Technical Architecture board at Natural History Museum, London on August 25, 2010.
Secure distributed data storage can shift the burden of maintaining a large number of files from the owner to proxy servers.
Proxy servers can convert encrypted files for the owner to encrypted files for the receiver without the necessity of knowing the content of the original files. In practice, the original files will be removed by the owner for the sake of space efficiency. Hence, the issues on confidentiality and integrity of the outsourced data must be addressed carefully. In this paper, we propose two identity-based secure distributed data storage (IBSDDS) schemes. Our schemes can capture the following properties: (1) The file owner can decide the access permission independently without the help of the private key generator (PKG).
(2) For one query, a receiver can only access one file, instead of all files of the owner.
(3) Our schemes are secure against the collusion attacks, namely even if the receiver can compromise the proxy servers, he cannot obtain the owner’s secret key.
Although the first scheme is only secure against the chosen plaintext attacks (CPA), the second scheme is secure against the chosen ciphertext attacks (CCA). To the best of our knowledge, it is the first IBSDDS schemes where an access permissions is made by the owner for an exact file and collusion attacks can be protected in the standard model.
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Data Con LA
Alluxio, formerly Tachyon, is a memory speed virtual distributed storage system. The Alluxio open source community is one of the fastest growing open source communities in big data history with more than 300 developers from over 100 organizations around the world. In the past year, the Alluxio project experienced a tremendous improvement in performance and scalability and was extended with key new features including tiered storage, transparent naming, and unified namespace. Alluxio now supports a wide range of under storage systems, including Amazon S3, Google Cloud Storage, Gluster, Ceph, HDFS, NFS, and OpenStack Swift. This year, our goal is to make Alluxio accessible to an even wider set of users, through our focus on security, new language bindings, and further increased stability.
Tachyon: An Open Source Memory-Centric Distributed Storage SystemTachyon Nexus, Inc.
Tachyon talk at Strata and Hadoop World 2015 at New York City, given by Haoyuan Li, Founder & CEO of Tachyon Nexus. If you are interested, please do not hesitate to contact us at info@tachyonnexus.com . You are welcome to visit our website ( www.tachyonnexus.com ) as well.
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...DataStax
Many companies use both elasticsearch and cassandra, typically in the form of logs or time series, but managing many softwares at a large scale can be quite challenging. Elassandra tightly integrates elasticsearch within cassandra as a secondary index, allowing near-realtime search with all existing elasticsearch APIs, plugins and tools like Kibana. We will present the core concepts of elassandra and explain how it draws benefit from internal cassandra features to make elasticsearch masterless, scalable with automatic resharding, more reliable and more efficient than deploying both softwares. We will also explore the bidirectional mapping : the way elasticsearch automatically creates the corresponding cassandra schema and the way elasticsearch indexes an existing cassandra table. Furthermore, we will share some use cases and benchmark results demonstrating practical use of elassandra to scale-out, re-index with zero-downtime, search and visualize data with various tools.
About the Speakers
Remi Trouville Consultant, Independant
Remi is an IT engineer who has worked for the last 8 years in the financial industry as a team manager responsible for all the call-center softwares managing the customer experience. At the end of this period, his team was dealing with 10,000+ agents with 100+ sites and some highly critical business processes such as storage of oral proof sales for transactions. He holds a Master's Degree in Telecommunication engineering and is now following an executive-MBA, in a French business school.
Clustered and distributed storage with commodity hardware and open source ...Phil Cryer
An overview of the state of the Biodiversity Heritage Library's first storage cluster. It covers the basics of building a clustered and distributed storage with commodity hardware and open source software , and also details such as working software to maintain synchronization with other global partners. Presented to the Biodiversity Heritage Library Europe's Technical Architecture board at Natural History Museum, London on August 25, 2010.
Secure distributed data storage can shift the burden of maintaining a large number of files from the owner to proxy servers.
Proxy servers can convert encrypted files for the owner to encrypted files for the receiver without the necessity of knowing the content of the original files. In practice, the original files will be removed by the owner for the sake of space efficiency. Hence, the issues on confidentiality and integrity of the outsourced data must be addressed carefully. In this paper, we propose two identity-based secure distributed data storage (IBSDDS) schemes. Our schemes can capture the following properties: (1) The file owner can decide the access permission independently without the help of the private key generator (PKG).
(2) For one query, a receiver can only access one file, instead of all files of the owner.
(3) Our schemes are secure against the collusion attacks, namely even if the receiver can compromise the proxy servers, he cannot obtain the owner’s secret key.
Although the first scheme is only secure against the chosen plaintext attacks (CPA), the second scheme is secure against the chosen ciphertext attacks (CCA). To the best of our knowledge, it is the first IBSDDS schemes where an access permissions is made by the owner for an exact file and collusion attacks can be protected in the standard model.
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Data Con LA
Alluxio, formerly Tachyon, is a memory speed virtual distributed storage system. The Alluxio open source community is one of the fastest growing open source communities in big data history with more than 300 developers from over 100 organizations around the world. In the past year, the Alluxio project experienced a tremendous improvement in performance and scalability and was extended with key new features including tiered storage, transparent naming, and unified namespace. Alluxio now supports a wide range of under storage systems, including Amazon S3, Google Cloud Storage, Gluster, Ceph, HDFS, NFS, and OpenStack Swift. This year, our goal is to make Alluxio accessible to an even wider set of users, through our focus on security, new language bindings, and further increased stability.
Tachyon: An Open Source Memory-Centric Distributed Storage SystemTachyon Nexus, Inc.
Tachyon talk at Strata and Hadoop World 2015 at New York City, given by Haoyuan Li, Founder & CEO of Tachyon Nexus. If you are interested, please do not hesitate to contact us at info@tachyonnexus.com . You are welcome to visit our website ( www.tachyonnexus.com ) as well.
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...DataStax
Many companies use both elasticsearch and cassandra, typically in the form of logs or time series, but managing many softwares at a large scale can be quite challenging. Elassandra tightly integrates elasticsearch within cassandra as a secondary index, allowing near-realtime search with all existing elasticsearch APIs, plugins and tools like Kibana. We will present the core concepts of elassandra and explain how it draws benefit from internal cassandra features to make elasticsearch masterless, scalable with automatic resharding, more reliable and more efficient than deploying both softwares. We will also explore the bidirectional mapping : the way elasticsearch automatically creates the corresponding cassandra schema and the way elasticsearch indexes an existing cassandra table. Furthermore, we will share some use cases and benchmark results demonstrating practical use of elassandra to scale-out, re-index with zero-downtime, search and visualize data with various tools.
About the Speakers
Remi Trouville Consultant, Independant
Remi is an IT engineer who has worked for the last 8 years in the financial industry as a team manager responsible for all the call-center softwares managing the customer experience. At the end of this period, his team was dealing with 10,000+ agents with 100+ sites and some highly critical business processes such as storage of oral proof sales for transactions. He holds a Master's Degree in Telecommunication engineering and is now following an executive-MBA, in a French business school.
Dimensioning and Cost Structure Analysis of Wide Area Data Service Network - ...Laili Aidi
This report contains discussion of the radio access network design and the cost structure analysis of different deployment options of Radio Access Technologies (RATs). The objective is to provide specific amount of user, with specific traffic demand and deployment scenario.
--
Please contact trough lailiaidi at gmail.com for download request
Database Sharding the Right Way: Easy, Reliable, and Open source - HighLoad++...CUBRID
The presentation the CUBRID team presented at Russian HighLoad++ Conference in October, 2012. The presentation covers the topic of Big Data management through Database Sharding. CUBRID open source RDBMS provides native support for Sharding with load balancing, connection pooling, and auto fail-over features.
This session will explore the Sybase database embedded in Novell ZENworks 10 Configuration Management. We'll discuss topics such as backup, recovery, basic maintenance, tuning and troubleshooting techniques for the database components that underpin Novell ZENworks Configuration Management.
From concept to cloud a look at modern software developmentSoftware Guru
La computadora HAL 9000 es uno de los personajes centrales de "2001 Odisea del Espacio". Considerando que esta historia fue escrita en 1968, podemos decir que la tecnología ha tenido un tremendo avance desde entonces. Y nuestra habilidad para crear software también ha mejorado un poco. Pero todavía estamos muy lejos de poder construir algo como HAL 9000. En esta conferencia echaremos un vistazo a las principales tendencias en desarrollo de software hacia los próximos años.
Pisa is a decentralized block storage distribution and replication framework with the specific goal of simplifying the development of storage back-end services in a distributed environment. Main chararistics of the project are the message security, self-organization cluster and simple setup. Pisa is a subproject of RestFS project and the talk will explain our experience acquired with the development of this subcomponent and the decisions taken in the design of the framework.
s the culmination of ten years' work, the Samba Team has created the first compatible Free Software implementation of Microsoft’s Active Directory protocols.
LDAP, Kerberos, DNS, and all other essential services that are required for Active Directory are natively supported by Samba4.
Samba4 doesn't have only Active Directory functions, but it has also many other incredible features like smb3 protocol implementation, ctdb (cluster) functionality and much more.
The presentation will describe the supported scenarios of Samba 4 as an Active Directory DC and also, discusses the developments in the File Server, in particular the components of SMB2, SMB3 and CTDB.
The RestFS is an experimental project to develop an open-source distributed filesystem for large environments. It is designed to scale up from a single server to thousand of nodes and delivering a high availability storage system with special features for high i/o performance and network optimization for work better in WAN environment.
One of the new challenges of IT today is the "Big Data", to solve this problem many solutions are available on the market and some new paradigms have appeared.
In most of these new paradigms the Message Queue covers an important part, more than the past.
This is a small introduction to the use of Messaging Middleware and an overview of the main open source products available.
The RestFS is an experimental project to develop an open-source distributed filesystem for large environments. It is designed to scale up from a single server to thousand of nodes and delivering a high availability storage system with special features for high i/o performance and network optimization for work better in WAN environment. The Restfs is pure-python, but several of the libraries that it depends upon use C extensions (sometimes for speed, sometimes to interface to pre-existing C libraries). The Project is on the beginning stage, with some technology previews released.
High performance for a Web server that receive a large numbers of requests is critical success factor for a web site, but in many cases the Web server is only “tip of the iceberg” of a very large heterogeneous systems, with lots of components and technologies. This talk present best practices to design an high availability and high performance web site. The presentation will cover load balancing, Web server acceleration, and efficient management of dynamic data, that can be adopted by any sites to improve performance and availability. We also describe common mistake implemented in the web application framework that create performance limitations and bottleneck. The presentation will describe how to define monitors metrics of the service , that are the “eyes” of operation departments, and the implementation of the “red button”
Using automation you can make your home easier and cheaper to run and more secure. In the session we will see hardware options, architectural layouts, softwares, examples on customizations and extensions. The presentation will also cover specific problems on multimedia (UPNP AV) and integrations with existing home devices, mobile and internet services. At the end of the session you will be able to design your home and customize the software for your specific needs, in this way you can lie on your sofa and keep everything under control.
Disaster recovery and business continuity planning are processes which help organizations prepare for disruptive events. The talk explains the basic concepts of business continuity, giving a brief overview on the business continuity plan and more detail informations (technical) on how to setup a Disaster Recovery site . We show two different approaches for creating a disaster recovery (DR) site, one the based on operating system layer and one based on the right design of the applications . The common elements on the two approaches are network design, data replication, monitoring system and system/configuration management. All these elements can be implemented with open source software, we explain advantages and disadvantages, performances and layouts on each solutions.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
2. Agenda
Introduction
Next Generation Data Center
Distributed File system
Distributed File system
OpenAFS
GlusterFS
HDFS
Ceph
Case Studies
Conclusion
2
16/02/2012
3. Class Exam
What do you know about DFS ?
How can you create a Petabyte
storage ?
How can you make a centralized
system log ?
How can you allocate space for your
user or system, when you have a
thousands of users/systems ?
How can you retrieve data from
everywhere ?
3
16/02/2012
4. Introduction
Next Generation Data Center: the ―FABRIC‖
Key categories:
Continuous data protection and disaster
recovery
File and block data migration across
heterogeneous environments
Server and storage virtualization
Encryption for data in-flight and at-rest
In other words: Cloud data center
4
16/02/2012
5. Introduction
Storage Tier in the ―FABRIC‖
High Performance
Scalability
Simplified Management
Security
High Availability
Solutions
Storage Area Network
Network Attached Storage
Distributed file system
5
16/02/2012
6. Introduction
What is a Distributed File system ?
“A distributed file system takes advantage of the
interconnected nature of the network by storing
files on more than one computer in the network
and making them accessible to all of them..”
6
16/02/2012
7. Introduction
What do you expected from a distributed file system ?
• Uniform Access: file names global support
• Security: to provide a global authentication/authorization
• Reliability: the elimination of each single point of failure
• Availability: administrators perform routine maintenance while the file
server is in operation, without disrupting the user’s routines
• Scalability: Handle terabytes of data
• Standard conformance: some IEEE POSIX file system semantics standard
• Performance: high performance
7
9. OpenAFS: introduction
is theopen sourceimplementation of
AndrewFile system of IBM
Key ideas:
Make clients do work whenever possible.
Cache whenever possible.
Exploit file usage properties. Understand them. One-third of Unix
files are temporary.
Minimize system-wide knowledge and change. Do not hardwire
locations.
Trust the fewest possible entities. Do not trust workstations.
Batch if possible to group operations.
9
16/02/2012
11. OpenAFS: components
Cell
•Cell is collection of file servers and
workstation
•The directories under /afs are
cells, unique tree
•Fileserver contains volumes
Volumes
•Volumes are "containers" or sets of
related files and directories
•Have size limit
•3 type rw, ro, backup
Mount Point Directory
Server A
•Access to a volume is provided through
a mount point Server C
•A mount point is just like a static
directory Server A+B
11
13. OpenAFS: features
Uniform name space: same path on all
workstations
Security: base to krb4/krb5, extended ACL,
traffic encryption
Reliability: read-only replication, HA
database, read/write replica in OSD version
Availability: maintenance tasks without
stopping the service
Scalability: server aggregation
Administration: administration delegation
Performance: client side disk base persistent
cache, big rate client per Server
13
16/02/2012
14. openAFS: who uses it ?
Morgan Stanley IT
• Internal usage
• Storage: 450 TB (ro)+ 15 TB (rw)
• Client: 22.000
Pictage, Inc
• Online picture album
• Storage: 265TB ( planned growth to 425TB in twelve months)
• Volumes: 800,000.
• Files: 200 000 000.
Embian
• Internet Shared folder
• Storage: 500TB
• Server: 200 Storage server
• 300 App server
RZH
•Internal usage 210TB
14
15. OpenAFS: good for ...
Good
• Wide Area Network
• Heterogeneous System
• Read operation > write operation
• Large number of clients/systems
• Usage directly by end-users
• Federation
Bad
• Locking
• Database
• Unicode
• Large File
• Some limitations on ..
15
16. GlusterFS
“Gluster can manage data in a
single global namespace on
commodity hardware..‖
Keys:
Lower Storage Cost—Open source software runs on commodity
hardware
Scalability—Linearly scales to hundreds of Petabytes
Performance—No metadata server means no bottlenecks
High Availability—Data mirroring and real time self-healing
Virtual Storage for Virtual Servers—Simplifies storage and keeps VMs
always-on
Simplicity—Complete web based management suite
16
16/02/2012
18. GlusterFS: components
Volume
volume posix1
•Volume is the basic element for data type storage/posix
export option directory /home/export1
•The volumes can be stacked for end-volume
extension
Capabilities
volume brick1
•Specific options (features) can be type features/posix-locks
enabled for each volume (cache, pre option mandatory
fetch, etc.) subvolumes posix1
•Simple creation for custom extensions end-volume
with api interface
Services volume server
type protocol/server
•Access to a volume is provided through option transport-type tcp
services like tcp, unix socket, option transport.socket.listen-port 6996
infiniband subvolumes brick1
option auth.addr.brick1.allow *
end-volume
18
16/02/2012
21. Gluster: carateristics
Uniform name space: same path on all
workstation
Reliability: read-1 replication, asynchronous
replication for disaster recovery
Availability: No system downtime for
maintenance (better in the next release)
Scalability: Truly linear scalability
Administration: Self Healing, Centralized logging
and reporting, Appliance version
Performance: Stripe files across dozens of
storage blocks, Automatic load balancing, per
volume i/o tuning
21
16/02/2012
22. Gluster: who uses it ?
Avail TVN (USA)
400TB for Video on demand, video
storage
Fido Film (Sweden)
visual FX and Animation studio
University of Minnesota (USA)
142TB Supercomputing
Partners Healthcare (USA)
336TB Integrated health system
Origo(Switzerland)
open source software development
and collaboration platform
22
23. Gluster: good for ...
Good
• Large amount of data
• Access with different protocols
• Directly access from applications
(api layer)
• Disaster recover (better in the
next release)
• SAN replacement, vm storage
Bad
• User-space
• Low granularity in security setting
• High volumes of operations on
same file
23
24. Implementations
Implementations
Old way
Metadata and data in the same place
Single stream per file
New way
Multiple streams are parallel channels
through which data can flow
Files are striped across a set of nodes in
order to facilitate parallel access
OSD Separation of file metadata
management (MDS) from the storage of
file data
24
16/02/2012
25. HDFS: Hadoop
HDFS is part of the Apache
Hadoopproject which develops
open-source software for
reliable, scalable, distributed
computing.
Hadoop was inspired by Google’s
MapReduce and Google File
system
25
16/02/2012
26. HDFS: Google File System
― Design of a file systems for a different environment
where assumptions of a general purpose file system
do not hold—interesting to see how new assumptions
lead to a different type of system…‖
Key ideas:
Component failures are the norm.
Huge files (not just the occasional file)
Append rather than overwrite is typical
Co-design of application and file system API—specialization.
For example can have relaxed consistency.
26
16/02/2012
27. HDFS: MapReduce
“Moving Computation is Cheaper than Moving Data”
Map
• Split and mapped in key-
value pairs
Combine
• For efficiency reasons, the
combiner works directly to map
operation outputs .
Reduce
• The files are then
merged, sorted and reduced
27
28. HDFS: goals
Scalable: can reliably store and
process petabytes.
Economical: It distributes the data and
processing across clusters of
commonly available computers.
Goals
Efficient: can process data in parallel
on the nodes where the data is
located.
Reliable: automatically maintains
multiple copies of data and
automatically redeploys computing
tasks based on failures.
28
30. HDFS: components
Namenode
• An HDFS cluster consists of a single
NameNode
• It is a master server that manages
the file system namespace and
regulates access to files by clients.
Datanodes
• Datanode manage storage attached
to the system it run on
• Applay the map rule of MapReduce
Blocks
• File is split into one or more blocks
and these blocks are stored in a set
of DataNodes
30
31. HDFS: features
Uniform name space: same path on all
workstations
Reliability: rw replication, re-balancing, copy
in different locations
Availability: hot deploy
Scalability: server aggregation
Administration: HOD
Performance: “grid” computation, parallel
transfer
31
16/02/2012
32. HDFS: who uses it ?
Yahoo!
A9.com
AOL
Booz Allen Hamilton
EHarmony
Facebook
Freebase
Fox Interactive Media
IBM
ImageShack
ISI
Major players Joost
Last.fm
LinkedIn
Metaweb
Meebo
Ning
Powerset (now part of Microsoft)
Proteus Technologies
The New York Times
Rackspace
Veoh
Twitter
…
32
33. HDFS: good for ...
Good
• Task distribution (Basic GRID
infrastructure)
• Distribution of content (High
throughput of data access )
• Archiving
• Etherogenous envirorment
Bad
• Not General purpose File system
• Not Posix Compliant
• Low granularity in security setting
• Java
33
34. Ceph
“Ceph is designed to handle workloads
in which tens thousands of clients or
more simultaneously access the same
file orwrite to the same directory–
usage scenarios that bring typical
enterprise storage systems to their
knees.‖
Keys:
Seamless scaling — The file system can be seamlessly expanded by simply
adding storage nodes (OSDs). However, unlike most existing file systems, Ceph
proactively migrates data onto new devices in order to maintain a balanced
distribution of data.
Strong reliability and fast recovery — All data is replicated across multiple
OSDs. If any OSD fails, data is automatically re-replicated to other devices.
Adaptive MDS — The Ceph metadata server (MDS) is designed to dynamically
adapt its behavior to the current workload.
34
36. Ceph: features
Dynamic Distributed Metadata
• Metadata Storage
• Dynamic Subtree Partitioning
• Traffic Control
Reliable Autonomic Distributed Object
Storage
• Data Distribution
• Replication
• Data Safety
• Failure Detection
• Recovery and Cluster Updates
36
37. Ceph: features
Pseudo-random data distribution function (CRUSH)
Reliable object storage service (RADOS)
Extent B-tree object File System (today btrfs)
37
38. Ceph: features
Splay Replication
• Only after it has been safely committed to disk is a final commit
notification sent to the client.
38
39. Ceph: good for …
Good
• Scientific application, High
throughput of data access
• Heavy Read / Write operations
• It is the most advance distributed
file system
Bad
• Young (Linux 2.6.34)
• Linux only
• Complex
39
42. Class Exam
What can DFS do for you ?
How can you create a Petabyte
storage ?
How can you make a centralized
system log ?
How can you allocate space for your
user or system, when you have a
thousands of users/systems ?
How can you retrieve data from
everywhere ?
42
16/02/2012
43. File sharing
Problem
•Share Documents across a wide
network area
•Share home folder across different
Terminal servers
Solution
•OpenAFS
•Samba
Results
•Single ID, Kerberos/ldap
•Single file system
Usage
•800 users
•15 branch offices
•File sharing /home dir
43
44. Web Service
Problem
• Big Storage on a little budget
Solution
• Gluster
Results
• High Availability data storage
• Low price
Usage
• 100 TB image archive
• Multimedia content for web site
44
46. Log concentrator
Problem
• Log concentrator
Solution
• Hadoop cluster
• Syslog-NG
Results
• High availability
• Fast search
• “Storage without limits”
Usage
• Security audit and access control
46
47. Private cloud
Problems
• Low cost VM storage
• VM self provisioning
Solution
• GlusterFS
• openAFS
• Custom provisioning
Rresults
• Auto provisioning
• Low cost
• Flexible solution
Usage
• Development env
• Production env
48. Conclusion: problems
Do you have enough bandwidth ?
Failure
For 10 PB of storage, you will have an
average of22consumer-grade SATA drives
failing per day.
Read/write time
Each of the 2TB drives takes approximately
best case 24,390 seconds to be read and
written over the network.
Data Replication
Data replication is the number of the disk
drives, plus difference.
48
16/02/2012
49. Conclusion
Environment Analysis
• No true Generic DFS
• Not simple move 800TB btw different solutions
Dimension
• Start with the right size
• Servers number is related to speed needed and number of clients
• Network for Replication
Divide system in Class of Service
• Different disk Type
• Different Computer Type
System Management
• Monitoring Tools
• System/Software Deploy Tools
49
52. I look forwardto meeting you…
XVII European AFS meeting 2010
PILSEN - CZECH REPUBLIC
September 13-15
Who should attend:
Everyone interested in deploying a globally accessible
file system
Everyone interested in learning more about real
world usage of Kerberos authentication in single
realm and federated single sign-on environments
Everyone who wants to share their knowledge and
experience with other members of the AFS and
Kerberos communities
Everyone who wants to find out the latest
developments affecting AFS and Kerberos
More Info: http://afs2010.civ.zcu.cz/
52
16/02/2012
The session is composed by 3 main parts.The first part we will see some definition on dfs and a new trends on data center, in special way what the big player try to doThe second part will be to explain the architectures of four distributed file systemAt the end I will give you some example and real case studies, with explained tecnologiesLat but not the least conclusion .. And Dinner
I want start with some question, to understand who are you ? You will find an answear of questions on case studies part
Today you can find the big player with some announce and sometime solution with the name Fabric, for example Cisco call its solution unify Fabric, but Which is the idea behind this name ? With Fabric we have to go back to a grid idea, with many nodes .but this time we have also some other categories, in special way the fabric has concepts of ...Probably is not a new idea .. But anyway ..this is the future that we will see as advertising
On the 5 categories showed before, the most important one today, for my opinion, i the storage tier, because we don’t still have the right way and we have a lot of solutions more unknown or in beta stage and only some consolidated old architecture. For the fabric the storage tier need to be ... The last two are not directy connected to torage but could be.The more used solutions today is Storage Area Network, and sometime Network Attached Storage, big player today said the future is ome over ethernet, no one talks about distributed file system .. Do you know distributed file system ?
Than the first question is .. With this explanation could be useful for a data center .. What do you think ?
What do you expect from a DFS, what do you need for the fabric
Unfortunatly or fortunatly we have dozen of dfs, we can create 5 categories, each of them try to solve some specific problem that means we don’t have a true generic filesystem like our ext3 or any file system used on local hard drive.
The First file system that I talk is openafs, the true origin of AFS is the carneymellon university, the keys ideas behind the design of openafs are That mean time and locationUse persistent cache on client sideHide data locationThis is the opposite of nfs, the afs used kb4 and now krb5
We have 2 types of services, one is name database and it is a collection of database (the name probably give you some ) and the other one is the file server also in this case .. You can understand the function. In the databae server you have 4 service, one for search and lookup the data, your information are spread around many server how can understand where is it ? Simple you use the Volume location service, this service give you the server where the information are sotred.Another service is the ptserver, it is a database for handle mapping btw id and user name and the same for groups. It also contain the group owner and member of a specific groupBu Server is the database with the information on last backup and some other related information for backup serviceThe last is deprecated, it is a special version of kerberos 4 now you can use a standard kerberos 5This is for the db server, on the other hand we have file server, witch read and save the data on the specific partition.OpenAFS is a set of file in standard file system, the block are handle with a map of inode of the partition, for this reason it is much better use separeted partitionLast component, is the client, on the client you have a kernel module and cache manager, with kerberos ticket all your request are autheticated, and handle by kernel, the cache manager controll and handle all the entry of the cche .OpenAFS works with RPC and callback that means the file server know you have a copy of a file, if the file change the fileserver break the callback to users with this mecanisim the cache is not a timer cache but a coherent.. And you have reduce the network traffic
Now we see how the information is archive,Volumes are similar to logical volume, the quota work as a quota and you can expand as you want, depend on the underline filesystem sizeYou can move volume wheterever you want, you can replicate volume , unfortunatly the read only copy is more a snapshoot .. Real tiem replicaYou have a specific command for handle syncronization btw volume
The user can define its own group ancacl
With the last cache changes, you lose 5 % of speed with a warm cache copared with a read of local filesyste,
The basic idea of gluster is replace Storage Attach storage with a bunch of low cost server without single point of failure in simple way, Goals of the project is High level of scalability, performance and high avaibility ... Today with the idea of a virtualization storage
The basic idea ofGluster is to be simple, you can use as a Lego, with different bricks.You have a server side where the partition is exported and a client side, most of the work is made in clint side because you don’t have metadata server, than all deciion are made by clients through information stored on the serverYou have different of interconnection of server and client.One of the big advantage of gluster is to re export the filesystem with other protocol.
I have mentionedbriks, we have also in this case the volume that is a bricks, you can define incremental capability with extension of the volume the idea come from Hurd
Is used as storage, SAN replacement, today more attention on vmware world and virtual machine with specific features for handling the fail over
Is used as storage, SAN replacement, today more attention on vmware world and virtual machine with specific features for handling the fail over
We see two implementation which don’t use separation of metadata throght data, single stream one server send to you the entiry file block per block And mantain the meta dta on separeted infrastructure, and introduction of concept of object storage
Components, you have a datanode where the metadata information are stored and many data nodes wher piece of file are copiedClient send a requst to namenode .. Te name node send back the list of block datanodes
The namenoe is single point of failure, you need to use some high avaibility, the information are stored in memory, but on it writes a journal file on disk, in case of crash you can Copy 2 in the same rack one .. In external rack
Basicaly is very good for log and or analisys, could be also for coordination like a distributd locking, interesting some project are hive a meta language like sql used to querys the data, and the hbase .. The inplementation of big table of google
Ceph try to solve some limitation present in the osd model, in special way in the separation of meta data and data. The main objective are scalability realibility and high avaibility
In the high level we have usual components , meta data server object storage cluster ... The nme is chenaged but
On the mds side you have a dynamic subtree partition base on traffic controllN the stora system you have automatic replication and failue detection
Unfortunatly or fortunatly we have dozen of dfs, we can create 5 categories, each of them try to solve some specific problem that means we don’t have a true generic filesystem like our ext3 or any file system used on local hard drive.