The document discusses the challenges of managing the massive amount of data that will be generated by the Internet of Things (IoT). It argues that storing all raw IoT data authoritatively and allowing for hypothesis exploration through computation will require new approaches. ZFS storage combined with OS-level containers through systems like Manta may provide a way to cope with IoT-generated data at scale while enabling flexible analysis of non-fungible data. The document also discusses how Unix and its model of small, interconnected programs was well-suited for ad hoc data processing tasks.
This is a talk given by Jason Hoffman at a workshop given by Joyent called "Scale With Rails" in 2006. There's quite a bit of prescience in this presentation, including the first documented use of ZFS in production ("Fsck you if you think ZFS isn't production") and of OS-based virtualization (zones) in the cloud (which, it must be said, was not called "cloud" in 2006).
My (very brief!) presentation at Interzone.io on March 11, 2015. A more in depth exploration of these ideas can be found at http://www.slideshare.net/bcantrill/docker-and-the-future-of-containers-in-production video: https://www.joyent.com/developers/videos/docker-and-the-future-of-containers-in-production
Talk originally given at FISL 2012 in Porto Alegre, Brazil. Video was on YouTube but regrettably taken down. Fortunately, I gave a slightly updated (and frankly, tighter and better produced) version of this at the Liferay Symposium in the fall of 2012: https://www.youtube.com/watch?v=Pm8P4oCIY3g
This is a talk given by Jason Hoffman at a workshop given by Joyent called "Scale With Rails" in 2006. There's quite a bit of prescience in this presentation, including the first documented use of ZFS in production ("Fsck you if you think ZFS isn't production") and of OS-based virtualization (zones) in the cloud (which, it must be said, was not called "cloud" in 2006).
My (very brief!) presentation at Interzone.io on March 11, 2015. A more in depth exploration of these ideas can be found at http://www.slideshare.net/bcantrill/docker-and-the-future-of-containers-in-production video: https://www.joyent.com/developers/videos/docker-and-the-future-of-containers-in-production
Talk originally given at FISL 2012 in Porto Alegre, Brazil. Video was on YouTube but regrettably taken down. Fortunately, I gave a slightly updated (and frankly, tighter and better produced) version of this at the Liferay Symposium in the fall of 2012: https://www.youtube.com/watch?v=Pm8P4oCIY3g
server to cloud: converting a legacy platform to an open source paasTodd Fritz
Â
This session discusses the process to move legacy applications "into the cloud". It is intended for a diverse audience including developers, architects, and managers. We will discuss techniques, methodologies, and thought processes used to analyze, design, and execute a migration strategy and implementation plan -- from planning through rollout and operational.
An important aspect of this is the necessity for technical staff to effectively communicate to mid-level management how these design decisions and strategies translate into cost, complexity and schedule.
Commonly used migration strategies, cloud technologies, architecture options, and low level technologies will be discussed.
The case will be made that investing in strategic refactoring and decomposition during the migration will reap the benefits of a modern, decoupled and simplified system.
The end game being alignment and adoption of current best practices around PaaS, Saas, SOA, event-driven architectures, and message-oriented middleware, at scale in the cloud, to provide quantifiable business value.
This talk will focus more on the big picture, at times delving into technical architectures and discussion of certain technologies and service providers.
Use of Containers (Docker) is evangelized for decoupling and decomposing legacy systems.
node.js in production: Reflections on three years of riding the unicornbcantrill
Â
My presentation at #NodeSummit, December 3, 2013. Video is at http://www.joyent.com/developers/videos/reflections-on-three-years-of-nodejs-in-production
Manta: a new internet-facing object storage facility that features compute by...Hakka Labs
Â
As the amount of unstructured data has greatly exceeded a single computer's ability to process it, data has become increasingly isolated from the compute elements . The resulting haul from stores of record (e.g., SAN, NAS, S3) to transient compute (e.g., Hadoop, EC2) creates needless mechanical work and human labor. Is there a better way? In this talk, we'll explore the coming convergence of data and compute in the cloud, focusing in particular on Joyent's Manta, a new internet-facing object storage facility that features compute. We will describe the design principles for Manta, the engineering challenges in building it, and more generally, the opportunities presented by the convergence of compute and data.
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebula Project
Â
The Science and Technology Facilities Council is a UK Research Council which funds research and provides large facilities to the UK Scientific Community. This includes running a Tier 1 site for the LHC computing project, the JASMIN Super Data Cluster and a number of other HPC and HTC facilities. The Scientific Computing Department at the Rutherford Appleton Laboratory has been developing a cloud for use across both sites of the Department and in the wider scientific community. This is an OpenNebula backed by Ceph block storage. I will give a brief background of the project, describe our set up, some use cases and the work we have done around OpenNebula (including a simplified web front-end and a number of hooks to provide us with traceability). I will also discuss how we are creating an elastic boundary between our HTC batch farm and cloud.
Author Biography
I am a Systems Administrator in the Scientific Computing Department of the UKâs Science and Technology Facilities Council. I work as part of the cloud team and I also work on a number of Grid services including our HTC batch farm for the LHC computing project.
Prior to my position here I worked in IT at a SMB focusing on Storage and Virtualisation, in particular Hyper-V and VMWare.
OSCON 2014 - Crash Course in Open Source Cloud ComputingMark Hinkle
Â
This crash course is designed to give an overview of cloud computing architecture and the open source software that can be used to deploy and manage a cloud computing environment.
Topics to be discussed in this session will include virtualization (KVM, LXC, and Xen Project), orchestration (Apache CloudStack, Eucalyptus, Open Nebula, and OpenStack), and storage (GlusterFS, Ceph, and others). The talk will also provide insight into how to deliver Platform-as-a-Service (PaaS) and what technologies can be used to compliment this evolving cloud computing paradigm.
Systems administrators and IT generalists will leave the discussion with a general overview of the options at their disposal to effectively build and manage their own cloud computing environments using free and open source software and understand the capabilities and benefits of a host of technologies.
LinuxFest Northwest: Crash Course in Open Source Cloud Computing Mark Hinkle
Â
Few IT trends have generated as much buzz as cloud computing. This talk will cut through the hype and clarify cloud computing. The bulk of the conversation will focus on the open source software that can be used to build compute clouds (infrastructure-as-a-service) and the complementary open source management tools that can be combined to automate the management of cloud computing environments. The discussion will appeal to anyone who has a good grasp of traditional data center infrastructure but is struggling with the benefits and migration path to a cloud computing environment. Systems administrators and IT generalists will leave the discussion with a general overview of the options for building and managing their own cloud computing environments using free and open source software.
Relational databases have pretty much ruled over the IT world for the last 30 years. However, Web 2.0 and the incipient Internet of Things (IoT) are some of the sources of a data explosion that has proved to exceed the limits of what modern relational databases can handle in a growing number of cases. As a result, new technologies had to be developed to handle these new use cases. We generally group these technologies under the umbrella of Big Data. In this two part presentation, we will start by understanding how relational databases have evolved to become the powerhouses they are today. In part 2 we will look at how non SQL databases are tackling the big data problem to scale beyond what relational databases can provide us today.
server to cloud: converting a legacy platform to an open source paasTodd Fritz
Â
This session discusses the process to move legacy applications "into the cloud". It is intended for a diverse audience including developers, architects, and managers. We will discuss techniques, methodologies, and thought processes used to analyze, design, and execute a migration strategy and implementation plan -- from planning through rollout and operational.
An important aspect of this is the necessity for technical staff to effectively communicate to mid-level management how these design decisions and strategies translate into cost, complexity and schedule.
Commonly used migration strategies, cloud technologies, architecture options, and low level technologies will be discussed.
The case will be made that investing in strategic refactoring and decomposition during the migration will reap the benefits of a modern, decoupled and simplified system.
The end game being alignment and adoption of current best practices around PaaS, Saas, SOA, event-driven architectures, and message-oriented middleware, at scale in the cloud, to provide quantifiable business value.
This talk will focus more on the big picture, at times delving into technical architectures and discussion of certain technologies and service providers.
Use of Containers (Docker) is evangelized for decoupling and decomposing legacy systems.
node.js in production: Reflections on three years of riding the unicornbcantrill
Â
My presentation at #NodeSummit, December 3, 2013. Video is at http://www.joyent.com/developers/videos/reflections-on-three-years-of-nodejs-in-production
Manta: a new internet-facing object storage facility that features compute by...Hakka Labs
Â
As the amount of unstructured data has greatly exceeded a single computer's ability to process it, data has become increasingly isolated from the compute elements . The resulting haul from stores of record (e.g., SAN, NAS, S3) to transient compute (e.g., Hadoop, EC2) creates needless mechanical work and human labor. Is there a better way? In this talk, we'll explore the coming convergence of data and compute in the cloud, focusing in particular on Joyent's Manta, a new internet-facing object storage facility that features compute. We will describe the design principles for Manta, the engineering challenges in building it, and more generally, the opportunities presented by the convergence of compute and data.
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebula Project
Â
The Science and Technology Facilities Council is a UK Research Council which funds research and provides large facilities to the UK Scientific Community. This includes running a Tier 1 site for the LHC computing project, the JASMIN Super Data Cluster and a number of other HPC and HTC facilities. The Scientific Computing Department at the Rutherford Appleton Laboratory has been developing a cloud for use across both sites of the Department and in the wider scientific community. This is an OpenNebula backed by Ceph block storage. I will give a brief background of the project, describe our set up, some use cases and the work we have done around OpenNebula (including a simplified web front-end and a number of hooks to provide us with traceability). I will also discuss how we are creating an elastic boundary between our HTC batch farm and cloud.
Author Biography
I am a Systems Administrator in the Scientific Computing Department of the UKâs Science and Technology Facilities Council. I work as part of the cloud team and I also work on a number of Grid services including our HTC batch farm for the LHC computing project.
Prior to my position here I worked in IT at a SMB focusing on Storage and Virtualisation, in particular Hyper-V and VMWare.
OSCON 2014 - Crash Course in Open Source Cloud ComputingMark Hinkle
Â
This crash course is designed to give an overview of cloud computing architecture and the open source software that can be used to deploy and manage a cloud computing environment.
Topics to be discussed in this session will include virtualization (KVM, LXC, and Xen Project), orchestration (Apache CloudStack, Eucalyptus, Open Nebula, and OpenStack), and storage (GlusterFS, Ceph, and others). The talk will also provide insight into how to deliver Platform-as-a-Service (PaaS) and what technologies can be used to compliment this evolving cloud computing paradigm.
Systems administrators and IT generalists will leave the discussion with a general overview of the options at their disposal to effectively build and manage their own cloud computing environments using free and open source software and understand the capabilities and benefits of a host of technologies.
LinuxFest Northwest: Crash Course in Open Source Cloud Computing Mark Hinkle
Â
Few IT trends have generated as much buzz as cloud computing. This talk will cut through the hype and clarify cloud computing. The bulk of the conversation will focus on the open source software that can be used to build compute clouds (infrastructure-as-a-service) and the complementary open source management tools that can be combined to automate the management of cloud computing environments. The discussion will appeal to anyone who has a good grasp of traditional data center infrastructure but is struggling with the benefits and migration path to a cloud computing environment. Systems administrators and IT generalists will leave the discussion with a general overview of the options for building and managing their own cloud computing environments using free and open source software.
Relational databases have pretty much ruled over the IT world for the last 30 years. However, Web 2.0 and the incipient Internet of Things (IoT) are some of the sources of a data explosion that has proved to exceed the limits of what modern relational databases can handle in a growing number of cases. As a result, new technologies had to be developed to handle these new use cases. We generally group these technologies under the umbrella of Big Data. In this two part presentation, we will start by understanding how relational databases have evolved to become the powerhouses they are today. In part 2 we will look at how non SQL databases are tackling the big data problem to scale beyond what relational databases can provide us today.
cloud computing - concepts and technologies and mechanisms of tackling problems in cloud
you plz ignore who created it , plz focus on problem oriented points
Data Lake and the rise of the microservicesBigstep
Â
By simply looking at structured and unstructured data, Data Lakes enable companies to understand correlations between existing and new external data - such as social media - in ways traditional Business Intelligence tools cannot.
For this you need to find out the most efficient way to store and access structured or unstructured petabyte-sized data across your entire infrastructure.
In this meetup weâll give answers on the next questions:
1. Why would someone use a Data Lake?
2. Is it hard to build a Data Lake?
3. What are the main features that a Data Lake should bring in?
4. Whatâs the role of the microservices in the big data world?
Hybrid Cloud
Multi-Cloud
Serverless Computing
Data Containers
Artificial Intelligence Platforms
Service mesh
Immutable Infrastructure Focused On Containers
The Internet of Things (IoT)
Cloudlet
Cloud Security
Backup and Disaster Recovery (DR)
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
Â
Cloud Computing Evolution
Why Cloud Computing needed?
Cloud Computing Models
Cloud Solutions
Cloud Jobs opportunities
Criteria for Big Data
Big Data challenges
Technologies to process Big Data- Hadoop
Hadoop History and Architecture
Hadoop Eco-System
Hadoop Real-time Use cases
Hadoop Job opportunities
Hadoop and SAP HANA integration
Summary
Latest (storage IO) patterns for cloud-native applications OpenEBS
Â
Applying micro service patterns to storage giving each workload its own Container Attached Storage (CAS) system. This puts the DevOps persona within full control of the storage requirements and brings data agility to k8s persistent workloads. We will go over the concept and the implementation of CAS, as well as its orchestration.
Cloud architecture, conception and computing PPTNangVictorin
Â
These platforms hide the complexity and details of the underlying infrastructure from users and applications by providing very simple graphical interface or API (Applications Programming Interface). Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of conďŹgurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.
The Effectiveness, Efficiency and Legitimacy of Outsourcing Your Data DataCentred
Â
Presentation given by our CEO Mike Kelly at this year's Excellence in Policing conference talking about the benefits of cloud computing and the Effectiveness, Efficiency and Legitimacy of outsourcing data. The presentation looks at the long term trends supporting the adoption of cloud technologies and dispels some of the myths and reasons why not to adopt cloud.
The presentation concludes with an examination of the benefits of utilising cloud technology and examines how best to adopt a cloud approach.
Supporting Research through "Desktop as a Service" models of e-infrastructure...David Wallom
Â
Keynote presentation given 13/9/16 @ ESA Earth Observation Open Science workshop 2016.
"The rise in cloud computing as an e-infrastructure model is one that has the power to democratise access to computational and data resources throughout the research communities. We have seen the difference that Infrastructure as a Service (IaaS) has made for different communities and are now only beginning to understand what different models further up the stack can make. It is also becoming clear that with the increase in research data volumes, the number of sources and the possibility of utilising data from different regulatory regimes that a different model of how analysis is performed on the data is possible. Utilising a "Desktop as a Service" model, with community focused applications installed on a common and well understood virtual system image that is directly connected to community relevant data allows the researcher to no longer have to consider moving data but only the final analysed results. This massively simplifies both the user model and the data and resource owner model. We will consider the specific example of the Environmental Ecomics Synthesis Cloud and how it could easily be generalised to other areas."
Talk given at the OCP Open System Firmware engineering workshop on 5/17/22. Talk was recorded; video at https://www.youtube.com/watch?v=eNI0wFgBNmY#t=7044s
Hardware/software Co-design: The Coming Golden Agebcantrill
Â
Talk I gave as a keynote at RailsConf 2021. There is no Rails in the talk, though; this is all about the revolutions in open source firmware and hardware that are changing the way we build systems. Video to come!
Tockilator: Deducing Tock execution flows from Ibex Verilator tracesbcantrill
Â
Talk given on March 20, 2020 at Oxidize 1K, a virtual conference that went from first idea to 300+ person conference in a week during the COVID-19 pandemic.
Platform values, Rust, and the implications for system softwarebcantrill
Â
Talk given at Scale By The Bay 2018. Video is at https://www.youtube.com/watch?v=2wZ1pCpJUIM. If you are interested in this talk, you might also be interested in my talk on Platform as a Reflection of Values from Node Summit 2017: https://www.slideshare.net/bcantrill/platform-as-reflection-of-values-joyent-nodejs-and-beyond
My Papers We Love talk in San Francisco on October 12, 2017 on "ARC: A self-tuning, low overhead replacement cache." Video at https://www.youtube.com/watch?v=F8sZRBdmqc0
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
Â
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Dev Dives: Train smarter, not harder â active learning and UiPath LLMs for do...UiPathCommunity
Â
đĽ Speed, accuracy, and scaling â discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Miningâ˘:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing â with little to no training required
Get an exclusive demo of the new family of UiPath LLMs â GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
đ¨âđŤ Andras Palfi, Senior Product Manager, UiPath
đŠâđŤ Lenka Dulovicova, Product Program Manager, UiPath
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilotâ˘UiPathCommunity
Â
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalitĂ di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
đ Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
đ¨âđŤđ¨âđť Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
Â
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
DevOps and Testing slides at DASA ConnectKari Kakkonen
Â
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
Â
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Â
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
Â
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Â
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Â
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
4. Big Data circa 2004: Internet exhaust
⢠Through the 1990s, Mooreâs Law + Kryderâs Law grew
faster than transaction rates, and what was
âoverwhelmingâ in 1994 was manageable by 2004
⢠But large internet concerns (Google, Facebook, Yahoo!)
encountered a new class of problem: analyzing massive
amounts of data emitted as a byproduct of activity
⢠Data scaled with activity, not transactions â changing
both data sizes and economics
⢠Data sizes were too large for extant data warehousing
solutions â and were embarrassingly parallel besides
5. Big Data circa 2004: MapReduce
⢠MapReduce, pioneered by Google and later emulated
by Hadoop, pointed to a new paradigm where compute
tasks are broken into map and reduce phases
⢠Serves to explicitly divide the work that can be
parallelized from that which must be run sequentially
⢠Map phases are farmed out to a storage layer that
attempts to co-locate them with the data being mapped
⢠Made for commodity scale-out systems; relatively cheap
storage allowed for sloppy but effective solutions (e.g.
storing data in triplicate)
6. Big Data circa 2014
⢠Hadoop has become the de facto big data processing
engine â and HDFS the de facto storage substrate
⢠But HDFS is designed around availability during/for
computation; it is not designed to be authoritative
⢠HDFS is used primarily for data that is redundant,
transient, replaceable or otherwise fungible
⢠Authoritative storage remains either enterprise storage
(on premises) or object storage (in the cloud)
⢠For analysis of non-fungible data, pattern is to ingest
data into a Hadoop cluster from authoritative storage
⢠But a new set of problems is poised to emerge...
7. Big Data circa 2014: Internet-of-things
⢠IDC forecasts that the âdigital universeâ will grow from
130 exabytes in 2005 to 40,000 exabytes in 2020 â
with as much of a third having âanalytic valueâ
⢠This doesnât even factor in the (long forecasted) rise of
the internet-of-things/industrial internet...
⢠Machine-generated data at the edge will effect a step
function in data sizes and processing methodologies
⢠No one really knows how much data will be generated
by IoT, but the numbers are insane (e.g., HD camera
generates 20 GB/hour; a Ford Energi engine generates
25 GB/hour; a GE jet engine generates 1TB/ďŹight)
8. How to cope with IoT-generated data?
⢠IoT presents so much more data that we will
increasingly need data science to make sense of it
⢠To assure data, we need to retain as much raw data as
possible, storing it once and authoritatively
⢠Storing data authoritatively has ramiďŹcations for the
storage substrate
⢠To allow for science, we need to place an emphasis on
hypothesis exploration: it must be quick to iterate from
hypothesis to experiment to result to new hypothesis
⢠Emphasizing hypothesis exploration has ramiďŹcations
for the compute abstractions and data movement
9. The coming ramiďŹcations of IoT
⢠It will no longer be acceptable to discard data: all data
will need to be retained to explore future hypotheses
⢠It will no longer be acceptable to store three copies: 3X
on storage costs is too acute when data is massive
⢠It will no longer be acceptable to move data for analysis:
in some cases, not even over the internet!
⢠It will no longer be acceptable to dictate the abstraction:
we must accommodate anything that can process data
⢠These shifts are as signiďŹcant as the shift from
traditional data warehousing to scale-out MapReduce!
10. IoT: Authoritative storage?
⢠âFilesystemsâ that are really just user-level programs
layered on local ďŹlesystems lack device-level visibility,
sacriďŹcing reliability and performance
⢠Even in-kernel, we have seen the corrosiveness of an
abstraction divide in the historic divide between logical
volume management and the ďŹlesystem:
⢠The volume manager understands multiple disks, but
nothing of the higher level semantics of the ďŹlesystem
⢠The ďŹlesystem understands the higher semantics of the
data, but has no physical device understanding
⢠This divide became entrenched over the 1990s, and had
devastating ramiďŹcations for reliability and performance
11. The ZFS revolution
⢠Starting in 2001, Sun began a revolutionary new
software effort: to unify storage and eliminate the divide
⢠In this model, ďŹlesystems would lose their one-to-one
association with devices: many ďŹlesystems would be
multiplexed on many devices
⢠By starting with a clean sheet of paper, ZFS opened up
vistas of innovation â and by its architecture was able
to solve many otherwise intractable problems
⢠Sun shipped ZFS in 2005, and used it as the foundation
of its enterprise storage products starting in 2008
⢠ZFS was open sourced in 2005; it remains the only open
source enterprise-grade ďŹlesystem
12. ZFS advantages
⢠Copy-on-write design allows on-disk consistency to be
always assured (eliminating ďŹle system check)
⢠Copy-on-write design allows constant-time snapshots in
unlimited quantity â and writable clones!
⢠Filesystem architecture allows ďŹlesystems to be created
instantly and expanded â or shrunk! â on-the-ďŹy
⢠Integrated volume management allows for intelligent
device behavior with respect to disk failure and recovery
⢠Adaptive replacement cache (ARC) allows for optimal
use of DRAM â especially on high DRAM systems
⢠Support for dedicated log and cache devices allows for
optimal use of ďŹash-based SSDs
13. ZFS at Joyent
⢠Joyent was the earliest ZFS adopter: becoming (in
2005) the ďŹrst production user of ZFS outside of Sun
⢠ZFS is one of the four foundational technologies of
Joyentâs SmartOS, our illumos derivative
⢠The other three foundational technologies in SmartOS are
DTrace, Zones and KVM
⢠Search âfork yeah illumosâ for the (uncensored) history of
OpenSolaris, illumos, SmartOS and derivatives
⢠Joyent has extended ZFS to provide better support
multi-tenant operation with I/O throttling
14. ZFS as the basis for IoT?
⢠ZFS offers commodity hardware economics with
enterprise-grade reliability â and obviates the need for
cross-machine mirroring for durability
⢠But ZFS is not itself a scale-out distributed system, and
is ill suited to become one
⢠Conclusion: ZFS is a good building block for the data
explosion from IoT, but not the whole puzzle
15. IoT: Compute abstraction?
⢠To facilitate hypothesis exploration, we need to carefully
consider the abstraction for computation
⢠How is data exploration programmatically expressed?
⢠How can this be made to be multi-tenant?
⢠The key enabling technology for multi-tenancy is
virtualization â but where in the stack to virtualize?
16. ⢠The historical answer â since the 1960s â has been to
virtualize at the level of the hardware:
⢠A virtual machine is presented upon which each
tenant runs an operating system of their choosing
⢠There are as many operating systems as tenants
⢠The historical motivation for hardware virtualization
remains its advantage today: it can run entire legacy
stacks unmodiďŹed
⢠However, hardware virtualization exacts a heavy tolls:
operating systems are not designed to share resources
like DRAM, CPU, I/O devices or the network
⢠Hardware virtualization limits tenancy and inhibits
performance!
Hardware-level virtualization?
17. ⢠Virtualizing at the application platform layer addresses
the tenancy challenges of hardware virtualizationâŚ
⢠...but at the cost of dictating abstraction to the developer
⢠With IoT, this is especially problematic: we can expect
much more analog data and much deeper numerical
analysis â and dependencies on native libraries and/or
domain-speciďŹc languages
⢠Virtualizing at the application platform layer poses many
other challenges:
⢠Security, resource containment, language speciďŹcity,
environment-speciďŹc engineering costs
Platform-level virtualization?
18. ⢠Containers virtualizing the OS and hit the sweet spot:
⢠Single OS (single kernel) allows for efďŹcient use of hardware
resources, and therefore allows load factors to be high
⢠Disjoint instances are securely compartmentalized by the
operating system
⢠Gives customers what appears to be a virtual machine
(albeit a very fast one) on which to run higher-level software
⢠Gives customers PaaS when the abstractions work for them,
IaaS when they need more generality
⢠OS-level virtualization allows for high levels of tenancy
without dictating abstraction or sacriďŹcing efďŹciency
⢠Zones is a bullet-proof implementation of OS-level
virtualization â and is the core abstraction in Joyentâs
SmartOS
Joyentâs solution: OS containers
20. ⢠Building a sophisticated distributed system on top of
ZFS and zones, we have built Manta, an internet-facing
object storage system offering in situ compute
⢠That is, the description of compute can be brought to
where objects reside instead of having to backhaul
objects to transient compute
⢠The abstractions made available for computation are
anything that can run on the OS...
⢠...and as a reminder, the OS â Unix â was built around
the notion of ad hoc unstructured data processing, and
allows for remarkably terse expressions of computation
Manta: ZFS + Containers!
21. Aside: Unix
⢠When Unix appeared in the early 1970s, it was not just a
new system, but a new way of thinking about systems
⢠Instead of a sealed monolith, the operating system was
a collection of small, easily understood programs
⢠First Edition Unix (1971) contained many programs that
we still use today (ls, rm, cat, mv)
⢠Its very name conveyed this minimalist aesthetic: Unix is
a homophone of âeunuchsâ â a castrated Multics
We were a bit oppressed by the big system mentality. Ken
wanted to do something simple. â Dennis Ritchie
22. Unix: Let there be light
⢠In 1969, Doug McIlroy had the idea of connecting
different components:
At the same time that Thompson and Ritchie were sketching
out a ďŹle system, I was sketching out how to do data
processing on the blackboard by connecting together
cascades of processes
⢠This was the primordial pipe, but it took three years to
persuade Thompson to adopt it:
And one day I came up with a syntax for the shell that went
along with the piping, and Ken said, âIâm going to do it!â
23. Unix: ...and there was light
And the next morning we had this
orgy of one-liners. â Doug McIlroy
24. The Unix philosophy
⢠The pipe â coupled with the small-system aesthetic â
gave rise to the Unix philosophy, as articulated by Doug
McIlroy:
⢠Write programs that do one thing and do it well
⢠Write programs to work together
⢠Write programs that handle text streams, because
that is a universal interface
⢠Four decades later, this philosophy remains the single
most important revolution in software systems thinking!
25. ⢠In 1986, Jon Bentley posed the challenge that became
the Epic Rap Battle of computer science history:
Read a ďŹle of text, determine the n most frequently used
words, and print out a sorted list of those words along with
their frequencies.
⢠Don Knuthâs solution: an elaborate program in WEB, a
Pascal-like literate programming system of his own
invention, using a purpose-built algorithm
⢠Doug McIlroyâs solution shows the power of the Unix
philosophy:
tr -cs A-Za-z 'n' | tr A-Z a-z |
sort | uniq -c | sort -rn | sed ${1}q
Doug McIlroy v. Don Knuth: FIGHT!
26. Big Data: History repeats itself?
⢠The original Google MapReduce paper (Dean et al.,
OSDI â04) poses a problem disturbingly similar to
Bentleyâs challenge nearly two decades prior:
Count of URL Access Frequency: The function processes
logs of web page requests and outputs â¨URL, 1âŠ. The
reduce function adds together all values for the same URL
and emits a â¨URL, total count⊠pair
⢠But the solutions do not adhere to the Unix philosophy...
⢠...and nor do they make use of the substantial Unix
foundation for data processing
⢠e.g., Appendix A of the OSDI â04 paper has a 71 line
word count in C++ â with nary a wc in sight
27. ⢠Manta allows for an arbitrarily scalable variant of
McIlroyâs solution to Bentleyâs challenge:
mfind -t o /bcantrill/public/v7/usr/man |
mjob create -o -m "tr -cs A-Za-z 'n' |
tr A-Z a-z | sort | uniq -c" -r
"awk '{ x[$2] += $1 }
END { for (w in x) { print x[w] " " w } }' |
sort -rn | sed ${1}q"
⢠This description not only terse, it is high performing: data
is left at rest â with the âmapâ phase doing heavy
reduction of the data stream
⢠As such, Manta â like Unix â is not merely syntactic
sugar; it converges compute and data in a new way
Manta: Unix for Big Data â and IoT
28. ⢠Eventual consistency represents the wrong CAP
tradeoffs for most; we prefer consistency over
availability for writes (but still availability for reads)
⢠Many more details:
http://dtrace.org/blogs/dap/2013/07/03/fault-tolerance-in-manta/
⢠Celebrity endorsement:
Manta: CAP tradeoffs
29. ⢠Hierarchical storage is an excellent idea (ht: Multics);
Manta implements proper directories, delimited with a
forward slash
⢠Manta implements a snapshot/link hybrid dubbed a
snaplink; can be used to effect versioning
⢠Manta has full support for CORS headers
⢠Manta uses SSH-based HTTP auth for client-side
tooling (IETF draft-cavage-http-signatures-00)
⢠Manta SDKs exist for node.js, R, Go, Java, Ruby,
Python â and of course, compute jobs may be in any of
these (plus Perl, Clojure, Lisp, Erlang, Forth, Prolog,
Fortran, Haskell, Lua, Mono, COBOL, Fortran, etc.)
⢠ânpm install mantaâ for command line interface
Manta: Other design principles
30. ⢠We believe compute/data convergence to be a
constraint imposed by IoT: stores of record must support
computation as a ďŹrst-class, in situ operation
⢠We believe that some (and perhaps many) IoT
workloads will require computing at the edge â internet
transit may be prohibitive for certain applications
⢠We believe that Unix is a natural way of expressing this
computation â and that OS containers are the right way
to support this securely
⢠We believe that ZFS is the only sane storage
underpinning for such a system
⢠Manta will surely not be the only system to represent the
conďŹuence of these â but it is the ďŹrst
Manta and IoT
31. ⢠Product page:
http://joyent.com/products/manta
⢠node.js module:
https://github.com/joyent/node-manta
⢠Manta documentation:
http://apidocs.joyent.com/manta/
⢠IRC, e-mail, Twitter, etc.:
#manta on freenode, manta@joyent.com, @mcavage,
@dapsays, @yunongx, @joyent
Manta: More information