There have never been more commercial tools available for building distributed data apps — from cloud hosting services, to cloud-native databases, to cloud-based analytics platforms. So why is it still so hard to make a successful app with a global user base?
One of the toughest challenges cloud offerings take on is the problem of consensus, abstracting away most of the complexity. That's no small feat, given that this is a hard enough problem that people spend years getting a PhD just to understand it! Unfortunately, while buying off-the-shelf cloud services can accelerate the path to an MVP, it also makes optimization tough. How will we scale during a period of rapid user growth? How do we do I18n and l10n or guarantee a good UX for users on the other side of the world? How do we prevent replication that might get us into legal trouble?
In this talk, we'll consider several case studies of global apps (both successful and otherwise!), talk about the limitations of off-the-shelf consensus, and consider a future where everyday developers can use open source tools to build distributed data apps that are easier to reason about, maintain, and tune.
An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort
Spanner is a globally distributed database that provides external consistency between data centers and stores data in a schema based semi-relational data structure. Not only that, Spanner provides a versioned view of the data that allows for instantaneous snapshot isolation across any segment of the data. This versioned isolation allows Spanner to provide globally consistent reads of the database at a particular time allowing for lock-free read-only transactions (and therefore no communications overhead for consensus during these types of reads). Spanner also provides externally consistent reads and writes with a timestamp-based linear execution of transactions and two phase commits. Spanner is the first distributed database that provides global sharding and replication with strong consistency semantics.
An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort
Spanner is a globally distributed database that provides external consistency between data centers and stores data in a schema based semi-relational data structure. Not only that, Spanner provides a versioned view of the data that allows for instantaneous snapshot isolation across any segment of the data. This versioned isolation allows Spanner to provide globally consistent reads of the database at a particular time allowing for lock-free read-only transactions (and therefore no communications overhead for consensus during these types of reads). Spanner also provides externally consistent reads and writes with a timestamp-based linear execution of transactions and two phase commits. Spanner is the first distributed database that provides global sharding and replication with strong consistency semantics.
This talk provides an introduction to various concepts that are essential to the understanding of distributed systems. Concepts covered include the 8 fallacies of distributed computing, the anatomy of a distributed system, system models, the CAP theorem, consistency models, partitioning, replication, leader election, failure detection, and consensus algorithms. This is the first in a three-part series designed to familiarize the audience with the design and usage of distributed systems.
Developing a Globally Distributed Purging SystemFastly
(Surge 2014) How do you build a distributed cache invalidation system that can invalidate content in 150 milliseconds across a global network of servers? Fastly CTO Tyler McMullen and engineer Bruce Spang will discuss the process of constructing a production-ready distributed system built on solid theoretical foundations. This talk will cover using research to design systems, the bimodal multicast algorithm, and the behavior of this system in production.
This paper describes the use of Storm at Twitter. Storm is a realtime fault-tolerant and distributed stream data processing system.Storm is currently being used to run various critical computations in Twitter at scale, and in real-time. This paper describes the architecture of Storm and its methods for distributed scale-out and fault-tolerance. This paper also describes how queries (aka.topologies) are executed in Storm, and presents some operational stories based on running Storm at Twitter. We also present results
from an empirical evaluation demonstrating the resilience of
Storm in dealing with machine failures. Storm is under active
development at Twitter and we also present some potential
directions for future work.
Google has designed and implemented a scalable distributed file system for their large distributed data intensive applications. They named it Google File System, GFS.
Best practice-high availability-solution-geo-distributed-finalMarco Tusa
Nowadays implementing different grades of business continuity for the data layer storage is a common requirement. When designing architectures that include MySQL as a data layer, we have different options to cover the required target. Nevertheless we still see a lot of confusion when in the need to properly cover concepts such as High Availability and Disaster Recovery. Confusion that often leads to improper architecture design and wrong solution implementation. This presentation aims to remove that confusion and provide clear guidelines when in the need to design a robust, flexible resilient architecture for your data layer.
The Zen of High Performance Messaging with NATS (Strange Loop 2016)wallyqs
Video: https://www.youtube.com/watch?v=dYrYCt2dTkw
HTML5: https://wallyqs.github.io/stl-nats-talk/
NATS is an open source, high performant messaging system with a design oriented towards both being as simple and reliable as possible without at the same time trading off scalability. Originally written in Ruby, and then rewritten in Go, a NATS server can nowadays push over 11M messages per second.
In this talk, we will cover how following simplicity as the main design constraint as well as focusing on a limited built-in feature set, resulted in a system which is easy to operate and reason about, making up for an attractive choice for when building many types of distributed systems where low latency and high availability are very important.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
Engaging users in real-time is the topic of our times. Whether it’s a game, a shop, or a content-network, the aim remains the same: providing a personalized experience. In this workshop we will look under the hood of Apache Storm and lay a firm foundation on how to use it with PHP. By that, you can leverage your existing codebase and PHP expertise for an entirely new world: real-time analytics and business logic working on message streams. During the course of the workshop, we will introduce Apache Storm and take a look at all of its components. We will then skyrocket the applicability of Storm by showing you how to implement their components with PHP. All exercises will be conducted using an example project, the infamous and most exhilarating lolcat kitten game ever conceived: Plan 9 From Outer Kitten. In order to follow the hands-on excercises, you will need a development VM prepared by us with all relevant system components and our project repositories. To make the workshop experience as smooth as possible for all participants, please bring a prepared computer to the workshop, as there will be no time to deal with installation and setup issues. Please download all prerequisites and install them as described: VM, Plan 9 webapp, Plan 9 storm backend, (Tutorial: https://github.com/DECK36/plan9_workshop_tutorial ).
From Mainframe to Microservice: An Introduction to Distributed SystemsTyler Treat
An introductory overview of distributed systems—what they are and why they're difficult to build. We explore fundamental ideas and practical concepts in distributed programming. What is the CAP theorem? What is distributed consensus? What are CRDTs? We also look at options for solving the split-brain problem while considering the trade-off of high availability as well as options for scaling shared data.
In this presentation you will understand what is CAP theorem, PACELC theorem, ACID / BASE principles and understand how to use them when you describe distributed database.
This presentation overviews basic principles of high availability architectures and presents how to deploy in high availability FIWARE data management services.
Distributed Consensus: Making Impossible Possible [Revised]Heidi Howard
In this talk, we explore how to construct resilient distributed systems on top of unreliable components. Starting, almost two decades ago, with Leslie Lamport’s work on organising parliament for a Greek island. We will take a journey to today’s datacenters and the systems powering companies like Google, Amazon and Microsoft. Along the way, we will face interesting impossibility results, machines acting maliciously and the complexity of today’s networks. Ultimately, we will discover how to reach agreement between many parties and from this, how to construct new fault-tolerance systems that we can depend upon everyday.
This talk provides an introduction to various concepts that are essential to the understanding of distributed systems. Concepts covered include the 8 fallacies of distributed computing, the anatomy of a distributed system, system models, the CAP theorem, consistency models, partitioning, replication, leader election, failure detection, and consensus algorithms. This is the first in a three-part series designed to familiarize the audience with the design and usage of distributed systems.
Developing a Globally Distributed Purging SystemFastly
(Surge 2014) How do you build a distributed cache invalidation system that can invalidate content in 150 milliseconds across a global network of servers? Fastly CTO Tyler McMullen and engineer Bruce Spang will discuss the process of constructing a production-ready distributed system built on solid theoretical foundations. This talk will cover using research to design systems, the bimodal multicast algorithm, and the behavior of this system in production.
This paper describes the use of Storm at Twitter. Storm is a realtime fault-tolerant and distributed stream data processing system.Storm is currently being used to run various critical computations in Twitter at scale, and in real-time. This paper describes the architecture of Storm and its methods for distributed scale-out and fault-tolerance. This paper also describes how queries (aka.topologies) are executed in Storm, and presents some operational stories based on running Storm at Twitter. We also present results
from an empirical evaluation demonstrating the resilience of
Storm in dealing with machine failures. Storm is under active
development at Twitter and we also present some potential
directions for future work.
Google has designed and implemented a scalable distributed file system for their large distributed data intensive applications. They named it Google File System, GFS.
Best practice-high availability-solution-geo-distributed-finalMarco Tusa
Nowadays implementing different grades of business continuity for the data layer storage is a common requirement. When designing architectures that include MySQL as a data layer, we have different options to cover the required target. Nevertheless we still see a lot of confusion when in the need to properly cover concepts such as High Availability and Disaster Recovery. Confusion that often leads to improper architecture design and wrong solution implementation. This presentation aims to remove that confusion and provide clear guidelines when in the need to design a robust, flexible resilient architecture for your data layer.
The Zen of High Performance Messaging with NATS (Strange Loop 2016)wallyqs
Video: https://www.youtube.com/watch?v=dYrYCt2dTkw
HTML5: https://wallyqs.github.io/stl-nats-talk/
NATS is an open source, high performant messaging system with a design oriented towards both being as simple and reliable as possible without at the same time trading off scalability. Originally written in Ruby, and then rewritten in Go, a NATS server can nowadays push over 11M messages per second.
In this talk, we will cover how following simplicity as the main design constraint as well as focusing on a limited built-in feature set, resulted in a system which is easy to operate and reason about, making up for an attractive choice for when building many types of distributed systems where low latency and high availability are very important.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
Engaging users in real-time is the topic of our times. Whether it’s a game, a shop, or a content-network, the aim remains the same: providing a personalized experience. In this workshop we will look under the hood of Apache Storm and lay a firm foundation on how to use it with PHP. By that, you can leverage your existing codebase and PHP expertise for an entirely new world: real-time analytics and business logic working on message streams. During the course of the workshop, we will introduce Apache Storm and take a look at all of its components. We will then skyrocket the applicability of Storm by showing you how to implement their components with PHP. All exercises will be conducted using an example project, the infamous and most exhilarating lolcat kitten game ever conceived: Plan 9 From Outer Kitten. In order to follow the hands-on excercises, you will need a development VM prepared by us with all relevant system components and our project repositories. To make the workshop experience as smooth as possible for all participants, please bring a prepared computer to the workshop, as there will be no time to deal with installation and setup issues. Please download all prerequisites and install them as described: VM, Plan 9 webapp, Plan 9 storm backend, (Tutorial: https://github.com/DECK36/plan9_workshop_tutorial ).
From Mainframe to Microservice: An Introduction to Distributed SystemsTyler Treat
An introductory overview of distributed systems—what they are and why they're difficult to build. We explore fundamental ideas and practical concepts in distributed programming. What is the CAP theorem? What is distributed consensus? What are CRDTs? We also look at options for solving the split-brain problem while considering the trade-off of high availability as well as options for scaling shared data.
In this presentation you will understand what is CAP theorem, PACELC theorem, ACID / BASE principles and understand how to use them when you describe distributed database.
This presentation overviews basic principles of high availability architectures and presents how to deploy in high availability FIWARE data management services.
Distributed Consensus: Making Impossible Possible [Revised]Heidi Howard
In this talk, we explore how to construct resilient distributed systems on top of unreliable components. Starting, almost two decades ago, with Leslie Lamport’s work on organising parliament for a Greek island. We will take a journey to today’s datacenters and the systems powering companies like Google, Amazon and Microsoft. Along the way, we will face interesting impossibility results, machines acting maliciously and the complexity of today’s networks. Ultimately, we will discover how to reach agreement between many parties and from this, how to construct new fault-tolerance systems that we can depend upon everyday.
Event-driven Infrastructure - Mike Place, SaltStack - DevOpsDays Tel Aviv 2016DevOpsDays Tel Aviv
"As we move into the age of containerization, it becomes more important than ever to figure out how to automate, monitor and deploy systems which are resilient and well-understood.
In this talk, we'll discuss methods for building infrastructures with universal event buses and reactive systems which can act as a nervous system for our computing environments."
This presentation goes over consensus fundamentals, what consensus algorithms are used in Hyperledger blockchain projects today and how do they work. This presentation was presented at the April 2nd SF Hyperledger Meetup @ PubNub.
Distributed shared memory
General architecture
Design and Implementation of issues of DSM
Granularity
Factors Influencing Block size Selection
Consistency Model
Replacement strategy
Which block be replace
where to place a replace block
thrashing
heterogeneous DSM
Issues
Deadlock
Data Structures for Data Privacy: Lessons Learned in ProductionRebecca Bilbro
In this talk we present a decentralized messaging protocol and storage system in use across North America, Europe, and South East Asia. Built at the behest of an international nonprofit working group, the protocol and system are designed to address a unique problem at the intersection of financial crime regulation, distributed ledger technology, and user privacy. In our talk we discuss the many lessons learned in the process of architecting, implementing, and fostering adoption of this system. We present the Secure Envelope, a data structure that employs a combination of methods (symmetric and asymmetric encryption, mTLS, protocol buffers, etc) to safeguard data privacy both at rest and in flight.
Conflict-Free Replicated Data Types (PyCon 2022)Rebecca Bilbro
Jupyter Notebook may be one of the most controversial open source projects released in the last ten years! Love them or hate them, they’ve become a mainstay of data science and machine learning, and a significant part of the Python ecosystem. While Jupyter can simplify experimentation, rapid prototyping, documentation, and visualization, it often impedes version control, code review, and test coverage. Dev teams must accept the good with the bad… but what if they didn’t have to? In this talk we introduce conflict-free replicated data types (CRDT), a special object that supports strong consistency, and which can be used to enhance Jupyter notebooks for a truly collaborative experience.
First proposed by Shapiro et al in 2011 conflict-free replicated data types (CRDTs) evolved out of the Distributed Systems community for replication of data across a network of replicas. CRDTs are objects that come with a special guarantee — namely, that two different copies of that object can be strongly consistent, meaning they can be kept in sync. While CRDTs have enjoyed a good amount of attention from academia over the last years, primarily amongst database and cloud researchers, they have not led to many practical applications for everyday developers. However, recent work by Kleppmann et al shows CRDTs can be used for real-time rich-text collaboration — creating a “Google doc”-type experience with any document in a networked file system.
In this talk, we’ll present the basics of CRDTs and demonstrate how they work with examples written in Python. Next, we’ll explain how CRDTs can enable more collaborative Jupyter notebooks, opening up features such as synchronous insertions, diffs, and auto-merges, even with multiple simultaneous contributors!
(Py)testing the Limits of Machine LearningRebecca Bilbro
Despite the hype cycle, each day machine learning becomes a little less magic and a little more real. Predictions increasingly drive our everyday lives, embedded into more of our everyday applications. To support this creative surge, development teams are evolving, integrating novel open source software and state-of-the-art GPU hardware, and bringing on essential new teammates like data ethicists and machine learning engineers. Software teams are also now challenged to build and maintain codebases that are intentionally not fully deterministic.
This nondeterminism can manifest in a number of surprising and oftentimes very stressful ways! Successive runs of model training may produce slight but meaningful variations. Data wrangling pipelines turn out to be extremely sensitive to the order in which transformations are applied, and require thoughtful orchestration to avoid leakage. Model hyperparameters that can be tuned independently may have mutually exclusive conditions. Models can also degrade over time, producing increasingly unreliable predictions. Moreover, open source libraries are living, dynamic things; the latest release of your team's favorite library might cause your code to suddenly behave in unexpected ways.
Put simply, as ML becomes more of an expectation than an exception in our industry, testing has never been more important! Fortunately, we are lucky to have a rich open source ecosystem to support us in our journey to build the next generation of apps in a safe, stable way. In this talk we'll share some hard-won lessons, favorite open source packages, and reusable techniques for testing ML software components.
Anti-Entropy Replication for Cost-Effective Eventual ConsistencyRebecca Bilbro
Eventually consistent systems are often more cost-effective to implement and maintain than their strongly consistent cousins. Gossip-based anti-entropy methods can be used to improve the consistency of such systems even as they expand geographically.
In the machine learning community, we're trained to think of size as inversely proportional to bias, driving us to ever larger datasets, increasingly complex model architectures, and ever better accuracy scores. But bigger doesn't always mean better.
What data quality issues emerge in large datasets? What complications surface as features become more geodistributed (e.g., diurnal patterns, seasonal variations, datetime formatting, multilingual text, etc.)? What happens as models attempt to extrapolate bigger and bigger patterns? Why is it that the pursuit of megamodels has driven a wedge between the ML definition of “bias” and the more colloquial sense of the word?
Perhaps the time has come to move away from monolithic models that reduce rich variations and complexities to a simple argmax on the output layer and instead embrace a new generation of model architectures that are just as organic and diverse as the data they seek to encode.
We live with an abundance of ML resources; from open source tools, to GPU workstations, to cloud-hosted autoML. What’s more, the lines between AI research and everyday ML have blurred; you can recreate a state-of-the-art model from arxiv papers at home. But can you afford to? In this talk, we explore ways to recession-proof your ML process without sacrificing on accuracy, explainability, or value.
EuroSciPy 2019: Visual diagnostics at scaleRebecca Bilbro
The hunt for the most effective machine learning model is hard enough with a modest dataset, and much more so as our data grow! As we search for the optimal combination of features, algorithm, and hyperparameters, we often use tools like histograms, heatmaps, embeddings, and other plots to make our processes more informed and effective. However, large, high-dimensional datasets can prove particularly challenging. In this talk, we’ll explore a suite of visual diagnostics, investigate their strengths and weaknesses in face of increasingly big data, and consider how we can steer the machine learning process, not only purposefully but at scale!
The hunt for the most effective machine learning model is hard enough with a modest dataset, and much more so as our data grow! As we search for the optimal combination of features, algorithm, and hyperparameters, we often use tools like histograms, heatmaps, embeddings, and other plots to make our processes more informed and effective. However, large, high-dimensional datasets can prove particularly challenging. In this talk, we’ll explore a suite of visual diagnostics, investigate their strengths and weaknesses in face of increasingly big data, and consider how we can steer the machine learning process, not only purposefully but at scale!
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Rebecca Bilbro
Machine learning is ultimately a search for the best combination of features, algorithm, and hyperparameters that result in the best performing model. Oftentimes, this leads us to stay in our algorithmic comfort zones, or to resort to automated processes such as grid searches and random walks. Whether we stick to what we know or try many combinations, we are sometimes left wondering if we have actually succeeded.
By enhancing model selection with visual diagnostics, data scientists can inject human guidance to steer the search process. Visualizing feature transformations, algorithmic behavior, cross-validation methods, and model performance allows us a peek into the high dimensional realm that our models operate. As we continue to tune our models, trying to minimize both bias and variance, these glimpses allow us to be more strategic in our choices. The result is more effective modeling, speedier results, and greater understanding of underlying processes.
Visualization is an integral part of the data science workflow, but visual diagnostics are directly tied to machine learning transformers and models. The Yellowbrick library extends the scikit-learn API providing a Visualizer object, an estimator that learns from data and produces a visualization as a result. In this tutorial, we will explore feature visualizers, visualizers for classification, clustering, and regression, as well as model analysis visualizers. We'll work through several examples and show how visual diagnostics steer model selection, making machine learning more informed, and more effective.
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
Machine learning often requires us to think spatially and make choices about what it means for two instances to be close or far apart. So which is best - Euclidean? Manhattan? Cosine? It all depends! In this talk, we'll explore open source tools and visual diagnostic strategies for picking good distance metrics when doing machine learning on text.
The Incredible Disappearing Data ScientistRebecca Bilbro
The last decade saw advances in compute power combine with an avalanche of open source software development, resulting in a revolution in machine learning and scalable analytics. “Data science” and “data product” are now household terms. This led to a new job description, the Data Scientist, which quickly became one of the most significant, exciting, and misunderstood jobs of the 21st century. One part statistician, one part computer scientist, and one part domain expert, data scientists seem poised to become the most pivotal value creators of the information age. And yet, danger (supposedly) lies ahead: human decisions are increasingly outsourced to algorithms of questionable ethical design; we’re putting everything on the blockchain; and perhaps most disturbingly, data science salaries are dropping precipitously as new graduates and Machine Learning as a Service (MLaaS) offerings flood the market. As we move into a future where predictive analytics is no longer a differentiator but instead a core business function, will data scientists proliferate or be automated out of a job?
In this talk, one humble data scientist attempts to cut through the hype to present an alternate vision of what data science is and can become. If not the “Sexiest Job of the 21st Century" as the Harvard Business Review once quipped, what is it like to be a workaday data scientist? What problems are we solving? How do we integrate with mature engineering teams? How do we engage with clients and product owners? How do we deploy non-deterministic models in production? In particular, we’ll examine critical integration points — technological and otherwise — we are currently tackling, which will ultimately determine our success, and our viability, over the next 10 years.
While data privacy challenges long predate current trends in machine-learning-as-a-service (MLAAS) offerings, predictive APIs do expose significant new attack vectors. To provide users with tailored recommendations, these applications often expose endpoints either to dynamic models or to pre-trained model artifacts, which learn patterns from data to surface insights. Problems arise when training data are collected, stored, and modeled in ways that jeopardize privacy. Even when user data is not exposed directly, private information can often be inferred using a technique called model inversion. In this talk, I discuss current research in black box model inversion and present a machine learning approach to discovering the model families of deployed black box models using only their decision topologies. Prior work suggests the efficacy of model family specific attack vectors (i.e., once the model is no longer a black box, it is easier to exploit). As such, we approach the problem only of model discovery and not of model inversion, reasoning that by solving the problem of model identification, we clear a path for information security and cryptography experts to use domain-specific tools for model inversion.
Learning machine learning with YellowbrickRebecca Bilbro
Yellowbrick is an open source Python library that provides visual diagnostic tools called “Visualizers” that extend the Scikit-Learn API to allow human steering of the model selection process. For teachers and students of machine learning, Yellowbrick can be used as a framework for teaching and understanding a large variety of algorithms and methods.
In machine learning, model selection is a bit more nuanced than simply picking the 'right' or 'wrong' algorithm. In practice, the workflow includes (1) selecting and/or engineering the smallest and most predictive feature set, (2) choosing a set of algorithms from a model family, and (3) tuning the algorithm hyperparameters to optimize performance. Recently, much of this workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can more effectively hone in on quality models than exhaustive search.
This talk presents a new open source Python library, Yellowbrick, which extends the Scikit-Learn API with a visual transfomer (visualizer) that can incorporate visualizations of the model selection process into pipelines and modeling workflow. Visualizers enable machine learning practitioners to visually interpret the model selection process, steer workflows toward more predictive models, and avoid common pitfalls and traps. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models, and assist in diagnosing problems throughout the machine learning workflow.
Data Intelligence 2017 - Building a Gigaword CorpusRebecca Bilbro
As the applications we build are increasingly driven by text, doing data ingestion, management, loading, and preprocessing in a robust, organized, parallel, and memory-safe way can get tricky. This talk walks through the highs (a custom billion-word corpus!), the lows (segfaults, 400 errors, pesky mp3s), and the new Python libraries we built to ingest and preprocess text for machine learning.
While applications like Siri, Cortana, and Alexa may still seem like novelties, language-aware applications are rapidly becoming the new norm. Under the hood, these applications take in text data as input, parse it into composite parts, compute upon those composites, and then recombine them to deliver a meaningful and tailored end result. The best applications use language models trained on domain-specific corpora (collections of related documents containing natural language) that reduce ambiguity and prediction space to make results more intelligible. Here's the catch: these corpora are huge, generally consisting of at least hundreds of gigabytes of data inside of thousands of documents, and often more!
In this talk, we'll see how working with text data is substantially different from working with numeric data, and show that ingesting a raw text corpus in a form that will support the construction of a data product is no trivial task. For instance, when dealing with a text corpus, you have to consider not only how the data comes in (e.g. respecting rate limits, terms of use, etc.), but also where to store the data and how to keep it organized. Because the data comes from the web, it's often unpredictable, containing not only text but audio files, ads, videos, and other kinds of web detritus. Since the datasets are large, you need to anticipate potential performance problems and ensure memory safety through streaming data loading and multiprocessing. Finally, in anticipation of the machine learning components, you have to establish a standardized method of transforming your raw ingested text into a corpus that's ready for computation and modeling.
In this talk, we'll explore many of the challenges we experienced along the way and introduce two Python packages that make this work a bit easier: Baleen and Minke. Baleen is a package for ingesting formal natural language data from the discourse of professional and amateur writers, like bloggers and news outlets, in a categorized fashion. Minke extends Baleen with a library that performs parallel data loading, preprocessing, normalization, and keyphrase extraction to support machine learning on a large-scale custom corpus.
As the applications we build are increasingly driven by text, doing data ingestion, management, loading, and preprocessing in a robust, organized, parallel, and memory-safe way can get tricky. This talk walks through the highs (a custom billion-word corpus!), the lows (segfaults, 400 errors, pesky mp3s), and the new Python libraries we built to ingest and preprocess text for machine learning.
While applications like Siri, Cortana, and Alexa may still seem like novelties, language-aware applications are rapidly becoming the new norm. Under the hood, these applications take in text data as input, parse it into composite parts, compute upon those composites, and then recombine them to deliver a meaningful and tailored end result. The best applications use language models trained on domain-specific corpora (collections of related documents containing natural language) that reduce ambiguity and prediction space to make results more intelligible. Here's the catch: these corpora are huge, generally consisting of at least hundreds of gigabytes of data inside of thousands of documents, and often more!
In this talk, we'll see how working with text data is substantially different from working with numeric data, and show that ingesting a raw text corpus in a form that will support the construction of a data product is no trivial task. For instance, when dealing with a text corpus, you have to consider not only how the data comes in (e.g. respecting rate limits, terms of use, etc.), but also where to store the data and how to keep it organized. Because the data comes from the web, it's often unpredictable, containing not only text but audio files, ads, videos, and other kinds of web detritus. Since the datasets are large, you need to anticipate potential performance problems and ensure memory safety through streaming data loading and multiprocessing. Finally, in anticipation of the machine learning components, you have to establish a standardized method of transforming your raw ingested text into a corpus that's ready for computation and modeling.
In this talk, we'll explore many of the challenges we experienced along the way and introduce two Python packages that make this work a bit easier: Baleen and Minke. Baleen is a package for ingesting formal natural language data from the discourse of professional and amateur writers, like bloggers and news outlets, in a categorized fashion. Minke extends Baleen with a library that performs parallel data loading, preprocessing, normalization, and keyphrase extraction to support machine learning on a large-scale custom corpus.
Yellowbrick: Steering machine learning with visual transformersRebecca Bilbro
In machine learning, model selection is a bit more nuanced than simply picking the 'right' or 'wrong' algorithm. In practice, the workflow includes (1) selecting and/or engineering the smallest and most predictive feature set, (2) choosing a set of algorithms from a model family, and (3) tuning the algorithm hyperparameters to optimize performance. Recently, much of this workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can more effectively hone in on quality models than exhaustive search.
This talk presents a new Python library, Yellowbrick, which extends the Scikit-Learn API with a visual transfomer (visualizer) that can incorporate visualizations of the model selection process into pipelines and modeling workflow. Yellowbrick is an open source, pure Python project that extends Scikit-Learn with visual analysis and diagnostic tools. The Yellowbrick API also wraps matplotlib to create publication-ready figures and interactive data explorations while still allowing developers fine-grain control of figures. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models, and assist in diagnosing problems throughout the machine learning workflow.
In this talk, we'll explore not only what you can do with Yellowbrick, but how it works under the hood (since we're always looking for new contributors!). We'll illustrate how Yellowbrick extends the Scikit-Learn and Matplotlib APIs with a new core object: the Visualizer. Visualizers allow visual models to be fit and transformed as part of the Scikit-Learn Pipeline process - providing iterative visual diagnostics throughout the transformation of high dimensional data.
In machine learning, model selection is a bit more nuanced than simply picking the 'right' or 'wrong' algorithm. In practice, the workflow includes (1) selecting and/or engineering the smallest and most predictive feature set, (2) choosing a set of algorithms from a model family, and (3) tuning the algorithm hyperparameters to optimize performance. Recently, much of this workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can more effectively hone in on quality models than exhaustive search.
This talk presents a new open source Python library, Yellowbrick (scikit-yb.org), which extends the Scikit-Learn API with a visual transfomer (visualizer) that can incorporate visualizations of the model selection process into pipelines and modeling workflow. Visualizers enable machine learning practitioners to visually interpret the model selection process, steer workflows toward more predictive models, and avoid common pitfalls and traps. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models, and assist in diagnosing problems throughout the machine learning workflow.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
Online aptitude test management system project report.pdfKamal Acharya
The purpose of on-line aptitude test system is to take online test in an efficient manner and no time wasting for checking the paper. The main objective of on-line aptitude test system is to efficiently evaluate the candidate thoroughly through a fully automated system that not only saves lot of time but also gives fast results. For students they give papers according to their convenience and time and there is no need of using extra thing like paper, pen etc. This can be used in educational institutions as well as in corporate world. Can be used anywhere any time as it is a web based application (user Location doesn’t matter). No restriction that examiner has to be present when the candidate takes the test.
Every time when lecturers/professors need to conduct examinations they have to sit down think about the questions and then create a whole new set of questions for each and every exam. In some cases the professor may want to give an open book online exam that is the student can take the exam any time anywhere, but the student might have to answer the questions in a limited time period. The professor may want to change the sequence of questions for every student. The problem that a student has is whenever a date for the exam is declared the student has to take it and there is no way he can take it at some other time. This project will create an interface for the examiner to create and store questions in a repository. It will also create an interface for the student to take examinations at his convenience and the questions and/or exams may be timed. Thereby creating an application which can be used by examiners and examinee’s simultaneously.
Examination System is very useful for Teachers/Professors. As in the teaching profession, you are responsible for writing question papers. In the conventional method, you write the question paper on paper, keep question papers separate from answers and all this information you have to keep in a locker to avoid unauthorized access. Using the Examination System you can create a question paper and everything will be written to a single exam file in encrypted format. You can set the General and Administrator password to avoid unauthorized access to your question paper. Every time you start the examination, the program shuffles all the questions and selects them randomly from the database, which reduces the chances of memorizing the questions.
Low power architecture of logic gates using adiabatic techniquesnooriasukmaningtyas
The growing significance of portable systems to limit power consumption in ultra-large-scale-integration chips of very high density, has recently led to rapid and inventive progresses in low-power design. The most effective technique is adiabatic logic circuit design in energy-efficient hardware. This paper presents two adiabatic approaches for the design of low power circuits, modified positive feedback adiabatic logic (modified PFAL) and the other is direct current diode based positive feedback adiabatic logic (DC-DB PFAL). Logic gates are the preliminary components in any digital circuit design. By improving the performance of basic gates, one can improvise the whole system performance. In this paper proposed circuit design of the low power architecture of OR/NOR, AND/NAND, and XOR/XNOR gates are presented using the said approaches and their results are analyzed for powerdissipation, delay, power-delay-product and rise time and compared with the other adiabatic techniques along with the conventional complementary metal oxide semiconductor (CMOS) designs reported in the literature. It has been found that the designs with DC-DB PFAL technique outperform with the percentage improvement of 65% for NOR gate and 7% for NAND gate and 34% for XNOR gate over the modified PFAL techniques at 10 MHz respectively.
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
2. 02
03
01
I have no idea 😳
Eventual consistency is good enough for us.
We require strong consistency.
Question #1
Think about the app you’re
currently working on.
How much consistency does
it require?
3. 02
03
01
We have total visibility and control over the region(s)
where user data is stored and replicated.
We added that cookie banner; wasn’t that enough?
It’s on our roadmap, but we haven’t tackled it yet.
Question #2
Think about the app you’re
currently working on.
How concerned are you
about compliance with CCPA,
GDPR, LGPD, etc. ?
4. 02
03
01
I think our frontend uses gettext and
CLDR.
We track geographic deployments and data replication
to guarantee consistent UX around the world.
We haven’t started thinking about global
markets yet.
Question #3
Think about the app you’re
currently working on.
How well does your app
support i18n/l10n?
5. 02
01
04
Commercial Consensus
What is Consensus?
Clocks, quorums, and
distracted parliamentarians
etcd, Spanner, Aurora, and more
Growing Pains
The downsides of success
03
An API for Consensus
Consensus made open source
Table of contents
6. Dr. Rebecca Bilbro
● Founder & CTO, Rotational Labs, LLC
● Adjunct Faculty, Georgetown University
● Applied Text Analysis with Python, O’Reilly
● Creator & Maintainer, Scikit- Yellowbrick
@rebeccabilbro
7. What if data systems were
a little smarter?
rotational.io
9. Context: A Single Server
Picture a simple data storage
system where clients can:
- GET(key) → value
- PUT(key, value)
- DEL(key)
10. Context: A Single Server
The order of the operations
determines the responses the
system gives.
In a single server system, the
server can determine any
order it wants.
It is always consistent.
1 2
3
4
5
6
12. Failure: A Single Server
Failure can occur for a lot of reasons,
from crashes to network outages and
is a routine.
In the single server system, when the
server fails the entire system
becomes unavailable.
Worse, data loss might occur for any
information stored on volatile
memory.
13. Context: A Distributed System
A system is distributed when it
contains more than one server that
must communicate.
If one server in the system fails, it
doesn’t necessarily become
unavailable because other servers
can answer requests.
If data is replicated we can also
avoid data loss.
14. Inconsistency
When multiple servers participate
in the system, they need to
communicate in order to ensure
they remain in the same state.
Communication takes time (latency)
and the more servers in the system,
the more time it takes to
synchronize.
PUT(x, 42)
ok
GET(x)
not found
15. Concurrency
Because of delays in
communication, it is possible for
two clients to perform operations
concurrently. In other words, from
the system’s perspective, they
happen at the same time.
The order of these operations
determines the response.
PUT(x, 42)
PUT(x, 27)
GET(x) → ?
16. Consistency
Levels
Strong Consistency: any GET request is guaranteed
to return the most recent PUT request.
Causal Consistency: any PUT request will only be
delayed by dependent keys, not all keys.
Eventual Consistency: in the absence of PUT
requests, the system will eventually become
consistent.
Monotonic Reads: every GET returns a value that is
more up to date.
Synchronization and ordering increase
the amount of time it takes to make
requests, however consistency can be
relaxed to improve performance.
17. Consistency
Levels
Strong Consistency: any GET request is guaranteed
to return the most recent PUT request.
Causal Consistency: any PUT request will only be
delayed by dependent keys, not all keys.
Eventual Consistency: in the absence of PUT
requests, the system will eventually become
consistent.
Monotonic Reads: every GET returns a value that is
more up to date.
Most data systems that exist today are
eventually consistent, preferring
performance over a low likelihood of
inconsistency.
18. Consistency
Levels
Strong Consistency: any GET request is guaranteed
to return the most recent PUT request.
Causal Consistency: any PUT request will only be
delayed by dependent keys, not all keys.
Eventual Consistency: in the absence of PUT
requests, the system will eventually become
consistent.
Monotonic Reads: every GET returns a value that is
more up to date.
However, many applications require
strong consistency in addition to
performance but have to rely on
centralized systems.
20. Paxos Consensus
Every server is a state machine that can
apply commands in a single order.
Each server maintains a log of the
operations (e.g. GET or PUT), and
applies those operations when the
entry in the log is committed.
Committing entries happens by
majority vote as follows.
21. Paxos Consensus
When a client makes a request, the
server requests a slot in the log to apply
a command, the “prepare” phase.
22. Paxos Consensus
When a client makes a request, the
server requests a slot in the log to apply
a command, the “prepare” phase.
If the other servers have that spot free,
they will reserve the slot for the
requesting server.
23. Paxos Consensus
If a majority of servers responds to the
prepare phase, the originating server,
will send the command to be applied to
the log in that spot, the accept phase.
24. Paxos Consensus
If a majority of servers responds to the
prepare phase, the originating server,
will send the command to be applied to
the log in that spot, the accept phase.
If a majority of servers reply to the
accept phase, the entry in the log is
committed.
25. Paxos Consensus
Even if servers fail, so long as a majority
of servers are still running, decisions
can be made.
Once the server returns, it can be
brought back up to date by the other
servers.
Rules for responding to the prepare
and accept phases ensure that there
will only ever be one log order.
26. Leader Optimization: Raft & Multi-Paxos
An optimization where the prepare
phase is performed once, ahead of
time by electing a designated leader.
Heartbeats are used to determine if the
leader has died and a new leader must
be elected.
This results in faster responses to
clients, but small outages during
elections.
27. Ballot Optimization: Mencius
To avoid a prepare phase, slots are
granted to leaders in a predetermined
fashion (e.g. round-robin).
Clients tend to broadcast to multiple
leaders in order to apply the command
to the next available slot.
Good for dense workloads with
continuous accesses. Compaction and
forwarding also help manage “empty”
log slots.
28. Optimistic Consensus: Fast Paxos, ePaxos
Attempt to commit a command in the
first phase, known as the “fast path”.
During the fast path commit, conflict
detection is applied.
If a conflict is detected, then perform
regular 2 phase Paxos (slow path).
If conflicts are rare, most accesses will
be fast path. But conflicts require 3
communication phases.
30. 2001: Lamport publishes “Paxos Made Simple”
2006: Google develops Chubby, a distributed locking
service based on Paxos, to rescue the GFS
2010: Apache Zookeeper offers an open source version of
Chubby using a Paxos-variant called ZAB
2013: CoreOS releases etcd, based on Raft, to manage a
cluster of Container Linux.
2014: Google launches k8s using etcd for the
configuration store.
Chubby, Zookeeper, etcd, etcetera
31. “ZooKeeper… got popular and became the de
facto coordination service for cloud computing
applications. However, since the bar on using
the ZooKeeper interface was so low, it has been
abused/misused by many applications.
When ZooKeeper is improperly used, it often
constituted the bottleneck in performance of
these applications and caused scalability
problems.”
Ailijiang, Charapko, Demirbas (2016)
Chubby, Zookeeper, etcd, etcetera
32. “Despite the increased choices and
specialization of Paxos protocols and
Paxos systems, the confusion remains
about the proper use cases of these
systems and about which systems are
more suitable for which tasks.”
Ailijiang, Charapko, Demirbas (2016)
Chubby, Zookeeper, etcd, etcetera
35. But… scaling consensus is hard
● The larger the quorum, the slower
it is to respond to requests.
● Things get worse when you have
servers in different data centers:
○ Latency increases because of physical
limits.
○ Network partitions can cut off groups
of servers.
○ Servers respond more quickly to
colocated clients.
37. Legalities of Global
Systems
Systems are regulated
in the countries where
the data resides (where
the servers are).
Building a global app
now means navigating
the complex waters of
data compliance.
40. “The launch of the augmented reality game Pokémon Go
was an unmitigated disaster. Due to extremely overloaded
servers from the release’s extreme popularity, users could
not download the game, login, create avatars, or find
augmented reality artifacts in their locales.
The game world was hosted by a suite of Google Cloud
services, primarily backed by the Cloud Datastore, a
geographically distributed NoSQL database. Scaling the
application to millions of users therefore involved
provisioning extra capacity to the database by increasing
the number of shards as well as improving load balancing
and autoscaling of application logic run in Kubernetes
containers.”
Bengfort (2019)
41. 21 February 2018: 501c3 Signal
Technology Foundation formed
4 Jan 2021: WhatsApp updates their
privacy policy re: data sharing with
Facebook
7 Jan 2021: Elon Musk tells his 60M
followers to switch to Signal
13 January 2021: Signal goes from
10M users to 50M users in under 24
hours, bringing the service down for
several days.
42. Dropbox Annual Revenue 2016-2020
“Between February and October of 2015,
Dropbox successfully relocated 90 percent of
an estimated 600 petabytes of its customer
data to its in-house network of data centers
dubbed Magic Pocket.”
“It was clear to us from the beginning that we’d
have to build everything from scratch,” wrote
Dropbox infrastructure VP Akhil Gupta on his
company blog in 2016, “since there’s nothing in
the open source community that’s proven to
work reliably at our scale. Few companies in the
world have the same requirements for scale of
storage as we do.”
Fulton (2020)
44. An API for Consensus
04
Consensus made open source
45. scikit-learn
Transformer
fit()
transform()
Estimator
fit()
predict()
X, y
X, y
ŷ
X′
Data Loader
Transformer
Transformer
Estimator
fit()
predict()
2007: Google Summer of Code project
2010: INRIA releases first open source
version
2013: Authors publish “API design for
machine learning software:
experiences from the scikit-learn
project”
2021: 2k contributors, 47k stars, used by
246k projects on Github, including
scikit- Yellowbrick!
46. The central insight of
scikit-Yellowbrick is that there is
no one “best” machine
learning model, only a set of
best practices for finding the
best model for a given dataset.
47. Concur: An API for Consensus Network
How do we send messages? What
does messaging imply?
Peer Management
Who’s in? How to reconfigure? How do
newcomers join?
Decision Making
How to vote? How to detect conflict?
Who’s the leader?
Execution
When is a decision final?
sys := &system.New{
QuorumSize: 7,
Network: &message.Stream{
Protocol: grpc,
Volume: broadcast
},
PeerManagement: dynamic,
DecisionMaking: LeadershipStrategy,
Execution: onCommit,
}
if err := sys.Validate(); err != nil {
return errors.New(“Invalid system: ”, err)
}
sys.Concur()