Presentation given at LinuxCon Japan 2016 by Satoshi "Moris" Tagomori (@tagomoris), Treasure Data. Describes various strategies for aggregating log data in a microservices architecture using containers, e.g. Docker.
Fluentd is a data collector for unified logging that allows for structured logging, reliable forwarding, and a pluggable architecture. It is written in Ruby and uses JSON to stream data between containers. Fluentd can aggregate logs from containers in different patterns, such as a single-level or two-level aggregation. A new Docker logging driver called "fluentd" may allow containers to send logs directly to Fluentd.
Good Things and Hard Things of SaaS Development/OperationsSATOSHI TAGOMORI
This document discusses the good and hard things about developing and operating a SaaS platform. It describes how the backend team at Treasure Data owns and manages various components of their distributed platform. It also discusses how they have modernized their deployment process from a periodic Chef-based approach to using CodeDeploy for more frequent deployments. This allows them to move faster by doing many small releases that minimize the number of affected components and customers.
Fluentd and Distributed Logging at KubeconN Masahiro
This document discusses distributed logging with containers using Fluentd. It notes the challenges of logging in container environments where logs need to be collected from ephemeral containers and transferred to storage. It introduces Fluentd as a flexible data collection tool that can collect logs from containers using various plugins and methods like log drivers, shared volumes, and application libraries. The document discusses deployment patterns for Fluentd including using it for source-side aggregation to buffer and transfer logs more efficiently and for destination-side aggregation to scale log storage.
Fluentd Project Intro at Kubecon 2019 EUN Masahiro
Fluentd is a streaming data collector that can unify logging and metrics collection. It collects data from sources using input plugins, processes and filters the data, and outputs it to destinations using output plugins. It is commonly used for container logging, collecting logs from files or Docker and adding metadata before outputting to Elasticsearch or other targets. Fluentbit is a lightweight version of Fluentd that is better suited for edge collection and forwarding logs to a Fluentd instance for aggregation.
Mapbox runs its map tile services across 9 data centers globally to provide high availability and low latency for its customers worldwide. A map request first hits the nearest content delivery network, then the local load balancer which routes it to an application server in that region. The server authenticates the request and retrieves the tile data from a distributed database and object storage, checking a local cache first. This global infrastructure allows Mapbox to meet its service level agreement of 99.9% uptime while minimizing latency for users around the world.
This document summarizes recent updates to Norikra, an open source stream processing server. Key updates include:
1) The addition of suspended queries, which allow queries to be temporarily stopped and resumed later, and NULLABLE fields, which handle missing fields as null values.
2) New listener plugins that allow processing query outputs in customizable ways, such as pushing to users, enqueueing to Kafka, or filtering records.
3) Dynamic plugin reloading that loads newly installed plugins without requiring a restart, improving uptime.
Presentation at LinuxCon Europe 2016 (Berlin). I introduced the concepts of logging for containers, aggregation patterns, distributted logging, data serialization, Fluentd: internals, architecture, Fluent Bit and it library API.
Fluentd is a data collector for unified logging that allows for structured logging, reliable forwarding, and a pluggable architecture. It is written in Ruby and uses JSON to stream data between containers. Fluentd can aggregate logs from containers in different patterns, such as a single-level or two-level aggregation. A new Docker logging driver called "fluentd" may allow containers to send logs directly to Fluentd.
Good Things and Hard Things of SaaS Development/OperationsSATOSHI TAGOMORI
This document discusses the good and hard things about developing and operating a SaaS platform. It describes how the backend team at Treasure Data owns and manages various components of their distributed platform. It also discusses how they have modernized their deployment process from a periodic Chef-based approach to using CodeDeploy for more frequent deployments. This allows them to move faster by doing many small releases that minimize the number of affected components and customers.
Fluentd and Distributed Logging at KubeconN Masahiro
This document discusses distributed logging with containers using Fluentd. It notes the challenges of logging in container environments where logs need to be collected from ephemeral containers and transferred to storage. It introduces Fluentd as a flexible data collection tool that can collect logs from containers using various plugins and methods like log drivers, shared volumes, and application libraries. The document discusses deployment patterns for Fluentd including using it for source-side aggregation to buffer and transfer logs more efficiently and for destination-side aggregation to scale log storage.
Fluentd Project Intro at Kubecon 2019 EUN Masahiro
Fluentd is a streaming data collector that can unify logging and metrics collection. It collects data from sources using input plugins, processes and filters the data, and outputs it to destinations using output plugins. It is commonly used for container logging, collecting logs from files or Docker and adding metadata before outputting to Elasticsearch or other targets. Fluentbit is a lightweight version of Fluentd that is better suited for edge collection and forwarding logs to a Fluentd instance for aggregation.
Mapbox runs its map tile services across 9 data centers globally to provide high availability and low latency for its customers worldwide. A map request first hits the nearest content delivery network, then the local load balancer which routes it to an application server in that region. The server authenticates the request and retrieves the tile data from a distributed database and object storage, checking a local cache first. This global infrastructure allows Mapbox to meet its service level agreement of 99.9% uptime while minimizing latency for users around the world.
This document summarizes recent updates to Norikra, an open source stream processing server. Key updates include:
1) The addition of suspended queries, which allow queries to be temporarily stopped and resumed later, and NULLABLE fields, which handle missing fields as null values.
2) New listener plugins that allow processing query outputs in customizable ways, such as pushing to users, enqueueing to Kafka, or filtering records.
3) Dynamic plugin reloading that loads newly installed plugins without requiring a restart, improving uptime.
Presentation at LinuxCon Europe 2016 (Berlin). I introduced the concepts of logging for containers, aggregation patterns, distributted logging, data serialization, Fluentd: internals, architecture, Fluent Bit and it library API.
Fluentd is an open source data collector that provides a unified logging layer between data sources and backend systems. It decouples these systems by collecting and processing logs and events in a flexible and scalable way. Fluentd uses plugins and buffers to make data collection reliable even in the case of errors or failures. It can forward data between Fluentd nodes for high availability and load balancing.
Ryan will expand on his popular blog series and drill down into the internals of the database. Ryan will discuss optimizing query performance, best indexing schemes, how to manage clustering (including meta and data nodes), the impact of IFQL on the database, the impact of cardinality on performance, TSI, and other internals that will help you architect better solutions around InfluxDB.
This document summarizes a presentation about using FluentD for end-to-end monitoring. It discusses the challenges of monitoring modern distributed applications and introduces FluentD as a highly pluggable framework that can capture logs and metrics from various sources and filter, aggregate, and route the data to various outputs like databases, alerting services, and visualization tools. It then provides examples of using FluentD to address challenges like consolidating logs from microservices and filtering critical events. Potential approaches for scaling FluentD in containerized environments are also discussed.
This document discusses logging for containers and microservices. It covers structured logging formats like JSON, logging drivers for Docker, challenges of logging at scale, and logging solutions like Fluentd and Fluent Bit. It highlights features like pluggable architectures, high performance, and support for aggregation patterns to optimize logging workflows.
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxData
This document discusses the components and architecture of InfluxDB IOx for replication, durability, and subscriptions. It describes the write buffer, how writes are routed and distributed across shards, replication between buffers to ensure durability, and how subscriptions are handled for querying data.
This document summarizes a presentation about log forwarding at scale. It discusses how logging works internally and requires understanding the logging pipeline of parsing, filtering, buffering and routing logs. It then introduces Fluent Bit as a lightweight log forwarder that can be used to cheaply forward logs from edge nodes to log aggregators in a scalable way, especially in cloud native environments like Kubernetes. Hands-on demos show how Fluent Bit can parse and add metadata to Kubernetes logs.
- The document discusses logging for containers using Fluentd, an open source data collector. It describes how Fluentd can provide a unified logging layer, reliably forwarding and aggregating logs from multiple containers and applications in a pluggable way.
- Key points covered include using Fluentd with the new Docker logging drivers to directly collect logs from containers, avoiding performance penalties from other approaches. A demo of Fluentd is also mentioned.
Apache Kafka is a distributed streaming platform that can be used to build real-time data pipelines. It publishes and subscribes to streams of records in a fault-tolerant and durable way, and helps process streams of records as they occur. Key characteristics of Kafka include high throughput ingestion, fault-tolerant storage, high availability, scalability, and support for concurrent processing and ordering guarantees. It provides functionality similar to a messaging system but with a unique design as a distributed, partitioned, replicated commit log service.
This document summarizes Fluentd v1.0 and provides details about its new features and release plan. It notes that Fluentd v1.0 will provide stable APIs and compatibility with previous versions while improving plugin APIs, adding Windows and multicore support, and increasing event time resolution to nanoseconds. The release is planned for Q3 2017 to allow feedback on v0.14 before finalizing v1.0 features.
Data Security Governanace and Consumer Cloud StorageDaniel Rohan
a brief on the most popular consumer cloud storage protocols along with suggestions to mitigate the threat of data exfiltration via these services on corporate networks.
Oleksandr Nitavskyi "Kafka deployment at Scale"Fwdays
This document discusses Kafka deployment at online advertising company Criteo. Some key points:
1. Criteo uses Kafka to process up to 10 million messages per second and 180 TB of data per day across 13 Kafka clusters spanning multiple datacenters.
2. They define partitioning based on retaining 72GB of data per partition over a 72 hour period. This has led to topics with over 1,300 partitions.
3. Criteo developed an in-house C# Kafka client optimized for their use cases of high throughput and ability to blacklist partitions when needed. They are looking to upgrade to support new Kafka features like idempotent producers and transactions.
4. Monitoring lag is a key metric,
EVCache: Lowering Costs for a Low Latency Cache with RocksDBScott Mansfield
EVCache is a distributed, sharded, replicated key-value store optimized for Netflix's use cases on AWS. It is based on Memcached but uses RocksDB for persistent storage, lowering costs compared to storing all data in memory. Moneta is the next generation EVCache server, using Rend and Mnemonic libraries to intelligently manage data placement in RAM and SSD. This provides high performance for both volatile and batch workloads while reducing costs by 70% compared to the original Memcached-based design.
Atmosphere 2014: Centralized log management based on Logstash and Kibana - ca...PROIDEA
Nowadays cloud enviroments are primary platform for applications. We no longer have multipurpose machines, rather multiple smaller virtual servers with dedicated roles. Therefore there is a need to have one place where we can manage applications and system logs. I wish to share my experience gained while building centralized log managment system using Nxlog, Logstash and Kibana. With that tools we are building cost effective and scalable log managment platform.
Dariusz Eliasz - Works in Allegro Group as a Solution Architect and is responsible for organizing cooperation with infrastructure teams, also leads some of the infrastructure projects. Earlier as an Expert System Administratorhe was related with building and maintaining the infrastructure shared services (i.e. image hosting platform) within Allegro Group.
In DiDi Chuxing Company, which is China’s most popular ride-sharing company. we use HBase to serve when we have a bigdata problem.
We run three clusters which serve different business needs. We backported the Region Grouping feature back to our internal HBase version so we could isolate the different use cases.
We built the Didi HBase Service platform which is popular amongst engineers at our company. It includes a workflow and project management function as well as a user monitoring view.
Internally we recommend users use Phoenix to simplify access.even more,we used row timestamp;multidimensional table schema to slove muti dimension query problems
C++, Go, Python, and PHP clients get to HBase via thrift2 proxies and QueryServer.
We run many important buisness applications out of our HBase cluster such as ETA/GPS/History Order/API metrics monitoring/ and Traffic in the Cloud. If you are interested in any aspects listed above, please come to our talk. We would like to share our experiences with you.
How Criteo is managing one of the largest Kafka Infrastructure in EuropeRicardo Paiva
This document discusses Criteo's large Kafka infrastructure in Europe. Some key details:
- Criteo uses Kafka to process up to 7 million messages per second (400 billion per day) across about 200 brokers in 13 Kafka clusters across multiple datacenters.
- They have developed an in-house C# Kafka client optimized for their high-throughput use case of no key partitioning and no order guarantees.
- Criteo monitors lag and message ordering using "watermark" messages containing timestamps that are tracked across partitions to measure stream processing lag.
- Data is replicated between clusters for redundancy using custom Kafka Connect connectors that write offsets to the destination.
Why You Definitely Don’t Want to Build Your Own Time Series DatabaseInfluxData
At Outlyer, an infrastructure monitoring tool, we had to build our own TSDB back in 2015 to support our service. Two years later, we decided to take a different direction after seeing for ourselves how hard it is to build and scale a TSDB. This talk will review our journey, the challenges we hit trying to scale a TSDB for large customers and hopefully talk some people out of trying to build one themselves because it is not easy!
Kafka Summit SF 2017 - MultiCluster, MultiTenant and Hierarchical Kafka Messa...confluent
This document discusses scaling challenges with large Kafka clusters and proposes a solution of using multiple, smaller Kafka clusters organized hierarchically. The key points are: 1) Large monolithic Kafka clusters have issues like slow operations and increased latency; 2) The solution is to create many smaller "immutable" Kafka clusters and connect them with a routing service; 3) This allows scaling producers and consumers across clusters rather than just brokers.
This document discusses open source relational databases. It begins by introducing the presenter and topic, which is the current state of components in open source SQL databases. It then covers key components such as the storage engine, query planner, protocols, transaction model, and others. For each component, it discusses the approaches taken by different databases like PostgreSQL, MySQL, CockroachDB, and ClickHouse. It also addresses topics like horizontal scalability and replication strategies. Overall, the document provides a detailed overview and comparison of the architectural components and capabilities across major open source relational database management systems.
Apache Geode is an open source in-memory data grid that provides data distribution, replication and high availability. It can be used for caching, messaging and interactive queries. The presentation discusses Geode concepts like cache, region and member. It provides examples of how large companies use Geode for applications requiring real-time response, high concurrency and global data visibility. Geode's performance comes from minimizing data copying and contention through flexible consistency and partitioning. The project is now hosted by Apache and the community is encouraged to get involved through mailing lists, code contributions and example applications.
Fluentd is an open source data collector that provides a unified logging layer between data sources and backend systems. It decouples these systems by collecting and processing logs and events in a flexible and scalable way. Fluentd uses plugins and buffers to make data collection reliable even in the case of errors or failures. It can forward data between Fluentd nodes for high availability and load balancing.
Ryan will expand on his popular blog series and drill down into the internals of the database. Ryan will discuss optimizing query performance, best indexing schemes, how to manage clustering (including meta and data nodes), the impact of IFQL on the database, the impact of cardinality on performance, TSI, and other internals that will help you architect better solutions around InfluxDB.
This document summarizes a presentation about using FluentD for end-to-end monitoring. It discusses the challenges of monitoring modern distributed applications and introduces FluentD as a highly pluggable framework that can capture logs and metrics from various sources and filter, aggregate, and route the data to various outputs like databases, alerting services, and visualization tools. It then provides examples of using FluentD to address challenges like consolidating logs from microservices and filtering critical events. Potential approaches for scaling FluentD in containerized environments are also discussed.
This document discusses logging for containers and microservices. It covers structured logging formats like JSON, logging drivers for Docker, challenges of logging at scale, and logging solutions like Fluentd and Fluent Bit. It highlights features like pluggable architectures, high performance, and support for aggregation patterns to optimize logging workflows.
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxData
This document discusses the components and architecture of InfluxDB IOx for replication, durability, and subscriptions. It describes the write buffer, how writes are routed and distributed across shards, replication between buffers to ensure durability, and how subscriptions are handled for querying data.
This document summarizes a presentation about log forwarding at scale. It discusses how logging works internally and requires understanding the logging pipeline of parsing, filtering, buffering and routing logs. It then introduces Fluent Bit as a lightweight log forwarder that can be used to cheaply forward logs from edge nodes to log aggregators in a scalable way, especially in cloud native environments like Kubernetes. Hands-on demos show how Fluent Bit can parse and add metadata to Kubernetes logs.
- The document discusses logging for containers using Fluentd, an open source data collector. It describes how Fluentd can provide a unified logging layer, reliably forwarding and aggregating logs from multiple containers and applications in a pluggable way.
- Key points covered include using Fluentd with the new Docker logging drivers to directly collect logs from containers, avoiding performance penalties from other approaches. A demo of Fluentd is also mentioned.
Apache Kafka is a distributed streaming platform that can be used to build real-time data pipelines. It publishes and subscribes to streams of records in a fault-tolerant and durable way, and helps process streams of records as they occur. Key characteristics of Kafka include high throughput ingestion, fault-tolerant storage, high availability, scalability, and support for concurrent processing and ordering guarantees. It provides functionality similar to a messaging system but with a unique design as a distributed, partitioned, replicated commit log service.
This document summarizes Fluentd v1.0 and provides details about its new features and release plan. It notes that Fluentd v1.0 will provide stable APIs and compatibility with previous versions while improving plugin APIs, adding Windows and multicore support, and increasing event time resolution to nanoseconds. The release is planned for Q3 2017 to allow feedback on v0.14 before finalizing v1.0 features.
Data Security Governanace and Consumer Cloud StorageDaniel Rohan
a brief on the most popular consumer cloud storage protocols along with suggestions to mitigate the threat of data exfiltration via these services on corporate networks.
Oleksandr Nitavskyi "Kafka deployment at Scale"Fwdays
This document discusses Kafka deployment at online advertising company Criteo. Some key points:
1. Criteo uses Kafka to process up to 10 million messages per second and 180 TB of data per day across 13 Kafka clusters spanning multiple datacenters.
2. They define partitioning based on retaining 72GB of data per partition over a 72 hour period. This has led to topics with over 1,300 partitions.
3. Criteo developed an in-house C# Kafka client optimized for their use cases of high throughput and ability to blacklist partitions when needed. They are looking to upgrade to support new Kafka features like idempotent producers and transactions.
4. Monitoring lag is a key metric,
EVCache: Lowering Costs for a Low Latency Cache with RocksDBScott Mansfield
EVCache is a distributed, sharded, replicated key-value store optimized for Netflix's use cases on AWS. It is based on Memcached but uses RocksDB for persistent storage, lowering costs compared to storing all data in memory. Moneta is the next generation EVCache server, using Rend and Mnemonic libraries to intelligently manage data placement in RAM and SSD. This provides high performance for both volatile and batch workloads while reducing costs by 70% compared to the original Memcached-based design.
Atmosphere 2014: Centralized log management based on Logstash and Kibana - ca...PROIDEA
Nowadays cloud enviroments are primary platform for applications. We no longer have multipurpose machines, rather multiple smaller virtual servers with dedicated roles. Therefore there is a need to have one place where we can manage applications and system logs. I wish to share my experience gained while building centralized log managment system using Nxlog, Logstash and Kibana. With that tools we are building cost effective and scalable log managment platform.
Dariusz Eliasz - Works in Allegro Group as a Solution Architect and is responsible for organizing cooperation with infrastructure teams, also leads some of the infrastructure projects. Earlier as an Expert System Administratorhe was related with building and maintaining the infrastructure shared services (i.e. image hosting platform) within Allegro Group.
In DiDi Chuxing Company, which is China’s most popular ride-sharing company. we use HBase to serve when we have a bigdata problem.
We run three clusters which serve different business needs. We backported the Region Grouping feature back to our internal HBase version so we could isolate the different use cases.
We built the Didi HBase Service platform which is popular amongst engineers at our company. It includes a workflow and project management function as well as a user monitoring view.
Internally we recommend users use Phoenix to simplify access.even more,we used row timestamp;multidimensional table schema to slove muti dimension query problems
C++, Go, Python, and PHP clients get to HBase via thrift2 proxies and QueryServer.
We run many important buisness applications out of our HBase cluster such as ETA/GPS/History Order/API metrics monitoring/ and Traffic in the Cloud. If you are interested in any aspects listed above, please come to our talk. We would like to share our experiences with you.
How Criteo is managing one of the largest Kafka Infrastructure in EuropeRicardo Paiva
This document discusses Criteo's large Kafka infrastructure in Europe. Some key details:
- Criteo uses Kafka to process up to 7 million messages per second (400 billion per day) across about 200 brokers in 13 Kafka clusters across multiple datacenters.
- They have developed an in-house C# Kafka client optimized for their high-throughput use case of no key partitioning and no order guarantees.
- Criteo monitors lag and message ordering using "watermark" messages containing timestamps that are tracked across partitions to measure stream processing lag.
- Data is replicated between clusters for redundancy using custom Kafka Connect connectors that write offsets to the destination.
Why You Definitely Don’t Want to Build Your Own Time Series DatabaseInfluxData
At Outlyer, an infrastructure monitoring tool, we had to build our own TSDB back in 2015 to support our service. Two years later, we decided to take a different direction after seeing for ourselves how hard it is to build and scale a TSDB. This talk will review our journey, the challenges we hit trying to scale a TSDB for large customers and hopefully talk some people out of trying to build one themselves because it is not easy!
Kafka Summit SF 2017 - MultiCluster, MultiTenant and Hierarchical Kafka Messa...confluent
This document discusses scaling challenges with large Kafka clusters and proposes a solution of using multiple, smaller Kafka clusters organized hierarchically. The key points are: 1) Large monolithic Kafka clusters have issues like slow operations and increased latency; 2) The solution is to create many smaller "immutable" Kafka clusters and connect them with a routing service; 3) This allows scaling producers and consumers across clusters rather than just brokers.
This document discusses open source relational databases. It begins by introducing the presenter and topic, which is the current state of components in open source SQL databases. It then covers key components such as the storage engine, query planner, protocols, transaction model, and others. For each component, it discusses the approaches taken by different databases like PostgreSQL, MySQL, CockroachDB, and ClickHouse. It also addresses topics like horizontal scalability and replication strategies. Overall, the document provides a detailed overview and comparison of the architectural components and capabilities across major open source relational database management systems.
Apache Geode is an open source in-memory data grid that provides data distribution, replication and high availability. It can be used for caching, messaging and interactive queries. The presentation discusses Geode concepts like cache, region and member. It provides examples of how large companies use Geode for applications requiring real-time response, high concurrency and global data visibility. Geode's performance comes from minimizing data copying and contention through flexible consistency and partitioning. The project is now hosted by Apache and the community is encouraged to get involved through mailing lists, code contributions and example applications.
Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI
This document summarizes Satoshi Tagomori's presentation on Treasure Data, a data analytics service company. It discusses Treasure Data's use of Ruby for various components of its platform including its logging (Fluentd), ETL (Embulk), scheduling (PerfectSched), and storage (PlazmaDB) technologies. The document also provides an overview of Treasure Data's architecture including how it collects, stores, processes, and visualizes customer data using open source tools integrated with services like Hadoop and Presto.
ApacheCon Core: Service Discovery in OSGi: Beyond the JVM using Docker and Co...Frank Lyaruu
OSGi offers an excellent service discovery mechanism, but it is limited to services inside the JVM. With Docker nowadays it is trivially easy to deploy all kind of (micro) services, using pretty much any technology stack, so we’d like to discover those as easily as the ones inside the JVM. We will have a look at how we can use the Docker API to discover services in other containers, and how we can use Consul to expand service discovery to other hosts.
The document discusses Oracle TimesTen In-Memory Database architecture, performance tips, and use cases. It provides an overview of TimesTen Classic and Scaleout architectures, how TimesTen handles persistence through checkpointing and transaction logging, and high-performance concurrency controls. The agenda covers TimesTen functionality, architectures, performance optimization, and when to use TimesTen versus other Oracle in-memory options.
Reactive Development: Commands, Actors and Events. Oh My!!David Hoerster
Distributed applications are becoming more popular with the increasing popularity of microservices (however you want to define that term). But the principles of distributed application development are key if you want to build a system that is resilient, responsive, elastic and maintainable. In this workshop, we’ll review the principles of CQRS and the Reactive Manifesto, and how they complement each other. We’ll build an application that can handle a large stream of data, and allow users to still have a responsive experience while interacting with real-time and near-real-time data.
We’ll look at Akka.NET as the workhorse inside your services, and how the principles of CQRS can help with your service-to-service communications.
We’ll also look at how Event Sourcing can aid in managing your domain state, and how an event stream can be used to project data for your system for a number of different uses. We’ll build our own simple event store, but also look at commercially available stores, too.
This session will focus on using Akka.NET along with a few other tools and technologies, such as EventStore and MongoDB. The concepts learned in this session will be applicable to a number of different tools, technologies and languages.
The document discusses a tool called the BizTalk Migrator that helps migrate BizTalk applications to Azure Integration Services. It summarizes the tool's capabilities like discovering and parsing BizTalk artifacts. It also discusses what is and isn't supported when migrating things like adapters, pipelines, and orchestrations. The document provides installation instructions and highlights differences between the original and migrated applications that users may encounter.
Conceptos básicos. Seminario web 6: Despliegue de producciónMongoDB
Este es el último seminario web de la serie Conceptos básicos, en la que se realiza una introducción a la base de datos MongoDB. En este seminario web le guiaremos por el despliegue en producción.
This document provides a summary of a presentation on Big Data and NoSQL databases. It introduces the presenters, Melissa Demsak and Don Demsak, and their backgrounds. It then discusses how data storage needs have changed with the rise of Big Data, including the problems created by large volumes of data. The presentation contrasts traditional relational database implementations with NoSQL data stores, identifying five categories of NoSQL data models: document, key-value, graph, and column family. It provides examples of databases that fall under each category. The presentation concludes with a comparison of real-world scenarios and which data storage solutions might be best suited to each scenario.
MongoDB and Machine Learning with FlowableFlowable
Joram Barrez, Principal Software Engineer at Flowable, explains how to run Flowable on MongoDB.
It was presented at the Flowfest 2018 in Barcelona, Spain
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Emprovise
Highlights of AWS ReInvent 2023 in Las Vegas. Contains new announcements, deep dive into existing services and best practices, recommended design patterns.
- MongoDB is well-suited for systems of engagement that have demanding real-time requirements, diverse and mixed data sets, massive concurrency, global deployment, and no downtime tolerance.
- It performs well for workloads with mixed reads, writes, and updates and scales horizontally on demand. However, it is less suited for analytical workloads, data warehousing, business intelligence, or transaction processing workloads.
- MongoDB shines for use cases involving single views of data, mobile and geospatial applications, real-time analytics, catalogs, personalization, content management, and log aggregation. It is less optimal for workloads requiring joins, full collection scans, high-latency writes, or five nines u
(A talk given at Wix R&D in Dnipro, Ukraine on March 2017. Video available at https://www.youtube.com/watch?v=eIX33mQdkAI&feature=youtu.be)
While microservices are conceptually simple, it's a deep rabbit hole to go down. Deceptively simple questions can have far-reaching implications: Which communication protocol should I choose? Is event-driven the way to go? What monitoring tools should I put in place?
In this talk we'll cover some of the fundamental questions, outline the solutions adopted or developed by Wix, and share our hindsight on what worked well for us, what didn't and thoughts on future directions for our stack.
Latest (storage IO) patterns for cloud-native applications OpenEBS
Applying micro service patterns to storage giving each workload its own Container Attached Storage (CAS) system. This puts the DevOps persona within full control of the storage requirements and brings data agility to k8s persistent workloads. We will go over the concept and the implementation of CAS, as well as its orchestration.
Centralizing Kubernetes and Container OperationsKublr
While developers see and realize the benefits of Kubernetes, how it improves efficiencies, saves time, and enables focus on the unique business requirements of each project; InfoSec, infrastructure, and software operations teams still face challenges when managing a new set of tools and technologies, and integrating them into an existing enterprise infrastructure.
These meetup slides go over what’s needed for a general architecture of a centralized Kubernetes operations layer based on open source components such as Prometheus, Grafana, ELK Stack, Keycloak, etc., and how to set up reliable clusters and multi-master configuration without a load balancer. It also outlines how these components should be combined into an operations-friendly enterprise Kubernetes management platform with centralized monitoring and log collection, identity and access management, backup and disaster recovery, and infrastructure management capabilities. This presentation will show real-world open source projects use cases to implement an ops-friendly environment.
Check out this and more webinars in our BrightTalk channel: https://goo.gl/QPE5rZ
This presentation will describe how to go beyond a "Hello world" stream application and build a real-time data-driven product. We will present architectural patterns, go through tradeoffs and considerations when deciding on technology and implementation strategy, and describe how to put the pieces together. We will also cover necessary practical pieces for building real products: testing streaming applications, and how to evolve products over time.
Presented at highloadstrategy.com 2016 by Øyvind Løkling (Schibsted Products & Technology), joint work with Lars Albertsson (independent, www.mapflat.com).
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleEvan Chan
My keynote presentation about how we developed FiloDB, a distributed, Prometheus-compatible time series database, productionized it at Apple and scaled it out to handle a huge amount of operational data, based on the stack of Kafka, Cassandra, Scala/Akka.
Following simple patterns of good application design can allow you to scale your application for your customers easily. This presentation dives into the 12 factor application design and demo how this applies to containers and deployments on Amazon ECS and Fargate. We'll take a look at tooling that can be used to simplify your workflow and help you adopt the principles of the 12 factor application.
Kubernetes – An open platform for container orchestrationinovex GmbH
Datum: 30.08.2017
Event: GridKA School 2017
Speaker: Johannes M. Scheuermann
Mehr Tech-Vorträge: https://www.inovex.de/de/content-pool/vortraege/
Mehr Tech-Artikel: https://www.inovex.de/blog/
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Lucas Jellema
This presentation gives an brief overview of the history of relational databases, ACID and SQL and presents some of the key strentgths and potential weaknesses. It introduces the rise of NoSQL - why it arose, what is entails, when to use it. The presentation focuses on MongoDB as prime example of NoSQL document store and it shows how to interact with MongoDB from JavaScript (NodeJS) and Java.
Similar to Distributed Logging Architecture in the Container Era (20)
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Marlon Dumas
This webinar discusses the limitations of traditional approaches for business process simulation based on had-crafted model with restrictive assumptions. It shows how process mining techniques can be assembled together to discover high-fidelity digital twins of end-to-end processes from event data.
Enhanced data collection methods can help uncover the true extent of child abuse and neglect. This includes Integrated Data Systems from various sources (e.g., schools, healthcare providers, social services) to identify patterns and potential cases of abuse and neglect.
Generative Classifiers: Classifying with Bayesian decision theory, Bayes’ rule, Naïve Bayes classifier.
Discriminative Classifiers: Logistic Regression, Decision Trees: Training and Visualizing a Decision Tree, Making Predictions, Estimating Class Probabilities, The CART Training Algorithm, Attribute selection measures- Gini impurity; Entropy, Regularization Hyperparameters, Regression Trees, Linear Support vector machines.
We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of March 2024.
Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.
5. Topics
• Microservices and logging in various industries
• Difficulties of logging with containers
• Distributed logging architecture
• Patterns of distributed logging architecture
• Case Study: Docker and Fluentd
7. Logging in Various Industries
• Web access logs
• Views/visitors on media
• Views/clicks on Ads
• Commercial transactions (EC, Game, ...)
• Data from devices
• Operation logs on Apps of phones
• Various sensor data
8. Microservices and Logging
• Monolithic service
• a service produces all data
about an user's behavior
• Microservices
• many services produce data
about an user's access
• it's needed to collect logs
from many services to know
what is happening
Users
Service (Application)
Logs
Users
Logs
10. Containers:
"a must" for microservices
• Dividing a service into services
• a service requires less computing resources
(VM -> containers)
• Making services independent from each other
• but it is very difficult :(
• some dependency must be solved even in
development environment
(containers on desktop)
11. Redesign Logging: Why?
• No permanent storages
• No fixed physical/network address
• No fixed mapping between servers and roles
• We should parse/label logs at the source, ship
these logs by pushing to destination ASAP
12. Containers:
immutable & disposable
• No permanent storages
• Where to write logs?
• files in the container
→ gone w/ container instance 😞
• directories shared from hosts
→ hosts are shared by many containers/services
☹
• TODO: ship logs from container to anywhere ASAP
13. Containers:
unfixed addresses
• No fixed physical / network address
• Where should we go to fetch logs?
• Service discovery (e.g., consul)
→ one more component 😞
• rsync? ssh+tail? or ..? Is it installed in containers?
→ one more tool to depend on ☹
• TODO: push logs to anywhere from containers
14. Containers:
instances per roles
• No fixed mapping between servers and roles
• How can we parse / store these logs?
• Central repository about log syntax
→ very hard to maintain 😞
• Label logs by source address
→ many containers/roles in a host ☹
• TODO: label & parse logs at source of logs
17. • Parse/Label (collector)
• Raw logs are not good for processing
• Convert logs to structured data (key-value pairs)
• Split/Sort (aggregator)
• Mixed logs are not good for searching
• Split whole data stream into streams per services
• Store (destination)
• Format logs(records) as destination expects
Collecting and Storing Data
18. Scaling Logging
• Network traffic
• CPU load to parse / format
• Parse logs on each collector (distributed)
• Format logs on aggregator (to be distributed)
• Capability
• Make aggregators redundant
• Controlling delay
• to make sure when we can know what's happening in our
systems
22. Without Source Aggregation
• Pros:
• Simple configuration
• Cons:
• fixed aggregator (endpoint) address
• many network connections
• high load in aggregator
collector
aggregator
23. With Source Aggregation
• Pros:
• less connections
• lower load in aggregator
• less configuration in containers
(by specifying localhost)
• highly flexible configuration
(by deployment only of aggregate containers)
• Cons:
• a bit much resource (+1 container per host)
aggregate
container
aggregator
25. Without Destination Aggregation
• Pros:
• Less nodes
• Simpler configuration
• Cons:
• Storage side change affects collector side
• Worse performance: many small write requests
on storage
26. With Destination Aggregation
• Pros:
• Collector side configuration is
free from storage side changes
• Better performance with fine tune
on destination side aggregator
• Cons:
• More nodes
• A bit complex configuration
aggregator
28. Scaling Up Endpoints
• Pros:
• Simple configuration
in collector nodes
• Cons:
• Limits about scaling up
Load balancer
Backend nodes
29. Scaling Out Endpoints
• Pros:
• Unlimited scaling
by adding aggregator nodes
• Cons:
• Complex configuration
• Client features for round-robin
30. Without
Destination Aggregation
With
Destination Aggregation
Scaling Up
Endpoints
Systems in early stages
Collecting logs over
Internet
or
Using queues
Scaling Out
Endpoints
Impossible :(
Collector nodes must know
all endpoints
↓
Uncontrollable
Collecting logs
in datacenter
33. Why Fluentd?
• Docker Fluentd logging driver
• Docker containers can send logs to Fluentd
directly - less overhead
• Pluggable architecture
• Various destination systems
• Small memory footprint
• Source aggregation requires +1 container per host
• Less additional resource usage ( < 100MB )
34. Destination aggregation + scaling up
• Sending logs directly over TCP by Fluentd logger
library in application code
• Same with patterns of New Relic
• Easy to implement
- good for startups Application code
35. Source aggregation + scaling up
• Kubernetes: Json logger + Fluentd + Elasticsearch
• Applications write logs to STDOUT
• Docker writes logs as JSON in files
• Fluentd
reads logs from file
parse JSON objects
writes logs to Elasticsearch
• EFK stack (like ELK stack)
http://kubernetes.io/docs/getting-started-guides/logging-elasticsearch/
Elasticsearch
Application code
Files (JSON)
36. Source aggregation + scaling up/out
• Docker fluentd logging driver + Fluentd + Kafka
• Applications write logs to STDOUT
• Docker sends logs
to localhost Fluentd
• Fluentd
gets logs over TCP
pushes logs into Kafka
• Highly scalable & less overhead
- very good for huge deployment
Kafka
Application code
37. Application code
Source/Destination aggregation +
scaling out
• Docker fluentd logging driver + Fluentd
• Applications write logs to STDOUT
• Docker sends logs
to localhost Fluentd
• Fluentd
gets logs over TCP
sends logs into Aggregator Fluentd
w/ round-robin load balance
• Highly flexible
- good for complex data processing
requirements
Any other storages
38. What's the Best?
• Writing logs from containers: Some way to do it
• Docker logging driver
• Write logs on files + read/parse it
• Send logs from apps directly
• Make the platform scalable!
• Source aggregation: Fluentd on localhost
• Scalable storage: (Kafka, external services, ...)
• No destination aggregation + Scaling up
• Non-scalable storage: (Filesystems, RDBMSs, ...)
• Destination aggregation + Scaling out
40. Why OSS?
• Logging layer is interface
• transparency
• interoperability
• Keep the platform scalable
• number of nodes
• number of types of source/destination