The IPTC's News Exchange Formats including NewsML-G2 and QCodes. Updating the documentation and fixing the problem of the XML Schema normalizedString type
In Apache Cassandra Lunch #59: Functions in Cassandra, we discussed the functions that are usable inside of the Cassandra database. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live.
The Science Working Group is an international collaboration of scientific organizations that develops open-source software tools for scientific research. It has 15 member organizations from fields like neutron sources and synchrotrons. The group created reusable software like the General Data Analysis framework and DAWN data analysis workbench. Recent projects included adopting new technologies like OSGi and developing SWMR file support and extensions to DAWN like a Fano factor image filter.
This document provides an overview of Graphite and StatsD. StatsD is a network daemon that listens for statistics like counters or timers over UDP and sends them to Carbon. Carbon is another network daemon that listens for statistics over TCP and stores them on disk using Whisper, a fixed-size database like RRD. Graphite is a web-based interface for visualizing the stored metrics, allowing users to render graphs and create dashboards from the collected time-series data.
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...Miguel Pérez Colino
The Red Hat portfolio is well suited to deliver cloud solutions to customers. We're going beyond solution-building and delivery to improve operations by launching an effort to improve log aggregation. Learn how new capabilities can help you better manage your Red Hat footprint.
This document discusses using InfluxDB and Grafana together for analyzing IoT data. It provides benchmarks showing InfluxDB's fast performance for ingesting and querying large time series data compared to PostgreSQL. It also covers hosting InfluxDB on AWS for horizontal scalability and high availability using InfluxDB relays.
MongoDB IoT City Tour EINDHOVEN: Managing the Database ComplexityMongoDB
The value of the fast growing class of NoSQL databases is the ability to handle high velocity and volumes of data while enabling greater agility with dynamic schemas. MongoDB gives you those benefits while also providing a rich querying capability and a document model for developer productivity. Arthur Viegers will outline the reasons for MongoDB's popularity in IoT applications and how you can leverage the core concepts of NoSQL to build robust and highly scalable IoT applications.
This document provides an overview of a toy model for simulating particle collisions. It describes sampling particle data from experimental measurements to generate events. A jet finding algorithm is used to cluster particles into jets using FastJet. The current status indicates particle generation works as expected but jet finding results appear buggy. Next steps involve analyzing jet distributions and performance of the jet finder on simulated events without embedded jets. Possible extensions include jet fragmentation.
Challenges in knowledge graph visualizationGraphAware
Visualizing a complex graph is a task of graph simplification and providing well-thought visual cues, the best UI goes unnoticed. This talk will summarize current approaches and present a novel user interaction pattern, which takes advantage of a performant Neo4j graph engine.
About the speaker:
Jan Zak - Senior Consultant at GraphAware; Data visualizations, graphs, maps; Based in Prague, Czech Republic
In Apache Cassandra Lunch #59: Functions in Cassandra, we discussed the functions that are usable inside of the Cassandra database. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live.
The Science Working Group is an international collaboration of scientific organizations that develops open-source software tools for scientific research. It has 15 member organizations from fields like neutron sources and synchrotrons. The group created reusable software like the General Data Analysis framework and DAWN data analysis workbench. Recent projects included adopting new technologies like OSGi and developing SWMR file support and extensions to DAWN like a Fano factor image filter.
This document provides an overview of Graphite and StatsD. StatsD is a network daemon that listens for statistics like counters or timers over UDP and sends them to Carbon. Carbon is another network daemon that listens for statistics over TCP and stores them on disk using Whisper, a fixed-size database like RRD. Graphite is a web-based interface for visualizing the stored metrics, allowing users to render graphs and create dashboards from the collected time-series data.
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...Miguel Pérez Colino
The Red Hat portfolio is well suited to deliver cloud solutions to customers. We're going beyond solution-building and delivery to improve operations by launching an effort to improve log aggregation. Learn how new capabilities can help you better manage your Red Hat footprint.
This document discusses using InfluxDB and Grafana together for analyzing IoT data. It provides benchmarks showing InfluxDB's fast performance for ingesting and querying large time series data compared to PostgreSQL. It also covers hosting InfluxDB on AWS for horizontal scalability and high availability using InfluxDB relays.
MongoDB IoT City Tour EINDHOVEN: Managing the Database ComplexityMongoDB
The value of the fast growing class of NoSQL databases is the ability to handle high velocity and volumes of data while enabling greater agility with dynamic schemas. MongoDB gives you those benefits while also providing a rich querying capability and a document model for developer productivity. Arthur Viegers will outline the reasons for MongoDB's popularity in IoT applications and how you can leverage the core concepts of NoSQL to build robust and highly scalable IoT applications.
This document provides an overview of a toy model for simulating particle collisions. It describes sampling particle data from experimental measurements to generate events. A jet finding algorithm is used to cluster particles into jets using FastJet. The current status indicates particle generation works as expected but jet finding results appear buggy. Next steps involve analyzing jet distributions and performance of the jet finder on simulated events without embedded jets. Possible extensions include jet fragmentation.
Challenges in knowledge graph visualizationGraphAware
Visualizing a complex graph is a task of graph simplification and providing well-thought visual cues, the best UI goes unnoticed. This talk will summarize current approaches and present a novel user interaction pattern, which takes advantage of a performant Neo4j graph engine.
About the speaker:
Jan Zak - Senior Consultant at GraphAware; Data visualizations, graphs, maps; Based in Prague, Czech Republic
C* Summit 2013: Time-Series Metrics with Cassandra by Mike HeffnerDataStax Academy
This document discusses using Cassandra to store time-series metrics data. It describes how the schema was matched to storage by using a measurement column family with rows organized by metric ID and time. It also covers optimizing data expiration through techniques like TTL expiration, synchronized compactions, and leveraging immutable sstable modification times. Effective monitoring is emphasized as well, including dashboards to track the ring and using Cassandra log volumes to identify issues.
You’ve spent considerable time picking your orchestrator, choosing the right cloud provider and configuring all the intricate details of your new Docker environment, but what about monitoring? In this talk we will cover the tools available on the market: upsides, downsides and upcoming changes. We’ll open the floor to questions, comments and feedback for each tool, so you have a complete view on the monitoring landscape.
Temporal Performance Modelling of Serverless Computing Platforms - WoSC6Nima Mahmoudi
This presentation is an overview of the "Temporal Performance Modeling of Serverless Computing Platforms" paper published in Sixth International Workshop on Serverless Computing (WoSC6) 2020 as part of IEEE Middleware conference.
Authors: Nima Mahmoudi and Hamzeh Khazaei
Paper: https://www.serverlesscomputing.org/wosc6/#p1
Preprint and Artifacts: https://research.nima-dev.com/publication/mahmoudi-2020-tempperf/
Full Presentation: https://youtu.be/9r3j_1B5t8c
Lightning Talk (1 min): https://youtu.be/E5KigIq0Z1E
PACS Lab: https://pacs.eecs.yorku.ca/
This document summarizes CloudModule, a Zabbix loadable module that enables monitoring of hybrid cloud environments by integrating with Apache Deltacloud. It automatically registers host and metric information from cloud instances and supports monitoring instance details like hardware profile and state, as well as common EC2 metrics from CloudWatch like CPU utilization, network traffic, and disk usage. The module architecture involves Zabbix communicating with Deltacloud and AWS through the CloudModule to discover instances and metrics and store them in a shared CloudCache.
Next-generation API Development with GraphQL and PrismaNikolas Burk
This document summarizes a presentation about next-generation API development with GraphQL and Prisma. The presentation covers an introduction to GraphQL, understanding GraphQL servers, and building GraphQL servers with Prisma and Nexus. Key points include: GraphQL is a query language for APIs that allows clients to request specific data in a single request; Prisma helps implement GraphQL resolvers against a database by providing type-safe database access, migrations, and other tools; Prisma and GraphQL work well together by saving boilerplate and ensuring end-to-end type safety from database to frontend.
Kafka est devenu en quelques temps un outil central dans les architectures d'analyses de données.
Ce journal distribué repose sur quelques principes simples qui en font un outil performant et robuste et lui offre une scalabilité presque linéaire. Nous vous proposons de comprendre par la pratique quelques concepts techniques en codant un système de production/consommation de messages.
Matériel nécessaire : un ordinateur (Machine ou VM Unix uniquement car Kafka ne fonctionne pas sous Windows) sur lequel vous aurez préalablement installé votre IDE préféré (IntelliJ/ Eclipse) ainsi que Maven ou SBT. De plus, il vous faut cloner le git clone https://github.com/xebia-france/kafka_the_north_face
The document discusses updates and new features for InfluxDB Platform in year 1, including new AWS regions for InfluxDB Cloud, pricing and signup improvements, expanded Telegraf plugin and SDK support, InfluxDB templates for sharing configurations, performance improvements for querying and visualizing data, alerting capabilities, and open sourcing InfluxDB 2.0 with a single binary. It also advertises upcoming talks on integrating OSS at the edge and Cloud at the core, and seamlessly migrating from InfluxDB 1.x to 2.0.
This document summarizes Netflix's big data capabilities and how they use Tableau to analyze and visualize their data. Some key points:
1. Netflix collects up to 100 billion data events per day across multiple tables exceeding 10 billion rows daily, totaling over 2 petabytes of compressed data stored in Amazon S3 buckets.
2. Their Hadoop cluster contains 2,000 EC2 nodes with 22.5 terabytes of RAM used to process this massive amount of data.
3. Tableau is used across many Netflix teams like Data Science, Platform, and IT to visually explore, analyze, and present their big data in a more user-friendly way than Excel.
4. Tableau enables teams
Critical Run files can be missing/corrupt after the Run folder was transferred from the HiSeq storage to the cluster storage. This presentation discusses the issue and suggests four workarounds.
The Current Messaging Landscape: RabbitMQ, ZeroMQ, nsq, KafkaAll Things Open
The document discusses Michael Laing's role as an architect at Edge Engineering and his work on various projects. It mentions his "nyt⨍aбrik" project, work with distributed graphs using Cassandra and Titan, and notes he is exploring using Spark Streaming for online analytical processing. The document contains technical details about distributed systems, databases, and graph structures.
MongoDB - Warehouse and Aggregator of EventsMaxim Ligus
This document discusses using MongoDB to warehouse and aggregate events from different sources. MongoDB can scale simply to handle large volumes of event data, provide 99.999% uptime, and integrate smoothly with other infrastructure components. It describes how MongoDB can distribute data across multiple shards to improve performance and scale to handle large workloads of event data over long retention periods in a cost effective manner using reasonable hardware requirements. The document compares MongoDB to Elasticsearch and provides an overview of how event data would flow through the system from ingestion to storage to retrieval.
The document summarizes the evolution of the ELK stack architecture at a company from a single cluster handling all data to three specialized clusters for logs, core data processing, and a testing environment. It also provides monitoring strategies using Elastizabbix to track cluster metrics and configure alerts in Zabbix. Key lessons learned are discussed around data modeling, indexing performance, and common query issues.
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Sujit Pal
Python has a great ecosystem of tools for natural language processing (NLP) pipelines, but challenges arise when data sizes and computational complexity grows. Best case, a pipeline is left to run overnight or even over several days. Worst case, certain analyses or computations are just not possible. Dask is a Python-native parallel processing tool that enables Python users to easily scale their code across a cluster of machines.
This talk presents an example of an NLP entity extraction pipeline using SciSpacy with Dask for parallelization, which was built and executed on Saturn Cloud. Saturn Cloud is an end-to-end data science and machine learning platform that provides an easy interface for Python environments and Dask clusters, removing many barriers to accessing parallel computing. This pipeline extracts named entities from the CORD-19 dataset, using trained models from the SciSpaCy project, and makes them available for downstream tasks in the form of structured Parquet files. We will provide an introduction to Dask and Saturn Cloud, then walk through the NLP code.
Code-first GraphQL Server Development with PrismaNikolas Burk
This document discusses code-first and SDL-first approaches to building GraphQL schemas and servers. It defines the terminology and compares the two approaches. Code-first involves programmatically defining types and resolvers, while SDL-first uses a string-based schema definition language. Both have tradeoffs like inconsistencies or lack of tooling for SDL-first, and lack of documentation for code-first. Prisma is introduced as a tool that can generate a GraphQL schema from a database using either approach. The document concludes with a demonstration of building a GraphQL server and schema with Prisma and Nexus using a code-first approach.
This document discusses logging for containers and microservices. It covers structured logging formats like JSON, logging drivers for Docker, challenges of logging at scale, and logging solutions like Fluentd and Fluent Bit. It highlights features like pluggable architectures, high performance, and support for aggregation patterns to optimize logging workflows.
My Talk at GCPUG-Taiwan on 2015/5/8.
You use BigQuery with SQL, but the internal work of BigQuery is very different from traditional Relational Database systems you may familiar with.
One of the way to understand how BigQuery works is to see it from the cost you pay for BigQuery. Knowing how to save money while using BigQuery is to know how BigQuery works to some extent.
In this session, let’s talk about practical knowledge (saving money) and exciting technology (how BigQuery works)!
DDS Advanced Tutorial - OMG June 2013 Berlin MeetingJaime Martin Losa
An extended, in-depth tutorial explaining how to fully exploit the standard's unique communication capabilities.Presented at the OMG June 2013 Berlin Meeting.
Users upgrading to DDS from a homegrown solution or a legacy-messaging infrastructure often limit themselves to using its most basic publish-subscribe features. This allows applications to take advantage of reliable multicast and other performance and scalability features of the DDS wire protocol, as well as the enhanced robustness of the DDS peer-to-peer architecture. However, applications that do not use DDS's data-centricity do not take advantage of many of its QoS-related, scalability and availability features, such as the KeepLast History Cache, Instance Ownership and Deadline Monitoring. As a consequence some developers duplicate these features in custom application code, resulting in increased costs, lower performance, and compromised portability and interoperability.
This tutorial will formally define the data-centric publish-subscribe model as specified in the OMG DDS specification and define a set of best-practice guidelines and patterns for the design and implementation of systems based on DDS.
C* Summit 2013: Time-Series Metrics with Cassandra by Mike HeffnerDataStax Academy
This document discusses using Cassandra to store time-series metrics data. It describes how the schema was matched to storage by using a measurement column family with rows organized by metric ID and time. It also covers optimizing data expiration through techniques like TTL expiration, synchronized compactions, and leveraging immutable sstable modification times. Effective monitoring is emphasized as well, including dashboards to track the ring and using Cassandra log volumes to identify issues.
You’ve spent considerable time picking your orchestrator, choosing the right cloud provider and configuring all the intricate details of your new Docker environment, but what about monitoring? In this talk we will cover the tools available on the market: upsides, downsides and upcoming changes. We’ll open the floor to questions, comments and feedback for each tool, so you have a complete view on the monitoring landscape.
Temporal Performance Modelling of Serverless Computing Platforms - WoSC6Nima Mahmoudi
This presentation is an overview of the "Temporal Performance Modeling of Serverless Computing Platforms" paper published in Sixth International Workshop on Serverless Computing (WoSC6) 2020 as part of IEEE Middleware conference.
Authors: Nima Mahmoudi and Hamzeh Khazaei
Paper: https://www.serverlesscomputing.org/wosc6/#p1
Preprint and Artifacts: https://research.nima-dev.com/publication/mahmoudi-2020-tempperf/
Full Presentation: https://youtu.be/9r3j_1B5t8c
Lightning Talk (1 min): https://youtu.be/E5KigIq0Z1E
PACS Lab: https://pacs.eecs.yorku.ca/
This document summarizes CloudModule, a Zabbix loadable module that enables monitoring of hybrid cloud environments by integrating with Apache Deltacloud. It automatically registers host and metric information from cloud instances and supports monitoring instance details like hardware profile and state, as well as common EC2 metrics from CloudWatch like CPU utilization, network traffic, and disk usage. The module architecture involves Zabbix communicating with Deltacloud and AWS through the CloudModule to discover instances and metrics and store them in a shared CloudCache.
Next-generation API Development with GraphQL and PrismaNikolas Burk
This document summarizes a presentation about next-generation API development with GraphQL and Prisma. The presentation covers an introduction to GraphQL, understanding GraphQL servers, and building GraphQL servers with Prisma and Nexus. Key points include: GraphQL is a query language for APIs that allows clients to request specific data in a single request; Prisma helps implement GraphQL resolvers against a database by providing type-safe database access, migrations, and other tools; Prisma and GraphQL work well together by saving boilerplate and ensuring end-to-end type safety from database to frontend.
Kafka est devenu en quelques temps un outil central dans les architectures d'analyses de données.
Ce journal distribué repose sur quelques principes simples qui en font un outil performant et robuste et lui offre une scalabilité presque linéaire. Nous vous proposons de comprendre par la pratique quelques concepts techniques en codant un système de production/consommation de messages.
Matériel nécessaire : un ordinateur (Machine ou VM Unix uniquement car Kafka ne fonctionne pas sous Windows) sur lequel vous aurez préalablement installé votre IDE préféré (IntelliJ/ Eclipse) ainsi que Maven ou SBT. De plus, il vous faut cloner le git clone https://github.com/xebia-france/kafka_the_north_face
The document discusses updates and new features for InfluxDB Platform in year 1, including new AWS regions for InfluxDB Cloud, pricing and signup improvements, expanded Telegraf plugin and SDK support, InfluxDB templates for sharing configurations, performance improvements for querying and visualizing data, alerting capabilities, and open sourcing InfluxDB 2.0 with a single binary. It also advertises upcoming talks on integrating OSS at the edge and Cloud at the core, and seamlessly migrating from InfluxDB 1.x to 2.0.
This document summarizes Netflix's big data capabilities and how they use Tableau to analyze and visualize their data. Some key points:
1. Netflix collects up to 100 billion data events per day across multiple tables exceeding 10 billion rows daily, totaling over 2 petabytes of compressed data stored in Amazon S3 buckets.
2. Their Hadoop cluster contains 2,000 EC2 nodes with 22.5 terabytes of RAM used to process this massive amount of data.
3. Tableau is used across many Netflix teams like Data Science, Platform, and IT to visually explore, analyze, and present their big data in a more user-friendly way than Excel.
4. Tableau enables teams
Critical Run files can be missing/corrupt after the Run folder was transferred from the HiSeq storage to the cluster storage. This presentation discusses the issue and suggests four workarounds.
The Current Messaging Landscape: RabbitMQ, ZeroMQ, nsq, KafkaAll Things Open
The document discusses Michael Laing's role as an architect at Edge Engineering and his work on various projects. It mentions his "nyt⨍aбrik" project, work with distributed graphs using Cassandra and Titan, and notes he is exploring using Spark Streaming for online analytical processing. The document contains technical details about distributed systems, databases, and graph structures.
MongoDB - Warehouse and Aggregator of EventsMaxim Ligus
This document discusses using MongoDB to warehouse and aggregate events from different sources. MongoDB can scale simply to handle large volumes of event data, provide 99.999% uptime, and integrate smoothly with other infrastructure components. It describes how MongoDB can distribute data across multiple shards to improve performance and scale to handle large workloads of event data over long retention periods in a cost effective manner using reasonable hardware requirements. The document compares MongoDB to Elasticsearch and provides an overview of how event data would flow through the system from ingestion to storage to retrieval.
The document summarizes the evolution of the ELK stack architecture at a company from a single cluster handling all data to three specialized clusters for logs, core data processing, and a testing environment. It also provides monitoring strategies using Elastizabbix to track cluster metrics and configure alerts in Zabbix. Key lessons learned are discussed around data modeling, indexing performance, and common query issues.
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Sujit Pal
Python has a great ecosystem of tools for natural language processing (NLP) pipelines, but challenges arise when data sizes and computational complexity grows. Best case, a pipeline is left to run overnight or even over several days. Worst case, certain analyses or computations are just not possible. Dask is a Python-native parallel processing tool that enables Python users to easily scale their code across a cluster of machines.
This talk presents an example of an NLP entity extraction pipeline using SciSpacy with Dask for parallelization, which was built and executed on Saturn Cloud. Saturn Cloud is an end-to-end data science and machine learning platform that provides an easy interface for Python environments and Dask clusters, removing many barriers to accessing parallel computing. This pipeline extracts named entities from the CORD-19 dataset, using trained models from the SciSpaCy project, and makes them available for downstream tasks in the form of structured Parquet files. We will provide an introduction to Dask and Saturn Cloud, then walk through the NLP code.
Code-first GraphQL Server Development with PrismaNikolas Burk
This document discusses code-first and SDL-first approaches to building GraphQL schemas and servers. It defines the terminology and compares the two approaches. Code-first involves programmatically defining types and resolvers, while SDL-first uses a string-based schema definition language. Both have tradeoffs like inconsistencies or lack of tooling for SDL-first, and lack of documentation for code-first. Prisma is introduced as a tool that can generate a GraphQL schema from a database using either approach. The document concludes with a demonstration of building a GraphQL server and schema with Prisma and Nexus using a code-first approach.
This document discusses logging for containers and microservices. It covers structured logging formats like JSON, logging drivers for Docker, challenges of logging at scale, and logging solutions like Fluentd and Fluent Bit. It highlights features like pluggable architectures, high performance, and support for aggregation patterns to optimize logging workflows.
My Talk at GCPUG-Taiwan on 2015/5/8.
You use BigQuery with SQL, but the internal work of BigQuery is very different from traditional Relational Database systems you may familiar with.
One of the way to understand how BigQuery works is to see it from the cost you pay for BigQuery. Knowing how to save money while using BigQuery is to know how BigQuery works to some extent.
In this session, let’s talk about practical knowledge (saving money) and exciting technology (how BigQuery works)!
DDS Advanced Tutorial - OMG June 2013 Berlin MeetingJaime Martin Losa
An extended, in-depth tutorial explaining how to fully exploit the standard's unique communication capabilities.Presented at the OMG June 2013 Berlin Meeting.
Users upgrading to DDS from a homegrown solution or a legacy-messaging infrastructure often limit themselves to using its most basic publish-subscribe features. This allows applications to take advantage of reliable multicast and other performance and scalability features of the DDS wire protocol, as well as the enhanced robustness of the DDS peer-to-peer architecture. However, applications that do not use DDS's data-centricity do not take advantage of many of its QoS-related, scalability and availability features, such as the KeepLast History Cache, Instance Ownership and Deadline Monitoring. As a consequence some developers duplicate these features in custom application code, resulting in increased costs, lower performance, and compromised portability and interoperability.
This tutorial will formally define the data-centric publish-subscribe model as specified in the OMG DDS specification and define a set of best-practice guidelines and patterns for the design and implementation of systems based on DDS.
Managing your Black Friday Logs NDC OsloDavid Pilato
Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
Topics include:
* monitoring architectures
* optimal bulk size
* distributing the load
* index and shard size
* optimizing disk IO
Takeaway: best practices when building a monitoring system with the Elastic Stack, advanced tuning to optimize and increase event ingestion performance.
This document provides an overview of Weather.com's analytics architecture using Apache Cassandra and Spark. It summarizes Weather.com's initial attempts using Cassandra, lessons learned, and its improved architecture. The improved architecture uses Cassandra for streaming event data with time-window compaction, stores all other data in Amazon S3 for batch processing in Spark, and replaces Kafka with Amazon SQS for event ingestion. It discusses best practices for data modeling in Cassandra including partitioning, secondary indexes, and avoiding wide rows and nulls. The document also highlights how Weather.com uses Apache Zeppelin notebooks for data exploration and visualization.
Managing your black friday logs - Code EuropeDavid Pilato
The document discusses optimally configuring Elasticsearch clusters for ingesting time-based data like logs. It recommends using time-based indices with a new index created each day. It also discusses techniques for scaling clusters by adding more shards as data volumes increase and distributing the data across nodes to avoid bottlenecks. The optimal bulk size for indexing may vary depending on factors like document size and should be tested.
This document provides an overview of streaming analytics, including definitions, common use cases, and key concepts like streaming engines, processing models, and guarantees. It also provides examples of analyzing data streams using Apache Spark Structured Streaming, Apache Flink, and Kafka Streams APIs. Code snippets demonstrate windowing, triggers, and working with event-time.
Imagine that self-driving cars now exist and are becoming widespread around the world. To facilitate the transition, it's necessary to set up central service to monitor traffic conditions nationwide, deploy sensors throughout the interstate system that monitor traffic conditions including car speeds, pavement and weather conditions, as well as accidents, construction, and other sources of traffic tie ups.
MongoDB has been selected as the database for this application. In this webinar, we will walk through designing the application’s schema that will both support the high update and read volumes as well as the data aggregation and analytics queries.
Active Data is a data-centric approach to data life-cycle management that uses a Petri net-based model to represent data states and transitions between systems. It exposes distributed data sets and allows clients to react to life cycle events in a scalable way. A prototype implemented the publish-subscribe model and demonstrated handling over 30,000 transitions per second. Active Data provides advantages like formal verification and fault tolerance but requires more work to standardize and represent complex data operations.
An Inter-Wiki Page Data Processor for a M2M System @Matsue, 1sep., Eskm2013Takashi Yamanoue
The document describes an inter-wiki page data processor for a machine-to-machine (M2M) system. The data processor reads data from sensors or wiki pages, processes the data, and outputs the results to wiki pages or controls actuators. It is controlled by programs written on wiki pages and has functions for communicating with mobile terminals, sensors, actuators, and web pages. An example application involves monitoring temperature, light, and human activity data in a room and controlling LEDs based on the results.
Experiences in ELK with D3.js for Large Log Analysis and VisualizationSurasak Sanguanpong
This document discusses experiences using the ELK stack (Elasticsearch, Logstash, Kibana) and D3.js for large log analysis and visualization. It begins with an overview of network traffic logging at Kasetsart University, which generates over 30 terabytes of log data per day. It then demonstrates setting up an ELK testbed to index these logs in real-time for fast search and exploration in Kibana. Finally, it shows how D3.js can be used to create dynamic, real-time visualizations of the logged data.
Building Conclave: a decentralized, real-time collaborative text editorSun-Li Beatteay
Conclave is an Open Source real time, collaborative text editor for the browser.
I worked in a remote, three person team to:
- Design and build a custom CRDT (conflict-free replicated data type) to increase the throughput speed of operations by over 1000% and guarantee consistency across all users.
- Reduce network latency by utilizing WebRTC to create a distributed, peer-to-peer architecture by upto 3000%.
- Implement a load-balancing algorithm to scale the application to dozens of concurrent users
- Built a Version Vector to guarantee causality and merge non-commutative operations.
- Give users complete control over their content by removing the need for a central data store and allowing users to download their content directly to their computer.
- Write an extensive case study (http://bit.ly/conclave-site) and Medium article (http://bit.ly/conclave-post) that has garnered more than 20K views.
Since more than 15 years we have been very happy to work with JCAPS and its predecessors. When Oracle announced to cease support for the platform, we created a task force with the mission to assess which platform provides the best assets to become a true successor for JCAPS. We finally selected WSO2 which is a great, stable, flexible and performing SOA platform. Our experience confirmed it to be the best fit for successful migration projects. We have written a complete set of migration tools for JCAPS (5.x) and eGate (4.x) to WSO2 ESB. This helps really to speed up the migration and keeping the quality and even improve the performance.
This document describes a U-SQL case study using Azure Data Lake to analyze web analytics data from Cegid websites. It involves developing three U-SQL scripts: 1) to extract and convert JSON log files to TSV format, extracting and pivoting custom dimensions, 2) to aggregate events into sessions, and 3) to further aggregate sessions into visitors. The case study demonstrates how U-SQL allows SQL-like querying and manipulation of large datasets with integrated C# code for custom logic, and discusses best practices for optimizing U-SQL scripts.
The document provides an overview of the Open Grid Computing Environments (OGCE) project, which develops and packages software for science gateways and resources. Key components discussed include the OGCE portal for building grid portals, Axis services for resource discovery and prediction, a workflow suite, and JavaScript and tag libraries. The document describes downloading and installing the OGCE software, which can be done with a single command, and discusses some of the portlets, services, and components included in the OGCE toolkit.
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
Flink Forward San Francisco 2022.
Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink.
by
Nico Kruber
This document proposes the LT-Innovate OSCAR project, which would provide a standardized open standards compliance assessment report (OSCAR) for translation tools. The OSCAR would assign compliance levels from 0-4 to various standards, providing a score that could help buyers choose tools and enforce progress on open standards compliance in the industry. The OSCAR project would be run annually by LT-Innovate's Standards Interest Group and provide the assessment for free to LT-Innovate members and potentially for a fee to non-members.
Architecture of a Kafka camus infrastructuremattlieber
This document summarizes the results of a performance evaluation of Kafka and Camus to ingest streaming data into Hadoop. It finds that Kafka can ingest data at rates from 15,000-50,000 messages per second depending on data format (Avro is fastest). Camus can move the data to HDFS at rates from 54,000-662,000 records per second. Once in HDFS, queries on Avro-formatted data are fastest, with count and max aggregation queries completing in under 100 seconds for 20 million records. The customer's goal of 5000 events per second can be easily achieved with this architecture.
The document discusses a generic programming toolkit called PADS/ML that can be used to parse, analyze, and transform semi-structured or "ad hoc" data from various domains. It describes how PADS/ML uses generated type representations and typecase analysis to write functions that can operate on any data format described by a PADS/ML type. Case studies of PADX and Harmony are presented, which use PADS/ML to build tools for querying and synchronizing different data formats.
This document proposes OM-JSON, a JSON implementation of the OGC Observations and Measurements (O&M) standard. It provides JSON schemas for representing different types of observations, such as single measurements, time series, geometries, specimens, and collections. Examples are given for each. Issues discussed include how to wrap the encodings for use in APIs/services, differences from other JSON schemas like those from 52North, and potential changes needed to the O&M abstract specification. The motion at the end recommends publishing OM-JSON as an OGC Discussion Paper.
Similar to IPTC News Exchange Working Group 2013 Autumn Meeting (20)
A proposal to adopt an approach inspired by rightsstatements.org:
1. Create a set of rights statements specific to news and media
2. Host the rights statements using the IPTC CV server
3. Create an editorial process for adding new rights statements
4. Document how to use the rights statements – and maybe even implement an evaluation engine with explanations
5. Document how to mix in custom statements with IPTC ones
Presented at the IPTC Spring 2019 meeting https://iptc.org/events/spring-meeting-2019/
Presented at the IPTC Spring 2019 meeting, three proposals for taxonomies:
1. Document how to use 3rd party entity schemes
2. Develop taxonomies for “perceived” metadata - for photo, video and audio items
3. Develop a way to “delegate” to wikidata as a way to extend IPTC Media Topics into more granular topics
The document discusses the International Press Telecommunications Council (IPTC) and its goals for 2018/19. It aims to expand the scope of its work, broaden participation in standards development, and ensure financial viability. It provides an overview of IPTC's structure, membership levels, efforts to increase voting members and revenue, and seeks suggestions on topics, groups to engage, and locations for meetings to achieve its goals.
This document provides an agenda and information for the IPTC Spring Meeting taking place from April 8-10, 2019 in Lisbon, Portugal. The agenda includes discussions on topics like NewsML-G2, SportsML, image and video metadata standards, rights management, and AI/text analysis. Presentations will cover credibility, image regions, copyright directives, and more. Future meeting dates and locations are also listed, along with tips for attendees to introduce themselves, ask questions, and help spread information about IPTC's work. The document concludes by asking attendees to introduce themselves.
Automation in the Newsroom and the impact on editorial labour: a case study. AP's image recognition technology project and how it requires new types of editorial tasks.
Presented on 1st February 2019 at COMPUTATION + JOURNALISM SYMPOSIUM 2019 http://cplusj.org/
IPTC Rights Working Group Toronto October 2018Stuart Myles
Why is rights metadata necessary for modern news and media organizations? How does IPTC's RightsML help solve those requirements? What are the opportunities to work with Google, Europeana, MINDS or other organizations to make progress with addressing the challenge of rights for news and media?
Welcome to IPTC's 2018 Annual General Meeting.
Three day face-to-face conference, discussing news metadata standards for photo, video, text and more, including news companies discuss their approaches for news search.
https://iptc.org/events/autumn-meeting-2018/
An update on the EXTRA project - an open source rules based classifier for news content. Including the application for additional funding from Google DNI for FRANCIS
IPTC Machine Readable Rights for News and Media: Solving Three Challenges wit...Stuart Myles
How News and Media Publishers Can Optimize their Content Licensing by Adopting Standard Machine-Processable Rights
Presented at IPTC's Spring 2018 meeting
Ap Taxonomy Localization Requirements and ChallengesStuart Myles
AP's Taxonomy is - currently - US English. What are the challenges of localizing - not just translating - for other languages? What would be the ideal approach?
IPTC Spring Meeting Welcome To Athens April 2018Stuart Myles
The document summarizes the agenda and goals for the IPTC Spring Meeting in Athens. Over the next three days, the meeting will cover topics like news codes, video and photo metadata standards, rights expression, and GDPR compliance. They will also welcome a new managing director and thank the outgoing one. Attendees are encouraged to introduce themselves to others, ask questions, and help spread knowledge about IPTC's work on technical solutions for the news industry. The document introduces the format of the meeting and urges participation and networking.
Sustaining Television News Technical ChallengesStuart Myles
This document summarizes a presentation about technical challenges in sustaining television news archives for future generations. It discusses challenges in managing and sharing video clip metadata across different standards. It introduces the IPTC's Video Metadata Hub, which maps metadata fields between standards to enable uniform searching and preserve metadata across formats. The presentation outlines metadata standards supported by the Hub and next steps, such as engaging partners to spread adoption and mapping new metadata types for immersive media. It encourages participation in developing the Hub through IPTC membership meetings.
How to Train Your Classifier: Create a Serverless Machine Learning System wit...Stuart Myles
How to train a custom tagger to classify text using scikit-learn, with practical tuning advice to get more accurate results. How to create a REST API to train and host your tagger using AWS services including Lambda, API Gateway and Step Functions. Tips on how to overcome limitations in AWS and scikit-learn when creating your own custom tagger.
Presented at PyData NYC 2017 by Stuart Myles, Veronika Zielinska and David Fox
https://pydata.org/nyc2017/schedule/presentation/21/
The Search for IPTC's Next Managing DirectorStuart Myles
The IPTC is seeking a new Managing Director to replace Michael Steidl, who is retiring in mid-2018. The ideal candidate will have experience working in news technology, promoting organizations, and managing non-profits or membership groups. The position can be full-time or part-time. The IPTC Board will begin reviewing applications on December 1st, and the planned start date for the new Managing Director is May 1st, 2018. Candidates should email a letter of interest, CV, and references.
ninjs is IPTC's news in JSON standard. How was the design of ninjs approached? What were the different options which were considered? What is different about designing in JSON versus other formats, such as XML and RDF?
ninjs is the IPTC's standard for news in JSON. An overview of the standard as it is today - for representing text, photo, video and audio items - together with our plans for enhancements.
EXTRA is an open source rules based classification engine, developed by IPTC supported by a grant from Google DNI. Why are rules better than machine learning for breaking news? How can automation better support the manual crafting of news rules.
Welcome to Barcelona - IPTC November 2017Stuart Myles
Welcome to the November IPTC 2017 Annual General Meeting. The IPTC is the global standards body of the news media. We provide the technical foundation for the news ecosystem.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations