Half of the work that it takes to do data science is plumbing and wrangling. I’ll discuss some tricks we’ve learned while building AddThis over the years to collect and process data at web scale.
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanDatabricks
Streamsets Data Collector is designed to make data ingest and processing easy. SDC integrates at several levels with Apache Spark to make data analysis using Spark very easy. SDC works with Databricks Cloud to trigger jobs based on incoming data.
In this talk, you will learn how a larger retail player with thousands of outlets is utilizing StreamSets to power Spark jobs on the Databricks cloud, combining real-time foot traffic data and historic behavioral & transaction data for analytic insights that improve revenue per square foot.
Fraugster's Data Scientist Oxana Goriuc presentation of her work on implementing Graph Databases for fraud solutions at the (WiMLDS) Women in Machine Learning & Data Science meet-up in Berlin - hosted by Babbel.
1) The document describes SoftNews, a distributed solution for acquiring online news articles from over 30,000 sources in 20 languages.
2) SoftNews uses Perl modules to fetch, filter, compare, transform and store large numbers of news articles at set intervals while respecting time constraints. The articles are indexed using KinoSearch for fast retrieval.
3) A control GUI allows configuration and monitoring of the acquisition process. Delivered content is presented through a "Stich&glue" portal that provides enhanced search, tagging and visualization features for the large text collection.
Build real time stream processing applications using Apache KafkaHotstar
This talk was presented at the Hotstar Scale Meetup in Bangalore by Jayesh Sidhwani
In this talk, the presenter introduces Apache Kafka and the Apache Kafka Streams library. Starting from the need for building streaming applications to thinking the use-cases as a streaming job - this talk covers all the technicalities.
It ends with a short description of how Kafka is deployed and used at Hotstar
Druid provides sub-second query latency and Flink provides SQL on streams allowing rich transformation/enrichment of events as it happens. In this talk we will learn how Lyft
uses flink sql and druid together to support real time analytics.
Meetup: https://www.meetup.com/druidio/events/252515792/
Introducing MagnetoDB, a key-value storage sevice for OpenStackMirantis
MagnetoDB is an open source implementation of the Amazon DynamoDB API for OpenStack. It provides a key-value database service for storing unlimited data with scalability and predictable performance. MagnetoDB's API is compatible with existing DynamoDB clients, allowing applications using DynamoDB storage to run on OpenStack. The pilot implementation provides basic CRUD operations for items and tables and is available on GitHub under an Apache 2 license.
Webinar slides: Free Monitoring (on Steroids) for MySQL, MariaDB, PostgreSQL ...Severalnines
Traditional server monitoring tools are not built for modern distributed database architectures. Let’s face it, most production databases today run in some kind of high availability setup - from simpler master-slave replication to multi-master clusters fronted by redundant load balancers. Operations teams deal with dozens, often hundreds of services that make up the database environment.
This is why we built ClusterControl - to address modern, highly distributed database setups based on replication or clustering. We wanted something that could provide a systems view of all the components of a distributed cluster, including load balancers.
Watch this replay of a webinar on free database monitoring using ClusterControl Community Edition. We show you how to monitor all your MySQL, MariaDB, PostgreSQL and MongoDB systems from a single point of control - whether they are deployed as Galera Clusters, sharded clusters or replication setups across on-prem and cloud data centers. We also see how to use Advisors in order to improve performance.
AGENDA
- Requirements for monitoring distributed database systems
- Cloud-based vs On-prem monitoring solutions
- Agent-based vs Agentless monitoring
- Deepdive into ClusterControl Community Edition
- Architecture
- Metrics Collection
- Trending
- Dashboards
- Queries
- Performance Advisors
- Other features available to Community users
SPEAKER
Bartlomiej Oles is a MySQL and Oracle DBA, with over 15 years experience in managing highly available production systems at IBM, Nordea Bank, Acxiom, Lufthansa, and other Fortune 500 companies. In the past five years, his focus has been on building and applying automation tools to manage multi-datacenter database environments.
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanDatabricks
Streamsets Data Collector is designed to make data ingest and processing easy. SDC integrates at several levels with Apache Spark to make data analysis using Spark very easy. SDC works with Databricks Cloud to trigger jobs based on incoming data.
In this talk, you will learn how a larger retail player with thousands of outlets is utilizing StreamSets to power Spark jobs on the Databricks cloud, combining real-time foot traffic data and historic behavioral & transaction data for analytic insights that improve revenue per square foot.
Fraugster's Data Scientist Oxana Goriuc presentation of her work on implementing Graph Databases for fraud solutions at the (WiMLDS) Women in Machine Learning & Data Science meet-up in Berlin - hosted by Babbel.
1) The document describes SoftNews, a distributed solution for acquiring online news articles from over 30,000 sources in 20 languages.
2) SoftNews uses Perl modules to fetch, filter, compare, transform and store large numbers of news articles at set intervals while respecting time constraints. The articles are indexed using KinoSearch for fast retrieval.
3) A control GUI allows configuration and monitoring of the acquisition process. Delivered content is presented through a "Stich&glue" portal that provides enhanced search, tagging and visualization features for the large text collection.
Build real time stream processing applications using Apache KafkaHotstar
This talk was presented at the Hotstar Scale Meetup in Bangalore by Jayesh Sidhwani
In this talk, the presenter introduces Apache Kafka and the Apache Kafka Streams library. Starting from the need for building streaming applications to thinking the use-cases as a streaming job - this talk covers all the technicalities.
It ends with a short description of how Kafka is deployed and used at Hotstar
Druid provides sub-second query latency and Flink provides SQL on streams allowing rich transformation/enrichment of events as it happens. In this talk we will learn how Lyft
uses flink sql and druid together to support real time analytics.
Meetup: https://www.meetup.com/druidio/events/252515792/
Introducing MagnetoDB, a key-value storage sevice for OpenStackMirantis
MagnetoDB is an open source implementation of the Amazon DynamoDB API for OpenStack. It provides a key-value database service for storing unlimited data with scalability and predictable performance. MagnetoDB's API is compatible with existing DynamoDB clients, allowing applications using DynamoDB storage to run on OpenStack. The pilot implementation provides basic CRUD operations for items and tables and is available on GitHub under an Apache 2 license.
Webinar slides: Free Monitoring (on Steroids) for MySQL, MariaDB, PostgreSQL ...Severalnines
Traditional server monitoring tools are not built for modern distributed database architectures. Let’s face it, most production databases today run in some kind of high availability setup - from simpler master-slave replication to multi-master clusters fronted by redundant load balancers. Operations teams deal with dozens, often hundreds of services that make up the database environment.
This is why we built ClusterControl - to address modern, highly distributed database setups based on replication or clustering. We wanted something that could provide a systems view of all the components of a distributed cluster, including load balancers.
Watch this replay of a webinar on free database monitoring using ClusterControl Community Edition. We show you how to monitor all your MySQL, MariaDB, PostgreSQL and MongoDB systems from a single point of control - whether they are deployed as Galera Clusters, sharded clusters or replication setups across on-prem and cloud data centers. We also see how to use Advisors in order to improve performance.
AGENDA
- Requirements for monitoring distributed database systems
- Cloud-based vs On-prem monitoring solutions
- Agent-based vs Agentless monitoring
- Deepdive into ClusterControl Community Edition
- Architecture
- Metrics Collection
- Trending
- Dashboards
- Queries
- Performance Advisors
- Other features available to Community users
SPEAKER
Bartlomiej Oles is a MySQL and Oracle DBA, with over 15 years experience in managing highly available production systems at IBM, Nordea Bank, Acxiom, Lufthansa, and other Fortune 500 companies. In the past five years, his focus has been on building and applying automation tools to manage multi-datacenter database environments.
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB
MongoDB Atlas Data Lake is a new service offered by MongoDB Atlas. Many organizations store long term, archival data in cost-effective storage like S3, GCP, and Azure Blobs. However, many of them do not have robust systems or tools to effectively utilize large amounts of data to inform decision making. MongoDB Atlas Data Lake is a service allowing organizations to analyze their long-term data to discover a wealth of information about their business.
This session will take a deep dive into the features that are currently available in MongoDB Atlas Data Lake and how they are implemented. In addition, we'll discuss future plans and opportunities and offer ample Q&A time with the engineers on the project.
Improve your SQL workload with observabilityOVHcloud
La majeure partie du SI d'OVH repose sur des bases de données relationnelles (PostgreSQL, MySQL, MariaDB). En termes de volumétrie cela représente 400 bases pesants plus de 20To de données réparties sur 60 clusters dans deux zones géographiques le tout propulsant 3000 applications.
Comment tout voir dans notre parc ? Mieux encore, comment faire pour que tout le monde puisse suivre l'activité de sa base de données ? C'est le challenge que nous nous sommes fixés, un an après nous pouvons partager notre expérience.
Et si l'observability n'était pas juste un buzzword, mais avait un réel impact sur la production ?
Kafka as an Eventing System to Replatform a Monolith into Microservices confluent
(Madhulika Tripathi, Intuit) Kafka Summit SF 2018
Breaking down monolithic applications into smaller manageable microservices can be a tough challenge. But the benefits are many. Faster changes, developer productivity, maintainability, scalability and high performance are a few of the motivators that make companies undertake this difficult journey.
At Intuit, we have our fair share of monolithic applications. One such application is Quickbooks Online, our accounting product for small businesses. In order to decompose the application, we needed to create new services, and reduce footprint of data in the monolith by moving it to new services in a phased manner. As more and more data and services keep moving out of the monolith, this data now distributed across multiple microservices needs to be synchronized in near real time to provide a seamless and fast experience to the customers of our product.
To achieve this, we are using Kafka as our eventing backbone that can aid us in keeping distributed data in sync, without compromising performance and user experience. Guaranteed publishing of financial events with no loss, high accuracy and performance is of utmost importance as majority of Intuit products deal with highly sensitive, financial data. Strong ordering guarantees is another important criteria that Kafka can provide with low latency and high throughput. Use cases for data and streaming analytics, insights, personalization, machine-learning-based predictions, can all be unlocked by adopting Kafka as our distributed streaming platform.
This talk will take you through Intuit’s journey of building a distributed, asynchronous system using Kafka. Specifically about the choices made, challenges faced, the adaptations clients had to make and how we see Kafka powering our future!
Presentation shows how we started doing Big Data in Ocado, what obstacles we hit and how we tried to fix this later. You'll see how to deal with data sources, or most importatly, how not to deal with them.
This document discusses the journey of Ocado, the largest online-only grocery retailer in the UK, to move its large and growing data to the cloud. It describes Ocado's initial use of traditional databases that became insufficient to handle the scale of data. It then discusses Ocado's move to Google Cloud Platform and use of services like Google BigQuery and Cloud Dataflow. While this helped with scalability and analytics, some challenges remained. The document evaluates different cloud-based options like Hadoop and Spark before concluding that BigQuery provided the best performance and ease of use, though could still be improved.
The document introduces the WSO2 Analytics Platform, which allows users to collect, store, analyze, visualize and communicate data. It discusses how the platform can help organizations reduce costs, improve customer satisfaction and efficiency. The key capabilities of the platform include interactive, batch, real-time and predictive analytics. It also provides tools for developers, solutions for various use cases, and discusses how to get started with the platform.
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
This document discusses OpenLineage and Marquez, which aim to provide standardized metadata and data lineage collection for data pipelines. OpenLineage defines an open standard for collecting metadata as data moves through pipelines, similar to metadata collected by EXIF for images. Marquez is an open source implementation of this standard, which can collect metadata from various data tools and store it in a graph database for querying lineage and understanding dependencies. This collected metadata helps with tasks like troubleshooting, impact analysis, and understanding how data flows through complex pipelines over time.
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
This document discusses Open Lineage and the Marquez project for collecting metadata and data lineage information from data pipelines. It describes how Open Lineage defines a standard model and protocol for instrumentation to collect metadata on jobs, datasets, and runs in a consistent way. This metadata can then provide context on the data source, schema, owners, usage, and changes. The document outlines how Marquez implements the Open Lineage standard by defining entities, relationships, and facets to store this metadata and enable use cases like data governance, discovery, and debugging. It also positions Marquez as a centralized but modular framework to integrate various data platforms and extensions like Datakin's lineage analysis tools.
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataStavros Kontopoulos
This document discusses streaming engines for big data and provides a case study on Spark Streaming. It begins with an overview of streaming concepts like streams, stream processing, and time in modern data stream analysis. Next, it covers key design considerations for streaming engines and examples of state-of-the-art stream analysis tools like Apache Flink, Spark Streaming, and Apache Beam. It then focuses on Spark Streaming, describing its DStream and Structured Streaming APIs. Code examples are provided for the DStream API and Structured Streaming. The document concludes with a recommendation to first consider Flink, Spark, or Kafka Streams when choosing a streaming engine.
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB
MongoDB Atlas Data Lake is a new service offered by MongoDB Atlas. Many organizations store long term, archival data in cost-effective storage like S3, GCP, and Azure Blobs. However, many of them do not have robust systems or tools to effectively utilize large amounts of data to inform decision making. MongoDB Atlas Data Lake is a service allowing organizations to analyze their long-term data to discover a wealth of information about their business.
This session will take a deep dive into the features that are currently available in MongoDB Atlas Data Lake and how they are implemented.
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
This document discusses Marquez, an open source metadata management system. It provides an overview of Marquez and how it can be used to track metadata in data pipelines. Specifically:
- Marquez collects and stores metadata about data sources, datasets, jobs, and runs to provide data lineage and observability.
- It has a modular framework to support data governance, data lineage, and data discovery. Metadata can be collected via REST APIs or language SDKs.
- Marquez integrates with Apache Airflow to collect task-level metadata, dependencies between DAGs, and link tasks to code versions. This enables understanding of operational dependencies and troubleshooting.
- The Marquez community aims to build an open
DocumentDB is a fully managed, scalable NoSQL document database service hosted on Azure. It provides a rich queryable schema-free JSON document model with transactional processing. Applications can leverage features like stored procedures, triggers, user-defined functions and consistency options to balance performance and data consistency needs. Documents in DocumentDB can contain arbitrary JSON content and applications work with data through HTTP/REST endpoints.
Kafka Streams - From the Ground Up to the CloudVMware Tanzu
Kafka Streams is a client library for processing and transforming streams of data stored in Apache Kafka clusters. It allows embedding stream processing logic directly into applications using a simple Java DSL. Kafka Streams applications can perform stateful transformations like filtering, mapping, aggregations and joins on Kafka data. The processing is integrated with Kafka's storage and replication capabilities to ensure exactly-once semantics even in the cloud.
Pomerania Cloud case study - Openstack Day Warsaw 2017Łukasz Klimek
This document describes Pomerania Cloud, an OpenStack-based cloud computing platform located in Szczecin, Poland. It has two independent data centers connected by fiber with a total of 64 servers and over 1000 CPU cores. The backend uses OpenStack for infrastructure and OpenShift for PaaS. The frontend includes a website, e-commerce, and self-service portal built on Drupal for ordering, billing, and managing cloud resources. Customers include members of the local Cloud for Cities technology partnership.
FIWARE Global Summit - QuantumLeap: Time-series and Geographic QueriesFIWARE
This document describes QuantumLeap, an open source software that stores and queries spatial-temporal IoT data from NGSI entities. It converts NGSI entities to a tabular format and stores them in time series and geo-spatial databases for efficient querying over space and time. QuantumLeap can be easily deployed using Docker containers on platforms like Kubernetes and supports multiple database backends. It provides a REST API and Grafana integration for querying and visualizing IoT data.
This document describes uberVU's use of big data to monitor social media mentions and provide analytics to clients. It discusses how uberVU ingests large amounts of social media data daily using distributed technologies like Amazon Web Services, MongoDB, and Redis. Machine learning algorithms are used to analyze and classify data, though batch processing is more efficient. Signals like influencers and trends are identified. Lessons learned include the importance of monitoring systems and planning for failures.
The document discusses KB DataSpace, which is a platform for linked open data. It describes Virtuoso, an open source triplestore used to store RDF data. It also discusses HTTP and content negotiation standards used to make data accessible on the web. Finally, it outlines the process of converting raw data into structured RDF data using SPARQL updates, and tools like OntoWiki for authoring and linking semantic datasets as part of the linked open data cycle.
The document discusses the Helix Nebula Science Cloud procurement project. It provides updates on:
- Ramping up computing and storage resources for the project over 2018.
- Testing and consolidating the approach across procurers to provide shared resources for large-scale tests.
- Upcoming events where the project will demonstrate resources and tools.
- Two proposed use cases, PanCancer and ALICE, detailing their computing, storage and network requirements.
- Introducing vouchers as a means for procurers to provide short-term access to resources for additional users.
Scalable Dynamic Data Consumption on the WebRuben Taelman
The document discusses reducing server load for dynamic web data by moving continuous query evaluation from servers to clients. It proposes doing this through three steps: scalable data storage and publication, efficient data transmission using compression and caching, and continuous evaluation on clients. Several research questions are posed around how to combine publication of real-time and historical data to make it queryable efficiently while storing it in a way that allows efficient data transfer and enabling client-side query evaluation over both static and dynamic data. Hypotheses are made that new data can be stored and retrieved linearly based on amounts, and that server costs will be lower than alternatives with data transfer being the main factor influencing query times.
This document discusses functional prototyping for mobile apps. It begins by defining various types of prototypes like paper drawings, wireframes, and mockups. It then outlines several popular prototyping tools like POP, Balsamiq, Flinto, and Marvel. The document emphasizes that prototyping can save significant money on app development projects by clarifying requirements and creating a unified vision. It also argues for cross-functional teams that include disciplines like security, testing, and operations from the beginning rather than as an afterthought. Finally, it provides some resources for prototyping with Sketch and Framer.
Charlie Reverte, VP of Engineering at AddThis, discusses lessons learned from processing large-scale web data. AddThis processes data from 14 million domains, including 100 billion monthly page views and 50,000 events per second. Reverte outlines challenges around distributed ID generation, counting unique values, joining distributed data, sampling large datasets, and deploying systems that invalidate over 1.4 billion browser caches. He advocates for loose coupling between systems using approaches like Kafka for asynchronous event logging. Reverte also discusses techniques for columnar compression, tunable quality of service, and open sourcing Hydra, AddThis' custom processing system optimized for real-time data.
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB
MongoDB Atlas Data Lake is a new service offered by MongoDB Atlas. Many organizations store long term, archival data in cost-effective storage like S3, GCP, and Azure Blobs. However, many of them do not have robust systems or tools to effectively utilize large amounts of data to inform decision making. MongoDB Atlas Data Lake is a service allowing organizations to analyze their long-term data to discover a wealth of information about their business.
This session will take a deep dive into the features that are currently available in MongoDB Atlas Data Lake and how they are implemented. In addition, we'll discuss future plans and opportunities and offer ample Q&A time with the engineers on the project.
Improve your SQL workload with observabilityOVHcloud
La majeure partie du SI d'OVH repose sur des bases de données relationnelles (PostgreSQL, MySQL, MariaDB). En termes de volumétrie cela représente 400 bases pesants plus de 20To de données réparties sur 60 clusters dans deux zones géographiques le tout propulsant 3000 applications.
Comment tout voir dans notre parc ? Mieux encore, comment faire pour que tout le monde puisse suivre l'activité de sa base de données ? C'est le challenge que nous nous sommes fixés, un an après nous pouvons partager notre expérience.
Et si l'observability n'était pas juste un buzzword, mais avait un réel impact sur la production ?
Kafka as an Eventing System to Replatform a Monolith into Microservices confluent
(Madhulika Tripathi, Intuit) Kafka Summit SF 2018
Breaking down monolithic applications into smaller manageable microservices can be a tough challenge. But the benefits are many. Faster changes, developer productivity, maintainability, scalability and high performance are a few of the motivators that make companies undertake this difficult journey.
At Intuit, we have our fair share of monolithic applications. One such application is Quickbooks Online, our accounting product for small businesses. In order to decompose the application, we needed to create new services, and reduce footprint of data in the monolith by moving it to new services in a phased manner. As more and more data and services keep moving out of the monolith, this data now distributed across multiple microservices needs to be synchronized in near real time to provide a seamless and fast experience to the customers of our product.
To achieve this, we are using Kafka as our eventing backbone that can aid us in keeping distributed data in sync, without compromising performance and user experience. Guaranteed publishing of financial events with no loss, high accuracy and performance is of utmost importance as majority of Intuit products deal with highly sensitive, financial data. Strong ordering guarantees is another important criteria that Kafka can provide with low latency and high throughput. Use cases for data and streaming analytics, insights, personalization, machine-learning-based predictions, can all be unlocked by adopting Kafka as our distributed streaming platform.
This talk will take you through Intuit’s journey of building a distributed, asynchronous system using Kafka. Specifically about the choices made, challenges faced, the adaptations clients had to make and how we see Kafka powering our future!
Presentation shows how we started doing Big Data in Ocado, what obstacles we hit and how we tried to fix this later. You'll see how to deal with data sources, or most importatly, how not to deal with them.
This document discusses the journey of Ocado, the largest online-only grocery retailer in the UK, to move its large and growing data to the cloud. It describes Ocado's initial use of traditional databases that became insufficient to handle the scale of data. It then discusses Ocado's move to Google Cloud Platform and use of services like Google BigQuery and Cloud Dataflow. While this helped with scalability and analytics, some challenges remained. The document evaluates different cloud-based options like Hadoop and Spark before concluding that BigQuery provided the best performance and ease of use, though could still be improved.
The document introduces the WSO2 Analytics Platform, which allows users to collect, store, analyze, visualize and communicate data. It discusses how the platform can help organizations reduce costs, improve customer satisfaction and efficiency. The key capabilities of the platform include interactive, batch, real-time and predictive analytics. It also provides tools for developers, solutions for various use cases, and discusses how to get started with the platform.
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
This document discusses OpenLineage and Marquez, which aim to provide standardized metadata and data lineage collection for data pipelines. OpenLineage defines an open standard for collecting metadata as data moves through pipelines, similar to metadata collected by EXIF for images. Marquez is an open source implementation of this standard, which can collect metadata from various data tools and store it in a graph database for querying lineage and understanding dependencies. This collected metadata helps with tasks like troubleshooting, impact analysis, and understanding how data flows through complex pipelines over time.
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
This document discusses Open Lineage and the Marquez project for collecting metadata and data lineage information from data pipelines. It describes how Open Lineage defines a standard model and protocol for instrumentation to collect metadata on jobs, datasets, and runs in a consistent way. This metadata can then provide context on the data source, schema, owners, usage, and changes. The document outlines how Marquez implements the Open Lineage standard by defining entities, relationships, and facets to store this metadata and enable use cases like data governance, discovery, and debugging. It also positions Marquez as a centralized but modular framework to integrate various data platforms and extensions like Datakin's lineage analysis tools.
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataStavros Kontopoulos
This document discusses streaming engines for big data and provides a case study on Spark Streaming. It begins with an overview of streaming concepts like streams, stream processing, and time in modern data stream analysis. Next, it covers key design considerations for streaming engines and examples of state-of-the-art stream analysis tools like Apache Flink, Spark Streaming, and Apache Beam. It then focuses on Spark Streaming, describing its DStream and Structured Streaming APIs. Code examples are provided for the DStream API and Structured Streaming. The document concludes with a recommendation to first consider Flink, Spark, or Kafka Streams when choosing a streaming engine.
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB
MongoDB Atlas Data Lake is a new service offered by MongoDB Atlas. Many organizations store long term, archival data in cost-effective storage like S3, GCP, and Azure Blobs. However, many of them do not have robust systems or tools to effectively utilize large amounts of data to inform decision making. MongoDB Atlas Data Lake is a service allowing organizations to analyze their long-term data to discover a wealth of information about their business.
This session will take a deep dive into the features that are currently available in MongoDB Atlas Data Lake and how they are implemented.
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
This document discusses Marquez, an open source metadata management system. It provides an overview of Marquez and how it can be used to track metadata in data pipelines. Specifically:
- Marquez collects and stores metadata about data sources, datasets, jobs, and runs to provide data lineage and observability.
- It has a modular framework to support data governance, data lineage, and data discovery. Metadata can be collected via REST APIs or language SDKs.
- Marquez integrates with Apache Airflow to collect task-level metadata, dependencies between DAGs, and link tasks to code versions. This enables understanding of operational dependencies and troubleshooting.
- The Marquez community aims to build an open
DocumentDB is a fully managed, scalable NoSQL document database service hosted on Azure. It provides a rich queryable schema-free JSON document model with transactional processing. Applications can leverage features like stored procedures, triggers, user-defined functions and consistency options to balance performance and data consistency needs. Documents in DocumentDB can contain arbitrary JSON content and applications work with data through HTTP/REST endpoints.
Kafka Streams - From the Ground Up to the CloudVMware Tanzu
Kafka Streams is a client library for processing and transforming streams of data stored in Apache Kafka clusters. It allows embedding stream processing logic directly into applications using a simple Java DSL. Kafka Streams applications can perform stateful transformations like filtering, mapping, aggregations and joins on Kafka data. The processing is integrated with Kafka's storage and replication capabilities to ensure exactly-once semantics even in the cloud.
Pomerania Cloud case study - Openstack Day Warsaw 2017Łukasz Klimek
This document describes Pomerania Cloud, an OpenStack-based cloud computing platform located in Szczecin, Poland. It has two independent data centers connected by fiber with a total of 64 servers and over 1000 CPU cores. The backend uses OpenStack for infrastructure and OpenShift for PaaS. The frontend includes a website, e-commerce, and self-service portal built on Drupal for ordering, billing, and managing cloud resources. Customers include members of the local Cloud for Cities technology partnership.
FIWARE Global Summit - QuantumLeap: Time-series and Geographic QueriesFIWARE
This document describes QuantumLeap, an open source software that stores and queries spatial-temporal IoT data from NGSI entities. It converts NGSI entities to a tabular format and stores them in time series and geo-spatial databases for efficient querying over space and time. QuantumLeap can be easily deployed using Docker containers on platforms like Kubernetes and supports multiple database backends. It provides a REST API and Grafana integration for querying and visualizing IoT data.
This document describes uberVU's use of big data to monitor social media mentions and provide analytics to clients. It discusses how uberVU ingests large amounts of social media data daily using distributed technologies like Amazon Web Services, MongoDB, and Redis. Machine learning algorithms are used to analyze and classify data, though batch processing is more efficient. Signals like influencers and trends are identified. Lessons learned include the importance of monitoring systems and planning for failures.
The document discusses KB DataSpace, which is a platform for linked open data. It describes Virtuoso, an open source triplestore used to store RDF data. It also discusses HTTP and content negotiation standards used to make data accessible on the web. Finally, it outlines the process of converting raw data into structured RDF data using SPARQL updates, and tools like OntoWiki for authoring and linking semantic datasets as part of the linked open data cycle.
The document discusses the Helix Nebula Science Cloud procurement project. It provides updates on:
- Ramping up computing and storage resources for the project over 2018.
- Testing and consolidating the approach across procurers to provide shared resources for large-scale tests.
- Upcoming events where the project will demonstrate resources and tools.
- Two proposed use cases, PanCancer and ALICE, detailing their computing, storage and network requirements.
- Introducing vouchers as a means for procurers to provide short-term access to resources for additional users.
Scalable Dynamic Data Consumption on the WebRuben Taelman
The document discusses reducing server load for dynamic web data by moving continuous query evaluation from servers to clients. It proposes doing this through three steps: scalable data storage and publication, efficient data transmission using compression and caching, and continuous evaluation on clients. Several research questions are posed around how to combine publication of real-time and historical data to make it queryable efficiently while storing it in a way that allows efficient data transfer and enabling client-side query evaluation over both static and dynamic data. Hypotheses are made that new data can be stored and retrieved linearly based on amounts, and that server costs will be lower than alternatives with data transfer being the main factor influencing query times.
This document discusses functional prototyping for mobile apps. It begins by defining various types of prototypes like paper drawings, wireframes, and mockups. It then outlines several popular prototyping tools like POP, Balsamiq, Flinto, and Marvel. The document emphasizes that prototyping can save significant money on app development projects by clarifying requirements and creating a unified vision. It also argues for cross-functional teams that include disciplines like security, testing, and operations from the beginning rather than as an afterthought. Finally, it provides some resources for prototyping with Sketch and Framer.
Charlie Reverte, VP of Engineering at AddThis, discusses lessons learned from processing large-scale web data. AddThis processes data from 14 million domains, including 100 billion monthly page views and 50,000 events per second. Reverte outlines challenges around distributed ID generation, counting unique values, joining distributed data, sampling large datasets, and deploying systems that invalidate over 1.4 billion browser caches. He advocates for loose coupling between systems using approaches like Kafka for asynchronous event logging. Reverte also discusses techniques for columnar compression, tunable quality of service, and open sourcing Hydra, AddThis' custom processing system optimized for real-time data.
What it's like to switch from working in the Federal IT space to commercial technology companies in DC. Where to look for companies and get a job at one you like.
UI testing tools like Selenium allow testing user interfaces in real browsers to ensure proper rendering. Traditional UI testing requires development skills and test maintenance is tedious. Visual testing tools provide higher productivity by automating tests visually without code. Visual tests can be used to test complex applications like Gmail by recording user flows and validating page elements and differences. Visual testing empowers non-technical users and complements unit and API tests.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu.
Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com..
Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
Moving to a new home is daunting. Packing up all your things, getting a vehicle to move it all, unpacking it, updating your mailing address, and making sure you did not leave anything behind. Well, the move to MongoDB Atlas is similar, but all the logistics are already figured out for you by MongoDB.
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Spark Summit
Streaming applications have often been complex to design and maintain because of the significant upfront infrastructure investment required. However, with the advent of Spark an easy transition to stream processing is now available, enabling personalization applications and experiments to consume near real-time data without massive development cycles.
Our decision to evaluate Spark as our stream processing engine was primarily led by the following considerations: 1) Ease of development for the team (already familiar with spark for batch), 2) the scope/requirements of our problem, 3) re-usability of code from spark batch jobs, and 4) Spark support from infrastructure teams within the company.
In this session, we will present our experience using Spark for stream processing unbounded datasets in the personalization space. The datasets consisted of, but were not limited, to the stream of playback events that are used as feedback for all personalization algorithms. These plays are used to extract specific behaviors which are highly predictive of a customer’s enjoyment of our service. This dataset is massive and has to be further enriched by other online and offline Netflix data sources. These datasets, when consumed by our machine learning models, directly affect the customer’s personalized experience, which means that the impact is high and tolerance for failure is low. We’ll talk about the experiments we did to compare Spark with other streaming solutions like Apache Flink , the impact that we had on our customers, and most importantly, the challenges we faced.
Take-aways for the audience:
1) A great example of stream processing large, personalization datasets at scale.
2) An increased awareness of the costs/requirements for making the transition from batch to streaming successfully.
3) Exposure to some of the technical challenges that should be expected along the way.
Extracting Insights from Data at TwitterPrasad Wagle
Prasad Wagle's talk discussed how Twitter extracts insights from its large volumes of data. Twitter collects hundreds of millions of tweets and interactions per day from over 300 million monthly active users, creating big data challenges around velocity, volume, and variety. Twitter stores this data in hundreds of petabytes across large Hadoop clusters and processes it using batch tools like Hadoop and Spark as well as real-time tools like Heron. Insights are generated through basic analytics like user counts, A/B testing of new features, and custom data science work including machine learning models for recommendations, content filtering, and ad targeting. Systems, programming, and statistical skills are needed to effectively extract value from Twitter's big data.
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
What we're about
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORCwhich technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL?how to handle streaming?how to manage costs?Performance tips?Security tip?Cloud best practices tips?
Some of our online materials:
Website:
https://big-data-demystified.ninja/
Youtube channels:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
Meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
https://www.meetup.com/Big-Data-Demystified
Facebook Group :
https://www.facebook.com/groups/amazon.aws.big.data.demystified/
Facebook page (https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/)
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
Dyn delivers exceptional Internet Performance. Enabling high quality services requires data centers around the globe. In order to manage services, customers need timely insight collected from all over the world. Dyn uses DataStax Enterprise (DSE) to deploy complex clusters across multiple datacenters to enable sub 50 ms query responses for hundreds of billions of data points. From granular DNS traffic data, to aggregated counts for a variety of report dimensions, DSE at Dyn has been up since 2013 and has shined through upgrades, data center migrations, DDoS attacks and hardware failures. In this webinar, Principal Engineers Tim Chadwick and Rick Bross cover the requirements which led them to choose DSE as their go-to Big Data solution, the path which led to SPARK, and the lessons that we’ve learned in the process.
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA
Enabling real-time exploration and analytics at scale to drive operational intelligence at Hulu by Indrasis Mondal, Director, Data Engineering and Data Products, Hulu
Data is one of most powerful assets for companies today and a key driver for innovation, product development and business efficiency. Operational intelligence allows modern organization to use that data asset in real-time to enable immediate insights to their business operations and allow rapid decision making for strategic advantage. In this presentation we will walk through the operational intelligence capabilities Hulu has built to process tens of millions of events per minute to enable fast exploration of data and real-time decision making .
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data. This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company
The video: https://youtu.be/l5KmaZNQxaU
dont forget to subcribe to the youtube channel
The website: https://amazon-aws-big-data-demystified.ninja/
The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/
The document discusses techniques for improving web performance, including reducing time to first byte, using content delivery networks and HTTP compression, caching resources, keeping connections alive and reducing request sizes. It also covers optimizing images, loading JavaScript asynchronously to avoid blocking, and prefetching content. The overall goal is to reduce page load times and improve user experience.
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS & GCP and Data Center infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORC,AVRO which technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL? GCS? Big Query? Data flow? Data Lab? tensor flow? how to handle streaming? how to manage costs? Performance tips? Security tip? Cloud best practices tips?
In this meetup we shall present lecturers working on several cloud vendors, various big data platforms such hadoop, Data warehourses , startups working on big data products. basically - if it is related to big data - this is THE meetup.
Some of our online materials (mixed content from several cloud vendor):
Website:
https://big-data-demystified.ninja (under construction)
Meetups:
https://www.meetup.com/Big-Data-Demystified
https://www.meetup.com/AWS-Big-Data-Demystified/
You tube channels:
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
This document discusses how StreamSets and Spark can be used together for analytics insights in retail. Some key points:
- StreamSets Data Collector (SDC) is used to ingest IoT and sensor data from various sources into a common format and process the data in real-time using Spark evaluators.
- The Spark evaluator allows running Spark transformations on batches of data within SDC pipelines to do tasks like anomaly detection, sentiment analysis, and fraud detection.
- SDC can also be used to move data to and from Spark for end-of-batch processing using a Spark executor, such as running jobs on Databricks after files land in S3.
- Together, SDC and Spark
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
Learn about the Big Data Processing ecosystem at Netflix and how Apache Spark sits in this platform. I talk about typical data flows and data pipeline architectures that are used in Netflix and address how Spark is helping us gain efficiency in our processes. As a bonus – i’ll touch on some unconventional use-cases contrary to typical warehousing / analytics solutions that are being served by Apache Spark.
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
The document discusses the role of data engineers and data pipelines. It begins with an introduction to big data and why data volumes are increasing. It then covers what data engineers do, including building data architectures, working with cloud infrastructure, and programming for data ingestion, transformation, and loading. The document also explains data pipelines, describing extract, transform, load (ETL) processes and batch versus streaming data. It provides an example of Credit OK's data pipeline architecture on Google Cloud Platform that extracts raw data from various sources, cleanses and loads it into BigQuery, then distributes processed data to various applications. It emphasizes the importance of data engineers in processing and managing large, complex data sets.
Web performance optimization - MercadoLibrePablo Moretti
The document provides techniques and tools for improving web performance. It discusses how reducing response times can directly impact revenues and user experience. It then covers various ways to optimize the frontend, including reducing time to first byte through DNS optimization and caching, using content delivery networks, HTTP compression, keeping connections alive, parallel downloads, and prefetching. It also discusses optimizing images, JavaScript loading, and introducing new formats like WebP. The overall document aims to educate on measuring and enhancing web performance.
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
This document summarizes a typical day for a Druid architect. It describes common tasks like evaluating production clusters, analyzing data and queries, and recommending optimizations. The architect asks stakeholders questions to understand usage and helps evaluate if Druid is a good fit. When advising on Druid, the architect considers factors like data sources, query types, and technology stacks. The document also provides tips on configuring clusters for performance and controlling segment size.
This document summarizes a presentation about designing systems to handle high loads when Chuck Norris is your customer. It discusses scaling architectures vertically and horizontally, RESTful principles, using NoSQL databases like MongoDB, caching with Memcached, search engines like Sphinx, video/image storage, and bandwidth management. It emphasizes that the right technology depends on business needs, and high-load systems require robust architectures, qualified developers, and avoiding single points of failure.
Similar to Data Lessons Learned at Scale - Big Data DC (20)
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
2. Topic
Half of the work that it takes to do data science is
plumbing and wrangling
I’ll discuss some tricks we’ve learned over the
years to collect and process data at web scale
@numbakrrunch
4. Our Data
We process tool data
● Sharing
● Following
● Visitation
● Content Classification
And feed it back to sites
● Analytics
● Trending Content
● Personalized
Recommendations
@numbakrrunch
5. At Scale...
●
●
●
●
●
14 million domains
100 billion views/month
45k events/sec
160k concurrent firewall sessions
500k unique metrics in ganglia
@numbakrrunch
6. Counting Things
Common operations:
● Cardinality
● Set membership
● Top-k elements
● Frequency
●
●
●
●
http://highlyscalable.wordpress.
com/2012/05/01/probabilistic-structures-webanalytics-data-mining/
Estimate when possible
Sample when possible
Often streaming vs. batch
Mergeability is a big plus
○
○
Distributed counting
Checkpointing
Stream-lib: https://github.com/clearspring/stream-lib
@numbakrrunch
7. Distributed ID Generation
●
●
Session IDs are generated in the browser
We concatenate time and a random value
time
63
●
Hex: 4f6934b6f54bd7c1
rand
31
Base64: T2k0to403VS
0
Time-bounded probabilistic uniqueness
○ (m 2) / n = 0.142 collisions/sec (at 35k rq/sec)
●
Naturally time ordered, built-in DoB
Compare to Twitter Snowflake
https://github.com/twitter/snowflake/
@numbakrrunch
8. Joining Data
●
Value of data increases with higher dimensionality
○
●
Join and de-normalize data when you ingest
○
●
Disk is cheap
Join your data in client-side storage
○
●
Geo, user profile, page attributes, external data
Browsers as a lossy distributed database
Mutability?
“The value is in the join”
(or something like that)
https://github.com/stewartoallen
@numbakrrunch
9. Sharding and Sampling
● Choose your shard keys wisely
○ High cardinality field to reduce lumpiness
○ What do you need to co-locate
● Shards also useful for sampling
○ Law of big numbers
● Can yield statistical significance
○ Depending on the question
@numbakrrunch
10. Tunable QoS
●
●
●
●
●
URL Metadata stored in a 90-node
Cassandra cluster
We scrape and classify 20M URLs/day
750 million active records
2.2B reads/day
Variable cache TTLs
○
●
Depending on write rate per record
6
CDN cache
Global TTL knob
○
○
Turn up to reduce load for maintenance
Turn down to improve responsiveness
@numbakrrunch
11. Deployment
● Continuous Deploy?
● Deploying our javascript costs $3k
○ Have to invalidate 1.4B browser caches
○ Several hours to flush to browsers (clench)
● 2PB of CDN data served per month
● Have DDOSed ourselves
○ Very interesting bugs
● Simulation is weak
○ The internet is a dirty place
○ Embrace incremental deploys
12. Columnar Compression
●
●
●
●
●
Columnar storage techniques for row data
Better compressor efficiency
Different compressors per column
>20% size savings
by @abramsm
Input Data
Time
IP
UID
URL
Stored Data
Geo
Time
IP
Block
Size
UID
URL
Geo
@numbakrrunch
13. Summary
● Are you more like the post office or the bank?
● Look for good-enough answers
● Fight your nerd tendency for perfect
○ I’m still struggling with this
@numbakrrunch