Data Lessons Learned at Scale - Big Data DC

Big data on google platform dev fest presentation

Sriskandarajah Suhothayan

This document discusses the journey of Ocado, the largest online-only grocery retailer in the UK, to move its large and growing data to the cloud. It describes Ocado's initial use of traditional databases that became insufficient to handle the scale of data. It then discusses Ocado's move to Google Cloud Platform and use of services like Google BigQuery and Cloud Dataflow. While this helped with scalability and analytics, some challenges remained. The document evaluates different cloud-based options like Hadoop and Spark before concluding that BigQuery provided the best performance and ease of use, though could still be improved.

An introduction to the WSO2 Analytics Platform

The document introduces the WSO2 Analytics Platform, which allows users to collect, store, analyze, visualize and communicate data. It discusses how the platform can help organizations reduce costs, improve customer satisfaction and efficiency. The key capabilities of the platform include interactive, batch, real-time and predictive analytics. It also provides tools for developers, solutions for various use cases, and discusses how to get started with the platform.

Data pipelines observability: OpenLineage & Marquez

This document discusses OpenLineage and Marquez, which aim to provide standardized metadata and data lineage collection for data pipelines. OpenLineage defines an open standard for collecting metadata as data moves through pipelines, similar to metadata collected by EXIF for images. Marquez is an open source implementation of this standard, which can collect metadata from various data tools and store it in a graph database for querying lineage and understanding dependencies. This collected metadata helps with tasks like troubleshooting, impact analysis, and understanding how data flows through complex pipelines over time.

Open core summit: Observability for data pipelines with OpenLineage

This document discusses Open Lineage and the Marquez project for collecting metadata and data lineage information from data pipelines. It describes how Open Lineage defines a standard model and protocol for instrumentation to collect metadata on jobs, datasets, and runs in a consistent way. This metadata can then provide context on the data source, schema, owners, usage, and changes. The document outlines how Marquez implements the Open Lineage standard by defining entities, relationships, and facets to store this metadata and enable use cases like data governance, discovery, and debugging. It also positions Marquez as a centralized but modular framework to integrate various data platforms and extensions like Datakin's lineage analysis tools.

Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data

Stavros Kontopoulos

This document discusses streaming engines for big data and provides a case study on Spark Streaming. It begins with an overview of streaming concepts like streams, stream processing, and time in modern data stream analysis. Next, it covers key design considerations for streaming engines and examples of state-of-the-art stream analysis tools like Apache Flink, Spark Streaming, and Apache Beam. It then focuses on Spark Streaming, describing its DStream and Structured Streaming APIs. Code examples are provided for the DStream API and Structured Streaming. The document concludes with a recommendation to first consider Flink, Spark, or Kafka Streams when choosing a streaming engine.

MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive

Data lineage and observability with Marquez - subsurface 2020

Helix Nebula The Science Cloud

This document discusses Marquez, an open source metadata management system. It provides an overview of Marquez and how it can be used to track metadata in data pipelines. Specifically: - Marquez collects and stores metadata about data sources, datasets, jobs, and runs to provide data lineage and observability. - It has a modular framework to support data governance, data lineage, and data discovery. Metadata can be collected via REST APIs or language SDKs. - Marquez integrates with Apache Airflow to collect task-level metadata, dependencies between DAGs, and link tasks to code versions. This enables understanding of operational dependencies and troubleshooting. - The Marquez community aims to build an open

DocumentDB - NoSQL on Cloud at Reboot2015

Vidyasagar Machupalli

DocumentDB is a fully managed, scalable NoSQL document database service hosted on Azure. It provides a rich queryable schema-free JSON document model with transactional processing. Applications can leverage features like stored procedures, triggers, user-defined functions and consistency options to balance performance and data consistency needs. Documents in DocumentDB can contain arbitrary JSON content and applications work with data through HTTP/REST endpoints.

Kafka Streams - From the Ground Up to the Cloud

VMware Tanzu

Kafka Streams is a client library for processing and transforming streams of data stored in Apache Kafka clusters. It allows embedding stream processing logic directly into applications using a simple Java DSL. Kafka Streams applications can perform stateful transformations like filtering, mapping, aggregations and joins on Kafka data. The processing is integrated with Kafka's storage and replication capabilities to ensure exactly-once semantics even in the cloud.

Pomerania Cloud case study - Openstack Day Warsaw 2017

Łukasz Klimek

This document describes Pomerania Cloud, an OpenStack-based cloud computing platform located in Szczecin, Poland. It has two independent data centers connected by fiber with a total of 64 servers and over 1000 CPU cores. The backend uses OpenStack for infrastructure and OpenShift for PaaS. The frontend includes a website, e-commerce, and self-service portal built on Drupal for ordering, billing, and managing cloud resources. Customers include members of the local Cloud for Cities technology partnership.

FIWARE Global Summit - QuantumLeap: Time-series and Geographic Queries

FIWARE

This document describes QuantumLeap, an open source software that stores and queries spatial-temporal IoT data from NGSI entities. It converts NGSI entities to a tabular format and stores them in time series and geo-spatial databases for efficient querying over space and time. QuantumLeap can be easily deployed using Docker containers on platforms like Kubernetes and supports multiple database backends. It provides a REST API and Grafana integration for querying and visualizing IoT data.

Big data @ uber vu (1)

Mihnea Giurgea

This document describes uberVU's use of big data to monitor social media mentions and provide analytics to clients. It discusses how uberVU ingests large amounts of social media data daily using distributed technologies like Amazon Web Services, MongoDB, and Redis. Machine learning algorithms are used to analyze and classify data, though batch processing is more efficient. Signals like influencers and trends are identified. Lessons learned include the importance of monitoring systems and planning for failures.

Dataspace presentatie

Roland Cornelissen

The document discusses KB DataSpace, which is a platform for linked open data. It describes Virtuoso, an open source triplestore used to store RDF data. It also discusses HTTP and content negotiation standards used to make data accessible on the web. Finally, it outlines the process of converting raw data into structured RDF data using SPARQL updates, and tools like OntoWiki for authoring and linking semantic datasets as part of the linked open data cycle.

M-PIL-3.2 Public Session

The document discusses the Helix Nebula Science Cloud procurement project. It provides updates on: - Ramping up computing and storage resources for the project over 2018. - Testing and consolidating the approach across procurers to provide shared resources for large-scale tests. - Upcoming events where the project will demonstrate resources and tools. - Two proposed use cases, PanCancer and ALICE, detailing their computing, storage and network requirements. - Introducing vouchers as a means for procurers to provide short-term access to resources for additional users.

Scalable Dynamic Data Consumption on the Web

Ruben Taelman

The document discusses reducing server load for dynamic web data by moving continuous query evaluation from servers to clients. It proposes doing this through three steps: scalable data storage and publication, efficient data transmission using compression and caching, and continuous evaluation on clients. Several research questions are posed around how to combine publication of real-time and historical data to make it queryable efficiently while storing it in a way that allows efficient data transfer and enabling client-side query evaluation over both static and dynamic data. Hypotheses are made that new data can be stored and retrieved linearly based on amounts, and that server costs will be lower than alternatives with data transfer being the main factor influencing query times.

Functional Prototyping For Mobile Apps

Movel

This document discusses functional prototyping for mobile apps. It begins by defining various types of prototypes like paper drawings, wireframes, and mockups. It then outlines several popular prototyping tools like POP, Balsamiq, Flinto, and Marvel. The document emphasizes that prototyping can save significant money on app development projects by clarifying requirements and creating a unified vision. It also argues for cross-functional teams that include disciplines like security, testing, and operations from the beginning rather than as an afterthought. Finally, it provides some resources for prototyping with Sketch and Framer.

Data Lessons Learned at Scale

Charlie Reverte, VP of Engineering at AddThis, discusses lessons learned from processing large-scale web data. AddThis processes data from 14 million domains, including 100 billion monthly page views and 50,000 events per second. Reverte outlines challenges around distributed ID generation, counting unique values, joining distributed data, sampling large datasets, and deploying systems that invalidate over 1.4 billion browser caches. He advocates for loose coupling between systems using approaches like Kafka for asynchronous event logging. Reverte also discusses techniques for columnar compression, tunable quality of service, and open sourcing Hydra, AddThis' custom processing system optimized for real-time data.

What's hot

MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive

Improve your SQL workload with observability

OVHcloud

Kafka as an Eventing System to Replatform a Monolith into Microservices

confluent

Scalable Application Development @ Picnic

Sander Mak (@Sander_Mak)

The Big Bad Data

Big data on google platform dev fest presentation

Sriskandarajah Suhothayan

An introduction to the WSO2 Analytics Platform

Data pipelines observability: OpenLineage & Marquez

Open core summit: Observability for data pipelines with OpenLineage

Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data

Stavros Kontopoulos

MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive

Data lineage and observability with Marquez - subsurface 2020

Helix Nebula The Science Cloud

DocumentDB - NoSQL on Cloud at Reboot2015

Vidyasagar Machupalli

Kafka Streams - From the Ground Up to the Cloud

VMware Tanzu

Pomerania Cloud case study - Openstack Day Warsaw 2017

Łukasz Klimek

FIWARE Global Summit - QuantumLeap: Time-series and Geographic Queries

FIWARE

Big data @ uber vu (1)

Mihnea Giurgea

Dataspace presentatie

Roland Cornelissen

M-PIL-3.2 Public Session

Scalable Dynamic Data Consumption on the Web

Ruben Taelman

What's hot (20)

MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive

Improve your SQL workload with observability

Kafka as an Eventing System to Replatform a Monolith into Microservices

Scalable Application Development @ Picnic

The Big Bad Data

Big data on google platform dev fest presentation

An introduction to the WSO2 Analytics Platform

Data pipelines observability: OpenLineage & Marquez

Open core summit: Observability for data pipelines with OpenLineage

Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data

MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive

Data lineage and observability with Marquez - subsurface 2020

DocumentDB - NoSQL on Cloud at Reboot2015

Kafka Streams - From the Ground Up to the Cloud

Pomerania Cloud case study - Openstack Day Warsaw 2017

FIWARE Global Summit - QuantumLeap: Time-series and Geographic Queries

Big data @ uber vu (1)

Dataspace presentatie

M-PIL-3.2 Public Session

Scalable Dynamic Data Consumption on the Web

Viewers also liked

Functional Prototyping For Mobile Apps

Movel

Data Lessons Learned at Scale

Privacy Friendly Personalization

.Gov to .com

UI Testing Automation

AgileEngine

UI testing tools like Selenium allow testing user interfaces in real browsers to ensure proper rendering. Traditional UI testing requires development skills and test maintenance is tedious. Visual testing tools provide higher productivity by automating tests visually without code. Visual tests can be used to test complex applications like Gmail by recording user flows and validating page elements and differences. Visual testing empowers non-technical users and complements unit and API tests.

"Docker is NOT Container." ~ Dockerとコンテナ技術、PaaSの関係を理解するEtsuji Nakai

Viewers also liked (6)

Functional Prototyping For Mobile Apps

Data Lessons Learned at Scale

Privacy Friendly Personalization

.Gov to .com

UI Testing Automation

"Docker is NOT Container." ~ Dockerとコンテナ技術、PaaSの関係を理解する

Similar to Data Lessons Learned at Scale - Big Data DC

Data Science in the Cloud @StitchFix

C4Media

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu. Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com.. Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.

Netflix Big Data Paris 2017

Jason Flittner

MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...

Spark Summit

Streaming applications have often been complex to design and maintain because of the significant upfront infrastructure investment required. However, with the advent of Spark an easy transition to stream processing is now available, enabling personalization applications and experiments to consume near real-time data without massive development cycles. Our decision to evaluate Spark as our stream processing engine was primarily led by the following considerations: 1) Ease of development for the team (already familiar with spark for batch), 2) the scope/requirements of our problem, 3) re-usability of code from spark batch jobs, and 4) Spark support from infrastructure teams within the company. In this session, we will present our experience using Spark for stream processing unbounded datasets in the personalization space. The datasets consisted of, but were not limited, to the stream of playback events that are used as feedback for all personalization algorithms. These plays are used to extract specific behaviors which are highly predictive of a customer’s enjoyment of our service. This dataset is massive and has to be further enriched by other online and offline Netflix data sources. These datasets, when consumed by our machine learning models, directly affect the customer’s personalized experience, which means that the impact is high and tolerance for failure is low. We’ll talk about the experiments we did to compare Spark with other streaming solutions like Apache Flink , the impact that we had on our customers, and most importantly, the challenges we faced. Take-aways for the audience: 1) A great example of stream processing large, personalization datasets at scale. 2) An increased awareness of the costs/requirements for making the transition from batch to streaming successfully. 3) Exposure to some of the technical challenges that should be expected along the way.

Extracting Insights from Data at Twitter

Prasad Wagle

Prasad Wagle's talk discussed how Twitter extracts insights from its large volumes of data. Twitter collects hundreds of millions of tweets and interactions per day from over 300 million monthly active users, creating big data challenges around velocity, volume, and variety. Twitter stores this data in hundreds of petabytes across large Hadoop clusters and processes it using batch tools like Hadoop and Spark as well as real-time tools like Heron. Insights are generated through basic analytics like user counts, A/B testing of new features, and custom data science work including machine learning models for recommendations, content filtering, and ad targeting. Systems, programming, and statistical skills are needed to effectively extract value from Twitter's big data.

Big Data in 200 km/h | AWS Big Data Demystified #1.3

What we're about A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry… Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS infrastructure to answer the basic questions of anyone starting their way in the big data world. how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORCwhich technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL?how to handle streaming?how to manage costs?Performance tips?Security tip?Cloud best practices tips? Some of our online materials: Website: https://big-data-demystified.ninja/ Youtube channels: https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber Meetup: https://www.meetup.com/AWS-Big-Data-Demystified/ https://www.meetup.com/Big-Data-Demystified Facebook Group : https://www.facebook.com/groups/amazon.aws.big.data.demystified/ Facebook page (https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/) Audience: Data Engineers Data Science DevOps Engineers Big Data Architects Solution Architects CTO VP R&D

Big data at scrapinghub

Dana Brophy

Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...

DataStax

Dyn delivers exceptional Internet Performance. Enabling high quality services requires data centers around the globe. In order to manage services, customers need timely insight collected from all over the world. Dyn uses DataStax Enterprise (DSE) to deploy complex clusters across multiple datacenters to enable sub 50 ms query responses for hundreds of billions of data points. From granular DNS traffic data, to aggregated counts for a variety of report dimensions, DSE at Dyn has been up since 2013 and has shined through upgrades, data center migrations, DDoS attacks and hardware failures. In this webinar, Principal Engineers Tim Chadwick and Rick Bross cover the requirements which led them to choose DSE as their go-to Big Data solution, the path which led to SPARK, and the lessons that we’ve learned in the process.

Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...

Data Con LA

Enabling real-time exploration and analytics at scale to drive operational intelligence at Hulu by Indrasis Mondal, Director, Data Engineering and Data Products, Hulu Data is one of most powerful assets for companies today and a key driver for innovation, product development and business efficiency. Operational intelligence allows modern organization to use that data asset in real-time to enable immediate insights to their business operations and allow rapid decision making for strategic advantage. In this presentation we will walk through the operational intelligence capabilities Hulu has built to process tens of millions of events per minute to enable fast exploration of data and real-time decision making .

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English

Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...

Flink Forward

Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data. This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.

AWS Big Data Demystified #1: Big data architecture lessons learned

AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company The video: https://youtu.be/l5KmaZNQxaU dont forget to subcribe to the youtube channel The website: https://amazon-aws-big-data-demystified.ninja/ The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/ The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/

Web performance mercadolibre - ECI 2013

Santiago Aimetta

The document discusses techniques for improving web performance, including reducing time to first byte, using content delivery networks and HTTP compression, caching resources, keeping connections alive and reducing request sizes. It also covers optimizing images, loading JavaScript asynchronously to avoid blocking, and prefetching content. The overall goal is to reduce page load times and improve user experience.

AWS Big Data Demystified #1.2 | Big Data architecture lessons learned

A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry… Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS & GCP and Data Center infrastructure to answer the basic questions of anyone starting their way in the big data world. how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORC,AVRO which technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL? GCS? Big Query? Data flow? Data Lab? tensor flow? how to handle streaming? how to manage costs? Performance tips? Security tip? Cloud best practices tips? In this meetup we shall present lecturers working on several cloud vendors, various big data platforms such hadoop, Data warehourses , startups working on big data products. basically - if it is related to big data - this is THE meetup. Some of our online materials (mixed content from several cloud vendor): Website: https://big-data-demystified.ninja (under construction) Meetups: https://www.meetup.com/Big-Data-Demystified https://www.meetup.com/AWS-Big-Data-Demystified/ You tube channels: https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber Audience: Data Engineers Data Science DevOps Engineers Big Data Architects Solution Architects CTO VP R&D

Streamsets and spark in Retail

Hari Shreedharan

This document discusses how StreamSets and Spark can be used together for analytics insights in retail. Some key points: - StreamSets Data Collector (SDC) is used to ingest IoT and sensor data from various sources into a common format and process the data in real-time using Spark evaluators. - The Spark evaluator allows running Spark transformations on batches of data within SDC pipelines to do tasks like anomaly detection, sentiment analysis, and fraud detection. - SDC can also be used to move data to and from Spark for end-of-batch processing using a Spark executor, such as running jobs on Databricks after files land in S3. - Together, SDC and Spark

Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma

Spark Summit

Learn about the Big Data Processing ecosystem at Netflix and how Apache Spark sits in this platform. I talk about typical data flows and data pipeline architectures that are used in Netflix and address how Spark is helping us gain efficiency in our processes. As a bonus – i’ll touch on some unconventional use-cases contrary to typical warehousing / analytics solutions that are being served by Apache Spark.

Introduction to Data Engineer and Data Pipeline at Credit OK

Kriangkrai Chaonithi

The document discusses the role of data engineers and data pipelines. It begins with an introduction to big data and why data volumes are increasing. It then covers what data engineers do, including building data architectures, working with cloud infrastructure, and programming for data ingestion, transformation, and loading. The document also explains data pipelines, describing extract, transform, load (ETL) processes and batch versus streaming data. It provides an example of Credit OK's data pipeline architecture on Google Cloud Platform that extracts raw data from various sources, cleanses and loads it into BigQuery, then distributes processed data to various applications. It emphasizes the importance of data engineers in processing and managing large, complex data sets.

Web performance optimization - MercadoLibre

Pablo Moretti

The document provides techniques and tools for improving web performance. It discusses how reducing response times can directly impact revenues and user experience. It then covers various ways to optimize the frontend, including reducing time to first byte through DNS optimization and caching, using content delivery networks, HTTP compression, keeping connections alive, parallel downloads, and prefetching. It also discusses optimizing images, JavaScript loading, and introducing new formats like WebP. The overall document aims to educate on measuring and enhancing web performance.

A Day in the Life of a Druid Implementor and Druid's Roadmap

Itai Yaffe

This document summarizes a typical day for a Druid architect. It describes common tasks like evaluating production clusters, analyzing data and queries, and recommending optimizations. The architect asks stakeholders questions to understand usage and helps evaluate if Druid is a good fit. When advising on Druid, the architect considers factors like data sources, query types, and technology stacks. The document also provides tips on configuring clusters for performance and controlling segment size.

kranonit S06E01 Игорь Цинько: High load

Krivoy Rog IT Community

This document summarizes a presentation about designing systems to handle high loads when Chuck Norris is your customer. It discusses scaling architectures vertically and horizontally, RESTful principles, using NoSQL databases like MongoDB, caching with Memcached, search engines like Sphinx, video/image storage, and bandwidth management. It emphasizes that the right technology depends on business needs, and high-load systems require robust architectures, qualified developers, and avoiding single points of failure.

Similar to Data Lessons Learned at Scale - Big Data DC (20)

Data Science in the Cloud @StitchFix

Netflix Big Data Paris 2017

MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...

Extracting Insights from Data at Twitter

Big Data in 200 km/h | AWS Big Data Demystified #1.3

Big data at scrapinghub

Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...

Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English

Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...

AWS Big Data Demystified #1: Big data architecture lessons learned

Web performance mercadolibre - ECI 2013

AWS Big Data Demystified #1.2 | Big Data architecture lessons learned

Streamsets and spark in Retail

Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma

Introduction to Data Engineer and Data Pipeline at Credit OK

Web performance optimization - MercadoLibre

A Day in the Life of a Druid Implementor and Druid's Roadmap

kranonit S06E01 Игорь Цинько: High load

Recently uploaded

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024

GenAI Pilot Implementation in the organizations

kumardaparthi1024

Removing Uninteresting Bytes in Software Fuzzing

Aftab Hussain

Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process. In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds. - These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.

Pushing the limits of ePRTC: 100ns holdover for 100 days

Adtran

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!

SOFTTECHHUB

As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.

HCL Notes and Domino License Cost Reduction in the World of DLAU

panagenda

Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/ The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this! We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model. Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward. These topics will be covered - Reducing license cost by finding and fixing misconfigurations and superfluous accounts - How do CCB and CCX licenses really work? - Understanding the DLAU tool and how to best utilize it - Tips for common problem areas, like team mailboxes, functional/test users, etc - Practical examples and best practices to implement right away

GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...

Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.

Artificial Intelligence for XMLDevelopment

Octavian Nadolu

In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject. We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup. Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved. The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring. The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise. By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.

Communications Mining Series - Zero to Hero - Session 1

DianaGray10

This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered: • Communication Mining Overview • Why is it important? • How can it help today’s business and the benefits • Phases in Communication Mining • Demo on Platform overview • Q/A

UiPath Test Automation using UiPath Test Suite series, part 5

DianaGray10

Video Streaming: Then, Now, and in the Future

Alpen-Adria-Universität

In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.

GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...

Leonard Jayamohan, Partner & Generative AI Lead, Deloitte This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.

UiPath Test Automation using UiPath Test Suite series, part 6

DianaGray10

Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI. UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities. Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes. What will you get from this session? 1. Insights into integrating generative AI. 2. Understanding how this integration enhances test automation within the UiPath platform 3. Practical demonstrations 4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath Topics covered: What is generative AI Test Automation with generative AI and Open AI. UiPath integration with generative AI Speaker: Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP

AI 101: An Introduction to the Basics and Impact of Artificial Intelligence

IndexBug

TrustArc Webinar - 2024 Global Privacy Survey

TrustArc

How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024? In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores. See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe. This webinar will review: - The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey - The top challenges for privacy leaders, practitioners, and organizations in 2024 - Key themes to consider in developing and maintaining your privacy program

Serial Arm Control in Real Time Presentation

tolgahangng

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...