Prasad Wagle's talk discussed how Twitter extracts insights from its large volumes of data. Twitter collects hundreds of millions of tweets and interactions per day from over 300 million monthly active users, creating big data challenges around velocity, volume, and variety. Twitter stores this data in hundreds of petabytes across large Hadoop clusters and processes it using batch tools like Hadoop and Spark as well as real-time tools like Heron. Insights are generated through basic analytics like user counts, A/B testing of new features, and custom data science work including machine learning models for recommendations, content filtering, and ad targeting. Systems, programming, and statistical skills are needed to effectively extract value from Twitter's big data.
Slides from a Heron tech talk by Karthik Ramasamy, Maosong Fu and Bill Graham at the Hive in August 2016. The presentation includes an overview of the Heron architecture, how Heron is used at Twitter and how to install Heron and run an example topology.
A video of the talk can be found at https://www.youtube.com/watch?v=FRvmeoJCZKU.
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Data Con LA
Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. Twitter designed and deployed a new streaming system called Heron. Heron has been in production nearly 2 years and is widely used by several teams for diverse use cases. This talk looks at Twitter's operating experiences and challenges of running Heron at scale and the approaches taken to solve those challenges.
Real Time Processing Using Twitter Heron by Karthik RamasamyData Con LA
Abstract:- Today's enterprises are not only producing data in high volume but also at high velocity. With velocity comes the need to process the data in real time. To meet the real time needs, we developed and deployed Heron, the next generation streaming engine at Twitter. Heron processes billions and billions of events per day at Twitter and has been in production for nearly 3 years. Heron provides unparalleled performance at large scale and has been successfully meeting Twitter's strict performance requirements for various streaming and iOT applications. Heron is a open source project with several major contributors from various institutions. As the project, we identified and implemented several optimizations that improved throughput by additional 5x and further reduce latency by 50-60%. In this talk, we will describe Heron in detail, how the detailed profiling indicated the performance bottleneck areas such as multiple serializations/deserialization and immutable data structures. After mitigating these costs, we were able to show much higher throughput and latencies as low as 12ms.
Delta Lake delivers reliability, security and performance to data lakes. Join this session to learn how customers have achieved 48x faster data processing, leading to 50% faster time to insight after implementing Delta Lake. You’ll also learn how Delta Lake provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture.
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
Databricks' founders caused a seismic shift in data analysis community when they created Apache Spark which has become a cornerstone of Big Data processing pipelines and tools in large and small companies all around the world. Now they've built a revolutionary, comprehensive and easy-to-use platform around Apache Spark and their other inventions, such as MLFlow and Koalas frameworks and most importantly the Data Lakehouse: a concept of fusing data warehouse and data lake architectures into a single versatile and fast platform. Technical foundation for Databricks Data Lakehouse is Delta Lake. More than 7000 organizations today rely on Databricks to enable massive-scale data engineering, collaborative data science, full-lifecycle machine learning and business analytics. Come to the talk and see the demo to find out why.
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...Databricks
Upwork has the biggest closed-loop online dataset of jobs and job seekers in labor history (>10M Profiles, >100M Job Posts, Job Proposals and Hiring Decisions, >10B of Messages, Transaction and Feedback Data). Besides sheer quantity, our data is also contextually very rich. We have client and contractor data for the entire job-funnel – from finding jobs to getting the job done.
For various machine learning applications including search and recommendations and labor marketplace optimization (rate, supply and demand), we heavily relied on a Greenplum-based data warehouse solution for data processing and ad-hoc ML pipelines (weka, scikit-learn, R) for offline model development and online model scoring.
In this talk, we present our modernization efforts in moving towards a 1) holistic data processing infrastructure for batch and stream data processing using S3, Kinesis, Spark and Spark Structured Streaming 2) model development using Spark MLlib and other ML libraries for Spark 3) model serving using Databricks Model Scoring, Scoring over Structured Streams and microservices and 3) how we orchestrate and streamline all these processes using Apache Airflow and a CI/CD workflow customized to our Data Science product engineering needs. The focus of this talk is on how we were able to leverage the Databricks service offering to reduce DevOps overhead and costs, complete the entire modernization with moderate efforts and adopt a collaborative notebook-based solution for all our data scientists to develop model, reuse features and share results. We will shared the core lessons learned and pitfalls we encountered during this journey.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
Many had dubbed 2020 as the decade of data. This is indeed an era of data zeitgeist.
From code-centric software development 1.0, we are entering software development 2.0, a data-centric and data-driven approach, where data plays a central theme in our everyday lives.
As the volume and variety of data garnered from myriad data sources continue to grow at an astronomical scale and as cloud computing offers cheap computing and data storage resources at scale, the data platforms have to match in their abilities to process, analyze, and visualize at scale and speed and with ease — this involves data paradigm shifts in processing and storing and in providing programming frameworks to developers to access and work with these data platforms.
In this talk, we will survey some emerging technologies that address the challenges of data at scale, how these tools help data scientists and machine learning developers with their data tasks, why they scale, and how they facilitate the future data scientists to start quickly.
In particular, we will examine in detail two open-source tools MLflow (for machine learning life cycle development) and Delta Lake (for reliable storage for structured and unstructured data).
Other emerging tools such as Koalas help data scientists to do exploratory data analysis at scale in a language and framework they are familiar with as well as emerging data + AI trends in 2021.
You will understand the challenges of machine learning model development at scale, why you need reliable and scalable storage, and what other open source tools are at your disposal to do data science and machine learning at scale.
Slides from a Heron tech talk by Karthik Ramasamy, Maosong Fu and Bill Graham at the Hive in August 2016. The presentation includes an overview of the Heron architecture, how Heron is used at Twitter and how to install Heron and run an example topology.
A video of the talk can be found at https://www.youtube.com/watch?v=FRvmeoJCZKU.
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Data Con LA
Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. Twitter designed and deployed a new streaming system called Heron. Heron has been in production nearly 2 years and is widely used by several teams for diverse use cases. This talk looks at Twitter's operating experiences and challenges of running Heron at scale and the approaches taken to solve those challenges.
Real Time Processing Using Twitter Heron by Karthik RamasamyData Con LA
Abstract:- Today's enterprises are not only producing data in high volume but also at high velocity. With velocity comes the need to process the data in real time. To meet the real time needs, we developed and deployed Heron, the next generation streaming engine at Twitter. Heron processes billions and billions of events per day at Twitter and has been in production for nearly 3 years. Heron provides unparalleled performance at large scale and has been successfully meeting Twitter's strict performance requirements for various streaming and iOT applications. Heron is a open source project with several major contributors from various institutions. As the project, we identified and implemented several optimizations that improved throughput by additional 5x and further reduce latency by 50-60%. In this talk, we will describe Heron in detail, how the detailed profiling indicated the performance bottleneck areas such as multiple serializations/deserialization and immutable data structures. After mitigating these costs, we were able to show much higher throughput and latencies as low as 12ms.
Delta Lake delivers reliability, security and performance to data lakes. Join this session to learn how customers have achieved 48x faster data processing, leading to 50% faster time to insight after implementing Delta Lake. You’ll also learn how Delta Lake provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture.
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
Databricks' founders caused a seismic shift in data analysis community when they created Apache Spark which has become a cornerstone of Big Data processing pipelines and tools in large and small companies all around the world. Now they've built a revolutionary, comprehensive and easy-to-use platform around Apache Spark and their other inventions, such as MLFlow and Koalas frameworks and most importantly the Data Lakehouse: a concept of fusing data warehouse and data lake architectures into a single versatile and fast platform. Technical foundation for Databricks Data Lakehouse is Delta Lake. More than 7000 organizations today rely on Databricks to enable massive-scale data engineering, collaborative data science, full-lifecycle machine learning and business analytics. Come to the talk and see the demo to find out why.
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...Databricks
Upwork has the biggest closed-loop online dataset of jobs and job seekers in labor history (>10M Profiles, >100M Job Posts, Job Proposals and Hiring Decisions, >10B of Messages, Transaction and Feedback Data). Besides sheer quantity, our data is also contextually very rich. We have client and contractor data for the entire job-funnel – from finding jobs to getting the job done.
For various machine learning applications including search and recommendations and labor marketplace optimization (rate, supply and demand), we heavily relied on a Greenplum-based data warehouse solution for data processing and ad-hoc ML pipelines (weka, scikit-learn, R) for offline model development and online model scoring.
In this talk, we present our modernization efforts in moving towards a 1) holistic data processing infrastructure for batch and stream data processing using S3, Kinesis, Spark and Spark Structured Streaming 2) model development using Spark MLlib and other ML libraries for Spark 3) model serving using Databricks Model Scoring, Scoring over Structured Streams and microservices and 3) how we orchestrate and streamline all these processes using Apache Airflow and a CI/CD workflow customized to our Data Science product engineering needs. The focus of this talk is on how we were able to leverage the Databricks service offering to reduce DevOps overhead and costs, complete the entire modernization with moderate efforts and adopt a collaborative notebook-based solution for all our data scientists to develop model, reuse features and share results. We will shared the core lessons learned and pitfalls we encountered during this journey.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
Many had dubbed 2020 as the decade of data. This is indeed an era of data zeitgeist.
From code-centric software development 1.0, we are entering software development 2.0, a data-centric and data-driven approach, where data plays a central theme in our everyday lives.
As the volume and variety of data garnered from myriad data sources continue to grow at an astronomical scale and as cloud computing offers cheap computing and data storage resources at scale, the data platforms have to match in their abilities to process, analyze, and visualize at scale and speed and with ease — this involves data paradigm shifts in processing and storing and in providing programming frameworks to developers to access and work with these data platforms.
In this talk, we will survey some emerging technologies that address the challenges of data at scale, how these tools help data scientists and machine learning developers with their data tasks, why they scale, and how they facilitate the future data scientists to start quickly.
In particular, we will examine in detail two open-source tools MLflow (for machine learning life cycle development) and Delta Lake (for reliable storage for structured and unstructured data).
Other emerging tools such as Koalas help data scientists to do exploratory data analysis at scale in a language and framework they are familiar with as well as emerging data + AI trends in 2021.
You will understand the challenges of machine learning model development at scale, why you need reliable and scalable storage, and what other open source tools are at your disposal to do data science and machine learning at scale.
FLiP Into Trino
FLiP into Trino. Flink Pulsar Trino
Pulsar SQL (Trino/Presto)
Remember the days when you could wait until your batch data load was done and then you could run some simple queries or build stale dashboards? Those days are over, today you need instant analytics as the data is streaming in real-time. You need universal analytics where that data is. I will show you how to do this utilizing the latest cloud native open source tools. In this talk we will utilize Trino, Apache Pulsar, Pulsar SQL and Apache Flink to analyze instantly data from IoT, sensors, transportation systems, Logs, REST endpoints, XML, Images, PDFs, Documents, Text, semistructured data, unstructured data, structured data and a hundred data sources you could never dream of streaming before. I will teach how to use Pulsar SQL to run analytics on live data.
Tim Spann
Developer Advocate
StreamNative
David Kjerrumgaard
Developer Advocate
StreamNative
https://www.starburst.io/info/trinosummit/
https://github.com/tspannhw/FLiP-Into-Trino/blob/main/README.md
https://github.com/tspannhw/StreamingAnalyticsUsingFlinkSQL/tree/main/src/main/java
select * from pulsar."public/default"."weather";
Apache Pulsar plus Trio = fast analytics at scale
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Building Reliable Data Lakes at Scale with Delta LakeDatabricks
Most data practitioners grapple with data reliability issues—it’s the bane of their existence. Data engineers, in particular, strive to design, deploy, and serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. Built on open standards, Delta Lake employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data engineering, the challenges data engineers face when it comes to data reliability and performance and how Delta Lake can help. Through presentation, code examples and notebooks, we will explain these challenges and the use of Delta Lake to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.
This tutorial will be both instructor-led and hands-on interactive session. Instructions on how to get tutorial materials will be covered in class.
What you’ll learn:
Understand the key data reliability challenges
How Delta Lake brings reliability to data lakes at scale
Understand how Delta Lake fits within an Apache Spark™ environment
How to use Delta Lake to realize data reliability improvements
Prerequisites
A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Pre-register for Databricks Community Edition
Tomer Shiran est le fondateur et chef de produit (CPO) de Dremio. Tomer était le 4e employé et vice-président produit de MapR, un pionnier de l'analyse du Big Data. Il a également occupé de nombreux postes de gestion de produits et d'ingénierie chez IBM Research et Microsoft, et a fondé plusieurs sites Web qui ont servi des millions d'utilisateurs. Il est titulaire d'un Master en génie informatique de l'Université Carnegie Mellon et d'un Bachelor of Science en informatique du Technion - Israel Institute of Technology.
Le Modern Data Stack meetup est ravi d'accueillir Tomer Shiran. Depuis Apache Drill, Apache Arrow maintenant Apache Iceberg, il ancre avec ses équipes des choix pour Dremio avec une vision de la plateforme de données “ouverte” basée sur des technologies open source. En plus, de ces valeurs qui évitent le verrouillage de clients dans des formats propriétaires, il a aussi le souci des coûts qu’engendrent de telles plateformes. Il sait aussi proposer un certain nombre de fonctionnalités qui transforment la gestion de données grâce à des initiatives telles Nessie qui ouvre la route du Data As Code et du transactionnel multi-processus.
Le Modern Data Stack Meetup laisse “carte blanche” à Tomer Shiran afin qu’il nous partage son expérience et sa vision quant à l’Open Data Lakehouse.
Delta from a Data Engineer's PerspectiveDatabricks
Take a walk through the daily struggles of a data engineer in this presentation as we cover what is truly needed to create robust end to end Big Data solutions.
Building Data Quality pipelines with Apache Spark and Delta LakeDatabricks
Technical Leads and Databricks Champions Darren Fuller & Sandy May will give a fast paced view of how they have productionised Data Quality Pipelines across multiple enterprise customers. Their vision to empower business decisions on data remediation actions and self healing of Data Pipelines led them to build a library of Data Quality rule templates and accompanying reporting Data Model and PowerBI reports.
With the drive for more and more intelligence driven from the Lake and less from the Warehouse, also known as the Lakehouse pattern, Data Quality at the Lake layer becomes pivotal. Tools like Delta Lake become building blocks for Data Quality with Schema protection and simple column checking, however, for larger customers they often do not go far enough. Notebooks will be shown in quick fire demos how Spark can be leverage at point of Staging or Curation to apply rules over data.
Expect to see simple rules such as Net sales = Gross sales + Tax, or values existing with in a list. As well as complex rules such as validation of statistical distributions and complex pattern matching. Ending with a quick view into future work in the realm of Data Compliance for PII data with generations of rules using regex patterns and Machine Learning rules based on transfer learning.
Frame - Feature Management for Productive Machine LearningDavid Stein
Presented at the ML Platforms Meetup at Pinterest HQ in San Francisco on August 16, 2018.
Abstract: At LinkedIn we observed that much of the complexity in our machine learning applications was in their feature preparation workflows. To address this problem, we built Frame, a shared virtual feature store that provides a unified abstraction layer for accessing features by name. Frame removes the need for feature consumers to deal directly with underlying data sources, which are often different across computing environments. By simplifying feature preparation, Frame has made ML applications at LinkedIn easier to build, modify, and understand.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
PyCaret is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within minutes in your choice of environment. This talk is a practical demo using PyCaret in your existing workflows and supercharges your data science team's productivity.
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...Databricks
The ‘feature store’ is an emerging concept in data architecture that is motivated by the challenge of productionizing ML applications. The rapid iteration in experimental, data driven research applications creates new challenges for data management and application deployment.
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks
This session will give a new dimension to Apache Spark’s usage. See how Apache Spark and other open source projects can be used together in providing a scalable, real-time monitoring system. Apache Spark plays the central role in providing this scalable solution, since without Spark Streaming we would not be able to process millions of events in real time. This approach can provide a lot of learning to the DevOps/Infrastructure domain on how to build a scalable and automated logging and monitoring solution using Apache Spark, Apache Kafka, Grafana and some other open-source technologies.
Sony PlayStation’s monitoring pipeline processes about 40 billion events every day, and generates metrics in near real-time (within 30 seconds). All the components, used along with Apache Spark, are horizontally scalable using any auto-scaling techniques, which enhances the reliability of this efficient and highly available monitoring solution. Sony Interactive Entertainment has been using Apache Spark, and specifically Spark Streaming, for the last three years. Hear about some important lessons they have learned. For example, they still use Spark Streaming’s receiver-based method in certain use cases instead of Direct Streaming, and will share the application of both the methods, giving the knowledge back to the community.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Observability for Data Pipelines With OpenLineageDatabricks
Data is increasingly becoming core to many products. Whether to provide recommendations for users, getting insights on how they use the product, or using machine learning to improve the experience. This creates a critical need for reliable data operations and understanding how data is flowing through our systems. Data pipelines must be auditable, reliable, and run on time. This proves particularly difficult in a constantly changing, fast-paced environment.
Collecting this lineage metadata as data pipelines are running provides an understanding of dependencies between many teams consuming and producing data and how constant changes impact them. It is the underlying foundation that enables the many use cases related to data operations. The OpenLineage project is an API standardizing this metadata across the ecosystem, reducing complexity and duplicate work in collecting lineage information. It enables many projects, consumers of lineage in the ecosystem whether they focus on operations, governance or security.
Marquez is an open source project part of the LF AI & Data foundation which instruments data pipelines to collect lineage and metadata and enable those use cases. It implements the OpenLineage API and provides context by making visible dependencies across organizations and technologies as they change over time.
Performance Troubleshooting Using Apache Spark MetricsDatabricks
Performance troubleshooting of distributed data processing systems is a complex task. Apache Spark comes to rescue with a large set of metrics and instrumentation that you can use to understand and improve the performance of your Spark-based applications. You will learn about the available metric-based instrumentation in Apache Spark: executor task metrics and the Dropwizard-based metrics system. The talk will cover how Hadoop and Spark service at CERN is using Apache Spark metrics for troubleshooting performance and measuring production workloads. Notably, the talk will cover how to deploy a performance dashboard for Spark workloads and will cover the use of sparkMeasure, a tool based on the Spark Listener interface. The speaker will discuss the lessons learned so far and what improvements you can expect in this area in Apache Spark 3.0.
Data Warehouse or Data Lake, Which Do I Choose?DATAVERSITY
Today’s data-driven companies have a choice to make – where do we store our data? As the move to the cloud continues to be a driving factor, the choice becomes either the data warehouse (Snowflake et al) or the data lake (AWS S3 et al). There are pro’s and con’s for each approach. While the data warehouse will give you strong data management with analytics, they don’t do well with semi-structured and unstructured data with tightly coupled storage and compute, not to mention expensive vendor lock-in. On the other hand, data lakes allow you to store all kinds of data and are extremely affordable, but they’re only meant for storage and by themselves provide no direct value to an organization.
Enter the Open Data Lakehouse, the next evolution of the data stack that gives you the openness and flexibility of the data lake with the key aspects of the data warehouse like management and transaction support.
In this webinar, you’ll hear from Ali LeClerc who will discuss the data landscape and why many companies are moving to an open data lakehouse. Ali will share more perspective on how you should think about what fits best based on your use case and workloads, and how some real world customers are using Presto, a SQL query engine, to bring analytics to the data lakehouse.
FLiP Into Trino
FLiP into Trino. Flink Pulsar Trino
Pulsar SQL (Trino/Presto)
Remember the days when you could wait until your batch data load was done and then you could run some simple queries or build stale dashboards? Those days are over, today you need instant analytics as the data is streaming in real-time. You need universal analytics where that data is. I will show you how to do this utilizing the latest cloud native open source tools. In this talk we will utilize Trino, Apache Pulsar, Pulsar SQL and Apache Flink to analyze instantly data from IoT, sensors, transportation systems, Logs, REST endpoints, XML, Images, PDFs, Documents, Text, semistructured data, unstructured data, structured data and a hundred data sources you could never dream of streaming before. I will teach how to use Pulsar SQL to run analytics on live data.
Tim Spann
Developer Advocate
StreamNative
David Kjerrumgaard
Developer Advocate
StreamNative
https://www.starburst.io/info/trinosummit/
https://github.com/tspannhw/FLiP-Into-Trino/blob/main/README.md
https://github.com/tspannhw/StreamingAnalyticsUsingFlinkSQL/tree/main/src/main/java
select * from pulsar."public/default"."weather";
Apache Pulsar plus Trio = fast analytics at scale
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Building Reliable Data Lakes at Scale with Delta LakeDatabricks
Most data practitioners grapple with data reliability issues—it’s the bane of their existence. Data engineers, in particular, strive to design, deploy, and serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. Built on open standards, Delta Lake employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data engineering, the challenges data engineers face when it comes to data reliability and performance and how Delta Lake can help. Through presentation, code examples and notebooks, we will explain these challenges and the use of Delta Lake to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.
This tutorial will be both instructor-led and hands-on interactive session. Instructions on how to get tutorial materials will be covered in class.
What you’ll learn:
Understand the key data reliability challenges
How Delta Lake brings reliability to data lakes at scale
Understand how Delta Lake fits within an Apache Spark™ environment
How to use Delta Lake to realize data reliability improvements
Prerequisites
A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Pre-register for Databricks Community Edition
Tomer Shiran est le fondateur et chef de produit (CPO) de Dremio. Tomer était le 4e employé et vice-président produit de MapR, un pionnier de l'analyse du Big Data. Il a également occupé de nombreux postes de gestion de produits et d'ingénierie chez IBM Research et Microsoft, et a fondé plusieurs sites Web qui ont servi des millions d'utilisateurs. Il est titulaire d'un Master en génie informatique de l'Université Carnegie Mellon et d'un Bachelor of Science en informatique du Technion - Israel Institute of Technology.
Le Modern Data Stack meetup est ravi d'accueillir Tomer Shiran. Depuis Apache Drill, Apache Arrow maintenant Apache Iceberg, il ancre avec ses équipes des choix pour Dremio avec une vision de la plateforme de données “ouverte” basée sur des technologies open source. En plus, de ces valeurs qui évitent le verrouillage de clients dans des formats propriétaires, il a aussi le souci des coûts qu’engendrent de telles plateformes. Il sait aussi proposer un certain nombre de fonctionnalités qui transforment la gestion de données grâce à des initiatives telles Nessie qui ouvre la route du Data As Code et du transactionnel multi-processus.
Le Modern Data Stack Meetup laisse “carte blanche” à Tomer Shiran afin qu’il nous partage son expérience et sa vision quant à l’Open Data Lakehouse.
Delta from a Data Engineer's PerspectiveDatabricks
Take a walk through the daily struggles of a data engineer in this presentation as we cover what is truly needed to create robust end to end Big Data solutions.
Building Data Quality pipelines with Apache Spark and Delta LakeDatabricks
Technical Leads and Databricks Champions Darren Fuller & Sandy May will give a fast paced view of how they have productionised Data Quality Pipelines across multiple enterprise customers. Their vision to empower business decisions on data remediation actions and self healing of Data Pipelines led them to build a library of Data Quality rule templates and accompanying reporting Data Model and PowerBI reports.
With the drive for more and more intelligence driven from the Lake and less from the Warehouse, also known as the Lakehouse pattern, Data Quality at the Lake layer becomes pivotal. Tools like Delta Lake become building blocks for Data Quality with Schema protection and simple column checking, however, for larger customers they often do not go far enough. Notebooks will be shown in quick fire demos how Spark can be leverage at point of Staging or Curation to apply rules over data.
Expect to see simple rules such as Net sales = Gross sales + Tax, or values existing with in a list. As well as complex rules such as validation of statistical distributions and complex pattern matching. Ending with a quick view into future work in the realm of Data Compliance for PII data with generations of rules using regex patterns and Machine Learning rules based on transfer learning.
Frame - Feature Management for Productive Machine LearningDavid Stein
Presented at the ML Platforms Meetup at Pinterest HQ in San Francisco on August 16, 2018.
Abstract: At LinkedIn we observed that much of the complexity in our machine learning applications was in their feature preparation workflows. To address this problem, we built Frame, a shared virtual feature store that provides a unified abstraction layer for accessing features by name. Frame removes the need for feature consumers to deal directly with underlying data sources, which are often different across computing environments. By simplifying feature preparation, Frame has made ML applications at LinkedIn easier to build, modify, and understand.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
PyCaret is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within minutes in your choice of environment. This talk is a practical demo using PyCaret in your existing workflows and supercharges your data science team's productivity.
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...Databricks
The ‘feature store’ is an emerging concept in data architecture that is motivated by the challenge of productionizing ML applications. The rapid iteration in experimental, data driven research applications creates new challenges for data management and application deployment.
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks
This session will give a new dimension to Apache Spark’s usage. See how Apache Spark and other open source projects can be used together in providing a scalable, real-time monitoring system. Apache Spark plays the central role in providing this scalable solution, since without Spark Streaming we would not be able to process millions of events in real time. This approach can provide a lot of learning to the DevOps/Infrastructure domain on how to build a scalable and automated logging and monitoring solution using Apache Spark, Apache Kafka, Grafana and some other open-source technologies.
Sony PlayStation’s monitoring pipeline processes about 40 billion events every day, and generates metrics in near real-time (within 30 seconds). All the components, used along with Apache Spark, are horizontally scalable using any auto-scaling techniques, which enhances the reliability of this efficient and highly available monitoring solution. Sony Interactive Entertainment has been using Apache Spark, and specifically Spark Streaming, for the last three years. Hear about some important lessons they have learned. For example, they still use Spark Streaming’s receiver-based method in certain use cases instead of Direct Streaming, and will share the application of both the methods, giving the knowledge back to the community.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Observability for Data Pipelines With OpenLineageDatabricks
Data is increasingly becoming core to many products. Whether to provide recommendations for users, getting insights on how they use the product, or using machine learning to improve the experience. This creates a critical need for reliable data operations and understanding how data is flowing through our systems. Data pipelines must be auditable, reliable, and run on time. This proves particularly difficult in a constantly changing, fast-paced environment.
Collecting this lineage metadata as data pipelines are running provides an understanding of dependencies between many teams consuming and producing data and how constant changes impact them. It is the underlying foundation that enables the many use cases related to data operations. The OpenLineage project is an API standardizing this metadata across the ecosystem, reducing complexity and duplicate work in collecting lineage information. It enables many projects, consumers of lineage in the ecosystem whether they focus on operations, governance or security.
Marquez is an open source project part of the LF AI & Data foundation which instruments data pipelines to collect lineage and metadata and enable those use cases. It implements the OpenLineage API and provides context by making visible dependencies across organizations and technologies as they change over time.
Performance Troubleshooting Using Apache Spark MetricsDatabricks
Performance troubleshooting of distributed data processing systems is a complex task. Apache Spark comes to rescue with a large set of metrics and instrumentation that you can use to understand and improve the performance of your Spark-based applications. You will learn about the available metric-based instrumentation in Apache Spark: executor task metrics and the Dropwizard-based metrics system. The talk will cover how Hadoop and Spark service at CERN is using Apache Spark metrics for troubleshooting performance and measuring production workloads. Notably, the talk will cover how to deploy a performance dashboard for Spark workloads and will cover the use of sparkMeasure, a tool based on the Spark Listener interface. The speaker will discuss the lessons learned so far and what improvements you can expect in this area in Apache Spark 3.0.
Data Warehouse or Data Lake, Which Do I Choose?DATAVERSITY
Today’s data-driven companies have a choice to make – where do we store our data? As the move to the cloud continues to be a driving factor, the choice becomes either the data warehouse (Snowflake et al) or the data lake (AWS S3 et al). There are pro’s and con’s for each approach. While the data warehouse will give you strong data management with analytics, they don’t do well with semi-structured and unstructured data with tightly coupled storage and compute, not to mention expensive vendor lock-in. On the other hand, data lakes allow you to store all kinds of data and are extremely affordable, but they’re only meant for storage and by themselves provide no direct value to an organization.
Enter the Open Data Lakehouse, the next evolution of the data stack that gives you the openness and flexibility of the data lake with the key aspects of the data warehouse like management and transaction support.
In this webinar, you’ll hear from Ali LeClerc who will discuss the data landscape and why many companies are moving to an open data lakehouse. Ali will share more perspective on how you should think about what fits best based on your use case and workloads, and how some real world customers are using Presto, a SQL query engine, to bring analytics to the data lakehouse.
We share our experience with Apache Kafka for event-driven collaboration in microservices-based architecture. Talk was a part of Meetup: https://www.meetup.com/de-DE/Apache-Kafka-Germany-Munich/events/236402498/
Getting started with Azure Event Hubs and Stream Analytics servicesVladimir Bychkov
The total amount of data in the world almost doubles every 2 years. Storing data for offline processing is no longer a viable business model. In the past few years, new technologies for real-time data processing emerged. Microsoft Azure offers a comprehensive set of tools to ingest and process data in motion. In this presentation we will go over and learn how to collect data from devices, how to process data in real time using Azure Stream Analytic jobs, and how to produce and handle actionable insights.
Landoop presenting how to simplify your ETL process using Kafka Connect for (E) and (L). Introducing KCQL - the Kafka Connect Query Language & how it can simplify fast-data (ingress & egress) pipelines. How KCQL can be used to set up Kafka Connectors for popular in-memory and analytical systems and live demos with HazelCast, Redis and InfluxDB. How to get started with a fast-data docker kafka development environment. Enhance your existing Cloudera (Hadoop) clusters with fast-data capabilities.
Apache Kafka is a distributed streaming platform that forms a key part of the infrastructure at many companies including Uber, Netflix and LinkedIn. In this talk, Matt gave a technical overview of Apache Kafka, discussed practical use cases of Kafka for IoT data and demonstrated how to ingest data from an MQTT server using Kafka Connect.
Explore IoT in Big Data while brewing beer. All verticals are instrumenting devices to learn more about their process to help cut costs or improve efficiency.
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...confluent
The move to streaming architectures from batch processing is a revolution in how companies use data. But what is the state of the union for stream processing, and what gaps remain in the technology we have? How will this technology impact the architectures and applications of the future? Jay Kreps explores the future of Apache Kafka and the stream processing ecosystem.
Paolo Castagna is a Senior Sales Engineer at Confluent. His background is on 'big data' and he has, first hand, saw the shift happening in the industry from batch to stream processing and from big data to fast data. His talk will introduce Kafka Streams and explain why Apache Kafka is a great option and simplification for stream processing.
Building Reactive Fast Data & the Data Lake with Akka, Kafka, SparkTodd Fritz
In this session, we will discuss:
* reactive architecture tenets
* distributed “fast data” streams
* application and analytics focused Data Lake
Enterprise level concerns and the importance of holistic governance, operational management, and a Metadata Lake will be conceptually investigated. The next level of detail will be to explore what a prospective architecture looks like at scale with Terabytes of ingestion per day, how scale puts pressure on an architecture, and how to be successful without losing data in a mission critical system via resilient, self-healing, scalable technologies. DevOps and application architecture concerns will be first-class themes throughout.
Reactive principles and technology will be the second act of this talk. Kafka. Akka. Spark. Various streaming technologies (Kafka Streams, Akka Streams, Spark Streaming) will be reviewed to identify what they are best suited for. The fast data pipeline discussion will center around Kafka, Akka, and Apache Flink (Lightbend Fast Data platform). We’ll also walk through an exciting addition to the Akka family, Alpakka, which is a Camel equivalent for Enterprise Integration Patterns.
The final act will be to dive into the Data Lake, from both an analytics and application development perspective. Technologies used to explain concepts will include Amazon and Hadoop. A Data Lake may service multiple analytics consumers with various “views” (and access levels) of data. It may also be a participant of various applications, perhaps by acting as a centralized source for reference data or common middleware (in turn feeding the analytics aspect). The concept of the Metadata Lake to apply structure, meaning and purpose will be an over-arching success factor for a Data Lake. The difference between the Data Lake and Metadata Lake is conceptually similar to a Halocline… Various technologies (Iglu/Snowplow and more) will be discussed from a feature standpoint to flesh out the technology capabilities needed for Data Lake governance.
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Data Con LA
While the last few years have seen great advancements in computing paradigms for big data stores, there remains one critical bottleneck in this architecture - the ingestion process. Instead of immediate insights into the data, a poor ingestion process can cause headaches and problems to no end. On the other hand, a well-designed ingestion infrastructure should give you real-time visibility into how your systems are functioning at any given time. This can significantly increase the overall effectiveness of your ad-campaigns, fraud-detection systems, preventive-maintenance systems, or other critical applications underpinning your business.
In this session we will explore various modes of ingest including pipelining, pub-sub, and micro-batching, and identify the use-cases where these can be applied. We will present this in the context of open source frameworks such as Apache Flume, Kafka, among others that can be used to build related solutions. We will also present when and how to use multiple modes and frameworks together to form hybrid solutions that can address non-trivial ingest requirements with little or no extra overhead. Through this discussion we will drill-down into details of configuration and sizing for these frameworks to ensure optimal operations and utilization for long-running deployments.
Comparison of various streaming technologies
This meetup will take us through the various streaming technologies such as Storm, Flink, Infosphere Streams and Spark Streaming.
Agenda
• Characteristics of streaming technologies
• Introduction to Apache Storm, Trident and Flink
• Examples of Code and API
• Deep-dive of Spark Streaming
• Comparison of Spark Streaming with other streaming technologies
• Benchmark of Spark Streaming (with code walkthrough)
We will supplement theory concepts with sufficient examples
To view recording of this webinar please use below URL:
http://wso2.com/library/webinars/2016/06/analytics-in-your-enterprise/
Big data spans many fields and brings together technologies like distributed systems, machine learning, statistics and Internet of Things (IoT). It has now become a multi-billion dollar industry with use cases ranging from targeted advertising and fraud detection to product recommendations and market surveys.
Some use cases such as urban planning can be slower (done in batch mode), while others such as the stock market needs results in milliseconds (done is a streaming fashion). Different technologies are used for each case; MapReduce for batch analytics, complex event processing for real-time analytics and machine learning for predictive analytics. Furthermore, the type of analysis ranges from basic statistics to complicated prediction models.
This webinar will discuss the big data landscape including
Concepts, use cases and technologies
Capabilities and applications of the WSO2 analytics platform
WSO2 Data Analytics Server
WSO2 Complex Event Processor
WSO2 Machine Learner
This session takes an in-depth look at:
- Trends in stream processing
- How streaming SQL has become a standard
- The advantages of Streaming SQL
- Ease of development with streaming SQL: Graphical and Streaming SQL query editors
- Business value of streaming SQL and its related tools: Domain-specific UIs
- Scalable deployment of streaming SQL: Distributed processing
Elasticsearch Performance Testing and Scaling @ SignalJoachim Draeger
In this talk I describe the specific challenges that we faced at Signal to make our use case scale. I then go into detail on how we benchmarked single queries and different shard configurations. You can try the experiments yourself using The Signal Media One-Million News Articles Dataset, a Docker Compose stack and some scripts provided here: https://github.com/joachimdraeger/elasticsearch-performance-experiments.
I also got the great advice to have a look at https://github.com/elastic/rally which can also give you summaries for test runs.
WSO2Con USA 2015: An Introduction to the WSO2 Analytics PlatformWSO2
In today’s connected world organizations have access to an enormous amount of data. We often don’t know what they mean or how we can use them, in terms of hindsight, oversight, insight and foresight, to gain competitive advantage in the market. Use cases ranging from simple system monitoring to complex fraud analysis demands this.
The WSO2 Data Analytics platform lets you collect data, allows you to explore it through batch, real-time, interactive and predictive processing technologies and allows you to communicate your results. In this talk, we will discuss the WSO2 Data Analytics platform and how it brings together all analytics technologies into a single platform and user experience.
Data Con LA 2020
Description
Apache Druid is a cloud-native open-source database that enables developers to build highly-scalable, low-latency, real-time interactive dashboards and apps to explore huge quantities of data. This column-oriented database provides the microsecond query response times required for ad-hoc queries and programmatic analytics. Druid natively streams data from Apache Kafka (and more) and batch loads just about anything. At ingestion, Druid partitions data based on time so time-based queries run significantly faster than traditional databases, plus Druid offers SQL compatibility. Druid is used in production by AirBnB, Nielsen, Netflix and more for real-time and historical data analytics. This talk provides an introduction to Apache Druid including: Druid's core architecture and its advantages, Working with streaming and batch data in Druid, Querying data and building apps on Druid and Real-world examples of Apache Druid in action
Speaker
Matt Sarrel, Imply Data, Developer Evangelist
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Jaroslav Gergic
The recent boom in big data processing and democratization of the big data space has been enabled by the fact that most of the concepts originated in the research labs of companies such as Google, Amazon, Yahoo and Facebook are now available as open source. Technologies such as Hadoop, Cassandra let businesses around the world to become more data driven and tap into their massive data feeds to mine valuable insights.
At the same time, we are still at a certain stage of the maturity curve of these new big data technologies and of the entire big data technology stack. Many of the technologies originated from a particular use case and attempts to apply them in a more generic fashion are hitting the limits of their technological foundations. In some areas, there are several competing technologies for the same set of use cases, which increases risks and costs of big data implementations.
We will show how GoodData solves the entire big data pipeline today, starting from raw data feeds all the way up to actionable business insights. All this provided as a hosted multi-tenant environment letting its customers to solve their particular analytical use case or many analytical use cases for thousands of their customers all using the same platform and tools while abstracting them away from the technological details of the big data stack.
Lyft’s data platform is at the heart of the company's business. Decisions from pricing to ETA to business operations rely on Lyft’s data platform. Moreover, it powers the enormous scale and speed at which Lyft operates. Mark Grover and Deepak Tiwari walk you through the choices Lyft made in the development and sustenance of the data platform, along with what lies ahead in the future.
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanDatabricks
Streamsets Data Collector is designed to make data ingest and processing easy. SDC integrates at several levels with Apache Spark to make data analysis using Spark very easy. SDC works with Databricks Cloud to trigger jobs based on incoming data.
In this talk, you will learn how a larger retail player with thousands of outlets is utilizing StreamSets to power Spark jobs on the Databricks cloud, combining real-time foot traffic data and historic behavioral & transaction data for analytic insights that improve revenue per square foot.
Analytical Innovation: How to Build the Next Generation Data PlatformVMware Tanzu
There was a time when the Enterprise Data Warehouse (EDW) was the only way to provide a 360-degree analytical view of the business. In recent years many organizations have deployed disparate analytics alternatives to the EDW, including: cloud data warehouses, machine learning frameworks, graph databases, geospatial tools, and other technologies. Often these new deployments have resulted in the creation of analytical silos that are too complex to integrate, seriously limiting global insights and innovation.
Join guest speaker, 451 Research’s Jim Curtis and Pivotal’s Jacque Istok for an interactive discussion about some of the overarching trends affecting the data warehousing market, as well as how to build a next generation data platform to accelerate business innovation. During this webinar you will learn:
- The significance of a multi-cloud, infrastructure-agnostic analytics
- What is working and what isn’t, when it comes to analytics integration
- The importance of seamlessly integrating all your analytics in one platform
- How to innovate faster, taking advantage of open source and agile software
Speakers: James Curtis, Senior Analyst, Data Platforms & Analytics, 451 Research & Jacque Istok, Head of Data, Pivotal
An overview on how we have approached dataops to allow analysts and data scientists to work quickly and release frequently with high confidence. Covers:
- Cloud/multi-cloud architecture
- CI/CD in the data space
- Development, testing, and deployment
- Monitoring and alerting
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA
Enabling real-time exploration and analytics at scale to drive operational intelligence at Hulu by Indrasis Mondal, Director, Data Engineering and Data Products, Hulu
Data is one of most powerful assets for companies today and a key driver for innovation, product development and business efficiency. Operational intelligence allows modern organization to use that data asset in real-time to enable immediate insights to their business operations and allow rapid decision making for strategic advantage. In this presentation we will walk through the operational intelligence capabilities Hulu has built to process tens of millions of events per minute to enable fast exploration of data and real-time decision making .
Cloud Cost Management and Apache Spark with Xuan WangDatabricks
The cloud computing market is growing faster than virtually any other IT market today, according to Gartner [1]. Providing a unified analytics platform in public clouds, Databricks invests heavily in cloud computing. As a result, cloud expense becomes an imperative category of our cost of goods sold (COGS) and operating expense (OPEX). Many companies share the same story as ours, embracing the cloud while facing the raising challenge of managing its cost.
In this session, we will share our experience on cloud cost management, from mistakes we made, data garnered, lessons learned, to the solutions we built. We will discuss general principles of managing accounts and services and assigning budget and attributing cost to internal teams. Using AWS as a concrete example, with Databricks and Spark as part of our solution, we will show how we: 1) make AWS cost and usage data available to finance and budget owners, 2) build data products that help budget owners to monitor the cost and take actions by buying reserved instances and setting retention policies, 3) use data science techniques to detect changes and do forecast. The general principles and solutions we built are applicable to other cloud providers too.
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
Dan Lynn (AgilData) & Patrick Russell (Craftsy) present on how to do data science in the real world. We discuss data cleansing, ETL, pipelines, hosting, and share several tools used in the industry.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu.
Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com..
Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.
Similar to Extracting Insights from Data at Twitter (20)
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Extracting Insights from Data at Twitter
1. Extracting Insights from Data at Twitter
Prasad Wagle
Technical Lead, Core Data and Metrics, Data Platform
twitter.com/prasadwagle
Jan 26, 2016
2. ● What are the properties of Big Data at Twitter?
● Where do we store it and how do we process it?
● What do we learn from the data?
Overview of the talk
3. ● Velocity: Rate at which data is created
○ 313 million monthly active users. (June 2016)
○ Hundreds of millions of Tweets are sent per day. TPS record:
one-second peak of 143,199 Tweets per second
○ 100 Billion interaction events per day
● Volume: 100s of petabytes of data
● Variety: Tweets, Users, Client events and many more
○ Client events logs have a unified Thrift format for wide variety of
application events
3Vs of Big Data @Twitter
4. Data Processing Big Picture
Production
systems
Batch
Scalding
Spark
Real-time
Heron
Lambda (Batch + Real-time)
Summingbird
TSAR
Interactive
Presto
Vertica
R
Custom
Dashboards
Tableau
Apache Zeppelin
Command line
tools
Batch
Hadoop
(HDFS
MapReduce)
Analytics Tools
Analytics Front-ends
Real-time
Eventbus,
Kafka
Streams
Data Abstraction Layer (DAL), Pipeline Orchestration
6. ● Batch Processing Engine - Hadoop
● Real-time Processing Engine - Heron
● Core Data Libraries - Scalding, Summingbird, Tsar, Parquet
● Data Pipeline - Data Access Layer (DAL), Orchestration
● Interactive SQL - Presto, Vertica
● Data Visualization - Tableau, Apache Zeppelin
● Core Data and Metrics
Data Platform Projects
7. ● Largest Hadoop clusters in the world, some > 10K nodes
● Store 100s of petabytes of data
● More than 100K daily jobs
● Improvements to open source hadoop software
● hRaven - tool that collects run time data of hadoop jobs and lets users
visualize job metrics
○ YARN Timelineserver is next-gen hRaven
● Log pipeline software (scribe -> HDFS)
○ Scribe is being replace by Flume
Hadoop
8. ● Heron - a real-time, distributed, fault tolerant stream processing engine
● Successor of Storm, API compatible with Storm
● Analyze data as it is being produced
● > 400 real-time jobs, 500 B events / day processed, 25 - 200 ms latency
● Use cases
○ Real-time impression and engagement counts
○ Real-time trends, recommendations, spam detection
Real-time Processing
9. ● Tools that make it easy to create MapReduce and Heron jobs
● Scalding
○ Scala DSL on top of Cascading
● Summingbird
○ Lambda architecture: real-time and batch
● Tsar: TimeSeries AggregatoR
○ DSL implemented on top of Summingbird
Core Data Libraries
10. ● DAL is a service that simplifies the discovery, usage, and maintainability
of data
● Users work with logical datasets
● Physical dataset describes the serialization of a logical dataset to a
specific location (hadoop, vertica) and format
● Logical dataset can simultaneously exist in multiple places
● Users can use logical dataset name to consume data with different
tools like Scalding, Presto
Data Access Layer (DAL)
11. ● Eagleeye web application is front-end for end users
● Users discover datasets with Eagleeye
● Eagleeye displays metadata like owners and schema
● Applications access to datasets is recorded
● Enables Eagleye to show dependency graphs for a dataset - jobs that
produce a dataset and jobs that consume it
Data Access Layer (DAL)
15. ● Statebird service
○ Tracks state of batch jobs
○ Used to manage dependencies
Pipeline Orchestration
16. ● Interactive means that results of a query are available in the range of
seconds to a few minutes
● SQL is still the lingua franca for ad hoc data analysis
● Vertica
○ Columnar architecture, high performance analytics queries
● Presto
○ Data in HDFS in Parquet format
Interactive SQL
17. ● Custom Dashboards
● Apache Zeppelin Strengths
○ Notebook metaphor - notebook is a collection of notes, each note
is a collection of paragraphs (queries)
○ Web based report authoring, collaborative like Google docs
○ Very easy to create a note and then share it
○ > 2K notes, 18K queries
○ Supports JDBC (Presto, Vertica, MySQL)
○ Open source, Easy to add new interpreters like Scalding
Data Visualization
18. ● Tableau Strengths
○ Easy to create reports, does not require SQL expertise
○ Built in analytics functions e.g. Rank, Percentile
○ Polished visualizations
○ Row level security
Data Visualization
19. ● Big part of data analysis is data cleansing
● Makes sense to do this once
● Core Data
○ Create pipelines to create “verified” datasets like Users, Tweets,
Interactions
○ Reliable and easy to use
● Core Metrics
○ Create pipelines to compute Twitter’s important metrics
○ DAU, MAU, Tweet Impressions
Core Data and Metrics
21. ● Analytics - Basic Counting
● A/B Testing
● Data Science - Custom analysis
● Data Science - Machine Learning
Data Processing
22. ● Daily/Monthly Active Users
● Number of Tweets, Retweets, Likes
● Tweet Impressions
● Logic is relatively simple
● Challenges: scale and timeliness
○ Results for previous day should be available by 10 am
○ Some metrics are real-time
Basic Counting
23. ● Goal: find the number of impressions and engagements for a tweet
● Real-time
● Used in analytics.twitter.com
Example - Counting Tweet Impressions
25. ● TSAR job is converted to a Summingbird job
● Summingbird job creates
○ Real-time pipeline with Heron
○ Batch pipeline with Scalding
● Users access results using TSAR query service
● Write once, run batch and real-time
Example - Counting Tweet Impressions
26. ● Experimentation is at the heart of Twitter’s product development cycle
● Expertise needed in Statistics and Technology
A/B Testing Framework
27. ● Goal: informative experiment,
● Minimize false positive and false negative errors
● How many users do we need to sample?
● How long should we run the experiment?
A/B Testing Statistics
28. ● Process 100 B events daily, compute intensive.
● Metrics computed using Scalding pipeline that combines client event
logs, internal user models, and other datasets.
● Lightweight statistics are computed in a streaming job using TSAR
running on Heron.
A/B Testing Technology
29. ● Cause of spikes and dips in key metrics
● Growth Trends
○ By country, client
● Analysis to understand user behavior
○ Creators vs Consumers
○ Distribution of followers
○ User clusters
● Analysis to inform product feature decisions
Data Science - Custom Analysis
30. ● Recommendations
○ Users: WTF - who to follow
○ Tweets: Algorithmic timeline
● Cortex, Deep learning based on Torch framework
○ Identify NSFW images
○ Recognize what is happening in live feeds
Data Science - Machine Learning
31. ● Product Safety
○ Detect fake accounts
○ Detect tweet spam and abuse
● Ad Targeting
○ Promoted Trends, Accounts and Tweets
○ Show only if it is likely to be interesting and relevant to that user
○ Predict click probability using signals including what a user
chooses to follow, how they interact with a Tweet and what they
retweet
Machine Learning
32. ● Systems (Hadoop, Vertica)
○ Necessary because higher level abstraction are leaky
● Programming (Scala, Scalding, SQL)
● Math (Statistics, Linear Algebra)
Ideal Talent Stack
Systems Programming Statistics
Data Engineers Data Scientists
33. Data Platform and Data Science
work hand-in-hand
to extract insights from Big Data at Twitter
Summary