- HiveHome provides smart home sensors that generate over 4 billion messages per day which are accessible through Kafka topics.
- Many of HiveHome and Connected Home's services are based on analyzing this big data.
- Lessons learned include decoupling applications, sticking to single responsibility principles, and making applications portable, immutable, and easy to test using Docker, Kubernetes, and other tools.
- The data platform team replaced Spark jobs with Kafka Connect and KCQL to define extract and load stages generically with less duplication and improved reusability.
- They are rethinking transformation stages using Kafka Streams instead of Spark for better performance and scalability without shared storage needs.
- Data scientists at Connected
in this presentation we go through the differences and similarities between Redshift and BigQuery. It was presented during the Athens Big Data meetup May 2017.
BigQuery is Google's columnar, massively parallel data querying solution. This talk explores using it as an ad-hoc reporting solution and the limitations present in May 2013.
Vadim Solovey is a CTO of DoiT International has helped to implement Google BigQuery as a cloud data warehouse for many medium and large sized data and analytics initiatives. BigQuery’s serverless architecture had redefined what it means to be fully managed for hundreds of Israeli's startups.
Recently, Google announced an update to BigQuery that dramatically advances cloud data analytics for large-scale businesses such as BigQuery now support Standard SQL, implementing the SQL 2011 standard as well as new ODBC drivers making it possible to use BigQuery with a number of tools ranging from Microsoft Excel to traditional business intelligence systems such as Microstrategy and Qlik.
Agenda:
• Partitioned tables
• The ability to update, delete rows and columns using SQL
• Integration with IAM for fine-grained security policies
• Monitoring w/ StackDriver to track performance and usage
• Query sharing via links, to foster knowledge within orgs
• Cost optimisation strategies
Introduction to our Datawarehouse solutions called BigQuery.
The Google Cloud Platform products are based on our internal systems which are powering Google AdWords, Search, YouTube and our leading research in the field of real-time data analysis.
You can get access ($300 for 60 days) to our free trial through google.com/cloud
Building a Real-time Stream Processing Pipeline - Kinesis Data Firehose, Amaz...★ Akshay Surve
Talks about our journey to stream processing; as part of the talk I shared with the audience the specifics of the solution we built on AWS cloud and pointers for others to help them think through their own use-cases.
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Holden Ackerman
This is a presentation given at a Big Data Boulder / Denver Meetup event by Ashish Dubey, a Senior Solutions Architect at Qubole.
The following slides cover a background of Presto and its architecture, and how it differs in both performance and cost from traditional Hadoop / Hive for Adhoc queries as well as SparkSQL, Impala, Tez, and Redshift.
There are also several slides about how Qubole has been involved with the open-source Apache Presto project, along with performance optimizing contributions.
Qubole is a big data analytics software that has solved many headaches around the traditional model of big data (Hadoop, Spark, Presto) and cloud computing in popular IaaS providers: AWS, Google Cloud, Microsoft Azure, and Oracle BMC.
in this presentation we go through the differences and similarities between Redshift and BigQuery. It was presented during the Athens Big Data meetup May 2017.
BigQuery is Google's columnar, massively parallel data querying solution. This talk explores using it as an ad-hoc reporting solution and the limitations present in May 2013.
Vadim Solovey is a CTO of DoiT International has helped to implement Google BigQuery as a cloud data warehouse for many medium and large sized data and analytics initiatives. BigQuery’s serverless architecture had redefined what it means to be fully managed for hundreds of Israeli's startups.
Recently, Google announced an update to BigQuery that dramatically advances cloud data analytics for large-scale businesses such as BigQuery now support Standard SQL, implementing the SQL 2011 standard as well as new ODBC drivers making it possible to use BigQuery with a number of tools ranging from Microsoft Excel to traditional business intelligence systems such as Microstrategy and Qlik.
Agenda:
• Partitioned tables
• The ability to update, delete rows and columns using SQL
• Integration with IAM for fine-grained security policies
• Monitoring w/ StackDriver to track performance and usage
• Query sharing via links, to foster knowledge within orgs
• Cost optimisation strategies
Introduction to our Datawarehouse solutions called BigQuery.
The Google Cloud Platform products are based on our internal systems which are powering Google AdWords, Search, YouTube and our leading research in the field of real-time data analysis.
You can get access ($300 for 60 days) to our free trial through google.com/cloud
Building a Real-time Stream Processing Pipeline - Kinesis Data Firehose, Amaz...★ Akshay Surve
Talks about our journey to stream processing; as part of the talk I shared with the audience the specifics of the solution we built on AWS cloud and pointers for others to help them think through their own use-cases.
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Holden Ackerman
This is a presentation given at a Big Data Boulder / Denver Meetup event by Ashish Dubey, a Senior Solutions Architect at Qubole.
The following slides cover a background of Presto and its architecture, and how it differs in both performance and cost from traditional Hadoop / Hive for Adhoc queries as well as SparkSQL, Impala, Tez, and Redshift.
There are also several slides about how Qubole has been involved with the open-source Apache Presto project, along with performance optimizing contributions.
Qubole is a big data analytics software that has solved many headaches around the traditional model of big data (Hadoop, Spark, Presto) and cloud computing in popular IaaS providers: AWS, Google Cloud, Microsoft Azure, and Oracle BMC.
BigQuery is an analytical database designed to scale to petabyte scales. To optimize BigQuery we need to use practices and patterns that take advantage of the BigQuery architecture.
Migrating a multi tenant app to Azure (war biopic)★ Akshay Surve
P.S: This was presented at the Software Architect's Bangalore meetup. So, this is not completely consumable on it's own.
A war biopic on Migrating a multi-tenant app to Azure. This presentation is a combination of Learnings and Lessons in planning and executing migration of multi-tenant app to Azure (or in general to cloud).Talks about the original on-premise architecture, challenges faced in migration and the architecture after migrating to azure.
https://www.meetup.com/SoftwareArchitectsBangalore/events/237117024/
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryChris Schalk
This is an introductory presentation given at DevFest Madrid 2010 by Google Developer Advocate Chris Schalk. It introduces new Google cloud technologies: Google Storage, Google Prediction API and BigQuery.
Build Real-Time Applications with Databricks StreamingDatabricks
In this presentation, we will study a recent use case we implemented recently. In this use case we are working with a large, metropolitan fire department. Our company has already created a complete analytics architecture for the department based upon Azure Data Factory, Databricks, Delta Lake, Azure SQL and Azure SQL Server Analytics Services (SSAS). While this architecture works very well for the department, they would like to add a real-time channel to their reporting infrastructure.
This channel should serve up the following information: •The most up-to-date locations and status of equipment (fire trucks, ambulances, ladders etc.)
• The current locations and status of firefighters, EMT personnel and other relevant fire department employees
• The current list of active incidents within the city The above information should be visualized through an automatically updating dashboard. The central component of the dashboard will be map which automatically updates with the locations and incidents. This view should be as real-time as possible and will be used by the fire chiefs to assist with real-time decision-making on resource and equipment deployments.
In this presentation, we will leverage Databricks, Spark Structured Streaming, Delta Lake and the Azure platform to create this real-time delivery channel.
Introducing the Hub for Data OrchestrationAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Introducing the Hub for Data Orchestration
Danny Linden, Chapter Lead Software Engineer (Ryte)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Building Data Lakes with Apache AirflowGary Stafford
Build a simple Data Lake on AWS using a combination of services, including Amazon Managed Workflows for Apache Airflow (Amazon MWAA), AWS Glue, AWS Glue Studio, Amazon Athena, and Amazon S3.
Blog post and link to the video: https://garystafford.medium.com/building-a-data-lake-with-apache-airflow-b48bd953c2b
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest ProblemsDatabricks
<p>In this talk, we will highlight the opportunity data presents to tackle world’s toughest problems. In spite of the promise that data presents, most data teams are challenged with data, technology and organizational silos. Unified Data Analytics presents a radically different approach to unlock the data potential by unifying all your data with your analytics - from Business Intelligence to Machine Learning.</p>
Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks
We’re always told to ‘Go for the Gold!,’ but how do we get there? This talk will walk you through the process of moving your data to the finish fine to get that gold metal! A common data engineering pipeline architecture uses tables that correspond to different quality levels, progressively adding structure to the data: data ingestion (‘Bronze’ tables), transformation/feature engineering (‘Silver’ tables), and machine learning training or prediction (‘Gold’ tables). Combined, we refer to these tables as a ‘multi-hop’ architecture. It allows data engineers to build a pipeline that begins with raw data as a ‘single source of truth’ from which everything flows. In this session, we will show how to build a scalable data engineering data pipeline using Delta Lake, so you can be the champion in your organization.
Stargate, the gateway for some multi-models data APIData Con LA
Data Con LA 2020
Description
Join us to learn about Stargate! Stargate is a data gateway deployed between client applications and a database. It's built with extensibility as a first-class citizen and makes it easy to use a database for any application workload by adding plugin support for new APIs, data types, and access methods. After detailing the architecture and ideas behind the frameworks we will demo the creation of REST and GraphQL APIs on top of Cassandra through simple configuration. Bring back home a working sample !
Speaker
Cedrick Lunven, Director of Developer Advocacy, Datastax
AWS Athena vs. Google BigQuery for interactive SQL QueriesDoiT International
During the re:Invent 2016, AWS has released the Amazon Athena - an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
We took a look on AWS Athena and compared it to the Google BigQuery - another player of serverless interactive data analysis.
Would you like to know which one is the right tool for you? Join us for this meetup to learn AWS Athena and for the test drive of querying exactly the same dataset using AWS Athena and Google BigQuery to see where each one shines (or totally blows it).
Accelerating Data Ingestion with Databricks AutoloaderDatabricks
Tracking which incoming files have been processed has always required thought and design when implementing an ETL framework. The Autoloader feature of Databricks looks to simplify this, taking away the pain of file watching and queue management. However, there can also be a lot of nuance and complexity in setting up Autoloader and managing the process of ingesting data using it. After implementing an automated data loading process in a major US CPMG, Simon has some lessons to share from the experience.
This session will run through the initial setup and configuration of Autoloader in a Microsoft Azure environment, looking at the components used and what is created behind the scenes. We’ll then look at some of the limitations of the feature, before walking through the process of overcoming these limitations. We will build out a practical example that tackles evolving schemas, applying transformations to your stream, extracting telemetry from the process and finally, how to merge the incoming data into a Delta table.
After this session you will be better equipped to use Autoloader in a data ingestion platform, simplifying your production workloads and accelerating the time to realise value in your data!
How Adobe uses Structured Streaming at ScaleDatabricks
Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing. We want to share some of our learnings and hard earned lessons and as we reached this scale specifically with Structured Streaming.
Know thy Lag
While consuming off a Kafka topic which sees sporadic loads, its very important to monitor the Consumer lag. Also makes you respect what a beast backpressure is.
Reading Data In
Fan Out Pattern using minPartitions to Use Kafka Efficiently
Overload protection using maxOffsetsPerTrigger
More Apache Spark Settings used to optimize Throughput
MicroBatching Best Practices
Map() +ForEach() vs MapPartitons + forEachPartition
Adobe Spark Speculation and its Effects
Calculating Streaming Statistics
Windowing
Importance of the State Store
RocksDB FTW
Broadcast joins
Custom Aggegators
OffHeap Counters using Redis
Pipelining
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
Redis accelerates Apache Spark execution by 45 times, when used as a shared distributed in-memory datastore for Spark in analyses like time series data range queries. With the redis module for machine learning, redis-ml, implementation of spark-ml models gains a new real time serving layer that offloads processing of models directly in Redis, allows multiple applications to reuse the same models and speeds up classification and execution of these models by 13x. Join this session to learn more about the Redis Labs’ connector for Apache Spark that enhances production implementations of real-time big data processing.
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...Databricks
<p>In this talk, we will highlight major efforts happening in the Spark ecosystem. In particular, we will dive into the details of adaptive and static query optimizations in Spark 3.0 to make Spark easier to use and faster to run. We will also demonstrate how new features in Koalas, an open source library that provides Pandas-like API on top of Spark, helps data scientists gain insights from their data quicker.</p>
Deep Learning in the Cloud at Scale: A Data Orchestration StoryAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Deep Learning in the Cloud at Scale: A Data Orchestration Story
Mickey Zhang, Software Engineer (Microsoft)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Dynamic Partition Pruning in Apache SparkDatabricks
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro
DevOps, continuous delivery and modern architectural trends can incredibly speed up the software development process. Big Data applications cannot be an exception and need to keep the same pace.
BigQuery is an analytical database designed to scale to petabyte scales. To optimize BigQuery we need to use practices and patterns that take advantage of the BigQuery architecture.
Migrating a multi tenant app to Azure (war biopic)★ Akshay Surve
P.S: This was presented at the Software Architect's Bangalore meetup. So, this is not completely consumable on it's own.
A war biopic on Migrating a multi-tenant app to Azure. This presentation is a combination of Learnings and Lessons in planning and executing migration of multi-tenant app to Azure (or in general to cloud).Talks about the original on-premise architecture, challenges faced in migration and the architecture after migrating to azure.
https://www.meetup.com/SoftwareArchitectsBangalore/events/237117024/
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryChris Schalk
This is an introductory presentation given at DevFest Madrid 2010 by Google Developer Advocate Chris Schalk. It introduces new Google cloud technologies: Google Storage, Google Prediction API and BigQuery.
Build Real-Time Applications with Databricks StreamingDatabricks
In this presentation, we will study a recent use case we implemented recently. In this use case we are working with a large, metropolitan fire department. Our company has already created a complete analytics architecture for the department based upon Azure Data Factory, Databricks, Delta Lake, Azure SQL and Azure SQL Server Analytics Services (SSAS). While this architecture works very well for the department, they would like to add a real-time channel to their reporting infrastructure.
This channel should serve up the following information: •The most up-to-date locations and status of equipment (fire trucks, ambulances, ladders etc.)
• The current locations and status of firefighters, EMT personnel and other relevant fire department employees
• The current list of active incidents within the city The above information should be visualized through an automatically updating dashboard. The central component of the dashboard will be map which automatically updates with the locations and incidents. This view should be as real-time as possible and will be used by the fire chiefs to assist with real-time decision-making on resource and equipment deployments.
In this presentation, we will leverage Databricks, Spark Structured Streaming, Delta Lake and the Azure platform to create this real-time delivery channel.
Introducing the Hub for Data OrchestrationAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Introducing the Hub for Data Orchestration
Danny Linden, Chapter Lead Software Engineer (Ryte)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Building Data Lakes with Apache AirflowGary Stafford
Build a simple Data Lake on AWS using a combination of services, including Amazon Managed Workflows for Apache Airflow (Amazon MWAA), AWS Glue, AWS Glue Studio, Amazon Athena, and Amazon S3.
Blog post and link to the video: https://garystafford.medium.com/building-a-data-lake-with-apache-airflow-b48bd953c2b
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest ProblemsDatabricks
<p>In this talk, we will highlight the opportunity data presents to tackle world’s toughest problems. In spite of the promise that data presents, most data teams are challenged with data, technology and organizational silos. Unified Data Analytics presents a radically different approach to unlock the data potential by unifying all your data with your analytics - from Business Intelligence to Machine Learning.</p>
Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks
We’re always told to ‘Go for the Gold!,’ but how do we get there? This talk will walk you through the process of moving your data to the finish fine to get that gold metal! A common data engineering pipeline architecture uses tables that correspond to different quality levels, progressively adding structure to the data: data ingestion (‘Bronze’ tables), transformation/feature engineering (‘Silver’ tables), and machine learning training or prediction (‘Gold’ tables). Combined, we refer to these tables as a ‘multi-hop’ architecture. It allows data engineers to build a pipeline that begins with raw data as a ‘single source of truth’ from which everything flows. In this session, we will show how to build a scalable data engineering data pipeline using Delta Lake, so you can be the champion in your organization.
Stargate, the gateway for some multi-models data APIData Con LA
Data Con LA 2020
Description
Join us to learn about Stargate! Stargate is a data gateway deployed between client applications and a database. It's built with extensibility as a first-class citizen and makes it easy to use a database for any application workload by adding plugin support for new APIs, data types, and access methods. After detailing the architecture and ideas behind the frameworks we will demo the creation of REST and GraphQL APIs on top of Cassandra through simple configuration. Bring back home a working sample !
Speaker
Cedrick Lunven, Director of Developer Advocacy, Datastax
AWS Athena vs. Google BigQuery for interactive SQL QueriesDoiT International
During the re:Invent 2016, AWS has released the Amazon Athena - an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
We took a look on AWS Athena and compared it to the Google BigQuery - another player of serverless interactive data analysis.
Would you like to know which one is the right tool for you? Join us for this meetup to learn AWS Athena and for the test drive of querying exactly the same dataset using AWS Athena and Google BigQuery to see where each one shines (or totally blows it).
Accelerating Data Ingestion with Databricks AutoloaderDatabricks
Tracking which incoming files have been processed has always required thought and design when implementing an ETL framework. The Autoloader feature of Databricks looks to simplify this, taking away the pain of file watching and queue management. However, there can also be a lot of nuance and complexity in setting up Autoloader and managing the process of ingesting data using it. After implementing an automated data loading process in a major US CPMG, Simon has some lessons to share from the experience.
This session will run through the initial setup and configuration of Autoloader in a Microsoft Azure environment, looking at the components used and what is created behind the scenes. We’ll then look at some of the limitations of the feature, before walking through the process of overcoming these limitations. We will build out a practical example that tackles evolving schemas, applying transformations to your stream, extracting telemetry from the process and finally, how to merge the incoming data into a Delta table.
After this session you will be better equipped to use Autoloader in a data ingestion platform, simplifying your production workloads and accelerating the time to realise value in your data!
How Adobe uses Structured Streaming at ScaleDatabricks
Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing. We want to share some of our learnings and hard earned lessons and as we reached this scale specifically with Structured Streaming.
Know thy Lag
While consuming off a Kafka topic which sees sporadic loads, its very important to monitor the Consumer lag. Also makes you respect what a beast backpressure is.
Reading Data In
Fan Out Pattern using minPartitions to Use Kafka Efficiently
Overload protection using maxOffsetsPerTrigger
More Apache Spark Settings used to optimize Throughput
MicroBatching Best Practices
Map() +ForEach() vs MapPartitons + forEachPartition
Adobe Spark Speculation and its Effects
Calculating Streaming Statistics
Windowing
Importance of the State Store
RocksDB FTW
Broadcast joins
Custom Aggegators
OffHeap Counters using Redis
Pipelining
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
Redis accelerates Apache Spark execution by 45 times, when used as a shared distributed in-memory datastore for Spark in analyses like time series data range queries. With the redis module for machine learning, redis-ml, implementation of spark-ml models gains a new real time serving layer that offloads processing of models directly in Redis, allows multiple applications to reuse the same models and speeds up classification and execution of these models by 13x. Join this session to learn more about the Redis Labs’ connector for Apache Spark that enhances production implementations of real-time big data processing.
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...Databricks
<p>In this talk, we will highlight major efforts happening in the Spark ecosystem. In particular, we will dive into the details of adaptive and static query optimizations in Spark 3.0 to make Spark easier to use and faster to run. We will also demonstrate how new features in Koalas, an open source library that provides Pandas-like API on top of Spark, helps data scientists gain insights from their data quicker.</p>
Deep Learning in the Cloud at Scale: A Data Orchestration StoryAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Deep Learning in the Cloud at Scale: A Data Orchestration Story
Mickey Zhang, Software Engineer (Microsoft)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Dynamic Partition Pruning in Apache SparkDatabricks
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro
DevOps, continuous delivery and modern architectural trends can incredibly speed up the software development process. Big Data applications cannot be an exception and need to keep the same pace.
Using eBPF to Measure the k8s Cluster HealthScyllaDB
As a k8s cluster-admin your app teams have a certain expectation of your cluster to be available to deploy services at any time without problems. While there is no shortage on metrics in k8s its important to have the right metrics to alert on issues and giving you enough data to react to potential availability issues. Prometheus has become a standard and sheds light on the inner behaviour of Kubernetes clusters and workloads. Lots of KPIs (CPU, IO, network. Etc) in our On-Premise environment are less precise when we start to work in a Cloud environment. Ebpf is the perfect technology that fulfills that requirement as it gives us information down to the kernel level. In 2018 Cloudflare shared an opensource project to expose custom ebpf metrics in Prometheus. Join this session and learn about: • What is ebpf? • What type of metrics we can collect? • How to expose those metrics in a K8s environment. This session will try to deliver a step-by-step guide on how to take advantage of the ebpf exporter.
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
Effective Kubernetes - Is Kubernetes the new Linux? Is the new Application Se...Wojciech Barczyński
I will tell you two stories about two different implementations of Kubernetes. One from Fashion mobile ecomerce. One from a Fintech. Kubernetes is not a silver bullet. But damn close ;).
Chris Lauer, NOAA Space Weather Prediction Center -
This is the story of how adopting a containerized workflow changed the way our small software team works at NOAA’s Space Weather Prediction Center. Our old architecture, a big ball of mud shared-database integration, just wasn’t cutting it - it was killing our agility. Over the past two years, our small team has adopted a microservice style architecture, using Docker with docker-compose and environment files as our deployment strategy for all new development. We’ve discovered the joys of using containers for identical dev, staging, and production environments. We work closely with scientists: much of the code we’re running has complicated and conflicting library dependencies. Docker captures these beautifully - we’ve even had some success teaching our scientists to use it! I’ll share what we’ve learned, some of the persistent challenges we face, and one place we really got it wrong. This talk builds off of a popular hallway track from DockerCon 2019.
OS for AI: Elastic Microservices & the Next Gen of MLNordic APIs
AI has been a hot topic lately, with advances being made constantly in what is possible, there has not been as much discussion of the infrastructure and scaling challenges that come with it. How do you support dozens of different languages and frameworks, and make them interoperate invisibly? How do you scale to run abstract code from thousands of different developers, simultaneously and elastically, while maintaining less than 15ms of overhead?
At Algorithmia, we’ve built, deployed, and scaled thousands of algorithms and machine learning models, using every kind of framework (from scikit-learn to tensorflow). We’ve seen many of the challenges faced in this area, and in this talk I’ll share some insights into the problems you’re likely to face, and how to approach solving them.
In brief, we’ll examine the need for, and implementations of, a complete “Operating System for AI” – a common interface for different algorithms to be used and combined, and a general architecture for serverless machine learning which is discoverable, versioned, scalable and sharable.
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Landon Robinson
The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...NETWAYS
In this talk we’ll introduce an open source project being used to monitor large Power Systems clusters, such as in the IBM collaboration with Oak Ridge and Lawrence Livermore laboratories for the Summit project, a large deployment of custom AC922 Power Systems nodes augmented by GPUs that work in tandem to implement the (currently) largest Supercomputer in the world.
Data is collected out-of-band directly from the firmware layer and then redistributed to various components using an open source component called Crassd. In addition, in-band operating-system and service level metrics, logs and alerts can also be collected and used to enrich the visualization dashboards. Open source components such as the Elastic Stack (Elasticsearch, Logstash, Kibana and select Beats) and Netdata are used for monitoring scenarios appropriate to each tool’s strengths, with other components such as Prometheus and Grafana in the process of being implemented. We’ll briefly discuss our experience to put these components together, and the decisions we had to make in order to automate their deployment and configuration for our goals. Finally, we lay out collaboration possibilities and future directions to enhance our project as a convenient starting point for others in the open source community to easily monitor their own Power Systems environments.
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...GetInData
Did you like it? Check out our blog to stay up to date: https://getindata.com/blog
The talk is focused on administration, development and monitoring platform with Apache Spark, Apache Flink and Kubeflow in which the monitoring stack is based on Prometheus stack.
Author: Albert Lewandowski
Linkedin: https://www.linkedin.com/in/albert-lewandowski/
___
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
All the Ops: DataOps with GitOps for Streaming data on Kafka and KubernetesDevOps.com
Running Apache Kafka and Kubernetes is synonymous with containerized real time data. Many users have adopted the pairing to deploy and manage individual distributed real time applications.
While Kubernetes allows developers to scale applications in microservices quicker, there are still productivity blockers such as visibility and governance.
Enter DataOps.
In this webinar, you'll learn how to:
Enhance the productivity of your Kafka & Kubernetes stream with DataOps
Enable enterprise adoption and scaling
Govern & secure your stream
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
A talk I gave at Prairie Dev Con about how we deployed applications before Kubernetes, how to reason about what Kubernetes does, and why you should use it as a default.
Similar to Streaming 4 billion Messages per day. Lessons Learned. (20)
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
2. What is HiveHome doing ?
Provides a range of different sensors that all work together
to build a smart and connected home.
3. How is Big Data generated at Connected Home
…more devices to be released
How is it accessible ?
Avro messages through
Kafka Contracted topics
Some numbers ?
4+ Billion messages from input topics
to the Data Platform (increasing by 1000s every day)
Is it useful ?
Many of CH & BG services are based solely
on Big Data projects.
4. * design a micro services architecture
that won’t wake you up at 03:00 for a simple restart
* not duplicate stuff (code or configs)
significant % of our time we are plumbers ..let’s make our lives easy
* be resilient to failures.
Especially when dealing with stageful applications
* communicate/collaborate with data scientists
mathematicians != engineers
Processing 50K msgs/s from IoT Devices you learn to:
5. Try to :
* Decouple applications
* Stick to single responsibility principle
* Make apps portable
* Make apps immutable
* Make testing portable and easy
Docker & Kubernetes
GoCD (CI/CD)
Decouple ETL
Police EL with Schema Registry
Microservices in real time pipelines
6. Average the internal temperatures per
house per 30 minutes and persist to ES
Pros
* We only support/monitor one app !
* All in one place and you don’t have to
remember git repos etc..
Cons
* Job has 2 responsibilities
* Hard to test
* If we want to persist to Cassandra we
need to reprocess the messages
* We cannot reuse the app
Monolithic approach
T+ L
7. Pros
* 1 responsibility per app
* Easy to replace the load job to ES with
a Cassandra job
* Easy to replay data
* We CAN generalise/reuse the L stage
Cons
* We need to support/monitor 2 apps :(
Microsevices based approach
T E
E C*
ES
8.
9. We went through of how our infrastructure looks like.
Let’s see what we deploy in that infrastructure
10. We used to write a lot of Spark apps for E & L operations
> internalTemperature to ElasticSearch
> internalTemperature Cassandra
> motionDetected to ElasticSearch
> deviceSignal to Cassandra
> …
11. But we replaced our spark jobs because…
We ended up with:
* Duplicated code all over the place for
simple tasks
* Too many github repos. Hard to keep them
in your head
* Too much time to provision a small cluster
to test the app
* Many resources £££ were wasted
because of the master/driver dependencies
of spark
12. The goal was to define the E & L stages *once* as
a generic re-usable component that handles:
Offset Management
Serialization / de-serialization
Partitioning / Scalability
Fault tolerance / fail-over
Schema Registry integration.
13. Kafka Connect to the rescue
Kafka Connect
* Suitable for EL operations (no T here)
* No driver/master/worker notations
* No dependency on zookeeper
* Uses the well tested kafka consumer/producers
* Configurable by a rest API
14. But by default you need to write some code for every application
for the specific domain transformations.
15. KCQL is a SQL like syntax allowing streamlined configuration of Kafka Sink Connectors
Examples:
* INSERT INTO transactionIndex SELECT * FROM transcationTopic
* INSERT INTO motionIndex SELECT motionDatetime As motionAt FROM motionTopic
KCQL (Kafka Connect Query Language)
Available operations: rename,ignore fields, reparation messages any many more
(https://github.com/datamountaineer/kafka-connect-query-language)
17. Monitor your KC apps…
* JMX Metrics and logs from the APP (Jmx metrics provide detailed granularity of the state of the KC app)
* Kafka Connect UI (logs and configs for each KC app available with 1 click - https://github.com/landoop)
18. E & L stages are now solid, well defined
with minimal duplication and highly reusable.
T needs some polishing. Time to re-think of
our T stage.
19. What was the problem …
Spark is great but not always the best option:
-> has the notation of micro batches
-> handling state is not optimal
-> you need shared storage to store checkpoints and state
-> you need a cluster with master, driver & workers
SPARK :(
20. From Spark to Kafka Streams …
Kafka Streams is great because:
-> is cluster and framework free
-> uses kafka to store the state
-> exposes the state via an API
-> has no notation of micro batches
-> KTables
-> No need for zookeeper
21. So we re-wrote one of our heavy CPU jobs in Kafka Streams
Results:
-> Again: No need to worry about where to store checkpoints. Everything is stored in kafka.
-> No need for a cluster. Just execute `java -jar app.jar`
-> Less scripting !
-> We needed to do funny stuff to make it work with scala :(
22. And now we have:
-> 50% less resources were used in some cases. Better CPU/Memory utilisation across instances.
-> Easier auto scaling. Just start more instances of your app and kafka streams will scale automatically.
-> Happier devops because they worry about the infrastructure and not the frameworks on top of that.
And since the state is exposed
through an API we now know
what happens internally inside the app at any given time
23. Until now we described the engineering part of the Data
Platform team.
Let’s see who uses the data from our platform.
24. Data Science @ HiveHome
Some of the projects:
-> Energy Breakdown
Distribute the energy usage into categories (lighting, cooking etc) just by knowing the total hourly
consumed energy (patent pending)
-> Heating Failure Alert
Try to identify if a boiler is not working properly, knowing only the internal temperature of a house
25. -> as much data as possible
-> as soon as possible
-> as accessible as possible
Data Science @ Connected Home
what do scientists need ?
26. Data Science @ Connected Home
how to work with data scientists
* Be proactive. Have the data ready in advance.
* Keep the data in an flexible datastore. I.e. Elastic Search and not Cassandra.
* Side by side development during each iteration of a model. (Scientists do not unit test!)
* Jupyter/Zeppelin notebooks. Easily run and scale a model across your clusters.
27. So what we actually learned
(except from all the cool stuff we can add to our CVs)
* Decouple everything.
* When you start copying code and configs -> tools down and re-think of your applications setup.
* Try new technologies. The initial learning curve will compensate you later.
* Work tightly with data scientists so they develop similar mindset to an engineer.