Enabling blazing fast search on a 1 billion member social network demands special weapons and tactics. In Evojam we took advantage of the Scala ecosystem and modern tools to realize persistence and processing of such a data set.
Case study focused on the client requirements and tools we have made use of to meet them. Thoughts, injuries and ideas about real life fast & big data challenge brought to you directly from the trenches.
This talk has been given on Scalar 2016 conference.
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
Developing Real-Time Data Pipelines with Apache Kafka http://kafka.apache.org/ is an introduction for developers about why and how to use Apache Kafka. Apache Kafka is a publish-subscribe messaging system rethought of as a distributed commit log. Kafka is designed to allow a single cluster to serve as the central data backbone. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of coordinated consumers. Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages. For the Spring user, Spring Integration Kafka and Spring XD provide integration with Apache Kafka.
Talk given at Ministry of Test, Cork, Ireland on Thursday, September 23, 2021. The purpose of this talk was to discuss areas that need to be covered while testing the applications handling big data in the cloud. These included the various types of tes -ting and tools involved along with the challenges encountered from the speaker's experience in working with the AWS cloud.
Spotify has built several real-time pipelines using Apache Storm for use cases like ad targeting, music recommendation, data visualization, and notifications. Each of these real-time pipelines have Apache Storm wired to different systems like Apache Kafka, Apache Cassandra, Apache Zookeeper, and other sources and sinks. In this talk the speaker, Kinshuk Mishra, will share his experiences of scaling Apache Storm pipelines at Spotify. The talk will cover the topics such as Spotify's data architecture, best practices, caching, tuning event sources and sinks, monitoring pipeline health and miscellaneous optimizations to make Apache Storm pipelines more robust.
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
Developing Real-Time Data Pipelines with Apache Kafka http://kafka.apache.org/ is an introduction for developers about why and how to use Apache Kafka. Apache Kafka is a publish-subscribe messaging system rethought of as a distributed commit log. Kafka is designed to allow a single cluster to serve as the central data backbone. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of coordinated consumers. Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages. For the Spring user, Spring Integration Kafka and Spring XD provide integration with Apache Kafka.
Talk given at Ministry of Test, Cork, Ireland on Thursday, September 23, 2021. The purpose of this talk was to discuss areas that need to be covered while testing the applications handling big data in the cloud. These included the various types of tes -ting and tools involved along with the challenges encountered from the speaker's experience in working with the AWS cloud.
Spotify has built several real-time pipelines using Apache Storm for use cases like ad targeting, music recommendation, data visualization, and notifications. Each of these real-time pipelines have Apache Storm wired to different systems like Apache Kafka, Apache Cassandra, Apache Zookeeper, and other sources and sinks. In this talk the speaker, Kinshuk Mishra, will share his experiences of scaling Apache Storm pipelines at Spotify. The talk will cover the topics such as Spotify's data architecture, best practices, caching, tuning event sources and sinks, monitoring pipeline health and miscellaneous optimizations to make Apache Storm pipelines more robust.
[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...Konrad Malawski
Japanese subtitles by Yugo Maede-san, thank you very much. Japanese subtitled version of the "How Reactive Streams and Akka Streams change the JVM Ecosystem". http://www.slideshare.net/ktoso/how-reactive-streams-akka-streams-change-the-jvm-ecosystem
Kafka and Storm - event processing in realtimeGuido Schmutz
Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. It is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Storm is a distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. This session presents the main concepts of Kafka and Storm and then shows how a simple stream processing application is implemented using these two technologies.
Managing your Black Friday Logs NDC OsloDavid Pilato
Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
Topics include:
* monitoring architectures
* optimal bulk size
* distributing the load
* index and shard size
* optimizing disk IO
Takeaway: best practices when building a monitoring system with the Elastic Stack, advanced tuning to optimize and increase event ingestion performance.
Managing your black friday logs Voxxed LuxembourgDavid Pilato
Surveiller une application complexe n'est pas une tâche aisée, mais avec les bons outils, ce n'est pas si sorcier. Néanmoins, des périodes fortes telles que les opérations de type "Black Friday" (Vendredi noir) ou période de Noël peuvent pousser votre application aux limites de ce qu'elle peut supporter, ou pire, la faire crasher. Parce que le système est fortement sollicité, il génère encore davantage de logs qui peuvent également mettre à mal votre système de supervision.
Dans cette session, j'aborderai les bonnes pratiques d'utilisation de la suite Elastic pour centraliser et monitorer vos logs. Je partagerai également avec vous quelques trucs et astuces pour vous aider à passer sans souci vos Vendredis noirs !
Nous verrons :
* Les architectures de monitoring
* Trouver la taille optimale pour l'API _bulk
* Distribuer la charge
* Taille des index et des shards
* Optimiser les E/S disque
Vous ressortirez de la session avec : des bonnes pratiques pour bâtir son système de monitoring avec la suite Elastic, le tuning avancé pour optimiser les performances d'ingestion et de recherche.
Spark Streaming has supported Kafka since it's inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.Apache Kafka 0.10 (actually since 0.9) introduced the new Consumer API, built on top of a new group coordination protocol provided by Kafka itself.
So a new Spark Streaming integration comes to the playground, with a similar design to the 0.8 Direct DStream approach. However, there are notable differences in usage, and many exciting new features. In this talk, we will cover what are the main differences between this new integration and the previous one (for Kafka 0.8), and why Direct DStreams have replaced Receivers for good. We will also see how to achieve different semantics (at least one, at most one, exactly once) with code examples.
Finally, we will briefly introduce the usage of this integration in Billy Mobile to ingest and process the continuous stream of events from our AdNetwork.
End to End Akka Streams / Reactive Streams - from Business to SocketKonrad Malawski
The Reactive Streams specification, along with its TCK and various implementations such as Akka Streams, is coming closer and closer with the inclusion of the RS types in JDK 9. Using an example Twitter-like streaming service implementation, this session shows why this is a game changer in terms of how you can design reactive streaming applications by connecting pipelines of back-pressured asynchronous processing stages. The presentation looks at the example from two perspectives: a raw implementation and an implementation addressing a high-level business need.
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmDmitri Zimine
Presented on SuperComputing SC17 on Nov 14/2017 by Dmitri Zimine.
This talk is a story of bio-tech meeting DevOps to produce genomic computations, economically, and at scale.
Genomic computation is growing in demand as it comes to the mainstream practices of bio-technology, agriculture, and personal medicine. It also explodes the demand for compute resources. In fact, with inexpensive next-gen sequencing, some labs sequence over 1,000,000 billion bases per year. Genetic data banks are growing over 10x annually. How to compute the genomic data at massive scale, and do it in a cost-efficient way?
In the presentation, we describe and demonstrate a serverless solution built with Docker, Docker Swarm, StackStorm and other tools from the DevOps toolchain on AWS. The solution offers a new take on creating and computing a bio-informatic pipelines that can run at high scale and at optimal cost.
http://sc17.supercomputing.org/presentation/?id=exforum106&sess=sess150
Real-Time Big Data at In-Memory Speed, Using StormNati Shalom
Storm, a popular framework from Twitter, is used for real-time event processing. The challenge presented is how to manage the state of your real-time data processing at all times. In addition, you need Storm to integrate with your batch processing system (such as Hadoop) in a consistent manner.
This session will demonstrate how to integrate Storm with an in-memory database/grid, and explore various strategies for integrating the data grid with Hadoop and Cassandra, seamlessly. By achieving smooth integration with consistent management, you will be able to easily manage all the tiers of you Big Data stack in a consistent and effective way.
- See more at: http://nosql2013.dataversity.net/sessionPop.cfm?confid=74&proposalid=5526#sthash.FWIdqRHh.dpuf
Serverless on OpenStack with Docker Swarm, Mistral, and StackStormDmitri Zimine
Intro to Serverless, 101 demo with StackStorm, and real world application of serverless solution.
Slides for OpenStack Summit Boston 2017 talk:
https://www.openstack.org/summit/boston-2017/summit-schedule/events/18325
Most of the talk was a demo, please stay tuned for recording.
Serverless, devops, automation, operations, faas, @Stack_Storm.
Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at www.hops.site. Researchers work in an entirely UI-driven environment on a platform built with only open-source software.
Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant Spark streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running Spark streaming applications, how we use Graphana and Graphite for monitoring Spark streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Sept 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
To conclude, we will also give an overview on our course ID2223 on Large Scale Learning and Deep Learning, in which 60 students designed and ran SparkML applications on the platform.
Apache Spark v3 is a new milestone for the Big Data framework. In this session, you will (re)discover what Spark is, learn about the new features in its third major version, and go through a complete end-to-end project.
I like to call Spark an Analytics Operating Systems. It is offering far more than just a framework or a library. I will explain why. Spark v3 is the latest major evolution. It was released mid-June 2020 and adds impressive new features. After looking at them from a high level, I will detail a few of my favorites.
Finally, as we all like code (well, at least I do), I will demonstrate a complete data & AI pipeline looking at Covid-19 data.
Key takeaways: Spark as an Analytics OS, Spark v3 highlights, building data/AI pipelines/models with Spark.
Audience: software engineers, data engineers, architects, data scientists.
Many architectures include both real-time and batch processing components. This often results in two separate pipelines performing similar tasks, which can be challenging to maintain and operate. We'll show how a single, well designed ingest pipeline can be used for both real-time and batch processing, making the desired architecture feasible for scalable production use cases.
Real-time streaming and data pipelines with Apache KafkaJoe Stein
Get up and running quickly with Apache Kafka http://kafka.apache.org/
* Fast * A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.
* Scalable * Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers
* Durable * Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.
* Distributed by Design * Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.
Putting the 'I' in IoT - Building Digital Twins with Akka MicroservicesLightbend
In this webinar with Hugh McKee, Developer Advocate for Akka Platform, we’ll look at “What on Earth”, a demo exploring how Akka Microservices serves as an ideal solution for high-scale digital twinning for IoT.
For the full presentation, including video, visit: https://www.lightbend.com/blog/iot-building-digital-twins-with-akka-microservices
Aws cost optimization: lessons learned, strategies, tips and toolsFelipe
A couple of useful resources that may help you lower your AWS bill at the end of the month. Includes AWS Resources, Third-party Solutions and general tips and lessons learned.
[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...Konrad Malawski
Japanese subtitles by Yugo Maede-san, thank you very much. Japanese subtitled version of the "How Reactive Streams and Akka Streams change the JVM Ecosystem". http://www.slideshare.net/ktoso/how-reactive-streams-akka-streams-change-the-jvm-ecosystem
Kafka and Storm - event processing in realtimeGuido Schmutz
Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. It is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Storm is a distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. This session presents the main concepts of Kafka and Storm and then shows how a simple stream processing application is implemented using these two technologies.
Managing your Black Friday Logs NDC OsloDavid Pilato
Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
Topics include:
* monitoring architectures
* optimal bulk size
* distributing the load
* index and shard size
* optimizing disk IO
Takeaway: best practices when building a monitoring system with the Elastic Stack, advanced tuning to optimize and increase event ingestion performance.
Managing your black friday logs Voxxed LuxembourgDavid Pilato
Surveiller une application complexe n'est pas une tâche aisée, mais avec les bons outils, ce n'est pas si sorcier. Néanmoins, des périodes fortes telles que les opérations de type "Black Friday" (Vendredi noir) ou période de Noël peuvent pousser votre application aux limites de ce qu'elle peut supporter, ou pire, la faire crasher. Parce que le système est fortement sollicité, il génère encore davantage de logs qui peuvent également mettre à mal votre système de supervision.
Dans cette session, j'aborderai les bonnes pratiques d'utilisation de la suite Elastic pour centraliser et monitorer vos logs. Je partagerai également avec vous quelques trucs et astuces pour vous aider à passer sans souci vos Vendredis noirs !
Nous verrons :
* Les architectures de monitoring
* Trouver la taille optimale pour l'API _bulk
* Distribuer la charge
* Taille des index et des shards
* Optimiser les E/S disque
Vous ressortirez de la session avec : des bonnes pratiques pour bâtir son système de monitoring avec la suite Elastic, le tuning avancé pour optimiser les performances d'ingestion et de recherche.
Spark Streaming has supported Kafka since it's inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.Apache Kafka 0.10 (actually since 0.9) introduced the new Consumer API, built on top of a new group coordination protocol provided by Kafka itself.
So a new Spark Streaming integration comes to the playground, with a similar design to the 0.8 Direct DStream approach. However, there are notable differences in usage, and many exciting new features. In this talk, we will cover what are the main differences between this new integration and the previous one (for Kafka 0.8), and why Direct DStreams have replaced Receivers for good. We will also see how to achieve different semantics (at least one, at most one, exactly once) with code examples.
Finally, we will briefly introduce the usage of this integration in Billy Mobile to ingest and process the continuous stream of events from our AdNetwork.
End to End Akka Streams / Reactive Streams - from Business to SocketKonrad Malawski
The Reactive Streams specification, along with its TCK and various implementations such as Akka Streams, is coming closer and closer with the inclusion of the RS types in JDK 9. Using an example Twitter-like streaming service implementation, this session shows why this is a game changer in terms of how you can design reactive streaming applications by connecting pipelines of back-pressured asynchronous processing stages. The presentation looks at the example from two perspectives: a raw implementation and an implementation addressing a high-level business need.
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmDmitri Zimine
Presented on SuperComputing SC17 on Nov 14/2017 by Dmitri Zimine.
This talk is a story of bio-tech meeting DevOps to produce genomic computations, economically, and at scale.
Genomic computation is growing in demand as it comes to the mainstream practices of bio-technology, agriculture, and personal medicine. It also explodes the demand for compute resources. In fact, with inexpensive next-gen sequencing, some labs sequence over 1,000,000 billion bases per year. Genetic data banks are growing over 10x annually. How to compute the genomic data at massive scale, and do it in a cost-efficient way?
In the presentation, we describe and demonstrate a serverless solution built with Docker, Docker Swarm, StackStorm and other tools from the DevOps toolchain on AWS. The solution offers a new take on creating and computing a bio-informatic pipelines that can run at high scale and at optimal cost.
http://sc17.supercomputing.org/presentation/?id=exforum106&sess=sess150
Real-Time Big Data at In-Memory Speed, Using StormNati Shalom
Storm, a popular framework from Twitter, is used for real-time event processing. The challenge presented is how to manage the state of your real-time data processing at all times. In addition, you need Storm to integrate with your batch processing system (such as Hadoop) in a consistent manner.
This session will demonstrate how to integrate Storm with an in-memory database/grid, and explore various strategies for integrating the data grid with Hadoop and Cassandra, seamlessly. By achieving smooth integration with consistent management, you will be able to easily manage all the tiers of you Big Data stack in a consistent and effective way.
- See more at: http://nosql2013.dataversity.net/sessionPop.cfm?confid=74&proposalid=5526#sthash.FWIdqRHh.dpuf
Serverless on OpenStack with Docker Swarm, Mistral, and StackStormDmitri Zimine
Intro to Serverless, 101 demo with StackStorm, and real world application of serverless solution.
Slides for OpenStack Summit Boston 2017 talk:
https://www.openstack.org/summit/boston-2017/summit-schedule/events/18325
Most of the talk was a demo, please stay tuned for recording.
Serverless, devops, automation, operations, faas, @Stack_Storm.
Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at www.hops.site. Researchers work in an entirely UI-driven environment on a platform built with only open-source software.
Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant Spark streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running Spark streaming applications, how we use Graphana and Graphite for monitoring Spark streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Sept 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
To conclude, we will also give an overview on our course ID2223 on Large Scale Learning and Deep Learning, in which 60 students designed and ran SparkML applications on the platform.
Apache Spark v3 is a new milestone for the Big Data framework. In this session, you will (re)discover what Spark is, learn about the new features in its third major version, and go through a complete end-to-end project.
I like to call Spark an Analytics Operating Systems. It is offering far more than just a framework or a library. I will explain why. Spark v3 is the latest major evolution. It was released mid-June 2020 and adds impressive new features. After looking at them from a high level, I will detail a few of my favorites.
Finally, as we all like code (well, at least I do), I will demonstrate a complete data & AI pipeline looking at Covid-19 data.
Key takeaways: Spark as an Analytics OS, Spark v3 highlights, building data/AI pipelines/models with Spark.
Audience: software engineers, data engineers, architects, data scientists.
Many architectures include both real-time and batch processing components. This often results in two separate pipelines performing similar tasks, which can be challenging to maintain and operate. We'll show how a single, well designed ingest pipeline can be used for both real-time and batch processing, making the desired architecture feasible for scalable production use cases.
Real-time streaming and data pipelines with Apache KafkaJoe Stein
Get up and running quickly with Apache Kafka http://kafka.apache.org/
* Fast * A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.
* Scalable * Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers
* Durable * Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.
* Distributed by Design * Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.
Putting the 'I' in IoT - Building Digital Twins with Akka MicroservicesLightbend
In this webinar with Hugh McKee, Developer Advocate for Akka Platform, we’ll look at “What on Earth”, a demo exploring how Akka Microservices serves as an ideal solution for high-scale digital twinning for IoT.
For the full presentation, including video, visit: https://www.lightbend.com/blog/iot-building-digital-twins-with-akka-microservices
Aws cost optimization: lessons learned, strategies, tips and toolsFelipe
A couple of useful resources that may help you lower your AWS bill at the end of the month. Includes AWS Resources, Third-party Solutions and general tips and lessons learned.
Cloudwatch: Monitoring your AWS services with Metrics and AlarmsFelipe
Brief intro to AWS Cloudwatch. Motivation, examples and use cases. Shows how you can collect and monitors metrics for all your AWS services to better control your applications and infrastructure. #cloud-computing #aws #amazon-web-services
An introduction to and a couple of examples and tips on how to use Elasticsearch for general data analytics. Examples are based on Elasticsearch version 2.x.
Online Machine Learning: introduction and examplesFelipe
In this talk I introduce the topic of Online Machine Learning, which deals with techniques for doing machine learning in an online setting, i.e. where you train your model a few examples at a time, rather than using the full dataset (off-line learning).
OrientDB vs Neo4j - Comparison of query/speed/functionalityCurtis Mosters
This presentation gives an overview on OrientDB and Neo4j. It also compares some specific querys, their speed and the overall functionality of both databases.
The querys might not be optimized in both cases. At least they have the same outcome and are both written as querys. For sure in Neo4j you should do this in Java code. But that is way harder to write, so this presentation is more like a direkt comparision instead of really getting the best results.
Also it's done with real data and at the end round about 200 GB of data.
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
Paco Nathan, Director of Community Evangelism at Databricks
Apache Spark is intended as a fast and powerful general purpose engine for processing Hadoop data. Spark supports combinations of batch processing, streaming, SQL, ML, Graph, etc., for applications written in Scala, Java, Python, Clojure, and R, among others. In this talk, I'll explore how Spark fits into the Big Data landscape. In addition, I'll describe other systems with which Spark pairs nicely, and will also explain why Spark is needed for the work ahead.
What to Expect for Big Data and Apache Spark in 2017 Databricks
Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, Matei Zaharia will cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, he will talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.
Speaker: Matei Zaharia
Video: http://go.databricks.com/videos/spark-summit-east-2017/what-to-expect-big-data-apache-spark-2017
This talk was originally presented at Spark Summit East 2017.
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
What’s new in Spark 2.0
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Off-Label Data Mesh: A Prescription for Healthier DataHostedbyConfluent
"Data mesh is a relatively recent architectural innovation, espoused as one of the best ways to fix analytic data. We renegotiate aged social conventions by focusing on treating data as a product, with a clearly defined data product owner, akin to that of any other product. In addition, we focus on building out a self-service platform with integrated governance, letting consumers safely access and use the data they need to solve their business problems.
Data mesh is prescribed as a solution for _analytical data_, so that conventionally analytical results (think weekly sales or monthly revenue reports) can be more accurately and predictably computed. But what about non-analytical business operations? Would they not also benefit from data products backed by self-service capabilities and dedicated owners? If you've ever provided a customer with an analytical report that differed from their operational conclusions, then this talk is for you.
Adam discusses the resounding successes he has seen from applying data mesh _off-label_ to both analytical and operational domains. The key? Event streams. Well-defined, incrementally updating data products that can power both real-time and batch-based applications, providing a single source of data for a wide variety of application and analytical use cases. Adam digs into the common areas of success seen across numerous clients and customers and provides you with a set of practical guidelines for implementing your own minimally viable data mesh.
Finally, Adam covers the main social and technical hurdles that you'll encounter as you implement your own data mesh. Learn about important data use cases, data domain modeling techniques, self-service platforms, and building an iteratively successful data mesh."
Introduction to apache kafka, confluent and why they matterPaolo Castagna
This is a short and introductory presentation on Apache Kafka (including Kafka Connect APIs, Kafka Streams APIs, both part of Apache Kafka) and other open source components part of the Confluent platform (such as KSQL).
This was the first Kafka Meetup in South Africa.
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
The Apache Spark 2.3 release marks a big step forward in speed, unification, and API support.
This talk will quickly walk through what’s new and how you can benefit from the upcoming improvements:
* Continuous Processing in Structured Streaming.
* PySpark support for vectorization, giving Python developers the ability to run native Python code fast.
* Native Kubernetes support, marrying the best of container orchestration and distributed data processing.
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaGoDataDriven
Matei Zaharia is an assistant professor of computer science at Stanford University, Chief Technologist and Co-founder of Databricks. He started the Spark project at UC Berkeley and continues to serve as its vice president at Apache. Matei also co-started the Apache Mesos project and is a committer on Apache Hadoop. Matei’s research work on datacenter systems was recognized through two Best Paper awards and the 2014 ACM Doctoral Dissertation Award.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1aNvLOQ.
Mike Krieger discusses Instagram's best and worst infrastructure decisions, building and deploying scalable and extensible services. Filmed at qconsf.com.
Mike Krieger (@mikeyk) graduated from Stanford University where he studied Symbolic Systems with a focus in Human-Computer Interaction. During his undergrad, he interned at Microsoft's PowerPoint team and after graduating, he worked at Meebo for a year and a half as a user experience designer and as a front-end engineer before joining the Instagram team doing design & development.
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Landon Robinson
The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
Scio - Moving to Google Cloud, A Spotify StoryNeville Li
Talk at Philly ETE Apr 28 2017
We will talk about Spotify’s story of migrating our big data infrastructure to Google Cloud. Over the past year or so we moved away from maintaining our own 2500+ node Hadoop cluster to managed services in the cloud. We replaced two key components in our data processing stack, Hive and Scalding, with BigQuery and Scio and are able to iterate at a much faster speed. We will focus the technical aspect of Scio, a Scala API for Apache Beam and Google Cloud Dataflow and how it changed the way we process data.
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
In this session, we discuss how Spark and Presto complement the Netflix big data platform stack that started with Hadoop, and the use cases that Spark and Presto address. Also, we discuss how we run Spark and Presto on top of the Amazon EMR infrastructure; specifically, how we use Amazon S3 as our data warehouse and how we leverage Amazon EMR as a generic framework for data-processing cluster management.
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSEDuvanRamosGarzon1
AIRCRAFT GENERAL
The Single Aisle is the most advanced family aircraft in service today, with fly-by-wire flight controls.
The A318, A319, A320 and A321 are twin-engine subsonic medium range aircraft.
The family offers a choice of engines
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Automobile Management System Project Report.pdfKamal Acharya
The proposed project is developed to manage the automobile in the automobile dealer company. The main module in this project is login, automobile management, customer management, sales, complaints and reports. The first module is the login. The automobile showroom owner should login to the project for usage. The username and password are verified and if it is correct, next form opens. If the username and password are not correct, it shows the error message.
When a customer search for a automobile, if the automobile is available, they will be taken to a page that shows the details of the automobile including automobile name, automobile ID, quantity, price etc. “Automobile Management System” is useful for maintaining automobiles, customers effectively and hence helps for establishing good relation between customer and automobile organization. It contains various customized modules for effectively maintaining automobiles and stock information accurately and safely.
When the automobile is sold to the customer, stock will be reduced automatically. When a new purchase is made, stock will be increased automatically. While selecting automobiles for sale, the proposed software will automatically check for total number of available stock of that particular item, if the total stock of that particular item is less than 5, software will notify the user to purchase the particular item.
Also when the user tries to sale items which are not in stock, the system will prompt the user that the stock is not enough. Customers of this system can search for a automobile; can purchase a automobile easily by selecting fast. On the other hand the stock of automobiles can be maintained perfectly by the automobile shop manager overcoming the drawbacks of existing system.
2. Previous Experience
~10 millions records
~60 millions documents
~60 million vertices graph (non-commercial)
Datasets I have previously worked on were much smaller!
This time we were about to deal with a much bigger social network, which
can easily be represented as a graph. We were going to join the team…
5. 1 billion challenge
835,272,759
835,272,759 vertices
751,857,081 users
83,415,678 companies
6,956,990,209 relations
Subsets from thousands to millions users
When we finally put
our hands on the data,
we have realized that
datasize is smaller
than expected
What more have we
found?
6. Existing app workflow
One Engineer-Evening
Multiple custom scripts
JSON files
extract subset
eg: 3m profiles
750m profiles
data duplication
only manually selected
60m incl. duplicates
Manual process,
takes few days from
user perspective
This was already used by final customers
8. PoC - Definition of done
• Graph traversable on demand
• First results available under 1 minute
• Entire subset ready in few minutes
REST API with search
9. PoC - Concept
Graph DB
Document
DB
API
6,956,990,209 relations
751,857,081 profiles
Whole dataset
stored in two
engines
All relations
All profiles
10. Why graph DB?
•Fast graph traversal
•Easily extendable with new relations and vertices
•Convenient algorithm description
11. PoC - Flow
Graph DB
Document
DB
API
1. Generate subset
2. Tag documents
3. Perform search with
tag as a filter
Generate subset from
graph with users and
companies relations
Tag documents
with unique id in
database with all
profiles
Search with
unique id as a
primary filter
12. Weapon of choice
? ?Graph DB
Document
DB
API
Scala and Play were
natural choice for API app
Some research
required for
databases
13. First Steps: Extraction
Extract anonymized
profiles, companies
and relations
Cleanup data, sort
and generate input
files
few days to pull, streaming
with Akka and Slick
To make a research we had to put our hands on the real data
14. First Steps: Pushing forward
Push profiles for
searching purposes
Push vertices and
relations for traversal
Document
DB
Graph DB
two tools,
highly dependent on db engines
15. Fulltext Searchable Document DB
• Mature
• Horizontally scalable
• Fast indexing (~3k documents per second on the single node)
• Well documented
• With scala libraries:
• https://github.com/sksamuel/elastic4s
• https://github.com/evojam/play-elastic4s
We already had significant experience with scaling Elastic Search
for 80 millions
21. Final Setup on AWS
Neo4j
API
hr-1 hr-2
ES Cluster
i2.xlarge
4vCPU 30.5GB
i2.2xlarge
8vCPU 61GB
m4.large
2vCPU 8GB
• 2 nodes
• 2 indexes
• 10 shards each index
• 0 replicas
22. Step #1 - Bulk loading into Neo4j
Importing the contents of these files into data/graph.db.
[…]
IMPORT DONE in 3h 24m 58s 140ms. Imported:
835273352 nodes
6956990209 relationships
0 properties
Importing the contents of these files into data/graph.db.
[…]
IMPORT DONE in 3h 24m 58s 140ms. Imported:
835273352 nodes
6956990209 relationships
0 properties
Importing the contents of these files into data/graph.db.
[…]
IMPORT DONE in 3h 24m 58s 140ms. Imported:
835273352 nodes
6956990209 relationships
0 properties
Importing the contents of these files into data/graph.db.
[…]
IMPORT DONE in 3h 24m 58s 140ms. Imported:
835273352 nodes
6956990209 relationships
0 properties
It took 12 hours on 2 times smaller Amazon instance
23. Step #2 - Bulk loading into ES
grouped insert
1. Create Source from CSV file, frame by n
2. Decode id from ByteString, generate user json
3. Group by the bulk size (eg.: 4500)
4. Throttle
5. Execute bulk insert into the ElasticSearch
throttle
Akka advantage:
CPU utilization,
parallel data
enrichment,
human readable
24. Step #2 - Bulk loading into ES
FileIO.fromFile(new File(sourceOfUserIds))
.via(
Framing.delimiter(ByteString('n'),
maximumFrameLength = 1024,
allowTruncation = false))
.mapAsyncUnordered(16)(prepareUserJson)
.grouped(4500)
.throttle(
elements = 2,
per = 2 seconds,
maximumBurst = 2,
mode = ThrottleMode.Shaping)
.mapAsyncUnordered(2)(executeBulkInsert)
.runWith(Sink.ignore)
Flow description
with Akka
Streams
25. User
User
User
Foo Inc.
User
User
User
worked
worked
knows
knows
knows
knows
Step #3 - Tagging
Bar Ltd.
worked
Foo Inc.
User
User
User
User
worked
worked
knows
knows
knows
knows
knows
MATCH (c:Company)-[:worked]-(employees:User)-[:knows]-(aquitance:User)
WHERE ID(c)={foo-inc-id} AND NOT (aquitance)-[:worked]-(c)
RETURN DISTINCT ID(aquitance)
Neo4j traversal query in CYPHER
26. Akka Streams to the rescue
val idEnum : Enumeratee[CypherRow] = _
val src =
Source.fromPublisher(Streams.enumeratorToPublisher(idEnum))
.map(_.data.head.asInstanceOf[BigDecimal].toInt)
.via(new TimeoutOnEmptyBuffer())
.map(UserId(_))
.mapAsyncUnordered(parallelism = 1)(id =>
dao.tagProfiles(id, companyId))
Readable flow
Buffering to protect Neo4j when indexing is too slow
Timeout due to the bug in the underlying implementation
28. Bulk update with AkkaStream tuning
src
.grouped(20000)
.throttle(
elements = 2,
per = 6 second,
maximumBurst = 2,
mode = ThrottleMode.Shaping)
.mapAsyncUnordered(parallelism = 1)(ids =>
dao.bulkTag(ids, ...))
Bulk tagging,
few lines
29. Tagging Foo-Company
~14 seconds until first batch is tagged
~7 minutes 11 seconds until all tagged
reference implementation - few hours
2,222,840 profiles matching criteria
Sample subset :)
30. Step #5 - Search
Neo4j
API
hr-1 hr-2
ES Cluster
i2.xlarge
4vCPU 30.5GBi2.2xlarge
8vCPU 61GB
m4.large
2vCPU 8GB
With data ready in ES
search implementation was
pretty straightforward
31. Step #5 - Search benchmark
Fulltext search on 2 millions subset
GET /users?company=foo-company&phrase=John
2000 phrases for the benchmarking
Response JSON with 50 profiles
Searching in 750 millions database
Phrases based on
real names and
surnames, used
during profiles
enrichment
Random
requests ordering
32. Step #5 - Search under siege
Response time ~ 0.14s50 concurrent users
constant
latency
constant
search rate
34. Summary
reference implementation PoC
manual automatic
few days 14 seconds
few days 7 minutes
~40 millions ~750 millions
no analytics GraphX ready
Tool ready for data scientists
W can implement core traversal modifications almost instantly